Analysing Dracula
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I recently read Bram Stoker’s Dracula, mostly because I wanted to rewatch the 1990s movie of the same name and see how badly they’d mutilated the story. [Update 26/04/2015: I rewatched the 1992 movie and while it’s the most faithful version to the book, it has some weird additions and is utterly appalling. People have been mean about Keanu Reeves’ performance, but he’s no worse than anyone else and the direction is the real issue. The Annie Lennox song at the end is nice, though]. The book is wholly composed of diary entries, letters and newspaper clippings. This means that it’s effectively a self-contained social network, with every “sample” having a date, an author and an intended recipient. It occurred to me that it might be interesting to look at these data and nobody else seems to have done so. Fortunately, someone had tabulated much of the data, even if they hadn’t done anything further with it. Therefore, thanks to Harold J Hotchkiss for typing it up over a decade ago.
Did it really happen?
Obviously not. The year it takes place isn’t important, but it was published in 1897 and ends with a note dated “7 years later”, so let’s assume it takes place in 1890. If you calculate the days of the week that events occur on throughout the book, you get the plot below. I have no empirical evidence to back this up, but it looks too evenly distributed to me to be real. I’ve looked at real time series before and you will generally see some sort of trend, be it for weekdays, weekends or a particularly day. As it says in the book, “denn die Todten reiten Schnell” (“for the dead travel fast”) and they apparently don’t take days off, either.
Authorship and structure of the book
~70% of the book is diary entries chiefly written by 3 authors. John Seward (~40% of diary entries), Jonathan Harker (~31% of diary entries) and Wilhelmina “Mina” Murray-Harker (~24% of diary entries). Mina gets a double-barreled surname because she marries Jonathan midway in the book and changes her surname, which was the style at the time. I needed a single name to refer to both her ‘states’. There is some truly incredible sexism in the book, it seems like gender was a big deal in the late 19th century. You can divide documents by the type (e.g. letter, diary entry, etc) and also both who they are sent by and to.
The above figures are global summaries of the entire book. What is happening over the course of the book? This is slightly complicated because the book isn’t in strict chronological order; chapters may feature letters and events that occurred previously.
The figure above gives a clear view of the structure of the book. The plots below show who is writing what and to whom. The first few chapters are Jonathan Harker’s diary of events in Castle Dracula, before a flurry of letters to introduce most of the rest of the characters. Subsequently, the captain of the Demeter gives his account of bringing Dracula to Whitby (summary; not a lot of fun), and then Dr Seward gets on with the business of writing the majority of the book. You can also clearly see how certain characters come and go; for example, Lucy Westenra’s death and Jonathan Harker’s heroic reappearance in the second half.
Barplots are great, but there other ways to represent the data and they can show structure that hard to grasp in any other way.
Timeline
The above plot is a timeline of events in the book; it takes place in a mere 7 months. Events in odd-numbered chapters are above the timeline and those in even-numbered chapters are below. An arbitrary Y-axis separates events by their document type. Events are represented by coloured squares corresponding to the sender of the document, with the colours matching those used in other plots here.
Network analysis
We’ll start with the most interesting figure, the network itself. Primary characters are green hexagons and secondary characters are blue triangles. Nodes are labeled with character initials and scaled relative to how many documents they sent or received. Edges between nodes are also scaled relative to how many documents are sent.
A few things of note:
- Nobody ever writes or receives any correspondence from Count Dracula
- Jonathan Harker and Mina Murray-Harker never speak. Letters are mentioned, but we never get to read them
- It’s fairly interesting how many characters never speak to one another
- Arthur Holmwood receives a lot of post, but doesn’t bother writing back. I guess he’s too important being a Lord
- Quincey exists mostly as a bizarre stereotype of Americana and he only communicates with Arthur Holmwood once. He spends the entire book being very worthy, his actions announcing “I will die heroically” to the reader
- The cluster of secondary characters is some removal companies talking to each other about Dracula’s boxes of earth, some newspapers reporting on events at Whitby and the “bloofer lady” (i.e. undead Lucy), and of course the unfortunate captain of the Demeter, scribbling in his log about the worst cruise ever
Below is a barplot of how many characters each character sends and receives documents from (this is a summary of the edges in the network above). It includes self-edges, so people who write only to themselves still “receive” at a value of 1. It further illustrates the fact that Arthur Holmwood is the most popular girl at school, whilst Jonathan Harker is a weird, angry loner who talks only to himself. Van Helsing is nice enough to write to him… once.
I have also produced a correlation plot, although the data plotted are not correlation values. It shows the proportion of documents each author sent to each recipient. The scale is misleading, as all values are non-negative (i.e. between 0 and 1). I find it sad that Van Helsing likes to write letters and never writes in his own self-absorbed diary, but people don’t write back. The Y-axis is sender and the X-axis is recipient.
Finally in this section, an alternative to a network is an arcplot. These can look very nice indeed, but I don’t think they work particularly well for these data:
Wordcloud
By all means go nuts and do some sentiment analysis or something more complicated, but I was happy just to create the obligatory word cloud. They’re not really empirical, but they can be nice to look at, I guess. The most common word is “said”, because the book is a “he-said-she-said” bonanza, like all social networks.
Final thoughts
It holds up quite well. I was surprised to see how “modern” the text was; for example,
“To use an Americanism, he had ‘taken no chances’”
In other ways, it’s comically out of date. It’s very much a story about a group of wealthy people and anyone else is given short shrift and comical accents. Servants are also plentiful and the word “peasant” is used semi-frequently, usually to refer to people from central and eastern Europe:
Some of them were just like the peasants at home
or those I saw coming through France and Germany, with short jackets, and round hats, and home-made trousers; but others were very picturesque. The women looked pretty, except when you got near them, but they were
very clumsy about the waist.
The sexism is so overt it’s quite funny, with men being strong and noble whilst women are mostly feeble and emotional victims:
“Ah, that wonderful Madam Mina! She has man’s brain, a brain that a man should have were he much gifted, and a woman’s heart.”
It can just about be forgiven for its time, as Mina is ultimately instrumental in defeating Dracula and has a central role in events overall, even if she eventually becomes a bit of a damsel in distress. On the other hand, if you wrote a vampire novel today, it would be unforgivable to have a female protagonist willingly submitting to all kinds of demeaning emotional and physical abuse… wouldn’t it?
Perhaps the strangest thing about the story is the way they combat and defeat vampires; on the one hand, the most potent weapons against them appear to be herbs and Christian mythology (particularly bits of magic wafer), all of which is treated with absolute po-faced seriousness. On the other hand, while vampires seem to be averse to these things, but are not physically harmed by them. They are susceptible to ultraviolence, though, as stabbing a vampire in the heart and sawing its head off is pretty effective. Maybe the power of prayer is overstated. The next time you suspect that you’re in the presence of supernatural being intent on giving you immortality (although there are some downsides), try warding them off with a Kit Kat before you decide to brutally murder them, which will be frowned upon if they’re not actually one of the undead.
Files
You can access the code and files used on my GitHub repository, but the R code is copied below if you’re interested in a quick look at how the plots were made. You can download the book itself for free from Amazon, and also a nice text version from Project Gutenberg.
#### Load packages #### library(openxlsx) # for reading Excel files library(RColorBrewer) # nice color palettes library(ggplot2) # fancy-looking plots library(corrplot) # correlation plot library(network) # for network plot library(sna) # for network plot library(tm) # Text mining package library(wordcloud) # word cloud plotting #### Read in data and setup #### ## Read from Excel. Note that we have word counts for chapters, ## but never bother using them chron = read.xlsx("dracula.xlsx",sheet="chronology", startRow=1, colNames=TRUE, skipEmptyRows=TRUE, rowNames=FALSE) wc = read.xlsx("dracula.xlsx",sheet="wordcounts", startRow=1, colNames=TRUE, skipEmptyRows=TRUE, rowNames=FALSE) ## Reorder type column for aesthetic reasons and ## cast chapters as a factor chron$Type=factor(chron$Type,levels=c("Diary","Letter", "Telegram","Memo","Newspaper","Ship's log")) chron$Chapter=factor(chron$Chapter) ## To reorder the characters, we manually define ## the order to ensure proper grouping ## A frequency-based approach looks more confused primarycharacters=c("John Seward", "Jonathan Harker", "Wilhelmina Murray-Harker", "Lucy Westenra", "Abraham Van Helsing", "Arthur Holmwood", "Quincey P Morris") secondarycharacters=c("Captain of the Demeter", "The Dailygraph", "The Westminster Gazette", "The Pall Mall Gazette", "Carter, Paterson & Co", "Samuel F Billington & Son", "Mitchell, Sons & Candy", "Rufus Smith", "Patrick Hennessey", "Sister Agatha") chron$From=factor(chron$From,levels=c(primarycharacters, secondarycharacters)) chron$To=factor(chron$To,levels=c(primarycharacters, secondarycharacters)) ## Store the initials of each character for labelling plots initials=c("JS","JH","WMH","LW","AVH","AH","QPM","CD", "DG","WG","PMG","CPC","SBS","MSC","RS","PH","SA") ## Calculate days of the week - this cannot be done in ## Excel before the year 1900 chron$Date=as.Date(gsub(" ","-",chron$Date)) chron$Weekday=factor(weekdays(chron$Date), levels=c("Monday","Tuesday","Wednesday", "Thursday","Friday","Saturday","Sunday")) ## Define color palettes pcols=brewer.pal(length(primarycharacters),"Set1") scols=brewer.pal(length(secondarycharacters),"Set3") acols=c(pcols,scols) gothic=c("#6B090F","#360511","#080004","#230024","#3D1D54") #### End of reading data and setup ### #### Basic plots #### ## Which days documents were written on ggplot(chron,aes(Weekday,fill=Type))+geom_bar()+ ggtitle("Documents by day of the week") ## Plot document types with who wrote them and who to: ggplot(chron,aes(Type,fill=From))+geom_bar()+ scale_fill_manual(values=acols)+ggtitle("Documents From") ggplot(chron,aes(Type,fill=To))+geom_bar()+ scale_fill_manual(values=acols)+ggtitle("Documents To") ## Who writes what and when, split by chapters ggplot(chron,aes(Chapter,fill=Type))+geom_bar()+ ggtitle("Document type by chapter") ggplot(chron,aes(Chapter,fill=From))+geom_bar()+ scale_fill_manual(values=acols)+ggtitle("Document sender by chapter") ggplot(chron,aes(Chapter,fill=To))+geom_bar()+ scale_fill_manual(values=acols)+ggtitle("Document receiver by chapter") ## A heatmap of who sends what to who... fromtotable=table(chron$From,chron$To) fromtoproportions=fromtotable/rowSums(fromtotable) corrplot(fromtoproportions,method="color",tl.col="black") ## corrplot() messes with the device setup, so reset it dev.off() #### End of basic plots #### #### Timeline #### ## Crop off the last entry, as it's 7 years later main=chron[1:197,] ## Define heights for points height=rep(0,nrow(main)) height[main$Type=="Diary"]= 1 height[main$Type=="Ship's log"]=2 height[main$Type=="Memo"]= 3 height[main$Type=="Telegram"]=4 height[main$Type=="Newspaper"]= 5 height[main$Type=="Letter"]= 6 ## Define vectors of colours for the plot fromcols=as.character(main$From) tocols=as.character(main$To) for(i in 1:length(levels(chron$From))){ fromcols[fromcols==levels(chron$From)[i]]=acols[i] tocols[tocols==levels(chron$From)[i]]=acols[i] } ## Use chapter information to determine whether the ## point is above or below the line ## Odd numbers above, even below main$Chapter=as.numeric(as.character(main$Chapter)) height[main$Chapter%%2==0]=height[main$Chapter%%2==0]*-1 ## Perform plot plot(x=main$Date,y=rep(1,nrow(main)),type="n",xaxt = "n",yaxt="n", bty = "n",xlab = "Date", ylab=NA,ylim=c(-6,6),main="Dracula Timeline") abline(h=-6:6,col="lightgrey") points(main$Date,height,pch=15,col=fromcols) segments(main$Date,rep(0,nrow(main)),main$Date,height,col=fromcols,lty=1) #points(main$Date,height,pch=15,col=tocols) #uncomment for "to" colours #segments(main$Date,rep(0,nrow(main)),main$Date,height,col=tocols) u = par("usr") arrows(u[1], 0, u[2], 0, xpd=TRUE,lwd=3) xlabels=paste(c(month.abb[5:11]),"1890") ylabels=c("Diary","Ship's log","Memo","Telegram","Newspaper","Letter") axis(1, at=as.Date(c("1890-05-01","1890-06-01","1890-07-01", "1890-08-01","1890-09-01","1890-10-01","1890-11-01")),labels=xlabels) axis(2,at=c(-6:6),labels=c(rev(a),"",a),las=2,lty=0,cex.axis=0.7) #### End of timeline #### #### Network #### ## Create network of who communicates with who comtable=table(chron$From,chron$To) comtablebinary=comtable/comtable # divide by itself so all values are 1 comnetwork=network(comtablebinary,directed=TRUE,loops=TRUE) ## Number of people who write to each character to=(colSums(comtablebinary,na.rm=TRUE)) ## Number of people who each character writes to from=(rowSums(comtablebinary,na.rm=TRUE)) sentreceived=cbind(from,to) ## Barplot of results barplot(t(sentreceived),main="Characters interacted with", ylab="Characters",beside=TRUE,names=initials, las=3,col=c("#078E53","#111987")) legend("topright",c("Sent","Received"),pch=rep(15,2), col=c("#078E53","#111987")) ## Generate some scaled edge widths edgewidths=as.vector(comtable) edgewidths=edgewidths[edgeswidths!=0] edgewidths=sapply(log2(edgewidths),max,1) ## Perform network plot gplot(comnetwork, interactive=TRUE, # Uncomment for interactive plotting diag=TRUE, # self-loops loop.steps=5, # angular self-loops usecurve=TRUE, # curvy arrows arrowhead.cex=0.5, # smaller arrowheads label=initials, # initials not full names label.pos=5, # labels inside shapes label.cex=ifelse(levels(chron$From)%in%primarycharacters,1.1,0.75), label.col=ifelse(levels(chron$From)%in%primarycharacters,"black","white"), vertex.cex=(log2(table(c(as.character(chron$From), as.character(chron$To)))[levels(chron$From)])+3)/2, # log2 scale vertex.sides=ifelse(levels(chron$From)%in%primarycharacters,6,3), vertex.rot=ifelse(levels(chron$From)%in%primarycharacters,0,30), vertex.col=ifelse(levels(chron$From)%in%primarycharacters, "#078E53","#111987"), edge.lwd=edgewidths ) #### End of network #### #### Arc diagram #### ## IMPORTANT NOTE: loading this package will break the network-plotting ## code, which is why it's all commented out here: # https://github.com/gastonstat/arcdiagram # http://gastonsanchez.com/work/starwars/ # library(devtools) # for installing arcdiagram # install_github('arcdiagram', username='gastonstat') # library(arcdiagram) # for plotting arc diagram ## Generate a suitable object, order it for easy access and ## remove self-communication, as this can't be represented ## on an arc diagram arcmat=as.matrix(chron[,c("From","To")]) arcmat=arcmat[!(arcmat[,"From"]==arcmat[,"To"]),] arcmat=arcmat[order((arcmat[,2])),] arcmat=arcmat[order((arcmat[,1])),] ## Define which arcs we will place above and which below AVH2JS=c(1:6,14:16) AVH2JH=c(7) AVH2WMH=c(8:9,31:32) WMH2SA=30 WMH2LW=c(21:23,33:36) ## Define a custom order of nodes ordering=c("Abraham Van Helsing", "John Seward", "Jonathan Harker", "Wilhelmina Murray-Harker", "Arthur Holmwood", "Quincey P Morris", "Lucy Westenra", "Carter, Paterson & Co", "Samuel F Billington & Son", "Mitchell, Sons & Candy", "Sister Agatha", "Rufus Smith", "Patrick Hennessey") ## Perform the arc plot ## Col.nodes has a bug, so colours are defined manually ## Could be improved with weighted line arcs and/or nodes arcplot(arcmat,above=c(AVH2JS,AVH2JH,AVH2WMH,WMH2SA,WMH2LW), ordering=ordering, pch.nodes=15, col.nodes=c(rep("#078E53",6),rep("#111987",2),"#078E53",rep("#111987",4)), cex.nodes=2, cex.labels=0.5, lwd.arcs=3, col.arcs="#A71930") legend("topleft",c("Primary character","Secondary character"), pch=rep(15,2),col=c("#078E53","#111987"),box.lty=0) #### End of arc diagram #### #### Word cloud #### ## Plot a word cloud. It's not really quantitative, but it's pretty ## Create a corpus and do some basic cleaning up book=read.delim("pg345.txt",header=FALSE,stringsAsFactors=FALSE) corpus=Corpus(DataframeSource(book), readerControl=list(language = "en")) corpus=tm_map(corpus, content_transformer(removePunctuation)) corpus=tm_map(corpus, content_transformer(tolower)) corpus=tm_map(corpus, function(x) removeWords(x, stopwords("english"))) ## Create the necessary objects for plotting tdm = TermDocumentMatrix(corpus) tdmatrix = as.matrix(tdm) tdcounts = sort(rowSums(tdmatrix),decreasing=TRUE) tddf = data.frame(word = names(tdcounts),freq=tdcounts) ## Perform plot wordcloud(tddf$word,tddf$freq,min.freq=3, random.order=FALSE, rot.per=0.15, colors=gothic,random.color=TRUE) #### End of word cloud ####
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.