The Truth Is In There – an X-Files episode analysis with R (Part 4)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Part 4: Character occurrence
Section Chief Blevins: Agent Mulder. What are his thoughts?
Scully: Agent Mulder believes we are not alone.
This is part 4 of a natural language processing analysis of X-Files episode summaries. Parts 1 and 2 dealt with obtaining and cleaning the data, part 3 showed an analysis of prominent words by episode types. In this post, we will look at characters.
As in every TV show, The X-Files features recurring characters – some taking leading roles, others as supporting characters. There are good guys, bad guys, and ambivalent ones. Let’s look at which of these characters gets mentioned most.
# get DTM of all terms including cast (strip the metadata) termvector <- colnames(xfmerged[ , -(1:10)]) # create vector of character names # make sure all character names are in the list of terms xfcharacters <- intersect(xfcharterms, termvector) # get rid off non-names using the remcast list xfcharacters <- setdiff(xfcharacters, remcast) # reduce the DTM to character terms only xfcharacters <- select_(xfmerged, .dots = xfcharacters) # set counts to Boolean xfcharacters[ , -1][xfcharacters[ , -1] > 1] <- 1 xfcharacters[ , -1][xfcharacters[ , -1] == 0] <- 0 # construct data frame of terms and Boolean counts xfcharacters <- data.frame(ct = colSums(xfcharacters), name = names(xfcharacters)) # sort descending by count xfcharacters <- arrange(xfcharacters, desc(ct)) head(xfcharacters, 20) ## ct name ## 1 199 sculli ## 2 179 mulder ## 3 73 skinner ## 4 40 doggett ## 5 24 krycek ## 6 23 rey ## 7 20 elder ## 8 20 kersh ## 9 20 samantha ## 10 11 spender ## 11 10 covarrubia ## 12 10 frohik ## 13 10 marita ## 14 10 mrs ## 15 8 byer ## 16 8 lang ## 17 8 melvin ## 18 7 albert ## 19 7 fowley ## 20 7 teena
Looks plausible, with our lead actors in the very top. We will now take a top-down approach and construct a very limited list of main characters and see how they appear over the various seasons. Here’s the list:
xfnames <- c( # X-Files agents "mulder", "sculli", # written this way for stemming reasons "doggett", "rey", # agent reyes, stemmed # the bad guys "smoke", # the cigarette smoking man "krycek", "elder", "rohrer", # informants "covarrubia", "throat", # deep throat "kritschgau", # "X", deep throat's successor, can't be distinguished in our texts # the FBI "skinner", "fowley", "kersh", "spender", # the Lone Gunmen "byer", "frohik", "lang", "gunmen" )
We create a DTM from just these names, reconstruct their season number from the production code and summarize the occurence of each character by season
# get DTM xftimeline <- select_(xfmerged, .dots = c("ProdCode", xfnames)) # construct summary by season xftimeline <- xftimeline %>% mutate(Season = as.numeric(substr(ProdCode,1,1))) %>% # get season number select(-ProdCode) %>% # Production code no longer needed group_by(Season) %>% # group and summarise by season summarise_each(funs(sum)) %>% t() %>% # rotate the data frame (cols to rows) as.data.frame() # matrix to df xftimeline <- xftimeline[-1 ,] # get rid of season numbers colnames(xftimeline) <- 1:ncol(xftimeline) # add season numbers as col names # get row character names as own column xftimeline$term <- rownames(xftimeline) str(xftimeline) ## 'data.frame': 19 obs. of 10 variables: ## $ 1 : num 270 191 0 0 6 0 4 0 0 25 ... ## $ 2 : num 318 200 0 0 19 24 6 0 0 4 ... ## $ 3 : num 307 205 0 0 30 17 2 0 0 5 ... ## $ 4 : num 275 179 0 0 41 11 5 0 9 3 ... ## $ 5 : num 324 219 0 0 31 12 6 0 2 2 ... ## $ 6 : num 330 239 0 0 21 9 6 0 2 1 ... ## $ 7 : num 235 179 0 0 35 7 0 0 1 5 ... ## $ 8 : num 122 208 264 30 0 25 1 18 0 2 ... ## $ 9 : num 53 130 187 138 5 1 0 30 1 3 ... ## $ term: chr "mulder" "sculli" "doggett" "rey" ...
Since we’ll use ggplot for creating graphs, we will use a long format for our data:
# create long format library(tidyr) xftimeline <- gather(xftimeline, "season", "count", 1:9) str(xftimeline) ## 'data.frame': 171 obs. of 3 variables: ## $ term : chr "mulder" "sculli" "doggett" "rey" ... ## $ season: Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ count : num 270 191 0 0 6 0 4 0 0 25 ...
To plot multiple graphs, we’ll use the following function:
Plot_Timeline <- function(df, titletext) { # function to plot a line graph of character # appearance by season # # Args: # df: data frame with cols season, term (name) and count (nr of appearances) # # Return: # ggplot2 object (line graph) library(ggthemes) g <- ggplot(data = df, aes(x = season, y = count, group = term)) + geom_line(aes(color = term)) + ggtitle(titletext) + theme_few() + scale_colour_few() g }
Here are the plots – warning: Contains plot spoilers!
Plot_Timeline(filter(xftimeline, term %in% c("mulder", "sculli", "doggett", "rey")), "X-Files Agents") ## Loading required package: ggplot2 ## ## Attaching package: 'ggplot2' ## ## The following object is masked from 'package:NLP': ## ## annotate
While Agent Scully remains relatively constant over the various seasons (despite her character’s absence in some seaons), we clearly see Agent Mulder’s replacement by Agent Doggett in the last few seasons. Agent Reyes’ occurence is more of an afterthought (which, in my humble opinion, is deserved).
Plot_Timeline(filter(xftimeline, term %in% c("smoke", "krycek", "elder", "rohrer")), "The Syndicate")
The Cigarette Smoking Man shows an increasingly strong presence until season 7 and then drops off, presumed dead. At this point, “super soldier” Knowle Rohrer takes over. Agent Krycek has two peaks, while the obscure First Elder stays in the background throughout all seasons.
Plot_Timeline(filter(xftimeline, term %in% c("throat", "covarrubia", "kritschgau")), "Informants")
Main informant Deep Throat plays a key role in the first season, gets killed and re-appears in a couple of flashback scenes later. His replacement, “X”, unfortunately can’t be analyzed due to his unfortunate name (blame it on the 90s). Marita Covarrubias and Michael Kritschgau are important for some seasons, but are otherwise rare occurrences.
Plot_Timeline(filter(xftimeline, term %in% c("skinner", "fowley", "kersh", "spender")), "FBI Agents")
Agent Skinner dominates this picture. During the sixth and seventh seasons we can see a dip in his mentions, while other FBI characters shortly gain more presence.
Plot_Timeline(filter(xftimeline, term %in% c("frohik", "byer", "lang", "gunmen")), "Lone Gunmen")
While the Lone Gunmen as a group steadily gain importance over the seasons, some seasons feature individual Gunmen members more prominently (Lange in season 2 and 6, Byers in season 5 and 6). Towards the end, the group has a very strong presence, even though individual members are rarely mentioned.
Conclusion and Leftovers
Scully: Mulder, what did you find out there?
Mulder: Scully, I can’t tell you.
Scully: That doesn’t make sense.
Mulder: You’ve got to trust me, Scully. I know things. It’s better you don’t.
This concludes this exploration of the X-Files episode summary data – I hope you have enjoyed it as much as I have. As usual, most of the time was spent on data gathering and cleaning, which is especially true in the case of text-based data (don’t ask me how long it took to construct that remcast
list).
Of course, much more could be done with the data, and I hope you grab some of it for your own analyses from my GitHub repo. The original intention of the exercise was to try out some unsupervised machine learning and network analysis techniques, which unfortunately didn’t proove fruitful (K-Means clustering, Latent Semantic Analysis, Topic Models).
Finally, this analysis doesn’t have to be constrained to X-Files episode summaries: A lot of the code from part 1 and 2 can be re-used for other series – maybe even be turned into a package.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.