The Truth Is In There – an X-Files episode analysis with R (Part 3)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Part 3: Monster vs. Mythology word use
Mulder: We’ve both lost so much… but I believe that what we’re looking for is in the X-Files. I’m more certain than ever that the truth is in there.
Scully: I’ve heard the truth, Mulder. Now what I want are the answers.
This is part 3 of a natural language processing analysis of X-Files episode summaries. In parts 1 and 2 I explained how to get the data from Wikipedia and pre-process it for analysis. Now we can start working with the data.
Our data is stored in a neat document term matrix (DTM), along with meta-data. That means, every row of our data frame contains one of the 201 X-Files episodes. Our first 10 columns contain information about the episode, while the other 6000+ columns contain an alphabetized list of words and how often these words appear in each episode.
Let’s look at what the most prominent words are in the collection of episode summaries. Remember, our corpus cleanup got rid of English stop words (“the”, “and”, etc.) as well as the names of actors (i.e. “Duchovny” won’t appear in the list – but “Mulder” will).
The total count of the words could be misleading though – episode summary length or the way summaries are written skew the ratio. To “normalize” these effects, we will concentrate on if a word appears in an episode or not. For this, we need to transform the word count in our DTM to a Boolean value – TRUE if the count is 1 or greater, FALSE if the count is zero. For this, we define a simple function numToBool
numToBool <- function(x) { # function to set counts to boolean values # # Args: # x: an integer # # Return: # a Boolean value (>=1 TRUE, <1 FALSE) ifelse(is.numeric(x), x > 0, NA) }
We apply this function to all of the DTM cells.
xfmergedTF <- xfmerged # change word counts to Boolean values () xfmergedTF[ , -c(1:10)] <- as.data.frame(lapply( xfmergedTF[ , -c(1:10)],FUN = function(x) { sapply(x, FUN=numToBool) } )) str(xfmergedTF[ , -c(1:10)], list.len = 5) ## 'data.frame': 201 obs. of 6152 variables: ## $ aaa : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ aaronson : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ aback : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ abandon : logi FALSE FALSE FALSE TRUE FALSE FALSE ... ## $ abduct : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## [list output truncated]
Now we get the sums of all DTM columns (i.e. the number of episodes each word appears in) and divide it by the number of episodes to get a relative value.
nxf <- nrow(xfmergedTF) # get total Boolean counts (all episodes) xfcountAll <- xfmergedTF[ , -c(1:10)] %>% colSums() %>% sort(decreasing = TRUE) %>% `/`(nxf) #divide by number of episodes
Let’s look at the first 50 terms:
# show terms with largest counts head(xfcountAll, 50) ## sculli find mulder agent kill tell later ## 0.9900498 0.9353234 0.8905473 0.8457711 0.7810945 0.7810945 0.7462687 ## meanwhil believ arriv man one leav meet ## 0.7313433 0.7213930 0.6616915 0.6517413 0.6368159 0.6169154 0.6119403 ## see bodi investig reveal howev two fbi ## 0.6069652 0.5970149 0.5970149 0.5920398 0.5820896 0.5721393 0.5572139 ## discov return back take claim visit die ## 0.5522388 0.5373134 0.5323383 0.5273632 0.5174129 0.5024876 0.4825871 ## tri also head home car case realiz ## 0.4776119 0.4726368 0.4577114 0.4527363 0.4427861 0.4427861 0.4427861 ## dead found appear attack insid show work ## 0.4378109 0.4378109 0.4328358 0.4278607 0.4278607 0.4278607 0.4228856 ## attempt murder escap name room anoth death ## 0.4179104 0.4179104 0.4129353 0.4129353 0.4129353 0.4079602 0.4079602 ## ask ## 0.4029851
Pretty grim terminology, isn’t it? Unsurprisingly, we can see our two main characters’ names taking top spots in our list, along with procedural terms such as “case”, “fbi” and “investigate”. Most of the terms seem to be verbs, the most prominent one being “kill”. Words like “body”, “die”, “dead” and “murder” point to a rather dangerous world the protagonists live in.
What we want to find out now is how the terminology for the “Monster of the Week” and the “Mythology Arc” differ. For this, we will construct a slopegraph of top terms with strong differences. The first step is to separate the two datasets:
# get Boolean counts for Monster and Mythology episodes xfcountTFMonster <- filter(xfmergedTF, Mythology == FALSE) xfcountTFMyth <- filter(xfmergedTF, Mythology == TRUE)
Our graph will be based on the most prevalent terms from both of theses lists. We therefore need to sort, rank and reconnect them. Since we need to repeat the sorting/ranking for both sets, it makes sense to construct a function. prepCountList()
will prepare a ranked list our our terms, sorted by frequency.
prepCountList <- function(df, n) { # function to create a ranked list of relative Boolean counts # # Args: # df: data frame based on the data format in xfmerged # n: count of episodes # # Return: # df: a data frame of relative occurence (part), # term and rank df <- df[ , -c(1:10)] %>% # get rid of metadata colSums() %>% # get overall term counts sort(decreasing = TRUE) %>% # sort high to low `/`(n) %>% # divide by episode count as.data.frame() # make data frame df <- mutate(df, term = rownames(df)) # get row names as variable df$rank <- 1:nrow(df) # add ranks colnames(df) <- c("part", "term", "rank") # rename col names df }
Now we’ll apply the function to the two data frames:
# get number of episodes nmyth = nrow(xfcountTFMyth) nmon = nrow(xfcountTFMonster) nmyth ## [1] 60 nmon ## [1] 141 # create ranked lists for Monster and Mythology episodes xfcountTFMonster <- prepCountList(xfcountTFMonster, nmon) xfcountTFMyth <- prepCountList(xfcountTFMyth, nmyth) head(xfcountTFMonster) ## part term rank ## 1 0.9858156 sculli 1 ## 2 0.9432624 find 2 ## 3 0.8723404 agent 3 ## 4 0.8439716 mulder 4 ## 5 0.7801418 kill 5 ## 6 0.7446809 later 6 head(xfcountTFMyth) ## part term rank ## 1 1.0000000 mulder 1 ## 2 1.0000000 sculli 2 ## 3 0.9166667 find 3 ## 4 0.8666667 meanwhil 4 ## 5 0.8666667 tell 5 ## 6 0.7833333 agent 6
We join the two sets based on the term, so we’ll have both ranked lists in one data frame. We take the top 30 terms from both lists and throw out all terms where the difference in episode occurence is 10% or less. What remains is a list of the terms with the biggest difference between “Monster” and “Mythology” episodes.
# join the two ranked lists xfcountmerged <- full_join(xfcountTFMyth, xfcountTFMonster, by = "term") str(xfcountmerged) ## 'data.frame': 6152 obs. of 5 variables: ## $ part.x: num 1 1 0.917 0.867 0.867 ... ## $ term : chr "mulder" "sculli" "find" "meanwhil" ... ## $ rank.x: int 1 2 3 4 5 6 7 8 9 10 ... ## $ part.y: num 0.844 0.986 0.943 0.674 0.745 ... ## $ rank.y: int 4 1 2 9 7 3 5 14 184 8 ... # create data for slopegraph (top 30 terms for both episode types) # exclude difference below 10% xfcountmerged <- xfcountmerged %>% filter(rank.x <= 35 | rank.y <= 35) %>% select(myth = part.x, monster = part.y, term) %>% filter(abs(myth-monster) > .1) %>% mutate(myth = round(myth*100,0), monster = round(monster*100,0))
To construct the slopegraph, we’ll use a slightly adapted version of Nathan Yau’s plotting function. For space reasons, the code is sourced from a separate file in my GitHub repo.
# create slopegraph source("slopegraph.r") with(xfcountmerged, slopegraph(myth, monster, term))
## [1] "Plot generated: "
The “Mythology” episodes have more aliens and abductions in them; Agent Mulder is mentioned in all of these episodes, and Agent Skinner takes an important role in them too. Protagonists seem to move around more and meet more people (“visit”, “arrive”, “leave”, “return”), and more information is exchanged between people (“tell”, “reveal”, “inform”).
“Monster” episodes on the other hand feature more bodies, attacks< and murders, and the situations are described as “cases”. Names and homes are more important.
The differences in the other terms (“meanwhile”, “tri/try”) are difficult to interpret and probably have to do with narrative structures in the episode summaries.
Finally, some cleanup:
# cleanup rm(xfcountmerged, xfcountTFMonster, xfcountTFMyth, xfmergedTF, nmon, nmyth, nxf, xfcountAll, numToBool, prepCountList, slopegraph)
This concludes part 3 of our analysis. In part 4, we will look at characters and their occurence over seasons. Stay tuned!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.