The Truth Is In There – an X-Files episode analysis with R (Part 2)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Part 2: Preparing for text analysis
Cigarette Smoking Man: But I’ve come today not to ask, but to offer. To offer you the truths that you so desperately sought.
This is part 2 of a natural language processing analysis of X-Files episode summaries. In part 1 I explained how to get the data from Wikipedia. The next step is to get the data ready for analysis.
One problem that we’re facing in the episode summaries is that they contain names – lots of names. Some of them are useful (e.g. the main character names). Others, such as cast names, don’t have any relevance to our analysis. Let’s get those names and create stopword lists out of them, i.e. lists of terms that should be excluded from analysis.
The best source for cast and character information is the X-Files Full Credits page on IMDB, where it is stored in the third table. We scrape the table and store the cast and character names (columns V2
and V4
) in two separate data frames called xchar
and xcast
.
casturl <- "http://www.imdb.com/title/tt0106179/fullcredits/" # get third table (full cast) xcastchar <- content(GET(casturl)) xcastchar <- readHTMLTable(xcastchar, which = 3) # get cast and character names in separate tables xcast <- data.frame(as.character(xcastchar$V2), stringsAsFactors = FALSE) xchar <- data.frame(as.character(xcastchar$V4), stringsAsFactors = FALSE) str(xcast) ## 'data.frame': 1932 obs. of 1 variable: ## $ as.character.xcastchar.V2.: chr "Gillian Anderson" "David Duchovny" "Mitch Pileggi" "Robert Patrick" ... str(xchar) ## 'data.frame': 1932 obs. of 1 variable: ## $ as.character.xcastchar.V4.: chr " Dana Scully\n / ... \n (201 episodes, 1993-2002)" " Fox Mulder\n / ... \n (179 episodes, 1993-2002)" " Walter Skinner\n / ... \n (81 episodes, 1994-2002)" " John Doggett\n (40 episodes, 2000-2002)" ...
The cast list seems clean, but the character list is a little messy. What we really want is two lists of names that we can use as stopwords – i.e. two vectors containing words we can subtract from our episode summary collection.
To do this, we need to treat the cast and character list as a “text corpus”” for use in the tm
package. tm
is a framework of packages for natural language processing. Our two lists need to be cleaned, which means removing extra whitespaces, transforming to lower case, removing of punctuation and numbers. We will also stem the lists – which seems strange for names, but since we will be stemming the episode summaries later on, names can only be removed if they are stemmed too.
The following function takes care of all that cleaning, creates a document term matrix (DTM) and returns a term list, i.e. a list of all the words remaining after the cleaning.
library(tm) getcastcorpus <- function(df) { # Creates a frequent term matrix from a data frame of text # # Args: # df: data frame containing the corpus texts # # Return: # ffq: frequent term matrix corp <- Corpus(DataframeSource(df), readerControl = list(language = "en")) corp <- tm_map(corp, stripWhitespace) #white spaces corp <- tm_map(corp, content_transformer(tolower)) #lower case corp <- tm_map(corp, removePunctuation, preserve_intra_word_dashes = FALSE) #regular punctuation corp <- tm_map(corp, removeNumbers) # numbers corp <- tm_map(corp, stemDocument) # stemming dtm <- DocumentTermMatrix(corp) ffq <- findFreqTerms(dtm) ffq }
We now apply the function to our two name lists and receive the names as vectors:
#get cast and character names as separate vectors xfcastterms <- getcastcorpus(xcast) xfcharterms <- getcastcorpus(xchar) str(xfcastterms) ## chr [1:2314] "aaron" "aartsen" "abbott" "abdoulay" ... str(xfcharterms) ## chr [1:1736] "aaron" "aaronson" "abbott" "abducte" "abl" ...
Almost perfect – one problem remains: Some of the names are also English words which we want to include in our analysis. For example, a character might be described as “Gas Station Attendant”. We want to include those words in the analysis, so we need to exclude them from the stopword lists.
For this, I created a list called remcast.csv
which includes all those words we don’t want to exclude:
# manual list of non-names found in character description (e.g. "airplane pilot") remcast <- read.csv("remcast.csv", stringsAsFactors = FALSE) remcast <- remcast$x head(remcast) ## [1] "abducte" "abduct" "abandon" "addict" "african" "afterglow"
Now we subtract the remcast
list from both the cast and character vectors. Also, we make sure “Scully” (stemmed to “sculli”) is not removed from the character list.
# remove all non-name terms from cast and character vectors xfcastterms <- setdiff(xfcastterms, remcast) xfcharterms <- setdiff(xfcharterms, remcast) # remove "scully" from cast vector (it's both a cast an character name) xfcastterms <- setdiff(xfcastterms, "sculli") # cleanup rm(casturl, xcast, xchar, xcastchar, getcastcorpus)
Our stopword lists are ready. On to the real data – our episode summaries. Again, we will use the tm
package to create a corpus of summaries, clean and tokenize it. Since we want to keep our metadata (e.g. our episode titles), we will use a little trick to attach the episode code for later identification (check out the Stack Overflow question mentioned in the code for more info on this).
# Corpus construction and text cleanup # http://stackoverflow.com/questions/19850638/tm-reading-in-data-frame-and-keep-texts-id # create Corpus m <- list(id = "VarTitle", content = "content") myReader <- readTabular(mapping = m) xfcorpus <- Corpus(DataframeSource(xf2), readerControl = list(reader = myReader, language = "en")) # manually attach the product code for later identificaiton for (i in 1:length(xfcorpus)) { attr(xfcorpus[[i]], "ID") <- xf2$VarTitle[i] } inspect(xfcorpus[1]) ## <<VCorpus>> ## Metadata: corpus specific: 0, document level (indexed): 0 ## Content: documents: 1 ## ## [[1]] ## <<PlainTextDocument>> ## Metadata: 2 ## Content: chars: 4282
Our episode summary corpus is ready for cleaning. Again, we remove white spaces, transform to lower case, remove punctuation and numbers and stem the words. Additionally, we remove non-ASCII symbols, standard English stop words (“the”, “and”, etc.) and our cast stopwords – but not the character stop words.
xfcorpus <- tm_map(xfcorpus, stripWhitespace) #white spaces xfcorpus <- tm_map(xfcorpus, content_transformer(tolower)) #lower case xfcorpus <- tm_map(xfcorpus, removeWords, stopwords("english")) #stop words xfcorpus <- tm_map(xfcorpus, removePunctuation, preserve_intra_word_dashes = FALSE) #regular punctuation xfcorpus <- tm_map(xfcorpus, content_transformer(function(row) iconv(row, "latin1", "ASCII", sub=""))) # non-ascii chars xfcorpus <- tm_map(xfcorpus, removeNumbers) # numbers xfcorpus <- tm_map(xfcorpus, stemDocument) # stemming xfcorpus <- tm_map(xfcorpus, removeWords, xfcastterms) #remove names from cast list
We create a DTM out of our corpus…
# Create Document Term Matrix as data frame xfDTMfull <- DocumentTermMatrix(xfcorpus, control = list(wordLengths = c(3,15))) xfDTM <- data.frame(as.matrix(xfDTMfull), stringsAsFactors = TRUE) str(xfDTM, list.len = 5) ## 'data.frame': 201 obs. of 6152 variables: ## $ aaa : num 0 0 0 0 0 0 0 0 0 0 ... ## $ aaronson : num 0 0 0 0 0 0 0 0 0 0 ... ## $ aback : num 0 0 0 0 0 0 0 0 0 0 ... ## $ abandon : num 0 0 0 0 1 0 0 0 0 0 ... ## $ abduct : num 1 0 0 0 0 0 0 0 0 2 ... ## [list output truncated]
… and re-attach it to our meta-data:
# Merge DTM with episode data => full data set including character names xfmerged <- merge(xf2, xfDTM, by.x = "VarTitle", by.y = "row.names" ) str(xfmerged, list.len = 15) ## 'data.frame': 201 obs. of 6162 variables: ## $ VarTitle : chr "1X01_DeepThroat" "1X02_Squeeze" "1X03_Conduit" "1X04_TheJerseyDevil" ... ## $ ProdCode : chr "1X01" "1X02" "1X03" "1X04" ... ## $ Title : chr "Deep Throat" "Squeeze" "Conduit" "The Jersey Devil" ... ## $ epURL : chr "http://en.wikipedia.org/wiki/Deep_Throat_(The_X-Files_episode)" "http://en.wikipedia.org/wiki/Squeeze_(The_X-Files)" "http://en.wikipedia.org/wiki/Conduit_(The_X-Files)" "http://en.wikipedia.org/wiki/The_Jersey_Devil_(The_X-Files)" ... ## $ Director : chr "Daniel Sackheim" "Harry Longstreet" "Daniel Sackheim" "Joe Napolitano" ... ## $ Writer : chr "Chris Carter" "Glen Morgan" "Alex Gansa" "Chris Carter" ... ## $ AirDate : chr "September 17, 1993" "September 24, 1993" "October 1, 1993" "October 8, 1993" ... ## $ Viewers : num 11.1 11.1 9.2 10.4 8.8 9.5 10 10.7 8.8 10.4 ... ## $ Mythology : logi TRUE FALSE FALSE FALSE FALSE FALSE ... ## $ content.x : chr "In southwestern Idaho, near Ellens Air Force Base, military police raid the home of Colonel Robert Budahas, who has stolen a mi"| __truncated__ "In Baltimore, businessman George Usher arrives at his office building. He is watched from a storm drain by someone who then sne"| __truncated__ "At a campground at Lake Okobogee [sic] National Park in Sioux City, Iowa, Darlene Morris witnesses a flash of light outside of "| __truncated__ "In New Jersey in 1947, a man is attacked while fixing a flat tire on the road near the woods. His corpse is later found with hi"| __truncated__ ... ## $ aaa : num 0 0 0 0 0 0 0 0 0 0 ... ## $ aaronson : num 0 0 0 0 0 0 0 0 0 0 ... ## $ aback : num 0 0 0 0 0 0 0 0 0 0 ... ## $ abandon : num 0 0 0 1 0 0 0 0 0 0 ... ## $ abduct : num 0 0 0 0 0 0 0 0 2 1 ... ## [list output truncated]
Some cleanup:
# cleanup rm(i, m, myReader)
And we’re done with preprocessing our data! Everything we need is neatly stored in a DTM including episode meta-data.
In part 3, we’re going to explore the dataset and look for differences between the “monster of the week” and “mythology arc” episodes.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.