Gravity Falls and Tidy Data Principles (Part 2)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Motivation
The first part left an open door to analyze Gravity Falls contents using tf-idf, bag-of-words or some other NLP techniques. Here I’m also taking a lot of ideas from Julia Silge’s blog.
Note: If some images appear too small on your screen you can open them in a new tab to show them in their original size.
Term Frequency
The most basic measure in natural language processing is obviously to just count words. This is a crude way of knowing what a document is about. The problem with counting words, however, is that there are some words (called stopwords) that are always too common, like “the” or “that”. So to create a more meaningful representation what people usually do is to compare the word counts observed in a document with that of a larger body of text.
Tf-idf is the frequency of a term adjusted for how rarely it is used. It is intended to measure how important a word is to a document in a collection (or corpus) of documents.
The inverse document frequency for any given term is defined as: \[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\] We can use tidy data principles to approach tf-idf analysis and use consistent, effective tools to quantify how important various terms are in a document that is part of a collection.
What do Gravity Falls characters say?
Let’s start by looking at Gravity Falls dialogues and examine first term frequency, then tf-idf. I’ll analyze this removing stopwords beforehand.
if (!require("pacman")) install.packages("pacman") p_load(data.table, tidyr, tidytext, dplyr, ggplot2, viridis, ggstance) p_load_gh("dgrtwo/widyr") gravity_falls_subs <- as_tibble(fread("../../data/2017-10-13-rick-and-morty-tidy-data/gravity_falls_subs.csv")) %>% mutate(text = iconv(text, to = "ASCII")) %>% drop_na() gravity_falls_subs_tidy <- gravity_falls_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words) %>% count(season, word, sort = TRUE) total_words <- gravity_falls_subs_tidy %>% group_by(season) %>% summarize(total = sum(n)) season_words <- left_join(gravity_falls_subs_tidy, total_words) season_words # A tibble: 9,773 x 4 season word n total <chr> <chr> <int> <int> 1 S01 ha 313 17261 2 S02 mabel 249 20284 3 S01 hey 234 17261 4 S02 hey 219 20284 5 S01 mabel 207 17261 6 S01 stan 196 17261 7 S02 dipper 194 20284 8 S02 time 191 20284 9 S01 yeah 182 17261 10 S01 gonna 180 17261 # … with 9,763 more rows
Let’s look at the distribution of n/total
for each season, the number of times a word appears in a season divided by the total number of terms (words) in that season. This is term frequency!
ggplot(season_words, aes(n/total, fill = season)) + geom_histogram(alpha = 0.8, show.legend = FALSE) + xlim(0, 0.001) + labs(title = "Term Frequency Distribution in Gravity Falls' Seasons", y = "Count") + facet_wrap(~season, nrow = 3, scales = "free_y") + theme_minimal(base_size = 13) + scale_fill_viridis(end = 0.85, discrete = TRUE) + theme(strip.text = element_text(hjust = 0)) + theme(strip.text = element_text(face = "italic"))
There are very long tails to the right for these dialogues because of the extremely common words. These plots exhibit similar distributions for each season, with many words that occur rarely and fewer words that occur frequently. The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents, in this case, the group of Gravity Falls’ seasons as a whole. Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common. Let’s do that now.
season_words <- season_words %>% bind_tf_idf(word, season, n) %>% filter(!word %in% c("subtitles", "uksubtitles.ru", "http", "memoryonsmells")) season_words # A tibble: 9,769 x 7 season word n total tf idf tf_idf <chr> <chr> <int> <int> <dbl> <dbl> <dbl> 1 S01 ha 313 17261 0.0181 0 0 2 S02 mabel 249 20284 0.0123 0 0 3 S01 hey 234 17261 0.0136 0 0 4 S02 hey 219 20284 0.0108 0 0 5 S01 mabel 207 17261 0.0120 0 0 6 S01 stan 196 17261 0.0114 0 0 7 S02 dipper 194 20284 0.00956 0 0 8 S02 time 191 20284 0.00942 0 0 9 S01 yeah 182 17261 0.0105 0 0 10 S01 gonna 180 17261 0.0104 0 0 # … with 9,759 more rows
Notice that idf and thus tf-idf are zero for the extremely common words after removing stopwords. These are all words that appear all the time on every chapter, so the idf term (which will then be the natural log of 1) is zero, and “Mabel” and “Stan” are examples of this. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the documents in a collection; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the documents in the collection. Let’s look at terms with high tf-idf.
season_words %>% select(-total) %>% arrange(desc(tf_idf)) # A tibble: 9,769 x 6 season word n tf idf tf_idf <chr> <chr> <int> <dbl> <dbl> <dbl> 1 S02 ford 59 0.00291 0.693 0.00202 2 S02 gasping 32 0.00158 0.693 0.00109 3 S02 screams 32 0.00158 0.693 0.00109 4 S02 gasps 29 0.00143 0.693 0.000991 5 S02 chuckles 28 0.00138 0.693 0.000957 6 S02 stanley 28 0.00138 0.693 0.000957 7 S02 mayor 26 0.00128 0.693 0.000888 8 S02 puppet 26 0.00128 0.693 0.000888 9 S02 school 24 0.00118 0.693 0.000820 10 S02 grunting 20 0.000986 0.693 0.000683 # … with 9,759 more rows
Curious about why I filtered “subtitles”, “uksubtitles.ru”, “http” and “memoryonsmells”? Some episodes end the subtitle with the line “subtitles by memoryonsmells http://uksubtitles.ru”, making those words rare as those words do not appear until the end of some chapters.
Some of the values for idf are the same for different terms because there are 2 documents in this corpus and we are seeing the numerical value for ln(2/1), ln(2/2), etc. Let’s look at a visualization for these high tf-idf words.
plot_tfidf <- season_words %>% arrange(desc(tf_idf)) %>% mutate(word = factor(word, levels = rev(unique(word)))) ggplot(plot_tfidf[1:20, ], aes(tf_idf, word, fill = season, alpha = tf_idf)) + geom_barh(stat = "identity") + labs(title = "Highest tf-idf words in Gravity Falls' Seasons", y = NULL, x = "tf-idf") + theme_minimal(base_size = 13) + scale_alpha_continuous(range = c(0.6, 1), guide = FALSE) + scale_x_continuous(expand = c(0, 0)) + scale_fill_viridis(end = 0.85, discrete = TRUE) + theme(legend.title = element_blank()) + theme(legend.justification = c(1, 0), legend.position = c(1, 0))
Let’s look at the seasons individually.
plot_tfidf <- plot_tfidf %>% group_by(season) %>% top_n(15) %>% ungroup() ggplot(plot_tfidf, aes(tf_idf, word, fill = season, alpha = tf_idf)) + geom_barh(stat = "identity", show.legend = FALSE) + labs(title = "Highest tf-idf words in Gravity Falls' Seasons", y = NULL, x = "tf-idf") + facet_wrap(~season, nrow = 3, scales = "free") + theme_minimal(base_size = 13) + scale_alpha_continuous(range = c(0.6, 1)) + scale_x_continuous(expand = c(0,0)) + scale_fill_viridis(end = 0.85, discrete = TRUE) + theme(strip.text = element_text(hjust = 0)) + theme(strip.text = element_text(face = "italic"))
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.