Gravity Falls and Tidy Data Principles (Part 3)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Motivation
The first and second part of this analysis gave the idea that I did too much scrapping and processing and that deserves more analysis to use that information well. In this third and final part I’m also taking a lot of ideas from Julia Silge’s blog.
In the GitHub repo of this project you shall find not just Rick and Morty processed subs, but also for Gravity Falls, Bojack Horseman, Gravity Falls and Stranger Things. Why? In this post post I’m gonna compare the different shows.
Note: If some images appear too small on your screen you can open them in a new tab to show them in their original size.
Word Frequencies
Comparing frequencies across different shows can tell us how similar Gravity Falls, for example, is similar to Rick and Morty. I’ll use the subtitles from different shows that I scraped using the same procedure I did with Rick and Morty.
if (!require("pacman")) install.packages("pacman") p_load(data.table, tidyr, tidytext, dplyr, ggplot2, viridis, ggstance, stringr, scales) p_load_gh("dgrtwo/widyr") subs <- list.files("../../data/2017-10-13-rick-and-morty-tidy-data", pattern = "subs", full.names = T) archer_subs <- as_tibble(fread(subs[[1]])) %>% mutate(text = iconv(text, to = "ASCII")) %>% drop_na() bojack_horseman_subs <- as_tibble(fread(subs[[2]])) %>% mutate(text = iconv(text, to = "ASCII")) %>% drop_na() gravity_falls_subs <- as_tibble(fread(subs[[3]])) %>% mutate(text = iconv(text, to = "ASCII")) %>% drop_na() rick_and_morty_subs <- as_tibble(fread(subs[[4]])) %>% mutate(text = iconv(text, to = "ASCII")) %>% drop_na() stranger_things_subs <- as_tibble(fread(subs[[5]])) %>% mutate(text = iconv(text, to = "ASCII")) %>% drop_na() archer_subs_tidy <- archer_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words) bojack_horseman_subs_tidy <- bojack_horseman_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words) gravity_falls_subs_tidy <- gravity_falls_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words) rick_and_morty_subs_tidy <- rick_and_morty_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words) stranger_things_subs_tidy <- stranger_things_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words)
With this processing we can compare frequencies across different shows. Here’s an example of the top ten words for each show:
bind_cols(gravity_falls_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10), gravity_falls_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10), bojack_horseman_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10), gravity_falls_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10), stranger_things_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10)) %>% setNames(c("rm_word","rm_n","a_word","a_n","bh_word","bh_n","gf_word","gf_n","st_word","st_n")) # A tibble: 10 x 10 rm_word rm_n a_word a_n bh_word bh_n gf_word gf_n st_word st_n <chr> <int> <chr> <int> <chr> <int> <chr> <int> <chr> <int> 1 mabel 456 mabel 456 bojack 807 mabel 456 yeah 482 2 hey 453 hey 453 yeah 695 hey 453 hey 317 3 ha 416 ha 416 hey 567 ha 416 mike 271 4 stan 369 stan 369 gonna 480 stan 369 sighs 261 5 dipper 347 dipper 347 time 446 dipper 347 uh 189 6 gonna 341 gonna 341 uh 380 gonna 341 dustin 179 7 time 313 time 313 diane 345 time 313 lucas 173 8 yeah 291 yeah 291 todd 329 yeah 291 gonna 166 9 uh 264 uh 264 people 307 uh 264 joyce 161 10 guys 244 guys 244 love 306 guys 244 mom 157
There are common words such as “yeah” for example.
Now I’ll combine the frequencies of all the shows and I’ll plot the top 50 frequencies to see similitudes with Gravity Falls:
tidy_others <- bind_rows(mutate(archer_subs_tidy, show = "Archer"), mutate(bojack_horseman_subs_tidy, show = "Bojack Horseman"), mutate(rick_and_morty_subs_tidy, show = "Rick and Morty"), mutate(stranger_things_subs_tidy, show = "Stranger Things")) frequency <- tidy_others %>% mutate(word = str_extract(word, "[a-z]+")) %>% count(show, word) %>% rename(other = n) %>% inner_join(count(gravity_falls_subs_tidy, word)) %>% rename(gravity_falls = n) %>% mutate(other = other / sum(other), gravity_falls = gravity_falls / sum(gravity_falls)) %>% ungroup() frequency_top_50 <- frequency %>% group_by(show) %>% arrange(-other,-gravity_falls) %>% filter(row_number() <= 50) ggplot(frequency_top_50, aes(x = other, y = gravity_falls, color = abs(gravity_falls - other))) + geom_abline(color = "gray40") + geom_jitter(alpha = 0.1, size = 2.5, width = 0.4, height = 0.4) + geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) + scale_x_log10(labels = percent_format()) + scale_y_log10(labels = percent_format()) + scale_color_gradient(limits = c(0, 0.5), low = "darkslategray4", high = "gray75") + facet_wrap(~show, ncol = 4) + theme_minimal(base_size = 14) + theme(legend.position="none") + labs(title = "Comparing Word Frequencies", subtitle = "Word frequencies in Gravity Falls episodes versus other shows'", y = "Gravity Falls", x = NULL)
What is only noticeable if you have seen the analysed shows suggests that we should explore global measures of lexical variety such as mean word frequency and type-token ratios.
Before going ahead let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between Gravity Falls and the other shows?
cor.test(data = filter(frequency, show == "Archer"), ~ other + gravity_falls) Pearson's product-moment correlation data: other and gravity_falls t = 53.991, df = 4297, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.6176010 0.6532365 sample estimates: cor 0.6357574 cor.test(data = filter(frequency, show == "Bojack Horseman"), ~ other + gravity_falls) Pearson's product-moment correlation data: other and gravity_falls t = 70.829, df = 3802, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.7402008 0.7676188 sample estimates: cor 0.7542384 cor.test(data = filter(frequency, show == "Rick and Morty"), ~ other + gravity_falls) Pearson's product-moment correlation data: other and gravity_falls t = 59.98, df = 3362, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.7022447 0.7349146 sample estimates: cor 0.7189765 cor.test(data = filter(frequency, show == "Stranger Things"), ~ other + gravity_falls) Pearson's product-moment correlation data: other and gravity_falls t = 33.101, df = 2099, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.5568172 0.6130472 sample estimates: cor 0.5856364
The correlation test suggests that Gravity Falls and Bojack Horseman are the most similar from the considered sample.
The end
My analysis is now complete but the GitHub repo is open to anyone interested in using it for his/her own analysis. I covered mostly microanalysis, or words analysis as isolated units, while providing rusty bits of analysis beyond words as units that would deserve more and longer posts.
Those who find in this a useful material may explore global measures. One option is to read Text Analysis with R for Students of Literature that I’ve reviewed some time ago.
Interesting topics to explore are Hapax richness and keywords in context that correspond to mesoanalysis or even going for macroanalysis to do clustering, classification and topic modelling.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.