Archer and Tidy Data Principles (Part 3)

[This article was first published on Pachá, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Motivation

The first and second part of this analysis gave the idea that I did too much scrapping and processing and that deserves more analysis to use that information well. In this third and final part I’m also taking a lot of ideas from Julia Silge’s blog.

In the GitHub repo of this project you shall find not just Rick and Morty processed subs, but also for Archer, Bojack Horseman, Gravity Falls and Stranger Things. Why? In this post post I’m gonna compare the different shows.

Note: If some images appear too small on your screen you can open them in a new tab to show them in their original size.

Word Frequencies

Comparing frequencies across different shows can tell us how similar Archer, for example, is similar to Gravity Falls I’ll use the subtitles from different shows that I scraped using the same procedure I did with Rick and Morty.

if (!require("pacman")) install.packages("pacman")
p_load(data.table, tidyr, tidytext, dplyr, ggplot2, viridis, ggstance, stringr, scales)
p_load_gh("dgrtwo/widyr")
topicmodels (NA -> 0.2-9) [CRAN]
fuzzyjoin   (NA -> 0.1.5) [CRAN]
subs <- list.files("../../data/2017-10-13-rick-and-morty-tidy-data", pattern = "subs", full.names = T)

archer_subs <- as_tibble(fread(subs[[1]])) %>%
  mutate(text = iconv(text, to = "ASCII")) %>%
  drop_na()

bojack_horseman_subs <- as_tibble(fread(subs[[2]])) %>%
  mutate(text = iconv(text, to = "ASCII")) %>%
  drop_na()

gravity_falls_subs <- as_tibble(fread(subs[[3]])) %>%
  mutate(text = iconv(text, to = "ASCII")) %>%
  drop_na()

rick_and_morty_subs <- as_tibble(fread(subs[[4]])) %>%
  mutate(text = iconv(text, to = "ASCII")) %>%
  drop_na()

stranger_things_subs <- as_tibble(fread(subs[[5]])) %>%
  mutate(text = iconv(text, to = "ASCII")) %>%
  drop_na()

archer_subs_tidy <- archer_subs %>%
  unnest_tokens(word,text) %>%
  anti_join(stop_words)

bojack_horseman_subs_tidy <- bojack_horseman_subs %>%
  unnest_tokens(word,text) %>%
  anti_join(stop_words)

gravity_falls_subs_tidy <- gravity_falls_subs %>%
  unnest_tokens(word,text) %>%
  anti_join(stop_words)

rick_and_morty_subs_tidy <- rick_and_morty_subs %>%
  unnest_tokens(word,text) %>%
  anti_join(stop_words)

stranger_things_subs_tidy <- stranger_things_subs %>%
  unnest_tokens(word,text) %>%
  anti_join(stop_words)

With this processing we can compare frequencies across different shows. Here’s an example of the top ten words for each show:

bind_cols(archer_subs_tidy %>%
            count(word, sort = TRUE) %>%
            filter(row_number() <= 10),
          archer_subs_tidy %>%
            count(word, sort = TRUE) %>%
            filter(row_number() <= 10),
          bojack_horseman_subs_tidy %>%
            count(word, sort = TRUE) %>%
            filter(row_number() <= 10),
          gravity_falls_subs_tidy %>%
            count(word, sort = TRUE) %>%
            filter(row_number() <= 10),
          stranger_things_subs_tidy %>%
            count(word, sort = TRUE) %>%
            filter(row_number() <= 10)) %>%
  setNames(c("rm_word","rm_n","a_word","a_n","bh_word","bh_n","gf_word","gf_n","st_word","st_n"))
# A tibble: 10 x 10
   rm_word  rm_n a_word   a_n bh_word  bh_n gf_word  gf_n st_word  st_n
   <chr>   <int> <chr>  <int> <chr>   <int> <chr>   <int> <chr>   <int>
 1 archer   4526 archer  4526 bojack    807 mabel     456 yeah      482
 2 lana     2795 lana    2795 yeah      695 hey       453 hey       317
 3 yeah     1474 yeah    1474 hey       567 ha        416 mike      271
 4 cyril    1471 cyril   1471 gonna     480 stan      369 sighs     261
 5 malory   1460 malory  1460 time      446 dipper    347 uh        189
 6 pam      1297 pam     1297 uh        380 gonna     341 dustin    179
 7 god       873 god      873 diane     345 time      313 lucas     173
 8 wait      844 wait     844 todd      329 yeah      291 gonna     166
 9 uh        830 uh       830 people    307 uh        264 joyce     161
10 gonna     745 gonna    745 love      306 guys      244 mom       157

There are common words such as “yeah” for example.

Now I’ll combine the frequencies of all the shows and I’ll plot the top 50 frequencies to see similitudes with Archer:

tidy_others <- bind_rows(mutate(bojack_horseman_subs_tidy, show = "Bojack Horseman"),
                        mutate(gravity_falls_subs_tidy, show = "Gravity Falls"),
                        mutate(rick_and_morty_subs_tidy, show = "Rick and Morty"),
                        mutate(stranger_things_subs_tidy, show = "Stranger Things"))

frequency <- tidy_others %>%
  mutate(word = str_extract(word, "[a-z]+")) %>%
  count(show, word) %>%
  rename(other = n) %>%
  inner_join(count(archer_subs_tidy, word)) %>%
  rename(archer = n) %>%
  mutate(other = other / sum(other),
         archer = archer / sum(archer)) %>%
  ungroup()

frequency_top_50 <- frequency %>%
  group_by(show) %>%
  arrange(-other,-archer) %>%
  filter(row_number() <= 50)

ggplot(frequency_top_50, aes(x = other, y = archer, color = abs(archer - other))) +
  geom_abline(color = "gray40") +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.4, height = 0.4) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.5), low = "darkslategray4", high = "gray75") +
  facet_wrap(~show, ncol = 4) +
  theme_minimal(base_size = 14) +
  theme(legend.position="none") +
  labs(title = "Comparing Word Frequencies",
       subtitle = "Word frequencies in Archer episodes versus other shows'",
       y = "Archer", x = NULL)

What is only noticeable if you have seen the analysed shows suggests that we should explore global measures of lexical variety such as mean word frequency and type-token ratios.

Before going ahead let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between Archer and the other shows?

cor.test(data = filter(frequency, show == "Bojack Horseman"), ~ other + archer)

    Pearson's product-moment correlation

data:  other and archer
t = 75.164, df = 5530, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6975951 0.7236750
sample estimates:
      cor 
0.7108793 
cor.test(data = filter(frequency, show == "Gravity Falls"), ~ other + archer)

    Pearson's product-moment correlation

data:  other and archer
t = 58.208, df = 4300, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6467996 0.6802407
sample estimates:
      cor 
0.6638519 
cor.test(data = filter(frequency, show == "Rick and Morty"), ~ other + archer)

    Pearson's product-moment correlation

data:  other and archer
t = 62.068, df = 4606, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6588360 0.6902949
sample estimates:
      cor 
0.6748719 
cor.test(data = filter(frequency, show == "Stranger Things"), ~ other + archer)

    Pearson's product-moment correlation

data:  other and archer
t = 45.812, df = 2653, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6427990 0.6853044
sample estimates:
     cor 
0.664589 

The correlation test suggests that Archer and Bojack Horseman are the most similar from the considered sample.

The end

My analysis is now complete but the GitHub repo is open to anyone interested in using it for his/her own analysis. I covered mostly microanalysis, or words analysis as isolated units, while providing rusty bits of analysis beyond words as units that would deserve more and longer posts.

Those who find in this a useful material may explore global measures. One option is to read Text Analysis with R for Students of Literature that I’ve reviewed some time ago.

Interesting topics to explore are Hapax richness and keywords in context that correspond to mesoanalysis or even going for macroanalysis to do clustering, classification and topic modelling.

To leave a comment for the author, please follow the link and comment on their blog: Pachá.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)