Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Motivation
After reading The Life Changing Magic of Tidying Text and A tidy text analysis of Rick and Morty I wanted to do something similar for Rick and Morty and I did. Now I’m doing something similar for Bojack Horseman.
In this post I’ll focus on the Tidy Data principles. However, here is the Github repo with the scripts to scrap the subtitles of Rick and Morty and other shows.
Note: If some images appear too small on your screen you can open them in a new tab to show them in their original size.
Let’s scrap
The subtools package returns a data frame after reading srt files. In addition to that resulting data frame I wanted to explicitly point the season and chapter of each line of the subtitles. To do that I had to scrap the subtitles and then use str_replace_all
. To follow the steps clone the repo from Github:
git clone https://github.com/pachamaltese/rick_and_morty_tidy_text
Bojack Horseman Can Be So Tidy
After reading the tidy file I created after scraping the subtitles, I use unnest_tokens
to divide the subtitles in words. This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, sentences, lines, paragraphs, or separation around a regex pattern.
if (!require("pacman")) install.packages("pacman") p_load( tidyverse, tidytext, igraph, ggraph ) p_load_gh("dgrtwo/widyr") bojack_horseman_subs <- read_csv("../../data/2017-10-13-rick-and-morty-tidy-data/bojack_horseman_subs.csv") %>% mutate(text = iconv(text, to = "ASCII")) %>% drop_na() bojack_horseman_subs_tidy <- bojack_horseman_subs %>% unnest_tokens(word,text) %>% anti_join(stop_words)
The data is in one-word-per-row format, and we can manipulate it with tidy tools like dplyr. For example, in the last chunk I used an anti_join
to remove words such a “a”, “an” or “the”.
Then we can use count
to find the most common words in all of Bojack Horseman episodes as a whole.
bojack_horseman_subs_tidy %>% count(word, sort = TRUE) # A tibble: 10,845 x 2 word n <chr> <int> 1 bojack 807 2 yeah 695 3 hey 567 4 gonna 480 5 time 446 6 uh 380 7 diane 345 8 todd 329 9 people 307 10 love 306 # … with 10,835 more rows
Sentiment analysis can be done as an inner join. There is one sentiment lexicon in the tidytext package. Let’s examine how sentiment changes changes during each season. Let’s count the number of positive and negative words in the chapters of each season.
bojack_horseman_sentiment <- bojack_horseman_subs_tidy %>% inner_join(sentiments) %>% count(episode_name, index = linenumber %/% 50, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative) %>% left_join(bojack_horseman_subs_tidy[,c("episode_name","season","episode")] %>% distinct()) %>% arrange(season,episode) %>% mutate(episode_name = paste(season,episode,"-",episode_name), season = factor(season, labels = paste("Season", 1:4))) %>% select(episode_name, season, everything(), -episode) bojack_horseman_sentiment # A tibble: 537 x 6 episode_name season index negative positive sentiment <chr> <fct> <dbl> <dbl> <dbl> <dbl> 1 S01 E01 - BoJack Horseman The … Seaso… 0 9 7 -2 2 S01 E01 - BoJack Horseman The … Seaso… 1 4 3 -1 3 S01 E01 - BoJack Horseman The … Seaso… 2 6 8 2 4 S01 E01 - BoJack Horseman The … Seaso… 3 6 2 -4 5 S01 E01 - BoJack Horseman The … Seaso… 4 12 8 -4 6 S01 E01 - BoJack Horseman The … Seaso… 5 15 5 -10 7 S01 E01 - BoJack Horseman The … Seaso… 6 13 5 -8 8 S01 E01 - BoJack Horseman The … Seaso… 7 7 8 1 9 S01 E01 - BoJack Horseman The … Seaso… 8 11 3 -8 10 S01 E01 - BoJack Horseman The … Seaso… 9 17 8 -9 # … with 527 more rows
Now we can plot these sentiment scores across the plot trajectory of each season.
ggplot(bojack_horseman_sentiment, aes(index, sentiment, fill = season)) + geom_bar(stat = "identity", show.legend = FALSE) + facet_wrap(~season, nrow = 3, scales = "free_x", dir = "v") + theme_minimal(base_size = 13) + labs(title = "Sentiment in Bojack Horseman", y = "Sentiment") + scale_fill_viridis(end = 0.75, discrete = TRUE) + scale_x_discrete(expand = c(0.02,0)) + theme(strip.text = element_text(hjust = 0)) + theme(strip.text = element_text(face = "italic")) + theme(axis.title.x = element_blank()) + theme(axis.ticks.x = element_blank()) + theme(axis.text.x = element_blank())
Looking at Units Beyond Words
Lots of useful work can be done by tokenizing at the word level, but sometimes it is useful or necessary to look at different units of text. For example, some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole. These algorithms try to understand that I am not having a good day is a negative sentence, not a positive one, because of negation.
bojack_horseman_sentences <- bojack_horseman_subs %>% group_by(season) %>% unnest_tokens(sentence, text, token = "sentences") %>% ungroup()
Let’s look at just one.
bojack_horseman_sentences$sentence[1778] [1] "hey, boys. what is this, a crossover episode?"
We can use tidy text analysis to ask questions such as: What are the most negative episodes in each of Bojack Horseman’s seasons? First, let’s get the list of negative words from the lexicon. Second, let’s make a dataframe of how many words are in each chapter so we can normalize for the length of chapters. Then, let’s find the number of negative words in each chapter and divide by the total words in each chapter. Which chapter has the highest proportion of negative words?
sentiment_negative <- sentiments %>% filter(sentiment == "negative") wordcounts <- bojack_horseman_subs_tidy %>% group_by(season, episode) %>% summarize(words = n()) bojack_horseman_subs_tidy %>% semi_join(sentiment_negative) %>% group_by(season, episode) %>% summarize(negativewords = n()) %>% left_join(wordcounts, by = c("season", "episode")) %>% mutate(ratio = negativewords/words) %>% top_n(1) # A tibble: 4 x 5 # Groups: season [4] season episode negativewords words ratio <chr> <chr> <int> <int> <dbl> 1 S01 E06 133 1277 0.104 2 S02 E07 136 1273 0.107 3 S03 E07 131 1218 0.108 4 S04 E06 160 1392 0.115
Networks of Words
Another function in widyr is pairwise_count
, which counts pairs of items that occur together within a group. Let’s count the words that occur together in the lines of the first season.
bojack_horseman_words <- bojack_horseman_subs_tidy %>% filter(season == "S01") word_cooccurences <- bojack_horseman_words %>% pairwise_count(word, linenumber, sort = TRUE) word_cooccurences # A tibble: 295,932 x 3 item1 item2 n <chr> <chr> <dbl> 1 yeah bojack 57 2 bojack yeah 57 3 bojack hey 47 4 hey bojack 47 5 horseman bojack 44 6 bojack horseman 44 7 diane bojack 36 8 bojack diane 36 9 gonna bojack 35 10 bojack gonna 35 # … with 295,922 more rows
This can be useful, for example, to plot a network of co-occuring words with the igraph and ggraph packages.
set.seed(1717) word_cooccurences %>% filter(n >= 25) %>% graph_from_data_frame() %>% ggraph(layout = "fr") + geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "#a8a8a8") + geom_node_point(color = "darkslategray4", size = 8) + geom_node_text(aes(label = name), vjust = 2.2) + ggtitle(expression(paste("Word Network in Bojack Horseman's ", italic("Season One")))) + theme_void()
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.