[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When I started this blog back in 2011, my goal was to write deep thoughts on trivial topics – specifically, to overthink and overanalyze pop culture and related topics that appear fluffy until you really dig into them. Recently, I’ve been blogging more about statistics, research, R, and data science, and I’ve loved getting to teach and share.
But sometimes, you just want to overthink and overanalyze pop culture.
So in a similar vein to the text analysis I’ve been demonstrating on my blog, I decided to answer a question I’m sure we all have – as Taylor Swift moved from country sweetheart to mega pop star, how have the words she uses in her songs changed?
I’ve used the geniusR package on a couple posts, and I’ll be using it again today to answer this question. I’ll be pulling in some additional code, some based on code from the Text Mining with R: A Tidy Approach book I recently devoured, some written to try to tackle this problem I’ve created for myself to solve. I’ve shared all my code and tried to credit those who helped me write it where I can.
First, we want to pull in the names of Taylor Swift’s 6 studio albums. I found these and their release dates on Wikipedia. While there are only 6 and I could easily copy and paste them to create my data frame, I wanted to pull that data directly from Wikipedia, to write code that could be used on a larger set in the future. Thanks to this post, I could, with a couple small tweaks.
library(rvest) ## Loading required package: xml2 TSdisc <- 'https://en.wikipedia.org/wiki/Taylor_Swift_discography' disc <- TSdisc %>% read_html() %>% html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[2]') %>% html_table(fill = TRUE)
Since html() is deprecated, I replaced it with read_html(), and I got errors if I didn’t add fill = TRUE. The result is a list of 1, with an 8 by 14 data frame within that single list object. I can pull that out as a separate data frame.
TS_albums <- disc[[1]]
The data frame requires a little cleaning. First up, there are 8 rows, but only 6 albums. Because the Wikipedia table had a double header, the second header was read in as a row of data, so I want to delete that, because I only care about the first two columns anyway. The last row contains a footnote that was included with the table. So I removed those two rows, first and last, and dropped the columns I don’t need. Second, the information I want with release date was in a table cell along with record label and formats (e.g., CD, vinyl). I don’t need those for my purposes, so I’ll only pull out the information I want and drop the rest. Finally, I converted year from character to numeric – this becomes important later on.
library(tidyverse) TS_albums<-TS_albums[2:7,1:2] TS_albums <- TS_albums %>% separate(`Album details`, c("Released","Month","Day","Year"), extra='drop') %>% select(c("Title","Year")) TS_albums$Year<-as.numeric(TS_albums$Year)
I asked geniusR to download lyrics for all 6 albums. (Note: this code may take a couple minutes to run.) It nests all of the individual album data, including lyrics, into a single column, so I just need to unnest that to create a long file, with album title and release year applied to each unnested line.
library(geniusR) TS_lyrics <- TS_albums %>% mutate(tracks = map2("Taylor Swift", Title, genius_album)) ## Joining, by = c("track_title", "track_n", "track_url") ## Joining, by = c("track_title", "track_n", "track_url") ## Joining, by = c("track_title", "track_n", "track_url") ## Joining, by = c("track_title", "track_n", "track_url") ## Joining, by = c("track_title", "track_n", "track_url") ## Joining, by = c("track_title", "track_n", "track_url") TS_lyrics <- TS_lyrics %>% unnest(tracks)
Now we’ll tokenize our lyrics data frame, and start doing our word analysis.
library(tidytext) tidy_TS <- TS_lyrics %>% unnest_tokens(word, lyric) %>% anti_join(stop_words) ## Joining, by = "word" tidy_TS %>% count(word, sort = TRUE) ## # A tibble: 2,024 x 2 ## word n ## <chr> <int> ## 1 time 198 ## 2 love 180 ## 3 baby 118 ## 4 ooh 104 ## 5 stay 89 ## 6 night 85 ## 7 wanna 84 ## 8 yeah 83 ## 9 shake 80 ## 10 ey 72 ## # ... with 2,014 more rows
There are a little over 2,000 unique words across TS’s 6 albums. But how have they changed over time? To examine this, I’ll create a dataset that counts word by year (or album, really). Then I’ll use a binomial regression model to look at changes over time, one model per word. In their book, Julia Silge and David Robinson demonstrated how to use binomial regression to examine word use on the authors’ Twitter accounts over time, including an adjustment to the p-values to correct for multiple comparisons. So I based on my code off that.
words_by_year <- tidy_TS %>% count(Year, word) %>% group_by(Year) %>% mutate(time_total = sum(n)) %>% group_by(word) %>% mutate(word_total = sum(n)) %>% ungroup() %>% rename(count = n) %>% filter(word_total > 50) nested_words <- words_by_year %>% nest(-word) word_models <- nested_words %>% mutate(models = map(data, ~glm(cbind(count, time_total) ~ Year, ., family = "binomial")))
This nests our regression results in a data frame called word_models. While I could unnest and keep all, I don’t care about every value the GLM gives me. What I care about is the slope for Year, so the filter selects only that slope and the associated p-value. I can then filter to select the significant/marginally significant slopes for plotting (p < 0.1).
library(broom) slopes <- word_models %>% unnest(map(models, tidy)) %>% filter(term == "Year") %>% mutate(adjusted.p.value = p.adjust(p.value)) top_slopes <- slopes%>% filter(adjusted.p.value < 0.1) %>% select(-statistic, -p.value)
This gives me five words that show changes in usage over time: bad, call, dancing, eyes, and yeah. We can plot those five words to see how they’ve changed in usage over her 6 albums. And because I still have my TS_albums data frame, I can use that information to label the axis of my plot (which is why I needed year to be numeric). I also added a vertical line and annotations to note where TS believes she shifted from country to pop.
library(scales) words_by_year %>% inner_join(top_slopes, by = "word") %>% ggplot(aes(Year, count/time_total, color = word, lty = word)) + geom_line(size = 1.3) + labs(x = NULL, y = "Word Frequency") + scale_x_continuous(breaks=TS_albums$Year, labels=TS_albums$Title) + scale_y_continuous(labels=scales::percent) + geom_vline(xintercept = 2014) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank()) + annotate("text", x = c(2009.5,2015.5), y = c(0.025,0.025), label = c("Country", "Pop") , size=5)