The Perfect Taylor Swift Playlist?

1 year ago

[This article was first published on Blog on Data Solutions | Dedicated to helping businesses making data-driven decisions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One thing I really admire about Taylor Swift is her ability to tell a story. The way she weaves in details and characters, you feel like each song is about you or a friend. I think one reason for her popularity is because her music is so relateable. In particular, she infamously writes about her loves and breakups and exs and those songs have carried so many through love and heartbreak of their own. As someone currently going through a breakup, I thought it would be a good distraction to try and create a playlist for the emotions I’m feeling to get me through this heartbreak. And I thought I would use this as an opportunity to explore sentiment analysis with tidytext. So I hope that the fruits of my heartbreak will prove useful for someone either interested in sentiment analysis or also looking for a curated Taylor Swift playlist.

The Data

The data I am using today comes from the Tidy Tuesday dataset. The dataset contains lyrics from Taylor Swift and Beyoncé songs (Beyoncé lyrics analysis is forth coming, do not worry fellow hive members), as well as album sales and Billboard chart rankings. First I will load the data from the tidytuesdayR github page.

library(tidytuesdayR)
library(tidyverse)
library(tidytext)

ts <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/taylor_swift_lyrics.csv')
head(ts)
## # A tibble: 6 x 4
##   Artist     Album     Title         Lyrics                                     
##   <chr>      <chr>     <chr>         <chr>                                      
## 1 Taylor Sw… Taylor S… Tim McGraw    "He said the way my blue eyes shinx\nPut t…
## 2 Taylor Sw… Taylor S… Picture to B… "State the obvious, I didn't get my perfec…
## 3 Taylor Sw… Taylor S… Teardrops on… "Drew looks at me,\nI fake a smile so he w…
## 4 Taylor Sw… Taylor S… A Place in T… "I don't know what I want, so don't ask me…
## 5 Taylor Sw… Taylor S… Cold As You   "You have a way of coming easily to me\nAn…
## 6 Taylor Sw… Taylor S… The Outside   "I didn't know what I would find\nWhen I w…

For this post we are only going to focus on the Taylor Swift lyrics and you can see that dataframe has 4 columns:

Artist
Album
Song title
Lyrics

The lyrics are all in one row so the first step is to turn this into a tidytext format, which means a tibble with one token per row. A token is defined as a meaningful unit of text, most often a word, and tokenization is the process of breaking a vector down into tokens. We will use the unnest_tokens() function to do this. This function takes 2 arguments, the name of the new column that will be created with the tokens and the name of the current column to be turned into tokens. Note, I named the new column “word” so that it will match up with another tibble in the next step. The function separates the vectors into single words and by default turns everything into lowercase letters.

ts %>% 
  unnest_tokens(word, Lyrics)
## # A tibble: 48,555 x 4
##    Artist       Album        Title      word 
##    <chr>        <chr>        <chr>      <chr>
##  1 Taylor Swift Taylor Swift Tim McGraw he   
##  2 Taylor Swift Taylor Swift Tim McGraw said 
##  3 Taylor Swift Taylor Swift Tim McGraw the  
##  4 Taylor Swift Taylor Swift Tim McGraw way  
##  5 Taylor Swift Taylor Swift Tim McGraw my   
##  6 Taylor Swift Taylor Swift Tim McGraw blue 
##  7 Taylor Swift Taylor Swift Tim McGraw eyes 
##  8 Taylor Swift Taylor Swift Tim McGraw shinx
##  9 Taylor Swift Taylor Swift Tim McGraw put  
## 10 Taylor Swift Taylor Swift Tim McGraw those
## # … with 48,545 more rows

Next we want to remove some extremely common words such as “the”, “of”, “to”, etc. We do this by using stop words and the anti_join() function. We also are going to get rid of some frequently used sound effect words such as “ooh” or “ah”.

data("stop_words")

custom_stop <- data.frame(word = c("ooh", "yeah", "ah", "uh", "ha", "woah"))

tidy_ts <- ts %>% 
  unnest_tokens(word, Lyrics) %>% 
  anti_join(stop_words) %>% 
  anti_join(custom_stop)

Now let’s take a look at some of her most used words. This figure shows the top 15 most frequent words for each album:

tidy_ts %>% 
  count(word, sort = T)
## # A tibble: 2,574 x 2
##    word      n
##    <chr> <int>
##  1 love    248
##  2 time    225
##  3 wanna   158
##  4 baby    153
##  5 stay    100
##  6 gonna    98
##  7 night    96
##  8 bad      80
##  9 girl     80
## 10 home     76
## # … with 2,564 more rows
tidy_ts %>% 
  group_by(Album) %>% 
  count(word) %>% 
  group_by(Album) %>% 
  slice_max(order_by = n, n = 15) %>% 
  ungroup() %>% 
  mutate(word = reorder_within(word, n, Album),
         Album = factor(Album, levels = c("Taylor Swift", "Fearless", "Speak Now", "Red", "1989", "reputation", "Lover", "folklore"))) %>%
  ggplot(aes(x = word, y = n, fill = Album)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Album, scales = "free") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0))

For a detailed explanation on how to make this plot see Julia Silge’s post on using the reorder_within() function.

Some things we notice right away from just looking at this figure are that time is very important throughout the folklore album, love is one of the highest ranked words in all of her albums except Red and Reputation, and her earlier albums, Speak Now, Taylor Swift, and Fearless, used mostly verbs and her newer albums used more nouns. We can make a word cloud of the top 150 words she uses.

Sentiment Analysis

Now, to create the perfect playlists for any mood, we are going to use sentiment analysis to classify each song into a category and then create playlists from those categories. We will be evaluating the lyrics both as individual words and in the overall context of the song.

Using the AFINN sentiment lexicon we will compare all of the songs. The AFINN lexicon measures sentiment on a scale of -5 to 5 for negative and positive words instead of a binary scale like the NRC or Bing.

library(tidytext)
afinn <- tidy_ts %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(Title, Album) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(Album = factor(Album, levels = c("Taylor Swift", 
                                          "Fearless", 
                                          "Speak Now", 
                                          "Red", 
                                          "1989", 
                                          "reputation", 
                                          "Lover", 
                                          "folklore")))
## Joining, by = "word"
## `summarise()` regrouping output by 'Title' (override with `.groups` argument)
ggplot(afinn, aes(Title, sentiment, fill = Album)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Album, ncol = 1, scales = "free") +
  scale_x_discrete(guide = guide_axis(n.dodge=3))

When we break down the sentiments of each song in an album, we see some surprising and some not so surprising trends. Unsurprisingly, Lover, Fearless, and Taylor Swift, are mostly positive songs whereas Reputation is more negative or neutral. I was surprised to see that 1989 had many negative songs as well as folklore. I tend to think of 1989 as an upbeat pop album that is generally about starting over, however according to this analysis, most songs tend to be more negative. The folklore album is more indie and deep and I suppose more of the songs are sad stories. Now that we have a general idea of which songs are negative and which are positive, let’s make our first attempt at a playlist. We will do this by just grouping together songs by positive or negative sentiments and we will start from the saddest to the happiest.

tidy_ts %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(Title, Album) %>% 
  summarise(sentiment = sum(value)) %>% 
  ungroup() %>% 
  mutate(group = ifelse(sentiment < 0, "negative", "positive"),
         Title = reorder_within(Title, -sentiment, group)) %>% 
  ggplot(aes(x = Title, y = sentiment, fill = Album)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~group, scales = "free") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0))
## Joining, by = "word"
## `summarise()` regrouping output by 'Title' (override with `.groups` argument)

Overall, this list looks pretty good but I think we can do better. There are a few songs that seem out of place to me. So now that we have a basic, negative or positive understanding of Taylor’s songs, lets break it down further by emotions. We are going to use the syuzhet package to access the NRC emotion lexicon. The emotion lexicon is a list of words and their associations with two sentiments (negative and positive) and eight emotions: anger, fear, anticipation, trust, surprise, sadness, joy, and disgust. The function get_nrc_sentiment() takes a vector of lines of text and returns a dataframe where each row is a line of text and the emotions are represented in columns.

library(syuzhet)


word_vec <- tidy_ts %>% 
  select(word) %>% 
  pull() 
  
emotions <- get_nrc_sentiment(word_vec)

emotions2 <- tidy_ts %>% 
  bind_cols(emotions)

emotions2 %>% 
  select(-c(1,2)) %>% 
  group_by(Title) %>% 
  summarise_if(is.numeric, sum) %>% 
  select(Title, sadness, anger, disgust, fear) %>% 
  slice_max(order_by = sadness, n = 10)
## # A tibble: 13 x 5
##    Title           sadness anger disgust  fear
##    <chr>             <dbl> <dbl>   <dbl> <dbl>
##  1 Miss Americana       25    22      13    23
##  2 Bad Blood            24    28      22    28
##  3 Jump Then Fall       22     4       3     8
##  4 mad woman            22    26      19    24
##  5 Blank Space          20    26      12    31
##  6 Shake It Off         20    17      19    18
##  7 Cruel Summer         19    14      13    16
##  8 Haunted              19     1       1     7
##  9 Stay Stay Stay       19    17      16    19
## 10 Gorgeous             17    15      15    14
## 11 Red                  17     7       2     7
## 12 So It Goes           17    11      10    11
## 13 The Story of Us      17    12       3    18
emotions2 %>% 
  select(-c(1,2)) %>% 
  group_by(Title) %>% 
  summarise_if(is.numeric, sum) %>% 
  select(Title, sadness, anger, disgust, fear) %>% 
  slice_max(order_by = anger, n = 10)
## # A tibble: 10 x 5
##    Title           sadness anger disgust  fear
##    <chr>             <dbl> <dbl>   <dbl> <dbl>
##  1 Bad Blood            24    28      22    28
##  2 Blank Space          20    26      12    31
##  3 mad woman            22    26      19    24
##  4 Picture to Burn      14    23      17    19
##  5 Miss Americana       25    22      13    23
##  6 Don’t Blame Me       12    18      19    12
##  7 Shake It Off         20    17      19    18
##  8 Stay Stay Stay       19    17      16    19
##  9 Afterglow            15    16       9    15
## 10 Gorgeous             17    15      15    14

When we look at the top 10 songs ranked by the number of sad or angry lines, we see a slightly different line up from the songs that were considered most negative. First, Shake it off ranked as the most negative in the previous figure but with this one, Bad Blood is the saddest, followed by mad woman. When we rank them by anger, Bad Blood is top, followed by Blank Space. I think any Taylor fan would agree that Blank Space and Bad Blood are much more angry songs than Shake it Off. Of course, this second list is not perfect, but it’s an improvement. Now lets look at the other negative emotions, disgust and fear.

emotions2 %>% 
  select(-c(1,2)) %>% 
  group_by(Title) %>% 
  summarise_if(is.numeric, sum) %>% 
  select(Title, sadness, anger, disgust, fear) %>% 
  slice_max(order_by = disgust, n = 10)
## # A tibble: 12 x 5
##    Title                       sadness anger disgust  fear
##    <chr>                         <dbl> <dbl>   <dbl> <dbl>
##  1 Bad Blood                        24    28      22    28
##  2 Don’t Blame Me                   12    18      19    12
##  3 mad woman                        22    26      19    24
##  4 Shake It Off                     20    17      19    18
##  5 Picture to Burn                  14    23      17    19
##  6 Stay Stay Stay                   19    17      16    19
##  7 Gorgeous                         17    15      15    14
##  8 Dancing With Our Hands Tied      14    13      14    13
##  9 Clean                             4     1      13     3
## 10 Cruel Summer                     19    14      13    16
## 11 Miss Americana                   25    22      13    23
## 12 The Man                          13     7      13     7
emotions2 %>% 
  select(-c(1,2)) %>% 
  group_by(Title) %>% 
  summarise_if(is.numeric, sum) %>%  
  select(Title, sadness, anger, disgust, fear) %>% 
  slice_max(order_by = fear, n = 10)
## # A tibble: 10 x 5
##    Title           sadness anger disgust  fear
##    <chr>             <dbl> <dbl>   <dbl> <dbl>
##  1 Blank Space          20    26      12    31
##  2 Bad Blood            24    28      22    28
##  3 False God             7     5       1    24
##  4 mad woman            22    26      19    24
##  5 Miss Americana       25    22      13    23
##  6 Picture to Burn      14    23      17    19
##  7 Stay Stay Stay       19    17      16    19
##  8 Shake It Off         20    17      19    18
##  9 The Story of Us      17    12       3    18
## 10 Cruel Summer         19    14      13    16

Unsurprisingly, we see some of the same titles at the top of the list. So let’s rank these songs by a typical flow of negative emotions; we will start with sadness, then anger, disgust, and finally fear. To do this, I found the total number of lines for each negative emotion per song, grouped them by emotion and to prevent the same song from showing up multiple times, I only kept the row for the emotion that ranked highest. For example, Bad Blood ranked high for all four emotions but I kept the emotion that had the greatest score in the song and ties were broken by the order of emotions (i.e. anger beat fear and sadness beat anger).

Negative Playlist

emotions2 %>% 
  select(-c(1,2)) %>% 
  group_by(Title) %>% 
  summarise_if(is.numeric, sum) %>% 
  mutate(net_sentiment = positive - negative) %>%  #get overall sentiment (negative numbers = negative sentiment)
  filter(net_sentiment < 0) %>%  #find which songs have an overall negative sentiment
  select(Title, sadness, anger, disgust, fear) %>% 
  pivot_longer(cols = c("sadness", "anger", "disgust", "fear"), names_to = "emotion") %>% 
  group_by(Title) %>% 
  slice_max(order_by = value, n = 1, with_ties = FALSE) %>% 
  mutate(emotion = factor(emotion, levels = c("sadness", "anger", "disgust", "fear"))) %>% 
  arrange(emotion) %>% 
  ungroup() %>% 
  mutate(Title = reorder_within(Title, value, emotion)) %>% 
  ggplot(aes(x = Title, y = value, fill = emotion)) +
  geom_col() +
  geom_col(show.legend = FALSE) +
  facet_wrap(~emotion, scales = "free") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0))

I think the angry songs were classified best. I think the sad songs left out a lot of iconic breakup songs such as Teardrops on my Guitar, Red, Back to December, exile, and a few others. I would say I’m only half satisfied with this playlist.

Positive Playlist

Of course, once you go through the negative emotions, you will inevitiably start to be happy and maybe, eventually, fall in love again. At that point, you will want a different set of playlists to fit your moods. So let’s do the same process but using the positive songs and emotions.

emotions2 %>% 
  select(-c(1,2)) %>% 
  group_by(Title) %>% 
  summarise_if(is.numeric, sum) %>%  
  mutate(net_sentiment = positive - negative) %>%  #get overall sentiment (negative numbers = negative sentiment)
  filter(net_sentiment > 0) %>%  #find which songs have an overall positive sentiment
  select(Title, anticipation, surprise, joy, trust) %>% 
  pivot_longer(cols = c("anticipation", "surprise", "joy", "trust"), names_to = "emotion") %>% 
  group_by(Title) %>% 
  slice_max(order_by = value, n = 1, with_ties = FALSE) %>% 
  mutate(emotion = factor(emotion, levels = c("anticipation", "surprise", "joy", "trust"))) %>% 
  arrange(emotion) %>% 
  ungroup() %>% 
  mutate(Title = reorder_within(Title, value, emotion)) %>% 
  ggplot(aes(x = Title, y = value, fill = emotion)) +
  geom_col() +
  geom_col(show.legend = FALSE) +
  facet_wrap(~emotion, scales = "free") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0))

Overall, I think the Joy playlist is most accurate but there are still some songs that shouldn’t be in there, like Teardrops on my Guitar, betty, Death by a Thousands Cuts, etc. For the surprise category, Better than Revenge fits the category but the overall sentiment is not a positive one, it’s more negative and it could be categorized as angry or disgust.

Conclusions

Overall, I think the sentiment analysis generally categorized songs in the correct positive or negative category. Of course it was not perfect but that is because of the word choice Taylor used or the words that were available in the NRC and AFINN lexicons. I did also try the Bing lexicon but the results did not significantly change. Trying to use the NRC emotions lexicons to categorize songs into playlists for moods partially worked, however, you would still want to use your own judgement to re-categorize some songs. This exercise was a good distraction for me and I hope it helped you understand sentiment analysis a bit more.

Resources

For more information on sentiment analysis please see:

To leave a comment for the author, please follow the link and comment on their blog: Blog on Data Solutions | Dedicated to helping businesses making data-driven decisions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.