Statistics Sunday: Welcome to Sentiment Analysis with “Hotel California”

Posted on May 20, 2018 by in R bloggers | 0 Comments

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to the Hotel California As promised in last week’s post, this week: sentiment analysis, also with song lyrics.

Sentiment analysis is a method of natural language processing that involves classifying words in a document based on whether a word is positive or negative, or whether it is related to a set of basic human emotions; the exact results differ based on the sentiment analysis method selected. The tidytext R package has 4 different sentiment analysis methods:

“AFINN” for Finn Årup Nielsen – which classifies words from -5 to +5 in terms of negative or positive valence
“bing” for Bing Liu and colleagues – which classifies words as either positive or negative
“loughran” for Loughran-McDonald – mostly for financial and nonfiction works, which classifies as positive or negative, as well as topics of uncertainty, litigious, modal, and constraining
“nrc” for the NRC lexicon – which classifies words into eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) as well as positive or negative sentiment

Sentiment analysis works on unigrams – single words – but you can aggregate across multiple words to look at sentiment across a text.

To demonstrate sentiment analysis, I’ll use one of my favorite songs: “Hotel California” by the Eagles.

I know, I know.

Using similar code as last week, let’s pull in the lyrics of the song.

library(geniusR)
library(tidyverse)

hotel_calif <- genius_lyrics(artist = "Eagles", song = "Hotel California") %>%
  mutate(line = row_number())

First, we’ll chop up these 43 lines into individual words, using the tidytext package and unnest_tokens function.

library(tidytext)
tidy_hc <- hotel_calif %>%
  unnest_tokens(word,lyric)

This is also probably the point I would remove stop words with anti_join. But these common words are very unlikely to have a sentiment attached to them, so I’ll leave them in, knowing they’ll be filtered out anyway by this analysis. We have 4 lexicons to choose from. Loughran is more financial and textual, but we’ll still see how well it can classify the words anyway. First, let’s create a data frame of our 4 sentiment lexicons.

new_sentiments <- sentiments %>%
  mutate( sentiment = ifelse(lexicon == "AFINN" & score >= 0, "positive",
                             ifelse(lexicon == "AFINN" & score < 0,
                                    "negative", sentiment))) %>%
  group_by(lexicon) %>%
  mutate(words_in_lexicon = n_distinct(word)) %>%
  ungroup()

Now, we’ll see how well the 4 lexicons match up with the words in the lyrics. Big thanks to Debbie Liske at Data Camp for this piece of code (and several other pieces used in this post):

my_kable_styling <- function(dat, caption) {
  kable(dat, "html", escape = FALSE, caption = caption) %>%
    kable_styling(bootstrap_options = c("striped", "condensed", "bordered"),
                  full_width = FALSE)
}


library(kableExtra)
library(formattable)
library(yarrr)

tidy_hc %>%
  mutate(words_in_lyrics = n_distinct(word)) %>%
  inner_join(new_sentiments) %>%
  group_by(lexicon, words_in_lyrics, words_in_lexicon) %>%
  summarise(lex_match_words = n_distinct(word)) %>%
  ungroup() %>%
  mutate(total_match_words = sum(lex_match_words),
         match_ratio = lex_match_words/words_in_lyrics) %>%
  select(lexicon, lex_match_words, words_in_lyrics, match_ratio) %>%
  mutate(lex_match_words = color_bar("lightblue")(lex_match_words),
         lexicon = color_tile("lightgreen","lightgreen")(lexicon)) %>%
  my_kable_styling(caption = "Lyrics Found In Lexicons")

## Joining, by = "word"

Lyrics Found In Lexicons
lexicon	lex_match_words	words_in_lyrics	match_ratio
AFINN	18	175	0.1028571
bing	18	175	0.1028571
loughran	1	175	0.0057143
nrc	23	175	0.1314286

hcsentiment <- tidy_hc %>% inner_join(get_sentiments("nrc"), by = "word") hcsentiment ## # A tibble: 103 x 4 ## track_title line word sentiment ## <chr> <int> <chr> <chr> ## 1 Hotel California 1 dark sadness ## 2 Hotel California 1 desert anger ## 3 Hotel California 1 desert disgust ## 4 Hotel California 1 desert fear ## 5 Hotel California 1 desert negative ## 6 Hotel California 1 desert sadness ## 7 Hotel California 1 cool positive ## 8 Hotel California 2 smell anger ## 9 Hotel California 2 smell disgust ## 10 Hotel California 2 smell negative ## # ... with 93 more rows

theme_lyrics <- function(aticks = element_blank(), pgminor = element_blank(), lt = element_blank(), lp = "none") { theme(plot.title = element_text(hjust = 0.5), #Center the title axis.ticks = aticks, #Set axis ticks to on or off panel.grid.minor = pgminor, #Turn the minor grid lines on or off legend.title = lt, #Turn the legend title on or off legend.position = lp) #Turn the legend on or off } hcsentiment %>% group_by(sentiment) %>% summarise(word_count = n()) %>% ungroup() %>% mutate(sentiment = reorder(sentiment, word_count)) %>% ggplot(aes(sentiment, word_count, fill = -word_count)) + geom_col() + guides(fill = FALSE) + theme_minimal() + theme_lyrics() + labs(x = NULL, y = "Word Count") + ggtitle("Hotel California NRC Sentiment Totals") + coord_flip()

library(ggrepel) plot_words <- hcsentiment %>% group_by(sentiment) %>% count(word, sort = TRUE) %>% arrange(desc(n)) %>% ungroup() plot_words %>% ggplot(aes(word, 1, label = word, fill = sentiment)) + geom_point(color = "white") + geom_label_repel(force = 1, nudge_y = 0.5, direction = "y", box.padding = 0.04, segment.color = "white", size = 3) + facet_grid(~sentiment) + theme_lyrics() + theme(axis.text.y = element_blank(), axis.line.x = element_blank(), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), panel.grid = element_blank(), panel.background = element_blank(), panel.border = element_rect("lightgray", fill = NA), strip.text.x = element_text(size = 9)) + xlab(NULL) + ylab(NULL) + ggtitle("Hotel California Words by NRC Sentiment") + coord_flip()

hcsentiment_index <- tidy_hc %>% inner_join(get_sentiments("nrc")%>% filter(sentiment %in% c("positive", "negative"))) %>% count(index = line, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative) ## Joining, by = "word"