Statistics Sunday: Welcome to Sentiment Analysis with “Hotel California”
[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Sentiment analysis is a method of natural language processing that involves classifying words in a document based on whether a word is positive or negative, or whether it is related to a set of basic human emotions; the exact results differ based on the sentiment analysis method selected. The tidytext R package has 4 different sentiment analysis methods:
- “AFINN” for Finn Årup Nielsen – which classifies words from -5 to +5 in terms of negative or positive valence
- “bing” for Bing Liu and colleagues – which classifies words as either positive or negative
- “loughran” for Loughran-McDonald – mostly for financial and nonfiction works, which classifies as positive or negative, as well as topics of uncertainty, litigious, modal, and constraining
- “nrc” for the NRC lexicon – which classifies words into eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) as well as positive or negative sentiment
To demonstrate sentiment analysis, I’ll use one of my favorite songs: “Hotel California” by the Eagles.
I know, I know.
Using similar code as last week, let’s pull in the lyrics of the song.
library(geniusR) library(tidyverse) hotel_calif <- genius_lyrics(artist = "Eagles", song = "Hotel California") %>% mutate(line = row_number())
First, we’ll chop up these 43 lines into individual words, using the tidytext package and unnest_tokens function.
library(tidytext) tidy_hc <- hotel_calif %>% unnest_tokens(word,lyric)
This is also probably the point I would remove stop words with anti_join. But these common words are very unlikely to have a sentiment attached to them, so I’ll leave them in, knowing they’ll be filtered out anyway by this analysis. We have 4 lexicons to choose from. Loughran is more financial and textual, but we’ll still see how well it can classify the words anyway. First, let’s create a data frame of our 4 sentiment lexicons.
new_sentiments <- sentiments %>% mutate( sentiment = ifelse(lexicon == "AFINN" & score >= 0, "positive", ifelse(lexicon == "AFINN" & score < 0, "negative", sentiment))) %>% group_by(lexicon) %>% mutate(words_in_lexicon = n_distinct(word)) %>% ungroup()
Now, we’ll see how well the 4 lexicons match up with the words in the lyrics. Big thanks to Debbie Liske at Data Camp for this piece of code (and several other pieces used in this post):
my_kable_styling <- function(dat, caption) { kable(dat, "html", escape = FALSE, caption = caption) %>% kable_styling(bootstrap_options = c("striped", "condensed", "bordered"), full_width = FALSE) } library(kableExtra) library(formattable) library(yarrr) tidy_hc %>% mutate(words_in_lyrics = n_distinct(word)) %>% inner_join(new_sentiments) %>% group_by(lexicon, words_in_lyrics, words_in_lexicon) %>% summarise(lex_match_words = n_distinct(word)) %>% ungroup() %>% mutate(total_match_words = sum(lex_match_words), match_ratio = lex_match_words/words_in_lyrics) %>% select(lexicon, lex_match_words, words_in_lyrics, match_ratio) %>% mutate(lex_match_words = color_bar("lightblue")(lex_match_words), lexicon = color_tile("lightgreen","lightgreen")(lexicon)) %>% my_kable_styling(caption = "Lyrics Found In Lexicons") ## Joining, by = "word"
lexicon | lex_match_words | words_in_lyrics | match_ratio |
---|---|---|---|
AFINN | 18 | 175 | 0.1028571 |
bing | 18 | 175 | 0.1028571 |
loughran | 1 | 175 | 0.0057143 |
nrc | 23 | 175 | 0.1314286 |