Using tidytext to make sentiment analysis easy
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week I discovered the R package tidytext
and its very nice e-book detailing usage. Julia Silge and David Robinson have significantly reduced the effort it takes for me to “grok” text mining by making it “tidy.”
It certainly helped that a lot of the examples are from Pride and Prejudice and other books by Jane Austen, my most beloved author. Julia Silge’s examples on her blog doing NLP and sentiment analysis alone would have made me a life-long fan. The gifs from P&P (mostly the 1995 mini-series, to be honest) on her posts and the references in the titles made me very excited. My brain automatically started playing the theme and made me smile.
Okay, enough of that. Moving on.
Seeing her work, I started wondering what I can model to get some insight into my own life. I have a database of 92,372 text messages (basically every message sent to or from me from sometime in 2011/2012 to 2015) but text messages were weird (lots of “lol” and “haha”’s). I think there is some interesting insights there, but probably not what I wanted to cover today.
So I started thinking what other plain text data did I have that might be interesting. And then I realized I have a 149 page dissertation (excluding boilerplate and references) and it was in LaTeX (so easy to parse) and it was written in 5 different files that relate directly to the chapters (intro, lit review, methods, results and a discussion). I could do something with that!
My thesis is currently under embargo while I chop it into its respective papers (one under review, one soon to be under review and one undergoing a final revision. So close.), so I can’t link to it. However, it relates the seasonality of two infectious diseases and local weather patterns.
I wonder how my sentiment changes across the thesis. To do this, I’ll use the tidytext
package. Let’s import the relevant packages now.
library(tidyverse) library(tidytext) library(stringr)
The tidyverse
ecosystem and tidytext
play well together (no surprises there) and so I also import tidyverse
. The stringr
package is useful for filtering out the LaTeX specific code and also for dropping words that have numbers in them (like jefferson1776
as a reference or 0.05
).
Now let’s read in the data (the tex files)
thesis_words <- data_frame(file = paste0("~/thesis/thesis/", c("introduction.tex", "lit-review.tex", "methods.tex", "results.tex", "discussion.tex"))) %>% mutate(text = map(file, read_lines)) thesis_words ## # A tibble: 5 × 2 ## file text ## <chr> <list> ## 1 ~/thesis/thesis/introduction.tex <chr [125]> ## 2 ~/thesis/thesis/lit-review.tex <chr [1,386]> ## 3 ~/thesis/thesis/methods.tex <chr [625]> ## 4 ~/thesis/thesis/results.tex <chr [1,351]> ## 5 ~/thesis/thesis/discussion.tex <chr [649]>
The resulting tibble has a variable file
that is the name of the file that created that row and a list-column of the text of that file.
We want to unnest()
that tibble, remove the lines that are LaTeX crude (either start with \[A-Z]
or \[a-z]
, like \section
or \figure
) and compute a line number.
thesis_words <- thesis_words %>% unnest() %>% filter(text != "%!TEX root = thesis.tex") %>% filter(!str_detect(text, "^(\\\\[A-Z,a-z])"), text != "") %>% mutate(line_number = 1:n(), file = str_sub(basename(file), 1, -5)) thesis_words$file <- forcats::fct_relevel(thesis_words$file, c("introduction", "lit-review", "methods", "results", "discussion"))
Now we have a tibble with file
giving us the chapter, text
giving us the line of text from the tex files (when I wrote it, I strived to keep my line lengths under 80 characters, hence the relatively short value in text
) and line_number
giving a counter of the number of lines since the start of the thesis.
Now we want to tokenize (strip each word of any formatting and reduce down to the root word, if possible). This is easy with unnest_tokens()
. I’ve also played around with the results and came up with some other words that needed to be deleted (stats terms like ci
or p
, LaTeX terms like _i
or tabular
and references/numbers).
thesis_words <- thesis_words %>% unnest_tokens(word, text) %>% filter(!str_detect(word, "[0-9]"), word != "fismanreview", word != "multicolumn", word != "p", word != "_i", word != "c", word != "ci", word != "al", word != "dowellsars", word != "h", word != "tabular", word != "t", word != "ref", word != "cite", !str_detect(word, "[a-z]_"), !str_detect(word, ":"), word != "bar", word != "emph", !str_detect(word, "textless")) thesis_words ## # A tibble: 27,787 × 3 ## file line_number word ## <fctr> <int> <chr> ## 1 introduction 1 seasonality ## 2 introduction 1 or ## 3 introduction 1 the ## 4 introduction 1 periodic ## 5 introduction 1 surges ## 6 introduction 1 and ## 7 introduction 1 lulls ## 8 introduction 1 in ## 9 introduction 1 incidence ## 10 introduction 1 is ## # ... with 27,777 more rows
Now to compute the sentiment using the words written per line in the thesis. tidytext
comes with three sentiment lexicons, affin
, bing
and nrc
. affin
provides a score ranging from -5 (very negative) to +5 (very positive) fr 2,476 words. bing
provides a label of “negative” or “positive” for 6,788 words. nrc
provides a label (anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise or trust) for 13,901 words. None of these account for negation (“I’m not sad” is a negative sentiment, not a positive one).
Using the nrc
lexicon, let’s see how the emotions of my words change over the thesis.
thesis_words %>% inner_join(get_sentiments("nrc")) %>% group_by(index = line_number %/% 25, file, sentiment) %>% summarize(n = n()) %>% ggplot(aes(x = index, y = n, fill = file)) + geom_bar(stat = "identity", alpha = 0.8) + facet_wrap(~ sentiment, ncol = 5)
I wasn’t surprised, but at least I wasn’t sad? It looks like I used more “fear” and “negative” words in the lit-review than the other sections. However, it looks like “infectious” as in “infectious diseases” is a fear/negative word. I used that word a lot more in the lit review than other sections.
I can use the bing
and afinn
lexicons to look at how the sentiment of the words changed over the course of the thesis.
thesis_words %>% left_join(get_sentiments("bing")) %>% left_join(get_sentiments("afinn")) %>% group_by(index = line_number %/% 25, file) %>% summarize(afinn = mean(score, na.rm = TRUE), bing = sum(sentiment == "positive", na.rm = TRUE) - sum(sentiment == "negative", na.rm = TRUE)) %>% gather(lexicon, lexicon_score, afinn, bing) %>% ggplot(aes(x = index, y = lexicon_score, fill = file)) + geom_bar(stat = "identity") + facet_wrap(~ lexicon, scale = "free_y") + scale_x_continuous("Location in thesis", breaks = NULL) + scale_y_continuous("Lexicon Score")
Looking at the two lexicon’s scoring of my thesis, the bing
lexicon seems a little more stable if we assume local correlation of sentiments is likely. It seems like I started out all doom and gloom (hey, I needed to convince my committee that it was a real problem!), moved onto more doom and gloom (did I mention this is a problem and my question hasn’t been resolved?), the methods were more neutral, results were more doom and gloom but with a slight uplift at the end followed by more doom and gloom (this really is a problem guys!) and a little bit of hope at the end (now that we know, we can fix this?).
This got me thinking about what a typical academic paper looks like. My mental model for a paper is:
- show that the problem is really a problem (“
is a significant cause of morbidity and mortality”) - show that the problem isn’t resolved by the prior work
- answer the question
- incorporate the answer into the existing literature
- discussion limitations and breezily dismiss them
- show hope for the future
So I pulled the text of my 4 currently published papers. I’m going to call them well-children, medication time series, transfer networks and COPD readmissions.
I took the text out of each paper and copied them into plain text files and read them into R as above. I also computed line numbers within each of the different papers.
paper_words <- data_frame(file = paste0("~/projects/paper_analysis/", c("well_child.txt", "pharm_ts.txt", "transfers.txt", "copd.txt"))) %>% mutate(text = map(file, read_lines)) %>% unnest() %>% group_by(file = str_sub(basename(file), 1, -5)) %>% mutate(line_number = row_number()) %>% ungroup() %>% unnest_tokens(word, text) paper_sentiment <- inner_join(paper_words, get_sentiments("bing")) %>% count(file, index = round(line_number / max(line_number) * 100 / 5) * 5, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(net_sentiment = positive - negative) paper_sentiment %>% ggplot(aes(x = index, y = net_sentiment, fill = file)) + geom_bar(stat = "identity", show.legend = FALSE) + facet_wrap(~ file) + scale_x_continuous("Location in paper (percent)") + scale_y_continuous("Bing Net Sentiment")
It looks like I wasn’t totally off. Most of the papers start out relatively negative, have super negative results sections (judging by paper location) but I was wrong about them ending on a happy note.
And the sentiment for this post:
Talking about negative sentiments is a negative sentiment. But look at the start when I was talking about Austen… that was a good time.