Does sentiment analysis work? A tidy analysis of Yelp reviews
David Robinson
[This article was first published on Variance Explained, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Sentiment analysis is often used by companies to quantify general social media opinion (for example, using tweets about several brands to compare customer satisfaction). One of the simplest and most common sentiment analysis methods is to classify words as “positive” or “negative”, then to average the values of each word to categorize the entire document. (See this vignette and Julia’s post for examples of a tidy application of sentiment analysis). But does this method actually work? Can you predict the positivity or negativity of someone’s writing by counting words?
To answer this, let’s try sentiment analysis on a text dataset where we know the “right answer”- one where each customer also quantified their opinion. In particular, we’ll use the Yelp Dataset: a wonderful collection of millions of restaurant reviews, each accompanied by a 1-5 star rating. We’ll try out a specific sentiment analysis method, and see the extent to which we can predict a customer’s rating based on their written opinion. In the process we’ll get a sense of the strengths and weaknesses of sentiment analysis, and explore another example of tidy text mining with tidytext, dplyr, and ggplot2.
Setup
I’ve downloaded the yelp_dataset_challenge_academic_dataset folder from here.1 First I read and process them into a data frame:
We now have a data frame with one row per review:
Notice the stars column with the star rating the user gave, as well as the text column (too large to display) with the actual text of the review. For now, we’ll focus on whether we can predict the star rating based on the text.
Tidy sentiment analysis
Right now, there is one row for each review. To analyze in the tidy text framework, we need to use the unnest_tokens function and turn this into one-row-per-term-per-document:
Notice that there is now one-row-per-term-per-document: the tidy text form. In this cleaning process we’ve also removed “stopwords” (such as “I”, “the”, “and”, etc), and removing things things that are formatting (e.g. “—-“) rather than a word.
Now let’s perform sentiment analysis on each review. We’ll use the AFINN lexicon, which provides a positivity score for each word, from -5 (most negative) to 5 (most positive). This, along with several other lexicons, are stored in the sentiments table that comes with tidytext. (I’ve tried some other lexicons on this dataset and the results are pretty similar.)
Now as described in Julia’s post, our sentiment analysis is just an inner-join operation followed by a summary:
We now have an average sentiment alongside the star ratings. If we’re right and sentiment analysis can predict a review’s opinion towards a restaurant, we should expect the sentiment score to correlate with the star rating.
Did it work?
Well, it’s a very good start! Our sentiment scores are certainly correlated with positivity ratings. But we do see that there’s a large amount of prediction error- some 5-star reviews have a highly negative sentiment score, and vice versa.
Which words are positive or negative?
Our algorithm works at the word level, so if we want to improve our approach we should start there. Which words are suggestive of positive reviews, and which are negative?
To examine this, let’s create a per-word summary, and see which words tend to appear in positive or negative reviews. This takes more grouping and summarizing:
We can start by looking only at words that appear in at least 200 (out of 200000) reviews. This makes sense both because rare words will have a noisier measurement (a few good or bad reviews could shift the balance), and because they’re less likely to be useful in classifying future reviews or text. I also filter for ones that appear in at least 10 businesses (others are likely to be specific to a particular restaurant).
What were the most positive and negative words?
Looks plausible to me! What about negative?
Also makes a lot of sense. We can also plot positivity by frequency:
Note that some of the most common words (e.g. “food”) are pretty neutral. There are some common words that are pretty positive (e.g. “amazing”, “awesome”) and others that are pretty negative (“bad”, “told”).
Comparing to sentiment analysis
When we perform sentiment analysis, we’re typically comparing to a pre-existing lexicon, one that may have been developed for a particular purpose. That means that on our new dataset (Yelp reviews), some words may have different implications.
We can combine and compare the two datasets with inner_join.
Just like in our per-review predictions, there’s a very clear trend. AFINN sentiment analysis works, at least a little bit!
But we may want to see some of those details. Which positive/negative words were most successful in predicting a positive/negative review, and which broke the trend?
For example, we can see that most profanity has an AFINN score of -4, and that while some words, like “wtf”, successfully predict a negative review, others, like “damn”, are often positive (e.g. “the roast beef was damn good!”). Some of the words that AFINN most underestimated included “die” (“the pork chops are to die for!”), and one of the words it most overestimated was “joke” (“the service is a complete joke!”).
One other way we could look at misclassifications is to add AFINN sentiments to our frequency vs average stars plot:
One thing I like about the tidy text mining framework is that it lets us explore the successes and failures of our model at this granular level, using tools (ggplot2, dplyr) that we’re already familiar with.
Next time: Machine learning
In this post I’ve focused on basic exploration of the Yelp review dataset, and an evaluation of one sentiment analysis method for predicting review positivity. (Our conclusion: it’s good, but far from perfect!) But what if we want to create our own prediction method based on these reviews?
In my next post on this topic, I’ll show how to train LASSO regression (with the glmnet package) on this dataset to create a predictive model. This will serve as an introduction to machine learning methods in text classification. It will also let us create our own new “lexicon” of positive and negative words, one that may be more appropriate to our context of restaurant reviews.
I encourage you to download this dataset and follow along- but note that if you do, you are bound by their Terms of Use. ↩
To leave a comment for the author, please follow the link and comment on their blog: Variance Explained.