Sentiment Analysis using R
[This article was first published on Data Perspective, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
September 23, 2013
Movie rating using Twitter Data – Using R
Today I will explain you how to create a basic Movie review engine based on the tweets by people using R.
The implementation of the Review Engine will be as follows:
- Gets Tweets from Twitter
- Clean the data
- Create a Word Cloud
- Create a data dictionary
- Score each tweet.
First step is to fetch the data from Twitter. In R, we have facility to call the twitter API using package twitter. Below are the steps for fetch the tweets using twitter package. Each tweet data contains:
- Text
- Is re-tweeted
- Re-tweet count
- Tweeted User name
- Latitude/Longitude
- Replied to, etc.
library(tm)
tweets = searchTwitter(“#ChennaiExpress”, n=500, lang=”en”)
Clean the data:
In the next step, we need to clean the data so that we can use it for our analysis. Cleaning of data is a very important step in Data Analysis. This step includes:
Extracting only text from Tweets:
tweets_txt = sapply(tweets,function(x) x$getText())
Removing Url links, Reply to, punctuations, non-alphanumeric, symbols, spaces etc.
tweets_cl = gsub(“(RT|via)((?:\b\W*@\w+)+)”,””,tweets)
tweets_cl = gsub(“http[^[:blank:]]+”, “”, tweets_cl)
tweets_cl = gsub(“@\w+”, “”, tweets_cl)
tweets_cl = gsub(“[ t]{2,}”, “”, tweets_cl)
tweets_cl = gsub(“^\s+|\s+$”, “”, tweets_cl)
tweets_cl = gsub(“[[:punct:]]”, ” “, tweets_cl)
tweets_cl = gsub(“[^[:alnum:]]”, ” “, tweets_cl)
tweets_cl <- gsub('\d+', '', tweets_cl)
Create a Word Cloud:
At this point let us view Word-Cloud of frequently tweeted words in the data considered for visual understanding/analyzing the data.
library(wordcloud)
wordcloud(tweets_cl)
Create a data dictionary:
In this step, we create use a Dictionary of words containing positive, negative words which are downloaded from here. These 2 types of words are used as keywords for classifying the each tweet into one of the 4 categories: Very Positive, Positive, Negative and Very Negative.
Score each tweet:
In this step, we will write a function which will calculate rating of the movie. The function is given below. After calculating the scores we plot graphs showing the rating as “WORST”,”BAD”,”GOOD”,”VERYGOOD”