Twitter sentiment analysis with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recently I designed a relatively simple code in R to analyze the content of Twitter posts by using the categories identified as positive, negative and neutral. The idea of processing tweets is based on a presentation http://www.slideshare.net/ajayohri/twitter-analysis-by-kaify-rais. The algorithm evaluates tweets based on the number of positive and negative words in the tweet. The words in the tweet correspond with the words in dictionaries that you can find on the internet, but you can create a list yourself. It is also possible to edit this list or dictionary. Great work, but I discovered some issue.
There are some limitations in the API of Twitter. It depends on the total number of tweets you access via API, but usually you can get tweets for the last 7-8 days (not longer, and it can be 1-2 days only). The 7 to 8 day time limit to access a tweet creates a limitation to understanding what activities or events influenced a tweets or analyzing historical trends.
I created a cumulative file to bypass this limit and accumulate historical data. If you access tweets regularly, then you can analyze the dynamics of the interactions via chart like this one:
Furthermore, this algorithm is made as a function, and all you need to do is enter the keyword that you need. The process can be repeated several times a day and data of each keyword will be saved in separate file. It is useful for analyzing several keywords simultaneously (e.g. several brands names or the names of competitors).
Let’s get started. We need to create Twitter Application (https://apps.twitter.com/) for connecting to Twitter’s API. Then we get Consumer Key and Consumer Secret.
#connect all libraries library(twitteR) library(ROAuth) library(plyr) library(dplyr) library(stringr) library(ggplot2) #connect to API download.file(url='http://curl.haxx.se/ca/cacert.pem', destfile='cacert.pem') reqURL <- 'https://api.twitter.com/oauth/request_token' accessURL <- 'https://api.twitter.com/oauth/access_token' authURL <- 'https://api.twitter.com/oauth/authorize' consumerKey <- '____________' #put the Consumer Key from Twitter Application consumerSecret <- '______________' #put the Consumer Secret from Twitter Application Cred <- OAuthFactory$new(consumerKey=consumerKey, consumerSecret=consumerSecret, requestURL=reqURL, accessURL=accessURL, authURL=authURL) Cred$handshake(cainfo = system.file('CurlSSL', 'cacert.pem', package = 'RCurl')) #There is URL in Console. You need to go to it, get code and enter it on Console save(Cred, file='twitter authentication.Rdata') load('twitter authentication.Rdata') #Once you launch the code first time, you can start from this line in the future (libraries should be connected) registerTwitterOAuth(Cred) #the function of tweets accessing and analyzing search <- function(searchterm) { #access tweets and create cumulative file list <- searchTwitter(searchterm, cainfo='cacert.pem', n=1500) df <- twListToDF(list) df <- df[, order(names(df))] df$created <- strftime(df$created, '%Y-%m-%d') if (file.exists(paste(searchterm, '_stack.csv'))==FALSE) write.csv(df, file=paste(searchterm, '_stack.csv'), row.names=F) #merge last access with cumulative file and remove duplicates stack <- read.csv(file=paste(searchterm, '_stack.csv')) stack <- rbind(stack, df) stack <- subset(stack, !duplicated(stack$text)) write.csv(stack, file=paste(searchterm, '_stack.csv'), row.names=F) #evaluation tweets function score.sentiment <- function(sentences, pos.words, neg.words, .progress='none') { require(plyr) require(stringr) scores <- laply(sentences, function(sentence, pos.words, neg.words){ sentence <- gsub('[[:punct:]]', "", sentence) sentence <- gsub('[[:cntrl:]]', "", sentence) sentence <- gsub('\d+', "", sentence) sentence <- tolower(sentence) word.list <- str_split(sentence, '\s+') words <- unlist(word.list) pos.matches <- match(words, pos.words) neg.matches <- match(words, neg.words) pos.matches <- !is.na(pos.matches) neg.matches <- !is.na(neg.matches) score <- sum(pos.matches) - sum(neg.matches) return(score) }, pos.words, neg.words, .progress=.progress) scores.df <- data.frame(score=scores, text=sentences) return(scores.df) } pos <- scan('C:/___________/positive-words.txt', what='character', comment.char=';') #folder with positive dictionary neg <- scan('C:/___________/negative-words.txt', what='character', comment.char=';') #folder with negative dictionary pos.words <- c(pos, 'upgrade') neg.words <- c(neg, 'wtf', 'wait', 'waiting', 'epicfail') Dataset <- stack Dataset$text <- as.factor(Dataset$text) scores <- score.sentiment(Dataset$text, pos.words, neg.words, .progress='text') write.csv(scores, file=paste(searchterm, '_scores.csv'), row.names=TRUE) #save evaluation results into the file #total evaluation: positive / negative / neutral stat <- scores stat$created <- stack$created stat$created <- as.Date(stat$created) stat <- mutate(stat, tweet=ifelse(stat$score > 0, 'positive', ifelse(stat$score < 0, 'negative', 'neutral'))) by.tweet <- group_by(stat, tweet, created) by.tweet <- summarise(by.tweet, number=n()) write.csv(by.tweet, file=paste(searchterm, '_opin.csv'), row.names=TRUE) #create chart ggplot(by.tweet, aes(created, number)) + geom_line(aes(group=tweet, color=tweet), size=2) + geom_point(aes(group=tweet, color=tweet), size=4) + theme(text = element_text(size=18), axis.text.x = element_text(angle=90, vjust=1)) + #stat_summary(fun.y = 'sum', fun.ymin='sum', fun.ymax='sum', colour = 'yellow', size=2, geom = 'line') + ggtitle(searchterm) ggsave(file=paste(searchterm, '_plot.jpeg')) } search("______") #enter keyword
Finally we get 4 files:
- cumulative file of all initial data,
- file with tweets rating (number of points by the number of positive or negative words),
- file with number of tweets of each type (positive / negative / neutral) as at date,
- and chart looks like:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.