Site icon R-bloggers

Do we appreciate sunbathing in Spring ?

[This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We are currently experiencing an extremely hot month in Montréal (and more generally in North America). Looking at people having a beer, and starting the first barbecue of the year, I was wondering: if we asked people if global warming was a good or a bad thing, what do you think they will answer ? Wearing a T-shirt in Montréal in March is nice, trust me ! So how can we study, from a quantitative point of view, depending on the time of the year, what people think of global warming ?

A few month ago, I went quickly through Jeffrey Bree’s tutorial (online on his blog) on sentiment analysis, where he was studying perception of airline companies in the US based on Twitter messages. His idea was to define a score function, counting the number of positive and negative words in a short sentence (Twitter is then perfect to run such a study) mentioning an Airline company. Consider the following score function (the one proposed by Jeffrey)

score.sentiment = function(sentences, pos.words,
neg.words, .progress='none')
{
require(plyr)
require(stringr)
scores = laply(sentences, 
function(sentence, pos.words, neg.words) {
sentence = gsub('[[:punct:]]', '', sentence)
sentence = gsub('[[:cntrl:]]', '', sentence)
sentence = gsub('\\d+', '', sentence)
sentence = tolower(sentence)
word.list = strsplit(sentence, '\\s+')
words = unlist(word.list)
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
score = sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}

A list of positive and negative words (in English) can be found on http://www.cs.uic.edu/ (see e.g. the chapter on sentiment analysis).

hu.liu.pos = scan("positive-words.txt", what="character",
comment.char=';')
hu.liu.neg = scan('negative-words.txt', what='character',
comment.char=';')

The score here is simple the difference between positive versus negative words. For instance, the following sentence has a score of +3,

> score.sentiment("It's awesome I am so happy,
thank you all",
+ hu.liu.pos,hu.liu.neg)$score
[1] 3

while the following is -3.

> score.sentiment("I'm desperate, life is a nightmare,
I want to die",
+ hu.liu.pos,hu.liu.neg)$score
[1] -3

But one can easy see a big problem with this methodology. What if the sentence included negations ? E.g.

> score.sentiment("I'm no longer desperate, life is
not a nightmare anymore I don't want to die",
+ hu.liu.pos,hu.liu.neg)$score
[1] -3

Here the sentence is negative, extremely negative, if we look only at the score. But it should be the opposite. I simple idea is to change (slightly) the function, so that once a negation is found in the sentence, we take the opposite of the score. Hence, we just add at the end of the function

if("not"%in%words){score=-score}

Here we obtain

> score.sentiment.neg("I'm no longer desperate,
life is not a nightmare anymore I don't want to die",
+ hu.liu.pos,hu.liu.neg)$score
[1] 3

But does it really work ? Let us focus on Tweets,

library(twitteR)

Consider the following tweet-extractions, based on two words, a negative word, and the negation of a positive word,

> tweets=searchTwitter('not happy',n=1000)
> NH.text= lapply(tweets, function(t) t$getText() )
> NH.scores = score.sentiment(NH.text,
+ hu.liu.pos,hu.liu.neg)
 
> tweets=searchTwitter('unhappy',n=1000)
> UH.text= lapply(tweets, function(t) t$getText() )
> UH.scores = score.sentiment(UH.text,
+ hu.liu.pos,hu.liu.neg)

If we draw score density in tweets, we see that scores are quite different,

> plot(density(NH.scores$score,bw=.8),col="red")
> lines(density(UH.scores$score,bw=.8),col="blue")

If now we use the second function, taking the opposite of the score when the sentence contains a negation, we obtain

> UH.scores = score.sentiment.neg(UH.text,
+ hu.liu.pos,hu.liu.neg)
> NH.scores = score.sentiment.neg(NH.text,
+ hu.liu.pos,hu.liu.neg)
> plot(density(NH.scores$score,bw=.8),col="red")
> lines(density(UH.scores$score,bw=.8),col="blue")

So we can admit that our modified function is not that bad.

Let us now focus on tweets in North American. More specifically, we focus on tweets sent within the following region (in blue on the right), in order to avoid problems of hemispheres. Then, we restrict ourselves to specific days, or periods of time. For instance, we can wonder if snow is associated with positive or negative words, and if we enjoy snow when it does arrive, by the end of November, and if we start to find it boring by the end of March ?

Considerer the following research

> w.tweets=searchTwitter("snow",since= LISTEDATE[k],
+ until= LISTEDATE[k+1],geocode="40,-100,2000mi")
> W.text= lapply(w.tweets, function(t) t$getText() )
> W.scores = score.sentiment.neg(W.text,
+ hu.liu.pos,hu.liu.neg, .progress='text')
> M[k]=mean(W.scores$score)
We obtain here the following score function, over three years, on Twitter,

Well, we have to admit that the pattern is not that obvious. There might me small (local) bump in the winter, but it is not obvious…

Let us get back to the point used to introduce this post. If we study what people “feel” when they mention global warming, let us run the same code, again in North America
> w.tweets=searchTwitter("global warming",since= LISTEDATE[k],
+ until= LISTEDATE[k+1],geocode="40,-100,2000mi")
Actually, I was expecting a nice cycle, with positive scores in Spring, and perhaps negative scores during heat waves, in the middle of the Summer…

What we simply observe is that global warming was related to “negative words” on Twitter a few years ago, but we have reached a neutral position nowadays.

And to be honest, I do not really know how to interpret that: is there a problem with the technique I use (obviously, I use here a very simple scoring function, even after integrating a minor correction to take into consideration negations) or is there something going one that can be interpreted ?

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.