[This article was first published on Hot Damn, Data!, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Clearly, a 🙂 is happier than a 🙁 but what about a :-* and a 😀 ? Or a 😐 and a 😮 ? In this post I attempt to rank emoticons in order of how happy someone has to be to use each one. (And punctuate horribly to avoid mixing punctuation with the emoticon)
To start off, I need a collection of emoticons associated with some text. And where else would I find this, but that gigantic compendium of everyday emotions, the definitive corpus of our age – Twitter.
The methodology is this: I collect lots of tweets containing emoticons, assign each one a ‘sentiment’ score[1], and then order the emoticons based on the average sentiment score of tweets containing each emoticon
The tweet gathering process is fairly direct. I parse tweets obtained from the streaming API[2] which contain any of a set of predefined emoticons and write them out to a file. If you want to, have a look at the Python code here. For the purpose of the R analysis, the tweet texts are already in a file. Each line is then (a) parsed for the emoticons it contains, and (b) assigned a sentiment score[3].
Finally, we plot each tweet on an emoticon-score plot. Like so:
Notes
[1] A linear scale where positive is happy, negative is unhappy
[2] Twitter’s Search API handles punctuation poorly, so that’s not an option
[3] Assignment of this score is done via a relatively simple lookup mechanism. This file provides a good evaluation
To start off, I need a collection of emoticons associated with some text. And where else would I find this, but that gigantic compendium of everyday emotions, the definitive corpus of our age – Twitter.
The methodology is this: I collect lots of tweets containing emoticons, assign each one a ‘sentiment’ score[1], and then order the emoticons based on the average sentiment score of tweets containing each emoticon
The tweet gathering process is fairly direct. I parse tweets obtained from the streaming API[2] which contain any of a set of predefined emoticons and write them out to a file. If you want to, have a look at the Python code here. For the purpose of the R analysis, the tweet texts are already in a file. Each line is then (a) parsed for the emoticons it contains, and (b) assigned a sentiment score[3].
Finally, we plot each tweet on an emoticon-score plot. Like so:
The tiny vertical black lines mark the mean score for each emoticon.
There is no ordering to the colour scale. The colours just help differentiate each row.
Okay, so here’s a list of observations and (partial) explanations for some surprises
- o.O and :* score higher than 🙂
I think the ubiquity of 🙂 is its burden. People feel 🙂 for all sorts of reasons. Also, the score for o.O is computed over a much smaller number of tweets, and is possibly unstable. - I can understand people using 🙂 at sad stuff, but what kind of a person uses 🙁 for happy tweets? (There aren’t many of these, but a couple of them are too far right.) Let’s look at one of those tweets:
Wow I was sleeping sooooo good which doesn’t happen very often & They called from work & woke me up .. Now I can’t go back to sleep 🙁
That makes sense. It’s a tweet that turned sour half way through, but overall, had a pretty high density of positive words, so it’s no surprise that our scorer tagged it with a positive score - Here’s a tweet with a 8D in it:
Got to take a pic with heage ! Who has by far been the most fun, funny and candid lecturer(in my… http://t.co/WaOTW8D2YO
Notice anything funny? It’s a happy tweet, but the emoticon we were looking for, is conspicuously absent! Actually, the 8D does occur in the tweet – albeit in a url http://t.co/WaOTW8D2YO
Thanks to Twitter’s automatic url compression using t.co, it’s entirely possible to see an arbitrary collection of alphanumeric characters in a tweet without any semantic information. So be wary of the scores for stuff like 8D and xD.
So the next time you can’t tell what someone is trying to convey with an emoticon, this chart might come in handy as a reference. In the meantime, if you’re happy and you know it, contort your pupils o.O
Notes
[1] A linear scale where positive is happy, negative is unhappy
[2] Twitter’s Search API handles punctuation poorly, so that’s not an option
[3] Assignment of this score is done via a relatively simple lookup mechanism. This file provides a good evaluation
To leave a comment for the author, please follow the link and comment on their blog: Hot Damn, Data!.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.