Site icon R-bloggers

impacTwit : How big is your work on twitter?

[This article was first published on distributed ecology, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< embed width="320" height="266" src="https://www.youtube.com/v/2SUDZ30_PXg&fs=1&source=uds" type="application/x-shockwave-flash" allowfullscreen="true">
There’s a great Tom Waits song from the album “Mule Variations” called “Big in Japan”. The beauty of saying you’re big in Japan is that no one can ever really verify the statement (or at least that was more true in 1999). You might assert “my work is big on twitter”, and hey, how would I know? I think we’re all agreed now that if you’re a scientist being big on twitter is important. What about how much exposure your work gets on twitter though? In the research world, people are working on lot’s of interesting ways of measuring the impact of an article. People like Heather Piwowar, who cofounded total impact, are working to change how we measure the importance of a paper. More and more people want to look at the impact beyond just how many times other researchers cite it. That’s where projects like altmetrics and article level metrics from Plos come in. These are all great tools and I don’t doubt the future of how we measure impact. But what if you want to look under the hood of twitter and see what’s going on with a given research article? There’s lot’s web based tools (like tweetreach), but none of them offer a concise way to extract and store twitter data about the impact of scientific articles. Enter impacTwit (a slightly tongue-in-cheek name).

impacTwit is a collection of R functions that will output data about who tweeted and retweeted about any collection of search terms in a data frame that you can make easily plot. It gives you the time stamp, originating tweeter, and follower count of each tweet about a vector of search terms. It will sort them all by date and give you cumulative sums for the entire set of search terms, cumulative sums by originating tweeter, and cumulative sums by search term. It let’s you easily dissect the people who are influential about a paper, or the sources, and gives you a sense of the total impact on twitter. Total impact here is defined as the number of potential viewers of a tweet. Before I give a worked example I’ll say two caveats about “total impact”. Yes, just because a tweet is retweeted to 10,000 people doesn’t mean they all see it, and even if they see it, how many actually click on the article link to go read it? It is an imperfect metric to be sure.

The idea behind impacTwit is to measure the impact on twitter of a given scientific article (it could be used for blog posts too). The code works by searching twitter using the twitter API for article specific terms or links. It’s not really designed to handle huge amounts of data, so if you do this with a search for “Justin Bieber” I’m pretty sure you’ll break the code. The input is a vector of search strings, presumably about the same arcticle. As an example I’ll use a paper that’s been making the rounds on twitter from PNAS called “Heavy use of equations impedes communication among biologists”. Now we can search the actual URL of the article, but people might have linked to it from other sources, so we might want to include the actual title of paper as a search string on top of just the url. We search for the url because one tweet might be: “Love this article I hate math too : http://www.pnas.org/content/early/2012/06/22/1205259109.abstract”, or another might say:”Heavy use of equations impedes communication among biologists: http://some.obscure.source”. Finally there was an AP story about this called “Scientists think math is hard too”, so let’s include that title in our search. The actual input would is here:
test.str <- c("Scientists think math is hard too","http://www.pnas.org/content/early/2012/06/22/1205259109.abstract","Heavy use of equations impedes communication among biologists")
tweet.dat <- impacTwit(test.str)
We can then generate a series of plots for the resulting data frame, the first one is just a cumulative sum of the total impact.


Here is just the total number of potential viewers of the article from all people and all sources.  Ok, so that’s interesting, it topped out over 120,000 potential views.  What if we want to know who was influential about this?  Well we can parse our data frame and subset it so it only has the top retweets, from sources with 5 or more retweets and plot those out by originating tweeter.  The first plot has the top sources all plotted on a relative time scale, so the x axis is time since the original tweet that was retweeted.

We can also plot this on an absolute time scale to see when these retweets came into the stream.  As you can see @PlanktonMath was influential early, whereas others came in late to the game.  @BioScienceMum had 5 retweets, but really all by low impact people, so retweet count isn’t always a good measure of impact.

 

Finally let’s examine this by source.  Our search string had 3 terms.  I parsed those out as the AP story, the direct link to the scientific article, and just the title.  The AP is all popular press, the direct link only science, and the title is a bit of both (non-AP sources plus some posts with the direct link).  impacTwit can do plots like both of the above, but here’s just one on an absolute time scale.
It’s clear that people tweeting about the article itself were less impactful, but they tweeted about it longer.  The AP tweets are a big splash and then they’re gone.  You can try this all out yourself with the code over at my github, which has this example fully documented (including the parsing for figures).  I’m open to any suggestions about features or improvements.


Comments

Ted Hart
Thanks Galen. I might convert the normalizing by total tweets as another way of looking at it.
Galen
Nice Ted.

You might consider using entropy, and normalizing by total tweets to give relative frequencies.

You may find our presentation on Google N-Grams useful when thinking about networks:
http://prezi.com/wuhpytgxc721/exploring-the-english-language-network-with-google-n-grams/

To leave a comment for the author, please follow the link and comment on their blog: distributed ecology.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.