impacTwit is a collection of R functions that will output data about who tweeted and retweeted about any collection of search terms in a data frame that you can make easily plot. It gives you the time stamp, originating tweeter, and follower count of each tweet about a vector of search terms. It will sort them all by date and give you cumulative sums for the entire set of search terms, cumulative sums by originating tweeter, and cumulative sums by search term. It let’s you easily dissect the people who are influential about a paper, or the sources, and gives you a sense of the total impact on twitter. Total impact here is defined as the number of potential viewers of a tweet. Before I give a worked example I’ll say two caveats about “total impact”. Yes, just because a tweet is retweeted to 10,000 people doesn’t mean they all see it, and even if they see it, how many actually click on the article link to go read it? It is an imperfect metric to be sure.
The idea behind impacTwit is to measure the impact on twitter of a given scientific article (it could be used for blog posts too). The code works by searching twitter using the twitter API for article specific terms or links. It’s not really designed to handle huge amounts of data, so if you do this with a search for “Justin Bieber” I’m pretty sure you’ll break the code. The input is a vector of search strings, presumably about the same arcticle. As an example I’ll use a paper that’s been making the rounds on twitter from PNAS called “Heavy use of equations impedes communication among biologists”. Now we can search the actual URL of the article, but people might have linked to it from other sources, so we might want to include the actual title of paper as a search string on top of just the url. We search for the url because one tweet might be: “Love this article I hate math too : http://www.pnas.org/content/early/2012/06/22/1205259109.abstract”, or another might say:”Heavy use of equations impedes communication among biologists: http://some.obscure.source”. Finally there was an AP story about this called “Scientists think math is hard too”, so let’s include that title in our search. The actual input would is here:
test.str <- c("Scientists think math is hard too","http://www.pnas.org/content/early/2012/06/22/1205259109.abstract","Heavy use of equations impedes communication among biologists") tweet.dat <- impacTwit(test.str)We can then generate a series of plots for the resulting data frame, the first one is just a cumulative sum of the total impact.
Here is just the total number of potential viewers of the article from all people and all sources. Ok, so that’s interesting, it topped out over 120,000 potential views. What if we want to know who was influential about this? Well we can parse our data frame and subset it so it only has the top retweets, from sources with 5 or more retweets and plot those out by originating tweeter. The first plot has the top sources all plotted on a relative time scale, so the x axis is time since the original tweet that was retweeted.
We can also plot this on an absolute time scale to see when these retweets came into the stream. As you can see @PlanktonMath was influential early, whereas others came in late to the game. @BioScienceMum had 5 retweets, but really all by low impact people, so retweet count isn’t always a good measure of impact.
Finally let’s examine this by source. Our search string had 3 terms. I parsed those out as the AP story, the direct link to the scientific article, and just the title. The AP is all popular press, the direct link only science, and the title is a bit of both (non-AP sources plus some posts with the direct link). impacTwit can do plots like both of the above, but here’s just one on an absolute time scale.
It’s clear that people tweeting about the article itself were less impactful, but they tweeted about it longer. The AP tweets are a big splash and then they’re gone. You can try this all out yourself with the code over at my github, which has this example fully documented (including the parsing for figures). I’m open to any suggestions about features or improvements.
You might consider using entropy, and normalizing by total tweets to give relative frequencies.
You may find our presentation on Google N-Grams useful when thinking about networks:
http://prezi.com/wuhpytgxc721/exploring-the-english-language-network-with-google-n-grams/