Visualising Activity Around a Twitter Hashtag or Search Term Using R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I think one of valid criticisms around a lot of the visualisations I post here and on my various #f1datajunkie blogs is that I often don’t post any explanatory context around the visualisations. This is partly a result of the way I use my blog posts in a selfish way to document the evolution of my own practice, but not necessarily the “so what” elements that represent any meaning or sense I take from the visualisations. In many cases, this is because the understanding I come to of a dataset is typically the result of an (inter)active exploration of the data set; what I blog are the pieces of the puzzle that show how I personally set about developing a conversation with a dataset, pieces that you can try out if you want to…;-)
An approach that might get me more readers would be to post commentary around what I’ve learned about a dataset from having a conversation with it. A good example of this can be seen in @mediaczar’s post on How should Page Admins deal with Flame Wars?, where this visualisation of activity around a Facebook post is analysed in terms of effective (or not!) strategies for moderating a flame war.
The chart shows a sequential ordering of posts in the order they were made along the x-axis, and the unique individual responsible for each post, ordered by accession to the debate along the y-axis. For interpretation and commentary, see the original post: How should Page Admins deal with Flame Wars? 😉
One take away of the chart for me is that it provides a great snapshot of new people entering into a conversation (vertical lines) as well as engagement by an individual (horizontal lines). If we use a time proportional axis on x, we can also see engagement over time.
In a Twitter context, it’s likely that a rapid increase in numbers of folk engaging with a hashtag, for example, might be the result of an RT related burst of activity. For folk who have already engaged in hashtag usage, for example as part of a live event backhannel, a large number of near co-occurring tweets that are not RTs might signal some notable happenstance within the event.
To explore this idea, here’s a quick bit of R tooling inspired by Mat’s post… It uses the twitteR library and sources tweets via a Twitter search.
require(twitteR) #Pull in a search around a hashtag. searchTerm='#ukgc12' rdmTweets <- searchTwitter(searchTerm, n=500) # Note that the Twitter search API only goes back 1500 tweets #Plot of tweet behaviour by user over time #Based on @mediaczar's http://blog.magicbeanlab.com/networkanalysis/how-should-page-admins-deal-with-flame-wars/ #Make use of a handy dataframe creating twitteR helper function tw.df=twListToDF(rdmTweets) #@mediaczar's plot uses a list of users ordered by accession to user list ## 1) find earliest tweet in searchlist for each user [ http://stackoverflow.com/a/4189904/454773 ] require(plyr) tw.dfx=ddply(tw.df, .var = "screenName", .fun = function(x) {return(subset(x, created %in% min(created),select=c(screenName,created)))}) ## 2) arrange the users in accession order tw.dfxa=arrange(tw.dfx,-desc(created)) ## 3) Use the username accession order to order the screenName factors in the searchlist tw.df$screenName=factor(tw.df$screenName, levels = tw.dfxa$screenName) #ggplot seems to be able to cope with time typed values... require(ggplot2) ggplot(tw.df)+geom_point(aes(x=created,y=screenName))
We can get a feeling for which occurrences were old-style RTs by identifying tweets that start with a classic RT, and then colouring each tweet appropriately (note there may be some overplotting/masking of points…I’m not sure how big the x-axis time bins are…)
#Identify and colour the RTs... library(stringr) #A helper function to remove @ symbols from user names... trim <- function (x) sub('@','',x) #Identify classic style RTs tw.df$rt=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2])) tw.df$rtt=sapply(tw.df$rt,function(rt) if (is.na(rt)) 'T' else 'RT') ggplot(tw.df)+geom_point(aes(x=created,y=screenName,col=rtt))
So now we can see when folk entered into the hashtag community via a classic RT.
We can also start to explore who was classically retweeted when:
#Generate a plot showing how a person is RTd tw.df$rtof=sapply(tw.df$text,function(tweet) trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2])) #Note that this doesn't show how many RTs each person got in a given time period if they got more than one... ggplot(subset(tw.df,subset=(!is.na(rtof))))+geom_point(aes(x=created,y=rtof))
Another view might show who was classically RTd by whom (activity along a row indicating someone was retweeted a lot through one or more tweets, activity within a column identifying an individual who RTs a lot…):
#We can start to get a feel for who RTs whom... require(gdata) #We don't want to display screenNames of folk who tweeted but didn't RT tw.df.rt=drop.levels(subset(tw.df,subset=(!is.na(rtof)))) #Order the screennames of folk who did RT by accession order (ie order in which they RTd) tw.df.rta=arrange(ddply(tw.df.rt, .var = "screenName", .fun = function(x) {return(subset(x, created %in% min(created),select=c(screenName,created)))}),-desc(created)) tw.df.rt$screenName=factor(tw.df.rt$screenName, levels = tw.df.rta$screenName) # Plot who RTd whom ggplot(subset(tw.df.rt,subset=(!is.na(rtof))))+geom_point(aes(x=screenName,y=rtof))+opts(axis.text.x=theme_text(angle=-90,size=6)) + xlab(NULL)
What sense you might make of all this, or where to take it next, is down to you of course… Err, erm…?! 😉
PS see also: http://blog.ouseful.info/2012/01/21/a-quick-view-over-a-mashe-google-spreadsheet-twitter-archive-of-ukgc2012-tweets/
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.