Playing around with #rstats twitter data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As a bit of weekend fun, I decided to briefly look into the #rstats twitter data that Stephen Turner collected and made available (thanks!). Essentially, this data set contains some basic information about over 100,000 tweets that contain the hashtag “#rstats” that denotes that a tweeter is tweeting about R.
As a warning, I don’t know much about how these data were collected, whether it was collected and random times during the day or whether it was biased toward particular times and, therefore, locations. I wouldn’t really read too much into this.
Most common co-occuring hashtags
When a tweet uses a hashtag at all, it very often uses more than one. To extract the co-occuring hashtags, I used the following perl script:
#!/usr/bin/perl while(<>){ chomp; $_ = lc($_); $_ =~ s/#rstats//g; my @matches; push @matches, /(#w+)/; print join "n" => @matches if @matches; }
which uses the regular expression “(#w+)” to search for hashtags after removing “#rstats” from every tweet.
On the unix command-line, I put these other hashtags into a file and sorted via these commands:
cat data/R-hashtag-data.txt | ./PERL_SCRIPT_ABOVE.pl | tee other-hashtags.txt sort other-hashtags.txt | uniq -c | sort -n -r > sorted-other-hashtags.txt
After running these commands, I get a numbered list of co-occuring hashtags, sorted in descending order. The top 10 co-occuring hashtags were as follows (you can see the rest here :
5258 #datascience 1665 #python 1625 #bigdata 1542 #r 1451 #dataviz 1360 #ggplot2 852 #statistics 783 #dplyr 749 #machinelearning 743 #analytics
Neat-o. The presence of “#python” and “#ggplot2” in the top 10 made me wonder what the top 10 programming language and R package related hashtags were. Here they are, respectively:
1665 #python 423 #d3js (plus 72 for #d3) (plus 2 for #js) 343 #sas 312 #julialang (plus 43 for #julia) 240 #fsharp 140 #spss (plus 7 for #ibmspss) 102 #stata 75 #matlab 55 #sql 38 #java
1360 #ggplot2 (plus 298 for ggplot) (plus for 6 #gglot2) (plus 4 for #ggpot) 783 #dplyr 663 #shiny 557 #rcpp (plus 22 for rcpp11) 251 #knitr 156 #magrittr 105 #lme4 93 #ggvis (plus 11 for #ggivs) 65 #datatable 46 #rneo4j
You can view the full list here and here.
I was happy to see my favorite languages (python, perl, clojure, lisp, haskell, c) besides R being represented in the first list. Additionally, most of my favorite packages were fairly well tweeted about–at least as far as hashtags-applied-to-a-package go.
#strangehashtags
Before moving on to the next section, I wanted to share my favorite co-occuring hashtags that I found while sifting through the data: #rcatladies, #rdogfella, #bayesianbootycall, #dontbeaplyrhater, #overlyhonestmethods, #rickshaw (??), #statafail, and #monkeysinfrontoftypewriters.
Most prolific #rstats tweeters
One of the first things I did with these data is a simple aggregation and sort to find the tweeters that used the hashtag most often:
library(dplyr) THE_DATA %>% group_by(User) %>% summarise(count = n()) %>% arrange(desc(count)) -> prolific.rstats.tweeters
Here is the top 10 (you can see the rest here.)
@Rbloggers 1081 @hadleywickham 498 @timelyportfolio 427 @recology_ 419 @revodavid 210 @chlalanne 209 @adolfoalvarez 199 @RLangTip 175 @jmgomez 160
Nothing terribly surprising here.
Normalizing by total tweets
In a twitter discussion about these data, a twitter friend Tim Hopper posited that though he had fewer #rstats tweets than another mutual friend, Trey Causey, he would have a higher number of #rstats tweets if you control for total tweet volume. I wondered how this sorting would look.
Answering this question gave me an excuse to use Hadley Wickham’s new package, rvest (I literally just got why the package is named as much while typing this out) which makes web scraping easier–in part by leveraging the expressive power of the magrittr package.
To get the total number of tweets for a particular tweeter, I wrote the following function:
library(rvest) library(magrittr) get.num.tweets <- function(handle){ tryCatch({ unraw <- function(raw_str){ raw_str <- sub(",", "", raw_str) # remove commas if any if(grepl("K", raw_str)){ return(as.numeric(sub("K", "", raw_str))*1000) # in thousands } return(as.numeric(raw_str)) } html(paste0("http://twitter.com/", sub("@", "", handle))) %>% html_nodes(".is-active .ProfileNav-value") %>% html_text() %>% unraw }, error=function(cond){return(NA)}) }
The real logic (and beauty) of which is contained only in the last few lines:
html(paste0("http://twitter.com/", sub("@", "", TWITTER_HANDLE))) %>% html_nodes(".is-active .ProfileNav-value") %>% html_text()
The CSS element that houses the number of total tweets from a useR’s twitter page was found easily using SelectorGadget.
After scraping the number of tweets for almost 10,000 #rstats tweeters (waiting a few seconds between each request because I’m considerate) I divided number of #rstats tweets by the total number of tweets to come up with a normalized value.
The top 10 tweeteRs were as follows:
User count num.of.tweets ratio 1 @medzihorsky 9 28 0.3214286 2 @statworx 5 16 0.3125000 3 @LearnRinaDay 114 404 0.2821782 4 @RforExcelUsers 4 15 0.2666667 5 @showmeshiny 27 102 0.2647059 6 @tcrug 6 25 0.2400000 7 @DailyRpackage 155 666 0.2327327 8 @R_Programming 49 250 0.1960000 9 @hexadata 8 41 0.1951220 10 @Deep_RHelp 11 58 0.1896552
In case you were wondering, Trey Causey still “won” by a long shot:
> tweeters[which(tweeters$User=="@tdhopper"),] Source: local data frame [1 x 4] User count num.of.tweets ratio 1 @tdhopper 8 26700 0.0002996255 > tweeters[which(tweeters$User=="@treycausey"),] Source: local data frame [1 x 4] User count num.of.tweets ratio 1 @treycausey 50 28700 0.00174216
Before ending this post, I feel compelled to issue an almost certainly unnecessary but customary warning against using number of #rstats tweets as a proxy for who likes R the most or who are the biggest R “thought leaders” (whatever that is). Most tweets about R don’t use the #rstats hashtag, anyway.
Again, I would’t read too much into this 🙂
share this:R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.