Analysis of the #7FavPackages hashtag
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Twitter has seen a recent trend of “first 7” and “favorite 7” hashtags, like #7FirstJobs and #7FavFilms. Last week I added one to the mix, about my 7 favorite R packages:
devtools
— David Robinson (@drob) August 16, 2016
dplyr
ggplot2
knitr
Rcpp
rmarkdown
shiny#7FavPackages #rstats
Hadley Wickham agreed to share his own, but on one condition:
@drob I'll do it if you write a script to scrape the tweets, plot overall most common, and common co-occurences
— Hadley Wickham (@hadleywickham) August 16, 2016
Hadley followed through, so now it’s my turn.
Setup
We can use the same twitteR package that I used in my analysis of Trump’s Twitter account:
There were 116 (unique) tweets in this hashtag. I can use the tidytext package to analyze them, using a custom regular expression.
Note that since a lot of non-package words got mixed in with these tweets, I filtered for only packages in CRAN and Bioconductor (so packages that are only on GitHub or elsewhere won’t be included, though anecdotally I didn’t notice any among the tweets). Tweeters were sometimes inconsistent about case as well, so I kept all packages lowercase throughout this analysis.
General results
There were 700 occurrences of 184 packages in these tweets. What were the most common?
Some observations:
- ggplot2 and dplyr were the most popular packages, each mentioned by more than half the tweets, and other packages by Hadley like tidyr, devtools, purrr and stringr weren’t far behind. This isn’t too surprising, since much of the attention to the hashtag came with Hadley’s tweet.
@drob @JaySun_Bee @ma_salmon HOW IS THAT BIASED?
— Hadley Wickham (@hadleywickham) August 16, 2016
- The next most popular packages involved reproducible research (rmarkdown and knitr), along with other RStudio tools like shiny. What if I excluded packages maintained by RStudio (or RStudio employees like Hadley and Yihui)?
- The vast majority of packages people listed as their favorite were CRAN packages: only 7 Bioconductor packages were mentioned (though it’s worth noting they occurred across four different tweets):
- There were 109 CRAN packages that were mentioned only once, and those showed a rather large variety. A random sample of 10:
Correlations
What packages tend to be “co-favorited”- that is, listed by the same people? Here I’m using my in-development widyr package, which makes it easy to calculate pairwise correlations in a tidy data frame.
For instance, this shows the greatest correlation (technically a phi coefficient) were between the base, graphics, and stats packages, by people showing loyalty to built in packages.
I like using the ggraph package to visualize these relationships:
You can recognize most of RStudio’s packages (ggplot2, dplyr, tidyr, knitr, shiny) in the cluster on the bottom left of the graph. At the bottom right you can see the “base” cluster (stats, base, utils, grid, graphics), with people who showed their loyalty to base packages.
Beyond that, the relationships are a bit harder to parse (outside of some expected combinations like rstan and rstanarm): we may just not have enough data to create reliable correlations.
Compared to CRAN dependencies
This isn’t a particularly scientific survey, to say the least. So how does it compare to another metric of a package’s popularity: the number of packages that Depend, Import, or Suggest it on CRAN? (You could also compare to # of CRAN downloads using the cranlogs package, but since most downloads are due to dependencies, the two metrics give rather similar results).
We can discover this using the available.packages()
function, along with some processing.
We can compare the number of mentions in the hashtag to the number of pacakges:
Some like dplyr, ggplot2, and knitr are popular both within the hashtag and as CRAN dependencies. Some relatively new packages like purrr are popular on Twitter but haven’t built up as many packages needing them, and others like plyr and foreach are a common dependency but are barely mentioned. (This isn’t even counting the many packages never mentioned in the hashtag).
Since we have this dependency data, I can’t resist looking for correlations just like we did with the hashtag data. What packages tend to be depended on together?
(I skipped the code for these, but you can find it all here).
Some observations from the full network (while it’s not related to the hashtag, still quite interesting):
- The RStudio cluster is prominent in the lower left, with ggplot2, knitr and testthat serving as the core anchors. A lot of packages depend on these in combination.
- You can spot a tight cluster of spatial statistics packages in the upper left (around “sp”) and of machine learning packages near the bottom right (around caret, rpart, and nnet)
- Smaller clusters include parallelization on the left (parallel, doParallel), time series forecasting on the upper right (zoo, xts, forecast), and parsing API data on top (RCurl, rjson, XML)
One thing I like about this 2D layout (much as I’ve done with programming languages using Stack Overflow data) is that we can bring in our hashtag information, and spot visually what types of packages tended to be favorited.
This confirms our observation that the favorited packages are slanted towards the tidyverse/RStudio cluster.
The #7First and #7Fav hashtags have been dying down a bit, but it may still be interesting to try this analysis for others, especially ones with more activity. Maëlle Salmon is working on a great analysis of #7FirstJobs and I’m sure others would be informative.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.