Analyzing R-bloggers
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the last two posts we saw how to download posts from R-bloggers, and then extract the title, author and date of each post and write that information to a csv file. Since we now have a nice data set from r-bloggers, we can start to examine the development of the site during its time span. In this post I will look at the following patterns in the data :
- The rate of monthly posts submitted to r-bloggers
- The distribution of posts and contributors
- The top contributors in total and tabulated by year
The graph below show the monthly count of posts submitted to r-bloggers.com:

As you can see R-bloggers.com has experienced a tremendous growth in posts,. The first years, from 2005 to the end of 2008, where fairly consistent, with an average posting rate of 6 posts per month. In 2009 we see the beginning of a dramatic rise in submitted posts, which peaks in march 2011 with 266 posts that month. To see whether this is a function of a few very active bloggers, or if we also see a similar increase in contributors, the graph below plot the number of unique contributors for every month:

Here we see that the monthly number of contributors follows closely the monthly number of posts, therefor the rise in posts is not exclusively a result of a result of a few extremely active bloggers. However as the figure below show, most authors contribute a fairly small number of posts:

The distribution is extremely skewed with a median of 6 posts, and a few authors contributing 200 or more posts.
The overall top ten contributors to r-bloggers.org are:
author | count |
---|---|
David Smith | 647 |
xi’an | 293 |
Thinking inside the box | 217 |
Tal Galili | 124 |
klr | 104 |
Stephen Turner | 102 |
dirk.eddelbuettel | 94 |
Ralph | 82 |
romain francois | 79 |
C | 77 |
Breaking this down by year we can see that from 2009 there is a rise of some very active R bloggers:
2005author | count |
---|---|
Hadley Wickham | 3 |
fernandohrosa | 2 |
author | count |
---|---|
seth | 6 |
Hadley Wickham | 5 |
dataninja | 5 |
Di Cook | 3 |
Vincent Zoonekynd& #039;s Blog | 3 |
fernandohrosa | 2 |
Andrew Gelman | 1 |
author | count |
---|---|
Mario Pineda-Krch | 20 |
Forester | 14 |
Egon Willighagen | 5 |
Andrew Gelman | 4 |
Rob J Hyndman | 4 |
dataninja | 4 |
Hadley Wickham | 3 |
John Johnson | 2 |
dan | 2 |
seth | 2 |
author | count |
---|---|
Yu-Sung Su | 28 |
Michal | 9 |
Rob J Hyndman | 8 |
Gregor Gorjanc | 6 |
Forester | 5 |
Di Cook | 4 |
John Johnson | 4 |
Mario Pineda-Krch | 4 |
Radford Neal | 4 |
abiao | 4 |
author | count |
---|---|
Thinking inside the box | 63 |
dirk.eddelbuettel | 36 |
Shige | 30 |
John Myles White | 28 |
Paolo | 26 |
David Smith | 25 |
Todos Logos | 25 |
Jeromy Anglim | 24 |
Stephen Turner | 23 |
romain francois | 23 |
author | count |
---|---|
David Smith | 352 |
xi’an | 152 |
Thinking inside the box | 85 |
C | 75 |
Tal Galili | 74 |
dirk.eddelbuettel | 58 |
Ralph | 53 |
romain francois | 41 |
Stephen Turner | 34 |
Kelly | 33 |
author | count |
---|---|
David Smith | 268 |
xi’an | 137 |
klr | 104 |
Thinking inside the box | 66 |
BMS Add-ons » BMS Blog | 58 |
Pat | 52 |
Scott Chamberlain | 48 |
Stephen Turner | 44 |
Kay Cichini | 43 |
Tal Galili | 37 |
From 2009 a number of authors appear in every year as some of the top contributors, and of course in 2010 David Smith and Xi’an appears, both with a massive output.
I see r-bloggers as one of the great services in the R community, and the presence of very knowledgeable and prolific contributors is a public good that we can all enjoy. So lets hope the current trend will continue into the new year!
As always the full r script to reproduce the above analysis is here:#read the libraries | |
library(plyr) | |
library(ggplot2) | |
library(xtable) | |
#set the working direcotry to where you saved the output.csv file from the previous post | |
setwd("/.../") | |
#read the data | |
data <- read.csv("output.csv") | |
#define the date variable and create the year and month variables | |
data$date <- as.Date(data$date, format = "%B %d %Y") | |
data$year <- as.POSIXlt(data$date)$year + 1900 | |
data$month <- as.POSIXlt(data$date)$mon + 1 | |
#get the monthly count of posts for every year | |
posts <- ddply(data, c("year","month"), function(x) data.frame(count = nrow(x))) | |
#for easier plotting create a date variable from the year and month | |
dates <- paste(posts$year,posts$month,"01", sep = "-") | |
posts$date <- as.Date(dates, format = "%Y-%m-%d") | |
#plot the monthly post count | |
plot <- ggplot(posts, aes(x = date, y = count)) + geom_line() + theme_bw() + ylab("Post Count") | |
plot | |
#get the number of monthly contributors | |
contributors <- ddply(data,c("year","month"), function(x) data.frame(contributors = length(unique(x$author)))) | |
#for easier plotting create a date variable from the year and month | |
dates <- paste(contributors$year,contributors$month,"01", sep = "-") | |
contributors$date <- as.Date(dates, format = "%Y-%m-%d") | |
#plot the monthly count of contributors | |
plot <- ggplot(contributors, aes(x = date, y = contributors)) + geom_line() + theme_bw() | |
plot | |
#get the number of posts per author | |
authors <- ddply(data, "author", function(x) data.frame(count = nrow(x))) | |
#plot the density of contributions per author | |
plot <- ggplot(authors, aes(x = count)) + | |
geom_density(fill = "red", alpha = .3) + | |
theme_bw() + | |
opts(axis.ticks = theme_blank(), axis.text.x = theme_blank()) | |
plot | |
#get the ten authors with the highest post count | |
topten <- authors[order(authors$count, decreasing = TRUE)[1:10],] | |
print(xtable(topten), type = "html", include.rownames = FALSE) | |
#get the post of authors for every year | |
authorsYear<- ddply(data, c("author","year"), function(x) data.frame(count = nrow(x))) | |
#for every year get a table of the ten most prolific authors and print it as html | |
for (year in unique(authorsYear$year)){ | |
print(year) | |
table <- authorsYear[authorsYear$year == year,] | |
table <- table[order(table$count, decreasing = TRUE)[1:10],] | |
print(xtable(table[,c("author","count")]), type = "html", include.rownames = FALSE) | |
} |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.