Rblogger Posting Patterns Analyzed with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I've been a big fan of rbloggers for quite some time, but have only recently started contributing myself. After my first post yesterday, I immidiately started wondering how long most other bloggers go between posts.
I decided to gather the list of past posts to rbloggers to investigate a bit. I've posted the data (as of yesterday evening) here – I'm a bit new to github, but the file (RBloggersData.csv) should be there.
I started by using plyr to calculate the average delay between each author's posts. It turns out that this distribution has a ton of right-skew, and looks fairly normal (or at least mound-shaped.. see plot above) when logged. Depending on how 0s are handled, the average (log) delay between posts is around 3.5 to 3.75, meaning most people post around once each month.
Next, still pretty new to blogging, I wondered which day of the week most people are posting. The distrubution we get shows that weekends have markedly fewer posts than weekdays, and there's a fairly strong downward trend over the course of the week. I'm guessing most people (like me) end up experimenting with data over the weekends, and scaping together a post for Monday. (See first figure below)
Finally, even though I've been seeing the feed of rbloggers posts for a while, I'd never really tracked the total number of posts per day. When I collected the data at the day level, I was surprised to find what explosive growth the site had starting around 2009. After fitting a nonparametric line (see second figure below), we can see the average posts per day roughly double from 2009 to 2010, and double again between 2010 and 2012! Below are the figures and code used to generate.
load("C:/Users/Mark/Desktop/RInvest/WebScraping/rblogger.RData") library(ggplot2) library(plyr) library(lubridate) library(np) find.avg = function(post.inputs){ if(length(post.inputs) == 1){out = NA} else { diffs.raw = difftime(post.inputs,c(post.inputs[-1],tail(post.inputs,1)), units = "days") diffs = diffs.raw[-length(diffs.raw)] out = mean(diffs)} return(out) } delay.frame = ddply(base,.(author),summarize, avg.delay = round(as.numeric(find.avg(date.format)),2), tot.posts = length(date.format)) p = ggplot(delay.frame,aes(x = log(avg.delay))) + geom_density() p + xlab("average delay between posts (log days)") + theme_bw() ggsave("avgDelay.png") log.delay = log(delay.frame$avg.delay) log.delay[which(log.delay == -Inf)] = 0 mean(log.delay,na.rm = TRUE) base$dow = wday(base$date.format, label = T) base$month = month(base$date.format, label = T) base$year = year(base$date.format) p = ggplot(base, aes(x = dow)) + geom_bar(fill = "blue") p + theme_bw() + xlab("day of week") + ylab("total posts") ggsave("dayOfWeek.png") ## how many posts per day? post.per.day.frame = ddply(base,.(date.format), summarize, tot.posts = length(title)) post.per.day.frame$time = as.numeric(difftime(post.per.day.frame$date.format, rep(min(post.per.day.frame$date.format),nrow(post.per.day.frame)), units = "days")) np.1 = npreg(tot.posts ~ time, data = post.per.day.frame) post.per.day.frame$pred = predict(np.1, newdata = post.per.day.frame) p = ggplot(post.per.day.frame, aes(x = date.format, y = tot.posts)) + geom_point() + geom_line(aes(x = date.format, y = pred), color = "red", size = 2) p + theme_bw() + xlab("date") + ylab("total posts") ggsave("postsPerDay.png")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.