[This article was first published on R-Chart, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
There are lots of references on Hacker news to the fact that the “good old days” are gone and that the character of the site has changed since it started. The visualization above was based on a sample of users that posted on the site in recent times. The data was gathered by iterating over the first 1000 pages and gleaning a list of user names. The users ages were then checked and are plotted above.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Chart’s Meaning
Note that the chart does not represent the number of posts by a given user, it is just a list of distinct users with their start dates grouped in monthly buckets. I suppose that the shape of the graph makes sense – folks sign up so that they can post, and older users drift away and cease posting at some point. The chart does indicate that – as of a few days ago – a a given user who posted was more likely to be someone who signed up in the last year or two than a veteran.
Scraping the Data
I used Ruby and Hpricot (still missing you _why) to parse the site and Active Record to store the list of users in a MySQL database. I use ActiveRecord outside of rails rather frequently. It does great straightforward object to relational mapping – and even an arbitrary query is returned as an object that can be manipulated.
Noticed a couple of differences using MySQL vs Oracle/RODBC with R.
1) Oracle/RODBC capitalizes column names in the result set.
2) Using RMySQL, there is no need to set up a ODBC connection.
3)
function dbGetQuery that allows both actions to be taken in a single step.
4) I use TRUNC in Oracle – but ended up using the EXTRACT function and tagging on a 01 for the first day of the month with MySQL.
Creating the Chart
R speaks for itself:
library(RMySQL)
drv <- dbDriver(“MySQL”)
con <- dbConnect(drv, username=’xxxx’,password=’xxxx’,dbname=’xxxx’)
# Buckets by month
sql=’select extract(YEAR_MONTH from hn_created_date) hn_created_date, count(*) from users group by extract(YEAR_MONTH from hn_created_date);’
# Execute the Query and Fetch the Data
rs <- dbSendQuery(con,sql)
df <- fetch(rs)
# Set the date to the first of the month (buckets of user by start month)
df$hn_created_date = as.Date(paste(df$hn_created_date,’01’,sep=”),format=’%Y%m%d’)
# The Actual Plot
p=ggplot(data=df, aes(hn_created_date, df$`count(*)`))+geom_line()+xlab(‘User Start Date’)+ylab(‘Number of Users Who Posted recently’)
p+stat_smooth()
To leave a comment for the author, please follow the link and comment on their blog: R-Chart.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.