Site icon R-bloggers

GitHub Growth Appears Scale Free

[This article was first published on The Pith of Performance, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In 2013, a blogger claimed that the growth of GitHub (GH) users follows a certain type of diffusion model called Bass diffusion. Growth here refers to the number of unique user IDs as a function of time, not the number project repositories, which can have a high degree of multiplicity.

In a response, I tweeted a plot that suggested GH growth might be following a power law, aka scale free growth. The tell-tale sign is the asymptotic linearity of the growth data on double-log axes, which the original blog post did not discuss. The periods on the x-axis correspond to years, with the first period representing calendar year 2008 and the fifth period being the year 2012.

Scale free networks can arise from preferential attachment to super-nodes that have a higher vertex degree and therefore more connections to other nodes, i.e., a kind of rich-get-richer effect. Similarly for GH growth viewed as a particular kind of social network. The interaction between software developers using GH can be thought of as involving super-nodes that correspond to influential users influencing prospective GH users to open a new account and contribute to their project.

On this basis, I predicted GH would reach 4 million users during October 2013 and 5 million users during March 2014 (yellow points in the Linear axes plot below). In fact, GH reached those values slightly earlier than predicted by the power law model, and slightly later than the dates predicted by the diffusion model.

Since 2013, new data has been reported so, I extended my previous analysis in R. Details of the respective models are contained in the R script at the end of this post. In the Linear axes plot, both the diffusion model and power model essentially form an envelope around the newer data: diffusive on the upper side (red curve) and power law on the lower side (blue curve). In thise sense, it could be argued that the jury is still out on which model offers the more reliable predictions.

However, there is an aspect of the diffusion model that was overlooked in 2013. It predicts that GH growth will eventually plateau at 20 million users in 2020 (the 12th period, not shown). The beginnings of this leveling off is apparent in the 10th period (i.e., 2017). By contrast, the power law model predicts that GH will reach 23.65 million users by the end of the same period (yellow point). Whereas the diffusion and power law models respectively represent the upper and lower edges of an envelope surrounding the more recent data in periods 6–9, their predictions will start to diverge in the 10th period.

“GitHub is not the only player in the market. Other companies like GitLab are doing a good job but GitHub has a huge head start and the advantage of the network effect around public repositories. Although GitHub’s network effect is weaker compared to the likes of Facebook/Twitter or Lyft/Uber, they are the default choice right now.”  —GitHub is Doing Much Better Than Bloomberg Thinks
Although there will inevitably be an equilibrium bound on the number of active GH users, it seems unlikely to be as small as 20 million, given the combination of GH’s first-mover advantage and its current popularity. Presumably the private investors in GH also hope it will be a large number. This year will tell.

# Data source ... https://classic.scraperwiki.com/scrapers/github_users_each_year/

#LINEAR axes plot
plot(df.gh3$index, df.gh3$users, xlab="Period (years)", 
     ylab="Users (million)", col="gray", 
     ylim=c(0, 3e7), xaxt="n", yaxt="n")
axis(side=1, tck=1, at=c(0, seq(12,120,12)), labels=0:10, 
     col.ticks="lightgray", lty="dotted")
axis(side=2, tck=1, at=c(0, 10e6, 20e6, 30e6), labels=c(0,10,20,30), 
     col.ticks="lightgray", lty="dotted")

# Simple exp model
curve(coef(gh.exp)[2] * exp(coef(gh.exp)[1] * (x/13)), 
      from=1, to=108, add=TRUE, col="red2", lty="dot dash")

# Super-exp model 
curve(49100 * (x/13) * exp(0.54 * (x/13)), 
      from=1, to=120, add=TRUE, col="red", lty="dashed")

# Bass diffusion model
curve(21e6 * ( 1 - exp(-(0.003 + 0.83) * (x/13)) ) / ( 1 + (0.83 / 0.003) * exp(-(0.003 + 0.83) * (x/13)) ), 
      from=1, to=120, add=TRUE, col="red")

# Power law model
curve(10^coef(gh.fit)[2] * (x/13)^coef(gh.fit)[1], from=1, to=120, add=TRUE, 
      col="blue")

title(main="Linear axes: GitHub Growth 2008-2017")
legend("topleft",
  legend=c("Original data", "New data",  "Predictions", "Exponentital", "Super exp", "Bass diffusion", "Scale free"), 
       lty=c(NA,NA,NA,4,2,1,1), pch=c(1,19,21,NA,NA,NA,NA), 
  col=c("gray", "black", "yellow", "red", "red", "red", "blue"), 
  pt.bg = c(NA,NA,"yellow",NA,NA,NA,NA),
  cex=0.75, inset=0.05)

To leave a comment for the author, please follow the link and comment on their blog: The Pith of Performance.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.