Programming Language Popularity: StackOverflow and Ohloh
[This article was first published on R-Chart, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the following example, programming language popularity is measured based upon two data sets. The first is the number of contributors associated with a language on ohloh.net. The second is tag usage at stackoverflow.com.
SQL with no DDL
I admit it… in an age of NoSQL… I like SQL. I agree that fixed table schemas can be a real pain though… who wants the overhead of defining database tables for a quick comparison of two data sets?
Joining on the language name provides a simple, intuitive way to correlate the results sets. Of course there are limitations to this approach – after all SQL was designed for relational databases. Since a “join” is being done based upon language and tag name, some languages may be under represented. For instance
- JQuery – a Javascript library – is a leading tag.
- Objective C questions might appear under the iPhone tag.
- C# Questions might appear under .NET, ASP.NET or other Microsoft tags.
Top Languages
In this particular analysis, I am really interested in outliers – not the vast majority of the languages that appear in the data set. So the name of each point will be plotted beside it. For less popular languages, this chart is impossible to read and madly cluttered… but it is great for focusing on the most popular languages. So rather than coming up with a publication-quality graphic, the chart above provides a “quick-and-dirty” perspective that can lead to helpful discussions for people familiar with the programming language domain.
In the previous post, Ruby ranked at the top. This demonstrates the Ruby centric nature of github, which was initially directed towards the ruby community. Similar trends affect the results in the current post (where Ruby ranks 12th in tag count and 16th in the number of contributors). R is 18th in tag count and 33rd in number of contributors.
The data was extracted over the last few days and is available on github in ohlo_2010-08-16.txt and stackoverflow.txt (warning 400MB file… all tags from stackoverflow are listed in it). The process to analyze the files involved the following R Code.
library(ggplot2)
library(sqldf)
SODF=read.csv(‘stackoverflow.txt’,header=TRUE, sep=’;’)
OHLODF=read.csv(‘ohlo_2010-08-16.txt’,header=TRUE, sep=’;’)
head(OHLODF)
head(SODF)
df = sqldf(‘select Name name, Count tag_count, Contributors contributors
from OHLODF o
join SODF s on LOWER(s.Tag) = LOWER(o.Name) order by 1′)
ggplot(data=df,
aes(x=tag_count, y=contributors, color=name)) +
geom_point() +
geom_text(aes(label = name))
The resulting chart is displayed above. To list the top 10 languages:
> head(df[order(df$contributors, decreasing=TRUE),],10)
name tag_count contributors
57 XML 12374 133183
24 HTML 21936 106012
28 Java 62386 78098
9 C 17256 78023
13 CSS 16429 72060
11 C++ 38691 61831
29 JavaScript 46608 60677
33 Make 537 50328
44 Python 31852 38691
39 PHP 53884 36952
> head(df[order(df$tag_count, decreasing=TRUE),],10)
name tag_count contributors
10 C# 101811 22198
28 Java 62386 78098
39 PHP 53884 36952
29 JavaScript 46608 60677
11 C++ 38691 61831
44 Python 31852 38691
48 SQL 25316 28069
24 HTML 21936 106012
9 C 17256 78023
37 Objective-C 17250 6555
All other things being equal, one might think that the relationship between contributors to projects and tag counts might be roughly linear. As it stands, that is not the case at all.
Web Oriented Languages
The languages represented show a significant representation of web applications related technologies. HTML, CSS, Java Script and PHP are used almost exclusively for such development, and Ruby, Python, Perl, Java, C#, SQL are also heavily used for web applications (though not exclusively). C, C++, Objective C and Make are related technologies that are geared less towards web development.
Microsoft
According to wikipedia StackOverflow is a Microsoft partner and stackoverflow itself was developed on the Microsoft platform. This might provide some explanation to the high representation of C#.
Simple Languages = Less Questions
XML and HTML are markup languages with relatively simple syntax, hence the relatively small tag count. CSS and Make are also relatively small languages with specific uses rather than general purpose programming languages. The fact that C++ was developed as an enhancement to the C programming language explains why there are more questions (and a larger tag count) for C++ than C. A more speculative suggestion is that Perl’s relatively low tag count indicates that the “more than one way to do it” philosophy leads to less questions. An obvious alternative is that Perl users simply ask questions in other venues.
Conclusion
All measures of programming language popularity have their limitations. Correlating various sets of data can provide some additional insights into their prevalence and usage. R and sqld provide a convenient means of making such comparisons. And ggplot2 provides a great way of charting results.
Update
A log scale (as suggested by Tal in the comments) provides better insight into the majority of languages that appear clustered in the lower left hand corner of the chart. However, though this site might be considered R rated, the **** was added through later image editing to make it fit for all audiences.
ggplot(data=df,
aes(x=log(tag_count), y=log(contributors), color=name)) +
geom_point() +
geom_text(aes(label = name))
To leave a comment for the author, please follow the link and comment on their blog: R-Chart.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.