Comparing the #code2013 results with the current TIOBE rankings
[This article was first published on lp0 On Fire, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The TIOBE language rankings have always been controversial but in the absence of more meaningful metrics tends to be viewed as holy writ. Over the last few days of 2013 a hashtag was started by Twitter user @deadprogram called #code2013. The idea of this hashtag was that users would tweet which languages they used over the last year. I felt this would be an interesting comparison to the TIOBE rankings – the latter is based on search engine popularity but the #code2013 rankings would be based on what people are actually reporting.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
To do this I used my R library twitteR to pull 4624 tweets with this hash tag and then started pulling it apart to see what I could see. I previously pulled the tweets using the searchTwitter() function, and loaded it into my R session. From there my first step was to try to remove retweets. Removing the new style Twitter retweets are simple, and then after that I removed anything with RT in the text. The latter isn’t perfect and is likely to throw out good data (e.g. “lang1 lang2 lang3 RT @deadprogram: What programming languages have you used this year? Tweet using #code2013 Please do it and also RT!”) but it seemed unlikely to radically skew the results. The R code I used to do this was:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
load("code2013.rda") | |
# Find/remove the tweets flagged as retweets | |
is_retweets = which(sapply(code2013, function(x) x$getIsRetweet())) | |
if (length(is_retweets) > 0) { | |
filtered_tweets = code2013[-is_retweets] | |
} else { | |
filtered_tweets = code2013 | |
} | |
statuses = sapply(filtered_tweets, function(x) x$getText()) | |
# Find and remove RT based retweets. This will be overeager but we're not losing many | |
# tweets anyways | |
manual_retweets = grep("[[:space:]]?rt", statuses) | |
if (length(manual_retweets) > 0) { | |
filtered_tweets = filtered_tweets[-manual_retweets] | |
statuses = statuses[-manual_retweets] | |
} |
The next step was to read in the TIOBE rankings (well, the top 50). Visually inspecting a sampling of the #code2013 tweets and looking at the TIOBE data made it clear that I would have to massage the language names a bit as there were a few problems. The most notable issue were things like “Objective C” or “emacs lisp” as I was planning on tokenizing languages by whitespace. Similarly, TIOBE defined “delphi/object pascal” but people in #code2013 tended to say either “Delphi” or “object pascal”. It would be an impossible task to perfectly clean up the #code2013 data but I made a few adjustments to help things along:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Read in the TIOBE data | |
tiobe = read.csv("tiobe.csv", stringsAsFactors=FALSE) | |
tiobe_langs = tolower(tiobe[, "lang"]) | |
# Looking at the TIOBE listings and some of the tweet data, massage some of the entries | |
# here. This won't be perfect but will help a little bit | |
replace_statuses = function(statuses, was, is) { | |
gsub(was, is, statuses, ignore.case=TRUE) | |
} | |
replacements = list(c("objective c", "objective-c"), c("visual basic", "visual-basic"), | |
c("emacs lisp", "emacs-lisp"), c("object pascal", "delphi/object-pascal"), | |
c("delphi", "delphi/object-pascal"), c("common lisp", "common-lisp"), | |
c("elisp", "emacs-lisp")) | |
for (pair in replacements) { | |
statuses = replace_statuses(statuses, pair[1], pair[2]) | |
} | |
tiobe_langs[7] = "visual-basic" | |
tiobe_langs[11] = "visual-basic" | |
tiobe_langs[20] = "delphi/object-pascal" | |
tiobe_langs[46] = "emacs-lisp" | |
tiobe_langs[41] = "common-lisp" | |
tiobe$lang = tiobe_langs | |
# we've got two visual-basic entries | |
tiobe[7, "rating"] = tiobe[7, "rating"] + tiobe[11, "rating"] | |
tiobe = tiobe[-11, ] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# I want to convert this all to lowercase but there are 67 with weird encodings | |
bad_statuses = numeric() | |
lowercase_statuses = character() | |
for (i in seq_along(statuses)) { | |
tl = try(tolower(statuses[[i]]), silent=TRUE) | |
if (inherits(tl, "try-error")) { | |
bad_statuses = c(bad_statuses, i) | |
} else { | |
lowercase_statuses = c(lowercase_statuses, tl) | |
} | |
} | |
if (length(bad_statuses) > 0) { | |
filtered_tweets = filtered_tweets[-bad_statuses] | |
} | |
statuses = lowercase_statuses |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# tokenize each status. split on comma period or whitespace | |
status_tokens = strsplit(statuses, ",|\\.|\\s+") | |
matching_tokens = sapply(status_tokens, function(x) { | |
x[which(x %in% tiobe_langs)] | |
}) | |
# Now have the languages mentioned in #code2013 which are in TIOBE | |
code2013_langs = unlist(matching_tokens) | |
code2013_lang_table = as.data.frame(sort(table(code2013_langs), decreasing=TRUE)) | |
colnames(code2013_lang_table) = "Count" | |
# Create a column describing the rough place of the code2013 langs | |
code2013_lang_table$code2013_tier = ordered(c(rep("1-5", 5), rep("6-10", 5), rep("11-15", 5), | |
rep("16-25", 10), rep("26-39", 14)), levels=c("1-5", "6-10", "11-15", "16-25", "26-39")) | |
# Order by the TIOBE rankings | |
code2013_lang_table$code2013_langs = ordered(rownames(code2013_lang_table), | |
levels=rev(tiobe[, "lang"])) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(ggplot2) | |
png(file="code2013_tiobe.png", width=640, height=640) | |
ggplot(code2013_lang_table, aes(x=code2013_langs, y=Count, fill=code2013_tier)) + | |
geom_bar(stat="identity") + | |
xlab("Language") + ylab("Count") + | |
ggtitle("#code2013 Languages Sorted By TIOBE Rankings") + | |
coord_flip() | |
dev.off() |
I might be way off base here but looking at the rankings of the #code2013 languages tells me a couple of things. One is that unsurprisingly web development still rules the roost: javascript, ruby, python, java, php. The other is that data analysis & big data (I loathe the term, but chest la vie) is coming on stronger than TIOBE recognizes considering some of darlings of that world are doing better in #code2013 than TIOBE with notable examples being Python, Scala, Haskell & R.
For the record, my tweet in this hashtag was: “Scala, java, R, python, matlab, C++ #code2013” so I have to say I’m pleasantly surprised to see some of my favorite languages (which would be the first four I mentioned, although not in that order) looking like a better combination than TIOBE would suggest.
Edit #1: Hadley Wickham suggested that I include a scatterplot of the data. Considering that one of the main motivations for this exercise was to force myself to figure out how his ggplot2 library worked I figured I’d oblige:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
code2013_lang_table$code2013_rank = 1:nrow(code2013_lang_table) | |
code2013_lang_table$tiobe_rank = match(code2013_lang_table$code2013_langs, tiobe[, "lang"]) | |
# Make a scatterplot of the ranking differences | |
png(file="code2013_tiobe_scatter.png", width=640, height=640) | |
ggplot(code2013_lang_table, aes(x=code2013_rank, y=tiobe_rank, color=code2013_tier)) + | |
geom_text(aes(label=code2013_langs), size=3) + | |
ylab("TIOBE Rank") + xlab("#code2013 rank") + | |
ggtitle("#code2013 vs TIOBE rankings") | |
dev.off() |
To leave a comment for the author, please follow the link and comment on their blog: lp0 On Fire.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.