Gender ratios of programmers, by language
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
While there are many admirable efforts to increase participation by women in STEM fields, in many programming teams men still outnumber women, often by a significant margin. Specifically by how much is a fraught question, and accurate statistics are hard to come by. Another interesting question is whether the gender disparity varies by language, and how to define a “typical programmer” for a given language.
Jeff Allen from Trestle Tech recently took an interesting approach using R to gather data on gender ratios for programmers: get a list of the top coders for each programming language, and then count the number of men and women in each list. Neither task is trivial. For a list of coders, Jeff scraped GitHub's list of trending repositories over the past month by programming language, and then extracted the avatars for the listed contributors. Then, he used the Microsoft Cognitive Services Face API on the avatar to determine the apparent gender of each contributor, and then tally up the results. You can find the R code he used on GitHub.
I used Jeff's code to re-run his results based on GitHub's latest monthly rankings. The first thing I needed to do was to request an API Key; a trial key is free with a Microsoft account. (The number of requests per second, but the R code is written to limit the rate of requests accordingly.) I limited my search to the languages C++, C#, Java, Javascript, Python, R and Ruby. The percentage of contributors identified as female, within each language, are shown below:
According to this analysis, none of the contributors top C++ projects on GitHub are male; by contrast, almost 10% of contributors to R projects are female.
Now, these data need to be taken with a grain of salt. The main issue is numbers: fewer than 100 programmers per language are identified as “top programmers” via this method, and sometimes significantly fewer (just 45 top C++ contributors were identified). Part of the reason for this is that not all programmers use their face as an avatar; those that used a symbol, logo or cartoon were not counted. Furthermore, it's reasonable to assume that there's a disparity in the rate at which women use their own face as an avatar compared to men, which would add bias to the above results in addition to the variability from the small numbers. Finally, the gender determination is based on an algorithm, and isn't guaranteed to match the gender identity of the programmer (or their avatar).
Nonetheless, it's an interesting example of using social network data in conjunction with cognitive APIs to conduct demographic studies. You can examples of using other data from the facial analysis, including apparent happiness by language, at the link below.
(Update June 15: re-ran the analysis and updated the chart above to actually display percentages, not ratios, on the y-axis. The numbers changed slightly as the GitHub data changed. The old chart is here.)
Trestle Tech: EigenCoder: Programming Stereotypes
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.