Usage shares of programming languages in economics research

[This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My shiny app Finding Economics Articles with Data contains meanwhile over 8000 economic articles with replication packages. You can use it here: https://ejd.econ.mathematik.uni-ulm.de

Some of the data on articles and file types in the reproduction packages can be downloaded as a zipped SQLite database from my server (see the “About” page in the app for the link). Let us use the database to take a look at the usage shares of different programming languages.

The following code extracts our data set by merging two tables from the data base.

library(RSQLite)
library(dbmisc)
library(dplyr)

# Open data base using schemas as defined in my dbmisc
# package
db = dbConnect(RSQLite::SQLite(),"articles.sqlite")

articles = dbGet(db,"article")
fs = dbGet(db,"files_summary") 
fs = fs %>% 
  left_join(select(articles, year, journ, id), by="id")
head(fs)
idfile_typenum_filesmbis_codeis_datayearjourn
aejapp_10_4_5csv96.49858012018aejapp
aejapp_10_4_5do190.169755102018aejapp
aejapp_10_4_5dta20719918.231012018aejapp
aejpol_10_4_8csv12.110033012018aejpol
aejpol_10_4_8do180.118644102018aejpol
aejpol_10_4_8gz14294.9673002018aejpol

The data frame fs contains for each article and corresponding reproduction packages counts for common data or code files.

Let us take a look at the total number of reproduction packages and then compute the shares of reproduction packages that contain at least one file of specific programming languages (I am aware that not everybody would call e.g. Stata a programming language. Just feel free to replace the term by your favorite expression like scripting language or statistical software.):

n_art = n_distinct(fs$id)
n_art

## [1] 8262

fs %>% 
  group_by(file_type) %>%
  summarize(
    count = n(),
    share=round((count / n_art)*100,1)
  ) %>%
  # note that all file extensions are stored in lower case
  filter(file_type %in% c("do","r","py","jl","m","java","c","cpp","nb","f90","f95", "sas","mod","js","g","gms","ztt")) %>%
  arrange(desc(share))
file_typecountshare
do591571.6
m202324.5
r8089.8
sas3494.2
py3414.1
mod1982.4
f901882.3
nb1161.4
c1051.3
ztt1041.3
cpp660.8
jl390.5
java330.4
g280.3
gms190.2
js180.2
f9570.1

The most used software is by a far margin Stata, whose .do scripts can be found in 71.6% of reproduction packages. It follows Matlab with 24.5%. The most popular open source language is R with 9.8%. After one more proprietary software SAS, Python then follows as second most most used open source language with 4.1%. If you wonder why the shares add up to more than 100%: some reproduction packages simply use more than one language.

Let us take a look at the development over time for Stata, Matlab, R and Python.

year_dat = fs %>%
  filter(year >= 2010) %>%
  group_by(year) %>%
  mutate(n_art_year = n_distinct(id)) %>%
  group_by(year, file_type) %>%
  summarize(
    count = n(),
    share=count / first(n_art_year),
    # Compute approximate 95% CI of proportion
    se = sqrt(share*(1-share)/first(n_art_year)),
    ci_up = share + 1.96*se,
    ci_low = share - 1.96*se
  ) %>%
  filter(file_type %in% c("do","r","py","m")) %>%
  arrange(year,desc(share))  

library(ggplot2)
ggplot(year_dat, aes(x=year, y=share,ymin=ci_low, ymax=ci_up, color=file_type)) +
  facet_wrap(~file_type) +
  geom_ribbon(fill="#000000", colour = NA, alpha=0.1) +
  geom_line() +
  theme_bw()

The usage share of Stata and Matlab stays relatively constant over time. Yet, we still see a substantial increase in R usage from 1.4% in 2010 to over 20% in 2023. Also Python usage increases: from 0.4% in 2010 to almost 10% in 2023.

So open source software is getting more popular in academic economic research with large growth rates but absolute usage levels that are still substantially below Stata usage.

Note that the representation of journals is not balanced across years in our data base. E.g. the first reproduction package from Management Science in our data base is from 2019. To check whether the growth of R usage can also be found within journals, let us look at the development of its usage share within journals:

year_journ_dat = fs %>%
  filter(year >= 2010) %>%
  group_by(year, journ) %>%
  mutate(n_art = n_distinct(id)) %>%
  group_by(year, journ, file_type) %>%
  summarize(
    count = n(),
    share=count / first(n_art),
    # Compute approximate 95% CI of proportion
    se = sqrt(share*(1-share)/first(n_art)),
    ci_up = share + 1.96*se,
    ci_low = share - 1.96*se

  )
ggplot(year_journ_dat %>% filter(file_type=="r"),
  aes(x=year, y=share,ymin=ci_low, ymax=ci_up)) +
  facet_wrap(~ journ, scales = "free_y") +
  geom_ribbon(fill="#000000", colour = NA, alpha=0.1) +
  geom_line() +
  coord_cartesian(ylim = c(0, 0.4)) +
  ylab("") +
  ggtitle("Share of replication packages using R")+
  theme_bw()

We see a substantial increase in R usage in most journals. Finally, let us take a similar look at the time trends of Stata usage within journals.

ggplot(year_journ_dat %>% filter(file_type=="do"),
  aes(x=year, y=share,ymin=ci_low, ymax=ci_up)) +
  facet_wrap(~ journ, scales = "free_y") +
  geom_ribbon(fill="#000000", colour = NA, alpha=0.1) +
  geom_line() +
  coord_cartesian(ylim = c(0, 1)) +
  ylab("") +
  ggtitle("Share of replication packages using Stata")+
  theme_bw()
To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)