Finding Economic Articles With Data

[This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my view, one of the greatest developments during the last decade in economics is that the Journals of the American Economic Association and some other leading journals require authors to upload the replication code and data sets of accepted articles.

I wrote a Shiny app that allows to search currently among more than 3000 articles that have an accessible data and code supplement. Just click here to use it:

http://econ.mathematik.uni-ulm.de:3200/ejd/

One can perform a keyword search among the abstract and title. The screenshot shows an example:


One gets some information about the size of the data files and the used code files. I also tried to find and extract a README file from each supplement. Most README files explain whether all results can be replicated with the provided data sets or whether some results require confidential or proprietary data sets. The link allows you to look at the README without the need to download the whole data set.

The main idea is that such a search function could be helpful for teaching economics and data science. For example, my students can use the app to find an interesting topic for a Bachelor or Master Thesis in form of an interactive analysis with RTutor. You could also generate a topic list for a seminar, in which students shall replicate some key findings of a resarch article.

While the app performs well for a single user, I have not tested the performance for many simultaneous users. If it is too sluggish or you don’t get connected there are perhaps currently too many users. Then just try it out a bit later.

If you want to analyse yourself the collected data underlying the search app, you can download the zipped SQLite databases using the following links:

I try to update the databases regularly.

Below is an example, for a simple analysis based on that databases. First make sure that you download and extract articles.zip into your working directory.

We first open a database connection

library(RSQLite)
db = dbConnect(RSQLite::SQLite(),"articles.sqlite")

File type conversion between databases and R can sometimes be a bit tedious. For example, SQLite knows no native Date or logical type. For this reason, I typically use my package dbmisc when working with SQLite databases. It allows to specify a database schema as simple yaml file and has a lot of convenience function to retrieve or modify data that automatically use the provided schema. The following code sets the database schema that is provided in the package EconJournalData:

library(dbmisc)
db = set.db.schemas(db,schema.file=
  system.file("schema/articles.yaml", package="EconJournalData"))

Of course, for a simple analysis as ours below just using the standard function in the DBI package without schemata suffices. But I am just used to working with the dbmisc package.

The main information about articles is stored in the table article

# Get the first 4 entries of articles as data frame
dbGet(db, "article",n = 4)
idyeardatejourntitlevolissueartnumarticle_urlhas_datadata_urlsizeunitfiles_txtdownloaded_filenum_authorsfile_info_storedfile_info_summarizedabstractreadme_file
aer_108_11_120182018-11-01aerFirm Sorting and Agglomeration108111https://www.aeaweb.org/articles?id=10.1257/aer.20150361TRUEhttps://www.aeaweb.org/doi/10.1257/aer.20150361.data0.05339MBNAaer_vol_108_issue_11_article_1.zipNATRUENAAbstract To account for the uneven distribution of economic activity in space, I propose a theory of the location choices of heterogeneous firms in a variety of sectors across cities. In equilibrium, the distribution of city sizes and the sorting patterns of firms are uniquely determined and affect aggregate TFP and welfare. I estimate the model using French firm-level data and find that nearly half of the productivity advantage of large cities is due to firm sorting, the rest coming from agglomeration economies. I quantify the general equilibrium effects of place-based policies: policies that subsidize smaller cities have negative aggregate effects.aer/2018/aer_108_11_1/READ_ME.pdf
aer_108_11_220182018-11-01aerNear-Feasible Stable Matchings with Couples108112https://www.aeaweb.org/articles?id=10.1257/aer.20141188TRUEhttps://www.aeaweb.org/doi/10.1257/aer.20141188.data0.07286MBNAaer_vol_108_issue_11_article_2.zipNATRUENAAbstract The National Resident Matching program seeks a stable matching of medical students to teaching hospitals. With couples, stable matchings need not exist. Nevertheless, for any student preferences, we show that each instance of a matching problem has a “nearby” instance with a stable matching. The nearby instance is obtained by perturbing the capacities of the hospitals. In this perturbation, aggregate capacity is never reduced and can increase by at most four. The capacity of each hospital never changes by more than two.aer/2018/aer_108_11_2/Readme.pdf
aer_108_11_320182018-11-01aerThe Costs of Patronage: Evidence from the British Empire108113https://www.aeaweb.org/articles?id=10.1257/aer.20171339TRUEhttps://www.aeaweb.org/doi/10.1257/aer.20171339.data0.44938MBNAaer_vol_108_issue_11_article_3.zipNATRUENAAbstract I combine newly digitized personnel and public finance data from the British colonial administration for the period 1854-1966 to study how patronage affects the promotion and incentives of governors. Governors are more likely to be promoted to higher salaried colonies when connected to their superior during the period of patronage. Once allocated, they provide more tax exemptions, raise less revenue, and invest less. The promotion and performance gaps disappear after the abolition of patronage appointments. Patronage therefore distorts the allocation of public sector positions and reduces the incentives of favored bureaucrats to perform.aer/2018/aer_108_11_3/Readme.pdf
aer_108_11_420182018-11-01aerThe Logic of Insurgent Electoral Violence108114https://www.aeaweb.org/articles?id=10.1257/aer.20170416TRUEhttps://www.aeaweb.org/doi/10.1257/aer.20170416.data56MBNAaer_vol_108_issue_11_article_4.zipNATRUENAAbstract Competitive elections are essential to establishing the political legitimacy of democratizing regimes. We argue that insurgents undermine the state’s mandate through electoral violence. We study insurgent violence during elections using newly declassified microdata on the conflict in Afghanistan. Our data track insurgent activity by hour to within meters of attack locations. Our results suggest that insurgents carefully calibrate their production of violence during elections to avoid harming civilians. Leveraging a novel instrumental variables approach, we find that violence depresses voting. Collectively, the results suggest insurgents try to depress turnout while avoiding backlash from harming civilians. Counterfactual exercises provide potentially actionable insights for safeguarding at-risk elections and enhancing electoral legitimacy in emerging democracies.aer/2018/aer_108_11_4/READ_ME.pdf

The table files_summary contains information about code, data and archive files for each article

dbGet(db, "files_summary",n = 6)
idfile_typenum_filesmbis_codeis_data
aejapp_1_1_10do90.009427TRUEFALSE
aejapp_1_1_10dta20.100694FALSETRUE
aejapp_1_1_3do190.103628TRUEFALSE
aejapp_1_1_4csv10.024872FALSETRUE
aejapp_1_1_4dat17.15491FALSETRUE
aejapp_1_1_4do90.121618TRUEFALSE

Let us now analyse which share of articles uses Stata, R, Python, Matlab or Julia and how the usage has developed over time.

Since our datasets are small, we can just download the two tables and work with dplyr in memory. Alternatively, you could use some SQL commands or work with dplyr on the database connection.

articles = dbGet(db,"article")
fs = dbGet(db,"files_summary")

Let us now compute the shares of articles that have one of the file types, we are interested in

# Number of articles with analyes data & code supplementary
n_art = n_distinct(fs$id)

# Count articles by file types and compute shares
fs %>% group_by(file_type) %>%
  summarize(count = n(), share=round((count / n_art)*100,2)) %>%
  # note that all file extensions are stored in lower case
  filter(file_type %in% c("do","r","py","jl","m")) %>%
  arrange(desc(share))
file_typecountshare
do260670.55
m85223.06
r1052.84
py320.87
jl20.05

Roughly 70% of the articles have Stata do files and almost a quarter Matlab m files. Using open source statistical software seems not yet very popular among economists: less than 3% of articles have R code files, Python is below 1% and only 2 articles have Julia code.

This dominance of Stata in economics never ceases to surprise me, in particular when for some reason I just happened to open the Stata do file editor and compare it with RStudio… But then, I am not an expert in writing empirical economic research papers – I just like R programming and rather passively consume empirical research. For writing empirical papers it probably is convenient that in Stata you can add a robust or robust cluster option to almost every type of regression in order to quickly get the economists’ standard standard errors…

For a teaching empirical economics with R the dominance of Stata is not neccessarily bad news. It means that there are a lot of studies which students can replicate in R. Such replication would be considerably less interesting if the original code of the articles would already be given in R.

Let us finish by having a look at the development over time…

sum_dat = fs %>% 
  left_join(select(articles, year, id), by="id") %>%
  group_by(year) %>%
  mutate(n_art_year = n()) %>%
  group_by(year, file_type) %>%
  summarize(
    count = n(),
    share=round((count / first(n_art_year))*100,2)
  ) %>%
  filter(file_type %in% c("do","r","py","jl","m")) %>%
  arrange(year,desc(share))  
head(sum_dat)
yearfile_typecountshare
2005do2522.12
2005m108.85
2006do2420.87
2006m1311.3
2007do2419.35
2007m1612.9
library(ggplot2)
ggplot(sum_dat, aes(x=year, y=share, color=file_type)) 
  + geom_line(size=1.5) + scale_y_log10() + theme_bw()

Well, maybe there is a little upward trend for the open source languages, but not too much seems to have happened over time so far…

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)