Reproducibility: A cautionary tale from data journalism
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Timo Grossenbacher, data journalist with Swiss Radio and TV in Zurich, had a bit of a surprise when he attempted to recreate the results of one of the R Markdown scripts published by SRF Data to accompany their data journalism story about vested interests of Swiss members of parliament. Upon re-running the analysis in R last week, Timo was surprised when the results differed from those published in August 2015. There was no change to the R scripts or data in the intervening two-year period, so what caused the results to be different?
The version of R Timo was using had been updated, but that wasn't the root cause of the problem. What had also changed was the version of the dplyr package used by the script: version 0.5.0 now, versus version 0.4.2 then. For some unknown reason, a change in the dplyr package in the intervening package caused some data rows (shown in red above) to be deleted during the data preparation process, and so the results changed.
Timo was able to recreate the original results by forcing the script to run with package versions as they existed back in August 2015. This is easy to do with the checkpoint package: just add a line like
library(checkpoiint); checkpoint("2015-08-11")
to the top of your R script. We have been taking daily snapshots of every R package on CRAN since September 2014 to address exactly this situation, and the checkpoint package makes it super-easy to find and install all of the packages you need to make your script reproducible, without changing your main R installation or affecting any other projects you may have. (The checkpoint package is available on CRAN, and also included with all editions of Microsoft R.)
I've been including a call to checkpoint
on the top of most of my R scripts for several years now, and it's saved me from failing scripts many times. Likewise, Timo has created a structure and process to support truly reproducible data analysis with R, and it advocates using the checkpoint to manage package versions. You can find a description of the process here: A (truly) reproducible R workflow, and find the template in Github.
By the way, SRF Data — the data journalism arm of the national broadcaster in Switzerland — has published some outstanding stories over the past few years, and has even been nominated for Data Journalism Website of the Year. At the useR!2017 conference earlier this year, Timo presented several fascinating insights into the data journalism process at SRF Data, which you can see in his slides and talk (embedded below):
Timo Grossenbacher: This is what happens when you use different package versions, Larry! and A (truly) reproducible R workflow
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.