Site icon R-bloggers

How to distribute data with your R package

[This article was first published on Posts on R-hub blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Distributing data with an R package can be crucial for the package or even the only goal of a package: to show what a function can accomplish with a dataset; to show how a package can help tidy a messy data format; to test the package; for teaching purposes; to allow users to directly use the bundled data instead of having to fetch and clean the data. Now, how to provide data with/for your package is a recurring theme in R package development channels. In this post, we shall present various ways to distribute data with/for an R package, depending on the data use case and on its size.

Thanks to the R connoisseurs Thomas Vroylandt, Sébastien Rochette and Hugo Gruson for providing some inspiration and resources for this post! ???? 1

Data in your package

Sometimes the data can be vendored2 i.e. live in your package source, and even built into and installed with the package. An excellent overview of the different cases is provided in the R packages book by Hadley Wickham and Jenny Bryan; without forgetting the reference “Writing R Extensions”. In usethis, as explained in the R packages book, there are helpers for creating package data.

Data for whom?

my_languages <- function() {
  c("en", "fr")
}
# elsewhere
blabla code my_languages() blabla code

Data as small as possible

As “Writing R Extensions” underlines, for data under data/ and, “If your package is to be distributed, do consider the resource implications of large datasets for your users: they can make packages very slow to download and use up unwelcome amounts of storage space, as well as taking many seconds to load.". So what do you do?

?tools::resaveRdaFiles
Report on Details of Saved Images or Re-saves them

For data you use in tests, i.e. fixtures, you can think about how to make it as small as possible whilst still providing enough bug discovery potential. In other words, save minimal test cases, not a ton of data. This will make your package source lighter.

Data outside of your package

Now, sometimes the data is too big to be in your package, or follows a different development cycle i.e. is updated more or less often. What can you do about that? Well, have data live somewhere else.

Data packages

Yes, this subsection makes the point of having data live inside another package. ???? You could develop companion packages to go with your package, that hold data. A data package is user-friendly in the sense that installing it saves the data on the machine and makes it available. This is the setup of rnaturalearth, for which one of the companion packages, rnaturalearthhires, is not hosted on CRAN.

For #rnaturalearth I made 3 packages, 2 on CRAN, 1 not, rnaturalearth has methods and small example data, rnaturalearthdata has medium res data, rnaturalearthhires has hires data and is hosted by @rOpenSci because too big for CRAN.

— @southmapr (@southmapr) May 19, 2020

For a clear explanation of a way to host data packages outside of CRAN, refer to the R Journal article by Brooke Anderson and Dirk Eddelbuettel, “Hosting Data Packages via drat: A Case Study with Hurricane Exposure Data”.

You could also check out the datastorr package (not on CRAN) for integrating data packages with GitHub.

Other data services

Not using a data package also helps you make the data available e.g. as CSV to anyone including, gasp, Python users. ????

This is all good so far but with non data packages your package should still help download and save the data for a seamless workflow, e.g. via a function. What can this function do with the data? It could of course returns the data as an object, but also save it locally for easier re-use in future analyses.

Conclusion

In this post we went over different setups allowing you to distribute data with or for your R package: some standard locations in your package source (inst/extdata/, data/, R/sysdata.rda); locations outside of your package (using drat, git-LFS, GitHub releases via piggyback, a web API you’d build yourself), that your package would know how to access, and potentially save locally (in an app dir, or using a tool like bowerbird). Do you have a “data and R package” setup you’d like to share? How do you document the data you share, when it’s bigger than a small table?5 Please comment below!

< section class="footnotes" role="doc-endnotes">
  1. In a conversation in the friendly French-speaking R Slack workspace – where we’d write connaisseurs, not connoisseurs. If you want to join us, follow the invitation link. À bientôt ! ↩︎

  2. Now that you know the word vendor, ” to bundle one’s own, possibly modified version of dependencies with a standard program.", you can use your search engine to find and enjoy debates around vendoring or not. You’re welcome. ↩︎

  3. Thanks to Carl Boettiger for useful insight, including reminding me of the pins package. ↩︎

  4. Some caveats were noted in a comment by Carl Boettiger on Twitter. ↩︎

  5. E.g. the dataspice package aims at creating lightweight schema.org descriptions of datasets. ↩︎

To leave a comment for the author, please follow the link and comment on their blog: Posts on R-hub blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.