How to distribute data with your R package
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Distributing data with an R package can be crucial for the package or even the only goal of a package: to show what a function can accomplish with a dataset; to show how a package can help tidy a messy data format; to test the package; for teaching purposes; to allow users to directly use the bundled data instead of having to fetch and clean the data. Now, how to provide data with/for your package is a recurring theme in R package development channels. In this post, we shall present various ways to distribute data with/for an R package, depending on the data use case and on its size.
Thanks to the R connoisseurs Thomas Vroylandt, Sébastien Rochette and Hugo Gruson for providing some inspiration and resources for this post! ???? 1
Data in your package
Sometimes the data can be vendored2 i.e. live in your package source, and even built into and installed with the package.
An excellent overview of the different cases is provided in the R packages book by Hadley Wickham and Jenny Bryan; without forgetting the reference “Writing R Extensions”.
In usethis
, as explained in the R packages book, there are helpers for creating package data.
Data for whom?
-
If the data is for the user to load and use in examples or their own code, you’re looking for exported data. Since it’s exported, it has to be documented. For an example, see the
babynames
data/
folder andR/data.R
file -
If the data is for your functions to use internally, you’re after
R/sysdata.R
or an internal function.R/sysdata.R
. E.g. that’s wheremimetypes
are stored in themagick
packages- An internal function. E.g. how do you store the languages your pluralize and singularize functions support? It could be an internal function whose advantage is to be readable (as opposed to seeing a sysdata.rda in a repo) and whose downside is that it might be less natural to generate it with code. Example:
my_languages <- function() { c("en", "fr") } # elsewhere blabla code my_languages() blabla code
-
If the data is to showcase how to tidy raw data, and you want the user to be able to access it, you’re after… raw data living in inst/extdata.
-
If you want to keep the raw data used to create your exported or internal data, which you should, you’re after the
data-raw
folder. See thedata-raw
folder of thebabynames
package, in particular its R scripts. Note thatdata-raw
has to be in the.Rbuildignore
file (butusethis
would help you with that). -
For data used in test, i.e. fixtures, you could create a folder under e.g.
tests/testthat/
and whose content would be found using thetestthat::test_path()
function.
Data as small as possible
As “Writing R Extensions” underlines, for data under data/
and, “If your package is to be distributed, do consider the resource implications of large datasets for your users: they can make packages very slow to download and use up unwelcome amounts of storage space, as well as taking many seconds to load.".
So what do you do?
- You can compress the data, internal or external. Your friends are
tools::resaveRdaFiles()
?tools::resaveRdaFiles Report on Details of Saved Images or Re-saves them
- You can refer the next section of this post and have data live outside of your package!
For data you use in tests, i.e. fixtures, you can think about how to make it as small as possible whilst still providing enough bug discovery potential. In other words, save minimal test cases, not a ton of data. This will make your package source lighter.
Data outside of your package
Now, sometimes the data is too big to be in your package, or follows a different development cycle i.e. is updated more or less often. What can you do about that? Well, have data live somewhere else.
Data packages
Yes, this subsection makes the point of having data live inside another package. ????
You could develop companion packages to go with your package, that hold data.
A data package is user-friendly in the sense that installing it saves the data on the machine and makes it available.
This is the setup of rnaturalearth
, for which one of the companion packages, rnaturalearthhires
, is not hosted on CRAN.
For #rnaturalearth I made 3 packages, 2 on CRAN, 1 not, rnaturalearth has methods and small example data, rnaturalearthdata has medium res data, rnaturalearthhires has hires data and is hosted by @rOpenSci because too big for CRAN.
— @southmapr (@southmapr) May 19, 2020
For a clear explanation of a way to host data packages outside of CRAN, refer to the R Journal article by Brooke Anderson and Dirk Eddelbuettel, “Hosting Data Packages via drat: A Case Study with Hurricane Exposure Data”.
You could also check out the datastorr
package (not on CRAN) for integrating data packages with GitHub.
Other data services
Not using a data package also helps you make the data available e.g. as CSV to anyone including, gasp, Python users. ????
-
You could use an existing infrastructure: GitHub releases (with the
piggyback
package whose documentation includes a thorough comparison with other approaches such as git-LFS), Amazon S3 (using thepins
package?), etc.3. You could make use of scientific data repositories (DataONE, Zenodo, OSF, Figshare…). -
You could… write your own web API4? Like the web APIs powering R-hub’s own
pkgsearch
,rversions
; or the CRAN checks API. The web APIs have value of their own, and you can write wrapper functions in your package that thus becomes a data access package.
This is all good so far but with non data packages your package should still help download and save the data for a seamless workflow, e.g. via a function. What can this function do with the data? It could of course returns the data as an object, but also save it locally for easier re-use in future analyses.
-
The data could be cached in an app dir which we explained in a previous blog post, using the
rappdirs
orhoardr
package; -
For other approaches, your package could take a dependency on a package aimed at data storage and retrieval such as the aforementioned
piggyback
andpins
packages, thebowerbird
package, therdataretriever
package.
Conclusion
In this post we went over different setups allowing you to distribute data with or for your R package: some standard locations in your package source (inst/extdata/
, data/
, R/sysdata.rda
); locations outside of your package (using drat
, git-LFS
, GitHub releases via piggyback
, a web API you’d build yourself), that your package would know how to access, and potentially save locally (in an app dir, or using a tool like bowerbird
).
Do you have a “data and R package” setup you’d like to share?
How do you document the data you share, when it’s bigger than a small table?5
Please comment below!
-
In a conversation in the friendly French-speaking R Slack workspace – where we’d write connaisseurs, not connoisseurs. If you want to join us, follow the invitation link. À bientôt ! ↩︎
-
Now that you know the word vendor, ” to bundle one’s own, possibly modified version of dependencies with a standard program.", you can use your search engine to find and enjoy debates around vendoring or not. You’re welcome. ↩︎
-
Thanks to Carl Boettiger for useful insight, including reminding me of the
pins
package. ↩︎ -
Some caveats were noted in a comment by Carl Boettiger on Twitter. ↩︎
-
E.g. the
dataspice
package aims at creating lightweight schema.org descriptions of datasets. ↩︎
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.