Beyond Functions: How to Enrich an R Package with Data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
You can read the original post in its original format on Rtask website by ThinkR here: Beyond Functions: How to Enrich an R Package with Data
As package developers, we often focus on custom functions, their documentation, and unit testing. To provide a more complete user experience, it can be helpful to include data within the package. Here’s everything you need to know!
If you’re not familiar with unit testing, we recommend reading our article on the subject: : https://rtask.thinkr.fr/expectations-verified-dive-into-the-world-of-unit-tests-with-expect_/
Short on time? Here’s the gist
In this article, we present the different ways to incorporate data into an R package. We cover the three directories used for storing data and explain how to access them, whether you are a package user or a developer. Finally, we discuss best practices for documenting these data.
Why Include Data in a Package?
Including data in a package can be useful for several reasons:
- Simplifying package usage: The data included in the package is directly accessible to users.
- Facilitating reproducibility: The data allows users to replicate the examples provided in the documentation.
- Enhancing unit testing: The included data can be used for testing the package functions.
- Sharing information: Distributing documentation, scientific articles, code samples, etc.
Here, “data” should be interpreted broadly. It includes tabular data typically used in R (e.g., CSV or XLSX files, or data.frame
objects), but also images, configuration files, articles, code samples, etc.
Data Directories in an R Package
There are three directories in an R package used to store data: data-raw/
, data/
, and inst/
, each serving a specific purpose and catering to different audiences (developers vs. users).
The data-raw/
and data/
pair
The goal here is to make data available to package users, which can be used by the package’s functions or included in the documentation examples. These data will be represented as R objects (e.g., data.frame
, list, etc.).
The inst/
Directory
This directory allows you to store files without format restrictions: tabular files, code sample scripts, notebooks in Rmd/Qmd format, PDF documentation, etc. There are no limits.
Using data-raw/
+ data/
Use case: You want to make data available to package users that can be used by the package’s functions. The goal is to provide native access from the package’s functions. You’re probably familiar with preloaded datasets in R like mtcars
or iris
; this follows the same principle.
data-raw/
The data-raw/
folder is used to store scripts for preparing the data. Files in this folder are not included in the final package installed on the user’s computer but contain the code needed to generate the datasets that will later be included in the package.
data/
Once prepared in data-raw/
, the data is stored in the data/
folder. Files in this folder are included in the final package and are accessible to users. The files are stored in .rda
format and are loaded when a user runs library(mypackage)
.
Example
1 -Create the data-raw/
folder using the command usethis::use_data_raw("my_dataset_demo")
. This command creates a my_dataset_demo.R
file in the data-raw/
folder.
2- Prepare the dataset in the my_dataset_demo.R
file:
# Create a sample of the "starwars" dataset from the dplyr package # See https://github.com/tidyverse/dplyr/tree/main/data-raw and https://github.com/tidyverse/dplyr/tree/main/data library(dplyr) library(readr) starwars_raw <- read_csv("data-raw/starwars.csv") starwars_sample <- starwars_raw |> sample_n(size = 10) usethis::use_data(starwars_sample, overwrite = TRUE)
3- After running the command usethis::use_data(starwars_sample, overwrite = TRUE)
, you’ll see a file named starwars_sample.rda
in the data/
folder.
4- There’s still some work to do: now we need to document the dataset. For this, we will use the {checkhelper}
package.
This creates a doc_starwars_sample.R
file in the package R/
folder. This file contains the dataset’s documentation. You can edit this file to add more information about the dataset.
#' starwars_sample #' #' Description. #' #' @format A data frame with 10 rows and 14 variables: #' \describe{ #' \item{ name }{ character name } #' \item{ height }{ numeric } #' \item{ mass }{ numeric } #' \item{ hair_color }{ character } #' \item{ skin_color }{ character } #' \item{ eye_color }{ character } #' \item{ birth_year }{ numeric } #' \item{ sex }{ character } #' \item{ gender }{ character } #' \item{ homeworld }{ character } #' \item{ species }{ character } #' \item{ films }{ character } #' \item{ vehicles }{ character } #' \item{ starships }{ character } #' } #' @source Source "starwars_sample"
While the overall structure of the file should be preserved, you can edit the description, format, and source information as needed.
Finally, generate the LaTeX documentation using the command devtools::document()
or attachment::att_amend_desc()
.
Once your package is installed with (remotes::install_local()
) and loaded, you can access the dataset using the command data("starwars_sample")
.
Using inst/
Use case: You want to store files intended for use only in unit tests or to share additional documentation (e.g., a scientific article).
Example
1- Create the inst/
folder at the package root: dir.create(here::here("inst"))
.
2- Place the desired files in the folder.
3- Install the package: remotes::install_local()
.
4- The files are now accessible using a special function: system.file()
, which points to the root of the inst/
directory. For example, to access a file named article.pdf
in the inst/
folder, you would use system.file("article.pdf", package = "mypackage")
. If the file is in a subfolder called “doc,” you would use system.file("doc", "article.pdf", package = "mypackage")
.
Note: system.file()
does not read a file; it only returns the file path.
Conclusion
You now know all about incorporating data into an R package. You’ve learned how to store data in the data-raw/
, data/
, and inst/
directories and make it accessible from the package functions. You’ve also learned how to document these data to make them usable for package users.
Whether the data is intended for users or developers, you now have all the tools to enrich your R package with data.
To go further
For those who want to go further and master the art of creating robust packages, including best practices for unit testing, check out our training
This post is better presented on its original ThinkR website here: Beyond Functions: How to Enrich an R Package with Data
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.