Site icon R-bloggers

The Package: learning how to build an R package

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently made my first R package and was asked how I did it. The answer of course was: I searched, read, and stumbled around until it was done. But having gone through the process I figured it was worthwhile summarising what I did and what I found tricky.

First off, there are a ton of resources out there that describe how to go about building a package. Here are a few that I found useful:

Why build a package?

If you are using R and you have developed some scripts and useful functions, you might be wondering why build a package? There are quite a few reasons. One is to make the functions available to use for other projects. Another is to force yourself to document these functions in a more rigorous way. In my case I needed to build an installable package so that students taking a class I teach could easily install my code and run a practical session.

One way to think of a package as a robust set of scripts that are well documented. In fact, a package is probably best built out of a bunch of scripts that are already working and ready to be “packaged” up for release. Once you have some working code, the best approach is to use RStudio to make the package.

Very quick summary

Two libraries are required for making a package.

install.packages("devtools")
install.packages("roxygen2")

In RStudio, open File and select New Project and pick the option for R Package. Now name your project and pick where it will be saved.

RStudio will now initialise the project with a structure that will work as a package. Briefly, besides the .Rproj file for the RStudio project, you should see a folder called R (this is where all your .R files containing your package functions should go). You should also see a NAMESPACE file and a DESCRIPTION file. In the DESCRIPTION file you can store information about your project. The NAMESPACE file is auto-generated by roxygen and should not be edited by hand.

Once you have added your scripts into the R directory, you can get roxygen to format the documentation. Please refer to the links above for more information, but essentially a .R file might have one function and that function comes after several lines of roxygen comments. These are lines that start with #'

In Project Options in RStudio you can select Build Tools and check the box for Create Documentation. This means that roxygen will read the comments in all the .R files in R/ and generate .rd (R Documentation) files for each. It is also possible to build a PDF manual for your package, in a way that looks exactly like those you will have probably encountered for other packages on CRAN.

That was a very quick summary of how to make the core parts of a package. Here are a few things I ran into and how I solved them.

Putting the package onto GitHub

I did this part the wrong way around. Next time, I will develop the package on GitHub in a private repo from the start. Instead this time, I got a very basic working version going locally (without version control) and then uploaded the project as a GitHub repo (outside of RStudio). This allowed me to distribute a bare-bones version of the package to the class using devtools::install_github()

When it came to adding more features to the package and developing it further, I then cloned the repo and used that version for further development.

pkgdown

I used pkgdown to make a GitHub pages documentation website. This was ridiculously simple to do and again, the links above helped a lot.

library(pkgdown)
usethis::use_pkgdown()
pkgdown::build_site()
usethis::use_pkgdown_github_pages()

The repo can be configured such that whenever a commit is pushed to the remote, GitHub actions are triggered which does an R CMD check and rebuilds your documentation website.

will it, won’t it…. waiting to compile

One worry I had was whether it is possible to have multiple github.io sites (it is) or whether I should host the documentation at a custom url. In the end, using GitHub pages was so simple, that I don’t know why I worried about this step at all.

R CMD check

Before pushing some changes, it is a good idea to do R CMD check in RStudio first and then fix any problems before pushing. This is especially true if your package is already in a public repo (which mine was) and someone could potentially install it in a broken state. In the Build tab, and click Check to trigger R CMD check. This carries out the check that your package can be installed and be run on someone else’s computer. It checks that all the code examples in your roxygen comments will run too.

A tip here: make those code examples as simple as possible. In my case, the package reads in a large file and does a lot of number crunching. This meant I had to wait a few minutes for each check. Simplifying this step would have saved me a lot of time. The checking stage can be halted if there is a warning or note that you want to fix, rather than waiting for the whole thing to complete (or crash).

I would recommend: developing the code in your cloned repo. Click Install in the Build tab and then run your test code. When it works correctly, add it as a function to the repo and commit/push. This was certainly easier than the way I initially worked which was to have a separate RStudio project where I installed the GitHub version of the package and then developed code in that project before moving it to the “real package”. I thought this would be preferable and would keep the “real code” clean, but it quickly got very annoying.

A common error I got during R CMD check was:

no visible binding for global variable [variable name]

This error usually meant that I used a variable name (usually a column name in a data frame) without definition. This is not usually a problem when you are working with objects in the global environment and just running a script. However, in the context of a package, R needs to know what you mean by [variable name]. There are many solutions to this, but the simplest I found was:

id <- displacement <- time <- NULL

added to the first line of the function. Where id displacement and time were all variables without binding. There is disagreement over the best way to handle this situation, but this solution was intuitive for me – and it worked.

importFrom

You will almost certainly use functions from other libraries and not just base R. This means you need to import the whole library or import a function from a library. It is possible to add these as roxygen comments in the functions that need them, however, this can get out of control if you have lots of functions. Instead use a file called myPackage-package.R (where myPackage is the name of your package). This file sits in R/ and looks something like this:

#' @importFrom utils write.csv
#' @importFrom XML xmlParse
#' @importFrom XML xmlDoc
#' @importFrom XML xpathApply
#' @importFrom zoo rollmean
#' @import ggplot2
#' @import dplyr
#' @import patchwork
NULL
#> NULL

This way you can keep track of what has been loaded. I also collated the libraries and documented them in DESCRIPTION although I admit that I do not fully understand what is required here and what is best practice.

Imports:
  doParallel,
  dplyr,
  foreach,
  ggplot2,
  ggforce,
  patchwork,
  reshape2,
  utils,
  XML,
  zoo
Suggests:
  knitr,
  rmarkdown

Logically, it is best to keep the number of dependencies as low as possible. This is in case some future change in those libraries breaks you code. However, with popular/established/well-developed libraries you should be fairly safe.

Organising functions

This is slightly heading into code style territory, but I found that in general: one function per .R file was a good rule to follow. However, it is possible to have more than one function in a file. I decided to collate miscellaneous functions in one file and all general plotting functions in another, but apart from that each .R file contained one main function.

Some of these main functions had supporting functions which are only used by the main function. If you are used to programming in other languages, these supporting functions I am describing are similar to STATIC functions. How to handle these? Firstly, they should not get the #' @export roxygen comment, otherwise they will be made generally available and you probably don’t want this. However, even without this tag they will still get documented and could therefore clog up the Reference section built by pkgdown. So, the solution I found was to add #' @keywords internal to the comments, and this hid the function from the documentation whilst still being available for use in internally in my package.

How do I make some example data available?

The purpose of my package was to read in some data and then process it. Therefore I needed to make some data available so that people can try out the package, but also so the checks work on GitHub etc. When working locally, the roxygen examples can point to a file on your hard drive, but this will obviously fail when the code is run on a remote server.

I struggled for an answer to this one. Most packages either do not require test data or they use mtcars or one of the other built-in datasets. Other solutions talk about using an .rda file (an R object) rather than raw data.

As far as I could tell the “correct” solution is to put an example file in inst/extdata/ Now bear in mind that this file will be installed onto the user’s computer so it shouldn’t have sensitive info and importantly, should not be huge. R CMD check will give a warning about the size of the file anyhow. This file can then be accessed in your code (or by the user) by:

system.file("extdata", "nameOfFile.ext", package="myPackage")

I toyed with the idea of making data available at some other location, but this is really the only way that a user can install your package and then be offline and work with your example file.

In case you are wondering, if the data is an R object e.g. .rda file, the convention is to place it in data/ rather than inst/extdata/

Conclusion

I tried to give an overview of roadblocks that I had to overcome to make my first R package. I’m sure there may be better approaches out there to tackle some of these things. I learnt a lot by going through this process, hopefully some of you will find this info useful too.

The post title comes from “The Package” a track on Castor’s eponymous 1995 album.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.