The making of datasauRus

[This article was first published on R – Locke Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Around 8:30pm I saw this tweet and duly retweeted

It turns out awesome folks, George and Justin, had made a process whereby they can generate different distributions of points that retain the same summary statistics. They used this process for making some friends for Dino the Datasaurus who was created by Alberto Cairo. They made the data for Dino and the rest of the Datasaurus Dozen available for download.

The data seemed like an ideal thing to get into R, so it was R package development time!

I know a lot of folks can be a bit scared of producing packages so I thought I’d share the workflow so folks can see how easy the actual package bits and bobs can be.

Starting a package

For a long time I’ve been using devtools to make package building easier, and indeed I still do. However, I wanted to be even lazier so I didn’t even have to type out the ~10 lines of code to set up a best practice package. To be lazy, I made a package called pRojects. pRojects helps with the initial setup of different project types, including packages. It’s still under development and it’s a great place to cut your teeth on contributing to a package.

To start the datasauRus package I just did:

install.packages("devtools")
devtools::install_github("lockedata/pRojects")
pRojects::createPackageProject("datasauRus")

Well, I’ve already got those prerequisites so I only did 1 line!

Then some (optional) online stuff is required to complete the awesomeness:

  1. Create a new repository on github
  2. Turn on travis continuous integration
  3. Turn on coveralls

Then we need to add github repository to our project. I use the git command line for this:

git remote add origin [email protected]:stephlocke/datasauRus.git
git push --set-upstream origin master

With just these things, I have a package that contains the unit test framework, documentation stubs, continuous integration and test coverage, and source control.

That is all you need to do to get things going!

Adding package contents

I’m not going to take you through an in-depth piece on writing package contents as Hadley does that extremely well in R packages (or get the book). This package only requires datasets to be built.

Lucy McGowan made a PR overnight that converted the raw data (stored in inst/extdata/) into R datasets (stored in data) with this R code:

# #here is how I converted, we can delete this

# files =  list.files(path = "./inst/extdata",pattern="*.tsv", full.names = TRUE)

#

# purrr::walk(files, function(x){

#   nm <- gsub("\\.tsv","",basename(x))

#   nm <- gsub("-","_",nm)

#

#   #make it slither

#   nm <-  gsub("^_+", "", tolower(gsub("([A-Z])", "_\\1", nm)))

#

#   if (grepl("wide",nm)){

#     header1 <- scan(x, nlines = 1, what = character())

#     header1 <- paste(header1,c("x","y"),sep="_")

#     .dat <- readr::read_tsv(x, col_names = header1, skip = 2)

#   } else {

#     .dat <- readr::read_tsv(x)

#   }

#   assign(nm, .dat)

#   save(list = nm, file = paste0("data/",nm,".rda"))

# })


This made .Rda files that would now be loaded when the package installed. Lucy also added minimal documentation for each file too in the R package doc created by default in the package setup phase. Each data entry looks like:

#'Datasaurus Dozen (wide) data
"datasaurus_dozen_wide"

Adding tests

We don’t have functions in this package but we do have datasets, so I wanted to write some unit tests for these.

Unit testing with testthat is setup by default so I just needed to add a file beginning with test- into the test/testthat/ directory that contained some tests about the data.

As the code would be repetitive, I made a function that could be applied to each dataset.

context("datasets")

datashapetests<-function(df, ncols, nrows, uniquecol=NULL, nuniques=NULL){
  expect_equal(ncol(df),ncols)
  expect_equal(nrow(df),nrows)
  if(!is.null(uniquecol))
  expect_equal(nrow(unique(df[uniquecol])),nuniques)
}

test_that("box_plots is correctly shaped",{
  datashapetests(box_plots,6,2484)
})

test_that("datasaurus_dozen is correctly shaped",{
  datashapetests(datasaurus_dozen,3,1846,"dataset",13)
})

Documentation

I mentioned some of the minimally written dataset documentation. These should be expanded, and the vignette needs filling in. This could be based on the information I’ve already provided in the README. Once the documentation is polished, this package could go to CRAN.

Run checks

Once you’ve made some changes to your package, you should make sure your code passes it’s unit tests, that documentation is correctly structured, and that your code compiles. Remove as many warnings and notes as possibles.

Do this with devtools:

devtools::check()

Travis CI will also verify your code passes these checks in a clean environment, and if you hooked up code coverage then you’ll see how much your tests test your codebase.

That’s it

There can be nuance in writing your own functions and coding defensively, but you can now make a great package skeleton in just one line of code. The datasauRus is almost ready to go to CRAN and it took less than 50 lines of code, including the package contents. I hope this has been some useful insight into R package development.

If you want to tackle making your first package, I’ll happily give you a hand – book into my office hours and let’s do it!

The post The making of datasauRus appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)