Track changes in data with the lumberjack %>>%

mark

5 years ago

[This article was first published on R – Mark van der Loo, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

So you are using this pipeline to have data treated by different functions in R. For example, you may be imputing some missing values using the simputation package. Let us first load the only realistic dataset in R

> data(retailers, package="validate")
> head(retailers, 3)
  size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat
1  sc0      0.02    75       NA        NA      1130          NA       18915  20045  NA
2  sc3      0.14     9     1607        NA      1607         131        1544     63  NA
3  sc3      0.14    NA     6886       -33      6919         324        6493    426  NA

This data is dirty with missings and full of errors. Let us do some imputations with simputation.

> out <- retailers %>% 
+   impute_lm(other.rev ~ turnover) %>%
+   impute_median(other.rev ~ size)
> 
> head(out,3)
  size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat
1  sc0      0.02    75       NA  6114.775      1130          NA       18915  20045  NA
2  sc3      0.14     9     1607  5427.113      1607         131        1544     63  NA
3  sc3      0.14    NA     6886   -33.000      6919         324        6493    426  NA
>

Ok, cool, we know all that. But what if you’d like to know what value was imputed with which method? That’s where the lumberjack comes in.

The lumberjack operator is a `pipe'[1] operator that allows you to track changes in data.

> library(lumberjack)
> retailers$id <- seq_len(nrow(retailers))
> out <- retailers %>>% 
+   start_log(log=cellwise$new(key="id")) %>>%
+   impute_lm(other.rev ~ turnover) %>>%
+   impute_median(other.rev ~ size) %>>%
+   dump_log(stop=TRUE)
Dumped a log at cellwise.csv
> 
> read.csv("cellwise.csv") %>>% dplyr::arrange(key) %>>% head(3)
  step                     time                      expression key  variable old      new
1    2 2017-06-23 21:11:05 CEST impute_median(other.rev ~ size)   1 other.rev  NA 6114.775
2    1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover)   2 other.rev  NA 5427.113
3    1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover)   6 other.rev  NA 6341.683
>

So, to track changes we only need to switch from %>% to %>>% and add the start_log() and dump_log() function calls in the data pipeline. (to be sure: it works with any function, not only with simputation). The package is on CRAN now, and please see the introductory vignette for more examples and ways to customize it.

There are many ways to track changes in data. That is why the lumberjack is completely extensible. The package comes with a few loggers, but users or package authors are invited to write their own. Please see the extending lumberjack vignette for instructions.

If this post got you interested, please install the package using

install.packages('lumberjack')

You can get started with the introductory vignette or even just use the lumberjack operator %>>% as a (close) replacement of the %>% operator.

As always, I am open to suggestions and comments. Either through the packages github page.

Also, I will be talking at useR2017 about the simputation package, but I will sneak in a bit of lumberjack as well :p.

And finally, here’s a picture of a lumberjack smoking a pipe.

[1] It really should be called a function composition operator, but potetoes/potatoes.

Markdown with by

wp-gfm

To leave a comment for the author, please follow the link and comment on their blog: R – Mark van der Loo.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.