Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
So you are using this pipeline to have data treated by different functions in R. For example, you may be imputing some missing values using the simputation package. Let us first load the only realistic dataset in R
> data(retailers, package="validate") > head(retailers, 3) size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat 1 sc0 0.02 75 NA NA 1130 NA 18915 20045 NA 2 sc3 0.14 9 1607 NA 1607 131 1544 63 NA 3 sc3 0.14 NA 6886 -33 6919 324 6493 426 NA
This data is dirty with missings and full of errors. Let us do some imputations with simputation.
> out <- retailers %>% + impute_lm(other.rev ~ turnover) %>% + impute_median(other.rev ~ size) > > head(out,3) size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat 1 sc0 0.02 75 NA 6114.775 1130 NA 18915 20045 NA 2 sc3 0.14 9 1607 5427.113 1607 131 1544 63 NA 3 sc3 0.14 NA 6886 -33.000 6919 324 6493 426 NA >
Ok, cool, we know all that. But what if you’d like to know what value was imputed with which method? That’s where the lumberjack comes in.
The lumberjack operator is a `pipe'[1] operator that allows you to track changes in data.
> library(lumberjack) > retailers$id <- seq_len(nrow(retailers)) > out <- retailers %>>% + start_log(log=cellwise$new(key="id")) %>>% + impute_lm(other.rev ~ turnover) %>>% + impute_median(other.rev ~ size) %>>% + dump_log(stop=TRUE) Dumped a log at cellwise.csv > > read.csv("cellwise.csv") %>>% dplyr::arrange(key) %>>% head(3) step time expression key variable old new 1 2 2017-06-23 21:11:05 CEST impute_median(other.rev ~ size) 1 other.rev NA 6114.775 2 1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover) 2 other.rev NA 5427.113 3 1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover) 6 other.rev NA 6341.683 >
So, to track changes we only need to switch from %>%
to %>>%
and add the start_log()
and dump_log()
function calls in the data pipeline. (to be sure: it works with any function, not only with simputation). The package is on CRAN now, and please see the introductory vignette for more examples and ways to customize it.
There are many ways to track changes in data. That is why the lumberjack is completely extensible. The package comes with a few loggers, but users or package authors are invited to write their own. Please see the extending lumberjack vignette for instructions.
If this post got you interested, please install the package using
install.packages('lumberjack')
You can get started with the introductory vignette or even just use the lumberjack operator %>>%
as a (close) replacement of the %>%
operator.
As always, I am open to suggestions and comments. Either through the packages github page.
Also, I will be talking at useR2017 about the simputation package, but I will sneak in a bit of lumberjack as well :p.
And finally, here’s a picture of a lumberjack smoking a pipe.
[1] It really should be called a function composition operator, but potetoes/potatoes.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.