Site icon R-bloggers

What is new in the vtreat library?

[This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Win-Vector LLC vtreat library is a library we supply (under a GPL license) for automating the simple domain independent part of variable cleaning an preparation.

The idea is you supply (in R) an example general data.frame to vtreat’s designTreatmentsC method (for single-class categorical targets) or designTreatmentsN method (for numeric targets) and vtreat returns a data structure that can be used to prepare data frames for training and scoring. A vtreat-prepared data frame is nice in the sense:

The idea is vtreat automates a number of standard inspection and preparation steps that are common to all predictive analytic projects. This leaves the data scientist more time to work on important domain specific steps. vtreat also leaves as much of variable selection to the down-stream modeling software. The goal of vtreat is to reliably (and repeatably) generate a data.frame that is safe to work with.

This note explains a few things that are new in the vtreat library.

The typical use of vtreat is to defend down-stream modeling code from all kinds of typical incoming data problems. Such issues include:

These are all things that “shouldn’t happen” but do happen often enough that you want a systematic notifications, treatments and defenses against them. Uncaught these issues can cause your model to error-out or skip examples during scoring (novel levels often cause this) or lurk subtly causing a (large or small) unnoticed loss in model quality.

A typical use looks like the following:

library('vtreat') # our design and training data frame dTrainC <- data.frame(x=c('a','a','a','b','b',NA), z=c(1,2,3,4,NA,6),y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE)) print(dTrainC) # build the treatment plan on the training frame treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE) # treat the training frame and use this treated frame to build models dTrainCTreated <- prepare(treatmentsC,dTrainC,pruneLevel=c()) print(dTrainCTreated) # later, new test or application data arrives dTestC <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA)) print(dTestC) # use the treatment plan to prepare this frame dTestCTreated <- prepare(treatmentsC,dTestC,pruneLevel=c()) print(dTestCTreated)

vtreat was designed to package and automate some of the more common steps from section 4.1 of Practical Data Science with R. This is not a replacement for actually looking at the data. The automation is just to leave the data scientist more time to work on important domain specific adaptions and transformations. Similarly vtreat does a little variable scoring- but leaves the bulk of variable selection to the modeling technique the data scientist chooses to use after treatment. We want vtreat to be very light-weight and easy to combine with other libraries.

A few things have been added since we introduced the Win-Vector LLC basic variable preparation library. In particular:

We strongly encourage all data scientists to incorporate vtreat (or something like it) into their workflow.

Related posts:

  1. Vtreat: designing a package for variable treatment
  2. R minitip: don’t use data.matrix when you mean model.matrix
  3. What can be in an R data.frame column?

To leave a comment for the author, please follow the link and comment on their blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.