Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.27 has been released on CRAN.
(from the package documentation)
vtreat
is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.
Very roughly vtreat
accepts an arbitrary “from the wild” data frame (with different column types, NA
s, NaN
s and so forth) and returns a transformation that reliably and repeatably converts similar data frames to numeric (matrix-like) frames (all independent variables numeric free of NA
, NaN
s, infinities, and so on) ready for predictive modeling. This is a systematic way to work with high-cardinality character and factor variables (which are incompatible with some machine learning implementations such as random forest, and also bring in a danger of statistical over-fitting) and leaves the analyst more time to incorporate domain specific data preparation (as vtreat
tries to handle as much of the common stuff as practical). For more of an overall description please see here.
We suggest any users please update (and you will want to re-run any “design” steps instead of mixing “design” and “prepare” from two different versions of vtreat
).
For what is new in version 0.5.27 please read on.
vtreat
0.5.27 is a maintenance release. User visible improvements include.
- Switching `catB` encodings to a logit scale (instead of the previous log scale).
- Increasing the degree of parallelism by separately parallelizing the level pruning steps (using the methods outlined here).
- Changing the default for
catScaling
toFALSE
. We still think working logistic link-space is a great idea for classification problems, we are just not fully satisfied that un-regularized logistic regressions are the best way to get there (largely due to issues of separation and quasi-separation). In the meantime we think working in an expectation space is the safer (and now default) alternative. - Falling back to
stats::chisq.test()
instead of insisting onstats::fisher.test()
for large counts. This calculation is used for level pruning and only relevant ifrareSig < 1
(the default is1
). We caution that settingrareSig < 1
remains a fairly expensive setting. We are trying to make significance estimation much more transparent, for example we now return how many extra degrees of freedom are hidden by categorical variable re-encodings in a new score frame column calledextraModelDegrees
(found indesignTreatments*()$scoreFrame
).
The idea is having data preparation as a re-usable library lets us research, document, optimize, and fine tune a lot more details than would make sense on any one analysis project. The main design difference from other data preparation packages is we emphasize “y-aware” (or outcome aware) processing (using the training outcome to generate useful re-encodings of the data).
We have pre-rendered a lot of the package documentation, examples, and tutorials here.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.