Extending RevoScaleR for Mining Big Data – Discretization

Derek Norton

9 years ago

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Derek McCrae Norton, Senior Sales Engineer

In this second installment of Extending RevoScaleR for Mining Big Data we look at how to use the building blocks provided by RevoScaleR to transform continuous variables into discrete.

Motivation: Discretize continuous variables on big data.

Discretization is a technique to convert continuous variables into discrete variables, and it is sometimes useful in data mining models such as Naïve Bayes. There are two basic methods, Equal Width and Equal Frequency, as well as many advanced methods such as Chi2, ChiMerge, and Tree Based methods.

If we consider the two basic methods, they are quite easy to implement in RevoScaleR.

Equal Width – Simply divide range into k buckets. The range is precalculated in XDF files which means most of the work is already done!

Equal Frequency – rxQuantile is a function that efficiently calculates k quantiles.

Bring it all together and use cut inside of a rxDataStep tranform to create new discretized variables.

You can test this out yourself with the function rxDiscretize at github.

Look for upcoming posts on other ways to extend RevoScaleR for Mining Big Data.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.