Unsupervised data pre-processing: individual predictors

Posted on November 7, 2013 by thiagogm in R bloggers | 0 Comments

[This article was first published on Thiago G. Martins » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I just got the excellent book Applied Predictive Modeling, by Max Kuhn and Kjell Johnson [1]. The book is designed for a broad audience and focus on the construction and application of predictive models. Besides going through the necessary theory in a not-so-technical way, the book provides R code at the end of each chapter. This enables the reader to replicate the techniques described in the book, which is nice. Most of such techniques can be applied through calls to functions from the caret package, which is a very convenient package to have around when doing predictive modeling.

Chapter 3 is about unsupervised techniques for pre-processing your data. The pre-processing step happens before you start building your model. Inadequate data pre-processing is pointed on the book as one of the common reasons on why some predictive models fail. Unsupervised means that the transformations you perform on your predictors (covariates) does not use information about the response variable.

Feature engineering

How your predictors are encoded can have a significant impact on model performance. For example, the ratio of two predictors may be more effective than using two independent predictors. This will depend on the model used as well as on the particularities of the phenomenon you want to predict. The manufacturing of predictors to improve prediction performance is called feature engineering. To succeed at this stage you should have a deep understanding of the problem you are trying to model.

Data transformations for individuals predictors

A good practice is to center, scale and apply skewness transformations for each of the individual predictors. This practice gives more stability for numerical algorithms used later in the fitting of different models as well as improve the predictive ability of some models. Box and Cox transformation [2], centering and scaling can be applied using the preProcess function from caret. Assume we have a predictors data frame with two predictors, x1 and x2, depicted in Figure 1.

Transforming individual predictors with caret

Figure 1

Then the following code

set.seed(1)
predictors = data.frame(x1 = rnorm(1000,
                                   mean = 5,
                                   sd = 2),
                        x2 = rexp(1000,
                                  rate=10))

require(caret)

trans = preProcess(predictors, 
                   c("BoxCox", "center", "scale"))
predictorsTrans = data.frame(
      trans = predict(trans, predictors))

will estimate the ${\lambda}$ of the Box and Cox transformation

$\displaystyle x^* = \bigg\{ \begin{array}{cc} \frac{x^{\lambda} - 1}{\lambda} & \text{ if }\lambda \neq 0 \\ \log(x) & \text{ if }\lambda = 0 \end{array}$

apply it to your predictors that take on positive values, and then center and scale each one of the predictors. The new data frame predictorsTrans with the transformed predictors is depicted in Figure 2.

Figure 2

I will write more about the book and the caret package at future posts. The complete code that I have used here to simulate the data, generate the pictures and transform the data can be found on gist.

References:

[1] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[2] Box, G. and Cox, D. (1964). An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological) 211-252

To leave a comment for the author, please follow the link and comment on their blog: Thiago G. Martins » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Unsupervised data pre-processing: individual predictors

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)