Out-of-sample Imputation with {missRanger}
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
{missRanger} is a multivariate imputation algorithm based on random forests, and a fast version of the original missForest algorithm of Stekhoven and Buehlmann (2012). Surprise, surprise: it uses {ranger} to fit random forests. Especially combined with predictive mean matching (PMM), the imputations are often quite realistic.
Out-of-sample application
The newest CRAN release 2.6.0 offers out-of-sample application. This is useful for removing any leakage between train/test data or during cross-validation. Furthermore, it allows to fill missing values in user provided data. By default, it uses the same number of PMM donors as during training, but you can change this by setting pmm.k = nice value
.
We distinguish two types of observations to be imputed:
- Easy case: Only a single value is missing. Here, we simply apply the corresponding random forest to fill the one missing value.
- Hard case: Multiple values are missing. Here, we first fill the values univariately, and then repeatedly apply the corresponding random forests, with the hope that the effect of univariate imputation vanishes. If values of two highly correlated features are missing, then the imputations can be non-sensical. There is no way to mend this.
Example
To illustrate the technique with a simple example, we use the iris data.
1. First, we randomly add 10% missing values.
2. Then, we make a train/test split.
3. Next, we “fit” missRanger()
to the training data.
4. Finally, we use its new predict()
method to fill the test data.
R
library(missRanger) # 10% missings ir <- iris |> generateNA(p = 0.1, seed = 11) # Train/test split stratified by Species oos <- c(1:10, 51:60, 101:110) train <- ir[-oos, ] test <- ir[oos, ] head(test) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3.0 1.4 0.2 setosa # 3 4.7 3.2 1.3 NA setosa # 4 4.6 3.1 1.5 0.2 setosa # 5 5.0 3.6 1.4 0.2 setosa # 6 5.4 NA 1.7 NA setosa mr <- missRanger(train, pmm.k = 5, keep_forests = TRUE, seed = 1) test_filled <- predict(mr, test, seed = 1) head(test_filled) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3.0 1.4 0.2 setosa # 3 4.7 3.2 1.3 0.2 setosa # 4 4.6 3.1 1.5 0.2 setosa # 5 5.0 3.6 1.4 0.2 setosa # 6 5.4 4.0 1.7 0.4 setosa # Original head(iris) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3.0 1.4 0.2 setosa # 3 4.7 3.2 1.3 0.2 setosa # 4 4.6 3.1 1.5 0.2 setosa # 5 5.0 3.6 1.4 0.2 setosa # 6 5.4 3.9 1.7 0.4 setosa
The results look reasonable, in this case even for the “hard case” row 6 with missing values in two variables. Here, it is probably the strong association with Species
that helped to create good values.
The new predict()
also works with single row input.
Learn more about {missRanger}
- Basics: https://mayer79.github.io/missRanger/articles/missRanger.html
- Multiple imputation: https://mayer79.github.io/missRanger/articles/multiple_imputation.html
- Working with survival data: https://mayer79.github.io/missRanger/articles/working_with_censoring.html
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.