Site icon R-bloggers

Feature Subsampling For Random Forest Regression

[This article was first published on Blog – Michael's and Christian's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

TLDR: The number of subsampled features is a main source of randomness and an important parameter in random forests. Mind the different default values across implementations.

Randomness in Random Forests

Random forests are very popular machine learning models. They are build from easily understandable and well visualizable decision trees and give usually good predictive performance without the need for excessive hyperparameter tuning. Some drawbacks are that they do not scale well to very large datasets and that their predictions are discontinuous on continuous features.

A key ingredient for random forests is—no surprise here—randomness. The two main sources for randomness are:

In this post, we want to investigate the first source, feature subsampling, with a special focus on regression problems on continuous targets (as opposed to classification).

Feature Subsampling

In his seminal paper, Leo Breiman introduced random forests and pointed out several advantages of feature subsamling per node split. We cite from his paper:

The forests studied here consist of using randomly selected inputs or combinations of inputs at each node to grow each tree. The resulting forests give accuracy that compare favorably with Adaboost. This class of procedures has desirable characteristics:

i Its accuracy is as good as Adaboost and sometimes better.

ii It’s relatively robust to outliers and noise.

iii It’s faster than bagging or boosting.

iv It gives useful internal estimates of error, strength, correlation and variable importance.

v It’s simple and easily parallelized.

Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001).

Note the focus on comparing with Adaboost at that time and the, in today’s view, relatively small datasets used for empirical studies in this paper.

If the input data as p number of features (columns), implementations of random forests usually allow to specify how many features to consider at each split:

ImplementationLanguageParameterDefault
scikit-learn RandomForestRegressorPythonmax_featuresp
scikit-learn RandomForestClassifierPythonmax_featuressqrt(p)
rangerRmtrysqrt(p)
randomForest regressionRmtryp/3
randomForest classificationRmtrysqrt(p)
H2O regressionPython & Rmtriesp/3
H2O classificationPython & Rmtriessqrt(p)

Note that the default of scikit-learn for regression is surprising because it switches of the randomness from feature subsampling rendering it equal to bagged trees!

While empirical studies on the impact of feature for good default choices focus on classification problems, see the literature review in Probst et al 2019, we consider a set of regression problems with continuous targets. Note that different results might be more related to different feature spaces than to the difference between classification and regression.

The hyperparameters mtry, sample size and node size are the parameters that control the randomness of the RF. […]. Out of these parameters, mtry is most influential both according to the literature and in our own experiments. The best value of mtry depends on the number of variables that are related to the outcome.

Probst, P. et al. “Hyperparameters and tuning strategies for random forest.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9 (2019): n. pag.

Benchmarks

We selected the following 13 datasets with regression problems:

Datasetnumber of samplesnumber of used features p
Allstate188,318130
Bike_Sharing_Demand17,37912
Brazilian_houses10’69212
ames1’46079
black_friday166’8219
colleges7’06349
delays_zurich_transport27’32717
diamonds53’9406
la_crimes1’468’82525
medical_charges_nominal163’06511
nyc-taxi-green-dec-2016581’83514
particulate-matter-ukair-2017394’2999
taxi581’83518

Note that among those, there is no high dimensional dataset in the sense that p>number of samples.

On these, we fitted the scikit-learn RandomForestRegressor (within a short pipeline handling missing values) with default parameters. We used 5-fold cross validation with 4 different values max_feature=p/3 (blue), sqrt(p) (orange), 0.9 p (green), and p (red). Now, we show the mean squared error with uncertainty bars (± one standard deviation of cross validation splits), the lower the better.

In addition, we also report the fit time of each (5-fold) fit in seconds, again the lower the better.

Note that sqrt(p) is often smaller than p/3. With this in mind, this graphs show that fit time is about proportional to the number of features subsampled.

Conclusion

The full code can be found here:

https://github.com/lorentzenchr/notebooks/blob/master/random_forests_max_features.ipynb

To leave a comment for the author, please follow the link and comment on their blog: Blog – Michael's and Christian's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.