New package for ensembling R models

Posted on March 13, 2013 by Zachary Mayer in Uncategorized | 0 Comments

[This article was first published on Modern Toolmaking, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve written a new R package called caretEnsemble for creating ensembles of caret models in R. It currently works well for regression models, and I’ve written some preliminary support for binary classification models.

At this point, I’ve got 2 different algorithms for combining models:

1. Greedy stepwise ensembles (returns a weight for each model)
2. Stacks of caret models

(You can also manually specify weights for a greedy ensemble)

The greedy algorithm is based on the work of Caruana et al., 2004, and inspired by the medley package here on github. The stacking algorithm simply builds a second caret model on top of the existing models (using their predictions as input), and employs all of the flexibility of the caret package.

All the models in the ensemble must use the same training/test folds. Both algorithms use the out-of-sample predictions to find the weights and train the stack. Here’s a brief script demonstrating how to use the package:


	#Setup
	rm(list = ls(all = TRUE))
	gc(reset=TRUE)
	set.seed(42) #From random.org

	#Libraries
	library(caret)
	library(devtools)
	install_github('caretEnsemble', 'zachmayer') #Install zach's caretEnsemble package
	library(caretEnsemble)

	#Data
	library(mlbench)
	data(BostonHousing2)
	X <- model.matrix(cmedv~crim+zn+indus+chas+nox+rm+age+dis+
	rad+tax+ptratio+b+lstat+lat+lon, BostonHousing2)[,-1]
	X <- data.frame(X)
	Y <- BostonHousing2$cmedv

	#Split train/test
	train <- runif(nrow(X)) <= .66

	#Setup CV Folds
	#returnData=FALSE saves some space
	folds=5
	repeats=1
	myControl <- trainControl(method='cv', number=folds, repeats=repeats, returnResamp='none',
	returnData=FALSE, savePredictions=TRUE,
	verboseIter=TRUE, allowParallel=TRUE,
	index=createMultiFolds(Y[train], k=folds, times=repeats))
	PP <- c('center', 'scale')

	#Train some models
	model1 <- train(X[train,], Y[train], method='gbm', trControl=myControl,
	tuneGrid=expand.grid(.n.trees=500, .interaction.depth=15, .shrinkage = 0.01))
	model2 <- train(X[train,], Y[train], method='blackboost', trControl=myControl)
	model3 <- train(X[train,], Y[train], method='parRF', trControl=myControl)
	model4 <- train(X[train,], Y[train], method='mlpWeightDecay', trControl=myControl, trace=FALSE, preProcess=PP)
	model5 <- train(X[train,], Y[train], method='ppr', trControl=myControl, preProcess=PP)
	model6 <- train(X[train,], Y[train], method='earth', trControl=myControl, preProcess=PP)
	model7 <- train(X[train,], Y[train], method='glm', trControl=myControl, preProcess=PP)
	model8 <- train(X[train,], Y[train], method='svmRadial', trControl=myControl, preProcess=PP)
	model9 <- train(X[train,], Y[train], method='gam', trControl=myControl, preProcess=PP)
	model10 <- train(X[train,], Y[train], method='glmnet', trControl=myControl, preProcess=PP)

	#Make a list of all the models
	all.models <- list(model1, model2, model3, model4, model5, model6, model7, model8, model9, model10)
	names(all.models) <- sapply(all.models, function(x) x$method)
	sort(sapply(all.models, function(x) min(x$results$RMSE)))

	#Make a greedy ensemble - currently can only use RMSE
	greedy <- caretEnsemble(all.models, iter=1000L)
	sort(greedy$weights, decreasing=TRUE)
	greedy$error

	#Make a linear regression ensemble
	linear <- caretStack(all.models, method='glm', trControl=trainControl(method='cv'))
	summary(linear$ens_model$finalModel)
	linear$error

	#Predict for test set:
	preds <- data.frame(sapply(all.models, predict, newdata=X[!train,]))
	preds$ENS_greedy <- predict(greedy, newdata=X[!train,])
	preds$ENS_linear <- predict(linear, newdata=X[!train,])
	sort(sqrt(colMeans((preds - Y[!train]) ^ 2)))

view raw demo.R hosted with ❤ by GitHub

Please feel free to submit any comments here or on github. I’d also be happy to include any patches you feel like submitting. In particular, I could use some help writing support for multi-class models, writing more tests, and fixing bugs.

To leave a comment for the author, please follow the link and comment on their blog: Modern Toolmaking.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

New package for ensembling R models

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)