Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Motivation
The name bagging comes from boostrap aggregating. It is a machine learning technique proposed by Breiman (1996) to increase stability in potentially unstable estimators. For example, suppose you want to run a regression with a few variables in two steps. First, you run the regression with all the variables in your data and select the significant ones. Second, you run a new regression using only the selected variables and compute the predictions.
This procedure is not wrong if your problem is forecasting. However, this two step estimation may result in highly unstable models. If many variables are important but individually their importance is small, you will probably leave some of them out, and small perturbations on the data may drastically change the results.
Bagging
The Bagging solves instability problem by giving each variable several chances to be in the model. The same procedure described in the previous section is repeated in many bootstrap samples and the final forecast will be the average forecast across all these samples.
In this post I will show an example of Bagging with regression in the selection context described above. But you should keep in mind that this technique can be used for any unstable model. The most successful case is to use Bagging on regression trees, resulting in the famous random forest. The Bagging is also very robust to over fitting and produces forecasts with less variance.
Application
I am going to show an application of Bagging from the scratch using the boot
package for the bootstrap replications. We start by generating some data with the following dgp:
The first equation shows how the variable we want to predict,
N = 300 # = Number of obs. = # K = 40 # = Number of Variables = # set.seed(123) # = Seed for replication = # f = rnorm(N) # = Common factor = # A = runif(K,-1,1) # = Coefficients from the second equation = # X = t(A%*%t(f)) + matrix(rnorm(N*K), N, K) # = Generate xij = # beta = c((-1)^(1:(K-10)) ,rep(0, 10)) # = Coefficients from the first equation, the first 30 are equal 1 or -1 and the last 10 are 0 = # # = R2 setup = # aux = var(X%*%beta) erro = rnorm(N, 0, sqrt(2*aux)) # = The variance of the error will be twice the variance of X%*%beta = # y = X %*% beta+erro # = Generate y = # 1-var(erro)/var(y) # = R2 = # ## [,1] ## [1,] 0.2925866 # = Break data into in-sample and out-of-sample = # y.in = y[1:(N-100)] X.in = X[1:(N-100),] y.out = y[-c(1:(N-100))] X.out = X[-c(1:(N-100)),]
Now we must define a function to be called by the boostrap. This function must receive the data and an argument for indexes that will tell R
where the sampling must be made. This function will run a linear regression with all variables, select those with t-statistics bigger than
bagg = function(data,ind){ sample = data[ind,] y = sample[, 1] X = sample[, 2:ncol(data)] model1 = lm(y ~ X) t.stat = summary(model1)$coefficients[-1, 3] selected = which(abs(t.stat)>=1.96) model2 = lm(y ~ X[, selected]) coefficients = coef(model2) cons = coefficients[1] betas = coefficients[-1] aux = rep(0, ncol(X)) aux[selected] = betas res = c(cons, aux) return(res) }
Now we are ready to use the boot
function to run the Bagging. The code below does precisely that with selected
after running this code you will see that many variables were left out and some variables with
library(boot) # = Bagging = # bagging = boot(data = cbind(y.in, X.in), statistic=bagg, R=500) bagg.coef = bagging$t y.pred.bagg = rowMeans(cbind(1, X.out)%*%t(bagg.coef)) # = OLS = # ols0 = lm(y.in ~ X.in) selected = which(abs(summary(ols0)$coefficients[-1, 3])>=1.96) ols1 = lm(y.in ~ X.in[, selected]) y.pred = cbind(1, X.out[, selected])%*%coef(ols1) # = Forecasting RMSE = # sqrt(mean((y.pred-y.out)^2)) # = OLS = # ## [1] 10.43338 sqrt(mean((y.pred.bagg-y.out)^2)) # = Bagging = # ## [1] 9.725554
The results showed that the Bagging has a RMSE
barplot(coef(ols0)[-1]) barplot(colMeans(bagg.coef)[-1], add=TRUE, col="red")
Finally, what would happen if we had no instability? If you run the same code again changing the erro = rnorm(N, 0, sqrt(2*aux))
in the dgp. As you can see, the bagging and the simple procedure are the same in the absence of instability.
References
Breiman, Leo (1996). “Bagging predictors”. Machine Learning. 24 (2): 123–140
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.