Solutions for Multicollinearity in Regression(2)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue to discuss this topic about multicollinearity in regression. Firstly, it is necessary introduce how to calculate the VIF and condition number via software such as R. Of course it is really easy for us. The vif() in car and kappa() can be applied to calculate the VIF and condition number, respectively. Consider the data from the last article of this series for example
> #vif > vif(lm(GNP~.,data=longley)); GNP.deflator Unemployed Armed.Forces Population Year 81.946226 35.924858 9.406108 171.158675 1017.609561 Employed 196.247880 > #condition number > kappa(longley[,-1]); [1] 8521.126
From the output, it is clear that both of VIF and condition number are extremely large which means the data exist extremely multicollinearity.
2 Lasso and Least Angle Regression
Besides ridge regression, lasso is another feasible and straightforward way. Lasso is the abbreviation of Least Absolute Shrinkage and Selection Operator and actually the motivation is similar with ridge regression.
The main difference between ridge regression and lasso is that the former uses a (squared) penalty, while the latter uses an penalty. Due to this difference, their solutions behave very differently. For the sake of implement lasso regression in R language, we consider package lars which provides Efficient procedures for fitting. The following 3 functions in lars are particularly useful:
(1) lars(): Fits Least Angle Regression(will be mentioned later), Lasso and Infinitesimal Forward Stage-wise regression models.
(2) cv.lars(): Computes K-fold cross-validated error curve for lars
(3) plot.lars(): Plot method for lars objects
We still use the data which have been demonstrated by the last article, and please run the code as below and depend on the output, the model can be built easily
library(lars); #lars is only used for matrix y<-matrix(longley[,1]); x<-as.matrix(longley[,-1]); lasso<-lars(x,y); plot(lasso); summary(lasso); cvr<-cv.lars(x,y,K=10); best<-cvr$index[which.min(cvr$cv)]; coef0<-coef.lars(lasso,mode="fraction",s=best); s<-which.min(lasso$Cp)[1] coef1<-coef.lars(lasso,mode="step",s=s);
Besides, least angle regression is a possible method as well, which also can implemented by function lars(), but the argument type is supposed to adjust as "lar".
lar<-lars(x,y,type="lar"); plot(lar); summary(lar);
Note that the computation of the lasso solutions is a quadratic programming problem, and can be tackled by standard numerical analysis algorithms, whereas, the least angle regression procedure is a better approach.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.