Solutions for Multicollinearity in Regression(2)

[This article was first published on Chen-ang Statistics » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue to discuss this topic about multicollinearity in regression. Firstly, it is necessary introduce how to calculate the VIF and condition number via software such as R. Of course it is really easy for us. The vif() in car and kappa() can be applied to calculate the VIF and condition number, respectively. Consider the data from the last article of this series for example

> #vif
> vif(lm(GNP~.,data=longley));
GNP.deflator   Unemployed Armed.Forces   Population         Year 
   81.946226    35.924858     9.406108   171.158675  1017.609561 
    Employed 
  196.247880 
> #condition number
> kappa(longley[,-1]);
[1] 8521.126

From the output, it is clear that both of VIF and condition number are extremely large which means the data exist extremely multicollinearity.

2 Lasso and Least Angle Regression

Besides ridge regression, lasso is another feasible and straightforward way. Lasso is the abbreviation of Least Absolute Shrinkage and Selection Operator and actually the motivation is similar with ridge regression.

The main difference between ridge regression and lasso is that the former uses a (squared) l_2 penalty, while the latter uses an l_1 penalty. Due to this difference, their solutions behave very differently. For the sake of implement lasso regression in R language, we consider package lars which provides Efficient procedures for fitting. The following 3 functions in lars are particularly useful:

(1) lars(): Fits Least Angle Regression(will be mentioned later), Lasso and Infinitesimal Forward Stage-wise regression models.

(2) cv.lars(): Computes K-fold cross-validated error curve for lars

(3) plot.lars(): Plot method for lars objects

We still use the data which have been demonstrated by the last article, and please run the code as below and depend on the output, the model can be built easily

library(lars);
#lars is only used for matrix
y<-matrix(longley[,1]);
x<-as.matrix(longley[,-1]);
lasso<-lars(x,y);
plot(lasso);
summary(lasso);
cvr<-cv.lars(x,y,K=10);
best<-cvr$index[which.min(cvr$cv)];
coef0<-coef.lars(lasso,mode="fraction",s=best);
s<-which.min(lasso$Cp)[1]
coef1<-coef.lars(lasso,mode="step",s=s);

lasso

Besides, least angle regression is a possible method as well, which also can implemented by function lars(), but the argument type is supposed to adjust as "lar".

lar<-lars(x,y,type="lar");
plot(lar);
summary(lar);

Note that the computation of the lasso solutions is a quadratic programming problem, and can be tackled by standard numerical analysis algorithms, whereas, the least angle regression procedure is a better approach.

 

 

To leave a comment for the author, please follow the link and comment on their blog: Chen-ang Statistics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)