Twelve Days 2013: LASSO Regression
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Day Eight: LASSO Regression
TL/DR
LASSO regression (least absolute shrinkage and selection operator) is a modified form of least squares regression that penalizes model complexity via a regularization parameter. It does so by including a term proportional to $||\beta||_{l_1}$ in the objective function which shrinks coefficients towards zero, and can even eliminate them entirely. In that light, LASSO is a form of feature selection/dimensionality reduction. Unlike other forms of regularization such as ridge regression, LASSO will actually eliminate predictors. It’s a simple, useful technique that performs quite well on many data sets.
Regularization
Regularization refers to the process of adding additional constraints to a problem to avoid over fitting. ML techniques such as neural networks can generate models of arbitrary complexity that will fit in-sample data one-for-one. As we recently saw in the post on Reed-Solomon FEC codes, the same applies to regression. We definitely have a problem anytime there are more regressors than data points, but any excessively complex model will generalize horribly and do you zero good out of sample.
Why LASSO?
There’s a litany of regularization techniques for regression, ranging from heuristic, hands-on ones like stepwise regression to full blown dimensionality reduction. They all have their place, but I like LASSO because it works very well, and it’s simpler than most dimensionality reduction/ML techniques. And, despite being a non-linear method, as of 2008 it has a relatively efficient solution via coordinate descent. We can solve the optimization in $O(n\cdot p)$ time, where $n$ is the length of the data set and $p$ is the number of regressors.
An Example
Our objective function has the form:
where $\lambda \geq 0$. The first half of the equation is just the standard objective function for least squares regression. The second half penalizes regression coefficients under the $l_1$ norm. The parameter $\lambda$ determines how important the penalty on coefficient weights is.
There are two R packages that I know of for LASSO: lars (short for least angle regression – a super set of LASSO) and glmnet. Glmnet includes solvers for more general models (including elastic net – a hybrid of LASSO and ridge that can handle catagorical variables). Lars is simpler to work with but the documentation isn’t great. As such, here are a few points worth noting:
- The primary lars function generates an object that’s subsequently used to generate the fit that you actually want. There’s a computational motivation behind this approach. The LARS technique works by solving for a series of “knot points” with associated, monotonically decreasing values of $\lambda$. The knot points are subsequently used to compute the LASSO regression for any value of $\lambda$ using only matrix math. This makes procedures such as cross validation where we need to try lots of different values of $\lambda$ computationally tractable. Without it, we would have to recompute an expensive non-linear optimization each time $\lambda$ changed.
- There’s a saturation point at which $\lambda$ is high enough that the null model is optimal. On the other end of the spectrum, when $\lambda = 0$, we’re left with least squares. The final value of $\lambda$ on the path, right before we end up with least squares, will correspond to the largest coefficient norm. Let’s call these coefficients $\beta_\text{thresh}$, and denote $\Delta = || \beta_\text{thresh} ||_{l_1}$. When the lars package does cross validation, it does so by computing the MSE for models where the second term in the objective function is fixed at $x \cdot \Delta,\ x \in [0, 1]$. This works from a calculation standpoint (and computationally it makes things pretty), but it’s counter intuitive if you’re interested in the actual value of $\lambda$ and not just trying to get the regression coefficients. You could easily write your own cross validation routine to use $\lambda$ directly.
- The residual sum of squared errors will increase monotonically with $\lambda$. This makes sense as we’re trading off between minimizing RSS and the model’s complexity. As such, the smallest RSS will always correspond to the smallest value of $\lambda$, and not necessarily the optimal one.
Here’s a simple example using data from the lars package. We’ll follow a common heuristic that recommends choosing $\lambda$ one SD of MSE away from the minimum. Personally I prefer to examine the CV L-curve and pick a value right on the elbow, but this works.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | </p> <p>require(lars) data(diabetes)</p> <h1 id="compute-mses-for-a-range-of-coefficient-penalties-expressed-as-a-fraction">Compute MSEs for a range of coefficient penalties expressed as a fraction</h1> <p># of the final L1 norm on the interval [0, 1]. cv.res <- cv.lars(diabetes$x, diabetes$y, type = “lasso”, mode = “fraction”, plot = FALSE)</p> <h1 id="choose-an-optimal-value-one-standard-deviation-away-from-the">Choose an “optimal” value one standard deviation away from the</h1> <p># minimum MSE. opt.frac <- min(cv.res$cv) + sd(cv.res$cv) opt.frac <- cv.res$index[which(cv.res$cv < opt.frac)[1]]</p> <h1 id="compute-the-lars-path">Compute the LARS path</h1> <p>lasso.path <- lars(diabetes$x, diabetes$y, type = “lasso”)</p> <h1 id="compute-a-fit-given-the-lars-path-that-we-precomputed-and-our-optimal">Compute a fit given the LARS path that we precomputed, and our optimal</h1> <p># fraction of the final L1 norm lasso.fit <- predict.lars(lasso.path, type = “coefficients”, mode = “fraction”, s = opt.frac)</p> <h1 id="extract-the-final-vector-of-regression-coefficients">Extract the final vector of regression coefficients</h1> <p>coef(lasso.fit)</p> <p> |
Final Notes
LASSO is a biased, linear estimator whose bias increases with $\lambda$. It’s not meant to provide the “best” fit as Gauss-Markov defines it – LASSO aims to find models that generalize well. Feature selection is hard problem and the best that we can do is a combination of common sense and model inference. However, no technique will save you from the worst case scenario: two very highly correlated variables, one of which is a good predictor, the other of which is spurious. It’s a crap shoot as to which predictor a feature selection algorithm would penalize in that case. LASSO has a few technical issues as well. Omitted variable bias is still an issue as it is in other forms of regression, and because of its non-linear solution, LASSO isn’t invariant under transformations of original data matrix.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.