Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m writing a series of posts on various function options of the glmnet
function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation.
In this post, we will look at the type.gaussian option.
For reference, here is the full signature of the glmnet
function (v3.0-2):
glmnet(x, y, family = c("gaussian", "binomial", "poisson", "multinomial", "cox", "mgaussian"), weights, offset = NULL, alpha = 1, nlambda = 100, lambda.min.ratio = ifelse(nobs < nvars, 0.01, 1e-04), lambda = NULL, standardize = TRUE, intercept = TRUE, thresh = 1e-07, dfmax = nvars + 1, pmax = min(dfmax * 2 + 20, nvars), exclude, penalty.factor = rep(1, nvars), lower.limits = -Inf, upper.limits = Inf, maxit = 1e+05, type.gaussian = ifelse(nvars < 500, "covariance", "naive"), type.logistic = c("Newton", "modified.Newton"), standardize.response = FALSE, type.multinomial = c("ungrouped", "grouped"), relax = FALSE, trace.it = 0, ...)
type.gaussian
According to the official R documentation,
Two algorithm types are supported for (only)
family="gaussian"
. The default whennvar<500
istype.gaussian="covariance"
, and saves all inner-products ever computed. This can be much faster thantype.gaussian="naive"
, which loops through nobs every time an inner-product is computed. The latter can be far more efficient fornvar >> nobs
situations, or whennvar > 500
.
Generally speaking there is no need for you as the user to change this option.
How do the fitting times compare?
I ran a timing simulation to compare the function run times of type.gaussian="naive"
vs. type.gaussian="covariance"
for a range of number of observations (nobs
or nvar
or
This first panel of boxplots shows the time taken for type.gaussian="naive"
to complete as a fraction (or multiple) of that for type.gaussian="covariance"
(each boxplot represents 5 simulation runs). As advertised, naive runs more slowly for small values of
This next plot shows the absolute fitting times: note the log scale on both the x and y axes.
So, what algorithms do these two options represent? What follows is based on Statistical Learning with Sparsity by Hastie, Tibshirani and Wainwright (free PDF here, p124 of the PDF, p113 of the book).
Let glmnet
is minimizing the following objective function:
We minimize the expression above by cyclic coordinate descent. That is, we cycle through the features
where
Both of the modes minimize
type.gaussian = "naive"
As the features are standardized, we can write the argument in the soft-thresholding operator as
where
- At a coordinate descent step for feature
, if the coefficient doesn’t change its value, no updating of is needed. However, to get the LHS of for the next feature ( ), we need to make operations to compute the sum on the RHS of . - If
changes value, then we have to update the ‘s, then recompute the LHS of for the next feature using the expression on the RHS. This also takes time.
All in all, a full cycle through all
type.gaussian = "covariance"
Ignoring the factor of
In this mode, we compute all the inner products
- At a coordinate descent step for feature
, if and the beginning of the step and its value changes, we need to update the terms with operations. Then, to calculate for the next coordinate descent step, we only need operations, where is the number of non-zero coefficients at the moment. - As such, if no new variables become non-zero in a full cycle through the features, one full cycle takes only
operations. - If a new feature
enters the model for the first time (i.e. becomes non-zero), then we need to compute and store for , which takes .
This form of updating avoids the
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.