A deep dive into glmnet: type.gaussian
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m writing a series of posts on various function options of the glmnet
function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation.
In this post, we will look at the type.gaussian option.
For reference, here is the full signature of the glmnet
function (v3.0-2):
glmnet(x, y, family = c("gaussian", "binomial", "poisson", "multinomial", "cox", "mgaussian"), weights, offset = NULL, alpha = 1, nlambda = 100, lambda.min.ratio = ifelse(nobs < nvars, 0.01, 1e-04), lambda = NULL, standardize = TRUE, intercept = TRUE, thresh = 1e-07, dfmax = nvars + 1, pmax = min(dfmax * 2 + 20, nvars), exclude, penalty.factor = rep(1, nvars), lower.limits = -Inf, upper.limits = Inf, maxit = 1e+05, type.gaussian = ifelse(nvars < 500, "covariance", "naive"), type.logistic = c("Newton", "modified.Newton"), standardize.response = FALSE, type.multinomial = c("ungrouped", "grouped"), relax = FALSE, trace.it = 0, ...)
type.gaussian
According to the official R documentation,
Two algorithm types are supported for (only)
family="gaussian"
. The default whennvar<500
istype.gaussian="covariance"
, and saves all inner-products ever computed. This can be much faster thantype.gaussian="naive"
, which loops through nobs every time an inner-product is computed. The latter can be far more efficient fornvar >> nobs
situations, or whennvar > 500
.
Generally speaking there is no need for you as the user to change this option.
How do the fitting times compare?
I ran a timing simulation to compare the function run times of type.gaussian="naive"
vs. type.gaussian="covariance"
for a range of number of observations (nobs
or ) and number of features (
nvar
or ). The results are shown below. (For the R code that generated these plots, see here.)
This first panel of boxplots shows the time taken for type.gaussian="naive"
to complete as a fraction (or multiple) of that for type.gaussian="covariance"
(each boxplot represents 5 simulation runs). As advertised, naive runs more slowly for small values of but more quickly for large values of
. The difference seems to be more stark when
is larger.
This next plot shows the absolute fitting times: note the log scale on both the x and y axes.
So, what algorithms do these two options represent? What follows is based on Statistical Learning with Sparsity by Hastie, Tibshirani and Wainwright (free PDF here, p124 of the PDF, p113 of the book).
Let denote the response for observation
, and let
denote the value of feature
for observation
. Assume that the response and the features are standardized to have mean zero so that we don’t have to worry about fitting an intercept term. For each value of
in lambda,
glmnet
is minimizing the following objective function:
We minimize the expression above by cyclic coordinate descent. That is, we cycle through the features . For each
, treat
as a function of
only (pretend that
for
are fixed), then update
to the value which minimizes
. It turns out that this update is very simple:
where is the soft-thresholding operator and
is the partial residual.
Both of the modes minimize in this way. Where they differ is in how they keep track of the quantities needed to do the update above. From here, assume that the data has been standardized. (What follows works for unstandardized data as well but just has more complicated expressions.)
type.gaussian = "naive"
As the features are standardized, we can write the argument in the soft-thresholding operator as
where is the full residual for observation
. In this mode, we keep track of the full residuals
,
.
- At a coordinate descent step for feature
, if the coefficient
doesn’t change its value, no updating of
is needed. However, to get the LHS of
for the next feature (
), we need to make
operations to compute the sum on the RHS of
.
- If
changes value, then we have to update the
‘s, then recompute the LHS of
for the next feature using the expression on the RHS. This also takes
time.
All in all, a full cycle through all variables costs
operations.
type.gaussian = "covariance"
Ignoring the factor of , note that the first term on the RHS of
can be written as