Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m writing a series of posts on various function options of the glmnet
function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation.
In this post, we will look at the offset option.
For reference, here is the full signature of the glmnet
function:
glmnet(x, y, family=c("gaussian","binomial","poisson","multinomial","cox","mgaussian"), weights, offset=NULL, alpha = 1, nlambda = 100, lambda.min.ratio = ifelse(nobs<nvars,0.01,0.0001), lambda=NULL, standardize = TRUE, intercept=TRUE, thresh = 1e-07, dfmax = nvars + 1, pmax = min(dfmax * 2+20, nvars), exclude, penalty.factor = rep(1, nvars), lower.limits=-Inf, upper.limits=Inf, maxit=100000, type.gaussian=ifelse(nvars<500,"covariance","naive"), type.logistic=c("Newton","modified.Newton"), standardize.response=FALSE, type.multinomial=c("ungrouped","grouped"))
offset
According to the official R documentation, offset
should be
A vector of length nobs that is included in the linear predictor (a nobs x nc matrix for the “multinomial” family).
Its default value is NULL
: in that case, glmnet
internally sets the offset to be a vector of zeros having the same length as the response y
.
Here is some example code for using the offset
option:
set.seed(1) n <- 50; p <- 10 x <- matrix(rnorm(n * p), nrow = n) y <- rnorm(n) offset <- rnorm(n) # fit model fit1 <- glmnet(x, y, offset = offset)
If we specify offset
in the glmnet
call, then when making predictions with the model, we must specify the newoffset
option. For example, if we want the predictions fit1
gives us at newoffset
will give us an error:
This is the correct code:
predict(fit1, x, s = 0.1, newoffset = offset) # 1 # [1,] 0.44691399 # [2,] 0.30013292 # [3,] -1.68825225 # [4,] -0.49655504 # [5,] 1.20180199 # ...
So, what does offset
actually do (or mean)? Recall that glmnet
is fitting a linear model. More concretely, our data is
- For ordinary regression,
, i.e. the response itself. - For logistic regression,
. - For Poisson regression,
.
So, we are trying to find
Why might we want to use offsets? There are two primary reasons for them stated in the documentation:
Useful for the “poisson” family (e.g. log of exposure time), or for refining a model by starting at a current fit.
Let me elaborate. First, offsets are useful for Poisson regression. The official vignette has a little section explaining this; let me explain it through an example.
Imagine that we are trying to predict how many points an NBA basketball player will score per minute based on his physical attributes. If the player’s physical attributes (i.e. the covariates of our model) are denoted by
Having described the model, let’s turn to our data. For each player
Offsets allow us to use our data as is. In our example above, loosely speaking 12/30 (points per minute) is our estimate for
Taking this to the full dataset: if player glmnet
is
The second reason one might want to use offsets is to improve on an existing model. Continuing the example above: say we have a friend who has trained a model (not necessarily a linear model) to predict
where newoffset
option in the predict
call.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.