Trimming the Fat from glm() Models in R

Nina Zumel

8 years ago

[This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, glm models are not so concise; we noticed this to our dismay when we tried to automate fitting a moderate number of models (about 500 models, with on the order of 50 coefficients) to data sets of moderate size (several tens of thousands of rows). A workspace save of the models alone was in the tens of gigabytes! How is this possible? We decided to find out.

As many R users know (but often forget), a glm model object carries a copy of its training data by default. You can use the settings y=FALSE and model=FALSE to turn this off.

set.seed(2325235)


# Set up a synthetic classification problem of a given size
# and two variables: one numeric, one categorical
# (two levels).
synthFrame = function(nrows) {
   d = data.frame(xN=rnorm(nrows),
      xC=sample(c('a','b'),size=nrows,replace=TRUE))
   d$y = (d$xN + ifelse(d$xC=='a',0.2,-0.2) + rnorm(nrows))>0.5
   d
}


# first show that model=F and y=F help reduce model size

dTrain = synthFrame(1000)
model1 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'))
model2 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'),
             y=FALSE)
model3 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'),
              y=FALSE, model=FALSE)

#
# Estimate the object's size as the size of its serialization
#
length(serialize(model1, NULL))
# [1] 225251
length(serialize(model2, NULL))
# [1] 206341
length(serialize(model3, NULL))
# [1] 189562

dTest = synthFrame(100)
p1 = predict(model1, newdata=dTest, type='response')
p2 = predict(model2, newdata=dTest, type='response')
p3 = predict(model3, newdata=dTest, type='response')
sum(abs(p1-p2))
# [1] 0
sum(abs(p1-p3))
# [1] 0

So we see (as expected) that removing the training data from the model decreases the size of the model (as estimated by the size of its serialization), without affecting the model’s predictions. What happens when you increase the training data size? The size of the model (with y=FALSE and model=FALSE) should not grow.

ndata = seq(from=0, to=100000, by=5000)[-1]
#
# A function to estimate the size of a model for
# our synthetic problem, with a training set of size n
#
getModelSize = function(n) {
  data = synthFrame(n)
  model = glm(y~xN+xC,data=data,family=binomial(link='logit'),
              y=FALSE, model=FALSE)
  length(serialize(model, NULL))
}

size1 = sapply(ndata, FUN=getModelSize)
library(ggplot2)

ggplot(data.frame(n=ndata, modelsize=size1), aes(x=n, y=modelsize)) +
  geom_point() + geom_line()

Lo and behold, we see that the model size still grows linearly in the size of the training data! The model objects are still holding something that is proportional to the size of the training data. Where?

We can use our serialization trick to find the size of the individual components of a model:

breakItDown = function(mod) {
  sapply(mod, FUN=function(x){length(serialize(x, NULL))}, simplify=T)
}

Now let’s compare two models trained with datasets of different sizes (one ten times the size of the other).

mod1 = glm(y~xN+xC,data=synthFrame(1000),
           family=binomial(link='logit'),
           y=FALSE, model=FALSE)
c1 = breakItDown(mod1)

mod2 = glm(y~xN+xC,data=synthFrame(10000),
           family=binomial(link='logit'),
           y=FALSE, model=FALSE)
c2 = breakItDown(mod2)

# For pretty-printing a vector to a vertical blog-friendly format:
# return a string of vector formatted as a column with names
# use cat to echo the value
vfmtN = function(v) { 
  width = max(sapply(names(v),nchar))
  paste(
    sapply(1:length(v),function(i) { paste(format(names(v)[i],
                                                  width=width),
                                           format(v[[i]])) }),
    collapse='\n')
}

cat(vfmtN(c1))
# coefficients      119
# residuals         18948
# fitted.values     18948
# effects           16071
# R                 261
# rank              26
# qr                35261
# family            25160
# linear.predictors 18948
# deviance          30
# aic               30
# null.deviance     30
# iter              26
# weights           18948
# prior.weights     18948
# df.residual       26
# df.null           26
# converged         26
# boundary          26
# call              373
# formula           193
# terms             836
# data              16278
# offset            18
# control           140
# method            37
# contrasts         96
# xlevels           91

cat(vfmtN(c2))
# coefficients      119
# residuals         198949
# fitted.values     198949
# effects           160071
# R                 261
# rank              26
# qr                359262
# family            25160
# linear.predictors 198949
# deviance          30
# aic               30
# null.deviance     30
# iter              26
# weights           198949
# prior.weights     198949
# df.residual       26
# df.null           26
# converged         26
# boundary          26
# call              373
# formula           193
# terms             836
# data              160278
# offset            18
# control           140
# method            37
# contrasts         96
# xlevels           91

Look carefully, and you will see that certain objects in the glm model are large, and growing with data size.

r = c2/c1
cat(vfmtN(r))
# coefficients      1
# residuals         10.49974
# fitted.values     10.49974
# effects           9.960239
# R                 1
# rank              1
# qr                10.18865
# family            1
# linear.predictors 10.49974
# deviance          1
# aic               1
# null.deviance     1
# iter              1
# weights           10.49974
# prior.weights     10.49974
# df.residual       1
# df.null           1
# converged         1
# boundary          1
# call              1
# formula           1
# terms             1
# data              9.846296
# offset            1
# control           1
# method            1
# contrasts         1
# xlevels           1

cat(vfmtN(r[r>1]))
# residuals         10.49974
# fitted.values     10.49974
# effects           9.960239
# qr                10.18865
# linear.predictors 10.49974
# weights           10.49974
# prior.weights     10.49974
# data              9.846296

Now strictly speaking, all you need to know to apply a glm model are the coefficients of the model, and the appropriate link function. All the other things the glm model object carries around are for the purpose of characterizing the model. An example would be calculating coefficient significances (and really, for most purposes, one could just calculate the quantities one wants to know, save those, and throw the data away — but we’re here to discuss R as it is, not as it should be). Once you’ve examined a model and decided that it’s satisfactory, all you probably want to do is predict. So let’s try trimming all those large objects away.

cleanModel1 = function(cm) {
  # just in case we forgot to set
  # y=FALSE and model=FALSE
  cm$y = c()
  cm$model = c()

  cm$residuals = c()
  cm$fitted.values = c()
  cm$effects = c()
  cm$qr = c()  
  cm$linear.predictors = c()
  cm$weights = c()
  cm$prior.weights = c()
  cm$data = c()
  cm
}

cm1 = cleanModel1(mod1)
cm2 = cleanModel1(mod2)

dTest = synthFrame(100)
p1=predict(cm1, newdata=dTest, type='response') # FAILS 
# Error in qr.lm(object) : lm object does not have a proper 'qr' component.
# Rank zero or should not have used lm(.., qr=FALSE).

Ooops. We can’t null out the qr member of the model object if we want to predict. Incidentally, this is related to the observation that if you try to call lm(...., y=FALSE, model=FALSE, qr=FALSE), the result is a model object that fails to either predict or summarize. Don’t ask me why qr=FALSE is even an option. But back to the glm. What’s in the model’s qr field?

breakItDown(mod1$qr)
# qr  rank qraux pivot   tol 
# 35042    26    46    34    30 

breakItDown(mod2$qr)
# qr   rank  qraux  pivot    tol 
# 359043     26     46     34     30

It turns out that we don’t actually need model’s qr$qr to predict, so let’s trim just that away:

cleanModel2 = function(cm) {
  cm$y = c()
  cm$model = c()
  
  cm$residuals = c()
  cm$fitted.values = c()
  cm$effects = c()
  cm$qr$qr = c()  
  cm$linear.predictors = c()
  cm$weights = c()
  cm$prior.weights = c()
  cm$data = c()
  cm
}

# More reduction in model size
length(serialize(mod2, NULL))
# [1] 1701600
cm2 = cleanModel2(mod2)
length(serialize(cm2, NULL))
# [1] 27584

# And prediction works, too
resp.full = predict(mod2, newdata=dTest, type="response")
resp.cm = predict(cm2, newdata=dTest, type="response")
sum(abs(resp.full-resp.cm))
# [1] 0

Are we done?

getModelSize = function(n) {
  data = synthFrame(n)
  model = cleanModel2(glm(y~xN+xC,data=data,
                         family=binomial(link='logit'),
                         y=FALSE, model=FALSE))
  length(serialize(model, NULL))
}

size2 = sapply(ndata, FUN=getModelSize)

ggplot(data.frame(n=ndata, modelsize=size2), aes(x=n, y=modelsize)) +
  geom_point() + geom_line()

The models are substantially smaller than when we started, but they still grow with training data size.

A rough explanation for this is that glm hides pointers to the environment and things from the environment deep in many places. We didn’t notice this when we built models in the global environment because all those pointers pointed to the same things, so even though the models are much bigger than they need to be, they are all “too big” by the same amount, and hence don’t appear to grow as the training data grows. But when you build the models in a function (as we did in getModelSize(), you get more transient environments that are proportional to the size of the training data — and so model size grows with training data size. This isn’t going to seem clear, because it depends on a lot of complicated implementation details (for a taste of how complicated it can get, see here).

After much trial and error, this is the set of fields and attributes of the model that we found were growing with data size, and that we could eliminate without breaking predict().

stripGlmLR = function(cm) {
  cm$y = c()
  cm$model = c()
  
  cm$residuals = c()
  cm$fitted.values = c()
  cm$effects = c()
  cm$qr$qr = c()  
  cm$linear.predictors = c()
  cm$weights = c()
  cm$prior.weights = c()
  cm$data = c()

  
  cm$family$variance = c()
  cm$family$dev.resids = c()
  cm$family$aic = c()
  cm$family$validmu = c()
  cm$family$simulate = c()
  attr(cm$terms,".Environment") = c()
  attr(cm$formula,".Environment") = c()
  
  cm
}

getModelSize = function(n) {
  data = synthFrame(n)
  model = stripGlmLR(glm(y~xN+xC,data=data,
                          family=binomial(link='logit'),
                          y=FALSE, model=FALSE))
  length(serialize(model, NULL))
}

size3 = sapply(ndata, FUN=getModelSize)

ggplot(data.frame(n=ndata, modelsize=size3), aes(x=n, y=modelsize)) +
  geom_point() + geom_line()

Yahoo! It worked! The models are constant size with respect to training data size. And prediction works.

cm2 = stripGlmLR(mod2)
resp.full = predict(mod2, newdata=dTest, type="response")
resp.cm = predict(cm2, newdata=dTest, type="response")
sum(abs(resp.full-resp.cm))
# [1] 0

Comparing the size of the final stripped-down models (in variable size3 in the demonstration code) to the originals (size1), we find that the final model is 3/10th of a percent the size of the original model for small (n=5000) training data sets, and 0.015% the size of the original model for “large” (n=100,000) data sets. That’s a heckuva savings. And we probably haven’t gotten rid of all the unnecessary baggage, but at least we seem to have stopped the growth. This was enough trimming to accomplish our task for the client (producing working models that stored and loaded quickly), so we stopped.

One point and one caveat. You can null out model$family entirely; the predict function will still return its default value, the link value (that is, predict(model, newdata=data)) will work). However, predict(model, newdata=data, type='response') will fail. You can still recover the response by passing the link value through the inverse link function: in the case of logistic regression, this is the sigmoid function, sigmoid(x) = 1/(1 + exp(-x)). I didn’t test type=terms.

The caveat: many of the other things besides predict that you might like to do with a glm model will fail on the stripped-down version: in particular summary(), anova() and step(). So any characterization that you want to do on a candidate model should be done before trimming down the fat. Once you have decided on a satisfactory model, you can strip it down and save it for use in future predictions.

You can trim lm and gam models in a similar way, too. The exact fields to trim are a bit different. We will leave this as an exercise for the reader.
We are aware of the bigglm package, for fitting generalized linear models to big data. We didn’t test it, but I would imagine that it doesn’t have this problem. Note, though that the problem here isn’t the size of the training data per se (which is only of moderate size); it’s the inordinate size of the resulting model.

To leave a comment for the author, please follow the link and comment on their blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.