Is the size of your lm model causing you headaches?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
If you build an R lm model with a relatively large number of rows, you may be surprised by just how large that lm model is and what impact it has on your environment and application.
Why might you care about size? The most obvious is that the size of R objects impacts the amount of RAM available for further R processing or loading of more data. However, it also has implications for how much space is required to save that model or the time required to move it around the network. For example, you may want to move the model from the database server R engine to the client R engine when using Oracle R Enterprise Embedded R Execution. If the model is too large, you may encounter latency when trying to retrieve the model or even receive the following error:
Error in .oci.GetQuery(conn,
statement, data = data, prefetch = prefetch, :
ORA-20000: RQuery error
Error : serialization is too large to store in a raw vector
If you get this error, there are at least a few options:
- Perform summary component access, like coefficients, inside the embedded R function and return only what is needed
- Save the model in a database R datastore and manipulate that model at the database server to avoid pulling it to the client
- Reduce the size of the model by eliminating large and unneeded components
In this blog post, we focus on the third approach and
look at the size of lm model
components, what you can do to control lm model size, and the implications for
doing so. With vanilla R, objects are the “memory” serving as the repository for repeatability. As a result, models
tend to be populated with the data used to build them to ensure model build repeatability.
When working with database tables, this “memory” is not needed
because
governance mechanisms are already in place to ensure either data
does not change or logs are available to know what changes took place.
Hence it is unnecessary
to store the data used to build the model into the model object.
An lm model consists of several components, for example:
coefficients, residuals, effects, fitted.values, rank, qr, df.residual, call, terms, xlevels, model, na.action
Some of these components may appear deceptively small using
R’s object.size function. The following script builds an lm model to help reveal what R
reports for the size of various components. The examples use a sample of the ONTIME airline arrival and departure delays data set for domestic flights. The ONTIME_S data set is an ore.frame proxy object for data stored in an Oracle database and consists of 219932 rows and 26 columns. The R data.frame ontime_s is this same data pulled to the client R engine using ore.pull and is ~39.4MB.
Note: The results reported below use R 2.15.2 on Windows. Serialization of some components in the lm model has been improved in R 3.0.0, but the implications are the same.
f.lm.1 <- function(dat) lm(ARRDELAY ~ DISTANCE + DEPDELAY, data = dat)
lm.fit.1 <- f.lm.1(ontime_s)
object.size(lm.fit.1)
54807720 bytes
Using the object.size function on the resulting model, the size is about 55MB. If only scoring data with this model, it seems like a lot of bloat for the few coefficients assumed needed for scoring. Also, to move this object over a network will not be instantaneous. But is this the true size of the model?
A better way to determine just how big an object is, and what space is actually required to store the model or time to move it across a network, is the R serialize function.
length(serialize(lm.fit.1,NULL))
[1] 65826324
Notice that the size reported by object.size is different from that of serialize – a difference of 11MB or ~20% greater.
What is taking up so much space? Let’s invoke object.size on each component of this lm model:
lapply(lm.fit.1,
object.size)
$coefficients
424 bytes
$residuals
13769600 bytes
$effects
3442760 bytes
$rank
48 bytes
$fitted.values
13769600 bytes
$assign
56 bytes
$qr
17213536 bytes
$df.residual
48 bytes
$na.action
287504 bytes
$xlevels
192 bytes
$call
1008 bytes
$terms
4432 bytes
$model
6317192 bytes
The components residuals, fitted.values, qr, model, and even na.action are large. Do we need all these components?
The lm function provides arguments to control some aspects of model size. This can be done, for example, by specifying model=FALSE and qr=FALSE. However, as we saw above, there are other components that contribute heavily to model size.
f.lm.2 <- function(dat) lm(ARRDELAY
~ DISTANCE + DEPDELAY,
data = dat, model=FALSE,
qr=FALSE)
lm.fit.2 <- f.lm.2(ontime_s)
length(serialize(lm.fit.2,NULL))
[1] 51650410
object.size(lm.fit.2)
31277216 bytes
The resulting serialized model size is down to about ~52MB,
which is not significantly smaller than the full model.The difference with the result reported by object.size is now ~20MB, or 39% smaller.
Does removing these components have any effect on the usefulness of an lm model? We’ll explore this using four commonly used functions: coef, summary, anova, and predict. If we try to invoke summary on lm.fit.2, the following error results:
summary(lm.fit.2)
Error in qr.lm(object) : lm object does not have a proper ‘qr’ component.
Rank zero or should not have used lm(.., qr=FALSE).
The same error results when we try to run anova. Unfortunately, the predict function also fails with the error above. The qr component is necessary for these functions. Function coef returns without error.
coef(lm.fit.2)
(Intercept) DISTANCE DEPDELAY
0.225378249 -0.001217511 0.962528054
If only coefficients are required, these settings may be acceptable. However, as we’ve seen, removing the model and qr components, while each is large, still leaves a large model. The really large components appear to be the effects, residuals, and fitted.values. We can explicitly nullify them to remove them from the model.
f.lm.3 <- function(dat) {
mod <- lm(ARRDELAY ~
DISTANCE + DEPDELAY,
data = dat, model=FALSE, qr=FALSE)
mod$effects <- mod$residuals <- mod$fitted.values <- NULL
mod
}
lm.fit.3 <- f.lm.3(ontime_s)
length(serialize(lm.fit.3,NULL))
[1] 24089000
object.size(lm.fit.3)
294968 bytes
Thinking the model size should be small, we might be surprised to see the results above. The function object.size reports ~295KB, but serializing the model shows 24MB, a difference of 23.8MB or 98.8%. What happened? We’ll get to that in a moment. First, let’s explore what effect nullifying these additional components has on the model.
To answer this, we’ll turn on model and qr, and focus on effects, residuals, and fitted.values. If we nullify effects, the anova results are invalid, but the other results are fine. If we nullify residuals, summary cannot produce residual and coefficient statistics, but it also produces an odd F-statistic with a warning:
Warning message:
In is.na(x) : is.na() applied to non-(list or vector) of type ‘NULL’
The function anova produces invalid F values and residual statistics, clarifying with a warning:
Warning message:
In anova.lm(mod) :
ANOVA F-tests on an essentially perfect fit are unreliable
Otherwise, both predict and coef work fine.
If we nullify fitted.values, summary produces an invalid F-statistics issuing the warning:
Warning message:
In mean.default(f) : argument is not numeric or logical: returning NA
Depending on what we need from our model, some of these components could be eliminated. But let’s continue looking at each remaining component, not with object.size, but serialize. Below, we use lapply to compute the serialized length of each model component. This reveals that the terms component is actually the largest component, despite object.size reporting only 4432 bytes above.
as.matrix(lapply(lm.fit.3, function(x) length(serialize(x,NULL))))
[,1]
coefficients 130
rank 26
assign 34
df.residual 26
na.action 84056
xlevels 55
call 275
terms 24004509
If we nullify the terms component, the model becomes quite compact. (By the way, if we simply nullify terms, summary, anova, and predict all fail.) Why is the terms component so large? It turns out it has an environment object as an attribute. The environment contains the variable dat, which contains the original data with 219932 rows and 26 columns. R’s serialize function includes this object and hence the reason the model is so large. The function object.size ignores these objects.
attr(lm.fit.1$terms,
“.Environment”)
ls(envir = attr(lm.fit.1$terms,
“.Environment”))
[1] “dat”
d <- get("dat",envir=envir)
dim(d)
[1] 219932 26
length(serialize(attr(lm.fit.1$terms, “.Environment”), NULL))
[1] 38959319
object.size(attr(lm.fit.1$terms, “.Environment”))
56 bytes
If we remove this object from the environment, the serialized object size also becomes
small.
rm(list=ls(envir = attr(lm.fit.1$terms,
“.Environment”)),
envir = attr(lm.fit.1$terms,
“.Environment”))
ls(envir = attr(lm.fit.1$terms, “.Environment”))
character(0)
length(serialize(lm.fit.1, NULL))
[1] 85500
lm.fit.1
Call:
lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model = FALSE,
qr = FALSE)
Coefficients:
(Intercept) DISTANCE
DEPDELAY
0.225378 -0.001218
0.962528
Is the associated environment essential to the model? If not, we could
empty it to significantly reduce model size. We’ll rebuild the model using the function f.lm.full.
f.lm.full
<- function(dat) lm(ARRDELAY ~ DISTANCE
+ DEPDELAY, data = dat)
lm.fit.full <- f.lm.full(ontime_s)
ls(envir=attr(lm.fit.full$terms, “.Environment”))
[1] “dat”
length(serialize(lm.fit.full,NULL))
[1] 65826324
We’ll create the model removing some components as defined in function:
line-height: 115%; font-family: “Courier New”;”>f.lm.small
<- function(dat) {
f.lm <- function(dat) {
mod <- lm(ARRDELAY ~ DISTANCE + DEPDELAY, data
= dat, model=FALSE)
mod$fitted.values <- NULL
mod
}
mod <- f.lm(dat)
# empty the env associated with local function
e <- attr(mod$terms, ".Environment")
# set parent env to .GlobalEnv so serialization
doesn’t include contents
parent.env(e) <- .GlobalEnv
rm(list=ls(envir=e), envir=e) # remove all objects from this environment
mod
}
lm.fit.small
<- f.lm.small(ontime_s)
ls(envir=attr(lm.fit.small$terms, “.Environment”))
character(0)
length(serialize(lm.fit.small, NULL))
[1] 16219251
We can use the same function with embedded R execution.
lm.fit.ere <- ore.pull(ore.tableApply(ONTIME_S, f.lm.small))
ls(envir=attr(lm.fit.ere$terms, “.Environment”))
character(0)
length(serialize(lm.fit.ere, NULL))
[1] 16219251
as.matrix(lapply(lm.fit.ere, function(x)
length(serialize(x,NULL))))
[,1]
coefficients 130
residuals 4624354
effects 3442434
rank 26
fitted.values 4624354
assign 34
qr 8067072
df.residual 26
na.action 84056
xlevels 55
call 245
terms 938
Making this change does not affect the workings of the model for coef, summary, anova, or predict. For example, summary produces expected results:
summary(lm.fit.ere)
Call:
lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = dat, model = FALSE)
Residuals:
Min 1Q
Median 3Q Max
-1462.45 -6.97
-1.36 5.07 925.08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.254e-01 5.197e-02 4.336 1.45e-05 ***
DISTANCE -1.218e-03 5.803e-05 -20.979 < 2e-16
***
DEPDELAY 9.625e-01 1.151e-03 836.289 <
2e-16 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Residual standard error: 14.73 on 215144 degrees of freedom
(4785 observations deleted due to missingness)
Multiple R-squared: 0.7647, Adjusted R-squared: 0.7647
F-statistic: 3.497e+05 on 2 and 215144 DF, p-value: < 2.2e-16
Using the model for prediction also produces expected results.
lm.pred <- function(dat, mod) {
prd <- predict(mod, newdata=dat)
prd[as.integer(rownames(prd))] <-
prd
cbind(dat, PRED = prd)
}
dat.test
<- with(ontime_s, ontime_s[YEAR == 2003 & MONTH == 5,
c(“ARRDELAY”,
“DISTANCE”, “DEPDELAY”)])
head(lm.pred(dat.test, lm.fit.ere))
ARRDELAY DISTANCE
DEPDELAY PRED
163267
0 748 -2
-2.61037575
163268 -8
361 0 -0.21414306
163269 -5
484 0 -0.36389686
163270 -3
299 3 2.74892676
163271
6 857 -6
-6.59319662
163272 -21
659 -8 -8.27718564
163273 -2
1448 0 -1.53757703
163274
5 238
9 8.59836323
163275 -5
744 0 -0.68044960
163276 -3
199 0 -0.01690635
As shown above, an lm model can become quite large. At least for some applications, several of these
components may be unnecessary, allowing the user to significantly reduce the
size of the model and space required for saving or time for transporting the model. Relying on Oracle Database to store the data instead of the R model object further allows for significant reduction in model size.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.