Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Suppose we have the task of predicting an outcome y
given a number of variables v1,..,vk
. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, even reduce over-fit, and improve the quality of the resulting model.
For some informative discussion on such issues please see the following:
- How Do You Know if Your Data Has Signal?
- How do you know if your model is going to work?
- Variable pruning is NP hard
In this article we are going to deliberately (and artificially) find and test one of the limits of the technique. We recommend simple variable pruning, but also think it is important to be aware of its limits.
To be truly effective in applied fields (such as data science) one often has to use (with care) methods that “happen to work” in addition to methods that “are known to always work” (or at least be aware, you are always competing against such); hence the interest in mere heuristic.
The pruning heuristics
Let L(y;m;v1,...,vk)
denote the estimate loss (or badness of performance, so smaller is better) of a model for y
fit using modeling method m
and the variables v1,...,vk
. Let d(a;L(y;m;),L(y;m;a))
denote the portion of L(y;m;)-L(y;m;a)
credited to the variable a
. This could be the change in loss, something like effectsize(a)
, or -log(significance(a))
; in all cases larger is considered better.
For practical variable pruning (during predictive modeling) our intuition often implicitly relies on the following heuristic arguments.
L(y;m;)
is monotone decreasing, we expectL(y;m;v1,...,vk,a)
is no larger thanL(y;m;v1,...,vk,)
. Note this may be achievable “in sample” (or on training data), but is often false ifL(y;m;)
accounts for model complexity or is estimated on out of sample data (itself a good practice).- If
L(y;m;v1,...,vk,a)
is significantly lower thanL(y;m;v1,...,vk)
then we will be lucky enough haved(a;L(y;m;),L(y;m;a))
not too small. - If
d(a;L(y;m;),L(y;m;a))
is not too small then we will be lucky enough to haved(a;L(y;lm;),L(y;lm;a))
is non-negligible (where modeling methodlm
is one linear regression or logistic regression).
Intuitively we are hoping variable utility has a roughly diminishing return structure and at least some non-vanishing fraction of a variable’s utility can be seen in simple linear or generalized linear models. Obviously this can not be true in general (interactions in decision trees being a well know situation where variable utility can increase in the presence of other variables, and there are many non-linear relations that escape detection by linear models).
However, if the above were true (or often nearly true) we could effectively prune variables by keeping only the set of variables { a | d(a;L(y;lm;),L(y;lm;a)) is non negligible}
. This is a (user controllable) heuristic built into our vtreat
R package and proves to be quite useful in practice.
I’ll repeat: we feel in real world data you can use the above heuristics to usefully prune variables. Complex models do eventually get into a regime of diminishing returns, and real world engineered useful variables usually (by design) have a hard time hiding. Also, remember data science is an empirical field- methods that happen to work will dominate (even if they do not apply in all cases).
Counter-examples
For every heuristic you should crisply know if it is true (and is in fact a theorem) or it is false (and has counter-examples). We stand behind the above heuristics, and will show their empirical worth in a follow-up article. Let’s take some time and show that they are not in fact laws.
We are going to show that per-variable coefficient significances and effect sizes are not monotone in that adding more variables can in fact improve them.
First example
First (using R) we build a data frame where y = a xor b
. This is a classic example of y
being a function of two variable but not a linear function of them (at least over the real numbers, it is a linear relation over the field GF(2)).
d <- data.frame(a=c(0,0,1,1),b=c(0,1,0,1)) d$y <- as.numeric(d$a == d$b)
We look at the (real) linear relations between y
and a
, b
.
summary(lm(y~a+b,data=d)) ## ## Call: ## lm(formula = y ~ a + b, data = d) ## ## Residuals: ## 1 2 3 4 ## 0.5 -0.5 -0.5 0.5 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.500 0.866 0.577 0.667 ## a 0.000 1.000 0.000 1.000 ## b 0.000 1.000 0.000 1.000 ## ## Residual standard error: 1 on 1 degrees of freedom ## Multiple R-squared: 3.698e-32, Adjusted R-squared: -2 ## F-statistic: 1.849e-32 on 2 and 1 DF, p-value: 1 anova(lm(y~a+b,data=d)) ## Analysis of Variance Table ## ## Response: y ## Df Sum Sq Mean Sq F value Pr(>F) ## a 1 0 0 0 1 ## b 1 0 0 0 1 ## Residuals 1 1 1
As we expect linear methods fail to find any evidence of a relation between y
and a
, b
. This clearly violates our hoped for heuristics.
For details on reading these summaries we strongly recommend Practical Regression and Anova using R, Julian J. Faraway, 2002.
In this example the linear model fails to recognize a
and b
as useful variables (even though y
is a function of a
and b
). From the linear model’s point of view variables are not improving each other (so that at least looks monotone), but it is largely because the linear model can not see the relation unless we were to add an interaction of a
and b
(denoted a:b
).
Second example
Let us develop this example a bit more to get a more interesting counterexample.
Introduce new variables u = a and b
, v = a or b
. By the rules of logic we have y == 1+u-v
, so there is a linear relation.
d$u <- as.numeric(d$a & d$b) d$v <- as.numeric(d$a | d$b) print(d) ## a b y u v ## 1 0 0 1 0 0 ## 2 0 1 0 0 1 ## 3 1 0 0 0 1 ## 4 1 1 1 1 1 print(all.equal(d$y,1+d$u-d$v)) ## [1] TRUE
We can now see the counter-example effect: together the variables work better than they did alone.
summary(lm(y~u,data=d)) ## ## Call: ## lm(formula = y ~ u, data = d) ## ## Residuals: ## 1 2 3 4 ## 6.667e-01 -3.333e-01 -3.333e-01 -1.388e-16 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.3333 0.3333 1 0.423 ## u 0.6667 0.6667 1 0.423 ## ## Residual standard error: 0.5774 on 2 degrees of freedom ## Multiple R-squared: 0.3333, Adjusted R-squared: 5.551e-16 ## F-statistic: 1 on 1 and 2 DF, p-value: 0.4226 anova(lm(y~u,data=d)) ## Analysis of Variance Table ## ## Response: y ## Df Sum Sq Mean Sq F value Pr(>F) ## u 1 0.33333 0.33333 1 0.4226 ## Residuals 2 0.66667 0.33333 summary(lm(y~v,data=d)) ## ## Call: ## lm(formula = y ~ v, data = d) ## ## Residuals: ## 1 2 3 4 ## 5.551e-17 -3.333e-01 -3.333e-01 6.667e-01 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.0000 0.5774 1.732 0.225 ## v -0.6667 0.6667 -1.000 0.423 ## ## Residual standard error: 0.5774 on 2 degrees of freedom ## Multiple R-squared: 0.3333, Adjusted R-squared: 0 ## F-statistic: 1 on 1 and 2 DF, p-value: 0.4226 anova(lm(y~v,data=d)) ## Analysis of Variance Table ## ## Response: y ## Df Sum Sq Mean Sq F value Pr(>F) ## v 1 0.33333 0.33333 1 0.4226 ## Residuals 2 0.66667 0.33333 summary(lm(y~u+v,data=d)) ## Warning in summary.lm(lm(y ~ u + v, data = d)): essentially perfect fit: ## summary may be unreliable ## ## Call: ## lm(formula = y ~ u + v, data = d) ## ## Residuals: ## 1 2 3 4 ## -1.849e-32 7.850e-17 -7.850e-17 1.849e-32 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.00e+00 1.11e-16 9.007e+15 <2e-16 *** ## u 1.00e+00 1.36e-16 7.354e+15 <2e-16 *** ## v -1.00e+00 1.36e-16 -7.354e+15 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.11e-16 on 1 degrees of freedom ## Multiple R-squared: 1, Adjusted R-squared: 1 ## F-statistic: 4.056e+31 on 2 and 1 DF, p-value: < 2.2e-16 anova(lm(y~u+v,data=d)) ## Warning in anova.lm(lm(y ~ u + v, data = d)): ANOVA F-tests on an ## essentially perfect fit are unreliable ## Analysis of Variance Table ## ## Response: y ## Df Sum Sq Mean Sq F value Pr(>F) ## u 1 0.33333 0.33333 2.7043e+31 < 2.2e-16 *** ## v 1 0.66667 0.66667 5.4086e+31 < 2.2e-16 *** ## Residuals 1 0.00000 0.00000 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this example we see synergy instead of diminishing returns. Each variable becomes better in the presence of the other. This is on its own good, but indicates variable pruning is harder than one might expect- even for a linear model.
Third example
We can get around the above warnings by adding some rows to the data frame that don’t follow the designed relation. We can even draw rows from this frame to show the effect on a “more row independent looking” data frame.
d0 <- d d0$y <- 0 d1 <- d d1$y <- 1 dG <- rbind(d,d,d,d,d0,d1) set.seed(23235) dR <- dG[sample.int(nrow(dG),100,replace=TRUE),,drop=FALSE] summary(lm(y~u,data=dR)) ## ## Call: ## lm(formula = y ~ u, data = dR) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.8148 -0.3425 -0.3425 0.3033 0.6575 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.34247 0.05355 6.396 5.47e-09 *** ## u 0.47235 0.10305 4.584 1.35e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4575 on 98 degrees of freedom ## Multiple R-squared: 0.1765, Adjusted R-squared: 0.1681 ## F-statistic: 21.01 on 1 and 98 DF, p-value: 1.349e-05 anova(lm(y~u,data=dR)) ## Analysis of Variance Table ## ## Response: y ## Df Sum Sq Mean Sq F value Pr(>F) ## u 1 4.3976 4.3976 21.01 1.349e-05 *** ## Residuals 98 20.5124 0.2093 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 summary(lm(y~v,data=dR)) ## ## Call: ## lm(formula = y ~ v, data = dR) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.7619 -0.3924 -0.3924 0.6076 0.6076 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.7619 0.1049 7.263 9.12e-11 *** ## v -0.3695 0.1180 -3.131 0.0023 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4807 on 98 degrees of freedom ## Multiple R-squared: 0.09093, Adjusted R-squared: 0.08165 ## F-statistic: 9.802 on 1 and 98 DF, p-value: 0.002297 anova(lm(y~v,data=dR)) ## Analysis of Variance Table ## ## Response: y ## Df Sum Sq Mean Sq F value Pr(>F) ## v 1 2.265 2.26503 9.8023 0.002297 ** ## Residuals 98 22.645 0.23107 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 summary(lm(y~u+v,data=dR)) ## ## Call: ## lm(formula = y ~ u + v, data = dR) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.8148 -0.1731 -0.1731 0.1984 0.8269 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.76190 0.08674 8.784 5.65e-14 *** ## u 0.64174 0.09429 6.806 8.34e-10 *** ## v -0.58883 0.10277 -5.729 1.13e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3975 on 97 degrees of freedom ## Multiple R-squared: 0.3847, Adjusted R-squared: 0.3721 ## F-statistic: 30.33 on 2 and 97 DF, p-value: 5.875e-11 anova(lm(y~u+v,data=dR)) ## Analysis of Variance Table ## ## Response: y ## Df Sum Sq Mean Sq F value Pr(>F) ## u 1 4.3976 4.3976 27.833 8.047e-07 *** ## v 1 5.1865 5.1865 32.826 1.133e-07 *** ## Residuals 97 15.3259 0.1580 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion
Consider the above counter example as exceptio probat regulam in casibus non exceptis (“the exception confirms the rule in cases not excepted”). Or roughly outlining the (hopefully labored and uncommon) structure needed to break the otherwise common and useful heuristics.
In later articles in this series we will show more about the structure of model quality and show the above heuristics actually working very well in practice (and adding a lot of value to projects).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.