Regression on variables, or on categories?

arthur charpentier

8 years ago

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I admit it, the title sounds weird. The problem I want to address this evening is related to the use of the stepwise procedure on a regression model, and to discuss the use of categorical variables (and possible misinterpreations). Consider the following dataset

> db = read.table("http://freakonometrics.free.fr/db2.txt",header=TRUE,sep=";")

First, let us change the reference in our categorical variable (just to get an easier interpretation later on)

> db$X3=relevel(as.factor(db$X3),ref="E")

If we run a logistic regression on the three variables (two continuous, one categorical), we get

> reg=glm(Y~X1+X2+X3,family=binomial,data=db)
> summary(reg)

Call:
glm(formula = Y ~ X1 + X2 + X3, family = binomial, data = db)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0758   0.1226   0.2805   0.4798   2.0345  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.39528    0.86649  -6.227 4.77e-10 ***
X1           0.51618    0.09163   5.633 1.77e-08 ***
X2           0.24665    0.05911   4.173 3.01e-05 ***
X3A         -0.09142    0.32970  -0.277   0.7816    
X3B         -0.10558    0.32526  -0.325   0.7455    
X3C          0.63829    0.37838   1.687   0.0916 .  
X3D         -0.02776    0.33070  -0.084   0.9331    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 806.29  on 999  degrees of freedom
Residual deviance: 582.29  on 993  degrees of freedom
AIC: 596.29

Number of Fisher Scoring iterations: 6

Now, if we use a stepwise procedure, to select variables in the model, we get

> step(reg)
Start:  AIC=596.29
Y ~ X1 + X2 + X3

       Df Deviance    AIC
- X3    4   587.81 593.81
<none>      582.29 596.29
- X2    1   600.56 612.56
- X1    1   617.25 629.25

Step:  AIC=593.81
Y ~ X1 + X2

       Df Deviance    AIC
<none>      587.81 593.81
- X2    1   606.90 610.90
- X1    1   622.44 626.44

So clearly, we should remove the categorical variable if our starting point was the regression on the three variables.

Now, what if we consider the same model, but slightly different: on the five categories,

> X3complete = model.matrix(~0+X3,data=db)
> db2 = data.frame(db,X3complete)
> head(db2)
  Y       X1       X2 X3 X3A X3B X3C X3D X3E
1 1 3.297569 16.25411  B   0   1   0   0   0
2 1 6.418031 18.45130  D   0   0   0   1   0
3 1 5.279068 16.61806  B   0   1   0   0   0
4 1 5.539834 19.72158  C   0   0   1   0   0
5 1 4.123464 18.38634  C   0   0   1   0   0
6 1 7.778443 19.58338  C   0   0   1   0   0

From a technical point of view, it is exactly the same as before, if we look at the regression,

> reg = glm(Y~X1+X2+X3A+X3B+X3C+X3D+X3E,family=binomial,data=db2)
> summary(reg)

Call:
glm(formula = Y ~ X1 + X2 + X3A + X3B + X3C + X3D + X3E, family = binomial, 
    data = db2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0758   0.1226   0.2805   0.4798   2.0345  

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.39528    0.86649  -6.227 4.77e-10 ***
X1           0.51618    0.09163   5.633 1.77e-08 ***
X2           0.24665    0.05911   4.173 3.01e-05 ***
X3A         -0.09142    0.32970  -0.277   0.7816    
X3B         -0.10558    0.32526  -0.325   0.7455    
X3C          0.63829    0.37838   1.687   0.0916 .  
X3D         -0.02776    0.33070  -0.084   0.9331    
X3E               NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 806.29  on 999  degrees of freedom
Residual deviance: 582.29  on 993  degrees of freedom
AIC: 596.29

Number of Fisher Scoring iterations: 6

Both regressions are equivalent. Now, what about a stepwise selection on this new model?

> step(reg)
Start:  AIC=596.29
Y ~ X1 + X2 + X3A + X3B + X3C + X3D + X3E

Step:  AIC=596.29
Y ~ X1 + X2 + X3A + X3B + X3C + X3D

       Df Deviance    AIC
- X3D   1   582.30 594.30
- X3A   1   582.37 594.37
- X3B   1   582.40 594.40
<none>      582.29 596.29
- X3C   1   585.21 597.21
- X2    1   600.56 612.56
- X1    1   617.25 629.25

Step:  AIC=594.3
Y ~ X1 + X2 + X3A + X3B + X3C

       Df Deviance    AIC
- X3A   1   582.38 592.38
- X3B   1   582.41 592.41
<none>      582.30 594.30
- X3C   1   586.30 596.30
- X2    1   600.58 610.58
- X1    1   617.27 627.27

Step:  AIC=592.38
Y ~ X1 + X2 + X3B + X3C

       Df Deviance    AIC
- X3B   1   582.44 590.44
<none>      582.38 592.38
- X3C   1   587.20 595.20
- X2    1   600.59 608.59
- X1    1   617.64 625.64

Step:  AIC=590.44
Y ~ X1 + X2 + X3C

       Df Deviance    AIC
<none>      582.44 590.44
- X3C   1   587.81 593.81
- X2    1   600.73 606.73
- X1    1   617.66 623.66

What do we get now? This time, the stepwise procedure recommends that we keep one category (namely C). So my point is simple: when running a stepwise procedure with factors, either we keep the factor as it is, or we drop it. If it is necessary to change the design, by pooling together some categories, and we forgot to do it, then it will be suggested to remove that variable, because having 4 categories meaning the same thing will cost us too much if we use the Akaike criteria. Because this is exactly what happens here

> library(car)
> reg = glm(formula = Y ~ X1 + X2 + X3, family = binomial, data = db)
> linearHypothesis(reg,c("X3A=X3B","X3A=X3D","X3A=0"))
Linear hypothesis test

Hypothesis:
X3A - X3B = 0
X3A - X3D = 0
X3A = 0

Model 1: restricted model
Model 2: Y ~ X1 + X2 + X3

  Res.Df Df  Chisq Pr(>Chisq)
1    996                     
2    993  3 0.1446      0.986

So here, we should pool together categories A, B, D and E (which was here the reference). As mentioned in a previous post, it is necessary to pool together categories that should be pulled together as soon as possible. If not, the stepwise procedure might yield to some misinterpretations.

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.