Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A short post to get back – for my nonlife insurance course – on the interpretation of the output of a regression when there is a categorical covariate. Consider the following dataset
> db = read.table("http://freakonometrics.free.fr/db.txt",header=TRUE,sep=";") > tail(db) Y X1 X2 X3 995 1 4.801836 20.82947 A 996 1 9.867854 24.39920 C 997 1 5.390730 21.25119 D 998 1 6.556160 20.79811 D 999 1 4.710276 21.15373 A 1000 1 6.631786 19.38083 A
Let us run a logistic regression on that dataset
> reg = glm(Y~X1+X2+X3,family=binomial,data=db) > summary(reg) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.45885 1.04646 -4.261 2.04e-05 *** X1 0.51664 0.11178 4.622 3.80e-06 *** X2 0.21008 0.07247 2.899 0.003745 ** X3B 1.74496 0.49952 3.493 0.000477 *** X3C -0.03470 0.35691 -0.097 0.922543 X3D 0.08004 0.34916 0.229 0.818672 X3E 2.21966 0.56475 3.930 8.48e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 552.64 on 999 degrees of freedom Residual deviance: 397.69 on 993 degrees of freedom AIC: 411.69 Number of Fisher Scoring iterations: 7
Here, the reference is modality
where
For someone with characteristics
For someone with characteristics
(etc.) Here, if we accept
A natural idea can be to change the reference modality, and to look at the
> M = matrix(NA,5,5) > rownames(M)=colnames(M)=LETTERS[1:5] > for(k in 1:5){ + db$X3 = relevel(X3,LETTERS[k]) + reg = glm(Y~X1+X2+X3,family=binomial,data=db) + M[levels(db$X3)[-1],k] = summary(reg)$coefficients[4:7,4] + } > M A B C D E A NA 0.0004771853 9.225428e-01 0.8186723647 8.482647e-05 B 4.771853e-04 NA 4.841204e-04 0.0009474491 4.743636e-01 C 9.225428e-01 0.0004841204 NA 0.7506242347 9.194193e-05 D 8.186724e-01 0.0009474491 7.506242e-01 NA 1.730589e-04 E 8.482647e-05 0.4743636442 9.194193e-05 0.0001730589 NA
and if we simply want to know if the
> M.TF = M>.05 > M.TF A B C D E A NA FALSE TRUE TRUE FALSE B FALSE NA FALSE FALSE TRUE C TRUE FALSE NA TRUE FALSE D TRUE FALSE TRUE NA FALSE E FALSE TRUE FALSE FALSE NA
The first column is obtained when
and are not different from is not different from and are not different from and are not different from is not different from
Note that we only have, here, some kind of intuition. So, let us run a more formal test. Let us consider the following regression (we remove the intercept to get a model easier to understand)
> library(car) > db$X3=relevel(X3,"A") > reg=glm(Y~0+X1+X2+X3,family=binomial,data=db) > summary(reg) Coefficients: Estimate Std. Error z value Pr(>|z|) X1 0.51664 0.11178 4.622 3.80e-06 *** X2 0.21008 0.07247 2.899 0.00374 ** X3A -4.45885 1.04646 -4.261 2.04e-05 *** X3E -2.23919 1.06666 -2.099 0.03580 * X3D -4.37881 1.04887 -4.175 2.98e-05 *** X3C -4.49355 1.06266 -4.229 2.35e-05 *** X3B -2.71389 1.07274 -2.530 0.01141 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1386.29 on 1000 degrees of freedom Residual deviance: 397.69 on 993 degrees of freedom AIC: 411.69 Number of Fisher Scoring iterations: 7
It is possible to use Fisher test to test if some coefficients are equal, or not (more generally if some linear constraints are satisfied)
> linearHypothesis(reg,c("X3A=X3C","X3A=X3D","X3B=X3E")) Linear hypothesis test Hypothesis: X3A - X3C = 0 X3A - X3D = 0 - X3E + X3B = 0 Model 1: restricted model Model 2: Y ~ 0 + X1 + X2 + X3 Res.Df Df Chisq Pr(>Chisq) 1 996 2 993 3 0.6191 0.892
Here, we clearly accept the assumption that the first three factors are equal, as well as the last two. What is the next step? Well, if we believe that there are mainly two categories,
> X3bis=rep(NA,length(X3)) > X3bis[X3%in%c("A","C","D")]="ACD" > X3bis[X3%in%c("B","E")]="BE" > db$X3bis=as.factor(X3bis) > reg=glm(Y~X1+X2+X3bis,family=binomial,data=db) > summary(reg) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.39439 1.02791 -4.275 1.91e-05 *** X1 0.51378 0.11138 4.613 3.97e-06 *** X2 0.20807 0.07234 2.876 0.00402 ** X3bisBE 1.94905 0.36852 5.289 1.23e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 552.64 on 999 degrees of freedom Residual deviance: 398.31 on 996 degrees of freedom AIC: 406.31 Number of Fisher Scoring iterations: 7
Here, all the categories are significant. So we do have a proper model.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.