Combining automatically factor levels in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Each time we face real applications in an applied econometrics course, we have to deal with categorial variables. And the same question arise, from students : how can we combine automatically factor levels ? Is there a simple R function ?
I did upload a few blog posts, over the pas years. But so far, nothing satistfying. Let me write down a few lines about what could be done. And if some wants to write a nice R function, that would be awesome. To illustrate the idea, consider the following (simulated dataset)
n=200 set.seed(1) x1=runif(n) x2=runif(n) y=1+2*x1-x2+rnorm(n,0,.2) LB=sample(LETTERS[1:10]) b=data.frame(y=y,x1=x1, x2=cut(x2,breaks= c(-1,.05,.1,.2,.35,.4,.55,.65,.8,.9,2), labels=LB)) str(b) 'data.frame': 200 obs. of 3 variables: $ y : num 1.345 1.863 1.946 2.481 0.765 ... $ x1: num 0.266 0.372 0.573 0.908 0.202 ... $ x2: Factor w/ 10 levels "I","A","H","F",..: 4 4 6 4 3 6 7 3 4 8 ... table(b$x2)[LETTERS[1:10]] A B C D E F G H I J 11 12 23 34 23 36 12 32 3 14
There is one (continuous) dependent variable \(y\), one continuous covariable \(x_1\) and one categorical variable \(x_2\), with here ten levels. We can plot the data using
plot(b$x1,y,col="white",xlim=c(0,1.1)) text(b$x1,y,as.character(b$x2),cex=.5)
The output of a linear regression yield the following predictions
for(i in 1:10){ p=function(x) predict(lm(y~x1+x2,data=b),newdata=data.frame(x1=x,x2=LETTERS[i])) u=seq(-1,1.065,by=.01) v=Vectorize(p)(u) lines(u,v)}
the slope for \(x_1\) is the same, we simply add a different constant for each level. As we can see, some levels are very very close, so it seems legitimate to combine them into one single category. Here is the output of the linear regression,
summary(lm(y~x1+x2,data=b)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.843802 0.119655 7.052 3.23e-11 *** x1 1.992878 0.053838 37.016 < 2e-16 *** x2A 0.055500 0.131173 0.423 0.6727 x2H 0.009293 0.121626 0.076 0.9392 x2F -0.177002 0.121020 -1.463 0.1452 x2B -0.218152 0.130192 -1.676 0.0955 . x2D -0.206970 0.121294 -1.706 0.0896 . x2G -0.407417 0.129999 -3.134 0.0020 ** x2C -0.526708 0.123690 -4.258 3.24e-05 *** x2J -0.664281 0.128126 -5.185 5.54e-07 *** x2E -0.816454 0.123625 -6.604 3.94e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2014 on 189 degrees of freedom Multiple R-squared: 0.8995, Adjusted R-squared: 0.8942 F-statistic: 169.1 on 10 and 189 DF, p-value: < 2.2e-16 AIC(lm(y~x1+x2,data=b)) [1] -60.74443 BIC(lm(y~x1+x2,data=b)) [1] -21.16463
Here the reference category is “I”. And it looks like we could actually combine that category with several others. One strategy here would be to select all categories that seem to be not significantly different, and to run a (multiple) test
library(car) linearHypothesis(lm(y~x1+x2,data=b), c("x2A = 0", "x2H = 0", "x2F = 0")) Hypothesis: x2A = 0 x2H = 0 x2F = 0 Model 1: restricted model Model 2: y ~ x1 + x2 Res.Df RSS Df Sum of Sq F Pr(>F) 1 192 8.4651 2 189 7.6654 3 0.79971 6.5726 3e-04 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
It seems that we can combine those four categories together.
Here, we can see what’s going on when we change the reference category (actually, loop on all categories)
P=matrix(NA,nlevels(b$x2),nlevels(b$x2)) colnames(P)=rownames(P)=LETTERS[1:10] plot(1:nlevels(b$x2),1:nlevels(b$x2),col="white",xlab="",ylab="",axes=F,xlim=c(0,10.5), ylim=c(0,10.5)) text(1:10,0,LETTERS[1:10]) text(0,1:10,LETTERS[1:10]) for(i in 1:nlevels(b$x2)){ #levels(b$x2)=LETTERS[1:10] b$x2=relevel(b$x2,LETTERS[i]) p=summary(lm(y~x1+x2,data=b))$coefficients[-(1:2),4] names(p)=substr(names(p),3,3) P[LETTERS[i],names(p)]=p p=P[LETTERS[i],] idx=which(p>.05) points(((1:10))[idx],rep(i,length(idx)),pch=1,cex=2) idx=which(p>.1) points(((1:10))[idx],rep(i,length(idx)),pch=19,cex=2)}
We are glad to see that it is symmetric : if “H” should be combined with “I”, “I” should also be combined with “H”.
Here black points are related with the 10% \(p\)-value, and white points the 5% \(p\)-value. This graph is actually hard to read… And actually, this reminds us of Bertin (1967).
Here, we can predefine manually some ordering (we will see below how it might be automatised)
LETTERSord=c("I","A","H","F","B","D","G","C","J","E") P=matrix(NA,nlevels(b$x2),nlevels(b$x2)) colnames(P)=rownames(P)=LETTERSord plot(1:nlevels(b$x2),1:nlevels(b$x2),col="white",xlab="",ylab="",axes=F,xlim=c(0,10.5), ylim=c(0,10.5)) ct=c(3,3,2,1,1) abline(v=.5+c(0,cumsum(ct)),lty=2) abline(h=.5+c(0,cumsum(ct)),lty=2) text(1:10,0,LETTERSord) text(0,1:10,LETTERSord) for(i in 1:nlevels(b$x2)){ #levels(b$x2)=LETTERS[1:10] b$x2=relevel(b$x2,LETTERSord[i]) p=summary(lm(y~x1+x2,data=b))$coefficients[-(1:2),4] names(p)=substr(names(p),3,3) P[LETTERSord[i],names(p)]=p p=P[LETTERSord[i],] idx=which(p>.05) points(((1:10))[idx],rep(i,length(idx)),pch=1,cex=2) idx=which(p>.1) points(((1:10))[idx],rep(i,length(idx)),pch=19,cex=2) }
Here we get the following
It looks like we have our combined categories…
Actually, it is possible to use another strategy. We start from some level, say “A”. Then, we merge it with all non-significantly different levels. If “B” is not one of them, we use it as the new reference. Etc.
for(i in 1:nlevels(b$x2)){ if(LETTERS[i]%in%levels(b$x2)){ b$x2=relevel(b$x2,LETTERS[i]) p=summary(lm(y~x1+x2,data=b))$coefficients[-(1:2),4] names(p)=substr(names(p),3,nchar(p)) idx=which(p>.05) mix=c(LETTERS[i],names(p)[idx]) b$x2=recode(b$x2, paste("c('",paste(mix,collapse = "','"),"')='",paste(mix,collapse = "+"),"'",sep="")) }}
The final categories are
table(b$x2) A+I+H B+D+F C+G E J 46 82 35 23 14
with the following regression output
summary(lm(y~x1+x2,data=b)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.86407 0.03950 21.877 < 2e-16 *** x1 1.99180 0.05323 37.417 < 2e-16 *** x2B+D+F -0.21517 0.03699 -5.817 2.44e-08 *** x2C+G -0.50545 0.04528 -11.164 < 2e-16 *** x2E -0.83617 0.05128 -16.305 < 2e-16 *** x2J -0.68398 0.06131 -11.156 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2008 on 194 degrees of freedom Multiple R-squared: 0.8975, Adjusted R-squared: 0.8948 F-statistic: 339.6 on 5 and 194 DF, p-value: < 2.2e-16 AIC(lm(y~x1+x2,data=b)) [1] -66.76939 BIC(lm(y~x1+x2,data=b)) [1] -43.68117
Which is consistent with the group we got before. But actually, if we change the order, we can get different combinations. For instance, if we go from “J” to “A”, instead of “A” to “J”, we obtain
for(i in nlevels(b$x2):1){ #levels(b$x2)=LETTERS[1:10] if(LETTERS[i]%in%levels(b$x2)){ b$x2=relevel(b$x2,LETTERS[i]) p=summary(lm(y~x1+x2,data=b))$coefficients[-(1:2),4] names(p)=substr(names(p),3,nchar(p)) idx=which(p>.05) mix=c(LETTERS[i],names(p)[idx]) b$x2=recode(b$x2, paste("c('",paste(mix,collapse = "','"),"')='",paste(mix,collapse = "+"),"'",sep="")) }} table(b$x2) E G+C I+A+B+D+F+H J 23 35 128 14
with different information criteria here
AIC(lm(y~x1+x2,data=b)) [1] -36.61665 BIC(lm(y~x1+x2,data=b)) [1] -16.82675
I guess it would be necessary to run randomly the order we go through the levels. Last, but not least, one can use regression trees (even if it not per se in the syllabus of the course). The problem is that there is another explanatory variable that might interphere. So I would suggest (1) to fit a linear model \(y=\beta_0+\beta_1x_1+u_i\), to calculate the residuals, \(\widehat{u}_i\) (2) to run a regression tree, to explain \(\widehat{u}_i\) with categorical variable \(x_2\) (I did explain how trees are build when the explanatory variable is a categorical one in a previous post)
library(rpart) library(rpart.plot) b$e=residuals(lm(y~x1,data=b)) arbre=rpart(e~x2,data=b) prp(arbre,type=2,extra=1)
Observe that the leaves have the same groups as the one we got.
arbre n= 200 node), split, n, deviance, yval * denotes terminal node 1) root 200 22.563500 7.771561e-18 2) x2=G,C,J,E 72 4.441495 -3.232525e-01 4) x2=J,E 37 1.553520 -4.578492e-01 * 5) x2=G,C 35 1.509068 -1.809646e-01 * 3) x2=I,A,H,F,B,D 128 6.366628 1.818295e-01 6) x2=F,B,D 82 2.983381 1.048246e-01 * 7) x2=I,A,H 46 2.030229 3.190993e-01 *
I guess that it should be possible to put all that in an R function, to suggest combinations of level that might improve the regression.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.