Classification with Categorical Variables (the fuzzy side)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Gaussian and the (log) Poisson regressions share a very interesting property,
i.e. the average predicted value is the empirical mean of our sample.
> mean(predict(lm(dist~speed,data=cars))) [1] 42.98 > mean(cars$dist) [1] 42.98
One can prove that it is also the prediction for the average individual in our sample
> predict(lm(dist~speed,data=cars), + newdata=data.frame(speed=mean(cars$speed))) 42.98
The geometric interpretation is that the regression line passes through the centroid,
> plot(cars) > abline(lm(dist~speed,data=cars),col="red") > abline(h=mean(cars$dist),col="blue") > abline(v=mean(cars$speed),col="blue") > points(mean(cars$speed),mean(cars$dist))
But in all other cases, it is no longer the case. Consider for instance the case of a logistic regression. And to ask for something even more complicated, consider the case where we have only categorical explanatory variables. In that context, it is more difficult to get a prediction for the “average individual”. Unless we consider some fuzzy interpretation of the regression.
Consider the following dataset
> source("http://freakonometrics.free.fr/import_data_credit.R")
Just to get a simple model, consider the following regression model, on three covariates,
> reg_f=glm(class~checking_status+duration+ + credit_history,data=train.db,family=binomial) > summary(reg_f) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.3058 0.2765 -4.722 2.33e-06 *** checking_statusCA > 200 euros -1.2297 0.4691 -2.621 0.008761 ** checking_statusCA in [0-200 euros[ -0.6047 0.2314 -2.614 0.008962 ** checking_statusNo checking account -1.8756 0.2570 -7.298 2.92e-13 *** duration(15,36] 0.7630 0.2102 3.629 0.000284 *** duration(36,Inf] 1.3576 0.3543 3.832 0.000127 *** credit_historycritical account 1.9812 0.3679 5.385 7.24e-08 *** credit_historyexisting credits paid back duly till now 0.8171 0.2497 3.273 0.001065 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
An alternative is to use the regression on dummy variables,
> library(FactoMineR) > credit_disj=data.frame(class=train.db$class, + tab.disjonctif(train.db[,-which(names( + train.db)=="class")])) > reg_d=glm(class~.,data=credit_disj[,1:11], + family=binomial)
It is equivalent since it is exactly what R is doing while running the regression on the covariate. Well, not exactly. Reference modalities will be different, the output is different
Coefficients: (3 not defined because of singularities) Estimate Std. Error z value Pr(>|z|) (Intercept) -1.0066 0.3753 -2.682 0.007310 ** CA...0.euros 1.8756 0.2570 7.298 2.92e-13 *** CA...200.euros 0.6459 0.4855 1.330 0.183396 CA.in..0.200.euros. 1.2709 0.2609 4.871 1.11e-06 *** No.checking.account NA NA NA NA X.0.15. -1.3576 0.3543 -3.832 0.000127 *** X.15.36. -0.5947 0.3410 -1.744 0.081161 . X.36.Inf. NA NA NA NA all.credits.paid.back.duly -0.8171 0.2497 -3.273 0.001065 ** critical.account 1.1641 0.3156 3.688 0.000226 *** existing.credits.paid.back.duly.till.now NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
But it is the same model. Hence, predictions are exactly the same
> predict(reg_f,type="response")[1:10] 0.21319568 0.56568074 0.70452901 0.56814422 0.16780141 0.08593906 0.24094435 0.36753641 0.38020333 0.56814422 > predict(reg_d,type="response")[1:10] 0.21319568 0.56568074 0.70452901 0.56814422 0.16780141 0.08593906 0.24094435 0.36753641 0.38020333 0.56814422
Based on that second regression, it is possible to get a prediction for the average individual of the dataset
> tab.disj.comp <- tab.disjonctif( + train.db[,-which(names(train.db)=="class")]) > apply(tab.disj.comp,2,mean) CA < 0 euros CA > 200 euros CA in [0-200 euros[ 0.274844720 0.054347826 0.282608696 No checking account (0,15] (15,36] 0.388198758 0.423913043 0.495341615
Consider the regression on the contingency table
> credit_disj=data.frame(class=train.db$class, + tab.disj.comp) > reg=glm(class~.,data=credit_disj, + family=binomial)
and compute the prediction for the average individual
> nd=as.data.frame(t(apply(tab.disj.comp, + 2,mean))) > names(nd)=names(credit_disj)[-1] > predict(reg,newdata=nd,type="response") 0.1934358
We are quite far away, here, compared with the average value
> mean(as.numeric(train.db$class)-1) 0.2981366
but again, there is no reason to get the same value. Actually, if we were running a Gaussian regresision, it would be the same (even with that fuzzy interpretation of those categories),
> credit_disj=data.frame(class=as.numeric( + train.db$class)-1,tab.disj.comp) > reg=lm(class~.,data=credit_disj) > predict(reg,newdata=nd) 0.2981366
Soon, we will see an application of that fuzzy regression…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.