Classification with Categorical Variables (the fuzzy side)

Posted on April 9, 2015 by arthur charpentier in R bloggers | 0 Comments

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Gaussian and the (log) Poisson regressions share a very interesting property,

$\frac{1}{n}\sum_{i=1}^n \widehat{Y}_i=\frac{1}{n}\sum_{i=1}^n Y_i$

i.e. the average predicted value is the empirical mean of our sample.

> mean(predict(lm(dist~speed,data=cars)))
[1] 42.98
> mean(cars$dist)
[1] 42.98

One can prove that it is also the prediction for the average individual in our sample

> predict(lm(dist~speed,data=cars),
+ newdata=data.frame(speed=mean(cars$speed))) 
42.98

The geometric interpretation is that the regression line passes through the centroid,

> plot(cars)
> abline(lm(dist~speed,data=cars),col="red")
> abline(h=mean(cars$dist),col="blue")
> abline(v=mean(cars$speed),col="blue")
> points(mean(cars$speed),mean(cars$dist))

But in all other cases, it is no longer the case. Consider for instance the case of a logistic regression. And to ask for something even more complicated, consider the case where we have only categorical explanatory variables. In that context, it is more difficult to get a prediction for the “average individual”. Unless we consider some fuzzy interpretation of the regression.

Consider the following dataset

> source("http://freakonometrics.free.fr/import_data_credit.R")

Just to get a simple model, consider the following regression model, on three covariates,

> reg_f=glm(class~checking_status+duration+
+ credit_history,data=train.db,family=binomial)
> summary(reg_f)
 
Coefficients:
                                                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)                                             -1.3058     0.2765  -4.722 2.33e-06 ***
checking_statusCA > 200 euros                           -1.2297     0.4691  -2.621 0.008761 ** 
checking_statusCA in [0-200 euros[                      -0.6047     0.2314  -2.614 0.008962 ** 
checking_statusNo checking account                      -1.8756     0.2570  -7.298 2.92e-13 ***
duration(15,36]                                          0.7630     0.2102   3.629 0.000284 ***
duration(36,Inf]                                         1.3576     0.3543   3.832 0.000127 ***
credit_historycritical account                           1.9812     0.3679   5.385 7.24e-08 ***
credit_historyexisting credits paid back duly till now   0.8171     0.2497   3.273 0.001065 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

An alternative is to use the regression on dummy variables,

> library(FactoMineR)
> credit_disj=data.frame(class=train.db$class,
+ tab.disjonctif(train.db[,-which(names(
+ train.db)=="class")]))
> reg_d=glm(class~.,data=credit_disj[,1:11],
+ family=binomial)

It is equivalent since it is exactly what R is doing while running the regression on the covariate. Well, not exactly. Reference modalities will be different, the output is different

Coefficients: (3 not defined because of singularities)
                                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)                               -1.0066     0.3753  -2.682 0.007310 ** 
CA...0.euros                               1.8756     0.2570   7.298 2.92e-13 ***
CA...200.euros                             0.6459     0.4855   1.330 0.183396    
CA.in..0.200.euros.                        1.2709     0.2609   4.871 1.11e-06 ***
No.checking.account                            NA         NA      NA       NA    
X.0.15.                                   -1.3576     0.3543  -3.832 0.000127 ***
X.15.36.                                  -0.5947     0.3410  -1.744 0.081161 .  
X.36.Inf.                                      NA         NA      NA       NA    
all.credits.paid.back.duly                -0.8171     0.2497  -3.273 0.001065 ** 
critical.account                           1.1641     0.3156   3.688 0.000226 ***
existing.credits.paid.back.duly.till.now       NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

But it is the same model. Hence, predictions are exactly the same

> predict(reg_f,type="response")[1:10]
0.21319568 0.56568074 0.70452901 0.56814422 0.16780141 0.08593906 0.24094435 0.36753641 0.38020333 0.56814422 
> predict(reg_d,type="response")[1:10]
0.21319568 0.56568074 0.70452901 0.56814422 0.16780141 0.08593906 0.24094435 0.36753641 0.38020333 0.56814422

Based on that second regression, it is possible to get a prediction for the average individual of the dataset

> tab.disj.comp <- tab.disjonctif(
+ train.db[,-which(names(train.db)=="class")])
> apply(tab.disj.comp,2,mean)
      CA < 0 euros      CA > 200 euros    CA in [0-200 euros[ 
     0.274844720         0.054347826            0.282608696 
 No checking account      (0,15]                        (15,36] 
     0.388198758         0.423913043            0.495341615

Consider the regression on the contingency table

> credit_disj=data.frame(class=train.db$class,
+ tab.disj.comp)
> reg=glm(class~.,data=credit_disj,
+ family=binomial)

and compute the prediction for the average individual

> nd=as.data.frame(t(apply(tab.disj.comp,
+ 2,mean)))
> names(nd)=names(credit_disj)[-1]
> predict(reg,newdata=nd,type="response")
0.1934358

We are quite far away, here, compared with the average value

> mean(as.numeric(train.db$class)-1)
0.2981366

but again, there is no reason to get the same value. Actually, if we were running a Gaussian regresision, it would be the same (even with that fuzzy interpretation of those categories),

> credit_disj=data.frame(class=as.numeric(
+ train.db$class)-1,tab.disj.comp)
> reg=lm(class~.,data=credit_disj)
> predict(reg,newdata=nd)
0.2981366

Soon, we will see an application of that fuzzy regression…

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Classification with Categorical Variables (the fuzzy side)

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)