Visualising a Classification in High Dimension, part 2
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A few weeks ago, I published a post on Visualising a Classification in High Dimension, based on the use of a principal component analysis, to get a projection on the first two components. Following that post, I was wondering what could be done in the context of a classification on categorical covariates. A natural idea would be to consider a correspondance analysis, and to run a similar code.
Consider here the dataset used in a recent post,
> source("http://freakonometrics.free.fr/import_data_credit.R")
If we consider a correspondance analysis, we get
> library(FactoMineR) > acm=MCA(train.db,quali.sup = + which(names(train.db,)=="class"),ncp=10)
For the covariates (including also the variable we want to model, considered here as some supplementary variable), the visualisation – on the first two components – is
and for the individuals
The extension of the previous post (based on a principal component analysis) is easy : based on a point in the (projection) space above, we want to get the point in the original space, use our classifier to get a prediction, then visualise it on that projection. The problem is that when we get back to the original space, we might not end up in some categories: instead of having a gender “male” or “female“, our point might be 83% “male” and 17% “female“. We we can run our classifier in that fuzzy space, that’s fine. The good thing is that we can, in the context of a logistic regression (as discussed in a previous post, entitled Classification with Categorical Variables (the fuzzy side)).
I should also mentioned that when I wrote my initial code, I got some technical problem to get back to the original space (one dimension was missing)
> X=tab.disjonctif(train.db[,-which( + names(train.db)=="class")]) > ncol(X) [1] 56 > res.mca = MCA(train.db,quali.sup = + which(names(train.db)=="class"),ncp=56) > ncol(res.mca$var$coord) [1] 55
So I asked Julie who kindly helped me by sharing some code she got.
> don <- train.db[,-which(names(train.db)=="class")] > moy.p <- function(V, poids) { + res <- sum(V * poids, na.rm = TRUE)/sum(poids[!is.na(V)]) + } > row.w = rep(1/nrow(don), nrow(don)) > tab.disj.comp=tab.disjonctif(don) > M = apply(tab.disj.comp, 2, moy.p, row.w)/ncol(don) > Z = sweep(tab.disj.comp, 2, apply(tab.disj.comp, 2, moy.p, + row.w), FUN = "/") > Z = sweep(Z, 2, apply(Z, 2, moy.p, row.w), FUN = "-") > Zscale = sweep(Z, 2, sqrt(M), FUN = "*") > svd.Zscale = svd.triplet(Zscale, row.w = row.w) > eig.shrunk = (svd.Zscale$vs[1:2])
Now, we can get a function that returns a point in the original space (in a fuzzy sense) given a location in the projected space
> reconstruct=function(x1,x2){ + rec = tcrossprod(sweep(cbind(x1,x2), 2, + eig.shrunk, FUN = "*"), svd.Zscale$V[, 1:2]) + tab.disj.rec = sweep(rec, 2, sqrt(M), FUN = "/") + matrix(1, + nrow(rec), ncol(rec)) + tab.disj.rec = sweep(tab.disj.rec, 2, + apply(tab.disj.comp, 2, moy.p, row.w), + FUN = "*") + colnames(tab.disj.rec)=colnames(tab.disj.comp) + return(tab.disj.rec) + }
For instance, if we consider the origin in the projected space
> reconstruct(0,0)[1:10] [1] 0.25931677 0.06677019 0.27173913 0.40217391 [5] 0.42546584 0.48447205 0.09006211 0.29503106 [9] 0.09161491 0.61335404
This point is actually the avereage individual in our dataset
> (apply(tab.disj.comp,2,mean))[1:4] CA < 0 euros CA > 200 euros 0.25931677 0.06677019 CA in [0-200 euros[ No checking account 0.27173913 0.40217391
And indeed, this individual belongs to the fuzzy set of our covariates: this individual has 40% chance to get no account, 26% chance to get a negative balance, 27% to get a positive balance, below 200 euros, and 7% chance to get a positive balance above 200 euros. So it cannot be an actual observation from our dataset.
Our classifier should be in that fuzzy space, not on the space of categorical covariates. This was discussed in our previous post
> credit_disj=data.frame(class=train.db$class, + tab.disjonctif(don)) > reg=glm(class~.,data=credit_disj, + family=binomial) > nd=as.data.frame(t(apply(tab.disj.comp, + 2,mean))) > names(nd)=names(credit_disj)[-1] > predict(reg,newdata=nd,type="response") 0.1934358
Here, we get exactly the same prediction using our reconstruction function,
> nd=data.frame(reconstruct(0,0)) > predict(reg,newdata=nd,type="response") 0.1934358
We can define a function to get a prediction
> prev=function(x1,x2){ + newd=data.frame(retour(x1,x2)) + return(predict(reg,newdata=newd, + type="response")) + }
It is then possible to get predictions on a grid, on the projective space
> xgrid=seq(-1,2,length=51) > ygrid=seq(-1,1,length=51) > xygrid=as.matrix(expand.grid(xgrid,ygrid)) > rec = tcrossprod(sweep(xygrid, 2, + eig.shrunk, FUN = "*"), svd.Zscale$V[, 1:2]) > tab.disj.rec = sweep(rec, 2, sqrt(M), + FUN = "/") + matrix(1, + nrow(rec), ncol(rec)) > tab.disj.rec = sweep(tab.disj.rec, 2, apply( + tab.disj.comp, 2, moy.p, row.w), FUN = "*") > colnames(tab.disj.rec)=colnames(tab.disj.comp) > newd=data.frame(tab.disj.rec) > z_vect=predict(reg,newdata=newd, + type="response") > zgrid=matrix(z_vect,51,51)
We can then visualize our classifier, on that projected space
> plot(acm$ind$coord[,1:2],xlab="Dim 1",ylab="Dim 2",col="white") > abline(h=0,col="grey") > abline(v=0,col="grey") > points(acm$ind$coord[train.db$class=="0",1], + acm$ind$coord[train.db$class=="0",2], + col="blue",pch=19,cex=.5) > points(acm$ind$coord[train.db$class=="1",1], + acm$ind$coord[train.db$class=="1",2], + col="red",pch=19,cex=.5) > contour(xgrid,ygrid,zgrid,add=TRUE, + levels=prev(0,0),lwd=2) > contour(xgrid,ygrid,zgrid,add=TRUE,col="grey")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.