Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
So far, when discussing classification, we’ve been playing on my toy-dataset (actually, I should no claim it’s mine, it is inspired by the one used in the introduction of Boosting, by Robert Schapire and Yoav Freund). But in ral life, there are more observations, and more explanatory variables.With more than two explanatory variables, it starts to be more complicated to visualise. For instance, consider
MYOCARDE=read.table( "http://freakonometrics.free.fr/saporta.csv", head=TRUE,sep=";")
where we have observations from people in E.R., for infarctus, and we want to understand who did survive, to get a predictive model. But before running some classifier, let us visualise our data. Since we have seven explanatory variables and our class (survival or death), we can go for a PCA.
library(FactoMineR) # ACP (sur les var continues) X=MYOCARDE[,1:7] acp=PCA(X)
To add the death/survival variable, treat it as numerical 0/1 variable (at least to get a direction)
MYOCARDE2=MYOCARDE MYOCARDE2$PRONO=(MYOCARDE2$PRONO=="SURVIE")*1 acp=PCA(MYOCARDE2,quanti.sup=8,graph=TRUE)
The nice thing is that we see here where variables are colinear with that one. It is also possible to visualise individuals, and classes, too
acp=PCA(MYOCARDE,quali.sup=8,graph=TRUE) plot(acp, habillage = 8,col.hab=c("red","blue"))
It looks like we should be able to derive a nice classifier here. At least, while projecting on the first two components, we can see our classes.
Now, can’t we get a classifier and visualise it on the first two components ? Because PCAs are simply based on orthogonral projection, yes, we can (we have to keep in mind also that data were normalized here). Given two coordinates on the first two components plane (assume that the other 5 are null), given our transformation matrix, our normalizing components, and a classifier (here based on a logistic regression) we can get back to the original space, and classify that new data
acp=PCA(X,ncp=ncol(X)) M=acp$var$coord m=apply(X,2,mean) s=apply(X,2,sd) pred=function(d1,d2,Mat,reg){ z=Mat %*% c(d1,d2,rep(0,ncol(X)-2)) newd=data.frame(t(z*s+m)) names(newd)=names(X) predict(reg,newdata=newd,type="response") }
Consider now a logistic regression. Just to get something not too complicated (and get rid of non significant variables), we use a stepwise procedure to simplify the model,
reg_tot=glm(PRONO~.,data=MYOCARDE, family=binomial) reg_tot=step(reg_tot) Minv=solve(M) p=function(d1,d2) pred(d1,d2,Minv,reg_tot)
To visualize the isoprobability line (where an individual has a 50% chance to survive) use the following
Outer <- function(x,y,fun) { mat <- matrix(NA, length(x), length(y)) for (i in seq_along(x)) { for (j in seq_along(y)) mat[i,j]=fun(x[i],y[j])} return(mat)} xgrid=seq(-5,5,length=251) ygrid=seq(-5,5,length=251) zgrid=Outer(xgrid,ygrid,p)
and then, we add a contour line on our previous graph
acp2=PCA(MYOCARDE,quali.sup=8,graph=TRUE) plot(acp2, habillage = 8,col.hab=c("red","blue")) contour(xgrid,ygrid,zgrid,add=TRUE,levels=.5)
It is not too bad, but we should be able to do better. What if we keep all variables here (even if they were not significant)
reg_tot=glm(PRONO~.,data=MYOCARDE, family=binomial) Minv=solve(M) p=function(d1,d2) pred(d1,d2,Minv,reg_tot) zgrid=Outer(xgrid,ygrid,p) acp2=PCA(MYOCARDE,quali.sup=8,graph=TRUE) plot(acp2, habillage = 8,col.hab=c("red","blue")) contour(xgrid,ygrid,zgrid,add=TRUE,levels=.5)
It is much better, isn’t it ? Our course, in real life, to actually say something relevant about our classifier, we should fit our model on a subpart of our observations, and then test it on the other subpart. But here, the goal was more to get a function to visualise our classification on some projected space.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.