ROC curves and classification

arthur charpentier

8 years ago

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

To get back to a question asked after the last course (still on non-life insurance), I will spend some time to discuss ROC curve construction, and interpretation. Consider the dataset we’ve been using last week,

> db = read.table("http://freakonometrics.free.fr/db.txt",header=TRUE,sep=";")
> attach(db)

The first step is to get a model. For instance, a logistic regression, where some factors were merged together,

> X3bis=rep(NA,length(X3))
> X3bis[X3%in%c("A","C","D")]="ACD"
> X3bis[X3%in%c("B","E")]="BE"
> db$X3bis=as.factor(X3bis)
> reg=glm(Y~X1+X2+X3bis,family=binomial,data=db)

From this model, we can predict a probability, not a variable,

> S=predict(reg,type="response")

Let denote this variable (actually, we can use the score, or the predicted probability, it will not change the construction of our ROC curve). What if we really want to predict a variable. As we usually do in decision theory. The idea is to consider a threshold , so that

if , then will be , or “positive” (using a standard terminology)
si , then will be , or “negative“

Then we derive a contingency table, or a confusion matrix

	observed value
predicted value		“positive“	“négative“
“positive“	TP	FP
“négative“	FN	TN

where TP are the so-called true positive, TN the true negative, FP are the false positive (or type I error) and FN are the false negative (type II errors). We can get that contingency table for a given threshold

> roc.curve=function(s,print=FALSE){
+ Ps=(S>s)*1
+ FP=sum((Ps==1)*(Y==0))/sum(Y==0)
+ TP=sum((Ps==1)*(Y==1))/sum(Y==1)
+ if(print==TRUE){
+ print(table(Observed=Y,Predicted=Ps))
+ }
+ vect=c(FP,TP)
+ names(vect)=c("FPR","TPR")
+ return(vect)
+ }
> threshold = 0.5
> roc.curve(threshold,print=TRUE)
        Predicted
Observed   0   1
       0   5 231
       1  19 745
      FPR       TPR 
0.9788136 0.9751309

Here, we also compute the false positive rates, and the true positive rates,

TPR = TP / P = TP / (TP + FN) also called sensibility, defined as the rate of true positive: probability to be predicted positve, given that someone is positive (true positive rate)
FPR = FP / N = FP / (FP + TN) is the rate of false positive: probability to be predicted positve, given that someone is negative (false positive rate)

The ROC curve is then obtained using severall values for the threshold. For convenience, define

> ROC.curve=Vectorize(roc.curve)

First, we can plot (a standard predicted versus observed graph), and visualize true and false positive and negative, using simple colors

> I=(((S>threshold)&(Y==0))|((S<=threshold)&(Y==1)))
> plot(S,Y,col=c("red","blue")[I+1],pch=19,cex=.7,,xlab="",ylab="")
> abline(v=seuil,col="gray")

And for the ROC curve, simply use

> M.ROC=ROC.curve(seq(0,1,by=.01))
> plot(M.ROC[1,],M.ROC[2,],col="grey",lwd=2,type="l")

This is the ROC curve. Now, to see why it can be interesting, we need a second model. Consider for instance a classification tree

> library(tree)
> ctr <- tree(Y~X1+X2+X3bis,data=db)
> plot(ctr)
> text(ctr)

To plot the ROC curve, we just need to use the prediction obtained using this second model,

> S=predict(ctr)

All the code described above can be used. Again, we can plot (observe that we have 5 possible values for , which makes sense since we do have 5 leaves on our tree). Then, we can plot the ROC curve,

An interesting idea can be to plot the two ROC curves on the same graph, in order to compare the two models

> plot(M.ROC[1,],M.ROC[2,],type="l")
> lines(M.ROC.tree[1,],M.ROC.tree[2,],type="l",col="grey",lwd=2)

The most difficult part is to get a proper interpretation. The tree is not predicting well in the lower part of the curve. This concerns people with a very high predicted probability. If our interest is more on those with a probability lower than 90%, then, we have to admit that the tree is doing a good job, since the ROC curve is always higher, comparer with the logistic regression.

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.