Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
To get back to a question asked after the last course (still on non-life insurance), I will spend some time to discuss ROC curve construction, and interpretation. Consider the dataset we’ve been using last week,
> db = read.table("http://freakonometrics.free.fr/db.txt",header=TRUE,sep=";") > attach(db)
The first step is to get a model. For instance, a logistic regression, where some factors were merged together,
> X3bis=rep(NA,length(X3)) > X3bis[X3%in%c("A","C","D")]="ACD" > X3bis[X3%in%c("B","E")]="BE" > db$X3bis=as.factor(X3bis) > reg=glm(Y~X1+X2+X3bis,family=binomial,data=db)
From this model, we can predict a probability, not a
> S=predict(reg,type="response")
Let
- if
, then will be , or “positive” (using a standard terminology) - si
, then will be , or “negative“
Then we derive a contingency table, or a confusion matrix
observed value |
|||
predicted
value
|
“positive“ | “négative“ | |
“positive“ | TP | FP | |
“négative“ | FN | TN |
where TP are the so-called true positive, TN the true negative, FP are the false positive (or type I error) and FN are the false negative (type II errors). We can get that contingency table for a given threshold
> roc.curve=function(s,print=FALSE){ + Ps=(S>s)*1 + FP=sum((Ps==1)*(Y==0))/sum(Y==0) + TP=sum((Ps==1)*(Y==1))/sum(Y==1) + if(print==TRUE){ + print(table(Observed=Y,Predicted=Ps)) + } + vect=c(FP,TP) + names(vect)=c("FPR","TPR") + return(vect) + } > threshold = 0.5 > roc.curve(threshold,print=TRUE) Predicted Observed 0 1 0 5 231 1 19 745 FPR TPR 0.9788136 0.9751309
Here, we also compute the false positive rates, and the true positive rates,
- TPR = TP / P = TP / (TP + FN) also called sensibility, defined as the rate of true positive: probability to be predicted positve, given that someone is positive (true positive rate)
- FPR = FP / N = FP / (FP + TN) is the rate of false positive: probability to be predicted positve, given that someone is negative (false positive rate)
The ROC curve is then obtained using severall values for the threshold. For convenience, define
> ROC.curve=Vectorize(roc.curve)
First, we can plot
> I=(((S>threshold)&(Y==0))|((S<=threshold)&(Y==1))) > plot(S,Y,col=c("red","blue")[I+1],pch=19,cex=.7,,xlab="",ylab="") > abline(v=seuil,col="gray")
And for the ROC curve, simply use
> M.ROC=ROC.curve(seq(0,1,by=.01)) > plot(M.ROC[1,],M.ROC[2,],col="grey",lwd=2,type="l")
This is the ROC curve. Now, to see why it can be interesting, we need a second model. Consider for instance a classification tree
> library(tree) > ctr <- tree(Y~X1+X2+X3bis,data=db) > plot(ctr) > text(ctr)
To plot the ROC curve, we just need to use the prediction obtained using this second model,
> S=predict(ctr)
All the code described above can be used. Again, we can plot
An interesting idea can be to plot the two ROC curves on the same graph, in order to compare the two models
> plot(M.ROC[1,],M.ROC[2,],type="l") > lines(M.ROC.tree[1,],M.ROC.tree[2,],type="l",col="grey",lwd=2)
The most difficult part is to get a proper interpretation. The tree is not predicting well in the lower part of the curve. This concerns people with a very high predicted probability. If our interest is more on those with a probability lower than 90%, then, we have to admit that the tree is doing a good job, since the ROC curve is always higher, comparer with the logistic regression.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.