Please, never use my codes without checking twice (at least)!

arthur charpentier

8 years ago

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I wanted to get back on some interesting experience, following a discussion I had with Carlos after my class, this morning. Let me simplify the problem, and change also the dataset. Consider the following dataset

> db = read.table("http://freakonometrics.free.fr/db2.txt",header=TRUE,sep=";")

Let me change also one little thing (in the course, we use the age of people as explanatory variables, so let us consider rounded figures to),

> db$X1=round(db$X1*10)
> db$X2=round(db$X2*10)

Assume that you want to work with factors, because you don’t see why there should be some linear model (and you did not look at the awesome posts on smoothing techniques)

> db$X1F=cut(db$X1,c(-12,45,75,120))
> db$X2F=cut(db$X2,c(100,200,300))

Then you run your regression,

> reg = glm(Y~X1F+X2F+X3,family=binomial,data=db)

So far, nothing wrong, you can try, no error, no warning. Then Carlos wanted to use a ROC curve to see how the model was performing… so he did use some code I uploaded on the blog, something like

> reg = glm(Y~X1+X2+X3,family=binomial,data=db)
> S = predict(reg,type="response")
> Y = db$Y
> plot(0:1,0:1,xlab="False Positive Rate",ylab="True Positive Rate",cex=.5)
> for(s in seq(0,1,by=.01)){
+   Ps=(S>s)*1
+   FP=sum((Ps==1)*(nombre==0))/sum(nombre==0)
+   TP=sum((Ps==1)*(nombre==1))/sum(nombre==1)
+   points(FP,TP,cex=.5,col="red")
+ }

To make it nicer, let us use the following code

> ROCcurve=function(s){
+ Ps=(S>s)*1
+ FP=sum((Ps==1)*(Y==0))/sum(Y==0)
+ TP=sum((Ps==1)*(Y==1))/sum(Y==1)
+ return(c(FP,TP))}
> u=seq(0,1,by=.001)
> vectROC=Vectorize(ROCcurve)(u)

Here, I should mention that I got a warning, but to be honest, when you see the graph, you’re so puzzled that you forget about it…

There were 50 or more warnings (use warnings() to see the first 50)

Carlos ran the code, show me the graph, and asked me “can my model be that bad?”,

My first answer is “no, your model cannot be that bad! you use (almost) all your explanatory variables, you cannot have such a bad model…“. So, where the problem comes from? My first guess is that this is what you get using some random sort of a variable. For instance, if you use the code above on

> reg2 = glm(Y~X1+X2+X3,family=binomial,data=db)
> S = sort(predict(reg2,type="response"))
> Y = db$Y

Here, one variable is sorted, and we get

This confirms the idea that using a random classifier, the ROC curve is on the diagonal. But here, when you look at the code, you do not see any sorting operation.

So, again, what went wrong? The problem is actually very simple. The dataset looks like

> head(db)
  Y X1  X2 X3      X1F       X2F
1 1 33 163  B (-12,45] (100,200]
2 1 64 185  D  (45,75] (100,200]
3 1 53 166  B  (45,75] (100,200]
4 1 55 197  C  (45,75] (100,200]
5 1 41 184  C (-12,45] (100,200]
6 1 78 196  C (75,120] (100,200]

If we look at the factor variable, the lowest part is not included, in the interval. More precisely,

> levels(db$X1F)
[1] "(-12,45]" "(45,75]"  "(75,120]"

and if we look more precisely and the range of the two variables, we get

> range(db$X1)
[1] -12 120

Wait… the minimum is -12, but it does not appear in the factor (and we do not get any warning)? Yes,

> db[which.min(db$X1)+(-2):2,]
    Y  X1  X2 X3      X1F       X2F
429 1  76 227  A (75,120] (200,300]
430 0  35 186  D (-12,45] (100,200]
431 0 -12 109  E     <NA> (100,200]
432 1  76 225  B (75,120] (200,300]
433 1  61 206  A  (45,75] (200,300]

Now we start so see more clearly what went wrong… there is a missing value in the dataset. And when we get our prediction, there is no missing value: the missing value has be droped.

> length(predict(reg))
[1] 999
> nrow(db)
[1] 1000

So when we compare the prediction and the observed value…. there is a problem, since the vectors don’t match. Actually, it was mentioned in the output of the regression (but we did not look at it)

> summary(reg)

Call:
glm(formula = Y ~ X1F + X2F + X3, family = binomial, data = db)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9432   0.1627   0.2900   0.4823   1.1377  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.22696    0.23206   0.978 0.328067    
X1F(45,75]    1.86569    0.23397   7.974 1.53e-15 ***
X1F(75,120]   2.97463    0.65071   4.571 4.85e-06 ***
X2F(200,300]  1.11643    0.32695   3.415 0.000639 ***
X3B          -0.06131    0.31076  -0.197 0.843609    
X3C           0.75013    0.35825   2.094 0.036268 *  
X3D           0.13846    0.31399   0.441 0.659226    
X3E          -0.13277    0.31859  -0.417 0.676853    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 802.34  on 998  degrees of freedom
Residual deviance: 607.61  on 991  degrees of freedom
  (1 observation deleted due to missingness)
AIC: 623.61

Number of Fisher Scoring iterations: 7

Yes, there is a tiny little sentence,

  (1 observation deleted due to missingness)

So that was it? Yes! If you drop that missing value, you get something more realistic,

> S= predict(reg,type="response")
> Y=db$Y[-which.min(db$X1)]

or (probably better), change the left point of the interval

> db$X1F=cut(db$X1,c(-13,45,75,120))
> reg = glm(Y~X1F+X2F+X3,family=binomial,data=db)
> S= predict(reg,type="response")
> Y= db$Y

(the output would have been almost the same).

Observe that if we had used a dedicated package, we would not have encountered this problem. For instance (starting with the initial values of the vectors)

> library(ROCR)
> prediction(S,Y)
Error in prediction(S, Y) : 
  Number of predictions in each run must be equal to the number of labels for each run.

We do have the answer here: both vectors do not have the same length…

So, what is my point? R is great, because of all the packages, but as a teacher, I do not feel comfortable asking my student to use those functions, as black boxes. So I try to write my own codes, to get the same output. So yes, I do write codes to explain what the black box is doing, to simplify the algorithm, and show what’s going on. When working on a (forthcoming) book as the Editor, we had a discussion with Rob Hyndman about that issue. I wanted the contributors to explain the core of the code, with a simplified algorithm, when using a dedicated package. I do truly believe that using simplified codes might help to understand better. Until you start to have problems. Because I write a code to deal with one specific problem, there is no check for possible errors. And once the code is understood, please, please do not use it! use R function that can handle errors…

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.