Classification from scratch, logistic with splines 2/8

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today, second post of our series on classification from scratch, following the brief introduction on the logistic regression.

Piecewise linear splines

To illustrate what’s going on, let us start with a “simple” regression (with only one explanatory variable). The underlying idea is natura non facit saltus, for “nature does not make jumps”, i.e. process governing equations for natural things are continuous. That seems to be a rather strong assumption, because we can assume that there is a fixed threshold to explain death. For instance, if patients die (for sure) if the “stroke index” exceeds a threshold, we might expect some discontinuity. Exceept that if that threshold is an heterogeneous (non-observable continuous) variable, then we get back to the continuity assumption.

The most simple model we can think of to extend the linear model we’ve seen in the previous post is to consider a piecewise linear function, with two parts : small values of \(x\), and larger values of \(x\). The most convenient way to do so is to use the positive part function \((x-s)_+\) which is the difference between \(x\) and \(s\) if that difference is positive, and \(0\) otherwise. For instance \(\beta_1 x+\beta_2(x-s)_+\) is the following piecewise linear function, continuous, with a “rupture” at knot \(s\).

Observe also the following interpretation: for small values of \(x\), there is a linear increase, with slope \(\beta_1\), and for lager values of \(x\), there is a linear decrease, with slope \(\beta_1+\beta_2\). Hence, \(\beta_2\) is interpreted as a change of the slope.

And of course, it is possible to consider more than one knot. The function to get the positive value is the following

pos = function(x,s) (x-s)*(x>=s)

then we can use it direcly in our regression model

reg = glm(PRONO~INSYS+pos(INSYS,15)+
pos(INSYS,25),data=myocarde,family=binomial)

The output of the regression is here

summary(reg)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)  
(Intercept)     -0.1109     3.2783  -0.034   0.9730  
INSYS           -0.1751     0.2526  -0.693   0.4883  
pos(INSYS, 15)   0.7900     0.3745   2.109   0.0349 *
pos(INSYS, 25)  -0.5797     0.2903  -1.997   0.0458 *

Hence, the original slope, for very small values is not significant, but then, above 15, it become significantly positive. And above 25, there is a significant change again. We can plot it to see what’s going on

u = seq(5,55,length=201)
v = predict(reg,newdata=data.frame(INSYS=u),type="response")
plot(u,v,type="l")
points(myocarde$INSYS,myocarde$PRONO,pch=19)
abline(v=c(5,15,25,55),lty=2)

Using bs() linear splines

Using the GAM function, things are slightly different. We will use here so called b-splines,

library(splines)

We can define spline functions with support \((5,55)\) and with knots \(\{15,25\}\)

clr6 = c("#1b9e77","#d95f02","#7570b3","#e7298a","#66a61e","#e6ab02")
x = seq(0,60,by=.25)
B = bs(x,knots=c(15,25),Boundary.knots=c(5,55),degre=1)
matplot(x,B,type="l",lty=1,lwd=2,col=clr6)


as we can see, the functions defined here are different from the one before, but we still have (piecewise) linear functions on each segment \((5,15)\), \((15,25)\) and \((25,55)\). But linear combinations of those functions (the two sets of functions) will generate the same space. Said differently, if the interpretation of the output will be different, predictions should be the same

reg = glm(PRONO~bs(INSYS,knots=c(15,25),
Boundary.knots=c(5,55),degre=1),
data=myocarde,family=binomial)
summary(reg)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)  
(Intercept)    -0.9863     2.0555  -0.480   0.6314  
bs(INSYS,..)1  -1.7507     2.5262  -0.693   0.4883  
bs(INSYS,..)2   4.3989     2.0619   2.133   0.0329 *
bs(INSYS,..)3   5.4572     5.4146   1.008   0.3135

Observe that there are three coefficients, as before, but again, the interpretation is here more complicated…

v=predict(reg,newdata=data.frame(INSYS=u),type="response")
plot(u,v,ylim=0:1,type="l",col="red")
points(myocarde$INSYS,myocarde$PRONO,pch=19)
abline(v=c(5,15,25,55),lty=2)


Nevertheless, the prediction is the same… and that’s nice.

Piecewise quadratic splines

Let us go one step further… Can we have also the continuity of the derivative ? Yes, and that’s easy actually, considering parabolic functions. Instead of using a decomposition on \(x,(x-s_1)_+\) and \((x-s_2)_+\) consider now a decomposition on \(x,x^{\color{red}{2}},(x-s_1)^{\color{red}{2}}_+\) and \((x-s_2)^{\color{red}{2}}_+\).

 pos2 = function(x,s) (x-s)^2*(x>=s)
reg = glm(PRONO~poly(INSYS,2)+pos2(INSYS,15)+pos2(INSYS,25),
data=myocarde,family=binomial)
summary(reg)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)  
(Intercept)      29.9842    15.2368   1.968   0.0491 *
poly(INSYS, 2)1 408.7851   202.4194   2.019   0.0434 *
poly(INSYS, 2)2 199.1628   101.5892   1.960   0.0499 *
pos2(INSYS, 15)  -0.2281     0.1264  -1.805   0.0712 .
pos2(INSYS, 25)   0.0439     0.0805   0.545   0.5855

As expected, there are here five coefficients: the intercept and two for the part on the left (three parameters for the parabolic function), and then two additional terms for the part in the center – here \((15,25)\) – and for the part on the right. Of course, for each portion, there is only one degree of freedom since we have a parabolic function (three coefficients) but two constraints (continuity, and continuity of the first order derivative).

On a graph, we get the following

v = predict(reg,newdata=data.frame(INSYS=u),type="response")
plot(u,v,ylim=0:1,type="l",col="red",lwd=2,xlab="INSYS",ylab="")
points(myocarde$INSYS,myocarde$PRONO,pch=19)
abline(v=c(5,15,25,55),lty=2)

Using bs() quadratic splines

Of course, we can do the same with our R function. But as before, the basis of function is expressed here differently

 x = seq(0,60,by=.25)
B=bs(x,knots=c(15,25),Boundary.knots=c(5,55),degre=2)
matplot(x,B,type="l",xlab="INSYS",col=clr6)


If we run R code, we get

reg = glm(PRONO~bs(INSYS,knots=c(15,25),
Boundary.knots=c(5,55),degre=2),data=myocarde,
family=binomial)
summary(reg)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)  
(Intercept)       7.186      5.261   1.366   0.1720  
bs(INSYS, ..)1  -14.656      7.923  -1.850   0.0643 .
bs(INSYS, ..)2   -5.692      4.638  -1.227   0.2198  
bs(INSYS, ..)3   -2.454      8.780  -0.279   0.7799  
bs(INSYS, ..)4    6.429     41.675   0.154   0.8774

But that’s not really a big deal since the prediction is exactly the same

v = predict(reg,newdata=data.frame(INSYS=u),type="response")
plot(u,v,ylim=0:1,type="l",col="red")
points(myocarde$INSYS,myocarde$PRONO,pch=19)
abline(v=c(5,15,25,55),lty=2)

Cubic splines

Last, but not least, we can reach the cubic splines. With our previous notions, we would consider a decomposition on (guess what) \(x,x^2,x^{\color{red}{3}},(x-s_1)^{\color{red}{3}}_+,(x-s_2)^{\color{red}{3}}_+\), to get this time continuity, as well as continuity of the first two derivatives (and to get a very smooth function, since even variations will be smooth). If we use the bs function, the basis is the followin

B=bs(x,knots=c(15,25),Boundary.knots=c(5,55),degre=3)
matplot(x,B,type="l",lwd=2,col=clr6,lty=1,ylim=c(-.2,1.2))
abline(v=c(5,15,25,55),lty=2)

and the prediction will now be

reg = glm(PRONO~bs(INSYS,knots=c(15,25),
Boundary.knots=c(5,55),degre=3),
data=myocarde,family=binomial)
u = seq(5,55,length=201)
v = predict(reg,newdata=data.frame(INSYS=u),type="response")
plot(u,v,ylim=0:1,type="l",col="red",lwd=2)
points(myocarde$INSYS,myocarde$PRONO,pch=19)
abline(v=c(5,15,25,55),lty=2)


Two last things before concluding (for today), the location of the knots, and the extension to additive models.

Location of knots

In many applications, we do not want to specify the location of the knots. We just want – say – three (intermediary) knots. This can be done using

reg = glm(PRONO~1+bs(INSYS,degree=1,df=4),data=myocarde,family=binomial)

We can actually get the locations of the knots by looking at

attr(reg$terms, "predvars")[[3]]
bs(INSYS, degree = 1L, knots = c(15.8, 21.4, 27.15), 
Boundary.knots = c(8.7, 54), intercept = FALSE)

which provides us with the location of the boundary knots (the minumun and the maximum from from our sample) but also the three intermediary knots. Observe that actually, those five values are just (empirical) quantiles

quantile(myocarde$INSYS,(0:4)/4)
   0%   25%   50%   75%  100% 
 8.70 15.80 21.40 27.15 54.00

If we plot the prediction, we get

v = predict(reg,newdata=data.frame(INSYS=u),type="response")
plot(u,v,ylim=0:1,type="l",col="red",lwd=2)
points(myocarde$INSYS,myocarde$PRONO,pch=19)
abline(v=quantile(myocarde$INSYS,(0:4)/4),lty=2)


If we get back on what was computed before the logit transformation, we clealy see ruptures are the different quantiles

B = bs(x,degree=1,df=4)
B = cbind(1,B)
y = B%*%coefficients(reg)
plot(x,y,type="l",col="red",lwd=2)
abline(v=quantile(myocarde$INSYS,(0:4)/4),lty=2)


Note that if we do specify anything about knots (number or location), we get no knots…

reg = glm(PRONO~1+bs(INSYS,degree=2),data=myocarde,family=binomial)
attr(reg$terms, "predvars")[[3]]
bs(INSYS, degree = 2L, knots = numeric(0), 
Boundary.knots = c(8.7,54), intercept = FALSE)

and if we look at the prediction

u = seq(5,55,length=201)
v = predict(reg,newdata=data.frame(INSYS=u),type="response")
plot(u,v,ylim=0:1,type="l",col="red",lwd=2)
points(myocarde$INSYS,myocarde$PRONO,pch=19)


actually, it is the same as a quadratic regression (as expected actually)

reg = glm(PRONO~1+poly(INSYS,degree=2),data=myocarde,family=binomial)
v = predict(reg,newdata=data.frame(INSYS=u),type="response")
plot(u,v,ylim=0:1,type="l",col="red",lwd=2)
points(myocarde$INSYS,myocarde$PRONO,pch=19)

Additive models

Consider now the second dataset, with two variables. Consider here a model like
\(\mathbb{P}[Y|X_1=x_1,X_2=x_2]=\frac{\exp[\eta(x_1,x_2)]}{1+\exp[\eta(x_1,x_2)]}\)
where
\(\exp[\eta(x_1,x_2)]=\beta_0+\color{red}{s_1(x_1)}+\color{blue}{s_2(x_2)}\)
\(\color{red}{s_1(x_1)}=\beta_{1,0}x_1+\beta_{1,1}(x_1-s_{11})_++\beta_{1,2}(x_1-s_{12})_+\)
and
\(\color{blue}{s_2(x_2)}=\beta_{2,0}x_2+\beta_{2,1}(x_2-s_{21})_++\beta_{2,2}(x_2-s_{22})_+\)
It might seem a little bit restrictive, but that’s actually the idea of additive models.

reg = glm(y~bs(x1,degree=1,df=3)+bs(x2,degree=1,df=3),data=df,family=binomial(link = "logit"))
u = seq(0,1,length=101)
p = function(x,y) predict.glm(reg,newdata=data.frame(x1=x,x2=y),type="response")
v = outer(u,u,p)
image(u,u,v,xlab="Variable 1",ylab="Variable 2",col=clr10,breaks=(0:10)/10)
points(df$x1,df$x2,pch=19,cex=1.5,col="white")
points(df$x1,df$x2,pch=c(1,19)[1+(df$y=="1")],cex=1.5)
contour(u,u,v,levels = .5,add=TRUE)


Now, if think about is, we’ve been able to get a “perfect” model, so, somehow, it seems no longer continuous…

persp(u,u,v,theta=20,phi=40,col="green"


Of course, it is… it is piecewise linear, with hyperplane, some being almost vertical.

And one can also consider piecewise quadratic functions

reg = glm(y~bs(x1,degree=2,df=3)+bs(x2,degree=2,df=3),data=df,family=binomial(link = "logit"))
u = seq(0,1,length=101)
p = function(x,y) predict.glm(reg,newdata=data.frame(x1=x,x2=y),type="response")
v = outer(u,u,p)
image(u,u,v,xlab="Variable 1",ylab="Variable 2",col=clr10,breaks=(0:10)/10)
points(df$x1,df$x2,pch=19,cex=1.5,col="white")
points(df$x1,df$x2,pch=c(1,19)[1+(df$y=="1")],cex=1.5)
contour(u,u,v,levels = .5,add=TRUE)


Funny thing, we now have two “perfect” models, with different areas for the white and the black dots… Don’t ask me how to choose on that one.

In R, it is possible to use the mgcv package to run a gam regression. It is used for generalized additive models, but here, we have only one variable, so it is difficult to see the “additive” part, actually. And to be more specific, mgcv is using penalized quasi-likelihood from the nlme package (but we’ll get back on penalized routines later on).

But maybe I should also mention another smoothing tool before, kernels (and maybe also \(k\)-nearest neighbors). To be continued…

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)