Splines: opening the (black) box…
[This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Splines in regression is something which looks like a black box (or
maybe like some dishes you get when you travel away from home: it tastes
good, but you don’t what’s inside… even if you might have some clues,
you never know for sure*). Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
![box_closed-300x263.jpg, nov. 2010](http://freakonometrics.blog.free.fr/index.php?post/2010/11/04/public/perso/.box_closed-300x263_s.jpg)
That sounds nice, but when you look at the output of the regression… you got figures, but you barely see how to interpret them… So let us have a look at the box, and I mean what is inside that box…
![box_open-300x233.jpg, nov. 2010](http://freakonometrics.blog.free.fr/index.php?post/2010/11/04/public/perso/.box_open-300x233_s.jpg)
> library(splines)
> K=c(14,20)
> plot(cars)
> reg=lm(dist~bs(speed,knots=c(K),
+ degree=2),data=cars)
> u=seq(4,25,by=.1)
> B=data.frame(speed=u)
> Y=predict(reg,newdata=B)
> lines(u,Y,lwd=2,col=”red”)
I.e. we have the following (nice) picture,
> summary(reg)
Call:
lm(formula = dist ~ bs(speed, knots = c(K), degree = 2), data = cars)
Residuals:
Min 1Q Median 3Q Max
-21.848 -8.702 -1.667 6.508 42.514
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.991 9.692 0.618 0.539596
bs(speed, knots = c(K), degree = 2)1 8.087 15.278 0.529 0.599174
bs(speed, knots = c(K), degree = 2)2 45.540 10.735 4.242 0.000109 ***
bs(speed, knots = c(K), degree = 2)3 49.789 15.704 3.170 0.002738 **
bs(speed, knots = c(K), degree = 2)4 95.550 13.651 7.000 1.02e-08 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15 on 45 degrees of freedom
Multiple R-squared: 0.6888, Adjusted R-squared: 0.6611
F-statistic: 24.89 on 4 and 45 DF, p-value: 6.49e-11
So, what can we do with those numbers ?
First, assume know that we consider only one knot (we have to start somewhere), and we consider a b-spline interpolation of degree 1 (i.e. linear by parts).
> K=c(14)
> reg=lm(dist~bs(speed,knots=c(K),degree=1),data=cars)
> summary(reg)
Call:
lm(formula = dist ~ bs(speed, knots = c(K), degree = 1), data = cars)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.290 7.746 0.425 0.6730
bs(speed, knots = c(K), degree = 1)1 31.483 9.670 3.256 0.0021 **
bs(speed, knots = c(K), degree = 1)2 80.525 9.038 8.910 1.16e-11 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.41 on 47 degrees of freedom
Multiple R-squared: 0.657, Adjusted R-squared: 0.6424
F-statistic: 45.01 on 2 and 47 DF, p-value: 1.203e-11
![http://perso.univ-rennes1.fr/arthur.charpentier/latex/spl01.png](https://i2.wp.com/www.r-bloggers.com/wp-content/uploads/2010/11/spl01.png?w=578)
![http://perso.univ-rennes1.fr/arthur.charpentier/latex/spl02.png](https://i2.wp.com/www.r-bloggers.com/wp-content/uploads/2010/11/spl02.png?w=578)
![http://perso.univ-rennes1.fr/arthur.charpentier/latex/spl03.png](https://i1.wp.com/www.r-bloggers.com/wp-content/uploads/2010/11/spl03.png?w=578)
![http://perso.univ-rennes1.fr/arthur.charpentier/latex/spl04.png](https://i2.wp.com/www.r-bloggers.com/wp-content/uploads/2010/11/spl04.png?w=578)
![http://perso.univ-rennes1.fr/arthur.charpentier/latex/spl05.png](https://i0.wp.com/www.r-bloggers.com/wp-content/uploads/2010/11/spl05.png?w=578)
> B=function(x,j,n,K){
+ b=0
+ a1=a2=0
+ if(((K[j+n+1]>K[j+1])&(j+n<=length(K))&(n>0))==TRUE){a2=(K[j+n+1]-x)/
+ (K[j+n+1]-K[j+1])*B(x,j+1,n-1,K) }
+ if(((K[j+n]>K[j])&(n>0))==TRUE){a1=(x-K[j])/
+ (K[j+n]-K[j])*B(x,j,n-1,K)}
+ if(n==0){ b=((x>K[j])&(x<=K[j+1]))*1 }
+ if(n>0){ b=a1+a2}
+ return(b)
+ }
So, for instance, for splines of degree 1, we have
> u=seq(0,1,by=.01)
> plot(u,B(u,1,1,c(0,.4,1,1)),lwd=2,col=”red”,type=”l”,ylim=c(0,1))
> lines(u,B(u,2,1,c(0,.4,1,1)),lwd=2,col=”blue”)
> abline(v=c(0,.4,1),lty=2)
> u=seq(0,1,by=.01)
> plot(u,B(u,1,2,c(0,0,.4,1,1,1)),lwd=2,col=”red”,type=”l”,ylim=c(0,1))
> lines(u,B(u,2,2,c(0,0,.4,1,1,1)),lwd=2,col=”blue”)
> lines(u,B(u,3,2,c(0,0,.4,1,1,1)),lwd=2,col=”green”)
> abline(v=c(0,.4,1),lty=2)
So, how do we use that, here ? Actually, there are two steps:
- we get from
to the unit interval (using a simple affine transformation)
- we consider
> u0=seq(0,1,by=.01)
> v=reg$coefficients[2]*u0+reg$coefficients[1]
> x1=seq(min(cars$speed),K,length=length(u0))
> lines(x1,v,col=”green”,lwd=2)
> u0=seq(0,1,by=.01)
> v=(reg$coefficients[3]-reg$coefficients[2])*u0+ reg$coefficients[1]+reg$coefficients[2]
> x2=seq(K,max(cars$speed),length=length(u0))
> lines(x2,v,col=”blue”,lwd=2)
> plot(cars)
> reg=lm(dist~bs(speed,knots=c(K),degree=1),data=cars)
> k=(K-min(cars$speed))/(max(cars$speed)-min(cars$speed))
> u0=seq(0,1,by=.01)
> v=reg$coefficients[1]+
+ reg$coefficients[2]*B(u0,1,1,c(0,k,1,1))+
+ reg$coefficients[3]*B(u0,2,1,c(0,k,1,1))
> lines(min(cars$speed)+u0*(max(cars$speed)-min(cars$speed)),
+ v,col=”purple”,lwd=2)
> abline(v=K,lty=2,col=”red”)
> K=c(14,20)
> plot(cars)
> reg=lm(dist~bs(speed,knots=c(K),degree=1),data=cars)
> u=seq(4,25,by=.1)
> B=data.frame(speed=u)
> Y=predict(reg,newdata=B)
> lines(u,Y,lwd=2,col=”red”)
> abline(v=K,lty=2,col=”red”)
First, we can plot our basis functions, with two knots,
> u=seq(0,1,by=.01)
> plot(u,B(u,1,1,c(0,.4,.7,1)),lwd=2,col=”red”,type=”l”,ylim=c(0,1))
> lines(u,B(u,2,1,c(0,.4,.7,1,1)),lwd=2,col=”blue”)
> lines(u,B(u,3,1,c(0,.4,.7,1,1)),lwd=2,col=”green”)
> abline(v=c(0,.4,.7,1),lty=2)
> plot(cars)
> reg=lm(dist~bs(speed,knots=c(K),degree=1),data=cars)
> k=(K-min(cars$speed))/(max(cars$speed)-min(cars$speed))
> u0=seq(0,1,by=.01)
> v=reg$coefficients[1]+
+ reg$coefficients[2]*B(u0,1,1,c(0,k,1,1))+
+ reg$coefficients[3]*B(u0,2,1,c(0,k,1,1))+
+ reg$coefficients[4]*B(u0,3,1,c(0,k,1,1))
> lines(min(cars$speed)+u0*(max(cars$speed)-min(cars$speed)),
+ v,col=”red”,lwd=2)
> abline(v=K,lty=2,col=”red”)
> u=seq(0,1,by=.01)
> plot(u,B(u,1,2,c(0,0,.4,.7,1,1,1)),lwd=2,col=”red”,type=”l”,ylim=c(0,1))
> lines(u,B(u,2,2,c(0,0,.4,.7,1,1,1)),lwd=2,col=”blue”)
> lines(u,B(u,3,2,c(0,0,.4,.7,1,1,1)),lwd=2,col=”green”)
> lines(u,B(u,4,2,c(0,0,.4,.7,1,1,1)),lwd=2,col=”orange”)
> abline(v=c(0,.4,.7,1),lty=2)
> plot(cars)
> reg=lm(dist~bs(speed,knots=c(K),degree=2),data=cars)
> k=(K-min(cars$speed))/(max(cars$speed)-min(cars$speed))
> u0=seq(0,1,by=.01)
> v=reg$coefficients[1]+
+ reg$coefficients[2]*B(u0,1,2,c(0,0,k,1,1,1))+
+ reg$coefficients[3]*B(u0,2,2,c(0,0,k,1,1,1))+
+ reg$coefficients[4]*B(u0,3,2,c(0,0,k,1,1,1))+
+ reg$coefficients[5]*B(u0,4,2,c(0,0,k,1,1,1))
> lines(min(cars$speed)+u0*(max(cars$speed)-min(cars$speed)),
+ v,col=”purple”,lwd=2)
> abline(v=K,lty=2,col=”red”)
> vk=seq(.05,.95,by=.05)
> SSR=matrix(NA,length(vk))
> for(i in 1:(length(vk))){
+ k=vk[i]
+ K=min(cars$speed)+k*(max(cars$speed)-min(cars$speed))
+ reg=lm(dist~bs(speed,knots=c(K),degree=2),data=cars)
+ SSR[i]=sum(residuals(reg)^2)
+ }
> plot(vk,SSR,type=”b”,col=”blue”)
To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.