Site icon R-bloggers

Optimal regularization for smoothing splines

[This article was first published on R snippets, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In smooth.spline procedure one can use df or spar parameter to control smoothing level. Usually they are not set manually but recently I was asked a question which one of them is a better measure of regularization level. Hastie et al. (2009) discuss the advantages of df but I thought of a simple graphical illustraition of this fact.
I the following criterion to judge that some quantity measures roughness penalty well: Increasing training sample size should should influence the value of optimal roughness penalty level in a monotonic way.
I compared performance of df or spar parametrization using a sample function (defined in gen.data function) for different sizes of training sample size. Here is the code for df parametrization:

set.seed(1)< o:p>
gen.data <- function(n) {< o:p>
      x <- runif(n, 2, 2)< o:p>
      y <- x ^ 2 / 2 + sin(4 * x) + rnorm(n)< o:p>
      return(data.frame(x, y))< o:p>
}< o:p>

df.levels <- seq(5, 15, length.out = 100)< o:p>
n.train <- (3 ^ (0 : 5)) * (2 ^ (6 : 1))< o:p>
cols <- 1< o:p>
reps <- 100< o:p>
valid <- gen.data(100000)< o:p>
plot(NULL, xlab = “df”, ylab = “mse”,< o:p>
     xlim = c(5, 18), ylim = c(1, 1.3))< o:p>

for (n in n.train) {< o:p>
      mse <- rep(0, length(df.levels))< o:p>
      for (j in 1 : reps) {< o:p>
            train <- gen.data(n)< o:p>
            for (i in seq(along.with = df.levels)) {< o:p>
                  ss <- smooth.spline(train, df = df.levels[i])< o:p>
                  ss.y <- predict(ss, valid$x)$y< o:p>
                  mse[i] <- mse[i] + mean((ss.y valid$y) ^ 2)< o:p>
            }< o:p>
      }< o:p>
      mse <- mse / reps< o:p>
      lines(df.levels, mse, col = cols, lwd = 2)< o:p>
      points(df.levels[which.min(mse)], min(mse),< o:p>
             col = cols, pch = 19)< o:p>
      text(15, mse[length(mse)], paste(“n =”, n),< o:p>
           col = cols, pos = 4)< o:p>
      cols <- cols + 1< o:p>
}< o:p>

It produces the following result:


The plot shows the desired property. Similar plot can be obtained for spar parameter by simple modification of the code:



It is easy to notice that optimal values of spar do not change in a monotonic way as number of observation increases.

This comparison shows that  df is a better measure of regularization level in comparison to spar.

Additionally one can notice that curves for different sizes of training sample intersect for spar parametrization, which is unexpected. It might be only due to the randomness of data generation process, but I have run the simulation several times and the curves always intersected. Unfortunately I do not have the proof what should happen when valid data set size and reps parameter both tend to infinity.

To leave a comment for the author, please follow the link and comment on their blog: R snippets.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.