Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week, we had an “mid-term” exam, for our introduction to statistical learning course. The question is simple: consider three points, \((x_i,y_i)\), here \(\{(0,2),(2,2),(3,1)\}\)Consider here some linear models, estimated using least square techniques, what would be the leave-one-out cross-validation MSE ?
I like this exercise since we can compute everything easily, by hand. Since at each step we remove one single observation, only two observations remain in the sample. In with two points, fiting a linear model is straightforward (whatever the technique considered). Here, we’re simply considering the straight line that passes through the other two points. And since we have the straight line (without the minimal calculation of minimizing the sum of squared errors), we have the error committed on the omitted observation. This is exactly what we see in the drawing below
In other words, the LOOCV MSE is here\({\displaystyle\operatorname{MSE}={\frac{1}{n}}\sum_{i=1}^{n}\left(Y_{i}-{\hat {Y_{i}}^{(-i)}}\right)^{2}}\), where, intuitively, \(\hat {Y_{i}}^{(-i)}\) denotes the prediction associated with \(x_i\) with the model obtained on the other \(n-1\) observations. Thus, here\({\displaystyle\operatorname{MSE}=\frac{1}{3}\big(2^2+\frac{2^2}{3^2}+1^2\big)=\frac{1}{27}\big(36+4+9\big)=\frac{49}{27}}\)Note that we can also use R to compute that quantity,
> x = c(0,2,3) > y = c(2,2,1) > df = data.frame(x=x,y=y) > yp = rep(NA,3) > for(i in 1:3){ + reg = lm(y~x, data=df[-i,]) + yp[i] = predict(reg,newdata=df)[i] + } > 1/3*sum((yp-y)^2) [1] 1.814815
which is precisely what we obtained, by hand.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.