[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Last week, we had an “mid-term” exam, for our introduction to statistical learning course.  The question is simple: consider three points, ((x_i,y_i)), here ({(0,2),(2,2),(3,1)})Consider here some linear models, estimated using least square techniques, what would be the leave-one-out cross-validation MSE ?

I like this exercise since we can compute everything easily, by hand. Since at each step we remove one single observation, only two observations remain in the sample. In with two points, fiting a linear model is straightforward (whatever the technique considered). Here, we’re simply considering the straight line that passes through the other two points. And since we have the straight line (without the minimal calculation of minimizing the sum of squared errors), we have the error committed on the omitted observation. This is exactly what we see in the drawing below

In other words, the LOOCV MSE is here({displaystyleoperatorname{MSE}={frac{1}{n}}sum_{i=1}^{n}left(Y_{i}-{hat {Y_{i}}^{(-i)}}right)^{2}}), where, intuitively, (hat {Y_{i}}^{(-i)}) denotes the prediction associated with (x_i) with the model obtained on the other (n-1) observations. Thus, here({displaystyleoperatorname{MSE}=frac{1}{3}big(2^2+frac{2^2}{3^2}+1^2big)=frac{1}{27}big(36+4+9big)=frac{49}{27}})Note that we can also use R to compute that quantity,

> x = c(0,2,3)
> y = c(2,2,1)
> df = data.frame(x=x,y=y)
> yp = rep(NA,3)
> for(i in 1:3){
+ reg = lm(y~x, data=df[-i,])
+ yp[i] = predict(reg,newdata=df)[i]
+ }
> 1/3*sum((yp-y)^2)
[1] 1.814815

which is precisely what we obtained, by hand.

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Calculating an LOOCV MSE by hand

Understanding Linear Models and the LOOCV MSE

The text chiefly discusses a statistical exercise, a core part of a mid-term test in an introduction to statistical learning course. The problem involves three data points designated as (x_i,y_i), specifically (0,2), (2,2), and (3,1). Participants are asked to consider some linear model estimated using the least square method and then to calculate the leave-one-out cross-validation mean squared error (LOOCV MSE).

The significance of the task lies in the fact that all computations can be done manually. As each iteration involves omitting one observation from the sample, it leaves behind only two observations. Fitting a linear model with these thus becomes an easy task – all that’s needed is to single out the straight line that passes through these other two points. The model then provides the error associated with the removed observation, in line with LOOCV MSE.

LOOCV MSE = MSE = (1/n) * the sum of (Yi – Yi^(-i)) squared across all ‘i’ where Yi^(-i) is the prediction associated with Xi using a model fitted with the remaining n-1 observations.

The calculation yields a LOOCV MSE of 1.814815, both manually and in the R environment.

Long-term Implications and Future Developments

This simple exercise carries several implications for individuals interested in statistical learning and predictive modeling. While it seems basic, such rudimentary training forms the foundation of more complex problem-solving in disciplines like machine learning or artificial intelligence.

Understanding of LOOCV MSE and similar techniques will remain crucial, especially as linear models continue to be a highlight in statistical learning. They not only help improve model interpretation but also offer better validity, particularly with small sample sizes.

In terms of future developments, it suggests the most likely trajectory would be increasingly sophisticated techniques for model validation and error estimation. The goal will always be to minimize error rates and thus develop more accurate and reliable models.

Actionable Advice

  1. For budding, and even experienced statisticians, it is crucial to master these basic techniques. They form the underpinning of many advanced machine learning algorithms.
  2. Stay abreast with computational statistics tools like R or Python. These tools offer practical applications and can be incredibly efficient compared to working problems out by hand.
  3. Remember that even as we move towards more sophisticated statistical techniques, the foundation remains the same. A strong grasp of linear models and understanding of error calculations like LOOCV MSE will always be beneficial.
  4. Never underestimate the power of simplicity in models. Sometimes a simple model provides more interpretable and equally reliable results compared to a complex one.

Read the original article