Site icon R-bloggers

Machine Learning Explained: Overfitting

[This article was first published on Enhance Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to this new post of Machine Learning Explained.After dealing with bagging, today, we will deal with overfitting. Overfitting is the devil of Machine Learning and Data Science and has to be avoided in all of your models.

What is overfitting?

A good model is able to learn the pattern from your training data and then to generalize it on new data (from a similar distribution). Overfitting is when a model is able to fit almost perfectly your training data but is performing poorly on new data. A model will overfit when it is learning the very specific pattern and noise from the training data, this model is not able to extract the “big picture” nor the general pattern from your data. Hence, on new and different data the performance of the overfitted model will be poor.

Overfitting and polynomial regression

Well, let’s see this through an example! We want a model to estimate the relationship between the wind speed and the quantity of ozone in the air.

The relationship does not seem linear hence, using polynomial regression may give some good results. Five polynomial regressions were fitted to the data, respectively with 1,3, 5,10 and 20 degrees. The models were trained on 70% of the data.

As we could expect, the more degree you add to the polynomial, the better the fit to the training data is. However, we can see that high order polynomial (5, 10, 20) tends to learn the patterns from some outliers and are not robust. To confirm this, let’s compute the error (Mean standard error) on the training and testing set.

The red line shows the evolution of the error on the testing set and the black line on the training set. As soon as more than 9 or 10 degrees are used the MSE seem to start growing and explodes when there are even more degrees. For the sake of visibility let’s plot this for models with 1 to 8 degrees.

As we can see, though the train error keep decreasing, the test error is not affected much by the complexity of the model. Here, the simpler models are the best choices.

And why is overfitting happening?

Overfitting happens when your model has too much freedom to fit the data. Then, it is easy for the model to fit the training data perfectly (and to minimize the loss function). Hence, more complex models are more likely to overfit:

My advice would be, the more complex the model, the more careful you need to be.

How to detect and avoid overfitting?

To detect overfitting you need to see how the test error evolve. As long as the test error is decreasing, the model is still right. On the other hand, an increase in the test error indicates that you are probably overfitting.

As said before, overfitting is caused by a model having too much freedom. Hence most of the solutions to avoid overfitting add mor constraints to the model:

R Code to replicate the plot

###Overfitting

require(data.table)
library(rpart)
require(ggplot2)

set.seed(456)

##Reading data
overfitting_data=data.table(airquality)
ggplot(overfitting_data,aes(Wind,Ozone))+geom_point()+ggtitle("Ozone vs wind speed")
data_test=na.omit(overfitting_data[,.(Wind,Ozone)])
train_sample=sample(1:nrow(data_test),size = 0.7*nrow(data_test))

###creation of polynomial models
degree_of_poly=1:20
degree_to_plot=c(1,3,5,10,20)
polynomial_model=list()
df_result=NULL
for (degree in degree_of_poly)
{
 fm=as.formula(paste0("Ozone~poly(Wind,",degree,",raw=T)"))
 polynomial_model=c(polynomial_model,list(lm(fm,data_test[train_sample])))
 Polynomial_degree=paste0(degree)
 data_fitted=tail(polynomial_model,1)[[1]]$fitted.values
 new_df=data.table(Wind=data_test[train_sample,Wind],Ozone_real=data_test[train_sample,Ozone],Ozone_fitted=tail(polynomial_model,1)[[1]]$fitted.values,degree=as.factor(degree))
 if (is.null(df_result))
 df_result=new_df
 else
 df_result=rbind(df_result,new_df)
}
gg=ggplot(df_result[degree%in%degree_to_plot],aes(x=Wind))+geom_point(aes(y=Ozone_real))+geom_line(aes(color=degree,y=Ozone_fitted))
gg+ggtitle('Ozone vs wind for several polynomial regressions')+ylab('Ozone')

###Computing SE
SE_train_list=c()
SE_test_list=c()

for (poly_mod in polynomial_model)
{
 print(summary(poly_mod))
 SE_train_list=c(SE_train_list,sqrt(mean(poly_mod$residuals^2)))
 SE_test=sqrt(mean((data_test[-train_sample]-predict(poly_mod,data_test[-train_sample,]))^2))
 SE_test_list=c(SE_test_list,SE_test)
}

data_plot=data.table(SE_test_list,SE_train_list,degree_of_poly)
ggplot(data_plot[degree_of_poly<=8])+geom_line(aes(x=degree_of_poly,y=SE_test_list),color='red')+geom_line(aes(x=degree_of_poly,y=SE_train_list))+ylab('MSE')+xlab('Degrees of polynomial')

Thanks for reading! You can stay in touch by following us on Twitter :

The post Machine Learning Explained: Overfitting appeared first on Enhance Data Science.

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.