Random thoughts on econometric models with (pure) random features
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For my lectures on applied linear models, I wanted to illustrate the fact that the R2 is never a good measure of the goodness of the model, since it’s quite easy to improve it. Consider the following dataset
n=100 df=data.frame(matrix(rnorm(n*n),n,n)) names(df)=c("Y",paste("X",1:99,sep=""))
with one variable of interest y, and 99 features xj. All of them being (by construction) independent. And we have 100 observations… Consider here the regression on the first k features, and compute R2k of that regression
reg=function(k){ frm=paste("Y~",paste("X",1:k,collapse="+",sep="")) model=lm(frm,data=df) summary(model)$adj.r.squared}
Let us see what’s going on…
plot(1:99,Vectorize(reg)(1:99))
(actually, it’s not exactly what we have on the graph…. we have the average obtained over 1,000 samples randomly generated, with 90% confidence bands). Oberve that E[R2k]=k/n, i.e. if we add some pure random noise, we keep increasing the R2 (up to 1, actually).
Good news, as we’ve seen in the course, the adjusted R2 – denoted ˉR2-might help. Observe that E[\barR2k]=0, so, in some sense, adding features does not help here…
reg=function(k){ frm=paste("Y~",paste("X",1:k,collapse="+",sep="")) model=lm(frm,data=df) summary(model)$r.squared} plot(1:99,Vectorize(reg)(1:99))
We can actually do the same with Akaike criteria AICk and Schwarz (bayesian) criteria BICk.
reg=function(k){ frm=paste("Y~",paste("X",1:k,collapse="+",sep="")) model=lm(frm,data=df) AIC(model)} plot(1:99,Vectorize(reg)(1:99))
For the AIC, the intitial increase makes sense : we should not prefer the model with 10 covariates, compared with nothing. The strange thing is the far right behavior : we prefer here 80 random noise features to none ! Which I find hard to interprete… For the BIC the code is simply
reg=function(k){ frm=paste("Y~",paste("X",1:k,collapse="+",sep="")) model=lm(frm,data=df) BIC(model)} plot(1:99,Vectorize(reg)(1:99))
and here also, we have the same pattern, where we prefer a big model with juste pure noise to nothing…
A last one to conclude (or not) : what about the leave-one-out cross validation mean squared error ? More precisely, CV=1n∑i=1ˆε2−iwhere ˆε2−i=yi−ˆy−i where ˆy−i is the predicted value obtained with the model is estimated when the ith observation is deleted. One can prove that ˆβ−i=ˆβ−(XTX)−1xiˆεi(1−Hi,i)−1where H is the classical hat matrix, thusˆε−i=(1−Hi,i)−1ˆεii.e. we do note have to estimate (at each round) n models
reg=function(k){ frm=paste("Y~",paste("X",1:k,collapse="+",sep="")) model=lm(frm,data=df) h=lm.influence(model)$hat/2 mean( (residuals(model)/1-h)^2 ))} plot(1:99,Vectorize(reg)(1:99))
Here, it make sense : adding noisy features yields overfit ! So the mean squared error is decreasing !
That’s all nice, but it might not be very realistic… Here, for my model with only one variable, I just pick one, at random…. In practice, we try to get the “best one”… So a more natural idea would be to order the variables according to their correlations with y,
df=data.frame(matrix(rnorm(n*n),n,n)) df=df[,rev(order(abs(cor(df)[1,])))] names(df)=c("Y",paste("X",1:99,sep=""))}
and as before, we can plot the evolution of R2k as a function of k the number of features considered,
which is increasing, with a higher slope at the beginning… For the ˉR2k we might actually prefer a correlated noise to nothing (which makes sense actually). So here since we somehow chose our variables, ˉR2k seems to be always positive…
For the AICk here also, there is an improvement. Before coming back to the original situation (with about 80 features) and here also, we observe the drop on the far right part of the graph
The BICk might like the top three features, but soon, we have a deterioration…. even if here also, we have the drop at the far right (with more than 95 features… for 100 observations).
Finally, observe that here again, our (leave-one-out) cross-validation has not been mesled by our noisy variables : it is always decreasing !
So it seems that cross-validation techniques are more robust than the AIC and BIC (even if we mentioned in a previous post connexions between all those concepts) when we have a lot a noisy (non-relevent) features.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.