That damn R-squared !
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Another post about the R-squared coefficient, and about why, after some years teaching econometrics, I still hate when students ask questions about it. Usually, it starts with “I have a _____ R-squared… isn’t it too low ?” Please, feel free to fill in the blanks with your favorite (low) number. Say 0.2. To make it simple, there are different answers to that question:
- if you don’t want to waste time understanding econometrics, I would say something like “Forget about the R-squared, it is useless” (perhaps also “please, think twice about taking that econometrics course“)
- if you’re ready to spend some time to get a better understanding on subtle concepts, I would say “I don’t like the R-squared. I might be interesting in some rare cases (you can probably count them on the fingers of one finger), like comparing two models on the same dataset (even so, I would recommend the adjusted one). But usually, its values has no meaning. You can compare 0.2 and 0.3 (and prefer the 0.3 R-squared model, rather than the 0.2 R-squared one), but 0.2 means nothing“. Well, not exactly, since it means something, but it is not a measure tjat tells you if you deal with a good or a bad model. Well, again, not exactly, but it is rather difficult to say where bad ends, and where good starts. Actually, it is exactly like the correlation coefficient (well, there is nothing mysterious here since the R-squared can be related to some correlation coefficient, as mentioned in class)
- if you want some more advanced advice, I would say “It’s complicated…” (and perhaps also “Look in a textbook write by someone more clever than me, you can find hundreds of them in the library !“)
- if you want me to act like people we’ve seen recently on TV (during electoral debate), “It’s extremely interesting, but before answering your question, let me tell you a story…“
> set.seed(1) > n=20 > X=runif(n) > E=rnorm(n) > Y=2+5*X+E*.5 > base=data.frame(X,Y) > reg=lm(Y~X,data=base) > summary(reg) Call: lm(formula = Y ~ X, data = base) Residuals: Min 1Q Median 3Q Max -1.15961 -0.17470 0.08719 0.29409 0.52719 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.4706 0.2297 10.76 2.87e-09 *** X 4.2042 0.3697 11.37 1.19e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.461 on 18 degrees of freedom Multiple R-squared: 0.8778, Adjusted R-squared: 0.871 F-statistic: 129.3 on 1 and 18 DF, p-value: 1.192e-09
> Y=2+5*X+E*4 > base=data.frame(X,Y) > reg=lm(Y~X,data=base) > summary(reg) Call: lm(formula = Y ~ X, data = base) Residuals: Min 1Q Median 3Q Max -9.2769 -1.3976 0.6976 2.3527 4.2175 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.765 1.837 3.138 0.00569 ** X -1.367 2.957 -0.462 0.64953 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.688 on 18 degrees of freedom Multiple R-squared: 0.01173, Adjusted R-squared: -0.04318 F-statistic: 0.2136 on 1 and 18 DF, p-value: 0.6495
> S=seq(0,4,by=.2) > R2=rep(NA,length(S)) > for(s in 1:length(S)){ + Y=2+5*X+E*S[s] + base=data.frame(X,Y) + reg=lm(Y~X,data=base) + R2[s]=summary(reg)$r.squared}
Nevertheless, it looks like some econometricians really care about the R-squared, and cannot imagine looking at a model if the R-squared is lower than – say – 0.4. It is always possible to reach that level ! you just have to add more covariates ! If you have some… And if you don’t, it is always possible to use polynomials of a continuous variate. For instance, on the previous example,
> S=seq(1,25,by=1) > R2=rep(NA,length(S)) > for(s in 1:length(S)){ + reg=lm(Y~poly(X,degree=s),data=base) + R2[s]=summary(reg)$r.squared}
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.