Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I had a very stranger discussion on twitter (yes, another one), about regression curves. I think it started with a tweet based on some xkcd picture (just for fun, because it was New Year’s Day)
“don’t trust linear regressions” https://t.co/exUCvyRd1G pic.twitter.com/O6rBJfkULa
— Arthur Charpentier (@freakonometrics) 1 janvier 2017
There were comments on that picture, by econometricians, mainly about ‘significant’ trends when datasets are very noisy. And I mentioned a graph that I saw earlier, that day
@AndyHarless @mileskimball actually, all that reminds me of a post by @RogerPielkeJr earlier (not a big fan of the regression line) pic.twitter.com/NQgzgVsBcE
— Arthur Charpentier (@freakonometrics) 1 janvier 2017
Let us reproduce that graph (Roger kindly sent me the dataset)
db=data.frame(year=1990:2016,
ratio=c(.23,.27,.32,.37,.22,.26,.29,.15,.40,.28,.14,.09,.24,.18,.29,.51,.13,.17,.25,.13,.21,.29,.25,.2,.15,.12,.12))
library(ggplot2)
The graph is here (using the same conventions as Roger’s initial graph, using some sort of barplot)
ggplot(db, aes(year, ratio)) +
geom_bar(stat="identity") +
stat_smooth(method = "lm", se = FALSE)
My point was that we miss the ‘confidence band’ of the regression
@freakonometrics @AndyHarless @mileskimball Because it is not a sample. Since 1990 weather losses/global GDP have gone down.
— Roger Pielke Jr. (@RogerPielkeJr) 1 janvier 2017
In R, at least, it is quite natural to get (and actually, it is the default version of the graph function)
ggplot(db, aes(year, ratio)) +
geom_bar(stat="identity") +
stat_smooth(method = "lm", se = TRUE)
It is hard to claim that the ‘regression line’ is significant (in the sense significantly non horizontal). To be more specific, if we look at the output of the regression model, we get
summary(lm(ratio~year,data=db))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.158531 4.549672 2.013 0.055 .
year -0.004457 0.002271 -1.962 0.061 .
---
Signif. codes: 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(which is exactly what Roger used in his graph to plot his red straight line). The p-value of the estimator of the slope, in a linear regression model is here 6%. But I found Roger’s point puzzeling
@freakonometrics @AndyHarless @mileskimball Disagree. U can create one, of course, but doesnt mean much. These data are not balls from urns.
— Roger Pielke Jr. (@RogerPielkeJr) 1 janvier 2017
See also
@freakonometrics @AndyHarless @mileskimball These data are not random.
— Roger Pielke Jr. (@RogerPielkeJr) 1 janvier 2017
First of all, let us get back to a more standard graph, with a scatterplot, and not bars,
ggplot(db, aes(year, ratio)) +
stat_smooth(method = "lm") +
geom_point()
Here, we observe points
Even if observations are not obtained from balls in an urn, there is some kind of randomness here. One might consider a nonlinear model to reduce the error,
ggplot(db, aes(year, ratio)) +
geom_point() +
geom_smooth()
but in the case, the danger is to overfit
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.