Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Following my previous post, I wanted to spend more time, on the time series with “global weather-related disaster losses as a proportion of global GDP” over the time period 1990-2016 that Roger Pilke sent me last night.
db=data.frame(year=1990:2016,
ratio=c(.23,.27,.32,.37,.22,.26,.29,.15,.40,.28,.14,.09,.24,.18,.29,.51,.13,.17,.25,.13,.21,.29,.25,.2,.15,.12,.12))
In my previous post, I spend some time explaining that we should provide some sort of ‘confidence interval’ when we try to predict a pattern. That was what we call ‘model uncertainty’. But there are two (important) issues that I did not mention. (1) it is a time series, so why not use techniques dedicated to time series objects ? (2) we do not really care actually about ‘model uncertainty’ (unless we want to assess if a decreasing trend is significant, or not), and we care more about real prediction uncertainty: in the next ten years, what could be the range for the this ratio, with some given probability (say 95%)? Could we say that with 95% chance the global weather-related disaster losses as a proportion of global GDP should be (each year) within 0 and 0.35 or 0 and 0.7?
A first idea might be to use exponential smoothing techniques (without a seasonal component here).
ratio=ts(db$ratio,start=1990,frequency=1)
plot(ratio,xlim=c(1990,2030))
hw=HoltWinters(ratio,gamma=FALSE)
phw=predict(hw,n.ahead=15,prediction.interval = TRUE)
plot(hw,phw,xlim=c(1990,2030))
polygon(c(2017:2031,rev(2017:2031)), c(phw[,2],rev(phw[,3])),border=NA,col=rgb(0,0,1,.2))
The decreasing trend is coming from the fact that exponential smoothing is here a linear regression, with weight exponentially decaying with time (the older, the smaller the weight). But we cannot use that prediction, since the ratio cannot (obviously) be negative. So why not consider, here, the logarithm of the ratio
plot(db$year,db$ratio,type="l",xlim=c(1990,2030),ylim=c(-.2,.7),xlab="year",ylab="ratio")
hw=HoltWinters(log(ratio),gamma=FALSE)
phw=predict(hw,n.ahead=15,prediction.interval = TRUE)
abline(v=2016,lty=2,col="grey")
lines(2017:2031,exp(phw[,2]),col="blue")
lines(2017:2031,exp(phw[,3]),col="blue")
lines(c(1992:2016,2017:2031),c(exp(hw$fitted[,1]),exp(phw[,1])),col="red")
polygon(c(2017:2031,rev(2017:2031)),exp(c(phw[,2],rev(phw[,3]))),border=NA,col=rgb(0,0,1,.2))
abline(h=0,lty=2)
The confidence band is huge, here. What if we consider some ARIMA model here?
fit=auto.arima(ratio)
farma=forecast(fit,15)
farma=cbind(as.numeric(farma$fitted)[1:15],as.numeric(farma$lower[,1]),as.numeric(farma$upper[,1]),as.numeric(farma$lower[,2]),as.numeric(farma$upper[,2]))
plot(db$year,db$ratio,type="l",xlim=c(1990,2030),ylim=c(-.2,.7),xlab="year",ylab="ratio")
abline(v=2016,lty=2,col="grey")
lines(2017:2031,farma[,4],col="blue")
lines(2017:2031,farma[,5],col="blue")
lines(2017:2031,farma[,1],col="red")
polygon(c(2017:2031,rev(2017:2031)),c(farma[,4],rev(farma[,5])),border=NA,col=rgb(0,0,1,.2))
abline(h=0,lty=2)
Here, there is an intercept, but no dynamics for the time series (which is considered, here, as a pure white noise). We get exactly the same if we consider the average value of the series
fit=lm(ratio~1,data=db)
s=summary(fit)$sigma
plot(db$year,db$ratio,type="l",xlim=c(1990,2030),ylim=c(-.2,.7),xlab="year",ylab="ratio")
abline(v=2016,lty=2,col="grey")
ndb=data.frame(year=2017:2031)
pf=predict(fit,newdata=ndb)
farma=cbind(pf,pf-1.96*s,pf+1.96*s)
lines(2017:2031,farma[,2],col="blue")
lines(2017:2031,farma[,3],col="blue")
lines(1990:2031,c(predict(fit),farma[,1]),col="red")
polygon(c(2017:2031,rev(2017:2031)),c(farma[,2],rev(farma[,3])),border=NA,col=rgb(0,0,1,.2))
abline(h=0,lty=2)
Here, we get back to my previous post, if we want to consider a possible trend (and not only an intercept)
fit=lm(ratio~year,data=db)
s=summary(fit)$sigma
plot(db$year,db$ratio,type="l",xlim=c(1990,2030),ylim=c(-.2,.7),xlab="year",ylab="ratio")
abline(v=2016,lty=2,col="grey")
ndb=data.frame(year=2017:2031)
pf=predict(fit,newdata=ndb)
farma=cbind(pf,pf-1.96*s,pf+1.96*s)
lines(2017:2031,farma[,2],col="blue")
lines(2017:2031,farma[,3],col="blue")
lines(1990:2031,c(predict(fit),farma[,1]),col="red")
polygon(c(2017:2031,rev(2017:2031)),c(farma[,2],rev(farma[,3])),border=NA,col=rgb(0,0,1,.2))
abline(h=0,lty=2)
Again, the confidence region is not based on inference related error, but on model uncertainty: we try to visualize where future observations might be with (say) 95% chance. Note we can also consider (why not?) a quadratic regression
fit=lm(ratio~poly(year,2),data=db)
s=summary(fit)$sigma
plot(db$year,db$ratio,type="l",xlim=c(1990,2030),ylim=c(-.2,.7),xlab="year",ylab="ratio")
abline(v=2016,lty=2,col="grey")
ndb=data.frame(year=2017:2031)
pf=predict(fit,newdata=ndb)
farma=cbind(pf,pf-1.96*s,pf+1.96*s)
lines(2017:2031,farma[,2],col="blue")
lines(2017:2031,farma[,3],col="blue")
lines(1990:2031,c(predict(fit),farma[,1]),col="red")
polygon(c(2017:2031,rev(2017:2031)),c(farma[,2],rev(farma[,3])),border=NA,col=rgb(0,0,1,.2))
abline(h=0,lty=2)
I am usually not a huge fan of those polynomial regression, but recently, I’ve seen that a lot in economic papers (like “if it’s not linear, add a squared version of the explanatory variable”, which is a rather odd strategy, I’ll publish some posts on that issue this year).
Here again, it might be more clever to consider a logarithmic transformation of the ratio, to insure that the ratio remains positive
fit=lm(log(ratio)~year,data=db)
s=summary(fit)$sigma
plot(db$year,db$ratio,type="l",xlim=c(1990,2030),ylim=c(-.2,.7),xlab="year",ylab="ratio")
abline(v=2016,lty=2,col="grey")
ndb=data.frame(year=2017:2031)
pf=predict(fit,newdata=ndb)
farma=cbind(exp(pf+s^2/2),exp(pf-1.96*s),exp(pf+1.96*s))
lines(2017:2031,farma[,2],col="blue")
lines(2017:2031,farma[,3],col="blue")
lines(1990:2031,c(exp(predict(fit)+s^2/2),farma[,1]),col="red")
polygon(c(2017:2031,rev(2017:2031)),c(farma[,2],rev(farma[,3])),border=NA,col=rgb(0,0,1,.2))
abline(h=0,lty=2)
Observe that future trend is mainly driven by the three latest observations, that were rather low (compared with older observations). What if we remove them?
dbna=db
db$ratio[25:27]=NA
fit=lm(ratio~1,data=dbna)
s=summary(fit)$sigma
plot(db$year,db$ratio,type="l",xlim=c(1990,2030),ylim=c(-.2,.7),xlab="year",ylab="ratio")
abline(v=2016-3,lty=2,col="grey")
ndb=data.frame(year=2014:2031)
pf=predict(fit,newdata=ndb)
farma=cbind(pf,pf-1.96*s,pf+1.96*s)
lines(2014:2031,farma[,2],col="blue")
lines(2014:2031,farma[,3],col="blue")
lines(1990:2031,c(predict(fit)[1:24],farma[,1]),col="red")
polygon(c(2014:2031,rev(2014:2031)),c(farma[,2],rev(farma[,3])),border=NA,col=rgb(0,0,1,.2))
abline(h=0,lty=2)
More funny, if we consider a quadratic regression, we obtain an increasing trend for the future
fit=lm(ratio~poly(year,2),data=dbna)
s=summary(fit)$sigma
plot(db$year,db$ratio,type="l",xlim=c(1990,2030),ylim=c(-.2,.7),xlab="year",ylab="ratio")
abline(v=2016-3,lty=2,col="grey")
ndb=data.frame(year=2014:2031)
pf=predict(fit,newdata=ndb)
farma=cbind(pf,pf-1.96*s,pf+1.96*s)
lines(2014:2031,farma[,2],col="blue")
lines(2014:2031,farma[,3],col="blue")
lines(1990:2031,c(predict(fit)[1:24],farma[,1]),col="red")
polygon(c(2014:2031,rev(2014:2031)),c(farma[,2],rev(farma[,3])),border=NA,col=rgb(0,0,1,.2))
abline(h=0,lty=2)
As we can see, it is rather difficult to get relevant prediction for the future, based on 25 observations…. If anyone has a suggestion, comments are open…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.