Forecast, Automatic Routines vs. Experience
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This morning, in our Time Series course, we’ve been playing with some data I got from google.ca/trends/. Actually, we’ve been playing on some old version, downloaded 18 months ago (discussed in a previous post, in French).
> urls = "http://freakonometrics.free.fr/report-headphones-2015.csv" > report=read.table( + urls,skip=4,header=TRUE,sep=",",nrows=585) > tail(report) Semaine headphones 580 2015-02-08 - 2015-02-14 53 581 2015-02-15 - 2015-02-21 52 582 2015-02-22 - 2015-02-28 51 583 2015-03-01 - 2015-03-07 50 584 2015-03-08 - 2015-03-14 49 585 2015-03-15 - 2015-03-21 49
If we plot that weekly time series, we have
> plot(report[,2],type="l")
Working with weekly series is more complicated (at least to find a simple model, with only a few lags), so let us convert that series into a monthly one,
> source( + "http://freakonometrics.blog.free.fr/public/code/H2M.R") > headphones=H2M(report,lang="FR",type="ts") > plot(headphones)
Here, we did three things,
- we considered the logarithm of the series, because of the increasing variability we have on recent years,
- we removed (too) old observations
- we kept recent observations to test our models.
> X=log(headphones) > T=as.numeric(time(X)) > idx=which((T>2006)&(T<2014.3)) > T=T[idx] > X=as.numeric(X)[idx] > plot(T,X,type="l") > df=data.frame(X,T) > reg=lm(X~T,data=df) > abline(reg,col="red") > Y=residuals(reg)
The residuals obtained after fitting that linear model will be our (supposed to be) stationnary time series. Let us consider various models.
- model 1
We obtained our first model be playing with the orders, this morning. Here, the model is slightly more complex,
> fit1=arima(Y, + order = c(1, 0, 10), + seasonal = list(order = c(1, 1, 0), + period = 12)) > fit1 Call: arima(x = Y, order = c(1, 0, 10), seasonal = list(order = c(1, 1, 0), period = 12)) Coefficients: ar1 ma1 ma2 ma3 ma4 0.930 -0.033 0.078 -0.101 0.365 s.e. 0.064 0.102 0.132 0.119 0.152 ma5 ma6 ma7 ma8 ma9 ma10 -0.217 0.216 -0.137 0.134 0.048 0.465 0.154 0.187 0.166 0.173 0.153 0.154 sar1 -0.578 0.118 sigma^2 estimated as 0.00114: log likelihood = 166, aic = -305
If we look at the residuals, we have some white noise
We have here our first model.
- Model 2
For the second one, I was much more lazy
> library(forecast) > auto.arima(ts(Y,start=2006, + frequency=12),max.p=12,max.q=12) Series: ts(Y, start = 2006, frequency = 12) ARIMA(0,1,0)(0,1,1)[12] Coefficients: sma1 -0.950 s.e. 0.672 sigma^2 estimated as 0.00124: log likelihood=155 AIC=-306 AICc=-306 BIC=-301 > fit2=arima(Y, + order = c(0, 1, 0), + seasonal = list(order = c(0, 1, 1), + period = 12)) > plot(acf(residuals(fit2),lag=36),lwd=3)
Just to confirm that we actually have a white noise, here, use
> BP = function(h) Box.test(residuals(fit2),lag=h, type='Box-Pierce')$p.value > plot(1:24,Vectorize(BP)(1:24),type='b', + ylim=c(0,1),col="red") > abline(h=.05,lty=2,col="blue")
One more time, we have a white noise. So this model here is valid.
- Model 3
> fit3=arima(Y, + order = c(12, 0, 0), + seasonal = list(order = c(0, 1, 1), + period = 12)) > fit3 Call: arima(x = Y, order = c(12, 0, 0), seasonal = list(order = c(0, 1, 1), period = 12)) Coefficients: ar1 ar2 ar3 ar4 ar5 ar6 0.918 -0.044 0.014 0.253 -0.366 0.347 s.e. 0.112 0.148 0.145 0.135 0.142 0.151 ar7 ar8 ar9 ar10 ar11 -0.205 -0.030 0.395 -0.208 0.116 s.e. 0.152 0.139 0.134 0.162 0.169 ar12 sma1 -0.249 -0.767 s.e. 0.110 0.164
Again, we go a model similar, this morning, simply by playing with the orders. And one more time, we get a white noise,
- Model 4
For the last model, we can consider some exponential smoothing model
> HW = HoltWinters(log(headphones))
(again, on the logarithm of the series).
- Forecasting
For the first three models, use
> Y1=predict(fit1,n.ahead=36) > Y2=predict(fit2,n.ahead=36) > Y3=predict(fit3,n.ahead=36)
To visualise the three forecast, use, e.g.
> draw=function(fit=Y1){ + straight_line=predict(reg, + newdata=data.frame(T=Tfutur)) + mth=straight_line+fit$pred + sth2=fit$se^2 + Xmy=exp(mth+sth2/2) + qinf=qlnorm(.05,mth,sdlog=sqrt(sth2)) + qsup=qlnorm(.95,mth,sqrt(sth2)) + + plot(headphones,type="l",xlim=c(2006,2016), + ylim=c(25,105)) + polygon(c(Tfutur,rev(Tfutur)), + c(qinf,rev(qsup)), + col="yellow",border=NA) + lines(Tfutur,Xmy,col="red") + return(Xmy) + }
Here, we have
> draw(Y1) Time Series: Start = 100 End = 135 Frequency = 1 [1] 45.6 46.3 47.6 52.1 52.7 52.9 68.3 [8] 88.8 58.5 51.2 49.5 47.4 47.8 49.4 [15] 51.6 54.6 54.4 55.2 73.1 95.9 64.2 [22] 56.8 54.5 52.3 53.4 54.8 56.9 61.5 [29] 62.0 62.6 81.8 106.9 71.0 62.6 60.4 [35] 57.9
> draw(Y2)
> draw(Y3)
And finally, for the fourth model we can use
> Y4=predict(HW,n.ahead=36, + prediction.interval = TRUE) > sth2=((Y4[,"upr"]-Y4[,"fit"])/1.96)^2 > mth=Y4[,"fit"] > Xmy=exp(mth+sth2/2) > XHW=Xmy > qinf=qlnorm(.05,mth,sdlog=sqrt(sth2)) > qsup=qlnorm(.95,mth,sqrt(sth2)) > plot(headphones,type="l",xlim=c(2006,2016), + ylim=c(25,105)) > polygon(c(Tfutur,rev(Tfutur)), + c(qinf,rev(qsup)), + col="yellow",border=NA) > lines(Tfutur,Xmy,col="red")
- Model comparison
Here, we can compare the forecast with actual values
> X=headphones > T=as.numeric(time(X)) > idx=which((T>=2014.3)) > T=T[idx] > X=as.numeric(X)[idx] > D=data.frame(T=T,X=X,X1=draw(Y1)[1:length(T)],X2=draw(Y2)[1:length(T)],X3=draw(Y3)[1:length(T)],X4=XHW[2:(1+length(T))]) > D T X X1 X2 X3 X4 1 2014 44.8 45.6 43.7 44.3 49.0 2 2014 45.2 46.3 44.3 44.9 49.6 3 2014 46.8 47.6 45.5 45.6 51.2 4 2015 49.3 52.1 47.7 48.8 53.8 5 2015 49.8 52.7 47.9 49.7 53.6 6 2015 50.9 52.9 49.0 50.8 54.4 7 2015 69.8 68.3 62.3 64.9 72.3 8 2015 91.2 88.8 82.5 85.4 92.4 9 2015 57.5 58.5 58.0 59.4 59.8 10 2015 52.5 51.2 51.9 53.6 54.9 11 2015 49.3 49.5 49.3 51.4 52.6
We can plot the errors,
> barplot(D$X1-D$X,ylim=c(-7,5))
> barplot(D$X2-D$X,ylim=c(-7,5))
x
> barplot(D$X3-D$X,ylim=c(-7,5))
> barplot(D$X4-D$X,ylim=c(-7,5))
To compare the four model, compute the sum of the squares of those errors,
> L2_dist=function(var){ + return(sum((D[,"X"]-D[,var])^2)) + } > sapply(paste("X",1:4,sep=""),L2_dist) X1 X2 X3 X4 33.2 145.3 68.0 133.9
Model 2 and Model 4 were obtained using automatic routines. One on ARIMA models, the other one on the class of exponential smoothing models. The interesting point is thatthe we’ve been able to obtain two better models, just by playing a litle bit. From an initial guess, we look at the residuals, try a another one, and again another one, until the residuals series is a white noise… Somehow I like this idea, that I can still beat automatic routines, just using experience I have, now, on time series.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.