[This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last time we downloaded data from quandl.com. This was privately-owned homes completed in a month in thousands of units(not seasonally adjusted). Now, let’s take a look at some basic R functions to examine time series along with my first exploration of what I feel is an intriguing package called “changepoint”. Please note that this is not a forecasting effort with much bloodshed determining the proper ARIMA or Holt-Winters model etc. That I shall leave for another day.
After loading and attaching the data let’s begin.
> head(data)
Date Value
1 1/1/1968 1257
2 2/1/1968 1174
3 3/1/1968 1323
4 4/1/1968 1328
5 5/1/1968 1367
6 6/1/1968 118
This goes all the way back to 1968 and the last observation is June of 2013. I want to trim this down. Let’s look at the golden era of home construction, from January, 2000 through 2007. Now, we need to turn Value into a time series so we can subset by date. Otherwise, we would need to identify which observation equals our date and go from there.
> value.ts = ts(Value, frequency=12, start=c(2000,1), end=c(2006,12)) #create subset by date and make it a time series
> plot(value.ts)
Much to my surprise, new home construction had already topped out in late 2005. I had no idea. Side note: I still regret not taking advantage of this bubble. The sure sign of the top was when a good friend of mind sent me an email wanting to know if I was interested in becoming an investor in oceanfront property in North Carolina, circa 2007. If only R could create a time machine ‘eh.
The changepoint package seems to be a simple way to execute a rather complicated process; the identification of shifts in mean and/or variance in a time series. This is a lengthy subject to cover in-depth, so consider this a mere introduction. First of all, why would we want to determine change in mean and variance for a time series? Well, my first exposure to changepoint analysis was during Six Sigma training on control charts. The CUSUM control chart allows one to identify when a process has undergone a transformation, leading to a significant shift in the mean. Now, someone with a well-calibrated eyeball and knowledge of the underlying process can easily point out a shift. Take the time series plot above for instance, where it is rather obvious when something has changed dramatically. However, what I’m interested in via this package is the visual capability and the examination of changes in variance. One of the quaint sayings we had in Lean Six Sigma was, “customers don’t feel the mean, they feel the variation”. I believe if you ask a statistician that is drowning in a river that is, on average, 3 feet deep they would agree.
They are a number of changepoint statistical methods and I will show only results for Pruned Exact Linear Time (PELT) and Binary Segmentation. I will not get into the algorithms behind them nor shall I list the pros and cons of each as that is your assignment…should you choose to accept it. At any rate, I still need to get comfortable with the various methods, so who am I too lecture anyone on this topic?
Here is the synopsis (in R) of looking for mean and variance changepoints:
> library(changepoint)
> mvalue = cpt.mean(value.ts, method=”PELT”) #mean changepoints using PELT
> cpts(mvalue)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 24 25 26
[26] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 42 43 44 45 46 47 48 49 50 51 52
[51] 53 54 55 56 57 58 59 60 61 62 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
[76] 79 80 81 82 83
The command “cpts” lists the changepoints. All observations are changes, so this is a complete waste I believe. Time to try BinSeg.
> mvalue = cpt.mean(value.ts, method=”BinSeg”)
> cpts(mvalue)
[1] 13 39 48 74 78
> plot(mvalue)
Again, a calibrated eyeball could make perform this task, but can we learn anything from variance?
> vvalue = cpt.var(diff(value.ts), method=”PELT”)
> cpts(vvalue)
[1] 16 25 32 38 66 69 79
> plot(vvalue)
> par(mfrow=c(2,1))
> plot(value.ts)
> plot(vvalue)
What have we here? This is not seasonally adjusted and it looks like we are picking up some of that seasonality.
After several iterations using the BinSeg method, to no useful outcome, I went with the method “SegNeigh” (segment neighborhoods)
vnvalue = cpt.var(diff(value.ts), method=”SegNeigh”, Q=6) #the Q=6 caps the maximum number of changepoints at 6
> cpts(vnvalue)
[1] 16 25 66 69 79
We have only 5 changepoints according to this method.
> plot(vvalue)
> plot(vnvalue)
Nearly the same outcome. At the end of the day and given this dataset, I’m not sure changepoint analysis has added any value. However, I do like the ability to produce graphics, which may allow an analyst to tell a compelling story. I also believe it is worth further examination on other time series.
One more thing…
The base R package allows you to decompose a time series into trend, seasonality and a random component with “decompose”.
> d = decompose(value.ts)
> plot(d)
This plot clearly shows the summer peaks and winter troughs in new homes.
More to follow…
To leave a comment for the author, please follow the link and comment on their blog: Fear and Loathing in Data Science.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.