Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In time series work you often run into difficulties in modeling processes where the overall level of one variable (an input, for example) changes over time but the levels of another variable (an output) do not change. For instance if you want to use primary school enrollment as a factor in economic growth for a cross country regression over time you have to deal with the fact that while average growth is basically constant (as we saw in the numerous posts here), primary school enrollment is increasing until it reaches universal levels. If you are proposing some linear relationship between primary school enrollment and growth you will encounter a problem where your model will fit one timer period but not another. This, in a very basic sense is the stationarity problem, more commonly called the unit root problem. The phrase “unit root” is probably our first indicator that elementary introductions to this issue will not be simple. Indeed they are not. Pick up almost any introductory graduate or undergraduate time series analysis textbook and the problem of stationarity will be introduced via a an excercise to find the characteristic equation of the underlying data-generating process. But the intuition and impact of unit roots is much simpler.
A time series is stationary if the mean of the series over some reasonable range does not change when different endpoints for that range are chosen. So if we have a time series with 100 periods and we sample periods 1-20, 30-50, and 70-100 the sample means should all be roughly the same. I am describing an eye-ball test, not a formal test. Please don’t put this on an econometrics midterm and hope to get anything better than partial credit. But if we can graph the series or repeatedly resample it we can give ourselves some confidence that the underlying series is stationary. What does this look like in practice?
From Simple regression |
The above is a simulated time series chosen such that it would be stationary. Each of the blue lines represents a range over which we computed the mean of the series. Because the series is ergodic, each of the sample means will converge on the ensemble mean as the sample size grows. However this is not strictly related to stationarity. A non-stationary process will look quite different.
From Simple regression |
Each sample mean (I chose the same range for each series) is different from the others. For both of these series we can actually perform a quick test for stationarity known as the Augmented Dickey-Fuller test the results of which are shown in both graphs. We can reject the hypothesis of non-stationarity for the first series with some confidence and cannot reject it for the second. R has a function to perform the ADF test in the library tseries with adf.test()
. You can specify the lags in the test itself if you know them; in these examples since I created the series I know the lags. If you don’t know the lags the function will guess at them.
So what do we do about estimating non-stationary processes? The simplest solution is to apply some function to the process in order to make it stationary. If we are attempting to estimate GDP levels in a regression and cannot do so because GDP is non-stationary over time then we can difference the series and get (for all intents and purposes) GDP growth. GDP growth should be stationary (we can of course test this). For other variables differencing will not work as easily. Take our primary school enrollment example. We can imagine that over a range from 10% to 90% primary school enrollment differencing the series will give us a good stationary process with some intuitive application to growth in an economy. But that intuitive relationship falls apart as we look at countries in time which are at a particular bound for enrollment. The differenced series of enrollment will converge to 0 as actual enrollment reaches 100% and changes in enrollment may start to have a progressively smaller impact on GDP growth. The reverse may be true for enrollment around 0% (or it may not). Either way the stationary process created by differencing is a result from a bounded process and merely differencing doesn’t eliminate the effects of those bounds.
Code for the graphs is below:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.