Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week I wrote about time-series cross-validation, and mentioned that my original motivation was forecasting unemployment. Actually, I’m interested in nowcasting unemployment rates – that is, estimating a value for a current or recently past period, before the official statistic becomes available. In particular, I’ve been wondering about what leading indicators might be available to help with this.
“Leading” in this case will have to mean pretty fast, because the official unemployment stats in Australia come out from the Australian Bureau of Statistics (ABS) with admirable promptitude given the complexity of managing the Labour Force Survey. ABS Series 6202.0 – the monthly summary from the Labour Force Survey – comes out around two weeks after the reference month. Only a few economic variables of interest are available faster than that. In this blog post I look at two candidates for leading information that are readily available in more or less real time – interest rates and stock exchange prices.
One big change in the past decade in this sort of short-term forecasting of unemployment has been to model the transitions between participation, employed and unemployed people, rather than direct modelling of the resulting proportions. This innovation comes from an interesting 2012 paper by Barnichon and Nekarda. I’ve only skimmed this paper, but I’d like to look into how much of the gains they report comes from the focus on workforce transitions, and how much from their inclusion of new information in the form of vacancy postings and claims for unemployment insurance. My suspicion is that these latter two series have powerful new information. I will certainly be returning to vacancy information and job adverts at a later time – these are items which feature prominently for me in my day job at Nous Group in analysing the labour market.
Unemployment stats
Here’s the Australian unemployment rate that we’re trying to forecast. I’ve added the party of federal government as context:
The data go back to 1978 and are published as their “original” survey estimates, seasonally adjusted and trend. “Trend” is nearly visually identical to “seasonally adjusted” so I’ve not included it in the chart above.
If I’m to be forecasting this I’ll want to forecast both the original and the seasonally adjusted values, so it’s worth checking out if I can replicate (or get close to) the ABS’ seasonal adjustment. Here’s a comparison of a seasonally adjusted series out of the box using X13-SEATS-ARIMA and the default settings provided by Christoph Sax’s excellent seasonal
package (which includes testing for and including-if-helpful trading days and Easter seasonal adjustments).
There are 13 observations where we are 0.1 of a percentage point or more out, mostly featuring a difference of opinion between my model and the ABS’ on the seasonality of the winter months. This would be worth further exploring in a future post; perhaps the ABS’ parameters are published somewhere, and/or it would be interesting to try to reverse-engineer them. But it’s close enough for today’s purposes.
Here’s the R code so far. Getting hold of and tidying ABS time series data is a cinch now thanks to Matt Cowgill et al’s readabs
package. BTW, I’m behind the times on how I manage time series data – I know that some of what I’m doing below can be done a bit easier with the help of tsibble and related R packages. But I haven’t got around to learning how yet. Which is a great example of why continuous learning is an essential part of modern statistics and data science.
Post continues after code excerpt
Stock prices and interest rates
The Australian Securities Exchange publishes an index based on the price of the “ASX 200” list of companies. Daily data are available from the likes of Yahoo Finance (easily accessed from the R quantmod
package), but the longest series I could find was from the ASX themselves. Unfortunately, even this series only goes back to the early 2000s. There must be more historical data out there somewhere, so please feel free to make a comment on this post about where to find it.
I grabbed the historical interest rates series form the Reserve Bank of Australia (RBA). This series goes all the way back to the 1970s.
Here’s what all the time series I’ve got so far look like – unemployment (both original and seasonally adjusted), interest rates and stock prices:
Relationships
So what’s the relationship between these different series when we put them into scatter plots, with one axis for each series? First, let’s look at them on their original scales. We see there’s strongly negative relationships between both interest rates and stock prices on unemployment:
Of course, this doesn’t mean that interest rates and stock prices going up leads to a fall in unemployment. Rather, there are some complicated things going on here:
- When unemployment is low or expected to be low, the authorities become more worried about inflation and tend to increase interest rates to control that.
- When the economy is doing well, unemployment goes low and stock prices tend to grow.
- When investors think that interest rates are likely to go up, they pay less for stocks and stock prices go down (and vice versa). So a decrease in unemployment can lead to “pessimism” on the part of stock market players, fearing a rise in interest rates despite (or because of) the good news for the economy that the decrease in unemployment is. And vice versa – an anticipated increase in unemployment leads to expectations of interest rate cuts, which can lead to a surge in stock prices.
- Complicating these relationships, stock prices are definitely a non-stationary time series (ie despite their ups and downs, they have definitely grown on average since the 2000s) – and interest rates and unemployment rates are probably non-stationary (that is, their average values change over time in historic level shifts, not just oscillating around a constant mean). Comparing non-stationary time-series leads to spurious correlations.
We can control to a degree for the non-stationarity problem by looking at growth rates or first differences rather than the original series, and the following set of scatter plots does just that:
Now we’re getting somewhere. The apparent relationship between the stock prices and unemployment was probably a spurious artefact of the secular change they were both undergoing. On the other hand, there is definitely a negative relationship between change in interest rates and change in unemployment rates. What we’re seeing here is the dominance of one way of the causality direction – as unemployment goes up, interest rates go down in response. This trumps the presumed causal effect in the opposite direction (as interest rates go down, it boosts the economy, eventually leading to a decrease in unemployment), which is probably only observeable, if at all, at quite some lag.
In fact, the situation was sufficiently complex that I had a go at drawing a direct acyclic graph (DAG), as is now all the rage in thinking about causality. Here’s my greatly simplified description of what is happening here:
This provides a framework for thinking about what we’re observing. Basically, the connection from unemployment to interest rates via the RBA is realtively strong and quick, whereas the causality in the other direction is complex, delayed and difficult to observe. And the price of stocks is impacted on by so many things in such weird ways (I haven’t tried to show all the “animal spirits” and other influences, rational and otherwise) that it isn’t that surprising that it’s little or no use as a predictor of unemployment.
Here’s R code to date, probably of particular interest for people either in interested in grabbing this historical data from the ASX and the RBA, or in drawing DAGs (for which I’m using the R package “ggdag: Analyze and Create Elegant Directed Acyclic Graphs” by Malcolm Barrett).
Post continues after code excerpt
Vector auto regression as an exploratory modelling stage
The data here seem appropriate for exploring with a vector auto-regression model (VAR). A VAR models the changes in multiple time series together and is a way to understand the impacts of each (and their lagged versions) on themselves and eachother.
One of the most useful outputs from fitting a VAR is the plot of the Impulse Response Function, which lets you visualise the ongoing impact on all series of a shock to one series. In our case, here’s what happens if there is an upwards shock to interest rates:
The biggest impact is on interest rates, they go high, increase further for a bit, and basically stay high – the shock leads to a level shift upwards in the interest rates. But the model also implies that this level shift up in interest rates leads to a level shift downwards in unemployment. Clearly, interpreting this causally would be a mistake, for the reasons discussed earlier.
Also, it’s interesting to note that the interest rate shift upwards has little evidence of any impact on stock prices (the red lines are 95% confidence intervals, and basically include zero, suggesting no clear impact one way or another).
For completeness, here are the charts showing the impact of shocks to the other two series.
None of these are as polished as I usually like my graphics, but I’m treating these as outputs from exploratory modelling rather than core parts of my results dissemination. Actually, I have no intent of using a VAR for my actual nowcasting, as the whole point of using leading indicators is that I don’t need to forecast all three series together. But a VAR is a useful thing to fit to add a bit more discipline in thinking about multi-directional Granger causality to what has so far been just a bunch of scatter plots.
Here’s the code that fits those VARs, and also which puts together the data in a form for what I’ll be actually using in a jiffy. I’m mostly using the first-differenced stock price series, and original unemployment and interest rates series. My rationale for doing this is the instinct that prices are fundamentally a different, more absolute measure than are either of the other two, which are expressed as rates or proportions. Definitely if I were doing this for real, this would be one of my more substantive decisions that would need some serious thinking, and it’s one we’ll come back to.
Post continues after code excerpt
Finally, creating some viable forecasting models
I fit four potential ARIMA (auto-regressive integrated moving average) models, in each case specifying level one of integration ie the original time series is to be first-differenced. I used the original observations, not seasonally adjusted – my idea is that if I ever do this for real, I will nowcast the actual observed unemployment, and turn it into a seasonally adjusted value myself, so I have both numbers to present.
My four models differ in the explanatory “x” regressor variables they include:
- none
- just interest rates
- interest rates and stock prices
- interest rates, stock prices, and stock prices at lag 1
Here’s the performance of each of these models on the basis of Akaike’s Information Criterion:
df | AIC | |
---|---|---|
mod_aa0 | 6 | -109.8404 |
mod_aa1 | 8 | -128.5164 |
mod_aa2 | 9 | -126.6400 |
mod_aa3 | 10 | -124.8471 |
So historically, the best performing model for predicting unemployment (the lowest value of AIC) is the one with interest rates in it as an explanatory variable, but no stock prices.
Here’s rough confidence intervals for the values of all the parameters in that model. We’re most interest in the xreg
estimate, which is the direction and strenght of the impact of interest rates in forecasting uinemployment. Remember, this doesn’t imply causality – just that it’s useful for forecasting purposes.
2.5 % | 97.5 % | |
---|---|---|
ar1 | -0.05 | 0.64 |
ar2 | -0.26 | 0.14 |
ma1 | -0.96 | -0.31 |
sar1 | -2.65 | 1.70 |
sma1 | -2.43 | 1.96 |
sma2 | -1.84 | 1.26 |
xreg | -0.38 | -0.21 |
The negative value means that when interest rates are high, it’s a sign that the change in unemployment rates is about to decrease. How are we to interpret this? Recall this simple chart:
We see that at times of substantial economic change, there is a very strong negative link between interest rates and unemployment. This is particularly apparent in the early 1990s and in the period around the 2008 global economic crises up to 2010. The 1990s are excluded from my modelling data in order to have four comparable models, because I don’t have stock prices to include; so I suspect it’s the 2007 to 2009 period that is being picked up in the model.
I am profoundly skeptical of whether this indicates that we really have found a useful forecasting explanatory regressor variable. Hence I wanted to check the AIC quality of fit figures with time series cross-validation of the one-period-out forecasts. AIC is actually asymptotically equivalent to one-period-out cross-validation, but how much data do we need for this property to kick in? Funny things are the rule rather than exception with time series, and spurious correlation lurks everywhere. I’d feel more comfortable if brute force showed me that I would have been better off to include interest rates than not, in a thought experiment where I was doing monthly nowcasts of unemployment rates since the 2000s. Hence my post last week on time-series cross-validation, to be sure with simulated data that I was doing it right.
The reason I wanted to be sure (and my digression last week into simulated data) was that in this case, the cross-validation comes up with different results from the AIC. Here’s the accuracy summary of the one-period-out cross-validation results for comparing models mod_aa0
(first differenced unemployment rate, no explanatory variables) and mod_aa1
(with interest rates as an explanatory variable). The no-explanatory-variables model wins hands-down:
model | ME | RMSE | MAE | MPE | MAPE | ACF1 | Theil’s U |
---|---|---|---|---|---|---|---|
With interest rates | 0.0285642 | 0.4014554 | 0.1882691 | 0.6239319 | 3.491464 | 0.0617393 | 1.2358698 |
Univariate | 0.0061069 | 0.2200771 | 0.1631800 | 0.2062656 | 3.032230 | -0.0147829 | 0.6987458 |
The univariate mode has lower bias (mean error) as well as lower root mean square error, mean absolute error, mean percentage error, mean absolute percentage error, autocorrelation of the residuals, and Theil’s U. The apparent superiority of the interest rates model from comparing AICs turns out to be a mistake.
So after all that, what have we found? If I want to nowcast unemployment rates, I am best off (on the potential leading indicators used so far), ignoring other variables and simply fitting a univariate time series model to the unemployment rate. This actually doesn’t surprise me, but it was worth looking around.
I still have hopes of vacancy rates or job advert numbers as a leading variable. Unlike interest rates, these are basically measures of a very similar, directly related concept to unemployment, without any significant causality complications. But this will need to wait for another post.
Here’s the R code for fitting the various ARIMA models and doing the time-series cross-validation. The cross-validation code is explained in last week’s post:
Post continues after code excerpt
Final clincher – testing on a longer time series
So far in my modelling, I’ve been working with a cut-down dataset, only going back to 2002 so I can include stock prices as a candidate variable. If I’m satisfied that stock prices have nothing to offer in nowcasting unemployemnt (and I am now), then I might as well refit my two univariate and interest rates models with the full set of data available back to the late 1970s.
This time, the interest rate model is inferior even on the basis of the AIC (which of course has a longer series to allow its asymptotic properties to kick in).:
df | AIC | |
---|---|---|
mod_aa0_long | 6 | -106.2901 |
mod_aa1_long | 7 | -104.3239 |
This is confirmed by the time-series cross-validation:
model | ME | RMSE | MAE | MPE | MAPE | ACF1 | Theil’s U |
---|---|---|---|---|---|---|---|
With interest rates | -0.0022767 | 0.3287481 | 0.1980351 | 0.0378736 | 2.959823 | -0.0400820 | 0.8890656 |
Univariate | 0.0064120 | 0.2404348 | 0.1847038 | 0.1837424 | 2.755187 | 0.0177967 | 0.6408704 |
Other than mean percentage error (which I dislike as a metric of accuracy anyway), the univariabte model is the better of the two.
Post concludes after code excerpt
Unfinished business
So, a nice, clear conclusion today – don’t bother with interest rates or stock prices in trying to nowcast unemployment rates.
I have three things that need further investigation here, but as this post is now well over 500 lines long these are definitely for another time:
- more systematic thinking and exploration of whether the interest rate explanatory regressor should be first differenced, not in its absolute values. For the curious, I did fit my models with differenced interest rates, and reach the same (no effect) eventual conclusions as reported in the post. In fact, is possible that the non-stationary undifferenced interest rate series is what’s responsible for the suprious AIC finding that interest rates are worth including in the model in the first place.
- replicating the ABS’ seasonal adjustment.
- testing other possible explanatory variables, particularly vacancy rates or job advert numbers.
That’s all for now, folks.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.