ARMAX Offerings Remain a Muddle

John Mount

1 month ago

[This article was first published on R – Win Vector LLC, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

To borrow a term from Hyndman, the following is going to come out as a bit of a muddle. However, that may be the honest way to survey a muddled situation.

I am going to write on time series problems and solvers, try to run some details to ground, and call a few babies ugly. This will either add some clarity (my hope), or be something the informed reader disagrees with (my fear).

A grain of salt against this note (just in case!).

A time series example

Let’s start with a concrete example.

Suppose we want to model y: the number of cars parking in a parking lot on a given day, as a linear function of previous days and x: the set of external impermanent events (such as concerts) that may attract additional parking on a given day (imagine a parking pass included with event registration). We could set this up as the following (time indexed) system of equations.

  yt  =  b1 (yt-1 - g . xt-1 - et-1)  +  b2 (yt-2 - g . xt-2 - et-2)  +  g . xt   +  et

In this: t is a time index, y is an observed numeric outcome, and x are observed external regressors thought to affect the outcome. Then b1, b2 are the (unobserved) time modeling coefficients (to be inferred), g are the (unobserved) external coefficients (to be inferred), and e are the estimates of the (unobserved) unexplained shocks (to be minimized). In time series notation, the quantity zt = yt - g . xt - et is an AR(2) system in terms of (unobserved) z (an autoregressive system looking 2 periods back). A lot of the early time series problems used a z,y type notation to encode the essential recurring series as unobserved, and a noisy transcription of it as the observation.

The above system is saying: there is a loyal or durable sub-population (called z) of the parkers that recur in some pattern (so can be well approximated by a an AR(2) time series), and on top of this there are impermanent parkers that show up for events that we know about through our external regressors x. We will call this formulation the “external impermanent demand” model.

We can re-arrange the above system into the following more regular form:

  (yt - g . xt - et)  =  b1 (yt-1 - g . xt-1 - et-1)  +  b2 (yt-2 - g . xt-2 - et-2)

The above is sometimes called (from class called “stats 510”): “regression with AR errors“, as the residuals (or shocks, not the ys!) from the external regressor model are the quantities thought to obey time series relations.

Note: this is a different claim than the one captured by relations of the form:

  yt  =  b1 yt-1  +  b2 yt-2  +  g . xt  +  et

  (yt - g . xt - et) =  b1 yt-1  +  b2 yt-2

The new forms are encoding the following different problem: in each time period we recruit g . xt + et as new recurring or durable parkers who start contributing to both the present and future parking totals. This form isn’t appropriate for our original application (modeling impermanent parkers who show up only for an event), but is useful in modeling parkers attracted to the lot through marketing efforts designed to attract new recurring business. In practice we will want to model a mixture of new impermanent and durable parkers attracted by external regressors (parking attached to events, and marketing efforts). We will call this formulation the “external durable marketing” model.

Ways to solve our problem include: ARIMAX/ARMAX, Kalman filters (which update unobserved state as a function of observations), EM methods, Bayesian methods, all in one time series packages (such as Prophet), or looking at frequency or auto-correlation transforms of the observations. We can even try to solve for b and e ourselves (using EM methods, which were considered statistically ineffective in the 1960s and 1970s, but rightly or wrongly considered effective now).

The problems and tools

The question is: are the standard solvers powerful enough to encode the problem, and are the standard solvers controllable enough to respect the problem structure? Do they solve our problem, or a just near relative of our problem?

We mentioned: ARIMAX/ARMAX (ARIMA/ARMA plus X for external regressors). ARMAX is defined in D. F. Nicholls, A. R. Pagan and R. D. Terrell, “The Estimation and Use of Models with Moving Average Disturbance Terms: A Survey”, International Economic Review, Feb., 1975, Vol. 16, No. 1 (Feb., 1975), pp. 113-134:

In time-series nomenclature “L” is a lag operator (it changes indices) and y(t) are the observations (t being a time or an index). Then: B(L) is the relations the y(t) obey across time (the auto regressive or AR part of the model), A(L) is how the external shocks are moved through time (for some systems, the moving average or MA part of the model), and C(L) is how copies of the external regressors x() are moved through time. What we are looking for in an ARMAX solver is: the expressiveness of non-trivial time shift/lag operators, or more than just one time index on the external regressors (which we have here). This is also likely the definition worked out in Ljung, L. System Identification: Theory for the User, Second Edition. Upper Saddle River, NJ: Prentice-Hall PTR, 1999. And is certainly what is implemented in the related Mathworks Matlab ARMAX function. Interestingly: the germinal reference Box, Jenkins, Reinsel, Time Series Analysis, 4th Edition, Wiley doesn’t cover ARMAX/ARIMAX (or at least not by name or in the index; the 5th edition appears to add Greta M. Ljung as a new author and some uses of arimax()). Intead it introduces external regressors with regression with ARIMA residuals, seasonal methods, and general transfer function methods.

Our original (impermanent demand, the first two equations) problem can be written in ARMAX notation as:

  (yt - b1 yt-1 - b2 yt-2)  = 
      (g . xt - b1 g . xt-1 - b2 g . xt-2) 
      + (et - b1 et-1 - b2 et-2)

Notice, the same time coefficients are shared by all 2 processes. Nicholls et. al. mention such examples such as ours in their “origins of the ARMA and ARMAX models” section. They in fact show how such problems can be re-encoded into ARMAX problems. However such a conversion requires the additional ability to enforce constraints such as “the autoregressive and moving average coefficients are identical” (more on this in a bit!).

The first critique: over-kill

The above methodology (if available) will produce an estimate for e that is generally no larger (and maybe even smaller than!) the best estimate for e possible for our specified form. This is a form of “model domination”: the structure we specified is one of the structures considered in the ARMAX solution, so the ARMAX solution is considered at least as good. As we consider e as an unexplained shock (sometimes called an error, but that isn’t safe terminology in time series work as it implies they items can be assumed away- more on this in the appendix) that would seem to imply the solution is at least as good as the one we want.

However, this method may not produce the correct estimate or inference of b. This is because the solution above corresponds to the original problem only when it happens that B(L) = C(L) = A(L) in the above more general ARMAX formulation. This is a problem in two ways:

There usually is no way to specify such constraints to the solver.
Nicholls et. al.’s stated section 1 identifiability requirement that none of B(L), C(L), or A(L) in ARMAX have common roots! So enforcing them to be identical in a full ARMAX situation seems like lunacy.

Moving from the earlier ad-hoc problems gives the answer “I can put your 5 liters of sand in a 5 liter bag” to the question “how much of my 5 liters of sand can I put in my 2 liter bag.” The system gets a “better” solution, by ignoring some of the problem constraints (or by “being more powerful”). The user must worry: “is this merely a case of having more parameters than they wanted (risking possible overfitting or statistical inefficiency), or is it a situation where the model the user wants to specify is in the parameter space but inaccessible (due to fitting strictures- such as “no common roots”)?” Either way, it appears to be a detriment to not be able to specify the then chosen problem constraints as a specific positive inductive bias.

In addition to ARMAX solvers, there are “transfer function model solvers” (which are even more general than the ARIMAX, but one may not be able to force the transfer function to be merely be an ARIMAX model). Roughly a transfer model solves problems of the form y(t) = C(L) x(t) / φ(L) + A(L) e(t) / B(L). This would be ARMAX if we could add a constraint forcing a φ to equal B (which we can not). These systems have even more degrees of freedom in solving, so can likely claim to pack 7 pounds of sand into the 2 pound bag (in contrast to the earlier claim of 5).

Over-kill is great for prediction (reproducing the ys), and horrible for inference (estimating the relations or bs).

The second critique: under-kill

The second problem is: many of the available packages that claim to solver ARMAX problems in fact solve only a special case that may or may not cover a given example problem.

For example:

The Python package PyFlux claims to implement ARIMAX- when it appears to implement a single special case. We can see this as there are no lags on the x time indices:

This issue is: there are no lags on the time subscripts for either the external regressor or shock terms. This is great for the “durable external marketing” situation, but not the original “external impermanent demand” situation. This formulation is neither full ARMAX nor the previously noted Stat 510 definition of “Regression with ARIMA errors” (as PyFlux writes the time indexed recurrence in terms of the y’s and not in terms of the shock e’s)

The Muddle

It appears a great deal of the web teaching is repetitions of variations on the following (which starts seemingly harmless, but becomes more insidious when repeated without context). Take for example the following (not quite right) teaching:

In context (first tutorial) this may not be so bad. The teacher likely knows their stuff and probably will not make the mistakes the innocent readers would. The sentence would even be correct if it said “A basic” instead of “The basic.” The issue is: this is a specialization (again with durable interventions) and not full ARMAX.

Some More Package

If one ignores the original problem and instead asks “which packages solve ARMAX problems?” I have a small number of notes.

The R TSA package appears to solve transfer problems.
The R forecasts package appears to solve regression with ARIMA residuals.
The R fable package ~~appears to solve transfer problems (“x+pdq(p=0,d=0,q=1)” can be varied to specify the correct indexing of external regressors)~~ solves regression with ARIMA residuals (please see correction in comments).

Notice none of them seem to exactly target ARMAX. Of course ARMAX isn’t the end-goal, it is just the name for a particular waypoint in the evolution of time series tools.

What is going on?

My feeling is: the time series literature (especially that only neighboring the core ideas) has drifted quite a lot since the 1970s. It appears attention moved from solving individual time-oriented problems, to solving generalizations, to merely claiming to solve the generalizations. I think some of the change was driven by an industry move from fitting as inference (specifying specific estimates of system parameters) to prediction (reproducing observed outcomes). Tool makers tend to be different than practitioners, and the two groups diverge over time. Eventually this degenerates to: even if the user knows what ARMAX, is they must question if the tool makers also knew so. Or, (symmetrically) even if the tool makers know what ARMAX is, they must worry if their future users will also know that.

In my opinion, the use of terms is so degraded there is no way to tell what problem a time series package actually solves, without proofreading the exact recurrence equations. There is no easy way to tell from mere descriptions if a package “does exactly what it says on the tin” without seeing the actual claimed recurrences and working known examples.

To summarize, repeat, and amplify Rob J Hyndman’s The ARIMAX model muddle: many functions called “arimax()” don’t fit ARMAX models, and many writers don’t seem to care about the difference. Though it appears “regression with ARIMA residuals (imperminant external regressors) is popular in the R community and “regression with durable external regressors (regime changes)” is popular in the Python community.

Conclusion

My warning is: don’t commit to a methodology or package until you have worked examples proving it is actually solving the problem you are working on and preserving your model structure. Due to concept drift and the echo chamber: solving something with the same name isn’t always solving the same problem.

My positive actionable advice is: insist on seeing and controlling the exact recurrence equations. The practitioner should not care about the name of the problem they are solving, but whether the solver is respecting the system they specified or not.

Appendices

Appendix: why not “error”

The differences from predictions and actuals are usually called errors. In time-series problems this (in my opinion) should be avoided as these differences are often durable and important.

For example: let yt be a bank account balance at time t, and we model this balance as the time series yt = yt-1 + et. Then et isn’t an error, it is the net deposits minus net withdrawals at time t! Or: it may be an error from the model’s perspective, but an essential part of the process from the system from the system’s perspective.

I recommend the less value-laden terms: “residuals”, “differences”, or “shocks.”

Appendix: go Bayesian?

A solution I endorse is: go Bayesian (if you can afford the run time). Bayesian has two huge advantages: specification is kept separate from implementation (making specification simpler), and non-identifiability of the system is passed on to the user in results (meaning the user can try to fix it, or try to live with it). Though, be aware: a huge disadvantage of Bayesian methods is run time.

Bayesian packages, such as Stan, let you write down exactly the system structure you are trying to infer over. It remains your problem if that is a good or bad idea (instead of a Procrustean enforcement of problem structure). For example Stan’s documentation on the various time series models is almost unique in even showing what model structures are specified in a comprehensible format (none of these have external regressors, but they are easy to add). It actually feels like they broke a rule in specifying what the solver even claims to solve for.

I have some good results here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win Vector LLC.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.