Tornadoes Have Side Effects: A Response to Victor Chernozhukov
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recently I had a spirited conversation with Victor Chernozhukov, a leading econometrician whose work spans both traditional subjects like panel data modeling and newer application of machine learning models to causal inference. While I admire his work, I objected to his interpretation of a directed a-cyclic graph (DAG: read diagram) that showed fixed effects/varying intercepts as “latent confounders.” Below is Victor’s screenshot which he shared on Twitter:
Victor’s point was not related to the fixed effects-as-confounders aspect of the DAG but rather to people’s conflation of panel data with difference-in-difference methods (a point on which I happen to very much agree with him). We had a substantial back-and-forth over a different question: when including “fixed effects” in a panel data model, which involves including a separate intercept or dummy variable for each case/unit/subject \(i\), what do these intercepts/fixed effects represent in causal terms?
I think this is an important and largely overlooked question in Judea Pearl’s causal diagram analysis (though of course given how vast the literature is using his methods, I may have simply missed that paper). The promise of making causal graphs like the one above is that we can then figure out what we are likely to observe in Nature given a causal graph, and consequently what kind of statistical models we can use to identify effects. It’s really very cool, and it’s been a big help in my own work.
However, while I appreciate the usage of DAGs, it can be tricky to pin down exactly what they mean. As Pearl explains in his new Book of Why, the importance of causal analysis is to separate the things we see in the world that we are trying to understand from the limitations of mathematical models. To think in causal terms we have to have a causal language. So when we make a causal diagram, we have to think very carefully about how to translate it into other languages such as regression equations.
In a causal diagram, every node on the graph has to be its own separate causal factor. For that to be true, it has to have some kind of independent existence in the “real world.” We could also call it a random variable–some trait we can observe that can take on multiple values. If we have a complete and accurate causal graph, we can then know with confidence which variables we need to measure to ensure so that our statistics have a causal interpretation.
Now going back to Figure 8.1 above, it’s useful to think about the nodes on this graph and what they mean. The two main actors in the graph are \(D_{it}\), the treatment or main predictor of interest, and \(Y_{it}\), the outcome, indexed by \(i\) for cases/units/subjects and \(t\) for time. We also have two potential confounding variables, \(W_{it}\) and \(\alpha_i\). \(W_{it}\) is a conventional confounder: it sets up a “back-door” path between the treatment \(D_{it}\) and the outcome \(Y_{it}\) by causing both of them. If we don’t include \(W_{it}\), its causal influence will get absorbed by \(D_{it}\) in our regression model and we won’t know what the real causal effect of \(D_{it}\) is.
All well and good so far, but we still have this mysterious \(\alpha_i\). The first thing you should notice, of course, is that this variable is not indexed by \(t\). Why is that? Well, these variables represent intercepts or dummy variables that are included in the model, one for each case/subject/unit in the data. The DAG is claiming that these intercepts confound the relationship between \(D_{it}\) and \(Y_{it}\) in a manner identical to \(W_{it}\), and interestingly, also cause \(W_{it}\) in some sense. Based on this DAG, we must then include these intercepts to identify the effect of \(D_{it}\) just as we did with \(W_{it}\). Problem solved! We can all now go get a beer and call it a day.
Well… not so fast. Again, one element of causal analysis is that we are studying things “in the real world.” Each element of our graph must be some factor or force or element which we could (at least in theory) manipulate. Does including an intercept for each case meet that standard? I would say that in general, no. Including these intercepts is often a good idea to build a model that is relevant to your research question, but it’s neither necessary nor sufficient for causal inference with panel data. (A bold statement, but read on!)
To illustrate this, I’m going to start with a DAG that is much easier to understand and that has an empirical meaning. In this DAG, the outcome \(Y_{it}\) represents the level of democracy in a country \(i\) in year \(t\). We want to know what the causal effect of the level of gross domestic product (\(G_{it}\)) is on a country’s democracy over time. Unfortunately, there are potential confounding variables in this relationship.
In the diagram I also put \(F_i\), which represents a country’s factor endowment, or types of soil and other natural resources. As Engerman and Sokoloff argue, it could be that the type of natural environment predisposed some countries to more repressive types of agriculture such as large slaveholder plantations, and slave-owning countries and regions were more resistant to democracy over time. We know that slave-owning areas also tend to be poorer over time as well as they are less likely to industrialize. As a consequence, we are concerned that the relationship between GDP and democracy will be confounded by the level of factor endowments. Note that for factor endowments I only include the subscript \(i\): a country’s factor endowments are fixed over time, unlike GDP, which can vary over time and also across countries at a single point in time.
This sort of setup is the typical justification for including fixed effects, or one intercept/dummy variable for each country \(i\). So went Victor’s reasoning on Twitter: if we include \(\alpha_i\) then it will adjust for \(F_i\) so long as the two are equivalent (\(\alpha_i \equiv F_i\)). This seems easy and straightforward: \(\alpha_i\) only varies by country and \(F_i\) only varies by country. Bingo.
Well, I hate to be the bearer of bad news, but including \(\alpha_i\) as an adjustment variable to account for \(F_i\) doesn’t get you the same thing because \(\alpha_i\) and \(F_i\) are not equivalent, logically or otherwise. \(F_i\) is a vector of factor endowments with a plausible range from zero to a very large number, depending on how exactly we want to measure endowments. \(\alpha_i\), by contrast, is either a categorical variable where each value is the particular country in the data, or a matrix of dummy variables minus one country which serves as the reference category. Declaring these two equivalent is… odd.
In particular, something happens to the DAG above when we include \(\alpha_i\) as an adjustment factor instead of just \(F_i\). The following DAG shows what happens:
The above formula is likely familiar to people who have studied panel data in economics or political science. For both the treatment GDP and the outcome of democracy, \(Y_{it}\), I subtracted away the average value for each case \(i\). This process is also called “de-meaning” (which no one has yet made clever puns about, unfortunately) as we convert the two variables into deviations away from their over-time mean. Why did this happen? Again, the \(\alpha_i\) represent here an intercept for every country. If we have 100 countries, then we have a giant matrix of ninety-nine dummy variables. Each dummy variable will absorb everything that doesn’t vary over time in that country. Think of it like a tornado that sweeps through the country, leaving behind only time-varying debris.
Now did the \(\alpha_i\) get rid of the confounding variable \(F_i\)? Well, sort of. If you think about it, we couldn’t keep \(F_i\) in the DAG because there is only one value of factor endowment per country, so the average of 1 number is…. that value. You subtract it away and you have zero. This neat math trick is what first motivated economists to think of fixed effects (intercepts for cases) as a cool way to get rid of confounders.
However, tornadoes have side effects. We no longer have \(Y_{it}\) as our outcome, we have \(Y_{it} – \bar{Y_i}\). In other words, we no longer have to worry about \(F_i\) or any other time-invariant confounder, but we’ve also changed the question we’re asking. We have never answered the question posed in the earlier DAG about what the effect of GDP on democracy is because… we now only have above or below average GDP regressed on above or below average democracy.
So is \(\alpha_i\) an adjustment set or control variable for factor endowments? No, it isn’t. It’s true that if you include it in the model, you won’t need to worry about factor endowments, but that’s as comforting as telling someone trying to find shelter in order to protect themselves from a tornado that they no longer need to worry about being late for the plane flight they were supposed to catch later that day. True, but largely irrelevant to the problem at hand.
If we want to know what the causal effect of \(G_{it}\) on \(Y_{it}\) is when accounting for factor endowments, then we need to include a measure of factor endowments or find a way of manipulating a country’s income in a way that is independent of other causal factors (i.e. random assignment). Including fixed effects or intercepts for countries/cases does something different; it isolates a dimension of variation in the outcome. This is itself an important thing to do as the two subscripts of \(Y_{it}\) both matter but have different interpretations. We may be more interested, for example, in comparing countries at cross-sections for each year period. In that case, we could include intercepts for time points in the model. Or we might prefer to compare countries to themselves over time–in that case we would include the \(\alpha_i\) as above.
So circling back to the original question… can fixed effects help us identify a causal effect? I argue that it’s not by “adjusting for latent confounders” as above. But it could still be beneficial, though a causal diagram alone can’t tell us when it might be. We have to use our own knowledge of the case at hand and some heuristics about how we think the unobserved and unobservable aspects of our data–the unobserved heterogeneity–vary in the sample.
For example, if we had a very long time series, such as a series of country democracy and GDP over a couple hundred years, then including case intercepts \(\alpha_i\) would mean that we would be comparing countries’ democracy and GDP scores over that time period. In other words, we would compare the level of GDP and democracy in the United States in 1805 to the level of GDP and democracy in 2020. While we don’t know what confounders exist, we might be suspicious that this comparison is a bit of apples and oranges. Contrary to received wisdom, it might actually be better to drop the \(\alpha_i\) and include \(\alpha_t\). Canada and the United States in 1805 might be more similar to each other than 1805 and 2020 versions of the United States.
If you want to read more on the topic, I have both a prior blog post and a co-authored paper which I recommend you read at your leisure while avoiding tornadoes, miss-specified causal graphs and false equivalencies.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.