Site icon R-bloggers

What Panel Data Is Really All About

[This article was first published on R on Robert Kubinec, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We’ve all been in that seminar where the author puts up a slide containing regression coefficients, and buried in the bottom line of the table we can see little Ys and Ns indicating the kind of panel data model employed. Quite often, these brief descriptions of models are taken as a mere statistical ritual, believed to be efficacious at appeasing the mercurial statistical deities, but rarely if ever investigated more thoroughly. Furthermore, no one wants to be in the difficult position of discussing panel data models, which inevitably boils down to a conversation laced with econometric terms generating more heat than light.

So what if I told you … panel data, or data with two dimensions, such as repeated observations on multiple cases over time, is really not that complicated. In fact, there are only two basic ways to analyze panel data, which I will explain briefly in this piece, just as every panel dataset has two basic dimensions (cases and time). However, when we confuse these dimensions, bad things can happen. In fact, one of the most popular panel data models, the two-way fixed effects model–widely used in the social sciences–is in fact statistical nonsense because it does not clearly distinguish between these two dimensions. This statement should sound implausible to you–really?, but it’s quite easy to demonstrate, as I’ll show you in this post.

This blog post is based on a recent publication with Jonathan Kropko which you can access here. In this post I provide a more reader-friendly overview of the article, dropping the academic-ese and focusing on substance as I think there are important issues for many people doing research.

In short: there are an untold number of analyses of panel data affected by an issue that is almost impossible to identify because R and Stata obscure the problem. Thanks to multi-collinearity checks that automatically drop predictors in regression models, a two-way fixed effects model can produce sensible-looking results that are not just irrelevant to the question at hand, but practically nonsense. Instead, we would all be better served by using simpler 1-way fixed effects models (intercepts on time points or cases/subjects, but not both).

How We Think About Panel Data

Panel data is one of the richest resources we have for observational inference in the social and even natural sciences. By comparing cases to each other as they change over time, such as countries and gross domestic product (GDP), we can learn much more than we can by only examining isolated cases. However, researchers for decades have been told by econometricians and applied statisticians that they must use a certain kind of panel data model or risk committing grave inferential errors. As a result, probably the most common panel data model is to include case fixed effects, or a unique intercept (i.e. a dummy variable) for every case in the panel data model. The second most likely is to include a fixed effect or intercept for every time point and case in the data. The belief is that including all these intercepts will control for omitted variables because there are factors unique to cases or time points that will be “controlled for” by these intercepts.

The result of this emphasis on the lurking dangers of panel data inference results in what I have decided to call the Cheeseburger Syndrome. This phenomenon was first studied by Saturday Night Live in a classic 1980s skit:

Applied researchers are in the position of customers in the video, continually asking for models that will analyze the variation in their panel data which actually exists. Applied statisticians are often the cooks, informing customers that regardless of what particular dataset they have, they will only ever be able to get a “cheeseburger.” As a result, there is remarkable uniformity of application of panel data models in the social sciences, even if these models don’t always fit the data very well.

What we put forward in our piece linked above is that any statistical model should have, as its first requirement, that it match the researcher’s question. Problems of omitted variables are important, but necessarily secondary. It does not matter how good the cheeseburger is if the researcher really wants eggs easy over.

In addition, fixed effects models do not control for omitted variables. What fixed effect models do is isolate one dimension of variance in the model. As a result, any variables that don’t vary on that dimension are by definition removed from the model. This side-effect is trumpeted as the great inferential benefit of fixed effect models, but it has nothing to do with inference. Fixed effects (or their cousin, random effects/hierarchical models) are simply about selecting which part of the panel dataset is most germane to the analysis.

The rule of thumb that we put forward in our paper is that fixed effects/dummy variables/intercepts on cases correspond to the following research question:

How much does a case change in relation to itself over time?

While fixed effects/dummy variables/intercepts on time points correspond to the following research question:

How much does a case change relative to other cases?

Some questions are fruitfully asked when comparing cases to themselves over time. For example, if a case is a human being, we might want to know whether obtaining more education leads to higher earnings. Conversely, if a case is a country, we might want to know if wealthier countries tend to be more democratic than poorer countries. Some questions primarily employ within-case variation, while others look at cross-sectional variation. Both are present in any panel dataset, and both are potentially interesting.

If fixed effects enable comparisons, then what happens if we have dummy variables/intercepts for every case and time point (the so-called two-way fixed effects model)? What comparison are we then making? As it turns out, the answer to this question is not very clear at all. I refer you to the paper above for a full exposition, but in essence, because cases and time points are nested, we end up making comparisons across both dimensions simultaneously, and this is just as obtuse as it sounds. There is no clear research question that matches this model.

Furthermore, if the original aim was to remove omitted variables, these omitted variables inevitably end up in the estimate again because a two-way estimate necessarily relates to both dimensions of variance simultaneously. As a result, it is not very clear what the point is. The one known use of the model is for difference-in-difference estimation, but only with two time points (see our paper for more detail).

Exposition with Data

This all may sound to you like a nice story. But how can you really know we are telling the truth? There are plenty of equations in our paper, but there is nothing like being able to see the data. One of the contributions of our paper is to create a simulation for panel data where we can control the within-case and cross-sectional variation separately for the same panel data set. This allows us to compare all of the possible panel data models with the same dataset while including or excluding omitted variables and other issues.

To show you how this works, I will generate a dataset with a single covariate in which the effect of that covariate is +1 for within-case comparisons and -3 for cross-sectional comparisons. There is noise in the data, but the effect is the same for all cases and for all cross-sections. The following code simulates fifty cases and fifty time points using our panelsim package (currently available only via Github):

# to install this package, use 
# remotes::install_github("saudiwin/panelsim")

require(panelsim)

# generate dataset with fixed coefficients within cases and cross-section
# no variation in effects across cases or time

gen_data <- tw_data(N=50,
                    T=50,
                    case.eff.mean = 1,
                    cross.eff.mean = -3,
                    cross.eff.sd = 0,
                    case.eff.sd = 0,
                    noise.sd=.25)

Because there is only one covariate, we can pretty easily visualize the relationships in the cross sectional and within-case variation. First, the value of the outcome/response \(y\) is shown for five cases, where one dot represents the value of \(x\) for each time point for that case:

gen_data$data %>%
  filter(case<5) %>% 
  ggplot(aes(x=x,y=y)) +
           geom_point() +
           stat_smooth(method="lm") +
           facet_wrap(~case,scales="free") +
  theme_minimal()

As can be seen, there is a consistent positive relationship of approximately +1 between \(x\) and \(y\) within cases. We can also examine the relationship for the cross-section by subsetting the data to each time point and plotting the cases for five of the time points:

gen_data$data %>%
  filter(time<5) %>% 
  ggplot(aes(x=x,y=y)) +
           geom_point() +
           stat_smooth(method="lm") +
           facet_wrap(~time,scales="free") +
  theme_minimal()

There is a separate and quite distinct relationship in the cross section between \(x\) and \(y\). Both components are present in the outcome. To find the actual coefficient, we can simply fit linear regression models on the generated data, first with intercepts/dummies for cases:

summary(lm(y ~ x + factor(case),data=gen_data$data))
## 
## Call:
## lm(formula = y ~ x + factor(case), data = gen_data$data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8932 -0.1599 -0.0015  0.1682  0.8304 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.773045   0.035151  21.992  < 2e-16 ***
## x               1.002403   0.018501  54.182  < 2e-16 ***
## factor(case)2  -0.184927   0.049529  -3.734 0.000193 ***
## factor(case)3   1.466627   0.050043  29.307  < 2e-16 ***
## factor(case)4  -1.363988   0.049898 -27.335  < 2e-16 ***
## factor(case)5  -0.884801   0.049660 -17.817  < 2e-16 ***
## factor(case)6   1.097485   0.049812  22.032  < 2e-16 ***
## factor(case)7  -0.992114   0.049705 -19.960  < 2e-16 ***
## factor(case)8  -1.597938   0.050057 -31.922  < 2e-16 ***
## factor(case)9  -2.126106   0.050445 -42.147  < 2e-16 ***
## factor(case)10 -0.028781   0.049525  -0.581 0.561205    
## factor(case)11 -1.171168   0.049831 -23.503  < 2e-16 ***
## factor(case)12 -0.132703   0.049526  -2.679 0.007423 ** 
## factor(case)13 -0.612141   0.049588 -12.344  < 2e-16 ***
## factor(case)14  0.866259   0.049705  17.428  < 2e-16 ***
## factor(case)15 -0.867871   0.049670 -17.473  < 2e-16 ***
## factor(case)16  0.514860   0.049600  10.380  < 2e-16 ***
## factor(case)17 -0.633858   0.049621 -12.774  < 2e-16 ***
## factor(case)18 -0.696507   0.049617 -14.038  < 2e-16 ***
## factor(case)19  0.806729   0.049680  16.239  < 2e-16 ***
## factor(case)20 -0.764169   0.049643 -15.393  < 2e-16 ***
## factor(case)21 -0.597111   0.049591 -12.041  < 2e-16 ***
## factor(case)22 -1.405450   0.049928 -28.150  < 2e-16 ***
## factor(case)23  0.261895   0.049546   5.286 1.36e-07 ***
## factor(case)24  0.973181   0.049738  19.566  < 2e-16 ***
## factor(case)25 -0.710288   0.049618 -14.315  < 2e-16 ***
## factor(case)26 -0.718412   0.049631 -14.475  < 2e-16 ***
## factor(case)27 -0.946765   0.049698 -19.050  < 2e-16 ***
## factor(case)28 -3.194416   0.051604 -61.902  < 2e-16 ***
## factor(case)29  0.120342   0.049531   2.430 0.015184 *  
## factor(case)30 -0.965050   0.049712 -19.413  < 2e-16 ***
## factor(case)31 -1.108207   0.049794 -22.256  < 2e-16 ***
## factor(case)32 -0.888764   0.049668 -17.894  < 2e-16 ***
## factor(case)33 -0.920184   0.049700 -18.515  < 2e-16 ***
## factor(case)34  0.588541   0.049613  11.863  < 2e-16 ***
## factor(case)35 -0.008302   0.049525  -0.168 0.866881    
## factor(case)36 -1.068951   0.049757 -21.484  < 2e-16 ***
## factor(case)37 -2.534691   0.050831 -49.865  < 2e-16 ***
## factor(case)38 -0.699826   0.049616 -14.105  < 2e-16 ***
## factor(case)39  0.389961   0.049578   7.866 5.46e-15 ***
## factor(case)40 -1.291089   0.049878 -25.885  < 2e-16 ***
## factor(case)41 -2.527182   0.050822 -49.726  < 2e-16 ***
## factor(case)42 -1.718180   0.050056 -34.325  < 2e-16 ***
## factor(case)43  0.539154   0.049589  10.872  < 2e-16 ***
## factor(case)44  0.946498   0.049741  19.029  < 2e-16 ***
## factor(case)45 -1.672768   0.050093 -33.393  < 2e-16 ***
## factor(case)46 -1.924235   0.050325 -38.236  < 2e-16 ***
## factor(case)47  0.297708   0.049551   6.008 2.16e-09 ***
## factor(case)48  0.045939   0.049527   0.928 0.353731    
## factor(case)49  1.051133   0.049777  21.117  < 2e-16 ***
## factor(case)50 -1.102510   0.049764 -22.155  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2476 on 2449 degrees of freedom
## Multiple R-squared:  0.9173, Adjusted R-squared:  0.9156 
## F-statistic: 543.2 on 50 and 2449 DF,  p-value: < 2.2e-16

We can see in the regression coefficient above that the coefficient for \(x\) is almost exactly +1.

Next we can fit a model with cases/intercepts for time points (cross-sectional variation):

summary(lm(y ~ x + factor(time),data=gen_data$data))
## 
## Call:
## lm(formula = y ~ x + factor(time), data = gen_data$data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9114 -0.1674 -0.0046  0.1699  0.7753 
## 
## Coefficients:
##                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    -0.68479    0.03530  -19.400  < 2e-16 ***
## x              -2.99280    0.01924 -155.534  < 2e-16 ***
## factor(time)2   1.51868    0.05005   30.345  < 2e-16 ***
## factor(time)3   2.26458    0.05073   44.637  < 2e-16 ***
## factor(time)4  -1.04731    0.04975  -21.050  < 2e-16 ***
## factor(time)5   1.73263    0.05025   34.482  < 2e-16 ***
## factor(time)6   1.26293    0.04988   25.319  < 2e-16 ***
## factor(time)7   0.06635    0.04952    1.340  0.18044    
## factor(time)8   1.67967    0.05018   33.472  < 2e-16 ***
## factor(time)9  -0.75507    0.04964  -15.211  < 2e-16 ***
## factor(time)10 -0.55308    0.04957  -11.157  < 2e-16 ***
## factor(time)11  1.34417    0.04996   26.905  < 2e-16 ***
## factor(time)12  2.20623    0.05064   43.565  < 2e-16 ***
## factor(time)13  0.02975    0.04952    0.601  0.54803    
## factor(time)14  0.10429    0.04952    2.106  0.03530 *  
## factor(time)15  0.27429    0.04955    5.536 3.43e-08 ***
## factor(time)16  1.05558    0.04981   21.193  < 2e-16 ***
## factor(time)17 -0.40503    0.04954   -8.175 4.68e-16 ***
## factor(time)18  0.86367    0.04971   17.375  < 2e-16 ***
## factor(time)19  1.57862    0.05013   31.492  < 2e-16 ***
## factor(time)20 -0.01728    0.04952   -0.349  0.72717    
## factor(time)21  0.62057    0.04961   12.508  < 2e-16 ***
## factor(time)22 -1.38478    0.04995  -27.721  < 2e-16 ***
## factor(time)23  0.31574    0.04954    6.374 2.20e-10 ***
## factor(time)24  2.96579    0.05157   57.515  < 2e-16 ***
## factor(time)25  0.24178    0.04953    4.881 1.12e-06 ***
## factor(time)26  2.83592    0.05144   55.133  < 2e-16 ***
## factor(time)27 -0.86284    0.04968  -17.369  < 2e-16 ***
## factor(time)28  0.92505    0.04975   18.594  < 2e-16 ***
## factor(time)29  1.05202    0.04980   21.125  < 2e-16 ***
## factor(time)30  0.15122    0.04953    3.053  0.00229 ** 
## factor(time)31  1.81587    0.05033   36.082  < 2e-16 ***
## factor(time)32  1.38787    0.04999   27.763  < 2e-16 ***
## factor(time)33  1.87706    0.05038   37.257  < 2e-16 ***
## factor(time)34  1.76956    0.05029   35.184  < 2e-16 ***
## factor(time)35  1.55739    0.05016   31.048  < 2e-16 ***
## factor(time)36  0.57794    0.04960   11.653  < 2e-16 ***
## factor(time)37  0.10647    0.04952    2.150  0.03166 *  
## factor(time)38 -0.72418    0.04962  -14.594  < 2e-16 ***
## factor(time)39  1.80861    0.05033   35.938  < 2e-16 ***
## factor(time)40  0.24163    0.04953    4.879 1.14e-06 ***
## factor(time)41  1.27448    0.04994   25.521  < 2e-16 ***
## factor(time)42  1.82843    0.05029   36.361  < 2e-16 ***
## factor(time)43  1.57770    0.05011   31.487  < 2e-16 ***
## factor(time)44  1.50850    0.05006   30.135  < 2e-16 ***
## factor(time)45  0.23318    0.04953    4.708 2.64e-06 ***
## factor(time)46 -0.92491    0.04969  -18.613  < 2e-16 ***
## factor(time)47  1.97346    0.05044   39.125  < 2e-16 ***
## factor(time)48  0.60207    0.04961   12.135  < 2e-16 ***
## factor(time)49  0.75674    0.04964   15.244  < 2e-16 ***
## factor(time)50 -1.38077    0.04996  -27.637  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2476 on 2449 degrees of freedom
## Multiple R-squared:  0.9173, Adjusted R-squared:  0.9156 
## F-statistic: 543.4 on 50 and 2449 DF,  p-value: < 2.2e-16

Again, the estimated coefficient is almost exactly what we generated. Success!

However, this brings us back to one of the questions we started with. We know the within-case relationship and the cross-sectional relationship, so what happens if we put dummies/intercepts on both cases and time? Well let’s find out:

summary(lm(y~x + factor(case) + factor(time),data=gen_data$data))
## 
## Call:
## lm(formula = y ~ x + factor(case) + factor(time), data = gen_data$data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.87158 -0.16705 -0.00396  0.16311  0.78026 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.806372   0.087811   9.183  < 2e-16 ***
## x               1.000661   0.143392   6.978 3.84e-12 ***
## factor(case)2  -0.184865   0.049834  -3.710 0.000212 ***
## factor(case)3   1.465951   0.074525  19.671  < 2e-16 ***
## factor(case)4  -1.363415   0.068454 -19.917  < 2e-16 ***
## factor(case)5  -0.884457   0.057091 -15.492  < 2e-16 ***
## factor(case)6   1.096983   0.064585  16.985  < 2e-16 ***
## factor(case)7  -0.991716   0.059427 -16.688  < 2e-16 ***
## factor(case)8  -1.597253   0.075111 -21.265  < 2e-16 ***
## factor(case)9  -2.125203   0.089352 -23.785  < 2e-16 ***
## factor(case)10 -0.028795   0.049592  -0.581 0.561545    
## factor(case)11 -1.170649   0.065430 -17.892  < 2e-16 ***
## factor(case)12 -0.132674   0.049635  -2.673 0.007568 ** 
## factor(case)13 -0.611905   0.053240 -11.493  < 2e-16 ***
## factor(case)14  0.865860   0.059440  14.567  < 2e-16 ***
## factor(case)15 -0.867514   0.057639 -15.051  < 2e-16 ***
## factor(case)16  0.514605   0.053863   9.554  < 2e-16 ***
## factor(case)17 -0.633567   0.055045 -11.510  < 2e-16 ***
## factor(case)18 -0.696223   0.054818 -12.701  < 2e-16 ***
## factor(case)19  0.806361   0.058129  13.872  < 2e-16 ***
## factor(case)20 -0.763847   0.056206 -13.590  < 2e-16 ***
## factor(case)21 -0.596870   0.053399 -11.178  < 2e-16 ***
## factor(case)22 -1.404855   0.069741 -20.144  < 2e-16 ***
## factor(case)23  0.261759   0.050838   5.149 2.83e-07 ***
## factor(case)24  0.972749   0.061053  15.933  < 2e-16 ***
## factor(case)25 -0.710003   0.054856 -12.943  < 2e-16 ***
## factor(case)26 -0.718107   0.055565 -12.924  < 2e-16 ***
## factor(case)27 -0.946375   0.059073 -16.020  < 2e-16 ***
## factor(case)28 -3.193052   0.122837 -25.994  < 2e-16 ***
## factor(case)29  0.120274   0.049902   2.410 0.016019 *  
## factor(case)30 -0.964644   0.059789 -16.134  < 2e-16 ***
## factor(case)31 -1.107721   0.063746 -17.377  < 2e-16 ***
## factor(case)32 -0.888409   0.057538 -15.440  < 2e-16 ***
## factor(case)33 -0.919792   0.059160 -15.548  < 2e-16 ***
## factor(case)34  0.588263   0.054612  10.772  < 2e-16 ***
## factor(case)35 -0.008290   0.049589  -0.167 0.867240    
## factor(case)36 -1.068500   0.061960 -17.245  < 2e-16 ***
## factor(case)37 -2.533613   0.101653 -24.924  < 2e-16 ***
## factor(case)38 -0.699544   0.054758 -12.775  < 2e-16 ***
## factor(case)39  0.389746   0.052653   7.402 1.84e-13 ***
## factor(case)40 -1.290532   0.067565 -19.101  < 2e-16 ***
## factor(case)41 -2.526108   0.101383 -24.916  < 2e-16 ***
## factor(case)42 -1.717495   0.075071 -22.878  < 2e-16 ***
## factor(case)43  0.538917   0.053290  10.113  < 2e-16 ***
## factor(case)44  0.946063   0.061202  15.458  < 2e-16 ***
## factor(case)45 -1.672061   0.076523 -21.850  < 2e-16 ***
## factor(case)46 -1.923394   0.085197 -22.576  < 2e-16 ***
## factor(case)47  0.297558   0.051103   5.823 6.56e-09 ***
## factor(case)48  0.045901   0.049675   0.924 0.355567    
## factor(case)49  1.050663   0.062931  16.696  < 2e-16 ***
## factor(case)50 -1.102052   0.062325 -17.682  < 2e-16 ***
## factor(time)2   0.009865   0.089885   0.110 0.912618    
## factor(time)3  -0.027398   0.115374  -0.237 0.812314    
## factor(time)4  -0.039227   0.044426  -0.883 0.377342    
## factor(time)5  -0.038629   0.098266  -0.393 0.694277    
## factor(time)6   0.015206   0.081771   0.186 0.852496    
## factor(time)7  -0.052421   0.051843  -1.011 0.312051    
## factor(time)8  -0.009808   0.095634  -0.103 0.918319    
## factor(time)9  -0.029774   0.042955  -0.693 0.488287    
## factor(time)10 -0.073201   0.043597  -1.679 0.093275 .  
## factor(time)11 -0.034386   0.085805  -0.401 0.688639    
## factor(time)12  0.003032   0.112421   0.027 0.978486    
## factor(time)13 -0.092023   0.051904  -1.773 0.076362 .  
## factor(time)14  0.017896   0.051201   0.350 0.726720    
## factor(time)15 -0.099884   0.057487  -1.738 0.082424 .  
## factor(time)16 -0.059046   0.077744  -0.759 0.447632    
## factor(time)17 -0.066464   0.044756  -1.485 0.137669    
## factor(time)18 -0.039199   0.071529  -0.548 0.583733    
## factor(time)19 -0.040240   0.093375  -0.431 0.666546    
## factor(time)20 -0.019978   0.049628  -0.403 0.687311    
## factor(time)21 -0.021785   0.064293  -0.339 0.734760    
## factor(time)22 -0.016429   0.049358  -0.333 0.739273    
## factor(time)23  0.016278   0.055739   0.292 0.770279    
## factor(time)24 -0.021054   0.138843  -0.152 0.879486    
## factor(time)25 -0.016068   0.054798  -0.293 0.769382    
## factor(time)26 -0.054303   0.135548  -0.401 0.688739    
## factor(time)27 -0.034151   0.043223  -0.790 0.429545    
## factor(time)28 -0.075377   0.074360  -1.014 0.310839    
## factor(time)29 -0.046340   0.077258  -0.600 0.548686    
## factor(time)30 -0.062577   0.053830  -1.162 0.245150    
## factor(time)31 -0.049767   0.101325  -0.491 0.623362    
## factor(time)32 -0.037785   0.087273  -0.433 0.665091    
## factor(time)33 -0.052008   0.103393  -0.503 0.614999    
## factor(time)34 -0.059567   0.100140  -0.595 0.552009    
## factor(time)35 -0.104873   0.094761  -1.107 0.268534    
## factor(time)36 -0.010170   0.062857  -0.162 0.871484    
## factor(time)37 -0.082164   0.053289  -1.542 0.123243    
## factor(time)38 -0.057402   0.042945  -1.337 0.181464    
## factor(time)39 -0.056943   0.101323  -0.562 0.574169    
## factor(time)40  0.007376   0.054276   0.136 0.891908    
## factor(time)41 -0.070988   0.084778  -0.837 0.402484    
## factor(time)42  0.010645   0.099772   0.107 0.915043    
## factor(time)43 -0.013565   0.092496  -0.147 0.883418    
## factor(time)44 -0.015229   0.090355  -0.169 0.866166    
## factor(time)45  0.010920   0.054013   0.202 0.839797    
## factor(time)46 -0.060932   0.043387  -1.404 0.160328    
## factor(time)47 -0.020632   0.105521  -0.196 0.844996    
## factor(time)48 -0.038896   0.064256  -0.605 0.545015    
## factor(time)49  0.028808   0.066612   0.432 0.665437    
## factor(time)50        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2479 on 2401 degrees of freedom
## Multiple R-squared:  0.9187, Adjusted R-squared:  0.9154 
## F-statistic:   277 on 98 and 2401 DF,  p-value: < 2.2e-16

You might expect that perhaps the two-way coefficient would be somewhere between the cross-sectional and within-case coefficients. Nope. It’s equal in this simulation to roughly 1, but that depends on how much variance there is in the cross-section and the within-case components of the dataset as they are now being combined in a non-linear fashion (if you can’t wrap your mind around this, don’t worry, it’s nearly impossible to do and unfruitful if you do accomplish it). However, the coefficient is still equal to a value that appears reasonable, although it is much less precise, and we might be willing to traipse along with it. However, you should pay attention to what happened in the model summary above–one of the time dummies (factor(time)50) came back with a missing (NA) coefficient! What is that about?

I’m so glad you asked. As we mentioned earlier, R has the ability to check for multi-collinearity in the predictor variables to avoid getting errors when attempting to estimate linear regression. Multi-collinearity can arise for reasons that have nothing to do with fixed effects, such as accidentally including variables that are sums of each other, etc. It often happens in fixed effects regression models when people try to include variables that only vary in the cross-section when using case fixed effects (a symptom of the Cheeseburger Syndrome mentioned previously). In this instance, though, there is no reason to think that there is multi-collinearity as we generated the data with two continuous variables (\(x\) and \(y\)). The data are perfect with no missingness. So how could this happen?

This discovery is one of the central contributions of our paper. Whenever panel data has a single effect for the cross-section and for within-case variation–as is the dominant way of thinking about panel data–the two-way fixed effects coefficient is statistically unidentified. That is, trying to estimate it is equivalent to saying 2 + 2 = 5 or the earth is as round as a square. As we show in the paper, it algebraically reduces to dividing by zero. R magically saved the day by dropping a variable, but we can show what would happen if R had not intervened.

In the code below I manually estimate a regression using the matrix inversion formula, \((X^TX)^{-1}X^Ty\), where \(X\) in this case is all of our predictor variables, including the case/time dummies:

X <- model.matrix(y~x + factor(case) + factor(time),data=gen_data$data)
y <- gen_data$data$y
try(solve(t(X)%*%X)%*%t(X)%*%y)
## Error in solve.default(t(X) %*% X) : 
##   system is computationally singular: reciprocal condition number = 5.63975e-19

You can see the error message: the system (matrix computation of OLS regression) is computationally singular. You’ll need to see the paper for how this all breaks down, but essentially the question you are putting to R is nonsensical. You simply cannot estimate a single joint effect while including all the variables. Instead, you need to drop one, which essentially means the effect is relative to the whichever intercept happened to be dropped (in this case time point 50). The coefficient could change if we simply re-arrange the labeling of the time fixed effects, as in the following:

gen_data$data$time <- as.numeric(factor(gen_data$data$time,levels=sample(unique(gen_data$data$time))))
summary(lm(y~x + factor(case) + factor(time),data=gen_data$data))
## 
## Call:
## lm(formula = y ~ x + factor(case) + factor(time), data = gen_data$data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.87158 -0.16705 -0.00396  0.16311  0.78026 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.827062   0.058816  14.062  < 2e-16 ***
## x               0.892381   0.169360   5.269 1.49e-07 ***
## factor(case)2  -0.181065   0.049934  -3.626 0.000294 ***
## factor(case)3   1.423934   0.082322  17.297  < 2e-16 ***
## factor(case)4  -1.327772   0.074606 -17.797  < 2e-16 ***
## factor(case)5  -0.863081   0.059798 -14.433  < 2e-16 ***
## factor(case)6   1.065727   0.069627  15.306  < 2e-16 ***
## factor(case)7  -0.966975   0.062893 -15.375  < 2e-16 ***
## factor(case)8  -1.554645   0.083062 -18.717  < 2e-16 ***
## factor(case)9  -2.069070   0.100829 -20.521  < 2e-16 ***
## factor(case)10 -0.029652   0.049597  -0.598 0.549997    
## factor(case)11 -1.138408   0.070718 -16.098  < 2e-16 ***
## factor(case)12 -0.130897   0.049657  -2.636 0.008442 ** 
## factor(case)13 -0.597254   0.054619 -10.935  < 2e-16 ***
## factor(case)14  0.841102   0.062910  13.370  < 2e-16 ***
## factor(case)15 -0.845316   0.060527 -13.966  < 2e-16 ***
## factor(case)16  0.498709   0.055464   8.992  < 2e-16 ***
## factor(case)17 -0.615509   0.057060 -10.787  < 2e-16 ***
## factor(case)18 -0.678563   0.056754 -11.956  < 2e-16 ***
## factor(case)19  0.783446   0.061177  12.806  < 2e-16 ***
## factor(case)20 -0.743853   0.058618 -12.690  < 2e-16 ***
## factor(case)21 -0.581894   0.054834 -10.612  < 2e-16 ***
## factor(case)22 -1.367817   0.076249 -17.939  < 2e-16 ***
## factor(case)23  0.253267   0.051327   4.934 8.59e-07 ***
## factor(case)24  0.945844   0.065030  14.545  < 2e-16 ***
## factor(case)25 -0.692276   0.056805 -12.187  < 2e-16 ***
## factor(case)26 -0.699163   0.057758 -12.105  < 2e-16 ***
## factor(case)27 -0.922121   0.062427 -14.771  < 2e-16 ***
## factor(case)28 -3.108184   0.141697 -21.936  < 2e-16 ***
## factor(case)29  0.115992   0.050029   2.318 0.020507 *  
## factor(case)30 -0.939410   0.063371 -14.824  < 2e-16 ***
## factor(case)31 -1.077463   0.068540 -15.720  < 2e-16 ***
## factor(case)32 -0.866360   0.060394 -14.345  < 2e-16 ***
## factor(case)33 -0.895419   0.062541 -14.317  < 2e-16 ***
## factor(case)34  0.570972   0.056476  10.110  < 2e-16 ***
## factor(case)35 -0.007546   0.049593  -0.152 0.879081    
## factor(case)36 -1.040438   0.066215 -15.713  < 2e-16 ***
## factor(case)37 -2.466600   0.115948 -21.273  < 2e-16 ***
## factor(case)38 -0.681990   0.056674 -12.034  < 2e-16 ***
## factor(case)39  0.376359   0.053819   6.993 3.47e-12 ***
## factor(case)40 -1.255870   0.073466 -17.095  < 2e-16 ***
## factor(case)41 -2.459329   0.115618 -21.271  < 2e-16 ***
## factor(case)42 -1.674928   0.083011 -20.177  < 2e-16 ***
## factor(case)43  0.524163   0.054686   9.585  < 2e-16 ***
## factor(case)44  0.918966   0.065225  14.089  < 2e-16 ***
## factor(case)45 -1.628044   0.084840 -19.190  < 2e-16 ***
## factor(case)46 -1.871074   0.095680 -19.556  < 2e-16 ***
## factor(case)47  0.288205   0.051692   5.575 2.75e-08 ***
## factor(case)48  0.043567   0.049713   0.876 0.380918    
## factor(case)49  1.021395   0.067481  15.136  < 2e-16 ***
## factor(case)50 -1.073533   0.066692 -16.097  < 2e-16 ***
## factor(time)2  -0.069009   0.100571  -0.686 0.492671    
## factor(time)3  -0.108184   0.073336  -1.475 0.140295    
## factor(time)4   0.002837   0.043759   0.065 0.948307    
## factor(time)5  -0.011388   0.082981  -0.137 0.890857    
## factor(time)6  -0.079838   0.110453  -0.723 0.469861    
## factor(time)7  -0.066293   0.048598  -1.364 0.172660    
## factor(time)8  -0.135578   0.123915  -1.094 0.274015    
## factor(time)9  -0.149671   0.096133  -1.557 0.119623    
## factor(time)10 -0.135415   0.137057  -0.988 0.323244    
## factor(time)11 -0.046205   0.101477  -0.455 0.648917    
## factor(time)12 -0.094439   0.062068  -1.522 0.128252    
## factor(time)13 -0.056940   0.061062  -0.933 0.351172    
## factor(time)14 -0.074651   0.076729  -0.973 0.330690    
## factor(time)15 -0.064301   0.086107  -0.747 0.455286    
## factor(time)16 -0.148654   0.105819  -1.405 0.160210    
## factor(time)17 -0.109373   0.139416  -0.785 0.432820    
## factor(time)18 -0.023932   0.052512  -0.456 0.648610    
## factor(time)19 -0.054156   0.088109  -0.615 0.538842    
## factor(time)20 -0.035535   0.098979  -0.359 0.719611    
## factor(time)21 -0.116553   0.143594  -0.812 0.417053    
## factor(time)22 -0.010896   0.065124  -0.167 0.867139    
## factor(time)23 -0.030352   0.055015  -0.552 0.581207    
## factor(time)24 -0.097372   0.166084  -0.586 0.557742    
## factor(time)25 -0.033847   0.056850  -0.595 0.551644    
## factor(time)26 -0.144292   0.145023  -0.995 0.319857    
## factor(time)27 -0.076492   0.070012  -1.093 0.274696    
## factor(time)28 -0.081450   0.086159  -0.945 0.344576    
## factor(time)29 -0.059062   0.059659  -0.990 0.322275    
## factor(time)30 -0.069904   0.049340  -1.417 0.156679    
## factor(time)31 -0.116713   0.102264  -1.141 0.253863    
## factor(time)32 -0.050535   0.050595  -0.999 0.317977    
## factor(time)33 -0.025185   0.043191  -0.583 0.559876    
## factor(time)34 -0.026497   0.046296  -0.572 0.567147    
## factor(time)35 -0.059114   0.048596  -1.216 0.223940    
## factor(time)36 -0.109133   0.105935  -1.030 0.303025    
## factor(time)37 -0.119734   0.053185  -2.251 0.024457 *  
## factor(time)38 -0.042987   0.101938  -0.422 0.673288    
## factor(time)39 -0.035869   0.047662  -0.753 0.451788    
## factor(time)40 -0.113464   0.165575  -0.685 0.493239    
## factor(time)41 -0.059933   0.110559  -0.542 0.587805    
## factor(time)42 -0.009158   0.057266  -0.160 0.872963    
## factor(time)43 -0.126494   0.150871  -0.838 0.401878    
## factor(time)44 -0.059636   0.047396  -1.258 0.208424    
## factor(time)45 -0.039694   0.107192  -0.370 0.711183    
## factor(time)46 -0.056278   0.054291  -1.037 0.300026    
## factor(time)47 -0.146145   0.129554  -1.128 0.259404    
## factor(time)48 -0.088757   0.069468  -1.278 0.201491    
## factor(time)49 -0.136982   0.103233  -1.327 0.184661    
## factor(time)50        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2479 on 2401 degrees of freedom
## Multiple R-squared:  0.9187, Adjusted R-squared:  0.9154 
## F-statistic:   277 on 98 and 2401 DF,  p-value: < 2.2e-16

As you can see, the coefficient changed because we arbitrarily swapped the order of the time fixed effects, and as R will drop the last one in the case of multi-collinearity, the estimate changed.

At this point, you should be concerned. One of our frustrations with this piece is that although the working paper has been in circulation for years, no one seems to have cared about this apparently quite important problem.

You might wonder though that you’ve seen two-way fixed effects models where this didn’t happen. As we show in the paper, the two-way FE model is identified if we generate the data differently. What we have to do is use a different effect of \(x\) on \(y\) for each time point/case, or what can be called a varying slopes model (slope for regression coefficient). We are not varying the fixed effects/intercepts themselves, rather we are varying the relationship between \(x\) and \(y\) across time points and cases. We must do this to prevent the two-way FE model from being unidentified.

To demonstrate this, we will generate new data except the coefficients will vary across case and time points randomly:

# make the slopes vary with the .sd parameters
gen_data <- tw_data(N=20,
                    T=50,
                    case.eff.mean = 1,
                    cross.eff.mean = -1,
                    cross.eff.sd =.25,
                    case.eff.sd =1,
                    noise.sd=1)

summary(lm(y~x + factor(case) + factor(time),data=gen_data$data))
## 
## Call:
## lm(formula = y ~ x + factor(case) + factor(time), data = gen_data$data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5458 -0.7073  0.0026  0.6446  3.1709 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    -0.673576   0.264996   -2.542 0.011188 *  
## x              -0.812224   0.005523 -147.067  < 2e-16 ***
## factor(case)2   0.431969   0.201817    2.140 0.032582 *  
## factor(case)3   0.169660   0.201920    0.840 0.400994    
## factor(case)4   0.216513   0.201806    1.073 0.283604    
## factor(case)5   0.377437   0.201835    1.870 0.061794 .  
## factor(case)6   0.233856   0.201762    1.159 0.246726    
## factor(case)7   0.291743   0.202146    1.443 0.149292    
## factor(case)8   0.040054   0.201779    0.199 0.842696    
## factor(case)9   0.182815   0.201763    0.906 0.365123    
## factor(case)10  0.197116   0.201782    0.977 0.328886    
## factor(case)11  0.264956   0.201782    1.313 0.189480    
## factor(case)12  0.075685   0.201799    0.375 0.707708    
## factor(case)13  0.249261   0.201795    1.235 0.217062    
## factor(case)14  0.118726   0.201765    0.588 0.556382    
## factor(case)15  0.494571   0.201828    2.450 0.014451 *  
## factor(case)16 -0.085249   0.203634   -0.419 0.675579    
## factor(case)17  0.383878   0.201766    1.903 0.057403 .  
## factor(case)18  0.290350   0.201837    1.439 0.150619    
## factor(case)19  0.483656   0.201790    2.397 0.016734 *  
## factor(case)20  0.287368   0.201792    1.424 0.154759    
## factor(time)2   0.596679   0.319058    1.870 0.061780 .  
## factor(time)3   1.407443   0.319158    4.410 1.16e-05 ***
## factor(time)4  -0.328379   0.320361   -1.025 0.305616    
## factor(time)5  -0.119690   0.319047   -0.375 0.707634    
## factor(time)6   1.539948   0.319015    4.827 1.62e-06 ***
## factor(time)7   0.338443   0.319042    1.061 0.289052    
## factor(time)8   0.528072   0.319048    1.655 0.098232 .  
## factor(time)9  -0.032228   0.319030   -0.101 0.919556    
## factor(time)10  0.765270   0.319056    2.399 0.016657 *  
## factor(time)11  0.789097   0.319058    2.473 0.013568 *  
## factor(time)12  2.149822   0.319199    6.735 2.87e-11 ***
## factor(time)13  0.118301   0.319032    0.371 0.710862    
## factor(time)14 -0.734958   0.319015   -2.304 0.021452 *  
## factor(time)15  0.888259   0.319109    2.784 0.005486 ** 
## factor(time)16  0.553245   0.319027    1.734 0.083221 .  
## factor(time)17  0.652004   0.319030    2.044 0.041264 *  
## factor(time)18  1.351653   0.319104    4.236 2.50e-05 ***
## factor(time)19  0.260828   0.319061    0.817 0.413860    
## factor(time)20 -0.662932   0.319060   -2.078 0.038005 *  
## factor(time)21  1.509116   0.319095    4.729 2.60e-06 ***
## factor(time)22 -0.014879   0.319029   -0.047 0.962810    
## factor(time)23 -0.538377   0.319023   -1.688 0.091827 .  
## factor(time)24  0.493220   0.319046    1.546 0.122464    
## factor(time)25 -1.820174   0.319100   -5.704 1.57e-08 ***
## factor(time)26 -0.571426   0.319015   -1.791 0.073583 .  
## factor(time)27  0.093017   0.319026    0.292 0.770683    
## factor(time)28 -1.328269   0.319035   -4.163 3.43e-05 ***
## factor(time)29  0.342019   0.319038    1.072 0.283983    
## factor(time)30 -0.774338   0.319014   -2.427 0.015401 *  
## factor(time)31  0.461852   0.319026    1.448 0.148040    
## factor(time)32  0.308478   0.319046    0.967 0.333858    
## factor(time)33  1.343319   0.319094    4.210 2.80e-05 ***
## factor(time)34  1.989206   0.319258    6.231 7.03e-10 ***
## factor(time)35 -0.422407   0.319019   -1.324 0.185801    
## factor(time)36  0.614837   0.319085    1.927 0.054299 .  
## factor(time)37 -0.416240   0.319061   -1.305 0.192359    
## factor(time)38  0.779448   0.319089    2.443 0.014762 *  
## factor(time)39 -0.847622   0.319015   -2.657 0.008019 ** 
## factor(time)40  3.245566   0.319204   10.168  < 2e-16 ***
## factor(time)41 -0.589393   0.319031   -1.847 0.064999 .  
## factor(time)42  0.509922   0.319178    1.598 0.110469    
## factor(time)43 -0.815928   0.319029   -2.558 0.010699 *  
## factor(time)44  1.200743   0.320664    3.745 0.000192 ***
## factor(time)45 -0.677540   0.319015   -2.124 0.033946 *  
## factor(time)46  0.329598   0.319049    1.033 0.301842    
## factor(time)47  0.254068   0.319037    0.796 0.426027    
## factor(time)48  0.728453   0.319070    2.283 0.022652 *  
## factor(time)49  2.376336   0.319302    7.442 2.25e-13 ***
## factor(time)50  0.525069   0.319064    1.646 0.100172    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.009 on 930 degrees of freedom
## Multiple R-squared:  0.9625, Adjusted R-squared:  0.9598 
## F-statistic: 346.3 on 69 and 930 DF,  p-value: < 2.2e-16

Bada bing bada boom. Now there are no missing coefficients. However, the coefficient on \(x\) isn’t any easier to interpret, and in fact it is in even less transparent as it involves taking averages of averages of coefficients in the cross-section and within-case variation (say that five times fast). It is also peculiar that the model can only be fit with this kind of variation in the slopes, as opposed to the intercepts, as might appear more logical. The models with fixed effects only for cases or time points do not have this problem at all.

What makes this lack of identification particularly malicious is that it is virtually impossible to identify unless someone takes the trouble as we did to generate data from scratch. Visually, the dataset with single coefficients for cross-sectional/within-case variation appears pristine. It is only deep under the hood that the problem appears. The fact that R (and Stata has similar behavior) can magically make the problem go away by changing the data only makes it less likely that the problem will ever be identified. We simply do not know how many papers in the literature have been affected by this problem (or by similar situations where the regression slopes are almost fixed across cases or time points).

There is one important caveat to this post. The two-way fixed effects model can be a difference-in -differences model, but only if there are exactly two (2) time points. It is not possible to run a two-way FE model with many time points and call it difference-in-difference as there are the same difficulties in interpretation. We discuss this more in the paper.

Summary

In summary, fixed effects models are very useful for panel data as they help isolate dimensions of variation that matter for research questions. They are not magical tools for causal inference, but rather frameworks for understanding data. It is important to think about which dimension (cross-section or within-case) is more relevant, and then go with that dimension. All other concerns, including modeling spatial or time autocorrelation, omitted variables and endogeneity are all secondary to this first and most important point, which is what comparisons can be drawn from the data?

One important area that this can be applied to is allowing people to fit more models with fixed effects on time points rather than cases. There are many questions that can be best understood by comparing cases to each other rather than cases to themselves over time. Studying long-run institutions like electoral systems, for example, can only be understood in terms of cross-sectional variation. There is nothing unique or special about the within-case model.

Some claim that within-case variation is better because cases have less heterogeneity than the cross-section. However, there is no way this can be true a priori. If minimizing heterogeneity is the aim, then it is important to consider the time frame and how units change over time. For example, if we have a very long time series, say 200 years, and we are comparing the U.S. and Canada, then we might believe that the comparison of the U.S. of 1800 to the Canada of 1800 (the cross-section) has less noise than a comparison of the U.S. of 1800 to the U.S. of 2020 (within-case variation). We need to think more about our data and what it represents rather than taking the short-cut of employing the same model everyone else uses.

While we did not discuss random effects/hierarchical models in this post, the same principles apply even if “partial pooling” is used rather than “no pooling.” Intercepts are designed to draw comparisons, and however the intercepts are modeled, it is important to think about what they are saying about the data, and whether that statement makes any sense.

So if you don’t want to eat that cheeseburger… we release you. Go enjoy your tofu-based meat alternative with relish.

To leave a comment for the author, please follow the link and comment on their blog: R on Robert Kubinec.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.