Dynamite for Causal Inference from Panel Data using Dynamic Multivariate Panel Models

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Panel data contains measurements from multiple subjects measured over multiple time points. Such data can be encountered in many social science applications such as when analysing register data or cohort studies (for example). Often the aim is to perform causal inference based on such observational data (instead of randomized control trials).

A new rOpensci-reviewed R package dynamite available on CRAN implements a new class of panel models called the Bayesian dynamic multivariate panel model (DMPM) which supports

  • Joint modelling of multiple response variables potentially from mixed distributions (e.g. Gaussian and categorical responses)
  • Time-varying regression coefficients modelled as splines
  • Group-level random effects (coming soon)
  • Probabilistic posterior predictive simulations for long-term causal effect estimation, including not only the average causal effects but the full interventional distributions of interest (i.e. the distribution of the response variable after an intervention).

The theory regarding the model and the subsequent causal effect estimation for panel data, with some examples, can be found in the SocArxiv preprint1 and the package vignette. In this post, I will illustrate the use of dynamite for causal inference in a non-panel setting (i.e. we have time series data on only a single “individual”).

The idea of the following example is similar to a synthetic control approach for time series causal inference, originally suggested by Abadie et al.2, and further extended and popularized by Brodersen et al.3 and their CausalImpact package.

The basic idea of the synthetic control method is that you have a time series of interest y1,,yTy_1,\ldots,y_T, for example daily sales of some product, and an intervention was made at some time point t<Tt (e.g., change in the value-added tax, VAT). You would then like to know what was the effect of the intervention on yt,,yTy_{t},\ldots,y_{T}. For example, by how much did the sales increase due to the decrease or removal of the VAT? Typically you also have one or more “control times series” xx which predict the behaviour of yy but for which the intervention has no affect (e.g. some time-varying properties of the product, the demographic variables of the market, or sales of the product in some other markets). You then build a statistical model for y1,,yt1y_1,\ldots, y_{t-1} using the control series for the same time points, use your estimated model to predict the values of yy in the intervention period t,,Tt, \ldots, T using the observed control series and compare these predictions to the observed values of yt,,yTy_t,\ldots, y_T (which experienced the intervention).

In the synthetic control approach one of the key assumptions is that the control series itself is not affected by the intervention. In the sales example above, using sales data of some distant market could be suitable control, but the change in VAT might have an effect on the sales in the neighboring markets as well, so using such a time series for the control would then violate this assumption. While applicable to valid synthetic control cases, the dynamite package can also be used in cases where not only the main response variable of interest (i.e. sales in the above example) but also the control series (sales of neighboring markets) are affected by the intervention. This is done by jointly modelling both the main response variable as well as the control time series (although it could be argued that they should not be called a ‘control’ series in this case).

Data generation

Load some packages:

library(dynamite)
library(dplyr) # Data manipulation
library(ggplot2) # Figures
library(patchwork) # Combining Figures

I will consider the following true data generating process (DGP):

xtN(0.5yt1+xt1+zt1,0.32)ytN(0.7xt1,0.32)\begin{aligned} x_t & \sim N(-0.5 y_{t-1} + x_{t-1} + z_{t-1}, 0.3^2)\ y_t & \sim N(0.7x_{t-1}, 0.3^2)\ \end{aligned}

for t=2,,100t=2, \ldots, 100, where the initial values x1x_1 and y1y_1 are drawn from the standard normal distribution (in subsequent modelling we condition on these first observations, i.e. they are treated as fixed data).

Variable ztz_t is our intervention variable, which I fixed to zero for the first 80 time points, and to one for the the last 20 time points, i.e. the intervention starts at time t=81t=81. Following the terminology in our DMPM paper1, this is a recurring intervention, in contrast to an atomic intervention where z81=1z_{81}=1 and zero otherwise. Naturally it would be possible to also consider interventions such as zt=1z_t=1 for 80<t<9080 and zt=0z_t=0 otherwise (e.g. an ad campaign starts at time 81 and ends at time 89).

Our hypothetical research question is how does zz affect yy? Looking at our data generating process it is clear that ztz_t does not affect yty_t, but it still affects yt+1,yt+2,y_{t+1},y_{t+2},\ldots via xt,xt+1,x_t,x_{t+1},\ldots Note that I chose this model just to exemplify the modelling, and these coefficients clearly do not reflect the sales example in the introduction.

I will first simulate some data according to our true model:

set.seed(1)
n <- 100
x <- y <- numeric(n)
z <- rep(0:1, times = c(80, 20))
x[1] <- rnorm(1)
y[1] <- rnorm(1)
for(i in 2:n) {
x[i] <- rnorm(1, -0.5 * y[i-1] + x[i-1] + z[i], 0.3)
y[i] <- rnorm(1, 0.7 * x[i-1], 0.3)
}
d <- data.frame(y = y, x = x, z = z, time = 1:n)
ggplot(d, aes(time)) +
geom_line(aes(y = y, colour = "y")) +
geom_line(aes(y = x, colour = "x")) +
scale_colour_discrete("Series") +
ylab("Value") + xlab("Time") +
theme_bw()
Line plot with a pink line labelled x and a blue line labelled y. The axes are Value on the Y and Time on the X. The two lines follow each other fluctuating around zero, and both jump to fluctuating around 2.5 after time 80.

Causal inference based on synthetic control with dynamite

While in the data generation the yty_t variable did not depend on the lagged value yt1y_{t-1} and ztz_t, I will nevertheless estimate a model where both xtx_t and yty_t depend on the xt1x_{t-1} and yt1y_{t-1}, as well as zz, mimicking the fact that I’m not sure about the true causal graph (structure of DGP). The model formula for the main function of the package, dynamite(), is defined by calling a special function obs() once for each “channel”, i.e. response variable. The lagged variables can be defined in the formula as lag(y), but here I use a special function lags() which by default adds lagged values of all channels to each channel:

f <- obs(y ~ z, family = "gaussian") + obs(x ~ z, family = "gaussian") + lags()
## same as 
# f <- obs(y ~ z + lag(y) + lag(x), "gaussian) + 
# obs(x ~ z + lag(y) + lag(x), "gaussian)

We can now estimate our model with dynamite() for which we need to define the data, the variable in the data defining the time index (argument time), and the grouping variable (argument group), which can be ignored in this univariate case:

fit <- dynamite(f,
data = d,
time = "time",
chains = 4, cores = 4, refresh = 0, seed = 1)

Compiling Stan model.

The actual estimation is delegated to Stan using either rstan (default) or cmdstanr backends. The last arguments are passed to rstan::sampling() which runs the Markov chain Monte Carlo for us. While rstan is the default backend as it is available at CRAN, we recommend the often more efficient cmdstanr or the latest development version of rstan, available at Stan’s repo for R packages.

Let’s see some results:

options(width = 90) # expand the width so that the column names are not cut off
fit

Model:
Family Formula
y gaussian y ~ z
x gaussian x ~ z
Lagged responses added as predictors with: k = 1
Data: d (Number of observations: 198)
Grouping variable: .group (Number of groups: 1)
Time index variable: time (Number of time points: 100)
Smallest bulk-ESS: 2259 (beta_y_x_lag1)
Smallest tail-ESS: 2353 (beta_y_x_lag1)
Largest Rhat: 1.003 (beta_x_z)
Elapsed time (seconds):
warmup sample
chain:1 5.544 5.377
chain:2 4.979 4.286
chain:3 5.295 5.501
chain:4 4.693 5.600
Summary statistics of the time-invariant parameters:
# A tibble: 10 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 alpha_x 0.0115 0.0114 0.0320 0.0325 -0.0416 0.0652 1.00 5866. 2920.
2 alpha_y -0.00536 -0.00514 0.0322 0.0317 -0.0574 0.0483 1.00 5476. 2733.
3 beta_x_x_lag1 0.989 0.988 0.0573 0.0557 0.897 1.08 1.00 2362. 2566.
4 beta_x_y_lag1 -0.572 -0.572 0.0683 0.0682 -0.685 -0.458 1.00 3211. 3189.
5 beta_x_z 1.22 1.22 0.141 0.139 0.991 1.46 1.00 3166. 3016.
6 beta_y_x_lag1 0.656 0.657 0.0614 0.0637 0.556 0.757 1.00 2259. 2353.
7 beta_y_y_lag1 0.0837 0.0839 0.0706 0.0690 -0.0341 0.197 1.00 2682. 2559.
8 beta_y_z -0.0115 -0.0120 0.148 0.144 -0.253 0.236 1.00 3186. 2381.
9 sigma_x 0.278 0.277 0.0201 0.0202 0.247 0.313 1.00 3508. 2605.
10 sigma_y 0.290 0.289 0.0209 0.0205 0.258 0.326 1.00 3874. 3218.

The coefficient estimates are pretty much in line with the data generation, but notice the relatively large posterior standard errors of the coefficients related to zz; this is due to the fact that we have only a single series and single changepoint at time t=81t=81.

We can now perform some posterior predictive checks. First, we can check how well the posterior samples of our one-step-ahead predictions match with the observations by using the fitted() method and visualizing these posterior predictive distributions (I’ll one plot the estimates for the variable yy for simplicity):

out <- fitted(fit) |>
group_by(time) |>
summarise(
mean = mean(y_fitted),
lwr80 = quantile(y_fitted, 0.1, na.rm = TRUE), # na.rm as t = 1 is fixed
upr80 = quantile(y_fitted, 0.9, na.rm = TRUE),
lwr95 = quantile(y_fitted, 0.025, na.rm = TRUE),
upr95 = quantile(y_fitted, 0.975, na.rm = TRUE))
ggplot(out, aes(time, mean)) +
geom_ribbon(aes(ymin = lwr95, ymax = upr95), alpha = 0.3, fill = "#ED3535") +
geom_ribbon(aes(ymin = lwr80, ymax = upr80), alpha = 0.3, fill = "#ED3535") +
geom_line(colour = "#ED3535") +
geom_line(data = d, aes(y = y), colour = "black") +
xlab("Time") + ylab("Value") +
theme_bw()

Warning: Removed 1 row(s) containing missing values (geom_path).
Line plot with a black line and a red line with a pink ribbon. The axes are Value on the Y and Time on the X. The two lines follow each other fluctuating around zero, and both jump to fluctuating around 2.5 after time 80.

Note that these are not real out-of-sample predictions as the posterior samples of model parameters used for these predictions are based on all our observations, which would be especially problematic for a model containing time-varying components (e.g., splines). A more “honest” (and time consuming) approach would be to use approximate leave-future-out cross-validation via dynamite’s lfo() function.

Given the posterior samples of the model parameters, I can also make some counterfactual predictions (how y81,,y100y_{81},\ldots,y_{100} would have looked like if no intervention was made, i.e. if ztz_t was zero for all tt). First I create a new data frame where z=0z=0 for all time points, and where yy and xx are set to missing values starting from t=81t=81 (the time point where started our intervention z=1z=1):

newdata <- d
newdata$z <- 0
newdata$y[81:100] <- NA
newdata$x[81:100] <- NA

I then input this new data to the predict() method and define that I want posterior samples of expected values instead of new observations by using type = "mean" (new observations are still simulated behind the scenes in order to move forward in time):

pred <- predict(fit, newdata = newdata, type = "mean") |>
filter(time > 80)
head(pred)

 y_mean x_mean .group time .draw z y x
1 0.04691573 -0.40116363 1 81 1 0 NA NA
2 -0.44723418 -0.87026782 1 82 1 0 NA NA
3 -0.28467017 -0.38001229 1 83 1 0 NA NA
4 -0.36577537 -0.45640698 1 84 1 0 NA NA
5 0.01759079 0.01213885 1 85 1 0 NA NA
6 0.10382535 0.12547867 1 86 1 0 NA NA

From these I compute the posterior mean, 80% and 95% intervals for each time point:

sumr <- pred |>
group_by(time) |>
summarise(
mean = mean(y_mean),
lwr80 = quantile(y_mean, 0.1),
upr80 = quantile(y_mean, 0.9),
lwr95 = quantile(y_mean, 0.025),
upr95 = quantile(y_mean, 0.975))

And some figures, following similar visualization style as popularized by the CausalImpact package, consisting of the actual predictions, difference compared to observed values, and cumulative differences of these:

p1 <- ggplot(sumr, aes(time, mean)) +
geom_ribbon(aes(ymin = lwr95, ymax = upr95), alpha = 0.3, fill = "#ED3535") +
geom_ribbon(aes(ymin = lwr80, ymax = upr80), alpha = 0.3, fill = "#ED3535") +
geom_line(colour = "#ED3535") +
geom_line(data = d, aes(y = y), colour = "black") +
xlab("Time") + ylab("Value") +
theme_bw()
sumr$y_obs <- y[81:100]
p2 <- sumr |>
mutate(
y_obs_diff = y_obs - mean,
lwr_diff80 = y_obs - lwr80,
upr_diff80 = y_obs - upr80,
lwr_diff95 = y_obs - lwr95,
upr_diff95 = y_obs - upr95
) |>
ggplot(aes(time, y_obs_diff)) +
geom_ribbon(aes(ymin = lwr_diff95, ymax = upr_diff95),
alpha = 0.3, fill = "#ED3535") +
geom_ribbon(aes(ymin = lwr_diff80, ymax = upr_diff80),
alpha = 0.3, fill = "#ED3535") +
geom_line(colour = "#ED3535") +
xlab("Time") + ylab("Predicted value - observed value") +
theme_bw()
cs_sumr <- pred |>
group_by(.draw) |>
summarise(
cs = cumsum(d$y[81:n] - y_mean), across()) |>
group_by(time) |>
summarise(mean = mean(cs),
lwr = quantile(cs, 0.025, na.rm = TRUE),
upr = quantile(cs, 0.975, na.rm = TRUE))

`summarise()` has grouped output by '.draw'. You can override using the `.groups`
argument.

p3 <- ggplot(cs_sumr, aes(time, mean)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.3, fill = "#ED3535") +
geom_line(colour = "#ED3535") +
xlab("Time") +
ylab("Cumulative difference") +
theme_bw()
p1 + p2 + p3 + plot_layout(ncol = 1)
Three stacked line plots, each with time as the x axis. From top to bottom, the y-axes are Value, Predicted minus Observed value, and Cumulative differences. The top figure shows a pink ribbon deviating from a black line in the 80-100 time range. The middle figure shows a fluctuating pink ribbon. The bottom figure shows an increasing ribbon.

In the top figure, we see the predictions for the counterfactual case where no intervention was done (zt=0z_t=0 for all tt; pink ribbon/red line), whereas in the middle figure I have drawn the difference between predicted values and the actual observations for times t=81,,100t=81,\ldots,100, which show clear effect of intervention (the difference between between observations and predictions do not fluctuate around zero). In the bottom figure we see the cumulative difference between observations and our predictions, which emphasizes how the cumulative effect keeps increasing during the whole study period instead of tapering off or disappearing completely.

Finally we can consider a case where we assume that the intervention affects only a single response variable, yy, and xx does not depend on yy, Which, since we created this data, we know would be incorrect. This is essentially same as treating xx as exogenous and estimating a single-response model for yy. But in our case we can also proceed with our original model, and just modify the previous simulation so that the variable xx is fixed to its observed values:

newdata <- d
newdata$z <- 0
newdata$y[81:100] <- NA
pred_fixed_x <- predict(fit, newdata = newdata, type = "mean") |>
filter(time > 80)
sumr_fixed_x <- pred_fixed_x |>
group_by(time) |>
summarise(
mean = mean(y_mean),
lwr80 = quantile(y_mean, 0.1),
upr80 = quantile(y_mean, 0.9),
lwr95 = quantile(y_mean, 0.025),
upr95 = quantile(y_mean, 0.975))
p1 <- ggplot(sumr_fixed_x, aes(time, mean)) +
geom_ribbon(aes(ymin = lwr95, ymax = upr95), alpha = 0.3, fill = "#ED3535") +
geom_ribbon(aes(ymin = lwr80, ymax = upr80), alpha = 0.3, fill = "#ED3535") +
geom_line(colour = "#ED3535") +
geom_line(data = d, aes(y = y), colour = "black") +
xlab("Time") + ylab("Value") +
theme_bw()
sumr_fixed_x$y_obs <- y[81:100]
p2 <- sumr_fixed_x |>
mutate(
y_obs_diff = y_obs - mean,
lwr_diff80 = y_obs - lwr80,
upr_diff80 = y_obs - upr80,
lwr_diff95 = y_obs - lwr95,
upr_diff95 = y_obs - upr95
) |>
ggplot(aes(time, y_obs_diff)) +
geom_ribbon(aes(ymin = lwr_diff95, ymax = upr_diff95),
alpha = 0.3, fill = "#ED3535") +
geom_ribbon(aes(ymin = lwr_diff80, ymax = upr_diff80),
alpha = 0.3, fill = "#ED3535") +
geom_line(colour = "#ED3535") +
xlab("Time") + ylab("Predicted value - observed value") +
theme_bw()
cs_sumr_fixed_x <- pred_fixed_x |>
group_by(.draw) |>
summarise(
cs = cumsum(d$y[81:n] - y_mean), across()) |>
group_by(time) |>
summarise(mean = mean(cs),
lwr = quantile(cs, 0.025, na.rm = TRUE),
upr = quantile(cs, 0.975, na.rm = TRUE))

`summarise()` has grouped output by '.draw'. You can override using the `.groups`
argument.

p3 <- ggplot(cs_sumr_fixed_x, aes(time, mean)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.3, fill = "#ED3535") +
geom_line(colour = "#ED3535") +
xlab("Time") +
ylab("Cumulative difference") +
theme_bw()
p1 + p2 + p3 + plot_layout(ncol = 1)
Three stacked line plots, each with time as the x axis. From top to bottom, the y-axes are Value, Predicted minus Observed value, and Cumulative differences. The top figure shows a pink ribbon matching a black line in the 80-100 time range. The middle figure shows a fluctuating pink ribbon. The bottom figure shows a ribbon funnelling wider around zero.

As expected, because we treated variable xx as fixed, and the intervention only affects yy only via xx, we see that the counterfactual predictions and the observed series are very similar and would (incorrecly) conclude that our original intervention did not affect yy. Therefore, by using the dynamite package to model multiple response variables we are able to capture patterns we would have otherwise missed.

Future directions

In future, we plan to add more distributions such as Weibull, multinomial, and tt-distribution for the response variables and improve the tools for visualization of the model parameters and predictions. We would also be very interested in hearing how the package is used in various applications, especially if you can share your data openly. Pull requests and other contributions are very welcome.

Acknowledgements

The package was created by Santtu Tikka and Jouni Helske as part of PREDLIFE project, funded by the Academy of Finland. The package was reviewed by Nicholas Clark and Lucy D’Agostino McGowan.


  1. Helske J, Tikka S (2022). Estimating Causal Effects from Panel Data with Dynamic Multivariate Panel Models. SocArxiv Preprint. doi.org/10.31235/osf.io/mdwu5. ↩︎

  2. Abadie, A, Gardeazabal, J (2003). The Economic Costs of Conflict: A Case Study of the Basque Country. The American Economic Review, 93(1), 113–132. doi.org/10.1257/000282803321455188 ↩︎

  3. Brodersen KH, Gallusser F, Koehler J, Remy N, and Scott SL (2015). Annals of Applied Statistics. Inferring causal impact using Bayesian structural time-series models. doi.org/10.1214/14-AOAS788 ↩︎

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)