Site icon R-bloggers

Dynamite for Causal Inference from Panel Data using Dynamic Multivariate Panel Models

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Panel data contains measurements from multiple subjects measured over multiple time points. Such data can be encountered in many social science applications such as when analysing register data or cohort studies (for example). Often the aim is to perform causal inference based on such observational data (instead of randomized control trials).

A new rOpensci-reviewed R package dynamite available on CRAN implements a new class of panel models called the Bayesian dynamic multivariate panel model (DMPM) which supports

The theory regarding the model and the subsequent causal effect estimation for panel data, with some examples, can be found in the SocArxiv preprint1 and the package vignette. In this post, I will illustrate the use of dynamite for causal inference in a non-panel setting (i.e. we have time series data on only a single “individual”).

The idea of the following example is similar to a synthetic control approach for time series causal inference, originally suggested by Abadie et al.2, and further extended and popularized by Brodersen et al.3 and their CausalImpact package.

The basic idea of the synthetic control method is that you have a time series of interest < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>y< mn>1< mo separator="true">,< mo>…< mo separator="true">,< msub>< mi>y< mi>T< annotation encoding="application/x-tex">y_1,\ldots,y_Ty1,,yT, for example daily sales of some product, and an intervention was made at some time point < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>t< mo><< mi>T< annotation encoding="application/x-tex">t<Tt<T (e.g., change in the value-added tax, VAT). You would then like to know what was the effect of the intervention on < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>y< mi>t< mo separator="true">,< mo>…< mo separator="true">,< msub>< mi>y< mi>T< annotation encoding="application/x-tex">y_{t},\ldots,y_{T}yt,,yT. For example, by how much did the sales increase due to the decrease or removal of the VAT? Typically you also have one or more “control times series” < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>x< annotation encoding="application/x-tex">xx which predict the behaviour of < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>y< annotation encoding="application/x-tex">yy but for which the intervention has no affect (e.g. some time-varying properties of the product, the demographic variables of the market, or sales of the product in some other markets). You then build a statistical model for < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>y< mn>1< mo separator="true">,< mo>…< mo separator="true">,< msub>< mi>y< mrow>< mi>t< mo>−< mn>1< annotation encoding="application/x-tex">y_1,\ldots, y_{t-1}y1,,yt1 using the control series for the same time points, use your estimated model to predict the values of < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>y< annotation encoding="application/x-tex">yy in the intervention period < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>t< mo separator="true">,< mo>…< mo separator="true">,< mi>T< annotation encoding="application/x-tex">t, \ldots, Tt,,T using the observed control series and compare these predictions to the observed values of < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>y< mi>t< mo separator="true">,< mo>…< mo separator="true">,< msub>< mi>y< mi>T< annotation encoding="application/x-tex">y_t,\ldots, y_Tyt,,yT (which experienced the intervention).

In the synthetic control approach one of the key assumptions is that the control series itself is not affected by the intervention. In the sales example above, using sales data of some distant market could be suitable control, but the change in VAT might have an effect on the sales in the neighboring markets as well, so using such a time series for the control would then violate this assumption. While applicable to valid synthetic control cases, the dynamite package can also be used in cases where not only the main response variable of interest (i.e. sales in the above example) but also the control series (sales of neighboring markets) are affected by the intervention. This is done by jointly modelling both the main response variable as well as the control time series (although it could be argued that they should not be called a ‘control’ series in this case).

Data generation

Load some packages:

library(dynamite)
library(dplyr) # Data manipulation
library(ggplot2) # Figures
library(patchwork) # Combining Figures

I will consider the following true data generating process (DGP):

< math xmlns="http://www.w3.org/1998/Math/MathML" display="block">< semantics>< mtable rowspacing="0.25em" columnalign="right left" columnspacing="0em">< mtr>< mtd>< mstyle scriptlevel="0" displaystyle="true">< msub>< mi>x< mi>t< mtd>< mstyle scriptlevel="0" displaystyle="true">< mrow>< mrow>< mo>∼< mi>N< mo stretchy="false">(< mo>−< mn>0.5< msub>< mi>y< mrow>< mi>t< mo>−< mn>1< mo>+< msub>< mi>x< mrow>< mi>t< mo>−< mn>1< mo>+< msub>< mi>z< mrow>< mi>t< mo>−< mn>1< mo separator="true">,< mn>0.< msup>< mn>3< mn>2< mo stretchy="false">)< mtr>< mtd>< mstyle scriptlevel="0" displaystyle="true">< msub>< mi>y< mi>t< mtd>< mstyle scriptlevel="0" displaystyle="true">< mrow>< mrow>< mo>∼< mi>N< mo stretchy="false">(< mn>0.7< msub>< mi>x< mrow>< mi>t< mo>−< mn>1< mo separator="true">,< mn>0.< msup>< mn>3< mn>2< mo stretchy="false">)< annotation encoding="application/x-tex">\begin{aligned} x_t & \sim N(-0.5 y_{t-1} + x_{t-1} + z_{t-1}, 0.3^2)\ y_t & \sim N(0.7x_{t-1}, 0.3^2)\ \end{aligned}xtytN(0.5yt1+xt1+zt1,0.32)N(0.7xt1,0.32)

for < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>t< mo>=< mn>2< mo separator="true">,< mo>…< mo separator="true">,< mn>100< annotation encoding="application/x-tex">t=2, \ldots, 100t=2,,100, where the initial values < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>x< mn>1< annotation encoding="application/x-tex">x_1x1 and < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>y< mn>1< annotation encoding="application/x-tex">y_1y1 are drawn from the standard normal distribution (in subsequent modelling we condition on these first observations, i.e. they are treated as fixed data).

Variable < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>z< mi>t< annotation encoding="application/x-tex">z_tzt is our intervention variable, which I fixed to zero for the first 80 time points, and to one for the the last 20 time points, i.e. the intervention starts at time < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>t< mo>=< mn>81< annotation encoding="application/x-tex">t=81t=81. Following the terminology in our DMPM paper1, this is a recurring intervention, in contrast to an atomic intervention where < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>z< mn>81< mo>=< mn>1< annotation encoding="application/x-tex">z_{81}=1z81=1 and zero otherwise. Naturally it would be possible to also consider interventions such as < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>z< mi>t< mo>=< mn>1< annotation encoding="application/x-tex">z_t=1zt=1 for < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mn>80< mo><< mi>t< mo><< mn>90< annotation encoding="application/x-tex">80<t<9080<t<90 and < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>z< mi>t< mo>=< mn>0< annotation encoding="application/x-tex">z_t=0zt=0 otherwise (e.g. an ad campaign starts at time 81 and ends at time 89).

Our hypothetical research question is how does < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>z< annotation encoding="application/x-tex">zz affect < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>y< annotation encoding="application/x-tex">yy? Looking at our data generating process it is clear that < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>z< mi>t< annotation encoding="application/x-tex">z_tzt does not affect < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>y< mi>t< annotation encoding="application/x-tex">y_tyt, but it still affects < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>y< mrow>< mi>t< mo>+< mn>1< mo separator="true">,< msub>< mi>y< mrow>< mi>t< mo>+< mn>2< mo separator="true">,< mo>…< annotation encoding="application/x-tex">y_{t+1},y_{t+2},\ldotsyt+1,yt+2, via < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>x< mi>t< mo separator="true">,< msub>< mi>x< mrow>< mi>t< mo>+< mn>1< mo separator="true">,< mo>…< annotation encoding="application/x-tex">x_t,x_{t+1},\ldotsxt,xt+1, Note that I chose this model just to exemplify the modelling, and these coefficients clearly do not reflect the sales example in the introduction.

I will first simulate some data according to our true model:

set.seed(1)
n <- 100
x <- y <- numeric(n)
z <- rep(0:1, times = c(80, 20))
x[1] <- rnorm(1)
y[1] <- rnorm(1)
for(i in 2:n) {
x[i] <- rnorm(1, -0.5 * y[i-1] + x[i-1] + z[i], 0.3)
y[i] <- rnorm(1, 0.7 * x[i-1], 0.3)
}
d <- data.frame(y = y, x = x, z = z, time = 1:n)
ggplot(d, aes(time)) +
geom_line(aes(y = y, colour = "y")) +
geom_line(aes(y = x, colour = "x")) +
scale_colour_discrete("Series") +
ylab("Value") + xlab("Time") +
theme_bw()

Causal inference based on synthetic control with dynamite

While in the data generation the < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>y< mi>t< annotation encoding="application/x-tex">y_tyt variable did not depend on the lagged value < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>y< mrow>< mi>t< mo>−< mn>1< annotation encoding="application/x-tex">y_{t-1}yt1 and < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>z< mi>t< annotation encoding="application/x-tex">z_tzt, I will nevertheless estimate a model where both < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>x< mi>t< annotation encoding="application/x-tex">x_txt and < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>y< mi>t< annotation encoding="application/x-tex">y_tyt depend on the < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>x< mrow>< mi>t< mo>−< mn>1< annotation encoding="application/x-tex">x_{t-1}xt1 and < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>y< mrow>< mi>t< mo>−< mn>1< annotation encoding="application/x-tex">y_{t-1}yt1, as well as < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>z< annotation encoding="application/x-tex">zz, mimicking the fact that I’m not sure about the true causal graph (structure of DGP). The model formula for the main function of the package, dynamite(), is defined by calling a special function obs() once for each “channel”, i.e. response variable. The lagged variables can be defined in the formula as lag(y), but here I use a special function lags() which by default adds lagged values of all channels to each channel:

f <- obs(y ~ z, family = "gaussian") + obs(x ~ z, family = "gaussian") + lags()
## same as 
# f <- obs(y ~ z + lag(y) + lag(x), "gaussian) + 
# obs(x ~ z + lag(y) + lag(x), "gaussian)

We can now estimate our model with dynamite() for which we need to define the data, the variable in the data defining the time index (argument time), and the grouping variable (argument group), which can be ignored in this univariate case:

fit <- dynamite(f,
data = d,
time = "time",
chains = 4, cores = 4, refresh = 0, seed = 1)

Compiling Stan model.

The actual estimation is delegated to Stan using either rstan (default) or cmdstanr backends. The last arguments are passed to rstan::sampling() which runs the Markov chain Monte Carlo for us. While rstan is the default backend as it is available at CRAN, we recommend the often more efficient cmdstanr or the latest development version of rstan, available at Stan’s repo for R packages.

Let’s see some results:

options(width = 90) # expand the width so that the column names are not cut off
fit

Model:
Family Formula
y gaussian y ~ z
x gaussian x ~ z
Lagged responses added as predictors with: k = 1
Data: d (Number of observations: 198)
Grouping variable: .group (Number of groups: 1)
Time index variable: time (Number of time points: 100)
Smallest bulk-ESS: 2259 (beta_y_x_lag1)
Smallest tail-ESS: 2353 (beta_y_x_lag1)
Largest Rhat: 1.003 (beta_x_z)
Elapsed time (seconds):
warmup sample
chain:1 5.544 5.377
chain:2 4.979 4.286
chain:3 5.295 5.501
chain:4 4.693 5.600
Summary statistics of the time-invariant parameters:
# A tibble: 10 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 alpha_x 0.0115 0.0114 0.0320 0.0325 -0.0416 0.0652 1.00 5866. 2920.
2 alpha_y -0.00536 -0.00514 0.0322 0.0317 -0.0574 0.0483 1.00 5476. 2733.
3 beta_x_x_lag1 0.989 0.988 0.0573 0.0557 0.897 1.08 1.00 2362. 2566.
4 beta_x_y_lag1 -0.572 -0.572 0.0683 0.0682 -0.685 -0.458 1.00 3211. 3189.
5 beta_x_z 1.22 1.22 0.141 0.139 0.991 1.46 1.00 3166. 3016.
6 beta_y_x_lag1 0.656 0.657 0.0614 0.0637 0.556 0.757 1.00 2259. 2353.
7 beta_y_y_lag1 0.0837 0.0839 0.0706 0.0690 -0.0341 0.197 1.00 2682. 2559.
8 beta_y_z -0.0115 -0.0120 0.148 0.144 -0.253 0.236 1.00 3186. 2381.
9 sigma_x 0.278 0.277 0.0201 0.0202 0.247 0.313 1.00 3508. 2605.
10 sigma_y 0.290 0.289 0.0209 0.0205 0.258 0.326 1.00 3874. 3218.

The coefficient estimates are pretty much in line with the data generation, but notice the relatively large posterior standard errors of the coefficients related to < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>z< annotation encoding="application/x-tex">zz; this is due to the fact that we have only a single series and single changepoint at time < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>t< mo>=< mn>81< annotation encoding="application/x-tex">t=81t=81.

We can now perform some posterior predictive checks. First, we can check how well the posterior samples of our one-step-ahead predictions match with the observations by using the fitted() method and visualizing these posterior predictive distributions (I’ll one plot the estimates for the variable < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>y< annotation encoding="application/x-tex">yy for simplicity):

out <- fitted(fit) |>
group_by(time) |>
summarise(
mean = mean(y_fitted),
lwr80 = quantile(y_fitted, 0.1, na.rm = TRUE), # na.rm as t = 1 is fixed
upr80 = quantile(y_fitted, 0.9, na.rm = TRUE),
lwr95 = quantile(y_fitted, 0.025, na.rm = TRUE),
upr95 = quantile(y_fitted, 0.975, na.rm = TRUE))
ggplot(out, aes(time, mean)) +
geom_ribbon(aes(ymin = lwr95, ymax = upr95), alpha = 0.3, fill = "#ED3535") +
geom_ribbon(aes(ymin = lwr80, ymax = upr80), alpha = 0.3, fill = "#ED3535") +
geom_line(colour = "#ED3535") +
geom_line(data = d, aes(y = y), colour = "black") +
xlab("Time") + ylab("Value") +
theme_bw()

Warning: Removed 1 row(s) containing missing values (geom_path).

Note that these are not real out-of-sample predictions as the posterior samples of model parameters used for these predictions are based on all our observations, which would be especially problematic for a model containing time-varying components (e.g., splines). A more “honest” (and time consuming) approach would be to use approximate leave-future-out cross-validation via dynamite’s lfo() function.

Given the posterior samples of the model parameters, I can also make some counterfactual predictions (how < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>y< mn>81< mo separator="true">,< mo>…< mo separator="true">,< msub>< mi>y< mn>100< annotation encoding="application/x-tex">y_{81},\ldots,y_{100}y81,,y100 would have looked like if no intervention was made, i.e. if < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>z< mi>t< annotation encoding="application/x-tex">z_tzt was zero for all < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>t< annotation encoding="application/x-tex">tt). First I create a new data frame where < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>z< mo>=< mn>0< annotation encoding="application/x-tex">z=0z=0 for all time points, and where < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>y< annotation encoding="application/x-tex">yy and < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>x< annotation encoding="application/x-tex">xx are set to missing values starting from < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>t< mo>=< mn>81< annotation encoding="application/x-tex">t=81t=81 (the time point where started our intervention < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>z< mo>=< mn>1< annotation encoding="application/x-tex">z=1z=1):

newdata <- d
newdata$z <- 0
newdata$y[81:100] <- NA
newdata$x[81:100] <- NA

I then input this new data to the predict() method and define that I want posterior samples of expected values instead of new observations by using type = "mean" (new observations are still simulated behind the scenes in order to move forward in time):

pred <- predict(fit, newdata = newdata, type = "mean") |>
filter(time > 80)
head(pred)

 y_mean x_mean .group time .draw z y x
1 0.04691573 -0.40116363 1 81 1 0 NA NA
2 -0.44723418 -0.87026782 1 82 1 0 NA NA
3 -0.28467017 -0.38001229 1 83 1 0 NA NA
4 -0.36577537 -0.45640698 1 84 1 0 NA NA
5 0.01759079 0.01213885 1 85 1 0 NA NA
6 0.10382535 0.12547867 1 86 1 0 NA NA

From these I compute the posterior mean, 80% and 95% intervals for each time point:

sumr <- pred |>
group_by(time) |>
summarise(
mean = mean(y_mean),
lwr80 = quantile(y_mean, 0.1),
upr80 = quantile(y_mean, 0.9),
lwr95 = quantile(y_mean, 0.025),
upr95 = quantile(y_mean, 0.975))

And some figures, following similar visualization style as popularized by the CausalImpact package, consisting of the actual predictions, difference compared to observed values, and cumulative differences of these:

p1 <- ggplot(sumr, aes(time, mean)) +
geom_ribbon(aes(ymin = lwr95, ymax = upr95), alpha = 0.3, fill = "#ED3535") +
geom_ribbon(aes(ymin = lwr80, ymax = upr80), alpha = 0.3, fill = "#ED3535") +
geom_line(colour = "#ED3535") +
geom_line(data = d, aes(y = y), colour = "black") +
xlab("Time") + ylab("Value") +
theme_bw()
sumr$y_obs <- y[81:100]
p2 <- sumr |>
mutate(
y_obs_diff = y_obs - mean,
lwr_diff80 = y_obs - lwr80,
upr_diff80 = y_obs - upr80,
lwr_diff95 = y_obs - lwr95,
upr_diff95 = y_obs - upr95
) |>
ggplot(aes(time, y_obs_diff)) +
geom_ribbon(aes(ymin = lwr_diff95, ymax = upr_diff95),
alpha = 0.3, fill = "#ED3535") +
geom_ribbon(aes(ymin = lwr_diff80, ymax = upr_diff80),
alpha = 0.3, fill = "#ED3535") +
geom_line(colour = "#ED3535") +
xlab("Time") + ylab("Predicted value - observed value") +
theme_bw()
cs_sumr <- pred |>
group_by(.draw) |>
summarise(
cs = cumsum(d$y[81:n] - y_mean), across()) |>
group_by(time) |>
summarise(mean = mean(cs),
lwr = quantile(cs, 0.025, na.rm = TRUE),
upr = quantile(cs, 0.975, na.rm = TRUE))

`summarise()` has grouped output by '.draw'. You can override using the `.groups`
argument.

p3 <- ggplot(cs_sumr, aes(time, mean)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.3, fill = "#ED3535") +
geom_line(colour = "#ED3535") +
xlab("Time") +
ylab("Cumulative difference") +
theme_bw()
p1 + p2 + p3 + plot_layout(ncol = 1)

In the top figure, we see the predictions for the counterfactual case where no intervention was done (< math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< msub>< mi>z< mi>t< mo>=< mn>0< annotation encoding="application/x-tex">z_t=0zt=0 for all < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>t< annotation encoding="application/x-tex">tt; pink ribbon/red line), whereas in the middle figure I have drawn the difference between predicted values and the actual observations for times < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>t< mo>=< mn>81< mo separator="true">,< mo>…< mo separator="true">,< mn>100< annotation encoding="application/x-tex">t=81,\ldots,100t=81,,100, which show clear effect of intervention (the difference between between observations and predictions do not fluctuate around zero). In the bottom figure we see the cumulative difference between observations and our predictions, which emphasizes how the cumulative effect keeps increasing during the whole study period instead of tapering off or disappearing completely.

Finally we can consider a case where we assume that the intervention affects only a single response variable, < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>y< annotation encoding="application/x-tex">yy, and < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>x< annotation encoding="application/x-tex">xx does not depend on < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>y< annotation encoding="application/x-tex">yy, Which, since we created this data, we know would be incorrect. This is essentially same as treating < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>x< annotation encoding="application/x-tex">xx as exogenous and estimating a single-response model for < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>y< annotation encoding="application/x-tex">yy. But in our case we can also proceed with our original model, and just modify the previous simulation so that the variable < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>x< annotation encoding="application/x-tex">xx is fixed to its observed values:

newdata <- d
newdata$z <- 0
newdata$y[81:100] <- NA
pred_fixed_x <- predict(fit, newdata = newdata, type = "mean") |>
filter(time > 80)
sumr_fixed_x <- pred_fixed_x |>
group_by(time) |>
summarise(
mean = mean(y_mean),
lwr80 = quantile(y_mean, 0.1),
upr80 = quantile(y_mean, 0.9),
lwr95 = quantile(y_mean, 0.025),
upr95 = quantile(y_mean, 0.975))
p1 <- ggplot(sumr_fixed_x, aes(time, mean)) +
geom_ribbon(aes(ymin = lwr95, ymax = upr95), alpha = 0.3, fill = "#ED3535") +
geom_ribbon(aes(ymin = lwr80, ymax = upr80), alpha = 0.3, fill = "#ED3535") +
geom_line(colour = "#ED3535") +
geom_line(data = d, aes(y = y), colour = "black") +
xlab("Time") + ylab("Value") +
theme_bw()
sumr_fixed_x$y_obs <- y[81:100]
p2 <- sumr_fixed_x |>
mutate(
y_obs_diff = y_obs - mean,
lwr_diff80 = y_obs - lwr80,
upr_diff80 = y_obs - upr80,
lwr_diff95 = y_obs - lwr95,
upr_diff95 = y_obs - upr95
) |>
ggplot(aes(time, y_obs_diff)) +
geom_ribbon(aes(ymin = lwr_diff95, ymax = upr_diff95),
alpha = 0.3, fill = "#ED3535") +
geom_ribbon(aes(ymin = lwr_diff80, ymax = upr_diff80),
alpha = 0.3, fill = "#ED3535") +
geom_line(colour = "#ED3535") +
xlab("Time") + ylab("Predicted value - observed value") +
theme_bw()
cs_sumr_fixed_x <- pred_fixed_x |>
group_by(.draw) |>
summarise(
cs = cumsum(d$y[81:n] - y_mean), across()) |>
group_by(time) |>
summarise(mean = mean(cs),
lwr = quantile(cs, 0.025, na.rm = TRUE),
upr = quantile(cs, 0.975, na.rm = TRUE))

`summarise()` has grouped output by '.draw'. You can override using the `.groups`
argument.

p3 <- ggplot(cs_sumr_fixed_x, aes(time, mean)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.3, fill = "#ED3535") +
geom_line(colour = "#ED3535") +
xlab("Time") +
ylab("Cumulative difference") +
theme_bw()
p1 + p2 + p3 + plot_layout(ncol = 1)

As expected, because we treated variable < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>x< annotation encoding="application/x-tex">xx as fixed, and the intervention only affects < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>y< annotation encoding="application/x-tex">yy only via < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>x< annotation encoding="application/x-tex">xx, we see that the counterfactual predictions and the observed series are very similar and would (incorrecly) conclude that our original intervention did not affect < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>y< annotation encoding="application/x-tex">yy. Therefore, by using the dynamite package to model multiple response variables we are able to capture patterns we would have otherwise missed.

Future directions

In future, we plan to add more distributions such as Weibull, multinomial, and < math xmlns="http://www.w3.org/1998/Math/MathML">< semantics>< mrow>< mi>t< annotation encoding="application/x-tex">tt-distribution for the response variables and improve the tools for visualization of the model parameters and predictions. We would also be very interested in hearing how the package is used in various applications, especially if you can share your data openly. Pull requests and other contributions are very welcome.

Acknowledgements

The package was created by Santtu Tikka and Jouni Helske as part of PREDLIFE project, funded by the Academy of Finland. The package was reviewed by Nicholas Clark and Lucy D’Agostino McGowan.

< section class="footnotes" role="doc-endnotes">
  1. Helske J, Tikka S (2022). Estimating Causal Effects from Panel Data with Dynamic Multivariate Panel Models. SocArxiv Preprint. doi.org/10.31235/osf.io/mdwu5. ↩︎

  2. Abadie, A, Gardeazabal, J (2003). The Economic Costs of Conflict: A Case Study of the Basque Country. The American Economic Review, 93(1), 113–132. doi.org/10.1257/000282803321455188 ↩︎

  3. Brodersen KH, Gallusser F, Koehler J, Remy N, and Scott SL (2015). Annals of Applied Statistics. Inferring causal impact using Bayesian structural time-series models. doi.org/10.1214/14-AOAS788 ↩︎

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.