[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Well, maybe not, but this comes up all the time. An investigator wants to assess the effect of an intervention on a outcome. Study participants are randomized either to receive the intervention (could be a new drug, new protocol, behavioral intervention, whatever) or treatment as usual. For each participant, the outcome measure is recorded at baseline – this is the pre in pre/post analysis. The intervention is delivered (or not, in the case of the control group), some time passes, and the outcome is measured a second time. This is our post. The question is, how should we analyze this study to draw conclusions about the intervention’s effect on the outcome?
There are at least three possible ways to approach this. (1) Ignore the pre outcome measure and just compare the average post scores of the two groups. (2) Calculate a change score for each individual (\(\Delta_i = post_i – pre_i\)), and compare the average \(\Delta\)’s for each group. Or (3), use a more sophisticated regression model to estimate the intervention effect while controlling for the pre or baseline measure of the outcome. Here are three models associated with each approach (\(T_i\) is 1 if the individual \(i\) received the treatment, 0 if not, and \(\epsilon_i\) is an error term):
I’ve explored various scenarios (i.e. different data generating assumptions) to see if it matters which approach we use. (Of course it does.)
Scenario 1: pre and post not correlated
In the simulations that follow, I am generating potential outcomes for each individual. So, the variable post0 represents the follow-up outcome for the individual under the control condition, and post1 is the outcome in the intervention condition. pre0 and pre1 are the same, because the intervention does not affect the baseline measurement. The effect of the intervention is specified by eff. In the first scenario, the baseline and follow-up measures are not related to each other, and the effect size is 1. All of the data definitions and data generation are done using package simstudy.
Now we generate the potential outcomes, the group assignment, and observed data for 1000 individuals. (I’m using package stargazer, definitely worth checking out, to print out the first five rows of the dataset.)
The plots show the three different types of analysis - follow-up measurement alone, change, or follow-up controlling for baseline:
I compare the different modeling approaches by using simulation to estimate statistical power for each. That is, given that there is some effect, how often is the p-value of the test less than 0.05. I’ve written a function to handle the data generation and power estimation. The function generates 1000 data sets of a specified sample size, each time fitting the three models, and keeping track of the relevant p-values for each iteration.
The results for the first data set are based on a sample size of 150 individuals (75 in each group). The post-only model does just as well as the post adjusted for baseline model. The model evaluating change in this scenario is way under powered.
The correlation actually increases power, so we use a reduced sample size of 120 for the power estimation. In this case, the three models actually all do pretty well, but the adjusted model is slightly superior.
Scenario 3: pre and post are almost perfectly correlated
When baseline and follow-up measurements are almost perfectly correlated (in this case about 0.85), we would be indifferent between the change and adjusted analyses - the power of the tests is virtually identical. However, the analysis that considers the follow-up measure alone does is much less adequate, due primarily to the measure’s relatively high variability.
In a slight variation of the previous scenario, the effect of the intervention itself is a now function of the baseline score. Those who score higher will benefit less from the intervention - they simply have less room to improve. In this case, the adjusted model appears slightly inferior to the change model, while the unadjusted post-only model is still relatively low powered.
The adjusted model has less power than the change model, because I used a reduced \(\alpha\)-level for the hypothesis test of the adjusted models. I am testing for interaction first, then if that fails, for main effects, so I need to adjust for multiple comparisons. (I have another post that shows why this might be a good thing to do.) I have used a Bonferroni adjustment, which can be a more conservative test. I still prefer the adjusted model, because it provides more insight into the underlying process than the change model.
Treatment assignment depends on baseline measurement
Now, slightly off-topic. So far, we’ve been talking about situations where treatment assignment is randomized. What happens in a scenario where those with higher baseline scores are more likely to receive the intervention? Well, if we don’t adjust for the baseline score, we will have unmeasured confounding. A comparison of follow-up scores in the two groups will be biased towards the intervention group if the baseline scores are correlated with follow-up scores - as we see visually with a scenario in which the effect size is set to 0. Also notice that the p-values for the unadjusted model are consistently below 0.05 - we are almost always drawing the wrong conclusion if we use this model. On the other hand, the error rate for the adjusted model is close to 0.05, what we would expect.
I haven’t proved anything here, but these simulations suggest that we should certainly think twice about using an unadjusted model if we happen to have baseline measurements. And it seems like you are likely to maximize power (and maybe minimize bias) if you compare follow-up scores while adjusting for baseline scores rather than analyzing change in scores by group.
Related
To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.