Taking the baseline measurement into account
[This article was first published on Colman Statistics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In Randomized Controlled Trials (RCTs), a “Pre” measurement is often taken at baseline (before randomization), and treatment effects are measured at one or more time point(s) after randomization (“Post” measurements). There are many ways to take the baseline measurement into account when comparing 2 groups in a classic pre-post design with one post measurement. In this R-Bloggers post several strategies were discussed, including:
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
- The delta approach: a comparison of the “delta’s” (post measurement minus baseline measurement) using a independent samples t-test.
- The ANCOVA: A regression model with the pre measurement and group as predictors, and the post measurement as outcome variabele.
- In contrast to the ANCOVA and delta approach, all subjects with at least one measurement (baseline or post-randomization) are included in the analysis.
- The models can easily be extended to 3 or more repeated measures.
- It is possible to model the correlation between baseline and post measures differently for different treatment groups.
Simulating the data
Data were created using the simstudy package. Each subject has equal chance to be assigned to either the Control group or the Experimental group. Both Pre and Post measurements are normally distributed. The assignment has no influence on the pre measurement. The post measurement depends on Group (Exp or Con) and on the baseline measurement (Pre).library(simstudy) tdef <- defData(varname = "Group", dist = "binary", formula = 0.5, id = "Id") tdef <- defData(tdef, varname = "Pre", dist = "normal", formula = 7, variance = 10) tdef <- defData(tdef, varname = "Post", dist = "normal", formula = "Pre + 0.1 + 1.05* Group", variance =3 ) set.seed(123) dtTrial <- genData(150, tdef) dtTrial$Group <- factor(dtTrial$Group, levels=c(1,0), labels=c("Exp", "Con")) library(pander) pander(head(dtTrial))
Group | Pre | Post |
---|---|---|
Con | 10.24 | 10.45 |
Exp | 6.099 | 6.029 |
Con | 3.14 | 1.997 |
Exp | 7.573 | 10.26 |
Exp | 6.561 | 5.952 |
Con | 7.018 | 10.5 |
library(ggplot2) fav.col=c("#1A425C", "#8AB63F") ggplot(dtTrial, aes (x=Pre,y=Post, col=Group)) + geom_point() + geom_smooth(method="lm", se=FALSE) + scale_color_manual(values=fav.col) + theme_bw()
From wide to long format
When applying mixed models, the baseline measurement is seen as part of the outcome vector. All measurements (pre and post) should appear in one column (the variable Outcome ). Tidyverse’s gather() function is one of the many ways to do the job. In this example the columns Pre and Post are “gathered” in a column called Outcome (=value). The new variable called Time (=key) will show us whether it concerns a pre or post measurement.library (tidyverse) dtTrial_long <- gather(data=dtTrial, key=Time, value=Outcome, Pre:Post, factor_key=TRUE) pander(head(arrange(dtTrial_long, Id), 10))
Id | Group | Time | Outcome |
---|---|---|---|
1 | Con | Pre | 10.24 |
1 | Con | Post | 10.45 |
2 | Exp | Pre | 6.099 |
2 | Exp | Post | 6.029 |
3 | Con | Pre | 3.14 |
3 | Con | Post | 1.997 |
4 | Exp | Pre | 7.573 |
4 | Exp | Post | 10.26 |
5 | Exp | Pre | 6.561 |
5 | Exp | Post | 5.952 |
library(ggplot2) boxp <- ggplot (dtTrial_long , aes(x=Time,y=Outcome, col=Group))+ geom_boxplot(outlier.size = 0 ) + geom_point(aes(fill=Group, col=NULL),shape=21, alpha=0.5, position = position_jitterdodge(jitter.width = 0.2)) + scale_fill_manual(values=fav.col) + scale_color_manual(values=fav.col) + theme_bw() + xlab("") boxp
Applying constrained longitudinal data analysis (cLDA)
The gls() function of the nlme package can be used to apply LDA modelling. Fixed part: The terms of the most basic cLDA model are Time and Time * Group . By omitting Group , there is no term allowing for a difference in means at baseline between the groups. Consequently, the baseline means are assumed to be equal. However, a general rule in regression modelling is that when adding an interaction to the model, the lower ranked effects should also be included in the model. For this reason, R will automatically include Time and Group as main effects in the model when including the interaction term _Time*Group_. To avoid to have Group as a main effect in our model, we will create an alternative model matrix: Xalt.X <- model.matrix(~ Time * Group, data = dtTrial_long) colnames(X) [1] "(Intercept)" "TimePost" "GroupCon" [4] "TimePost:GroupCon" Xalt <- X[, c("TimePost", "TimePost:GroupCon")] colnames(Xalt) [1] "TimePost" "TimePost:GroupCon"Random part: The standard deviations and correlations that should be estimated are defined by the constant variance function varIdent(). By setting weights = varIdent(form = ~ 1 | Time) a separate standard deviation will be estimated for each time point and a seperate correlation will be estimated for each pair of time points (= unstructured variance covariance matrix). Applied to our example, we expect an estimated standard deviation for Time=Pre, an estimated standard deviation for Time=Post and one estimated correlation between the pre- and post measurements of the same subject. Remark: By setting weights = varIdent(form = ~ 1 | Time:Group), a separate variance is estimated for each combination of Group and Time (Pre-Exp Post-Exp Pre-Con Post-Con ). The argument correlation=corSymm (form = ~ 1 | Id) defines the subject levels. The correlation structure is assumed to apply only to observations within the same subject (in our example: Id); observations from different subjects (a different value for Id) are assumed to be uncorrelated.
library(nlme) clda_gls <- gls(Outcome ~ Xalt, weights = varIdent(form = ~ 1 | Time), correlation=corSymm (form = ~ 1 | Id), data = dtTrial_long) summary(clda_gls) Generalized least squares fit by REML Model: Outcome ~ Xalt Data: dtTrial_long AIC BIC logLik 1358.307 1380.47 -673.1537 Correlation Structure: General Formula: ~1 | Id Parameter estimate(s): Correlation: 1 2 0.842 Variance function: Structure: Different standard deviations per stratum Formula: ~1 | Time Parameter estimates: Pre Post 1.000000 1.059605 Coefficients: Value Std.Error t-value p-value (Intercept) 6.978858 0.2461488 28.352190 0e+00 XaltTimePost 1.240246 0.2047301 6.057956 0e+00 XaltTimePost:GroupCon -0.958945 0.2815211 -3.406297 7e-04 Correlation: (Intr) XltTmP XaltTimePost -0.129 XaltTimePost:GroupCon 0.000 -0.715 Standardized residuals: Min Q1 Med Q3 Max -2.20538393 -0.65214666 -0.09411877 0.63654736 3.40671592 Residual standard error: 3.014695 Degrees of freedom: 300 total; 297 residual predictions<- cbind( dtTrial_long,clda_gls$fitted)The p-value for the interaction effect (the effect of interest here) is 0.0007: the change in mean over time differs significantly between groups. The estimated difference in change is -0.95 (SE: 0.28) (Con increases 0.95 less then Exp). The estimated correlation between Pre and Post is 0.842. The estimated standard deviation for the Pre measurement is 3.01 and the estimated standard devaition for the Post measurement is 1.059 * 3.01. We can obtain predicted means using the gls() function. As expected, the predicted value is the same for both groups at baseline ( Time == Pre).
Id | Group | Time | Outcome | clda_gls$fitted |
---|---|---|---|---|
1 | Con | Pre | 10.24 | 6.979 |
1 | Con | Post | 10.45 | 7.26 |
2 | Exp | Pre | 6.099 | 6.979 |
2 | Exp | Post | 6.029 | 8.219 |
library(rms) clda_Gls <- Gls(Outcome ~ Xalt, weights = varIdent(form = ~ 1 | Time), correlation=corSymm (form = ~ 1 | Id), data = dtTrial_long) predictions <- cbind(dtTrial_long, predict (clda_Gls, dtTrial_long, conf.int=0.95))Now we do obtain 95% confidence intervals:
Id | Group | Time | Outcome | linear.predictors | lower | upper |
---|---|---|---|---|---|---|
1 | Con | Pre | 10.24 | 6.979 | 6.496 | 7.461 |
1 | Con | Post | 10.45 | 7.26 | 6.684 | 7.836 |
2 | Exp | Pre | 6.099 | 6.979 | 6.496 | 7.461 |
2 | Exp | Post | 6.029 | 8.219 | 7.633 | 8.805 |
Plotting the confidence intervals
First attempt for the visualization of the estimated means and their 95% confidence intervals:pd <- position_dodge(.1) limits <- aes(ymax = upper , ymin=lower, shape=Group) pCI1 <- ggplot(predictions, aes( y=linear.predictors, x=Time)) + geom_errorbar(limits, width= 0.1 , position=pd) + geom_line(aes(group=Group, y=linear.predictors), position=pd) + geom_point(position=pd, aes( fill=Group), shape=21, size=3) + scale_fill_manual(values=c( "black", "white")) + theme_bw() + theme(panel.grid.major = element_blank(), legend.title=element_blank(), legend.position="bottom") + xlab(NULL) + ylab("Estimated mean and 95% CI") pCI1To my opinion, this plot can be misleading. One estimated mean is presented by 2 different dots. I prefer to create the factor Group2 and remake the plot.
predictions$Group2<- factor(predictions$Group, levels= c("All", "Exp", "Con")) predictions$Group2 [predictions$Time=="Pre"] <- "All" pd <- position_dodge(.08) limits <- aes(ymax = upper , ymin=lower, shape=Group2) pCI <- ggplot(predictions, aes( y=linear.predictors, x=Time)) + geom_errorbar(limits, width=0.09,position=pd) + geom_line(aes(group=Group, y=linear.predictors), position=pd)+ geom_point(position=pd, aes( fill=Group2), shape=21, size=3) + scale_fill_manual(values=c("grey", "black", "white")) + theme_bw() + theme(panel.grid.major = element_blank(), legend.title=element_blank(), legend.position="bottom") + xlab (NULL) + ylab("Estimated mean and 95% CI") pCI
In the case of non-randomized experiments
In the case of non-randomized experiments or studies an equal mean at baseline cannot be assumed. Consequently, applying constrained LDA is not appropriate. It is however possible to perform an LDA (without the constraint). By including Group in the model (next to Time and the interaction term Group:Time ), the baseline means are allowed to differ between groups. There’s no need to create the alternative model matrix Xalt in that situation. (Xalt can thus be replaced by Group:Time ) The remaining arguments (random part of the model) doesn’t differ from the model we presented for cLDA.References
References LDA and cLDA: – Fitzmaurice, Garrett M., et al. Applied Longitudinal Analysis. John Wiley & Sons, 2004. – Liu GF, Lu K, Mogg R, et al. Should baseline be a covariate or dependent variable in analyses of change from baseline in clinical trials? Stat Med 2009; 28: 2509-2530. Reference Gls() function of the rms package: – Pinheiro J, Bates D (2000): Mixed effects models in S and S-Plus. New York: Springer-Verlag.The post Taking the baseline measurement into account appeared first on Colman Statistics.
To leave a comment for the author, please follow the link and comment on their blog: Colman Statistics.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.