Should we be concerned about incidence – prevalence bias?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recently, we were planning a study to evaluate the effect of an intervention on outcomes for very sick patients who show up in the emergency department. My collaborator had concerns about a phenomenon that she had observed in other studies that might affect the results – patients measured earlier in the study tend to be sicker than those measured later in the study. This might not be a problem, but in the context of a stepped-wedge study design (see this for a discussion that touches this type of study design), this could definitely generate biased estimates: when the intervention occurs later in the study (as it does in a stepped-wedge design), the “exposed” and “unexposed” populations could differ, and in turn so could the outcomes. We might confuse an artificial effect as an intervention effect.
What could explain this phenomenon? The title of this post provides a hint: cases earlier in a study are more likely to be prevalent ones (i.e. they have been sick for a while), whereas later in the study cases tend to be incident (i.e. they only recently become sick). Even though both prevalent and incident cases are sick, the former may be sicker on average than the latter, simply because their condition has had more time develop.
We didn’t have any data to test out this hypothesis (if our grant proposal is funded, we will be able to do that), so I decided to see if I could simulate this phenomenon. In my continuing series exploring simulation using Rcpp
, simstudy
, and data.table
, I am presenting some code that I used to do this.
Generating a population of patients
The first task is to generate a population of individuals, each of whom starts out healthy and potentially becomes sicker over time. Time starts in month 1 and ends at some fixed point – in the first example, I end at 400 months. Each individual has a starting health status and a start month. In the examples that follow, health status is 1 through 4, with 1 being healthy, 3 is quite sick, and 4 is death. And, you can think of the start month as the point where the individual ages into the study. (For example, if the study includes only people 65 and over, the start month is the month the individual turns 65.) If an individual starts in month 300, she will have no measurements in periods 1 through 299 (i.e. health status will be 0).
The first part of the simulation generates a start month and starting health status for each individual, and then generates a health status for each individual until the end of time. Some individuals may die, while others may go all the way to the end of the simulation in a healthy state.
Rcpp function to generate health status for each period
While it is generally preferable to avoid loops in R, sometimes it cannot be avoided. I believe generating a health status that depends on the previous health status (a Markov process) is one of those situations. So, I have written an Rcpp function to do this – it is orders of magnitude faster than doing this in R:
#include <RcppArmadilloExtensions/sample.h> // [[Rcpp::depends(RcppArmadillo)]] using namespace Rcpp; // [[Rcpp::export]] IntegerVector MCsim( unsigned int nMonths, NumericMatrix P, int startStatus, unsigned int startMonth ) { IntegerVector sim( nMonths ); IntegerVector healthStats( P.ncol() ); NumericVector currentP; IntegerVector newstate; unsigned int q = P.ncol(); healthStats = Rcpp::seq(1, q); sim[startMonth - 1] = startStatus; /* Loop through each month for each individual */ for (unsigned int i = startMonth; i < nMonths; i++) { /* new state based on health status of last period and probability of transitioning to different state */ newstate = RcppArmadillo::sample( healthStats, 1, TRUE, P.row(sim(i-1) - 1) ); sim(i) = newstate(0); } return sim; }
Generating the data
The data generation process is shown below. The general outline of the process is (1) define transition probabilities, (2) define starting health status distribution, (3) generate starting health statuses and start months, and (4) generate health statuses for each follow-up month.
# Transition matrix for moving through health statuses P <- matrix(c(0.985, 0.015, 0.000, 0.000, 0.000, 0.950, 0.050, 0.000, 0.000, 0.000, 0.850, 0.150, 0.000, 0.000, 0.000, 1.000), nrow = 4, byrow = TRUE) maxFU = 400 nPerMonth = 350 N = maxFU * nPerMonth ddef <- defData(varname = "sHealth", formula = "0.80; 0.15; 0.05", dist = "categorical") # generate starting health values (1, 2, or 3) for all individuals set.seed(123) did <- genData(n = N, dtDefs = ddef) # each month, 350 age in to the sample did[, sMonth := rep(1:maxFU, each = nPerMonth)] # show table for 10 randomly selected individuals did[id %in% sample(x = did$id, size = 10, replace = FALSE)] ## id sHealth sMonth ## 1: 15343 2 44 ## 2: 19422 2 56 ## 3: 41426 1 119 ## 4: 50050 1 143 ## 5: 63042 1 181 ## 6: 83584 1 239 ## 7: 93295 1 267 ## 8: 110034 1 315 ## 9: 112164 3 321 ## 10: 123223 1 353 # generate the health status history based on the transition matrix dhealth <- did[, .(sHealth, sMonth, health = MCsim(maxFU, P, sHealth, sMonth)), keyby = id] dhealth[, month := c(1:.N), by = id] dhealth ## id sHealth sMonth health month ## 1: 1 1 1 1 1 ## 2: 1 1 1 1 2 ## 3: 1 1 1 1 3 ## 4: 1 1 1 1 4 ## 5: 1 1 1 1 5 ## --- ## 55999996: 140000 1 400 0 396 ## 55999997: 140000 1 400 0 397 ## 55999998: 140000 1 400 0 398 ## 55999999: 140000 1 400 0 399 ## 56000000: 140000 1 400 1 400
Simulation needs burn-in period
The simulation process itself is biased in its early phases as there are too many individuals in the sample who have just aged in compared to those who are “older”. (This is sort of the the reverse of the incidence - prevalence bias.) Since individuals tend to have better health status when they are “younger”, the average health status of the simulation in its early phases is biased downwards by the preponderance of young individuals in the population. This suggests that any evaluation of simulated data needs to account for a “burn-in” period that ensures there is a mix of “younger” and “older” individuals. To show this, I have calculated an average health score for each period of the simulation and plotted the results. You can see that the sample stabilizes after about 200 months in this simulation.
# count number of individuals with a particular heath statust each month cmonth <- dhealth[month > 0, .N, keyby = .(month, health)] cmonth ## month health N ## 1: 1 0 139650 ## 2: 1 1 286 ## 3: 1 2 47 ## 4: 1 3 17 ## 5: 2 0 139300 ## --- ## 1994: 399 4 112203 ## 1995: 400 1 18610 ## 1996: 400 2 6515 ## 1997: 400 3 2309 ## 1998: 400 4 112566 # transform data from "long" form to "wide" form and calculate average mtotal <- dcast(data = cmonth, formula = month ~ health, fill = 0, value.var = "N") mtotal[, total := `1` + `2` + `3`] mtotal[, wavg := (`1` + 2*`2` + 3*`3`)/total] mtotal ## month 0 1 2 3 4 total wavg ## 1: 1 139650 286 47 17 0 350 1.231429 ## 2: 2 139300 558 106 32 4 696 1.244253 ## 3: 3 138950 829 168 45 8 1042 1.247601 ## 4: 4 138600 1104 215 66 15 1385 1.250542 ## 5: 5 138250 1362 278 87 23 1727 1.261726 ## --- ## 396: 396 1400 18616 6499 2351 111134 27466 1.407813 ## 397: 397 1050 18613 6537 2321 111479 27471 1.406938 ## 398: 398 700 18587 6561 2323 111829 27471 1.407957 ## 399: 399 350 18602 6541 2304 112203 27447 1.406201 ## 400: 400 0 18610 6515 2309 112566 27434 1.405810 ggplot(data = mtotal, aes(x=month, y=wavg)) + geom_line() + ylim(1.2, 1.5) + geom_hline(yintercept = 1.411, lty = 3) + geom_vline(xintercept = 200, lty = 3) + xlab("Month") + ylab("Average health status") + theme(panel.background = element_rect(fill = "grey90"), panel.grid = element_blank(), plot.title = element_text(size = 12, vjust = 0.5, hjust = 0) ) + ggtitle("Average health status of simulated population")
Generating monthly study cohorts
Now we are ready to see if we can simulate the incidence - prevalence bias. The idea here is to find the first month during which an individual (1) is “active” (i.e. the period being considered is on or after the individual’s start period), (2) has an emergency department visit, and (3) whose health status has reached a specified threshold.
We can set a final parameter that looks back some number of months (say 6 or 12) to see if there have been any previous qualifying emergency room visits before the study start period (which in our case will be month 290 to mitigate an burn-in bias identified above). This “look-back” will be used to mitigate some of the bias by creating a washout period that makes the prevalent cases look more like incident cases. This look-back parameter is calculated each month for each individual using an Rcpp function that loops through each period:
#include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] IntegerVector cAddPrior(IntegerVector idx, IntegerVector event, int lookback) { int nRow = idx.length(); IntegerVector sumPrior(nRow, NA_INTEGER); for (unsigned int i = lookback; i < nRow; i++) { IntegerVector seqx = Rcpp::seq(i-lookback, i-1); IntegerVector x = event[seqx]; sumPrior[i] = sum(x); } return(sumPrior); }
Generating a single cohort
The following code (1) generates a population (as we did above), (2) generates emergency department visits that are dependent on the health status (the sicker an individual is, the more likely they are to go to the ED), (3) calculates the number of eligible ED visits during the look-back period, and (4) creates the monthly cohorts based on the selection criteria. At the end, we calculate average health status for the cohort by month of cohort - this will be used to illustrate the bias.
maxFU = 325 nPerMonth = 100 N = maxFU * nPerMonth START = 289 # to allow for adequate burn-in HEALTH = 2 LOOKBACK = 6 # how far to lookback set.seed(123) did <- genData(n = N, dtDefs = ddef) did[, sMonth := rep(1:maxFU, each = nPerMonth)] healthStats <- did[, .(sHealth, sMonth, health = MCsim(maxFU, P, sHealth, sMonth)), keyby = id] healthStats[, month := c(1:.N), by = id] # eliminate period without status measurement (0) & death (4) healthStats <- healthStats[!(health %in% c(0,4))] # ensure burn-in by starting with observations far # into simulation healthStats <- healthStats[month > (START - LOOKBACK)] # set probability of emergency department visit healthStats[, pED := (health == 1) * 0.02 + (health == 2) * 0.10 + (health == 3) * 0.20] # generate emergency department visit healthStats[, ed := rbinom(.N, 1, pED)] healthStats[, edAdj := ed * as.integer(health >= HEALTH)] # if you want to restrict healthStats[, pSum := cAddPrior(month, edAdj, lookback = LOOKBACK), keyby=id] # look at one individual healthStats[id == 28069] ## id sHealth sMonth health month pED ed edAdj pSum ## 1: 28069 1 281 1 284 0.02 0 0 NA ## 2: 28069 1 281 1 285 0.02 0 0 NA ## 3: 28069 1 281 1 286 0.02 0 0 NA ## 4: 28069 1 281 1 287 0.02 0 0 NA ## 5: 28069 1 281 2 288 0.10 0 0 NA ## 6: 28069 1 281 2 289 0.10 0 0 NA ## 7: 28069 1 281 2 290 0.10 1 1 0 ## 8: 28069 1 281 2 291 0.10 0 0 1 ## 9: 28069 1 281 2 292 0.10 0 0 1 ## 10: 28069 1 281 2 293 0.10 0 0 1 ## 11: 28069 1 281 2 294 0.10 0 0 1 ## 12: 28069 1 281 3 295 0.20 1 1 1 # cohort includes individuals with 1 prior ed visit in # previous 6 months cohort <- healthStats[edAdj == 1 & pSum == 0] cohort <- cohort[, .(month = min(month)), keyby = id] cohort ## id month ## 1: 53 306 ## 2: 82 313 ## 3: 140 324 ## 4: 585 291 ## 5: 790 299 ## --- ## 3933: 31718 324 ## 3934: 31744 325 ## 3935: 31810 325 ## 3936: 31860 325 ## 3937: 31887 325 # estimate average health status of monthly cohorts cohortStats <- healthStats[cohort, on = c("id","month")] sumStats <- cohortStats[ , .(avghealth = mean(health), n = .N), keyby = month] head(sumStats) ## month avghealth n ## 1: 290 2.248175 137 ## 2: 291 2.311765 170 ## 3: 292 2.367347 147 ## 4: 293 2.291925 161 ## 5: 294 2.366906 139 ## 6: 295 2.283871 155
Exploring bias
Finally, we are at the point where we can see what, if any, bias results in selecting our cohorts under the scenario I’ve outlined above. We start by generating multiple iterations of populations and cohorts and estimating average health status by month under the assumption that we will have a look-back period of 0. That is, we will accept an individual into the first possible cohort regardless of her previous emergency department visit history. The plot below shows average across 1000 iterations. What we see is that the average health status of the cohorts in the first 20 months or so exceed the long run average. The incidence - prevalence bias is extremely strong if we ignore prior ED history!
Taking history into account
Once we start to incorporate ED history by using look-back periods greater than 0, we see that we can reduce bias considerably. The two plots below show the results of using look-back periods of 6 and 12 months. Both have reduced bias, but only at 12 months are we approaching something that actually looks desirable. In fact, under this scenario, we’d probably like to go back 24 months to eliminate the bias entirely. Of course, these particular results are dependent on the simulation assumptions, so determining an appropriate look-back period will certainly depend on the actual data. (When we do finally get the actual data, I will follow-up to let you know what kind of adjustment we needed to make in the real, non-simulated world.)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.