Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this first post I’m going to present a way of obtaining age- and sex-adjusted incidence rates using poisson regression in R. This will be similar to what is done in Stata as described here.
I’ve written a R function that’s available for download here. The script can be sourced ( source("age-sex-adjust.R"
) and then the function age_sex_adjust()
can be used as is. The code will also be described step by step below.
There are, as usual, several ways to calculate adjusted incidence rates in R. I’ve chosen to use the package stdReg by Arvid Sjölander because it has a lot of nice features and useful implications in causal inference. Specifically, we will use the function stdGlm()
from stdReg to generate the the adjusted incidence rates.
But first we start off with a little bit of background on what an incidence rate is. It is simply a measure of a number of occurrences (a count) in a population over the total population time. For example, in a population of 10 people, each followed 1 year, there was one case of death. In that population, the incidence rate of death would 1 per 10 person years. In observational data, we often have larger cohorts with varying follow-up time and censoring. The calculation is of course the same, using the formula below:
\[\text{Incidence rate} = \frac{\text{Number of occurrences}}{\sum_{\text{Persons}}{\text{Time in study}}}\]
Calculating crude incidence rate
To illustrate, we will now use the colon
dataset from the survival
package.
library(survival) library(dplyr) library(broom)
Running ?survival::colon
tells us the following:
Data from one of the first successful trials of adjuvant chemotherapy for colon cancer
Variable | Explanation |
---|---|
id | Patient id |
study | 1 for all patients |
rx | 1 for all patients |
sex | 1 = male |
age | in years |
obstruct | colon obstruction by tumour |
perfor | performation of colon |
adhere | adherence to nearby organs |
nodes | number of positive lymph nodes |
time | days until event or censoring |
status | censoring status |
differ | tumour differentiation — 1 = well, 2 = moderate, 3 = poor |
extent | extent of local spread — 1 = submucosa, 2 = muscle, 3 = serosa, 4 = continious |
surg | time from surgery to registration — 0 = short, 1 = long |
node4 | more than 4 positive lymph nodes |
etype | event type — 1 = recurrence, 2 = death |
OK, so now that we understand the data, let’s start calculating crude incidence rates for death among the different treatment groups:
# Using the colon dataset from the survival package # Only keep records related to the death outcome colon_death <- colon %>% dplyr::filter(etype == 2) # Time is divided by 365.25/100 to get the time in days variable # first to years, then to 100 person-years colon_death %>% group_by(rx) %>% summarise(Events = sum(status), Time = sum(time/365.25/100), Rate = Events / Time, lower = poisson.test(Events, Time)$conf.int[1], upper = poisson.test(Events, Time)$conf.int[2]) ## # A tibble: 3 x 6 ## rx Events Time Rate lower upper ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Obs 168 13.8 12.2 10.4 14.2 ## 2 Lev 161 13.7 11.7 10.0 13.7 ## 3 Lev+5FU 123 15.0 8.22 6.83 9.80
Now we compare to the calculated rates with rates obtained from the survRate()
function from the biostat3 package:
library(biostat3) survRate(Surv(time/365.25/100, status) ~ rx, data = colon_death) %>% dplyr::select(rx, event, tstop, rate, lower, upper) %>% as_tibble() %>% dplyr::rename(Events = event, Time = tstop, Rate = rate) ## # A tibble: 3 x 6 ## rx Events Time Rate lower upper ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Obs 168 13.8 12.2 10.4 14.2 ## 2 Lev 161 13.7 11.7 10.0 13.7 ## 3 Lev+5FU 123 15.0 8.22 6.83 9.80
Good, the incidence rates are identical. The observational patients had an mortality incidence rate of 12.2 per 100 person-years, compared to the Lev+5-FU treated patients with an incidence rate of 8.22 per 100 person-years. Now, let’s try and repeat these results with poisson regression.
Obtaining estimated incidence rates using poisson regression
Here we use the broom package tidy
function to obtain exponentiated estimates:
poisson_fit <- glm(status ~ rx + offset(log(time/365.25/100)), family = poisson, data = colon_death) tidy(poisson_fit, exponentiate = T, conf.int = T) ## # A tibble: 3 x 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 12.2 0.0772 32.4 3.13e-230 10.4 14.1 ## 2 rxLev 0.965 0.110 -0.324 7.46e- 1 0.777 1.20 ## 3 rxLev+5FU 0.675 0.119 -3.32 9.16e- 4 0.534 0.850
The Intercept estimate here is the estimated IR for the reference level, i.e. the Obs group.
To get estimated IR of Lev+5FU treated:
lev_5fu <- predict(poisson_fit, newdata = data.frame(rx = "Lev+5FU", time = 36525), type = "link", se.fit = T) as_tibble(lev_5fu) %>% summarise(Treatment = "Lev+5FU", IR = exp(fit), lower = exp(fit - (1.96 * se.fit)), upper = exp(fit + (1.96 * se.fit))) ## # A tibble: 1 x 4 ## Treatment IR lower upper ## <chr> <dbl> <dbl> <dbl> ## 1 Lev+5FU 8.22 6.88 9.80
Here, the confidence interval needs to be calculated on the \(\log\) scale and then exponentiated back. This will cause the confidence interval to not be centered around the estimate.
A poisson model is actually modeling \(\log\text{incidence rates (ratios)}\) when we use the time variable as an offset. Therefore, we can include covariates in the model to be accounted for, such as age and sex.
Age- and sex-adjusted incidence rates using poisson regression
First, we’ll do it using my age_sex_adjust()
function
source("age-sex-adjust.R") # Usage: age_sex_adjust(data, group, age, sex, event, time) age_sex_adjust(colon_death, rx, age, sex, status, time) ## # A tibble: 3 x 4 ## rx IR lower upper ## <chr> <dbl> <dbl> <dbl> ## 1 Obs 12.2 10.2 14.7 ## 2 Lev 11.8 10.1 13.8 ## 3 Lev+5FU 8.25 6.99 9.73
Here we see that the adjusted rates are very similar to the crude rates calculated above. Since this data comes from a randomized trial, this is expected and can be taken as a sign that the randomization worked.
Now, let’s do the some thing but without using the ready made function to see how it works under the hood.
poisson_fit <- glm(status ~ rx * I(age^2) + sex, offset = log(time/365.25/100), data = colon_death, family = poisson) std_fit <- stdGlm(poisson_fit, data = transform(colon_death, time = 36525), X = "rx") std_sum <- summary(std_fit, CI.type = "log") std_sum[["est.table"]] %>% as_tibble(rownames = "rx") %>% transmute(Treatment = rx, IR = Estimate, lower = `lower 0.95`, upper = `upper 0.95`) ## # A tibble: 3 x 4 ## Treatment IR lower upper ## <chr> <dbl> <dbl> <dbl> ## 1 Obs 12.2 10.2 14.7 ## 2 Lev 11.8 10.1 13.8 ## 3 Lev+5FU 8.25 6.99 9.73
The numbers are identical to the ones obtained from the age_sex_adjust()
function, which is logical since we did the same thing as the function does.
A few finishing notes. Here I included age as a quadratic term, and as an interaction with exposure. These are modeling decisions one will have to take, however the model could have been only a main effects model such as:
\[\log\lambda = \beta_0 + \beta_1\text{rxLev} + \beta_2\text{rxLev+5FU} + \beta_3\text{age} + \beta_4\text{sex}\]
Regarding the interaction term, a good explanation was given in the Stata forum in this post.
For anyone who wants to read more, I recommend the course material from the PhD course Biostatistics III at Karolinska Institutet, available here.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.