Site icon R-bloggers

Relative risk regression (1/2)

[This article was first published on R to the max, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When the outcome variable is binary such as alive/dead or yes/no, the most popular analytic method is logistic regression.

\[\textrm{logit}(\mathbb{E}[y]) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots \]

The name “logistic” might have come from the equation below, which can be derived from applying the inverse function of logit on the both side of the equation above.

\[ \mathbb{E}[y] = \textrm{logistic}( \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots)\]

The link function of the logistic regression is logit(). We can replace it with log() and the result looks like the below.

\[ \textrm{log}(\mathbb{E}[y]) = \beta_0 + \beta_1 x_1 + \cdots \] This equation represents “Relative Risk Regression” a.k.a log-binomial regression.

Risk, Relative Risk

Risk is just another term for probability. For instance, “the probability of being hit by a lightening” can be rephrased to “the risk of being hit by a lightening”.

Relative risk or risk ratio(RR) is the ratio of two probability(risk). Relative risk is to compare the probabilities of two events. For example, compare the probability of being hit by a lightening when standing alone with the probability of being hit by a lightening having an umbrella open. If we divide the second probability by the first probability, we get how many times we are likely to be hit by a lightening when having an umbrella open compared to having nothing at all. This is relative risk, or risk ratio. If it is 2, in average we will get hit twice (with an umbrella open) every one hit (with nothing).

The name “Relative Risk Regression” seems to come from the fact that the coefficients of relative risk regression is closely related to relative risk! Let’s imagine a relative risk regression with only one predictor \(x\) , which is \(1\) for having an umbrella open, and \(0\) for having nothing. We can compare \(y|x=0\) and \(y|x=1\) .

\[\log(y_{x=1}) = \beta_0 + \beta_1\] \[\Rightarrow y_{x=1} = \exp(\beta_0 + \beta_1)\]

\[\Rightarrow y_{x=1} = \exp(\beta_0)\exp(\beta_1)\]

\[y_{x=0} = \exp(\beta_0)\]

Combining the last two equations, we can derive the following.

\[y_{x=1}/y_{x=0} = \exp(\beta_1)\]

Let’s interpret \(y_{x=1}\) as the probability of being hit when \(x=1\) (with an umbrella open), then relative risk or risk ratio is \(\exp(\beta_1)\) !

The risk of being hit when having an umbrella open over the risk of being hit with nothing is exponential of \(\beta_1\), the coefficient. So if \(\beta_1\) equals to 1, having an umbrella open is approximately 2.718( \(exp(1) = 2.718\cdots\) ) times bigger. You are likely to be hit 2.718 times (with an umbrella opne) in average when people are hit with nothing one time.

Difficulties of applying MLE

Open any mathematical statistics, you will see wonderful characteristics of MLE(Maximum Likelihood Estimate). So MLE is the way to go when we estimate the coefficients of a relative risk regression. But estimating a relative risk regression is difficult because it is optimizing the likelihood with parameters constrained. See the equation below.

\[\log(y|x_1) = \beta_0 + \beta_1 x_1\] \[y|x_1 = \exp(\beta_0 + \beta_1 x_1)\]

Since \(y\) stands for the probability, \(\exp(\beta_0 + \beta_1 x_1)\) with any possible \(x_1\) can not be less than \(0\) or over than \(1\) ! Another problem is that since parameters can be on the edge of the possible parameter space, it becomes difficult to estimate the variance of the parameter.

Using R for Relative Risk Regression

We can use the traditional function glm() for relative risk regression but the package logbin seems to offer convenience and functionality. We can choose the estimating method with the package logbin. Let’s get to it!

First we will use Heart Attack Data(data(heart)). The description of the data can be found by ?heart

This data set is a cross-tabulation of data on 16949 individuals who experienced a heart attack (ASSENT-2 Investigators, 1999). There are 4 categorical factors each at 3 levels, together with the number of patients and the number of deaths for each observed combination of the factors. This data set is useful for illustrating the convergence properties of glm and glm2.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(logbin) # https://github.com/mdonoghoe/logbin
require(glm2, quietly = TRUE)
data(heart)

head(heart)
##   Deaths Patients AgeGroup Severity Delay Region
## 1     49     2611        1        1     1      1
## 2      1       74        1        1     1      2
## 3      2       96        1        1     1      3
## 4     30     2888        1        1     2      1
## 5      0       81        1        1     2      2
## 6      8      155        1        1     2      3

We can fit the relative risk regression model to the data like the following. Notice that the response variable part in the fomula is cbind(# of success, # of failure).

start.p <- sum(heart$Deaths) / sum(heart$Patients)
fit <- 
  logbin(cbind(Deaths, Patients-Deaths) ~ 
           factor(AgeGroup) + factor(Severity) 
           + factor(Delay) + factor(Region), 
         data = heart)
fit$converged

Using binary response variable, we can do like the following.

sum(duplicated(heart %>% select(AgeGroup:Region)))
## [1] 0
heart2 <- heart %>% 
  group_by(AgeGroup, Severity, Delay, Region) %>%
  summarise(data.frame(dead = c(rep(1,Deaths),
                                rep(0,Patients-Deaths)))) %>%
  ungroup()
## `summarise()` has grouped output by 'AgeGroup', 'Severity', 'Delay', 'Region'.
## You can override using the `.groups` argument.
fit2 <- 
  logbin(dead ~ 
           factor(AgeGroup) + factor(Severity) 
           + factor(Delay) + factor(Region),
         data = heart2)
fit2$converged

For me, it took LONG!!! Here is the faster way.1

start.p <- sum(heart$Deaths) / sum(heart$Patients)
fit <- 
  logbin(cbind(Deaths, Patients-Deaths) ~ 
           factor(AgeGroup) + factor(Severity) 
           + factor(Delay) + factor(Region), 
         data = heart,
         start = c(log(start.p), -rep(1e-4,8)),
         method = 'glm2')
cat('Is fit converged? ', fit$converged, '\n')
## Is fit converged?  TRUE
fit2 <- 
  logbin(dead ~ 
           factor(AgeGroup) + factor(Severity) 
           + factor(Delay) + factor(Region), 
         data = heart2,
         start = c(log(start.p), -rep(1e-4,8)),
         method = 'glm2')
cat('Is fit2 converged? ', fit2$converged, '\n')
## Is fit2 converged?  TRUE

Here is a tip. Use the form of # of success and # of failure. Using binary response took longer!

The results are almost identical

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
compareCoefs(fit, fit2)
## Calls:
## 1: logbin(formula = cbind(Deaths, Patients - Deaths) ~ factor(AgeGroup) + 
##   factor(Severity) + factor(Delay) + factor(Region), data = heart, start = 
##   c(log(start.p), -rep(1e-04, 8)), method = "glm2")
## 2: logbin(formula = dead ~ factor(AgeGroup) + factor(Severity) + 
##   factor(Delay) + factor(Region), data = heart2, start = c(log(start.p), 
##   -rep(1e-04, 8)), method = "glm2")
## 
##                   Model 1 Model 2
## (Intercept)       -4.0275 -4.0273
## SE                 0.0889  0.0889
##                                  
## factor(AgeGroup)2   1.104   1.104
## SE                  0.089   0.089
##                                  
## factor(AgeGroup)3  1.9268  1.9266
## SE                 0.0924  0.0924
##                                  
## factor(Severity)2  0.7035  0.7035
## SE                 0.0701  0.0701
##                                  
## factor(Severity)3  1.3767  1.3768
## SE                 0.0955  0.0955
##                                  
## factor(Delay)2     0.0590  0.0589
## SE                 0.0693  0.0693
##                                  
## factor(Delay)3     0.1718  0.1720
## SE                 0.0808  0.0808
##                                  
## factor(Region)2    0.0757  0.0757
## SE                 0.1775  0.1775
##                                  
## factor(Region)3     0.483   0.483
## SE                  0.111   0.111
## 

The authors of logbin states that logbin solves problems that might pop up using other packages.

Let’s compare!

start.p <- sum(heart$Deaths) / sum(heart$Patients)
t.glm <- system.time(
  fit.glm <- 
    logbin(cbind(Deaths, Patients-Deaths) ~ 
             factor(AgeGroup) + factor(Severity) 
             + factor(Delay) + factor(Region), 
           data = heart,
           start = c(log(start.p), -rep(1e-4, 8)), 
           method = "glm", 
           maxit = 10000)
)

t.glm2 <- system.time(
  fit.glm2 <- update(fit.glm, method='glm2'))
t.cem <- system.time(
  fit.cem <- update(fit.glm, method = "cem")
  #fit.cem <- update(fit.glm, method='cem', start = NULL)
  )
t.em <- system.time(
  fit.em <- update(fit.glm, method = "em"))
t.cem.acc <- system.time(
  fit.cem.acc <- update(fit.cem, accelerate = "squarem"))
t.em.acc <- system.time(
  fit.em.acc <- update(fit.em, accelerate = "squarem"))

objs = list("glm"=fit.glm, 
            "glm2"=fit.glm2,
            "cem"=fit.cem, 
            "em"=fit.em, 
            "cem.acc" = fit.cem.acc, 
            "em.acc" = fit.em.acc)
params = c('converged', "loglik", "iter")

to_dataframe = function(objs, params) {
  #param = params[1]
  #obj[[param]]
  dat = data.frame(model=names(objs))
  
  for (param in params) {
    dat[[param]] = sapply(objs, 
                          function(x)
                            x[[param]])
  }
  
  return(dat)
}

dat = to_dataframe(objs, params)

dat$time = c(t.glm['elapsed'], 
             t.glm2['elapsed'],
             t.cem['elapsed'], 
             t.em['elapsed'], 
             t.cem.acc['elapsed'], 
             t.em.acc['elapsed'])

Let’s see the result.

print(dat)
##     model converged    loglik         iter  time
## 1     glm     FALSE -186.7366        10000  1.61
## 2    glm2      TRUE -179.9016           14  0.00
## 3     cem      TRUE -179.9016 223196, 8451 42.47
## 4      em      TRUE -179.9016         6492  2.34
## 5 cem.acc      TRUE -179.9016    4215, 114  3.78
## 6  em.acc      TRUE -179.9016           81  0.09

The authors of the package logbin stated that the cem is the best but the time it took was the longest. glm2 was the fastest and has converged. But glm2 requires sensible start points. So we do not tell which will win when the data is large and the model is more complex.

In the next post, I will explain how the model and the meaning of coefficient changes with different link functions.


  1. This one uses glm2 package. I think logbin is just a wrapper in this case. I omitted warnings and messages.↩︎

To leave a comment for the author, please follow the link and comment on their blog: R to the max.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.