Delta Method Standard Errors
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Logistic regression produces result that are typically interpreted in one of two ways:
- Predicted probabilities
- Odds ratios
Odds are the ratio of the probability that something happens to the probabilty it doesn’t happen.
\[ \Omega(X) = \frac{p(y=1|X)}{1-p(y=1|X)} \] An odds ratio is the ratio of two odds, each calculated at a different score for \(X\).
There are strengths and weaknesses to either choice.
- Predictored probabilities are intuitive, but require assuming a value for every covariate.
- Odds ratios do not require specifying values for other covariates, but ratios of ratios are not always intuitive.
Illustration using 2016 ANES: vote for Trump or Clinton. See prior blog post for details.
trump_model <- glm(vote ~ gender + educ + age, data = anes, family = binomial(link = "logit")) estimates <- round(coef(trump_model), 3) results_tab <- tidy(trump_model) %>% mutate_if(is.numeric, funs(round(.,3))) %>% var_mapping(term) %>% dplyr::rename(Estimate = term, Beta = estimate, SE = std.error, z = statistic, p = p.value) kable(results_tab, align = c("l",rep("c",4)))
Estimate | Beta | SE | z | p |
---|---|---|---|---|
Constant | -1.051 | 0.255 | -4.127 | 0.000 |
Female | -0.374 | 0.085 | -4.384 | 0.000 |
Completed HS | 0.655 | 0.231 | 2.839 | 0.005 |
College < 4 Years | 0.696 | 0.218 | 3.195 | 0.001 |
College 4 Year Degree | 0.411 | 0.222 | 1.853 | 0.064 |
Advanced Degree | -0.424 | 0.229 | -1.850 | 0.064 |
Age | 0.015 | 0.002 | 6.026 | 0.000 |
Interpreting as odds ratios:
- The odds of voting for Trump are \(100\times[\mbox{exp}(-.374) – 1]\) = 31% lower for women compared to men.
- The odds of voting for Trump are \(100\times[\mbox{exp}(.655) – 1]\) = 93% higher for those with only a high school diploma compared to those without a high school diploma.
- The odds of voting for Trump are \(100\times [\mbox{exp}(.696) – 1]\) = 101% higher for those with with some college (but no 4-year degree) compared to those without a high school diploma.
- Each increase in age of one year leads to a \(100\times [\mbox{exp}(.015) – 1]\) = 2% increase in the odds of voting for Trump.
Interpreting as predicted probabilities:
predict_data <- expand.grid(educ = levels(anes$educ), gender = levels(anes$gender), age = seq(20,90, by = 10)) pred_probs <- predict_data %>% mutate(prob_trump = predict(trump_model, newdata = predict_data, type = "response")) ggplot(pred_probs, aes(x = age, y = prob_trump, group = educ, color = educ)) + geom_line() + facet_grid(~ gender) + labs(x = "Age (Years)", y = "Probability of Voting for Trump") + scale_color_discrete(name="Highest\nEducation")
The problem:
These are sample estimates. How can we assess levels of uncertainty?
For sample estimates, we would like a standard error (SE) or a 95% confidence interval (the former usually used to create the latter).
Start with odds ratios, which at first seems easiest. How can we get a 95% CI around the odds ratio?
- The odds ratio is just \(\mbox{exp}(\beta_k)\).
- Software gives us a 95% confidence interval around \(\beta_k\).
- Create the 95% confidence interval aroune \(\beta_k\) as \(\beta_k \pm 1.96 \times \mbox{SE}_{\beta_k}\).
- Exponentiate the lower limit and the upper limit of 95% CI around \(\beta_k\).
This is what is typically done in R with exp(confint(model_object))
.
Note that, because exp()
is a nonlinear transformation, the resulting confidence intervals will be asymmetric. To aid illustration, add a random noise variable (whose standard error will be large).
set.seed(1) anes <- anes %>% mutate(noise = rnorm(nrow(anes), 0, .05))
Fit the model.
trump_model <- glm(vote ~ gender + educ + age + noise, data = anes, family = binomial(link = "logit"))
Exponentiate the 95% confidence interval around \(\beta_k\).
confint(trump_model) %>% exp ## Waiting for profiling to be done... ## 2.5 % 97.5 % ## (Intercept) 0.2110361 0.5737587 ## genderFemale 0.5820401 0.8131175 ## educCompleted HS 1.2277695 3.0374878 ## educCollege < 4 Years 1.3127215 3.0913909 ## educCollege 4 Year Degree 0.9793854 2.3409015 ## educAdvanced Degree 0.4181680 1.0294183 ## age 1.0102303 1.0201813 ## noise 0.2170954 5.4118869
A graph is even better:
cis <- exp(confint(trump_model)) %>% as.data.frame %>% rownames_to_column("Variable") %>% var_mapping(Variable) ## Waiting for profiling to be done... graph_data <- tidy(trump_model) %>% mutate_if(is.numeric, funs(round(.,3))) %>% dplyr::rename(Variable = term, Beta = estimate, SE = std.error, z = statistic, p = p.value) %>% var_mapping(Variable) %>% mutate(OR = exp(Beta)) %>% left_join(cis) %>% filter(Variable != "Constant") ## Joining, by = "Variable" ggplot(data = graph_data, aes(x = Variable, y = `OR`, ymin = `2.5 %`, ymax = `97.5 %`)) + geom_pointrange() + geom_hline(yintercept = 1, lty = 2) + coord_flip() + xlab("Variable") + ylab("Odds Ratio with 95% CI")
Delta Method Standard Errors for Odds Ratios
Alternatively, we can use the SE for the odds ratio to determine a normal (and symmetric) approximation for the 95% CI. But what is the SE for the odds ratio?
An approach known as the delta method is used frequently to come up with standard errors for nonlinear transformations of model parameters.
It is based on computing the variance for a Taylor series linearization of the function.
A Taylor Series rewrites a function at a given location \(a\) as a (possibly infinite) sum of the function’s derivatives.
\[ f(x) = f(x) + f'(a)(x - a) + \frac{f''(a)}{2!}(x - a)^2 + \frac{f'''(a)}{3!}(x - a)^3 + \ldots \]
A Taylor series approximation chops off all but the first one or two derivatives. A linear approximation to \(f(x)\) at \(a\) is thus
\[ f(x) = f(x) + f'(a)(x - a) \]
Take a linear transformation of a random variable \(x\).
\[ f(x) = a + bx \]
The variance of \(f(x)\) is known to be
\[ \mbox{Var}(a + bx) = b^2\mbox{Var}(x) \] or
\[ \mbox{Var}(a + bx) = b\mbox{Var}(x)b \] or
\[ \mbox{Var}(a + bx) = f'(x)\mbox{Var}(x)f'(x) \]
Generalizing to any (univariate) differentiable function of a random varaible \(x\) with mean \(\mu\), we can approximate a function of \(x\) at \(\mu\) with
\[ f(x) \approx f(\mu) + f'(x)(x - \mu) \]
meaning that the variance of the function is
\[ \mbox{Var}\left(f(x)\right) = f'(\mu)\mbox{Var}(x)f'(\mu) \]
This generalizes to functions of multiple variables. Simply replace the derivate with the gradient vector and the variance with the variance-covariance matrix.
\[ \mbox{Var}\left(f(\mathbf x)\right) = \nabla(\boldsymbol\mu )^{T}\mbox{Cov}(\mathbf x)\nabla(\boldsymbol\mu) \]
Going back to logistic regression, our random variables are the sample estimates \(\widehat{\beta}_k\), and our function is \(f(\beta_k) = e^{\beta_k}\). The maximum likelihood estimates are the values for the vector \(\boldsymbol \mu\). The covariance matrix is the covariance matrix of the estimates. Both are easily recovered in R from a glm
object.
coef(trump_model) vcov(trump_model)
The function in which we are interested is
\[ f(\beta_k) = e^{\beta_k} \] The delta-method variance is
\[ f(\beta_k) = f'(\beta_k)\mbox{Var}(\widehat{\beta_k})f'(\beta_k) \]
This turns out to be a pretty simple problem, given that
\[ \frac{d e^x}{dx} = e^x \]
What would the variance be for the odds ratio on the noise term?
- The estimate \(\beta_{\mbox{noise}}\) was 0.0804698.
- The variance was 0.6725694
\[ \mbox{Var}\left(\mbox{exp}\left(\beta_{\mbox{noise}}\right)\right) = e^{\beta}\mbox{Var}\left(\hat{\beta}\right)e^{\beta} = 0.79 \]
A normal-approximated 95% confidence interval is found as
\[ 95\% \mbox{ CI} = \beta_k \pm 1.96 \times \mbox{SE}_{\beta_k} \]
where \(\mbox{SE}\) is the square root of the variance.
A function to return odds ratios and confidence intervals based on normal approximations.
OR_SE <- function(glm_object, variable){ d <- exp(coef(glm_object)[variable]) v <- vcov(glm_object)[variable, variable] output_df <- as_tibble(list(Estimate = variable, Beta = coef(glm_object)[variable], OR = d, `OR SE` = sqrt(d*v*d))) %>% mutate(`Lower CI` = OR - 1.96*`OR SE`, `Upper CI` = OR + 1.96*`OR SE`) output_df }
Map over all estimates and reduce to tibble.
or_tibble <- map(names(coef(trump_model)), function(i) OR_SE(trump_model, i)) %>% reduce(bind_rows) kable(or_tibble %>% var_mapping(Estimate), align = c("l", rep("c", 5)))
Estimate | Beta | OR | OR SE | Lower CI | Upper CI |
---|---|---|---|---|---|
Constant | -1.0513062 | 0.3494810 | 0.0890344 | 0.1749736 | 0.5239883 |
Female | -0.3738035 | 0.6881121 | 0.0586800 | 0.5730993 | 0.8031249 |
Completed HS | 0.6543854 | 1.9239596 | 0.4436975 | 1.0543126 | 2.7936067 |
College < 4 Years | 0.6962532 | 2.0062217 | 0.4372836 | 1.1491458 | 2.8632976 |
College 4 Year Degree | 0.4109402 | 1.5082351 | 0.3344887 | 0.8526372 | 2.1638330 |
Advanced Degree | -0.4244709 | 0.6541158 | 0.1500084 | 0.3600993 | 0.9481323 |
Age | 0.0150621 | 1.0151761 | 0.0025378 | 1.0102021 | 1.0201501 |
0.0804698 | 1.0837961 | 0.8888248 | -0.6583004 | 2.8258927 |
cis <- exp(confint(trump_model)) %>% as.data.frame %>% rownames_to_column("Variable") %>% var_mapping(Variable) ## Waiting for profiling to be done... graph_data_2 <- tidy(trump_model) %>% dplyr::rename(Variable = term, Beta = estimate, SE = std.error, z = statistic, p = p.value) %>% mutate_if(is.numeric, funs(round(.,3))) %>% var_mapping(Variable) %>% mutate(OR = exp(Beta)) %>% left_join(cis) %>% mutate(Source = "exp(confint(.))") %>% dplyr::select(Source, Variable, OR, `Lower CI` = `2.5 %`, `Upper CI` = `97.5 %`) %>% bind_rows(or_tibble %>% mutate(Source = "Delta Method") %>% dplyr::select(Source, Variable = Estimate, OR, `Lower CI`, `Upper CI`) %>% var_mapping(Variable)) %>% mutate(Source = factor(Source, levels = c("exp(confint(.))", "Delta Method"))) %>% filter(Variable != "Constant") ## Joining, by = "Variable" ggplot(data = graph_data_2, aes(x = Variable, y = OR, ymin = `Lower CI`, ymax = `Upper CI`)) + geom_pointrange() + geom_hline(yintercept = 1, lty = 2) + coord_flip() + xlab("Variable") + ylab("Odds Ratio with 95% CI") + facet_grid(~ Source)
Which is the correct method?
- Delta method provides a standard error for the odds ratio, which can be used to create a normal-approximated (i.e. symmetric) confidence interval.
- But delta method confidence intervals can also extend into negative territory.
What does Stata do?
- Stata reports standard errors for odds ratios determined by the delta method.
- But its 95% confidence intervals around the odds ratios are based on \(\mbox{exp}(\beta \pm 1.96*\mbox{SE}_{\beta})\).
That is, the standard error is the delta method, but the confidence intervals are equal to Rs exp(confint(model_object))
!
Let’s export our data to Stata and take a look.
haven::write_dta(anes, "anes-with-noise.dta") . logistic vote i.gender i.educ age noise Logistic regression Number of obs = 2,368 LR chi2(7) = 147.10 Prob > chi2 = 0.0000 Log likelihood = -1565.3592 Pseudo R2 = 0.0449 ---------------------------------------------------------------------------------------- vote | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -----------------------+---------------------------------------------------------------- gender | Female | .6881121 .05868 -4.38 0.000 .582199 .8132929 | educ | Completed HS | 1.92396 .4436975 2.84 0.005 1.224319 3.023412 College < 4 Years | 2.006222 .4372836 3.19 0.001 1.308723 3.07546 College 4 Year Degree | 1.508235 .3344887 1.85 0.064 .9765486 2.329401 Advanced Degree | .6541158 .1500084 -1.85 0.064 .4173001 1.025323 | age | 1.015176 .0025378 6.03 0.000 1.010214 1.020162 noise | 1.083796 .8888249 0.10 0.922 .2172073 5.407803 _cons | .349481 .0890344 -4.13 0.000 .2121143 .5758072 ----------------------------------------------------------------------------------------
Comparing the standard errors from Stata to our delta method SEs, we find the two match.
But the delta method CIs do not match. In fact, Stata’s confidence intervals are close to R’s exp(confint())
results.
Understanding Stata’s output for logit models with results reported as odds ratios:
- The CIs are exponentiated \(\widehat{\beta_k} \pm 1.96 \times \mbox{SE}_{\widehat{\beta_k}}\).
- The p-values reported for the odds ratio come from \(z = \frac{\widehat{\beta_k}}{\mbox{SE}_{\widehat{\beta_k}}}\). To see this, compare the p-values with and without the
or
option to the.logit
command.
If the sampling distribution is asymmetric, then the confidence interval should be asymmetric. We can use the nonparametric bootstrap to get a sense of the shape of the sampling distribution. The R code applied to the ANES data is:
or_boot_func <- function(form, data, indices) { data <- data[indices,] ors <- exp(coef(glm(form, data = data, family = "binomial"))) ors } or_boot_values <- boot(data = anes, statistic = or_boot_func, R = 1000, form = vote ~ gender + educ + age + noise) boot_ci <- apply(or_boot_values$t, 2, function(i) { as_tibble(list(Estimate = mean(i), Lower = quantile(i, .025), Upper = quantile(i, .975))) }) %>% reduce(bind_rows) boot_ci %>% mutate(Variable = c("Constant", graph_data$Variable)) %>% filter(Variable != "Constant") %>% ggplot(aes(x = Variable, y = `Estimate`, ymin = `Lower`, ymax = `Upper`)) + geom_pointrange() + geom_hline(yintercept = 1, lty = 2) + coord_flip() + xlab("Variable") + ylab("Bootrapped Mean Odds Ratio and 2.5-97.5 Pecentiles")
Delta Method Standard Errors for Predicted Probabilities
It makes most sense to stick with exp(confint(glm_object))
for confidence intervals around ORs, since this is a trivial task. But what about other quantities that are functions of model parameters, such as predicted probabilities?
Predicted probabilities are obtained from the results of a logit model by:
\[ \begin{array} \mbox{Pr}(y=1|x) &= F(x\beta) \\ & = \frac{\mbox{exp}(x\beta)}{1 + \mbox{exp}(x\beta)} \end{array} \]
What is the prediction for a 55-year-old male who finished high school but did not go to college (with average noise)? First, get \(x\beta\).
estimates <- coef(trump_model) xb <- estimates[1] + estimates[2]*0 + estimates[3]*1 + estimates[4]*0 + estimates[5]*0 + estimates[6]*0 + estimates[7]*55 + estimates[8]*0 (xb <- as.numeric(round(xb, 3))) ## [1] 0.431
\[ \begin{array} \mbox{Pr}(y=1|x) &= \frac{\mbox{exp}(x\beta)}{1 + \mbox{exp}(x\beta)} \\ & = \frac{\mbox{exp}(0.431)}{1 + \mbox{exp}(0.431)} \\ & = 0.606 \end{array} \]
Equivalently.
round(plogis(xb),3) ## [1] 0.606
The predicted probabilty is a function of all parameter estimates, so we need to use the matrix version of the Taylor series approximation to get our SE.
\[ \mbox{Var}\left(f(\mathbf x)\right) = \nabla(\boldsymbol\mu )^{T}\mbox{Cov}(\mathbf x)\nabla(\boldsymbol\mu) \]
First, determine the gradient vector for the \(\beta_k\).
We’ve used \(F()\) to represent the standard logistic cdf. Let \(f()\) be the standard logistic pdf.
By the chain rule, and by the fact that the first derivative of a cdf is its pdf,
\[ \begin{array} \frac{\partial F(x'\beta)}{\partial \beta} &= \frac{\partial F((x' \beta))}{\partial x' \beta} \\ &= f(x'\beta)x \end{array} \]
Define the \(x\) vector for our subject of interest along with the beta vector from the model results.
x <- c(1,0,1,0,0,0,55,0) beta <- coef(trump_model) cbind(beta,x) ## beta x ## (Intercept) -1.05130623 1 ## genderFemale -0.37380350 0 ## educCompleted HS 0.65438538 1 ## educCollege < 4 Years 0.69625321 0 ## educCollege 4 Year Degree 0.41094016 0 ## educAdvanced Degree -0.42447089 0 ## age 0.01506207 55 ## noise 0.08046982 0
And \(x'\beta\) is
xb <- t(x)%*%beta
The variance is thus:
\[ \begin{array} \mbox{Var}(p(\mbox{Vote} = \mbox{Trump})) &= \nabla(\boldsymbol\mu )^{T}\mbox{Cov}(\mathbf x)\nabla(\boldsymbol\mu) \\ &= f(x'\beta)x'\mbox{Cov}(\widehat{\beta})xf(x'\beta) \end{array} \]
Calculating “by hand” in R:
dlogis(xb)%*%t(x)%*%vcov(trump_model)%*%x%*%dlogis(xb) %>% sqrt ## [,1] ## [1,] 0.02747443
The deltamethod
function in the msm
package will also do this for you. The arguments are
- The function expressed as a formula
- A vector of means, here \(\mu = \beta\)
- The covariance matrix
deltamethod( ~ exp(x1 + x2*0 + x3*1 + x4*0 + x5*0 + x6*0 + x7*55 + x8*0)/ (1 + exp(x1 + x2*0 + x3*1 + x4*0 + x5*0 + x6*0 + x7*55 + x8*0)), coef(trump_model), vcov(trump_model)) ## [1] 0.02747443
This is the standard error for our predicted probability.
One catch: because the derivatives are found symbolically, you are limited to what stats::deriv
knows. For example, the following would be simpler to code.
deltamethod(~ plogis(x1 + x2*0 + x3*1 + x4*0 + x5*0 + x6*0 + x7*55 + x8*0), coef(trump_model), vcov(trump_model))
But R throws an error, Function plogis is not in the derivatives table.
By the way, this is where Stata truly shines and why I still turn to it despite protestations that “R can do everything.” Here are all of the predicted probabilities, with standard errors, for all combinations of education and gender, for 55 year-olds.
. margins educ, over(gender) at(age = 55 noise = 0) --------------------------------------------------------------------------------------- | Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] ----------------------+---------------------------------------------------------------- gender#educ | Male#HS Incomplete | .4445065 .0514349 8.64 0.000 .3436959 .545317 Male#HS Complete | .6062301 .0274744 22.07 0.000 .5523812 .660079 Male#Some College | .6161789 .0205271 30.02 0.000 .5759466 .6564112 Male#BA/BS | .5468739 .0231694 23.60 0.000 .5014627 .5922851 Male#Higher | .343584 .0251349 13.67 0.000 .2943204 .3928475 Female#HS Incomplete | .3551 .0482039 7.37 0.000 .2606221 .4495779 Female#HS Complete | .5144184 .0278629 18.46 0.000 .4598081 .5690286 Female#Some College | .5248688 .0201584 26.04 0.000 .485359 .5643786 Female#BA/BS | .4536941 .0231244 19.62 0.000 .4083711 .4990171 Female#Higher | .2648002 .0215839 12.27 0.000 .2224966 .3071038 ---------------------------------------------------------------------------------------
Type one more word, and this becomes a pretty graph.
. marginsplot
Again, the delta method gives us an approximation that may not be accurate.
Predicted probabilities are bounded by zero and one, yet a delta method CI may extend below or above these limits.
Alternatives:
- Boostrapping
- Simulations by repeated draws from \(\boldsymbol \beta \sim \mathcal{N}\left(\widehat{\boldsymbol \beta}, \mbox{Cov}(\widehat{\boldsymbol \beta})\right)\).
Both are computationally intensive and may be less attractive for biger data sets than the delta method.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.