Example 8.16: Exact logistic regression
[This article was first published on SAS and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In example 8.15, on Firth logistic regression, we mentioned alternative approaches to separation troubles. Here we demonstrate exact logistic regression. The code for this appears in the book (section 4.1.2) but we don’t show an example of it there. We’ll consider the setting of observing 100 subjects each with x=0 and x=1, observing no events when x=0, and 5 when x=1. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
SAS
We’ll create the data as a summary, rather than for every line of data. Then we can use the “events/trials” syntax (section 4.1.1) that both proc logistic and proc genmod accept. This is another way to reduce the size of data sets (along with the weight option mentioned previously) but is less generally useful. The the exact statement in proc logistic will fit the exact logistic regression and generate a p-value. The estimate option is required to display estimated log odds ratio.
data exact; x=0; count=0; n=100; output; x=1; count=5; n=100; output; run; proc logistic data=exact; model count/n = x; exact x / estimate; run;
This generates the following output:
Exact Parameter Estimates Standard 95% Confidence Parameter Estimate Error Limits p-Value x 1.9414* . -0.0677 Infinity 0.0594 NOTE: * indicates a median unbiased estimate.
R
In R we use the elrm() function in the elrm package to approximate exact logistic regression, as described in this paper by the package’s authors. The function requires a special formula object with syntax identical to the SAS events/trials syntax. (Note that the function does not behave as expected when identical observations with trials=1 are submitted. Thus data should be collapsed into unique combinations of predictors before using the function.) In addition, it requires its data to be included in a data frame. We’ll construct the data frame in one function call to data.frame().
elrmdata = data.frame(count=c(0,5), x=c(0,1), n=c(100,100)) library(elrm) resexact = elrm(count/n ~ x, interest = ~x, iter=22000, burnIn=2000, data=elrmdata, r=2) summary(resexact)
producing the following result:
Call: [[1]] elrm(formula = count/n ~ x, interest = ~x, r = 2, iter = 22000, dataset = elrmdata, burnIn = 2000) Results: estimate p-value p-value_se mc_size x 2.0225 0.02635 0.0011 20000 95% Confidence Intervals for Parameters lower upper x -0.02065572 Inf
Differences between the SAS and R results most likely arise from the fact that the elrm() function is an approximation of the exact approach. The upper limit of infinity seen in the exact SAS analysis and approximate exact elrm() analysis reveals a limitation of this approach relative to the Firth approach seen in example 8.15 and the Bayesian approach we’ll examine later.
A final note: if the true Pr(Y=1|X=1) = 0.05, then the true Pr(Y=1|X=0) that results in a log odds ratio of 1.94 is about 0.0075; for a log odds ratio of 2.02, the true probability is about 0.0069.
To leave a comment for the author, please follow the link and comment on their blog: SAS and R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.