Introducing R package ‘oddsratio’
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
You are dealing with statistical models (GLMs or GAMs) with a binomial response variable? Then the oddsratio
package will improve your analysis routine!
This package simplifies the calculation of odds ratios in binomial models. For GAMs, it also provides you with the power to insert your results into the smooth functions of your predictors! But let’s start with some basics…
– This post refers to package version 0.3.0 –
GLM
The concept of odds ratio calculation
The standard approach to calculate odds ratios in Generalized Linear Models (GLMs) is to exponentiate the function coefficients using exp(coef(model))
. Since the coefficients are returned in log odds, exponentiating converts them to odds. But wait! We want odds ratios showing the change in odds for a specific predictor change! Usually you just create a vector which stores the increments of your predictors you want to calculate odds ratios for. Next, you have to remove the first value of the coef
output (which is usually the intercept) because you only want to calculate odds ratios for your predictors! Then you multiply the coefficients with your increment values.
But wait again!? How do we get a ratio (‘odds ratio’) by multiplying two values? Well, this is a mathematical thing. Behind the scences you are doing a exp()
call on a substraction of log odds:log odds1 - log odds2
where log odds1
is just “one unit” (+1) larger than log odds2
. This difference is your coefficient. Applying exp()
on a substraction results in a division. Subsequently, the result is a ratio of odds1
by odds2
. If you then set an increment value to multiply your coefficient with, this “one unit increase/difference” (+1) is multiplied by this value and then exp()
is applied on it. This means that if you set an increment value of 5
, the “new” coefficient (corresponding to a change of 5
) is simply five times the coefficients of 1
(which was returned by your model summary).
If your predictor is not numeric but an indicator variable with certain levels (like rank
in the example), it does not make sense to set increments since the calculated coefficients just refer to your base factor level, here being rank1
. In practice this makes sense because you can just say “being rank1
increases/decreases the odds of the event to happen by x % compared to rank2
. So you just set the increment to 1
to calculate the basic odds ratio between the respective levels. Note that you need to set as much “1s” as there are levels of your indicator variable!
All these steps result in the following code:
Data source: http://www.ats.ucla.edu/stat/r/dae/logit.htm
# Example data dat <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv") dat$rank <- factor(dat$rank) fit.glm <- glm(admit ~ gre + gpa + rank, data = dat, family = "binomial") incr <- c(100, 2, 1, 1, 1) exp(coef(fit.glm)[-1] * incr) # gre gpa rank2 rank3 rank4 # 1.2541306 4.9931906 0.5089310 0.2617923 0.2119375
Here, the increments of our numeric predictors are 100
and 2
. The annoying part of this approach is that you have to specify as many “1s” as there are levels of your indicator variable – and you have to take care not to misplace the parentheses in the exp()
call! Other possible errors might be to miss the [-1]
for the intercept or increment/predictor misplacement within the incr
vector.
And yes, this was just the ‘easy’ procedure for GLMs – the GAM approach is way more extensive.
The ‘oddsratio’ approach
All what was shown before can be done better – in my opinion!
calc.oddsratio.glm(data = dat, model = fit.glm, incr = list(gre = 100, gpa = 2)) # predictor oddsratio CI.low (2.5 %) CI.high (97.5 %) increment # 1 gre 1.254 1.014 1.558 100 # 2 gpa 4.993 1.378 18.696 2 # 3 rank2 0.509 0.272 0.945 Indicator variable # 4 rank3 0.262 0.132 0.512 Indicator variable # 5 rank4 0.212 0.091 0.471 Indicator variable
Note how the column names of CI.low
and CI.high
are automatically adjusted.
By using calc.oddsratio.glm()
you get a nicely formatted output. When setting up the function arguments you avoid false references of increments by providing the information in a named list (gre = 100
, gpa = 2
). Also, automatically confident intervals (CI) of odds ratios are calculated and returned. So you can directly see how “safe” your odds ratio calculation is based on the underlying fitted model for the specific predictor.
For GLMs the default CI is 95%, i.e. the lower border is 2.5% and the upper one is 97.5%. You can easily specify your own CI using the CI
argument:
calc.oddsratio.glm(data = dat, model = fit.glm, + incr = list(gre = 100, gpa = 2), CI = .70) # predictor oddsratio CI.low (15 %) CI.high (85 %) increment # 1 gre 1.254 1.120 1.406 100 # 2 gpa 4.993 2.520 9.984 2 # 3 rank2 0.509 0.366 0.706 Indicator variable # 4 rank3 0.262 0.183 0.374 Indicator variable # 5 rank4 0.212 0.136 0.325 Indicator variable
GAM
For Generalized Additive Models (GAMs) the odds ratio calculation is done different. Due to the non-linear behavior of this model type, odds ratios of specific increment steps are different for every value combination and not constant throughout the value range of each predictor as for GLMs. For example, the odds ratio of two arbitrary values 3
and 10
with their difference of 7
is different to the odds ratio of 22
and 29
. This is based on the different coefficient slopes of GAMs between these two value combinations (non-linear!). For GLMs, the slope would be the same (‘linear’) and hence also the odds ratios.
Let’s show some examples! (Data source: ?mgcv::predict.gam)
suppressPackageStartupMessages(library(mgcv)) set.seed(1234) n <- 200 sig <- 2 dat <- gamSim(1, n = n,scale = sig, verbose = FALSE) dat$x4 <- as.factor(c(rep("A", 50), rep("B", 50), rep("C", 50), rep("D", 50))) fit.gam <- mgcv::gam(y ~ s(x0) + s(I(x1^2)) + s(x2) + offset(x3) + x4, data = dat)
Calculating odds ratios for GAMs is somewhat exhausting and more ‘complicated’ as for GLMs for which you just call exp(coef(model))
. For GAMs, you can only calculate the odds ratio of one predictor at a time. First, you call predict()
with your starting value, let’s call it value1
. Next, you do the same again, now using value2
of your predictor. You can either call predict on only one observation or on all if you fix all other values! The logic is that you call predict()
on your prediction data for which the only difference between the two calls is your change from value1
to value2
of your predictor while all other values stay the same.
After that, you have your two log odds coefficients corresponding to your specific value change of your chosen predictor. Next, you can call exp()
on this substraction (value2
– value1
) to receive your odds ratio value (as it is done for GLMs).
If you do not understand this theory in depth, do not worry – calc.oddsratio.gam()
does the work for you! What counts in the first place is to be able to correctly interpret the odds ratios.
calc.oddsratio.gam(data = dat, model = fit.gam, pred = "x2", values = c(0.099, 0.198)) # predictor value1 value2 oddsratio CI.low (2.5%) CI.high (97.5%) # 1 x2 0.099 0.198 50.34194 52.87883 47.92677
Using calc.oddsratio.gam()
you just specify your fitted model and your predictor, provide your values to calculate the odds ratio for and you receive your result!
“Hmmm – so I have to call the function x times if I want multiple odds ratios of the same predictor?”
Well, actually yes – but that is the moment when the slice
comes to stage!
calc.oddsratio.gam()
supports the calculation of multiple odds ratios for one predictor using slice = TRUE
. This option splits the value range of the predictor by percentage steps (specified in the percentage
argument). So if you want, for example, to calculate odds ratios for 20% quantiles of your predictors value range, you proceed as follows:
calc.oddsratio.gam(data = dat, model = fit.gam, pred = "x2", + percentage = 20, slice = TRUE) # predictor value1 value2 perc1 perc2 oddsratio CI.low (2.5%) CI.high (97.5%) # 1 x2 0.017 0.212 0 20 3254.67 1801.96 5878.52 # 2 x2 0.212 0.408 20 40 0.02 0.01 0.02 # 3 x2 0.408 0.604 40 60 0.24 0.27 0.22 # 4 x2 0.604 0.800 60 80 0.12 0.11 0.13 # 5 x2 0.800 0.995 80 100 0.09 0.17 0.05
You get the values which were taken for the odds ratio calculation (value1
, value2
), which percentage of the predictor distribution they correspond to (perc1
, perc2
), the calculate odds ratio and its confident interval borders.
Note that in package version 0.3.0 the CI of GAMs is fixed to 95% and cannot be modified.
Plot GAM(M) smoothing functions
Right now, the only (quick) possibility to plot the smoothing functions of a GAM(M) in R was by using the built-in plot()
function. Since I prefer using ggplot2
for all kind of plotting, I implemented the somehow fiddly procedure of plotting GAM smoothing functions using ggplot()
in pl.smooth.gam()
:
pl.smooth.gam(fit.gam, pred = "x2", title = "Predictor 'x2')
Add odds ratio information into smoothing function plot
So now, we have the odds ratios and we have a plot of the smoothing function. Why not combine both? We can do so using add.oddsratio.into.plot()
! Its main arguments are (i) a ggplot
plotting object containing the smooth function (from pl.smooth.gam()
) and a data frame returned from calc.oddsratio.gam()
containing information about the predictor and the respective values we want to insert.
plot.object <- pl.smooth.gam(fit.gam, pred = "x2", title = "Predictor 'x2'") or.object1 <- calc.oddsratio.gam(data = dat, model = fit.gam, pred = "x2", values = c(0.099, 0.198)) # insert first odds ratios into plot plot.object <- add.oddsratio.into.plot(plot.object, or.object1, or.yloc = 3, values.xloc = 0.04, line.size = 0.5, line.type = "dotdash", text.size = 6, values.yloc = 0.5, arrow.col = "red")
The odds ratio information is always centered between the two vertical lines. Hence it only looks nice if the gap between the two chosen values (here 0.099
and 0.198
) is large enough. If the smoothing line crosses your inserted text, you can correct it by adjusting or.yloc
. This argument sets the y-location of the inserted odds ratio information. Depending on the number of digits of your chosen values (here 3), you might also need to adjust the x-axis location of the two values so that these do not interfer with the vertical line.
Let’s add another odds ratio into this plot! This time we simply take the already produced plot as an input to add.oddsratio.into.plot()
and use a new odds ratio object:
or.object2 <- calc.oddsratio.gam(data = dat, model = fit.gam, pred = "x2", values = c(0.4, 0.6)) # add or.object2 into plot add.oddsratio.into.plot(plot.object, or.object2, or.yloc = 2.1, values.yloc = 2, line.col = "green4", text.col = "black", rect.col = "green4", rect.alpha = 0.2, line.alpha = 1, line.type = "dashed", arrow.xloc.r = 0.01, arrow.xloc.l = -0.01, arrow.length = 0.01, rect = TRUE)
Quite some adjustments were made for this insertion: I adjusted values.yloc
because we have only one digit this time. Also, or.yloc
was set to a lower value than in the first example to avoid an interference with the smoothing function. A green shaded rectangle was added using rect = TRUE
and I set the arrow color to black
. Length and position of the arrows very slightly modified using arrow.length
, arrow.xloc.r
and arrow.xloc.l
. If you do not like arrows, simply turn them off using arrow = FALSE
. The same logic applies to the shaded rectangle rect = FALSE
and the inserted values `values = FALSE´.
Installation
You can install the package from CRAN by
install.packages("oddsratio")
or from Github using
devtools::install_github("oddsratio", build_vignettes = TRUE)
Patrick Schratz
PhD Student at University of Jena
I have experience in working with all kind of image analysis, GIS software and programming languages such as R and Python. Currently I use statistical and machine learning methods to improve the understanding of forest decline within tree plantations in northern Spain.
pat-s / oddsratio
Simplified odds ratio calculation of binomial GAM/GLM models
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.