Too crude to be true?

Posted on October 8, 2013 by Max Gordon in R bloggers | 0 Comments

[This article was first published on G-Forge » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Harvesting the magical update() function. The image is CC by Waldemar Horwat.

The key to programming is being lazy; it has actually been called a virtue by some. When I discovered the update() function it blew me away. Within short I had created a monster based upon this tiny function, allowing quick and easy output of regression tables that contain crude and adjusted estimates. In this post I’ll show you how to tame the printCrudeAndAdjusted() function in my Gmisc-package and show a little behind the scenes.

Although not always present I think a regression table that shows both the crude and adjusted estimates conveys important information. The crude is also referred to as un-adjusted and if this sharply drops in the adjusted model we can suspect that there is a strong confounder that steals some of the effect that this variable has.

A simple walk-through

We will start with basic usage of the function. For this I’m using the colon data set from the survival package. So first we prepare our data by generating a few factors:

^?View Code RSPLUS

library(survival)
data(colon)
# Choose th ones wih deah as outcome
colon <- subset(colon, etype = 2)
# Change into years
colon$time <- colon$time/365.25
# Factor these binary variables or the software will interpret them as
# numeric even if the regression is correct
colon$sex <- factor(colon$sex, labels = c("Female", "Male"))
colon$perfor <- factor(colon$perfor, labels = c("No", "Yes"))
colon$obstruct <- factor(colon$obstruct, labels = c("No", "Yes"))

If the names are a little to shorthand and we want prettier names it is easy to set the label of the name using the Hmisc label() function:

^?View Code RSPLUS

library(Hmisc, verbose = FALSE)
label(colon$rx) <- "Treatment"
label(colon$perfor) <- "Colon perforation"
label(colon$age) <- "Age (years)"
label(colon$sex) <- "Sex"

The regression model is setup in the regular way. I use the rms-package for a lot of my work but regular lm(), glm() and coxph() should work just as well. If you’re not familiar with the rms-package it likes to have a summary on the data set that is stored in the datadist.

^?View Code RSPLUS

library(rms, verbose = FALSE)
dd <- datadist(colon[, c("time", "status", "rx", "sex", "obstruct", "perfor", 
    "age")])
options(datadist = "dd")
sn <- Surv(colon$time, colon$status)
fit <- cph(sn ~ rx + sex + perfor + age, data = colon)

And now it is time for my monster function. I like having the reference category in the table so I always add the add_reference option:

^?View Code RSPLUS

1 2	library(Gmisc) printCrudeAndAdjustedModel(fit, add_references = TRUE, ctable = TRUE)

Variable	Crude		Adjusted
	HR	2.5 % to 97.5 %	HR	2.5 % to 97.5 %
Treatment
Obs	1	ref	1	ref
Lev	0.98	0.84 to 1.14	0.98	0.84 to 1.14
Lev+5FU	0.64	0.55 to 0.76	0.64	0.55 to 0.76
Sex
Female	1	ref	1	ref
Male	0.97	0.85 to 1.10	0.95	0.83 to 1.08
Colon perforation
No	1	ref	1	ref
Yes	1.30	0.92 to 1.85	1.30	0.91 to 1.85
Age (years)	1.00	0.99 to 1.00	1.00	0.99 to 1.00

In event data like this it is also nice to have some information regarding the variables raw distribution and we can easily combine the regression model with some descriptive data by adding the desc_column option:

^?View Code RSPLUS

printCrudeAndAdjustedModel(fit, 
                           desc_column=TRUE,
                           add_references=TRUE, 
                           ctable=TRUE)

Variable			Crude		Adjusted
	Total	Event	HR	2.5 % to 97.5 %	HR	2.5 % to 97.5 %
Treatment
Obs	630	345 (54.76 %)	1	ref	1	ref
Lev	620	333 (53.71 %)	0.98	0.84 to 1.14	0.98	0.84 to 1.14
Lev+5FU	608	242 (39.80 %)	0.64	0.55 to 0.76	0.64	0.55 to 0.76
Sex
Female	890	444 (49.89 %)	1	ref	1	ref
Male	968	476 (49.17 %)	0.97	0.85 to 1.10	0.95	0.83 to 1.08
Colon perforation
No	1804	888 (49.22 %)	1	ref	1	ref
Yes	54	32 (59.26 %)	1.30	0.92 to 1.85	1.30	0.91 to 1.85
Age (years)	59.75 (± 11.95)	59.46 (± 12.35)	1.00	0.99 to 1.00	1.00	0.99 to 1.00

The function does not yet handle functions. Currently it skips these by removing any variable from the formula that matches the regular expression: “^(ns|rcs|bs|pspline)[(]“. In the future I image it could be possible to add something that uses the rms:::contrast() function, although that will probably wait until the need arises.

What about the update()

As I mentioned in the introduction the update function is at the heart of this function. It has the ability to on the fly add, remove, and change datasets and other nice options to your model.

^?View Code RSPLUS

# Now lets keep everything the same and add the obstruct variable
update(fit, . ~ . + obstruct)
## 
## Cox Proportional Hazards Model
## 
## cph(formula = sn ~ rx + sex + perfor + age + obstruct, data = colon)
## 
##                      Model Tests       Discrimination    
##                                           Indexes        
## Obs       1858    LR chi2     44.77    R2       0.024    
## Events     920    d.f.            6    Dxy      0.126    
## Center -0.1992    Pr(> chi2) 0.0000    g        0.245    
##                   Score chi2  43.75    gr       1.277    
##                   Pr(> chi2) 0.0000                      
## 
##              Coef    S.E.   Wald Z Pr(>|Z|)
## rx=Lev       -0.0193 0.0769 -0.25  0.8022  
## rx=Lev+5FU   -0.4368 0.0840 -5.20  <0.0001 
## sex=Male     -0.0440 0.0662 -0.66  0.5066  
## perfor=Yes    0.1992 0.1819  1.10  0.2735  
## age          -0.0012 0.0028 -0.44  0.6606  
## obstruct=Yes  0.2120 0.0817  2.60  0.0094
 
# We can also remove a variable
update(fit, . ~ . - age)
## 
## Cox Proportional Hazards Model
## 
## cph(formula = sn ~ rx + sex + perfor, data = colon)
## 
##                      Model Tests       Discrimination    
##                                           Indexes        
## Obs       1858    LR chi2     37.85    R2       0.020    
## Events     920    d.f.            4    Dxy      0.096    
## Center -0.1712    Pr(> chi2) 0.0000    g        0.214    
##                   Score chi2  36.43    gr       1.238    
##                   Pr(> chi2) 0.0000                      
## 
##            Coef    S.E.   Wald Z Pr(>|Z|)
## rx=Lev     -0.0205 0.0769 -0.27  0.7895  
## rx=Lev+5FU -0.4430 0.0839 -5.28  <0.0001 
## sex=Male   -0.0522 0.0661 -0.79  0.4297  
## perfor=Yes  0.2674 0.1800  1.49  0.1374

When I do crude adjustments I simply remove all other variables:

^?View Code RSPLUS

update(fit, . ~ rx)
## 
## Cox Proportional Hazards Model
## 
## cph(formula = sn ~ rx, data = colon)
## 
##                      Model Tests       Discrimination    
##                                           Indexes        
## Obs       1858    LR chi2     35.23    R2       0.019    
## Events     920    d.f.            2    Dxy      0.089    
## Center -0.1512    Pr(> chi2) 0.0000    g        0.194    
##                   Score chi2  33.63    gr       1.215    
##                   Pr(> chi2) 0.0000                      
## 
##            Coef    S.E.   Wald Z Pr(>|Z|)
## rx=Lev     -0.0209 0.0768 -0.27  0.7859  
## rx=Lev+5FU -0.4408 0.0839 -5.25  <0.0001

Nice, isn’t it? I find it really useful as I’m comparing AIC/BIC-values, testing interactions etc.

Where is the monster?

While the update() function is appealing in its simplicity there is much more that needed to be in place for this to work. The basic function needs to figure out the type of regression, loop through the variables, and check for the intercept. It still needs some work, this was one of my first functions that I created in my package so the coding is a little clumpsy – feel free to suggest improvements. You can find the raw code under my GitHub-account.

^?View Code RSPLUS

#' This function helps with printing regression models
#' 
#' This function is used for getting the adjusted and unadjusted values
#' for a regression model. It takes a full model and walks through each
#' variable, removes in the regression all variables except one then
#' reruns that variable to get the unadjusted value. This functions not
#' intended for direct use, it's better to use \code{\link{printCrudeAndAdjustedModel}}
#' that utilizes this function.
#' 
#' This function saves a lot of time creating tables since it compiles a fully
#' unadjusted list of all your used covariates.
#' 
#' If the model is an exponential poisson/logit/cox regression model then it automatically
#' reports the exp() values instead of the original values
#' 
#' The function skips by default all spline variables since this becomes very complicated
#' and there is no simple \deqn{\beta}{beta} to display. For the same reason it skips
#' any interaction variables since it's probably better to display these as a contrast table. 
#' 
#' @param fit The regression model
#' @return Returns a matrix with the columns c('Crude', '95\% CI', 'Adjusted', '95\% CI').
#'   The rows are not changed from the original fit.
#'
#' @seealso \code{\link{printCrudeAndAdjustedModel}}
#' 
#' @example examples/getCrudeAndAdjustedModelData_example.R
#' 
#' @import boot
#' 
#' @author max
#' @export
getCrudeAndAdjustedModelData <- function(fit) {
 
    if ("Glm" %in% class(fit)) 
        stop("The rms::Glm does not work properly with this function, please use regular glm instead")
 
    # Just a prettifier for the output an alternative could be:
    # paste(round(x[,1],1), ' (95% CI ', min(round(x[,2:3])), '-',
    # max(round(x[,2:3])), ')', sep='')
    get_coef_and_ci <- function(fit, skip_intercept = FALSE) {
        # Get the coefficients
        my_coefficients <- coef(fit)
        coef_names <- names(my_coefficients)
 
        # Confint isn't implemented for the rms package we therefore need to do a
        # work-around
        if ("rms" %in% class(fit) && "ols" %nin% class(fit)) {
            # The summary function has to be invoked to get this to work in the confint
            # version and the summary.rms doesn't return a profile.glm() compatible
            # list, it returns a matrix It works though OK for regular ordinary least
            # square regression
            if ("lrm" %in% class(fit)) {
                warning("Using the confint.default, i.e. Wald statistics, and it may not be optimal but profile CI is not implemented for the lrm, an option could be to change to the Glm with the family=binomial")
                ci <- confint.default(fit)
            } else if ("Glm" %in% class(fit)) {
                # The summary.rms is causing issues when profiling, it works fine when
                # summary.glm is issued, we need to change class for that
                class(fit) <- class(fit)["rms" != class(fit)]
                ci <- confint(fit)
            } else if ("cph" %in% class(fit)) {
                # I haven't found any large issues with this call
                ci <- confint(fit)
            } else {
                # Untested option try confint for better or worse
                warning("The function has not been tested with this type of regression object, may not work properly.")
                ci <- confint(fit)
            }
        } else {
            ci <- confint(fit)
        }
 
        if (skip_intercept) {
            intercept <- grep("Intercept", coef_names)
            if (length(intercept)) {
                my_coefficients <- my_coefficients[-intercept]
                ci <- ci[-intercept, ]
                coef_names <- coef_names[-intercept]
            }
        }
 
        # Use the exp() if logit or cox regression
        if ("coxph" %in% class(fit) || ("glm" %in% class(fit) && fit$family$link %in% 
            c("logit", "log"))) {
            my_coefficients <- exp(my_coefficients)
            ci <- exp(ci)
        }
 
        if (length(my_coefficients) > 1) 
            ret_val <- cbind(my_coefficients, ci) else ret_val <- matrix(c(my_coefficients, ci), nrow = 1)
 
        colnames(ret_val) <- c("", "2.5%", "97.5%")
        rownames(ret_val) <- coef_names
        return(ret_val)
    }
 
    all.terms <- terms(fit)
    var_names <- attr(all.terms, "term.labels")
 
    # Skip variables consisting of functions such as spline, strata variables
    regex_for_unwanted_vars <- "^(ns|rcs|bs|pspline)[(]"
    skip_variables <- grep(regex_for_unwanted_vars, var_names)
    skip_variables_names <- c()
    if (length(skip_variables) > 0) {
        for (vn in skip_variables) {
            match <- gregexpr("\\w+\\((?<var_name>\\w+)", var_names[vn], perl = TRUE)[[1]]
            if (match[1] >= 0) {
                st <- attr(match, "capture.start")[1, ]
                skip_variables_names <- append(skip_variables_names, substring(var_names[vn], 
                  st, st + attr(match, "capture.length")[1, ] - 1))
            }
        }
    }
 
    # Skips the interaction terms since they should be displayed in a different
    # fashion
    raw_int_terms <- grep("[[:alnum:]]+:[[:alnum:]]+", var_names)
    if (length(raw_int_terms) > 0) {
        require("stringr")
        for (i in raw_int_terms) {
            for (name in str_split(var_names[i], ":")[[1]]) {
                skip_variables_names <- append(skip_variables_names, name)
                skip_variables <- union(skip_variables, grep(name, var_names))
            }
        }
    }
 
    skip_variables <- union(skip_variables, grep("^strat[a]{0,1}\\(", var_names))
 
    # Get the adjusted variables
    adjusted <- get_coef_and_ci(fit)
    # When using splines, rcs in cox regression this shows a little different
 
    # Remove all the splines, rcs etc
    rn <- rownames(adjusted)
    remove <- grep("('{1,}|[[][0-9]+[]]|[)][0-9]+)$", rn)
    for (name in skip_variables_names) {
        m <- grep(name, rn)
        if (length(m) > 0) 
            remove <- union(remove, m)
    }
 
    # Remove the unwanted rows if any found
    if (length(remove) > 0) {
        adjusted <- adjusted[-remove, ]
    }
 
    unadjusted <- c()
    if (length(skip_variables) > 0) {
        variables_to_check <- var_names[-skip_variables]
    } else {
        variables_to_check <- var_names
    }
    if (length(variables_to_check) == 0) 
        stop("You have no variables that can be displayed as adjusted/unadjusted since they all are part of an interaction, spline or strata")
 
    for (variable in variables_to_check) {
        interaction_variable <- length(grep(":", variable)) > 0
 
        # If it's an interaction variable the interacting parts have to be included
        if (interaction_variable) {
            variable <- paste(paste(unlist(strsplit(variable, ":")), sep = "", 
                collapse = " + "), variable, sep = " + ")
        }
 
        # Run the same fit but with only one variable
        fit_only1 <- update(fit, paste(".~", variable))
        if ("boot.coef" %in% names(fit)) {
            # Redo bootstrap if this was a bootstrapped rms model by rerunning the
            # bootcov part
            fit_only1 <- bootcov(fit_only1, B = fit$B, coef.reps = "boot.Coef" %in% 
                names(fit))
        }
 
        # Get the coefficients processed with some advanced round part()
        new_vars <- get_coef_and_ci(fit_only1, skip_intercept = TRUE)
 
        # If interaction then we should only keep the interaction part - the other
        # variables are always included by default and need therefore to be removed
        if (interaction_variable) {
            new_vars <- new_vars[grep("[*:]", rownames(new_vars)), ]
        }
 
        # Add them to the previous
        unadjusted <- rbind(unadjusted, new_vars)
    }
 
    # If regression contains Intercept this is meaningless for the comparison of
    # adjusted and unadjusted values
    if (length(grep("Intercept", rownames(adjusted))) > 0) {
        if ("ols" %in% class(fit)) {
            # The simple .~1 doesn't work for me with the ols and I've therefore created
            # this workaround
            if ("boot.coef" %in% names(fit)) {
                if ("y" %nin% names(fit)) {
                  warning("If you want the intercept on a bootstrapped ols() model then you need to specify y=TRUE or else you get NA in the unadjusted estimates")
                  new_vars <- rep(NA, times = 3)
                } else {
                  require("boot") || stop("`boot' package not found - it's needed for the unadjusted estimate of the intercept")
                  ci <- boot.ci(boot(fit$y, function(x, i) mean(x[i], na.rm = TRUE), 
                    R = fit$B, stype = "i"), type = ifelse("boot.Coef" %in% 
                    names(fit), "perc", "norm"))
                  new_vars <- c(mean(fit$y, na.rm = TRUE), ci$percent[4:5])
                }
            } else {
                new_vars <- c(mean(fit$y, na.rm = TRUE), mean(fit$y) + c(-1, 
                  1) * sd(fit$y)/sqrt(sum(is.na(fit$y) == FALSE)) * qnorm(0.05/2))
            }
 
        } else {
            # Run the same fit but with only one variable
            fit_only1 <- update(fit, ".~ 1")
            if ("boot.coef" %in% names(fit)) {
                # Redo bootstrap if this was a bootstrapped rms model by rerunning the
                # bootcov part
                fit_only1 <- bootcov(fit_only1, B = fit$B, coef.reps = "boot.Coef" %in% 
                  names(fit))
            }
 
            # Get the coefficients processed with some advanced round part()
            new_vars <- get_coef_and_ci(fit_only1, skip_intercept = FALSE)
        }
 
        unadjusted <- rbind(new_vars, unadjusted)
        rownames(unadjusted)[1] <- "Intercept"
    }
 
    # If just one variable it's not a proper matrix
    if (is.null(dim(adjusted))) {
        both <- matrix(c(unadjusted, adjusted), nrow = 1)
    } else {
        both <- cbind(unadjusted, adjusted)
    }
    colnames(both) <- c("Crude", "2.5 %", "97.5 %", "Adjusted", "2.5 %", "97.5 %")
    return(both)
}

I hope you’ll find this as useful as I have.

To leave a comment for the author, please follow the link and comment on their blog: G-Forge » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Too crude to be true?

A simple walk-through

What about the update()

Where is the monster?

Related

A simple walk-through

What about the update()

Where is the monster?

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)