An exercise in non-linear modeling
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In my previous post I wrote about the importance of age and why it is a good idea to try avoiding modeling it as a linear variable. In this post I will go through multiple options for (1) modeling non-linear effects in a linear regression setting, (2) benchmark the methods on a real dataset, and (3) look at how the non-linearities actually look. The post is based on the supplement in my article on age and health-related quality of life (HRQoL).
Background
What is linearity?
Wikipedia has an excellent explanation of linearity:
linearity refers to a mathematical relationship or function that can be graphically represented as a straight line
Why do we assume linearity?
Linearity is a common assumption that is made when building a linear regression model. In a linear relation, a continuous variable has the same impact throughout the variable’s span. This makes the estimate is easy to interpret; an increase of one unit gives the corresponding coefficient’s change in the outcome. While this generates simple models with its advantages, it is difficult to believe that nature follows a simple straight line. With todays large data sets I believe that our models should permit non-linear effects.
How do we do non-linearity?
There are plenty of non-linear alternatives that can be used to better find the actual relationships. Most of them rely on converting the single continuous variable into several. The simplest form is when we transform the variable into a polynomial, e.g. instead of having the model:
HRQoL ~ β0 + βage * Age + βsex * Sex
We expand age to also include the age squared:
HRQoL ~ β0 + βage * Age + βage” * Age2 + βsex * Sex
This allows for the line for a bend, unfortunately as we add the squared term the coefficients are more difficult to interpret, and after adding a cubic term, i.e. Age3, it is almost impossible to interpret the coefficients. Due to this difficulty in interpretation I either use the rms::contrast
function or the stats::predict
in order to illustrate how the variable behaves.
Splines
A frequently used alternative to polynomials are splines. The most basic form of a spline consists of lines that are connected at different “knots”. For instance, a linear spline with 1 knot can assume a V-shape, while 2 knots allow for an N-shaped relationship. The number of knots decide the flexibility of a spline, i.e. more knots allow a more detailed description of the relationship.
The models
The dataset comes from the Swedish Hip Arthroplasty Register‘s excellent PROM database and consists of more than 30,000 patients that have filled out the EQ-5D form both before and after surgery. We will focus on age and its impact on the EQ-5D index and the EQ-VAS one year after total hip replacement surgery. We will control for sex, Charnley class, pre-operative HRQoL, pre-operative pain.
The restricted cubic spline
I’m a big fan of Frank Harrell‘s rms-package so we will start there although we start by splitting the dataset using the caret::createDataPartition
. We then need to set the rms::datadist
in order for the rms-functions to work as expected:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | library(caret) train_no <- createDataPartition(1:nrow(my_dataset), list = FALSE, p = .7) train <- my_dataset[train_no, ] test <- my_dataset[-train_no, ] # A list with all the fits that are later to be benchmarked fits <- list(eq5d = list(), eq_vas = list()) # The rms-setup library(rms) ddist <- datadist(train) options("datadist" = "ddist") |
Frank Harrell is a proponent of restricted cubic splines, alias natural cubic splines. This is a type of spline that uses cubic terms in the center of the data and restricts the ends to a straight line, preventing the center from distorting the ends, i.e. curling. His rcs()
also nicely integrates with the anova
in order to check if non-linearity actually exists:
1 2 3 4 5 | idx_model <- ols(eq5d1 ~ rcs(Age, 3) + Sex * Charnley_class + rcs(eq5d0, 3)+rcs(pain0, 3), data=train, x=TRUE, y=TRUE) anova(idx_model, ss=FALSE) |
gives:
Analysis of Variance Response: eq5d1 Factor F d.f. P Age 140.71 2
As we can see the model seems OK for the EQ-5D index, supporting both the non-linearity and the interaction between Charnley class and sex. If we for some reason cannot use the rms-package we can check for linearity by using the splines::ns
with two regular models as suggested below:
1 2 3 4 5 6 7 8 9 | lm1 <- lm(eq5d1 ~ Age + Sex * Charnley_class + ns(eq5d0, 3)+ns(pain0, 3), data=train) lm2 <- lm(eq5d1 ~ ns(Age, 3) + Sex * Charnley_class + ns(eq5d0, 3)+ns(pain0, 3), data=train) anova(lm1, lm2) |
gives:
Analysis of Variance Table Model 1: eq5d1 ~ Age + Sex * Charnley_class + ns(eq5d0, 3) + ns(pain0, 3) Model 2: eq5d1 ~ ns(Age, 3) + Sex * Charnley_class + ns(eq5d0, 3) + ns(pain0, 3) Res.Df RSS Df Sum of Sq F Pr(>F) 1 19095 193.01 2 19093 192.66 2 0.35112 17.398 2.825e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In order to avoid overfitting we will try to select models based upon the AIC/BIC criteria. The selection is simply finding the lowest value where in general AIC allows slightly more complex models compared to BIC. We will start with finding the optimal number of knots for the EQ-5D index using the AIC:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | #' A simple function for updating the formula and extracting #' the information criteria #' #' @param no A number that is used together with the add_var_str #' @param fit A regression fit that is used for the update #' @param rm_var The variable that is to be substituted #' @param add_var_str A sprintf() string that accepts the no #' parameter for each update #' @param ic_fn The information criteria function (AIC/BIC) getInfCrit <- function(no, fit, rm_var, add_var_str, ic_fn) { new_var <- sprintf(add_var_str, no) updt_frml <- as.formula(sprintf(".~.-%s+%s", rm_var, new_var)) ret <- ic_fn(update(fit, updt_frml)) names(ret) <- new_var return(ret)} # We start with a model where the other splines # have been AIC-selected idx_model <- ols(eq5d1 ~ rcs(Age, 3) + Sex * Charnley_class + rcs(eq5d0, 8)+rcs(pain0, 6), data=train, x=TRUE, y=TRUE) sapply(3:9, getInfCrit, fit = idx_model, rm_var = "rcs(Age, 3)", add_var_str = "rcs(Age, %d)", ic_fn=AIC) # rcs(Age, 3) rcs(Age, 4) rcs(Age, 5) rcs(Age, 6) rcs(Age, 7) rcs(Age, 8) rcs(Age, 9) # -33678.89 -33674.68 -33686.30 -33686.93 -33685.95 -33686.73 -33699.37 fits$eq5d[["RCS with AIC"]] <- update(idx_model, .~.-rcs(Age, 3)+rcs(Age, 5)) |
It can be discussed if the model should stop at 3 knots but I chose to continue a little higher as the drop was relatively large between the 4 and 5 knots. This is most likely due to a unfortunate fit for the 4 knots. We could also have motivated a larger number of knots but even with proper visualization these are difficult to interpret. When modeling confounders, such as the preoperative EQ-5D index (eq5d0) and the pre-operative pain (pain0), we may prefer a higher number of knots in order to avoid any residual confounding and we do not need to worry about visualizing/explaining the relations.
Now if we apply the same methodology to the EQ-VAS we get:
1 2 3 4 5 6 7 8 | vas_model <- ols(eq_vas1 ~ Sex * Charnley_cat + Sex + rcs(Age, 3) + rcs(eq_vas0, 3) + OpNr + rcs(pain0, 3), data=train, x=TRUE, y=TRUE) sapply(3:9, getInfCrit, fit = vas_model, rm_var = "rcs(Age, 3)", add_var_str = "rcs(Age, %d)", ic_fn=AIC) # rcs(Age, 3) rcs(Age, 4) rcs(Age, 5) rcs(Age, 6) rcs(Age, 7) rcs(Age, 8) rcs(Age, 9) # 166615.6 166619.8 166600.2 166600.1 166602.0 166603.0 166596.7 fits$eq_vas[["RCS with AIC"]] <- update(vas_model, .~.-rcs(Age, 3)+rcs(Age, 5)) |
Now lets do the same for the BIC:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | idx_model <- ols(eq5d1 ~ rcs(Age, 3) + Sex * Charnley_cat + rcs(eq5d0, 3)+rcs(pain0, 3), data=train, x=TRUE, y=TRUE) sapply(3:9, getInfCrit, fit = idx_model, rm_var = "rcs(Age, 3)", add_var_str = "rcs(Age, %d)", ic_fn=BIC) # rcs(Age, 3) rcs(Age, 4) rcs(Age, 5) rcs(Age, 6) rcs(Age, 7) rcs(Age, 8) rcs(Age, 9) # -33486.35 -33474.16 -33477.95 -33470.69 -33462.17 -33455.28 -33459.98 fits$eq5d[["RCS with BIC"]] <- idx_model vas_model <- ols(eq_vas1 ~ rcs(Age, 3) + Sex * Charnley_cat + rcs(eq_vas0, 3) + OpNr + rcs(pain0, 3), data=train, x=TRUE, y=TRUE) sapply(3:9, getInfCrit, fit = vas_model, rm_var = "rcs(Age, 3)", add_var_str = "rcs(Age, %d)", ic_fn=BIC) # rcs(Age, 3) rcs(Age, 4) rcs(Age, 5) rcs(Age, 6) rcs(Age, 7) rcs(Age, 8) rcs(Age, 9) # 166725.7 166737.8 166726.0 166733.8 166743.5 166752.4 166754.0 fits$eq_vas[["RCS with BIC"]] <- vas_model |
B-splines
B-splines, alias Basis spline, are common alternatives to restricted cubic splines that also use knots to control for flexibility. As these are not restricted at the ends they have more flexible tails than restricted cubic splines. In order to get a comparison we will use the same model for the other variables:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | # Use the same setting model as used in the RCS vars <- attr(terms(fits$eq5d[["RCS with AIC"]]), "term.labels") rm_var <- vars[grep("Age", vars, fixed = TRUE)] idx_model <- update(fits$eq5d[["RCS with AIC"]], sprintf(".~.-%s+bs(Age, 3)", rm_var)) sapply(3:9, getInfCrit, fit = idx_model, rm_var = "bs(Age, 3)", add_var_str = "bs(Age, %d)", ic_fn=AIC) # bs(Age, 3) bs(Age, 4) bs(Age, 5) bs(Age, 6) bs(Age, 7) bs(Age, 8) bs(Age, 9) # -33669.07 -33683.35 -33680.55 -33683.44 -33683.03 -33681.79 -33685.55 # Chose 6 knots for illustration as it otherwise be the # same as the BIC model - not that interesting fits$eq5d[["BS with AIC"]] <- update(idx_model, .~.-bs(Age, 3)+bs(Age, 6)) vars <- attr(terms(fits$eq5d[["RCS with BIC"]]), "term.labels") rm_var <- vars[grep("Age", vars, fixed = TRUE)] idx_model <- update(fits$eq5d[["RCS with BIC"]], sprintf(".~.-%s+bs(Age, 3)", rm_var)) sapply(3:9, getInfCrit, fit = idx_model, rm_var = "bs(Age, 3)", add_var_str = "bs(Age, %d)", ic_fn=BIC) # bs(Age, 3) bs(Age, 4) bs(Age, 5) bs(Age, 6) bs(Age, 7) bs(Age, 8) bs(Age, 9) # -33468.29 -33475.09 -33464.40 -33459.42 -33451.12 -33442.38 -33438.71 fits$eq5d[["BS with BIC"]] <- update(idx_model, .~.-bs(Age, 3)+bs(Age, 4)) vars <- attr(terms(fits$eq_vas[["RCS with AIC"]]), "term.labels") rm_var <- vars[grep("Age", vars, fixed = TRUE)] vas_model <- update(fits$eq_vas[["RCS with AIC"]], sprintf(".~.-%s+bs(Age, 3)", rm_var)) sapply(3:9, getInfCrit, fit = vas_model, rm_var = "bs(Age, 3)", add_var_str = "bs(Age, %d)", ic_fn=AIC) # bs(Age, 3) bs(Age, 4) bs(Age, 5) bs(Age, 6) bs(Age, 7) bs(Age, 8) bs(Age, 9) # 166640.3 166617.2 166621.1 166612.9 166613.2 166614.8 166615.0 fits$eq_vas[["BS with AIC"]] <- update(vas_model, .~.-bs(Age, 3)+bs(Age, 6)) vars <- attr(terms(fits$eq_vas[["RCS with BIC"]]), "term.labels") rm_var <- vars[grep("Age", vars, fixed = TRUE)] vas_model <- update(fits$eq_vas[["RCS with BIC"]], sprintf(".~.-%s+bs(Age, 3)", rm_var)) sapply(3:9, getInfCrit, fit = vas_model, rm_var = "bs(Age, 3)", add_var_str = "bs(Age, %d)", ic_fn=BIC) # bs(Age, 3) bs(Age, 4) bs(Age, 5) bs(Age, 6) bs(Age, 7) bs(Age, 8) bs(Age, 9) # 166750.4 166735.1 166746.9 166746.5 166754.7 166764.2 166772.2 fits$eq_vas[["BS with BIC"]] <- update(vas_model, .~.-bs(Age, 3)+bs(Age, 4)) |
Polynomials
Polynomials can be of varying degrees and have often been argued as a simple approach to fitting a more flexible non-linear relationship. As the majority of the patients are located around the mean age, 69.1 years, it is important to remember that these patients will have the strongest influence on the curve appearance. It is therefore possible that the tails are less reliable than the central portion.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | # Use the same setting model as used in the RCS vars <- attr(terms(fits$eq5d[["RCS with AIC"]]), "term.labels") rm_var <- vars[grep("Age", vars, fixed = TRUE)] idx_model <- update(fits$eq5d[["RCS with AIC"]], sprintf(".~.-%s+poly(Age, 2)", rm_var)) sapply(2:9, getInfCrit, fit = idx_model, rm_var = "poly(Age, 2)", add_var_str = "poly(Age, %d)", ic_fn=AIC) # poly(Age, 2) poly(Age, 3) poly(Age, 4) poly(Age, 5) poly(Age, 6) poly(Age, 7) poly(Age, 8) poly(Age, 9) # -33669.58 -33669.07 -33680.10 -33678.83 -33681.82 -33681.89 -33680.35 -33680.17 fits$eq5d[["Polynomial with AIC"]] <- update(idx_model, .~.-poly(Age, 2)+poly(Age, 6)) vars <- attr(terms(fits$eq5d[["RCS with BIC"]]), "term.labels") rm_var <- vars[grep("Age", vars, fixed = TRUE)] idx_model <- update(fits$eq5d[["RCS with BIC"]], sprintf(".~.-%s+poly(Age, 2)", rm_var)) sapply(2:9, getInfCrit, fit = idx_model, rm_var = "poly(Age, 2)", add_var_str = "poly(Age, %d)", ic_fn=BIC) # poly(Age, 2) poly(Age, 3) poly(Age, 4) poly(Age, 5) poly(Age, 6) poly(Age, 7) poly(Age, 8) poly(Age, 9) # -33476.79 -33468.29 -33471.83 -33462.59 -33457.76 -33450.37 -33440.97 -33432.99 fits$eq5d[["Polynomial with BIC"]] <- idx_model vars <- attr(terms(fits$eq_vas[["RCS with AIC"]]), "term.labels") rm_var <- vars[grep("Age", vars, fixed = TRUE)] vas_model <- update(fits$eq_vas[["RCS with AIC"]], sprintf(".~.-%s+poly(Age, 2)", rm_var)) sapply(2:9, getInfCrit, fit = vas_model, rm_var = "poly(Age, 2)", add_var_str = "poly(Age, %d)", ic_fn=AIC) # poly(Age, 2) poly(Age, 3) poly(Age, 4) poly(Age, 5) poly(Age, 6) poly(Age, 7) poly(Age, 8) poly(Age, 9) # 166638.4 166640.3 166622.3 166623.9 166615.7 166616.8 166617.2 166617.5 fits$eq_vas[["Polynomial with AIC"]] <- update(vas_model, .~.-poly(Age, 2)+poly(Age, 6)) vars <- attr(terms(fits$eq_vas[["RCS with BIC"]]), "term.labels") rm_var <- vars[grep("Age", vars, fixed = TRUE)] vas_model <- update(fits$eq_vas[["RCS with BIC"]], sprintf(".~.-%s+poly(Age, 2)", rm_var)) sapply(2:9, getInfCrit, fit = vas_model, rm_var = "poly(Age, 2)", add_var_str = "poly(Age, %d)", ic_fn=BIC) # poly(Age, 2) poly(Age, 3) poly(Age, 4) poly(Age, 5) poly(Age, 6) poly(Age, 7) poly(Age, 8) poly(Age, 9) # 166740.6 166750.4 166740.2 166749.7 166749.3 166758.3 166766.6 166774.7 fits$eq_vas[["Polynomial with BIC"]] <- update(vas_model, .~.-poly(Age, 2)+poly(Age, 4)) |
Multiple Fractional Polynomial Models
Multiple fractional polynomials (MFP) have been proposed as an alternative to splines. These use a defined set of exponential transformations of the variable, where it omits predictors according to their p-values. The mfp
has a built-in algorithm and we don't need to rely on either BIC or AIC with MFP.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | library(mfp) # We will use the simple BIC models and # instead of rcs() that is not available # for mfp we use ns() from the splines package fits$eq5d[["MFP"]] <- mfp(eq5d1 ~ fp(Age) + Sex * Charnley_class + ns(eq5d0, 3)+ns(pain0, 3), data=train) fits$eq_vas[["MFP"]] <- mfp(eq_vas1 ~ fp(Age) + Sex * Charnley_class + ns(eq_vas0, 3)+ns(pain0, 3), data=train) |
Generalized additive models
Generalized additive model (GAM) are an extension to generalized linear models and specializes in non-linear relationships. Elements of statistical learning by Hastie et. al. is an excellent source for diving deeper into these. The simplest way to explain the GAM smoothers is that they penalize the flexibility in order to avoid over-fitting, there plenty of options - the ones used here are:
Thin plate regression splines: This is generally the most common type of smoother in GAM models. The name refers to the physical analogy of bending a thin sheet of metal.
Cubic regression splines: The basis for the spline is cubic with evenly spread knots.
P-splines: P-splines are similar to B-splines in that they share basis with the main difference that P-splines enforce a penalty on the coefficients.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | library(mgcv) fits$eq5d[["GAM thin plate"]] <- gam(eq5d1 ~ s(Age, bs="tp") + Sex * Charnley_class + ns(eq5d0, 3) + ns(pain0, 3), data=train) fits$eq_vas[["GAM thin plate"]] <- gam(eq_vas1 ~ s(Age, bs="tp") + Sex * Charnley_class + ns(eq_vas0, 3) + ns(pain0, 3), data=train) fits$eq5d[["GAM cubic regression"]] <- update(fits$eq5d[["GAM thin plate"]], .~.-s(Age, bs="tp")+s(Age, bs="cr")) fits$eq_vas[["GAM cubic regression"]] <- update(fits$eq_vas[["GAM thin plate"]], .~.-s(Age, bs="tp")+s(Age, bs="cr")) fits$eq5d[["GAM P-splines"]] <- update(fits$eq5d[["GAM thin plate"]], .~.-s(Age, bs="tp")+s(Age, bs="ps")) fits$eq_vas[["GAM P-splines"]] <- update(fits$eq_vas[["GAM thin plate"]], .~.-s(Age, bs="tp")+s(Age, bs="ps")) |
Benchmark
With all these fancy models we will first try to evaluate how they perform when cross-validated and then on our test-set that we've set apart at the start. We will evaluate according to root-mean-square error (RMSE) and mean absolute percentage error (MAPE). RMSE tends to penalize for having outliers while the MAPE is more descriptive of the performance on average. Our testing functions will therefore be:
1 2 3 4 5 6 7 8 | rmse <- function(fit, newdata, outcome_var){ pred <- predict(fit, newdata=newdata) sqrt(mean((newdata[, outcome_var]-pred)^2, na.rm=TRUE)) } mape <- function(fit, newdata, outcome_var){ pred <- predict(fit, newdata=newdata) mean(abs(newdata[, outcome_var]-pred)/mean(newdata[, outcome_var], na.rm=TRUE)*100, na.rm=TRUE) } |
Furthermore in this particular article I wanted to look into what happens at the tails as almost 70 % of the patients were between 60 and 80 years while the variable span was 40 to 100 years. I therefore defined a central vs peripheral portion where the central portion was defined as being between the 15:th and 85:th percentile.
The cross-validation was a straight forward implementation. As this can take a little time it is useful to think about parallelization, we will here use the parallel
-package although the foreach
is also an excellent option.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | crossValidateInParallel <- function(fit, df, outcome_var, k=10, tail_prop=.15){ df$central <- with(df, Age >= quantile(Age, probs=tail_prop) & Age <= quantile(Age, probs=1-tail_prop)) df$cv <- rep(1:k, length.out=nrow(df)) cv_internal_fn <- function(x, cv_fit, df, outcome_var){ lc_train <- subset(df, cv != x) lc_test <- subset(df, cv == x) # In the original version I had a BIC/AIC optimizing # function that worked with the cross-validation # Here, we'll keep it simple and just use the previously # chosen functions cv_fit <- update(fit, data=lc_train) return(c(Main_RMSE = rmse(cv_fit, newdata=lc_test, outcome_var=outcome_var), Central_RMSE = rmse(cv_fit, newdata=subset(lc_test, central == TRUE), outcome_var=outcome_var), Peripheral_RMSE = rmse(cv_fit, newdata=subset(lc_test, central == FALSE), outcome_var=outcome_var), Main_MAPE = mape(cv_fit, newdata=lc_test, outcome_var=outcome_var), Central_MAPE = mape(cv_fit, newdata=subset(lc_test, central == TRUE), outcome_var=outcome_var), Peripheral_MAPE = mape(cv_fit, newdata=subset(lc_test, central == FALSE), outcome_var=outcome_var))) } # It is convenient to use the detectCores() function in # order to use the machines full capacity. Subtracting # 1 is also preferable as it prohibits the computer from # stopping other tasks, i.e. you can keep surfing the web :-) cl <- makeCluster(mc <- getOption("cl.cores", ifelse(detectCores() <= 1, 1, detectCores() - 1))) # In Windows each cores starts out fresh and we therefore need # to export the functions, data etc so that they can access # it as expected or you will get nasty errors clusterEvalQ(cl, library(mfp)) clusterEvalQ(cl, library(mgcv)) clusterEvalQ(cl, library(rms)) clusterEvalQ(cl, library(splines)) clusterEvalQ(cl, library(stringr)) clusterExport(cl, c("rmse", "mape")) res <- parallel::parLapply(1:k, cl=cl, fun=cv_internal_fn, outcome_var=outcome_var, cv_fit = fit, df=df) stopCluster(cl) res <- as.data.frame(Gmisc::mergeLists(lapplyOutput=res)) ret <- colMeans(res) attr(ret, "raw") <- res return(ret) } |
If we run all the models we get the following result:
RMSE | MAPE | ||||||
---|---|---|---|---|---|---|---|
Method | Main | Central | Peripheral | Main | Central | Peripheral | |
The EQ-5D index | |||||||
Restricted cubic splines | |||||||
AIC | 0.100 | 0.098 | 0.105 | 8.83 | 8.60 | 9.42 | |
BIC | 0.100 | 0.099 | 0.105 | 8.86 | 8.63 | 9.46 | |
B-splines | |||||||
AIC | 0.100 | 0.099 | 0.105 | 8.85 | 8.80 | 9.48 | |
BIC | 0.101 | 0.099 | 0.105 | 8.91 | 8.79 | 9.48 | |
Polynomials | |||||||
AIC | 0.103 | 0.105 | 0.117 | 8.94 | 9.09 | 10.21 | |
BIC | 0.103 | 0.105 | 0.116 | 8.96 | 9.07 | 10.09 | |
MFP | 0.101 | 0.099 | 0.105 | 8.87 | 8.64 | 9.43 | |
Generalized additive models | |||||||
Thin plate | 0.100 | 0.098 | 0.105 | 8.86 | 8.62 | 9.47 | |
Cubic regression | 0.100 | 0.098 | 0.105 | 8.86 | 8.62 | 9.47 | |
P-splines | 0.100 | 0.098 | 0.105 | 8.86 | 8.62 | 9.47 | |
The EQ VAS | |||||||
Restricted cubic splines | |||||||
AIC | 18.54 | 18.49 | 18.66 | 19.0 | 18.7 | 19.7 | |
BIC | 18.54 | 18.49 | 18.67 | 19.0 | 18.7 | 19.7 | |
B-splines | |||||||
AIC | 18.54 | 18.66 | 18.65 | 19.0 | 19.1 | 19.7 | |
BIC | 18.55 | 18.66 | 18.69 | 19.1 | 19.0 | 19.7 | |
Polynomials | |||||||
AIC | 19.27 | 20.06 | 21.81 | 19.5 | 20.1 | 22.9 | |
BIC | 19.24 | 19.96 | 21.67 | 19.5 | 20.0 | 22.6 | |
MFP | 18.55 | 18.50 | 18.69 | 19.0 | 18.7 | 19.7 | |
Generalized additive models | |||||||
Thin plate | 18.54 | 18.48 | 18.67 | 19.0 | 18.7 | 19.7 | |
Cubic regression | 18.54 | 18.48 | 18.67 | 19.0 | 18.7 | 19.7 | |
P-splines | 18.54 | 18.49 | 18.67 | 19.0 | 18.7 | 19.7 |
The difference between is almost negligable although the polynomial is clearly not performing as well as the other methods. This was though less pronounced in the testset where the polynomial had some issues with the tails:
RMSE | MAPE | ||||||
---|---|---|---|---|---|---|---|
Method | Main | Central | Peripheral | Main | Central | Peripheral | |
The EQ-5D index | |||||||
Restricted cubic splines | |||||||
AIC | 0.100 | 0.098 | 0.104 | 8.75 | 8.56 | 9.23 | |
BIC | 0.100 | 0.099 | 0.104 | 8.79 | 8.60 | 9.27 | |
B-splines | |||||||
AIC | 0.100 | 0.099 | 0.104 | 8.77 | 8.77 | 9.30 | |
BIC | 0.100 | 0.099 | 0.105 | 8.85 | 8.78 | 9.34 | |
Polynomials | |||||||
AIC | 0.100 | 0.099 | 0.106 | 8.73 | 8.66 | 9.18 | |
BIC | 0.101 | 0.099 | 0.106 | 8.78 | 8.69 | 9.15 | |
MFP | 0.100 | 0.099 | 0.104 | 8.79 | 8.61 | 9.25 | |
Generalized additive models | |||||||
Thin plate | 0.100 | 0.098 | 0.105 | 8.79 | 8.59 | 9.29 | |
Cubic regression | 0.100 | 0.098 | 0.104 | 8.79 | 8.60 | 9.27 | |
P-splines | 0.100 | 0.099 | 0.104 | 8.79 | 8.60 | 9.28 | |
The EQ VAS | |||||||
Restricted cubic splines | |||||||
AIC | 18.70 | 18.50 | 19.18 | 19.1 | 18.7 | 20.1 | |
BIC | 18.71 | 18.51 | 19.19 | 19.1 | 18.8 | 20.2 | |
B-splines | |||||||
AIC | 18.70 | 18.71 | 19.18 | 19.2 | 19.2 | 20.2 | |
BIC | 18.71 | 18.71 | 19.17 | 19.2 | 19.1 | 20.2 | |
Polynomials | |||||||
AIC | 18.74 | 18.80 | 19.67 | 19.1 | 19.0 | 20.4 | |
BIC | 18.74 | 18.75 | 19.64 | 19.1 | 19.0 | 20.4 | |
MFP | 18.72 | 18.52 | 19.19 | 19.2 | 18.8 | 20.1 | |
Generalized additive models | |||||||
Thin plate | 18.69 | 18.50 | 19.17 | 19.1 | 18.7 | 20.1 | |
Cubic regression | 18.69 | 18.50 | 19.17 | 19.1 | 18.7 | 20.1 | |
P-splines | 18.69 | 18.50 | 19.17 | 19.1 | 18.7 | 20.1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | cv_results <- list(eq5d = NULL, eq_vas = NULL) for (group in names(fits)){ for (method in names(fits[[group]])){ cv_results[[group]] <- rbind(cv_results[[group]], crossValidateInParallel(fits[[group]][[method]], df=train, outcome_var = sprintf("%s1", group))) } rownames(cv_results[[group]]) <- names(fits[[group]]) } out <- rbind(t(apply(cv_results$eq5d, 1, function(x) c(sprintf("%.3f", x[1:3]), sprintf("%.2f", x[4:6])))), t(apply(cv_results$eq_vas, 1, function(x) c(sprintf("%.2f", x[1:3]), sprintf("%.1f", x[4:6]))))) cgroup <- unique(gsub("^.+_", "", colnames(cv_results$eq5d))) n.cgroup <- as.numeric(table(gsub("^.+_", "", colnames(cv_results$eq5d)))) colnames(out) <- gsub("_.+$", "", colnames(cv_results$eq5d)) rownames(out) <- capitalize(gsub("(RCS|BS|GAM|Polynomial) (with |)", "", c(names(fits$eq5d), names(fits$eq_vas)))) htmlTable(out, rgroupCSSstyle = "", rowlabel = "Method", cgroup = cgroup, n.cgroup = n.cgroup, rgroup = rep(c("Restricted cubic splines", "B-splines", "Polynomials", "", "Generalized additive models"), 2), n.rgroup = rep(c(2, 2, 2, 1, 3), 2), tspanner = c("The EQ-5D index", "The EQ VAS"), n.tspanner = sapply(cv_results, nrow), ctable=TRUE) test$central <- with(test, Age >= quantile(Age, probs=.15) & Age <= quantile(Age, probs=.85)) testset_results <- list(eq5d = NULL, eq_vas = NULL) for (group in names(fits)){ outcome_var <- sprintf("%s1", group) for (method in names(fits[[group]])){ testset_results[[group]] <- rbind(testset_results[[group]], c(Main_RMSE = rmse(fits[[group]][[method]], newdata=test, outcome_var=outcome_var), Central_RMSE = rmse(fits[[group]][[method]], newdata=subset(test, central == TRUE), outcome_var=outcome_var), Peripheral_RMSE = rmse(fits[[group]][[method]], newdata=subset(test, central == FALSE), outcome_var=outcome_var), Main_MAPE = mape(fits[[group]][[method]], newdata=test, outcome_var=outcome_var), Central_MAPE = mape(fits[[group]][[method]], newdata=subset(test, central == TRUE), outcome_var=outcome_var), Peripheral_MAPE = mape(fits[[group]][[method]], newdata=subset(test, central == FALSE), outcome_var=outcome_var))) } rownames(testset_results[[group]]) <- names(fits[[group]]) } out <- rbind(t(apply(testset_results$eq5d, 1, function(x) c(sprintf("%.3f", x[1:3]), sprintf("%.2f", x[4:6])))), t(apply(testset_results$eq_vas, 1, function(x) c(sprintf("%.2f", x[1:3]), sprintf("%.1f", x[4:6]))))) cgroup <- unique(gsub("^.+_", "", colnames(testset_results$eq5d))) n.cgroup <- as.numeric(table(gsub("^.+_", "", colnames(testset_results$eq5d)))) colnames(out) <- gsub("_.+$", "", colnames(testset_results$eq5d)) rownames(out) <- capitalize(gsub("(RCS|BS|GAM|Polynomial) (with |)", "", c(names(fits$eq5d), names(fits$eq_vas)))) htmlTable(out, rgroupCSSstyle = "", rowlabel = "Method", cgroup = cgroup, n.cgroup = n.cgroup, rgroup = rep(c("Restricted cubic splines", "B-splines", "Polynomials", "", "Generalized additive models"), 2), n.rgroup = rep(c(2, 2, 2, 1, 3), 2), tspanner = c("The EQ-5D index", "The EQ VAS"), n.tspanner = sapply(testset_results, nrow), ctable=TRUE) |
Graphs
Now lets look at how the relations found actually look. In order to quickly style all graphs I use a common setup:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | # The ggplot look and feel my_theme <- theme_bw() + theme(legend.position = "bottom", legend.text=element_text(size=7), axis.title.x = element_text(size=9), axis.title.y = element_text(size=9), axis.text.x = element_text(size=8), axis.text.y = element_text(size=8)) getPredDf <- function(ds, eq_type, eq_values, Age){ new_data <- data.frame(Age = Age) new_data$Sex="Male" new_data[[sprintf("%s0", eq_type)]] = eq_values new_data$Charnley_class="A" new_data$pain0 = median(train$pain0, na.rm=TRUE) new_data$OpNr = 1 ret <- list("Best" = new_data) new_data$Sex="Female" new_data$Charnley_class="C" new_data$OpNr = 2 ret[["Worst"]] <- new_data return(ret) } getAgePredDf <- function(ds, eq_type){ mode_for_preop_eq <- as.numeric(names(which.max(table(ds[[sprintf("%s0", eq_type)]])))) Age <- seq(from=40, to=90, by=.1) return(getPredDf(ds, eq_type, eq_values = rep(mode_for_preop_eq, length.out=length(Age)), Age=Age)) } getRmsPred <- function(nd, model, eq_type){ new_data <- nd[["Best"]] best_age_spl <- predict(model, newdata=new_data, conf.type="mean", conf.int=.95) new_data <- nd[["Worst"]] worst_age_spl <- predict(model, newdata=new_data, conf.type="mean", conf.int=.95) getDf <- function(pred, newdata, eq_type){ df <- data.frame(Age = newdata$Age, Lower = pred$lower, Upper = pred$upper) df[[sprintf("%s1", eq_type)]] = pred$linear.predictors df[[sprintf("%s0", eq_type)]] = newdata[[sprintf("%s0", eq_type)]] df$Pain = newdata$pain0 df$OpNr = newdata$OpNr return(df) } df <- getDf(best_age_spl, new_data, eq_type) tmp <- getDf(worst_age_spl, new_data, eq_type) df$Cat = 1 tmp$Cat = 2 df <- rbind(df, tmp) rm(tmp) df$Cat <- factor(as.numeric(df$Cat), labels=c(" Sex: Male n Charnley class: A n First THR ", " Sex: Female n Charnley class: C n Previous contralateral THR ")) return(df) } getGlmPred <- function(nd, model, eq_type){ new_data <- nd[["Best"]] best_age_spl <- predict(model, newdata=new_data, se.fit = TRUE) new_data <- nd[["Worst"]] worst_age_spl <- predict(model, newdata=new_data, se.fit = TRUE) getDf <- function(pred, newdata, eq_type){ df <- data.frame(Age = newdata$Age, Lower = pred$fit - qnorm(0.975)*pred$se.fit, Upper = pred$fit + qnorm(0.975)*pred$se.fit) df[[sprintf("%s1", eq_type)]] = pred$fit df[[sprintf("%s0", eq_type)]] = newdata[[sprintf("%s0", eq_type)]] df$Pain = newdata$pain0 df$OpNr = newdata$OpNr return(df) } df <- getDf(best_age_spl, new_data, eq_type) tmp <- getDf(worst_age_spl, new_data, eq_type) df$Cat = 1 tmp$Cat = 2 df <- rbind(df, tmp) rm(tmp) df$Cat <- factor(as.numeric(df$Cat), labels=c(" Sex: Malen Charnley class: A n First THR ", " Sex: Femalen Charnley class: C n Previous contralateral THR ")) return(df) } g_legend<-function(a.gplot){ tmp <- ggplot_gtable(ggplot_build(a.gplot)) leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box") legend <- tmp$grobs[[leg]] return(legend) } plot2Lines <- function(eq5d_df, vas_df){ clrs <- brewer.pal(3, "Pastel1") clrs <- GISTools::add.alpha(clrs, .5) clrs <- clrs[c language="(2,1,3)"][/c] gg_eq5d <- ggplot(data=eq5d_df, aes(x=Age, y=eq5d1, fill=Cat, linetype=Cat)) + geom_hline(yintercept=eq5d_df$eq5d0[1], lwd=1, lty=2, colour="#00640055") + geom_ribbon(aes(ymin=Lower, ymax=Upper)) + geom_line() + ylab("EQ-5D index one year after surgery") + xlab("Age (years)") + scale_x_continuous(expand=c(0,0))+ scale_linetype_manual(name = "", values=c(1,2)) + scale_fill_manual(name = "", values = clrs[1:2], guide="none") + my_theme gg_eqvas <- ggplot(data=vas_df, aes(x=Age, y=eq_vas1, fill=Cat, linetype=Cat)) + geom_hline(yintercept=vas_df$eq_vas0[1], lwd=1, lty=2, colour="#00640055") + geom_ribbon(aes(ymin=Lower, ymax=Upper)) + geom_line() + theme_bw() + ylab("EQ VAS one year after surgery") + xlab("Age (years)") + scale_x_continuous(expand=c(0,0))+ scale_linetype_manual(name = "", values=c(1,2)) + scale_fill_manual(name = "", values = clrs[1:2]) + theme(legend.position = "bottom") + my_theme mylegend<-g_legend(gg_eqvas) grid.arrange(arrangeGrob(gg_eq5d + theme(legend.position="none"), gg_eqvas + theme(legend.position="none"), nrow=1), mylegend, nrow=2,heights=c(10, 2)) } |
It is important to remember that different algorithms will find different optima and may therefore seem different to the eye even though they fit the data equally well. I think of it as a form of skeleton that defines how it can move, it will try to adapt the best it can but within its boundaries. We can for instance bend our elbow but not the forearm (unless you need my surgical skill).
Restricted cubic splines
AIC
BIC
B-splines
AIC
BIC
Polynomials
Note that the y-scale differs for the polynomials
AIC
BIC
MFP - multiple fractional polynomial
Generalized additive models
Thin plate regression splines
Cubic regression splines
P-splines
Summary
I hope you enjoyed the post and found something useful. If there is some model that you lack, write up some code and I can see if it runs. Note that there are differences from the original supplement that used a slightly more complex setup for choosing the number of knots and a slightly different dataset.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.