Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
The rise of machine learning
In this current 4th industrial revolution, data science has penetrated all industries and healthcare is no exception. There has been an exponential use of machine learning in clinical research in the past decade and it is expected to continue to grow at an even faster rate in the following decade. Many machine learning techniques are considered as black box algorithms as the intrinsic workings of the models are too complex in justifying the reasons for the predictions. However, the adoption of machine learning techniques is inevitable. Thus, instead of avoiding them, clinicians need to learn how to “understand and explain the predictions of these models” to patients for “moral, legal and scientific reasons”.
Explaining models
There are 2 approaches to explaining models
- Use non-complex models. These algorithms are associated with traditional statistical models. The simple structure allows us to understand the intrinsic workings of the models.
- Conduct post-hoc interpretation on models. Post-hoc interpretation can be applied on both simple and black box models. These analyses done after model training can be further broken down into model specific and model agnostic approaches.
Direction of the post
In this post we will explore the first approach of explaining models, using interpretable models such as logistic regression and decision trees (decision trees will be covered in another post) . I will be using the tidymodels
approach to create these algorithms. The dataset used is the Cleveland heart dataset which is a binary classification problem if heart disease is present or absent for a patient.
Model production pipeline
Glossary
There are 3 columns in the glossary below:
- Original: The original variable names
- Definition: The expansion of the abbreviated original variable names and the details for each factor level for categorical variables.
- Rename: The renamed variable names which is meant to less jargonistic than the original variable names.
#library library(tidyverse) library(tidymodels) #import heart<-read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", col_names = F) #glossary tribble(~Original, ~Definition, ~Rename, "age", "" , "age", "sex", "1= Male, 2=Female", "sex", "cp", "Chest pain at rest, there is another variable related to chest pain during the exercise stress test. 1=typical angina, 2= atypical angina, 3= non-chestpain pain, 4=asymtomatic" ,"rest_cp", "tresbps", "Resting blood pressure (in mm Hg on admission to the hospital)", "rest_bp", "chol", "Serum cholesterol", "chol", "fbs", "Fasting blood sugar. 1= >120 mg/dl, 0 = <120 mg/dl","fast_bloodsugar", "restecg", "Resting electrocardiographic results. 0=normal, 1=ST-T wave abnormality, 2=left ventricular hypertrophy", "rest_ecg", "thalach", "Maximum heart rate achieved. based on values, likely taken during exercise stress test", "ex_maxHR", "exang", "Exercise induced angina (chest pain). 1=yes, 0=no", "ex_cp", "oldpeak", "ST depression induced by exercise relative to rest", "ex_STdepression_dur", "slope", "Slope of the peak exercise ST segment. 1=upsloping/normal, 2=flat, 3=downsloping", "ex_STpeak", "ca", "Number of major vessels colored by flourosopy", "coloured_vessels", "thal", "Thalassemia. 3=normal, 6=fixed defect, 7=reversable defect", "thalassemia", "num", "Angiographic status of heart disease. 0= <50% diameter narrowing (no heart disease), >1= >50% diameter narrowing (heart disease present)", "heart_disease") %>% DT::datatable(rownames = F, options = list(paging= T))
Data wrangling
Variable and value names
We will convert the numeric encoding of categorical variables to their intended meaning. Using the intended meaning will facilitate the interpretation of models. For instance, saying “typical resting chest pain is the most influential variable” is more comprehensible than “resting chest pain =1 is the most influential variable”.
# Renaming var colnames(heart)<- c("age", "sex", "rest_cp", "rest_bp", "chol", "fast_bloodsugar","rest_ecg","ex_maxHR","ex_cp", "ex_STdepression_dur", "ex_STpeak","coloured_vessels", "thalassemia","heart_disease") #elaborating cat var ##simple ifelse conversion heart<-heart %>% mutate( sex= ifelse(sex=="1", "male", "female"), fast_bloodsugar= ifelse(fast_bloodsugar=="1", ">120", "<120"), ex_cp=ifelse(ex_cp=="1", "yes", "no"), heart_disease=ifelse(heart_disease=="0", "no", "yes")) ## complex ifelse conversion using `case_when` heart<-heart %>% mutate( rest_cp=case_when(rest_cp== "1" ~ "typical", rest_cp=="2" ~ "atypical", rest_cp== "3" ~ "non-CP pain", rest_cp== "4" ~ "asymptomatic"), rest_ecg=case_when(rest_ecg=="0" ~ "normal", rest_ecg=="1" ~ "ST-T abnorm", rest_ecg=="2" ~ "LV hyperthrophy"), ex_STpeak=case_when(ex_STpeak=="1" ~ "up/norm", ex_STpeak== "2" ~ "flat", ex_STpeak== "3" ~ "down"), thalassemia=case_when(thalassemia=="3.0" ~ "norm", thalassemia== "6.0" ~ "fixed", thalassemia== "7.0" ~ "reversable"))
Missing data
Missing values often reflected as NA or “?”. First, let’s identify which variables have missing values recorded as “?”
heart %>% map_df(~sum((.x)=="?")) %>% gather(key="var", value = "value") %>% filter(value >0) ## # A tibble: 1 x 2 ## var value ## <chr> <int> ## 1 coloured_vessels 4
We will convert the 4 observations in “coloured_vessels” with missing values recorded as “?” into true NA.
heart<-heart%>% mutate_if(is.character, funs(replace(., .=="?", NA)))
Now let’s tally the total number of missing values reported as NA for each variable. We will impute these missing values later in the pre-processing stage with recipe::step_knnimpute()
heart %>% map_df(~sum(is.na(.x))) ## # A tibble: 1 x 14 ## age sex rest_cp rest_bp chol fast_bloodsugar rest_ecg ex_maxHR ex_cp ## <int> <int> <int> <int> <int> <int> <int> <int> <int> ## 1 0 0 0 0 0 0 0 0 0 ## # ... with 5 more variables: ex_STdepression_dur <int>, ex_STpeak <int>, ## # coloured_vessels <int>, thalassemia <int>, heart_disease <int>
Variable class
Currently, categorical variables are treated as characters. We will need to convert them to factors before feeding them into the model.
heart %>% map_df(~class(.x)) %>% gather(key="var", value = "class") ## # A tibble: 14 x 2 ## var class ## <chr> <chr> ## 1 age numeric ## 2 sex character ## 3 rest_cp character ## 4 rest_bp numeric ## 5 chol numeric ## 6 fast_bloodsugar character ## 7 rest_ecg character ## 8 ex_maxHR numeric ## 9 ex_cp character ## 10 ex_STdepression_dur numeric ## 11 ex_STpeak character ## 12 coloured_vessels character ## 13 thalassemia character ## 14 heart_disease character
I prefer to convert the categorical variables during the data wrangling stage rather than during the model pre-processing stage with recipes::step_string2factors
. Having the dataset in the right form during the data wrangling phase helps to prevent any errors further upstream during pre-processing and model production.
heart<-heart %>% mutate_if(is.character, as.factor)
Models
Training/test set
We will use functions from Rsample
to create the training and test set.
set.seed(4595) data_split <- initial_split(heart, prop=0.75, strata = "heart_disease") heart_train <- training(data_split) heart_test <- testing(data_split)
Pre-processing
The training and test sets are pre-processed using functions from recipes
. We will not explicitly create one hot encoding for categorical variables as both logistic regressions and decision trees are able to accommodate them. I kept the feature engineering brief as I wanted the explanation of the models to be succinct.
# create recipe object heart_recipe<-recipe(heart_disease ~., data= heart_train) %>% step_knnimpute(all_predictors()) # process the training set/ prepare recipe(non-cv) heart_prep <-heart_recipe %>% prep(training = heart_train, retain = TRUE)
Model creation
The model will be created using functions from parsnip
lr_model <- logistic_reg(penalty = 10, mixture = 0.1, mode = "classification") %>% set_engine("glm") %>% fit(heart_disease ~ ., data = juice(heart_prep))
Now that we’ve built the model, let’s interpret the white box model.
Logistic regression
Logistic regression is one of the classic models use in medical research to solve classification problems. Logistic regression provides us with coefficient estimates but most often we use a derivate of the coefficient estimate, odds ratio, in comprehending the model.
Coefficient estimate
Before elaborating about odds ratio, let me quickly define coefficient estimate. Coefficient estimate from logistic regression characterize the relationship between the predictor and the outcome on a log-odds scale. One-unit increase in a predictor (e.g. resting blood pressure rest_bp
) is associated with an increase in the log odds of the outcome (e.g. heart disease heart_disease
) by the value of the coefficient estimate (e.g. 0.0453).
broom::tidy(lr_model$fit) %>% filter(term=="rest_bp") ## # A tibble: 1 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 rest_bp 0.0453 0.0161 2.81 0.00492
Odds ratio
The odds ratio represents the odds that an outcome will occur given the presence of a specific predictor, compared to the odds of the outcome occurring in the absence of that predictor, assuming all other predictors remain constant.
The odds ratio is calculated from the exponential function of the coefficient estimate based on a unit increase in the predictor (on a side note, coefficient estimates are unbiased but odds ratio are biases due to transformation).
An example with a continuous variable, will be for 1 mm Hg increased in resting blood pressure rest_bp
, the odds of having heart disease increases by a factor of 1.05.
broom::tidy(lr_model$fit) %>% filter(term=="rest_bp") %>% mutate(odds_ratio= exp(estimate)) %>% select(term, estimate, odds_ratio) ## # A tibble: 1 x 3 ## term estimate odds_ratio ## <chr> <dbl> <dbl> ## 1 rest_bp 0.0453 1.05
An example with a categorical variable will be chest pain during exercise stress test ex_cpyes
. If chest pain is present, the odds of having heart disease increases by a factor of 1.52. In percentage change, the odds for heart disease is 52.6% higher for individuals with chest pain during the exercise stress test than individuals with no chest pain.
broom::tidy(lr_model$fit) %>% filter(term=="ex_cpyes") %>% mutate(odds_ratio= exp(estimate), percentage= (odds_ratio-1)*100) %>% select(term, estimate, odds_ratio, percentage) ## # A tibble: 1 x 4 ## term estimate odds_ratio percentage ## <chr> <dbl> <dbl> <dbl> ## 1 ex_cpyes 0.437 1.55 54.8
Variables such as cholesterol chol
where the odds ratio is 1 means that cholesterol does not affect odds of having heart disease.
broom::tidy(lr_model$fit) %>% mutate(odds_ratio= round(exp(estimate),2)) %>% select(term, estimate, odds_ratio) %>% filter(odds_ratio==1) #https://stackoverflow.com/questions/21509346/r-displays-numbers-in-scientific-notation ## # A tibble: 1 x 3 ## term estimate odds_ratio ## <chr> <dbl> <dbl> ## 1 chol 0.00447 1
Variables such as age
, normal resting ECG rest_ecgnormal
have odds ratio less than 1 which suggests that exposure with these variables are associated with lower odds of having heart disease. We tend to ignore the intercept when talking about odds ratio.
broom::tidy(lr_model$fit) %>% mutate(odds_ratio= round(exp(estimate),2)) %>% select(term, estimate, odds_ratio) %>% filter(odds_ratio<1) ## # A tibble: 10 x 3 ## term estimate odds_ratio ## <chr> <dbl> <dbl> ## 1 (Intercept) -0.750 0.47 ## 2 age -0.0226 0.98 ## 3 rest_cpatypical -2.13 0.12 ## 4 rest_cpnon-CP pain -1.86 0.16 ## 5 rest_cptypical -4.18 0.02 ## 6 fast_bloodsugar>120 -0.667 0.51 ## 7 rest_ecgnormal -0.779 0.46 ## 8 ex_maxHR -0.0471 0.95 ## 9 ex_STpeakup/norm -0.183 0.83 ## 10 thalassemianorm -0.201 0.82
Variables such as reversable thalassemia thalassemiareversable
and resting blood pressure rest_bp
have odds ratio greater than 1 which suggest exposure to these variables are associated with higher odds of heart disease.
ST-T abnormal during resting ECG rest_ecgST-T abnorm
, 2 vessels coloured during angiogram coloured_vessels2.0
and males sexmale
have unusually large odds ratio. More thorough data exploration and feature engineering will be needed to address them. Our pre-processing was rather brief, only imputation of missing values was done.
broom::tidy(lr_model$fit) %>% mutate(odds_ratio= round(exp(estimate),2)) %>% select(term, estimate, odds_ratio) %>% filter(odds_ratio>1) %>% arrange(desc(odds_ratio)) ## # A tibble: 10 x 3 ## term estimate odds_ratio ## <chr> <dbl> <dbl> ## 1 rest_ecgST-T abnorm 13.1 465136. ## 2 coloured_vessels2.0 4.68 107. ## 3 sexmale 2.48 12.0 ## 4 coloured_vessels3.0 1.62 5.04 ## 5 coloured_vessels1.0 1.52 4.59 ## 6 thalassemiareversable 1.26 3.53 ## 7 ex_cpyes 0.437 1.55 ## 8 ex_STdepression_dur 0.296 1.34 ## 9 rest_bp 0.0453 1.05 ## 10 ex_STpeakflat 0.0512 1.05
Mistakes with odds ratio
Although odds ratios are useful in identifying risk factors, it should not be confused with risk ratio. They are frequently confused terms.
When using values from logistic regression to create a point based risk scoring system, the coefficient estimates need to be used and not the odds ratio.
Conclusion
In this post we looked at explaining models using white box models such as logistic regression. In the next post, we will learn about another white box model, decision tree.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.