Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
In this paper we will fit a logistic regression model to the heart disease data uploaded from kaggle website.
For the data preparation we will follow the same steps as we did in my previous paper about naive bayes model, for more detail thus click here to get access to that paper.
Data preparation
First we call our data with the required packages
library(tidyverse, warn.conflicts = FALSE) library(caret, warn.conflicts = FALSE) mydata<-read.csv("heart.csv",header = TRUE) names(mydata)[1]<-"age" glimpse(mydata) ## Observations: 303 ## Variables: 14 ## $ age <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58... ## $ sex <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0... ## $ cp <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3... ## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130... ## $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275... ## $ fbs <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0... ## $ restecg <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1... ## $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139... ## $ exang <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0... ## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2... ## $ slope <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2... ## $ ca <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2... ## $ thal <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2... ## $ target <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
The data at hand has the following features:
- age.
- sex: 1=male,0=female
- cp : chest pain type.
- trestbps : resting blood pressure.
- chol: serum cholestoral.
- fbs : fasting blood sugar.
- restecg : resting electrocardiographic results.
- thalach : maximum heart rate achieved
- exang : exercise induced angina.
- oldpeak : ST depression induced by exercise relative to rest.
- slope : the slope of the peak exercise ST segment.
- ca : number of major vessels colored by flourosopy.
- thal : it is not well defined from the data source.
- target: have heart disease or not.
We see that some features should be converted to factor type as follows:m
mydata<-mydata %>% modify_at(c(2,3,6,7,9,11,12,13,14),as.factor) glimpse(mydata) ## Observations: 303 ## Variables: 14 ## $ age <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58... ## $ sex <fct> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0... ## $ cp <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3... ## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130... ## $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275... ## $ fbs <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0... ## $ restecg <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1... ## $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139... ## $ exang <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0... ## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2... ## $ slope <fct> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2... ## $ ca <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2... ## $ thal <fct> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2... ## $ target <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
Before going head we should check the relationships between the target variable and the remaining factors
xtabs(~target+sex,data=mydata) ## sex ## target 0 1 ## 0 24 114 ## 1 72 93 xtabs(~target+cp,data=mydata) ## cp ## target 0 1 2 3 ## 0 104 9 18 7 ## 1 39 41 69 16 xtabs(~target+fbs,data=mydata) ## fbs ## target 0 1 ## 0 116 22 ## 1 142 23 xtabs(~target+restecg,data=mydata) ## restecg ## target 0 1 2 ## 0 79 56 3 ## 1 68 96 1 xtabs(~target+exang,data=mydata) ## exang ## target 0 1 ## 0 62 76 ## 1 142 23 xtabs(~target+slope,data=mydata) ## slope ## target 0 1 2 ## 0 12 91 35 ## 1 9 49 107 xtabs(~target+ca,data=mydata) ## ca ## target 0 1 2 3 4 ## 0 45 44 31 17 1 ## 1 130 21 7 3 4 xtabs(~target+thal,data=mydata) ## thal ## target 0 1 2 3 ## 0 1 12 36 89 ## 1 1 6 130 28
As we see the restecg,ca and thal variables have values less than the threshold of 5 casses required for logistic regression. In addition if we split the data between training set and test set the level 2 of the restecg variable will not be found in one of the sets since we have only one case. Therfore we should remove these variables from the model.
mydata<-mydata[,-c(7,12,13)] glimpse(mydata) ## Observations: 303 ## Variables: 11 ## $ age <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58... ## $ sex <fct> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0... ## $ cp <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3... ## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130... ## $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275... ## $ fbs <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0... ## $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139... ## $ exang <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0... ## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2... ## $ slope <fct> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2... ## $ target <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
Before training our model, we can get a vague insight about the predictors that have some importance for the prediction of the dependent variable.
Let’s plot the relationships between the target variabl and the other features.
ggplot(mydata,aes(sex,target,color=target))+ geom_jitter()
If we look only at the red points (healthy patients) we can wrongly interpret that females are less healthy than males. This is because we do not take into account that we have imbalanced number of each sex level (96 females , 207 males). in contrast, if we look only at females we can say that a particular female are more likely to have the disease than not.
ggplot(mydata,aes(cp, fill = target))+ geom_bar(stat = "count", position = "dodge")
From this plot we can conclude that if the patient does not have any chest pain he/she will be highly unlikely to get the disease, otherwise for any chest type the patient will be more likely to be pathologique by this disease. we can expect therfore that this predictor will have a significant importance on the training model.
ggplot(mydata, aes(age,fill=target))+ geom_density(alpha=.5)
Data partition
we take out 80% of the data to use as training set and the rest will be put aside to evaluate the model performance.
set.seed(1234) index<-createDataPartition(mydata$target, p=.8,list=FALSE) train<-mydata[index,] test<-mydata[-index,]
train the model
We are now ready to train our model.
model <- glm(target~., data=train,family = "binomial") summary(model) ## ## Call: ## glm(formula = target ~ ., family = "binomial", data = train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.5855 -0.5294 0.1990 0.6120 2.4022 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 3.715274 2.883238 1.289 0.197545 ## age -0.014712 0.023285 -0.632 0.527502 ## sex1 -1.686359 0.479254 -3.519 0.000434 *** ## cp1 1.212919 0.549670 2.207 0.027340 * ## cp2 2.010255 0.486638 4.131 3.61e-05 *** ## cp3 2.139066 0.682727 3.133 0.001730 ** ## trestbps -0.020471 0.012195 -1.679 0.093220 . ## chol -0.005840 0.003776 -1.547 0.121959 ## fbs1 -0.200690 0.519116 -0.387 0.699053 ## thalach 0.024461 0.010928 2.238 0.025196 * ## exang1 -0.792717 0.431434 -1.837 0.066151 . ## oldpeak -0.820508 0.231100 -3.550 0.000385 *** ## slope1 -0.999768 1.015514 -0.984 0.324872 ## slope2 -0.767247 1.097448 -0.699 0.484477 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 335.05 on 242 degrees of freedom ## Residual deviance: 191.33 on 229 degrees of freedom ## AIC: 219.33 ## ## Number of Fisher Scoring iterations: 5
we see that some variables are not significant using p-value such as age, chol,fbs,slope, and also the intercept. First let’s remove the insignificant factor variables fbs and slope.
model <- glm(target~.-fbs-slope, data=train,family = "binomial") summary(model) ## ## Call: ## glm(formula = target ~ . - fbs - slope, family = "binomial", ## data = train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6702 -0.5505 0.1993 0.6344 2.4495 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 2.826395 2.695175 1.049 0.294322 ## age -0.016677 0.023157 -0.720 0.471420 ## sex1 -1.729320 0.470656 -3.674 0.000239 *** ## cp1 1.243879 0.548288 2.269 0.023289 * ## cp2 1.987151 0.472994 4.201 2.65e-05 *** ## cp3 2.125766 0.677257 3.139 0.001696 ** ## trestbps -0.020672 0.012005 -1.722 0.085084 . ## chol -0.006434 0.003721 -1.729 0.083816 . ## thalach 0.026567 0.010432 2.547 0.010873 * ## exang1 -0.848162 0.423189 -2.004 0.045047 * ## oldpeak -0.798699 0.198597 -4.022 5.78e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 335.05 on 242 degrees of freedom ## Residual deviance: 192.66 on 232 degrees of freedom ## AIC: 214.66 ## ## Number of Fisher Scoring iterations: 5
Now we remove the age variable since it is the least significance.
model <- glm(target~.-fbs-slope-age, data=train,family = "binomial") summary(model) ## ## Call: ## glm(formula = target ~ . - fbs - slope - age, family = "binomial", ## data = train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6925 -0.5397 0.2032 0.6345 2.4032 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.703126 2.188741 0.778 0.436492 ## sex1 -1.677986 0.463447 -3.621 0.000294 *** ## cp1 1.221925 0.545175 2.241 0.025004 * ## cp2 1.961200 0.468443 4.187 2.83e-05 *** ## cp3 2.085409 0.676469 3.083 0.002051 ** ## trestbps -0.022133 0.011872 -1.864 0.062273 . ## chol -0.006900 0.003675 -1.878 0.060443 . ## thalach 0.029761 0.009471 3.142 0.001676 ** ## exang1 -0.820113 0.420434 -1.951 0.051101 . ## oldpeak -0.803423 0.198400 -4.050 5.13e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 335.05 on 242 degrees of freedom ## Residual deviance: 193.19 on 233 degrees of freedom ## AIC: 213.19 ## ## Number of Fisher Scoring iterations: 5
we remove now the variables exang.
model <- glm(target~.-fbs-slope-age-exang, data=train,family = "binomial") summary(model) ## ## Call: ## glm(formula = target ~ . - fbs - slope - age - exang, family = "binomial", ## data = train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.7030 -0.5643 0.2004 0.6510 2.5728 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.832691 2.105139 0.396 0.692436 ## sex1 -1.713577 0.459659 -3.728 0.000193 *** ## cp1 1.494091 0.528172 2.829 0.004672 ** ## cp2 2.205121 0.454341 4.853 1.21e-06 *** ## cp3 2.220423 0.668760 3.320 0.000899 *** ## trestbps -0.021812 0.011704 -1.864 0.062375 . ## chol -0.007110 0.003597 -1.977 0.048054 * ## thalach 0.033412 0.009291 3.596 0.000323 *** ## oldpeak -0.822277 0.195993 -4.195 2.72e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 335.05 on 242 degrees of freedom ## Residual deviance: 196.98 on 234 degrees of freedom ## AIC: 214.98 ## ## Number of Fisher Scoring iterations: 5
Notice that we can not remove intercept even it is not significant because it contains the first level of “0” of the factor cp which is significant. This is hence our final model.
prediction and confusion matrix
we will use this model to predict the training set.
pred <- predict(model,train, type="response") head(pred) ## 2 3 4 6 7 8 ## 0.5202639 0.9331630 0.8330192 0.3354247 0.7730621 0.8705651
using the confusion matrix we get the accuracy rate in the training set.
pred <- as.integer(pred>0.5) confusionMatrix(as.factor(pred),train$target, positive = "1") ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 87 17 ## 1 24 115 ## ## Accuracy : 0.8313 ## 95% CI : (0.7781, 0.8761) ## No Information Rate : 0.5432 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.6583 ## ## Mcnemar's Test P-Value : 0.3487 ## ## Sensitivity : 0.8712 ## Specificity : 0.7838 ## Pos Pred Value : 0.8273 ## Neg Pred Value : 0.8365 ## Prevalence : 0.5432 ## Detection Rate : 0.4733 ## Detection Prevalence : 0.5720 ## Balanced Accuracy : 0.8275 ## ## 'Positive' Class : 1 ##
In the training set the accuracy rate is about 83,13% . But we are more intrested in the accuracy of the test set.
pred <- predict(model,test, type="response") pred <- as.integer(pred>0.5) confusionMatrix(as.factor(pred),test$target) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 16 3 ## 1 11 30 ## ## Accuracy : 0.7667 ## 95% CI : (0.6396, 0.8662) ## No Information Rate : 0.55 ## P-Value [Acc > NIR] : 0.0004231 ## ## Kappa : 0.5156 ## ## Mcnemar's Test P-Value : 0.0613688 ## ## Sensitivity : 0.5926 ## Specificity : 0.9091 ## Pos Pred Value : 0.8421 ## Neg Pred Value : 0.7317 ## Prevalence : 0.4500 ## Detection Rate : 0.2667 ## Detection Prevalence : 0.3167 ## Balanced Accuracy : 0.7508 ## ## 'Positive' Class : 0 ##
With the test set we have lower accuracy rate about 76.67%.
altering the link function
By default the link function is logit from the sigmoid distribution, we can make use of the link function probit instead, which stands for the normal distribution.
model1 <- glm(target~.-fbs-slope-exang-age, data=train, family = binomial(link = "probit")) summary(model1) ## ## Call: ## glm(formula = target ~ . - fbs - slope - exang - age, family = binomial(link = "probit"), ## data = train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.7779 -0.5883 0.1666 0.6670 2.5989 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.373007 1.199910 0.311 0.755905 ## sex1 -0.940784 0.252631 -3.724 0.000196 *** ## cp1 0.830588 0.299919 2.769 0.005616 ** ## cp2 1.275100 0.253681 5.026 5.00e-07 *** ## cp3 1.262407 0.387479 3.258 0.001122 ** ## trestbps -0.011677 0.006660 -1.753 0.079549 . ## chol -0.004068 0.002047 -1.987 0.046870 * ## thalach 0.018999 0.005163 3.680 0.000233 *** ## oldpeak -0.470191 0.108935 -4.316 1.59e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 335.05 on 242 degrees of freedom ## Residual deviance: 197.23 on 234 degrees of freedom ## AIC: 215.23 ## ## Number of Fisher Scoring iterations: 6 pred <- predict(model,test, type="response") pred <- as.integer(pred>0.5) confusionMatrix(as.factor(pred),test$target) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 16 3 ## 1 11 30 ## ## Accuracy : 0.7667 ## 95% CI : (0.6396, 0.8662) ## No Information Rate : 0.55 ## P-Value [Acc > NIR] : 0.0004231 ## ## Kappa : 0.5156 ## ## Mcnemar's Test P-Value : 0.0613688 ## ## Sensitivity : 0.5926 ## Specificity : 0.9091 ## Pos Pred Value : 0.8421 ## Neg Pred Value : 0.7317 ## Prevalence : 0.4500 ## Detection Rate : 0.2667 ## Detection Prevalence : 0.3167 ## Balanced Accuracy : 0.7508 ## ## 'Positive' Class : 0 ##
As we see we get the same results with a slight difference between the AIC criterion 215.54 for probit link and 214.98 for logit link.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.