Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
This tutorial aims to explore the most popular models used to predict an ordered response variable. We will use the heart disease data uploaded from kaggle website, where our response will be the chest pain cp variable instead of the target variable used usually.
Data preparation
First, we call the data and the libraries that we need along with this illustration as follows.
library(tidyverse) library(caret) library(tidymodels) mydata<-read.csv("../heart.csv",header = TRUE) names(mydata)[1]<-"age"
The data at hand has the following features:
- age.
- sex: 1=male,0=female
- cp : chest pain type.
- trestbps : resting blood pressure.
- chol: serum cholestoral.
- fbs : fasting blood sugar.
- restecg : resting electrocardiographic results.
- thalach : maximum heart rate achieved
- exang : exercise induced angina.
- oldpeak : ST depression induced by exercise relative to rest.
- slope : the slope of the peak exercise ST segment.
- ca : number of major vessels colored by flourosopy.
- thal : it is not well defined from the data source.
- target: have heart disease or not.
I think the best start to explore the summary of all predictors and missing values is by using the powerful function skim from skimr package.
skimr::skim(mydata)
Name | mydata |
Number of rows | 303 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
numeric | 14 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age | 0 | 1 | 54.37 | 9.08 | 29 | 47.5 | 55.0 | 61.0 | 77.0 | ▁▆▇▇▁ |
sex | 0 | 1 | 0.68 | 0.47 | 0 | 0.0 | 1.0 | 1.0 | 1.0 | ▃▁▁▁▇ |
cp | 0 | 1 | 0.97 | 1.03 | 0 | 0.0 | 1.0 | 2.0 | 3.0 | ▇▃▁▅▁ |
trestbps | 0 | 1 | 131.62 | 17.54 | 94 | 120.0 | 130.0 | 140.0 | 200.0 | ▃▇▅▁▁ |
chol | 0 | 1 | 246.26 | 51.83 | 126 | 211.0 | 240.0 | 274.5 | 564.0 | ▃▇▂▁▁ |
fbs | 0 | 1 | 0.15 | 0.36 | 0 | 0.0 | 0.0 | 0.0 | 1.0 | ▇▁▁▁▂ |
restecg | 0 | 1 | 0.53 | 0.53 | 0 | 0.0 | 1.0 | 1.0 | 2.0 | ▇▁▇▁▁ |
thalach | 0 | 1 | 149.65 | 22.91 | 71 | 133.5 | 153.0 | 166.0 | 202.0 | ▁▂▅▇▂ |
exang | 0 | 1 | 0.33 | 0.47 | 0 | 0.0 | 0.0 | 1.0 | 1.0 | ▇▁▁▁▃ |
oldpeak | 0 | 1 | 1.04 | 1.16 | 0 | 0.0 | 0.8 | 1.6 | 6.2 | ▇▂▁▁▁ |
slope | 0 | 1 | 1.40 | 0.62 | 0 | 1.0 | 1.0 | 2.0 | 2.0 | ▁▁▇▁▇ |
ca | 0 | 1 | 0.73 | 1.02 | 0 | 0.0 | 0.0 | 1.0 | 4.0 | ▇▃▂▁▁ |
thal | 0 | 1 | 2.31 | 0.61 | 0 | 2.0 | 2.0 | 3.0 | 3.0 | ▁▁▁▇▆ |
target | 0 | 1 | 0.54 | 0.50 | 0 | 0.0 | 1.0 | 1.0 | 1.0 | ▇▁▁▁▇ |
For our case we will use the chest pain type cp variable as our target variable since it is a categorical variable. However, for pedagogic purposes, we will manipulate it so that it will be an ordered factor with only three levels no pain,moderate pain, severe pain (instead of 4 levels now).
Looking at the above output, we convert the variables that should be of factor type, which are: sex, target, fbs, resecg, exang, slope, ca, thal. For the response variable cp, we drop its less frequently level with all its related rows, then we rename the remaining ones as no pain for the most frequently one, severe pain for the less frequently one, and moderate pain for the last one.
table(mydata$cp) 0 1 2 3 143 50 87 23
we see the level 3 is the less frequently one.
mydata<-mydata %>% modify_at(c("cp", "sex", "target", "fbs", "resecg", "exang", "slope", "ca", "thal"), as.factor) mydata<-mydata[mydata$cp!=3,] mydata$cp<-fct_drop(mydata$cp,only=levels(mydata$cp)) table(mydata$cp) 0 1 2 143 50 87
According to these frequencies we rename and we order the levels as follows.
mydata$cp<-fct_recode(mydata$cp,no="0",sev="1",mod="2") mydata$cp<-factor(mydata$cp,ordered = TRUE) mydata$cp<-fct_infreq(mydata$cp) mydata$cp[1:5] [1] mod sev sev no no Levels: no < mod < sev
Similar to the logistic regression, the number of cases in each cell from each cross table between the outcome and each factor should be above the threshold of 5 applied in practice.
xtabs(~cp+sex,data=mydata) sex cp 0 1 no 39 104 mod 35 52 sev 18 32 xtabs(~cp+target,data=mydata) target cp 0 1 no 104 39 mod 18 69 sev 9 41 xtabs(~cp+fbs,data=mydata) fbs cp 0 1 no 125 18 mod 70 17 sev 45 5 xtabs(~cp+restecg,data=mydata) restecg cp 0 1 2 no 78 62 3 mod 36 50 1 sev 19 31 0 xtabs(~cp+exang,data=mydata) exang cp 0 1 no 63 80 mod 76 11 sev 46 4 xtabs(~cp+slope,data=mydata) slope cp 0 1 2 no 11 84 48 mod 5 33 49 sev 2 12 36 xtabs(~cp+ca,data=mydata) ca cp 0 1 2 3 4 no 65 34 29 14 1 mod 57 20 2 5 3 sev 37 8 3 1 1 xtabs(~cp+thal,data=mydata) thal cp 0 1 2 3 no 1 12 52 78 mod 1 2 62 22 sev 0 2 39 9
The following variables do not respect this threshold and hence they will be removed from the predictors set: restecg, exang, slope, ca, and thal.
mydata<-mydata[,setdiff(names(mydata), c("restecg", "exang", "slope", "ca", "thal"))]
The data is ready and we can now split the data between training and testing set.
set.seed(1122) parts <- initial_split(mydata, prop=0.8, strata = cp) train <- training(parts) test <- testing(parts)
The most popular models that we will use are: ordinal logistic model, cart model, ordinal random forest model, Continuation ratio model.
ordered logistic regression (logit)
Before training this type of model let’s show how it works. For simplicity suppose we have data that has an ordered outcome \(y\) with three class labels (“1”,“2”,“3”) and only two features \(x_1\) and \(x_2\).
First we define a latent variable as a linear combination of the features:
\[y_i^*=\beta_1 X_{i1}+\beta_2 X_{i2}\]
Then since we have three classes we define two thresholds for this latent variable \(\alpha_1\) and \(\alpha_2\) such that a particular observation \(y_i\) will be classified as follows:
\[\begin{cases} y_i=1 & \text{if $y_i^* \leq \alpha_1$} \\ y_i=2 & \text{if $\alpha_1 < y_i^* \leq \alpha_2$} \\ y_i=3 & \text{if $y_i^* > \alpha_2$}\end{cases}\]
Now we can obtain the probability of a particular observation to fall into a specific class as follows:
\[\begin{cases} p(y_i=1)=p(y_i^* \leq \alpha_1)=F(\alpha_1-\beta_1 X_{i1}-\beta_2 X_{i2}) \\ p(y_i=2)=p(\alpha_1 < y_i^* \leq \alpha_2)=F(\alpha_2-\beta_1 X_{i1}-\beta_2 X_{i2})-F(\alpha_1-\beta_1 X_{i1}-\beta_2 X_{i2}) \\ p(y_i=3)=1-p(y_i=2)-p(y_i=1)\end{cases}\]
It remains now to define the suitable distribution function F. There are two commonly used ones for this type of data, the logit function \(F(x)=\frac{1}{1+exp^{-x}}\) and the normal distribution function aka probit.
Note: there exist other functions like loglog, cloglog, and cauchit.
Using the logit function the probabilities will be.
\[\begin{cases} p(y_i=1)=\frac{1}{1+exp^{-(\alpha_1-\beta_1 X_{i1}-\beta_2 X_{i2})}} \\ p(y_i=2)=\frac{1}{1+exp^{-(\alpha_2-\beta_1 X_{i1}-\beta_2 X_{i2})}}-p(y_i=1) \\ p(y_i=3)=1-p(y_i=2)-p(y_i=1)\end{cases}\]
The MASS package provides the method polr to perform an ordinal logistic regression.
library(MASS) set.seed(1234) model_logistic<-train(cp~., data=train, method="polr", tuneGrid=expand.grid(method="logistic")) summary(model_logistic) Coefficients: Value Std. Error t value age 0.0112236 0.018219 0.61605 sex1 0.2593720 0.316333 0.81993 trestbps -0.0002329 0.009090 -0.02562 chol -0.0013238 0.002697 -0.49082 fbs1 0.3188826 0.401836 0.79356 thalach 0.0226246 0.008199 2.75933 oldpeak -0.3360326 0.163547 -2.05465 target1 1.7234740 0.376279 4.58031 Intercepts: Value Std. Error t value no|mod 4.5786 1.9271 2.3759 mod|sev 6.5004 1.9527 3.3289 Residual Deviance: 376.4697 AIC: 396.4697
This table does not provide the p-values. However, it is not a big problem since we can add the p_values by the following script.
prob <- pnorm(abs(summary(model_logistic)$coefficients[,3]),lower.tail = FALSE)*2 cbind(summary(model_logistic)$coefficients,prob) Value Std. Error t value prob age 0.0112236479 0.018218848 0.61604597 5.378642e-01 sex1 0.2593719567 0.316332564 0.81993442 4.122535e-01 trestbps -0.0002329023 0.009090066 -0.02562163 9.795591e-01 chol -0.0013237835 0.002697079 -0.49082122 6.235529e-01 fbs1 0.3188825831 0.401836034 0.79356393 4.274493e-01 thalach 0.0226246089 0.008199317 2.75932853 5.792027e-03 oldpeak -0.3360326371 0.163547467 -2.05464899 3.991292e-02 target1 1.7234739863 0.376278770 4.58031152 4.642839e-06 no|mod 4.5785821473 1.927119568 2.37586822 1.750771e-02 mod|sev 6.5003986218 1.952726089 3.32888399 8.719471e-04
Using the threshold p-value 0.05, we remove the non significant variables. age, trestbps, chol.
set.seed(1234) model_logistic<-train(cp~.-age-trestbps-chol, data=train, method="polr",tuneGrid=expand.grid(method="logistic")) prob <- pnorm(abs(summary(model_logistic)$coefficients[,3]),lower.tail = FALSE)*2 cbind(summary(model_logistic)$coefficients,prob) Value Std. Error t value prob sex1 0.25427581 0.308143065 0.8251875 4.092651e-01 fbs1 0.37177505 0.384667871 0.9664832 3.338024e-01 thalach 0.02050951 0.007487511 2.7391620 6.159602e-03 oldpeak -0.33669473 0.161699555 -2.0822242 3.732199e-02 target1 1.71338020 0.369558584 4.6362885 3.547208e-06 no|mod 4.00836398 1.143111953 3.5065367 4.539789e-04 mod|sev 5.92987585 1.185074388 5.0038005 5.621092e-07
Notice that we do not remove the factors sex and fbs even they are not significant due to the significance of the intercepts.
To well understand these coefficients lets restrict the model with only two predictors.
set.seed(1234) model1<-train(cp~target+thalach, data=train, method = "polr", tuneGrid=expand.grid(method="logistic")) summary(model1) Coefficients: Value Std. Error t value target1 1.87953 0.333153 5.642 thalach 0.02347 0.007372 3.184 Intercepts: Value Std. Error t value no|mod 4.6457 1.0799 4.3018 mod|sev 6.5325 1.1271 5.7959 Residual Deviance: 383.3144 AIC: 391.3144
Let’s plug in these coefficients in the above equations we obtain the probability of each class as follows:
\[\begin{cases} p(no)=\frac{1}{1+exp^{-(4.6457-1.87953X_{i1}-0.02347X_{i2})}} \\ p(mod)=\frac{1}{1+exp^{-(6.5325-1.87953X_{i1}-0.02347X_{i2})}}-p(no) \\ p(sev)=1-p(mod)-p(no)\end{cases}\]
Let’s now predict a particular patient, say the third one.
train[3,c("cp","thalach","target")] cp thalach target 4 sev 178 1
We plug in the predictor values as follows:
\[\begin{cases} p(no)=\frac{1}{1+exp^{-(4.6457-1.87953*1-0.02347*178)}} \\ p(mod)=\frac{1}{1+exp^{-(6.5325-1.87953*1-0.02347*178)}}-p(no) \\ p(sev)=1-p(mod)-p(no)\end{cases}=\begin{cases} p(no)=0.1959992 \\ p(mod)=0.6166398-0.1959992=0.4206406 \\ p(sev)=1-0.4206406-0.1959992=0.3833602\end{cases}\]
Using the highest probability this patient will be predicted to have mod pain. Now let’s compare these probabilities with those obtained from function predict.
predict(model1, train[1:3,], type = "prob") %>% tail(1) no mod sev 4 0.1958709 0.4205981 0.383531
Now we go back to our original model and compute the accuracy rate for the training data.
predict(model_logistic, train) %>% bind_cols(train) %>% rename(pred="...1", truth=cp) %>% accuracy(pred, truth) # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 accuracy multiclass 0.611
with the logistic regression model we get 61% accuracy for the training set, which is quite bad. So let’s test the model using the testing set now.
predict(model_logistic, test) %>% bind_cols(test) %>% rename(pred="...1", truth=cp) %>% accuracy(pred, truth) # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 accuracy multiclass 0.648
Surprisingly, the accuracy rate for the testing set is about 65%, which is larger than that computed from the training data (61%). This is an indication of an underfitting problem (The opposite effect of overfitting problem). Is there any way to improve the model performance? Maybe yes, by going back and tune some hyperparameters, but since we have an underfitting problem we do not have much hyperparameters for this model except the type of function used which is by default the logistic function, but there exist as well other functions like probit, loglog, …ect.
For our case let’s try this model with the probit function
Ordinal logistic rgeression (probit)
set.seed(1234) model_probit<-train(cp~.-age-trestbps-chol, data=train, method="polr", tuneGrid=expand.grid(method="probit")) predict(model_probit, train) %>% bind_cols(train) %>% rename(pred="...1", truth=cp) %>% accuracy(pred, truth) # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 accuracy multiclass 0.606
This rate is slightly worse than that from the previous model. But what about the testing set.
predict(model_probit, test) %>% bind_cols(test) %>% rename(pred="...1", truth=cp) %>% accuracy(pred, truth) # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 accuracy multiclass 0.593
This one also is worse than the previous model. So this means that the logistic function for this data performs better than the probit one.
When we try many things to improve the model performance and we do not gain much, it will be better to think to try different types of models.
CART model
This is a tree-based model. To train this model we make use of rpartScore package, and for simplification, we will include only the significant predictors from the previous model.
library(rpartScore) set.seed(1234) model_cart<-train(cp~.-age-trestbps-chol, data=train, method="rpartScore") model_cart CART or Ordinal Responses 226 samples 8 predictor 3 classes: 'no', 'mod', 'sev' No pre-processing Resampling: Bootstrapped (25 reps) Summary of sample sizes: 226, 226, 226, 226, 226, 226, ... Resampling results across tuning parameters: cp split prune Accuracy Kappa 0.02702703 abs mr 0.5748197 0.2845545 0.02702703 abs mc 0.5796085 0.3011122 0.02702703 quad mr 0.5711605 0.2764466 0.02702703 quad mc 0.5805216 0.3020125 0.04504505 abs mr 0.5620975 0.2719646 0.04504505 abs mc 0.5966801 0.3274893 0.04504505 quad mr 0.5592845 0.2608402 0.04504505 quad mc 0.5930817 0.3208220 0.21621622 abs mr 0.5303342 0.1266324 0.21621622 abs mc 0.6004116 0.3343997 0.21621622 quad mr 0.5290009 0.1143360 0.21621622 quad mc 0.5928132 0.3225686 Accuracy was used to select the optimal model using the largest value. The final values used for the model were cp = 0.2162162, split = abs and prune = mc.
The caret model uses the bootstrapping technique for hyperparameters tuning. In our case, the largest accuracy rate is about 59.59%, with the complexity parameter **cp**=0.2162162
, the **split**=abs
, and **prune**= **mc**
.
The argument split controls the splitting function used to grow the tree by setting the misclassification costs in the generalized Gini impurity function to the absolute abs or squared quad. The argument prune is used to select the performance measure to prune the tree between total misclassification rate mr or misclassification cost mc.
predict(model_cart, train) %>% bind_cols(train) %>% rename(pred="...1", truth=cp) %>% accuracy(pred, truth) # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 accuracy multiclass 0.615
Surprisingly, we get approximately the same accuracy rate as the logit model. Let’s check the testing set.
predict(model_cart, test) %>% bind_cols(test) %>% rename(pred="...1", truth=cp) %>% accuracy(pred, truth) # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 accuracy multiclass 0.630
Now wit this model we get a lower accuracy rate than that of the logistic model.
Ordinal Random forst model.
This model is a corrected version of random forest model that takes into account the ordinal nature of the response variable. For more detail about this model read this great paper.
To train ordinal random forest model, we need to call the following packages: e1071, ranger, ordinalForest.
library(ordinalForest) library(ranger) library(e1071)
Since the create function train use bootstrapping method to perform hyperparameters tuning to choose the best values, this makes the training process very slow, that is why i save the resulted output and load it again
# set.seed(1234) # model_forest<-train(cp~.-age-trestbps-chol, data=train, # method='ordinalRF') # saveRDS(model_forest, "C://Users/dell/Documents/new-blog/content/post/ordinal/model_forest.rds") model_forest <- readRDS("C://Users/dell/Documents/new-blog/content/post/ordinal/model_forest.rds") model_forest Random Forest 226 samples 8 predictor 3 classes: 'no', 'mod', 'sev' No pre-processing Resampling: Bootstrapped (25 reps) Summary of sample sizes: 226, 226, 226, 226, 226, 226, ... Resampling results across tuning parameters: nsets ntreeperdiv ntreefinal Accuracy Kappa 50 50 200 0.5878619 0.3173073 50 50 400 0.5862998 0.3174139 50 50 600 0.5793897 0.3052328 50 100 200 0.5856573 0.3157938 50 100 400 0.5865246 0.3155563 50 100 600 0.5908431 0.3248941 50 150 200 0.5929076 0.3277110 50 150 400 0.5822072 0.3092856 50 150 600 0.5878854 0.3209507 100 50 200 0.5903104 0.3277174 100 50 400 0.5817918 0.3127438 100 50 600 0.5863685 0.3197145 100 100 200 0.5841147 0.3136250 100 100 400 0.5830929 0.3115348 100 100 600 0.5875075 0.3228354 100 150 200 0.5837402 0.3167977 100 150 400 0.5841116 0.3187849 100 150 600 0.5888038 0.3250622 150 50 200 0.5872276 0.3193143 150 50 400 0.5873943 0.3198634 150 50 600 0.5874236 0.3224713 150 100 200 0.5870714 0.3218660 150 100 400 0.5805275 0.3107488 150 100 600 0.5816353 0.3135750 150 150 200 0.5884693 0.3247512 150 150 400 0.5830154 0.3154980 150 150 600 0.5885156 0.3253794 Accuracy was used to select the optimal model using the largest value. The final values used for the model were nsets = 50, ntreeperdiv = 150 and ntreefinal = 200.
We can plot the important predictors as follows.
plot(varImp(model_forest))
Now we can obtain the accuracy rate for the training rate as follows.
predict(model_forest, train) %>% bind_cols(train) %>% rename(pred="...1", truth=cp) %>% accuracy(pred, truth) # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 accuracy multiclass 0.845
Great!, with this model, the accuracy rate has largely improved to roughly 84%. But wait, what matters is the accuracy of the testing set.
predict(model_forest, test) %>% bind_cols(test) %>% rename(pred="...1", truth=cp) %>% accuracy(pred, truth) # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 accuracy multiclass 0.519
This is exactly what is called the overfitting problem. The model generalizes poorly to new unseen data. We can go back and tune some other hyperparameters like increasing the minimum size of nodes (default is 5) to fight the overfitting problem. we do not, however, do that here since it is not the purpose of this tutorial.
Continuation Ratio Model
This model uses The vector generalized additive models which are available in the VGAM package. for more detail about these models click here.
library(VGAM) set.seed(1234) model_vgam<-train(cp~.-age-trestbps-chol, data=train, method="vglmContRatio", trace=FALSE) model_vgam Continuation Ratio Model for Ordinal Data 226 samples 8 predictor 3 classes: 'no', 'mod', 'sev' No pre-processing Resampling: Bootstrapped (25 reps) Summary of sample sizes: 226, 226, 226, 226, 226, 226, ... Resampling results across tuning parameters: parallel link Accuracy Kappa FALSE logit 0.5962581 0.3323075 FALSE probit 0.5942637 0.3302998 FALSE cloglog 0.5973844 0.3293056 FALSE cauchit 0.5967368 0.3316896 FALSE logc 0.5945121 0.3152759 TRUE logit 0.5758330 0.2961673 TRUE probit 0.5738297 0.2924747 TRUE cloglog 0.5838764 0.3014038 TRUE cauchit 0.5810184 0.3067004 TRUE logc 0.5302522 0.1031624 Accuracy was used to select the optimal model using the largest value. The final values used for the model were parallel = FALSE and link = cloglog.
the best model is obtained when the argument parallel is FALSE and link is cauchit which is the tangent function.
The accuracy rate of the training data is:
predict(model_vgam, train) %>% bind_cols(train) %>% rename(pred="...1", truth=cp) %>% accuracy(pred, truth) # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 accuracy multiclass 0.659
And the accuracy of the testing set is:
predict(model_vgam, test) %>% bind_cols(test) %>% rename(pred="...1", truth=cp) %>% accuracy(pred, truth) # A tibble: 1 x 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 accuracy multiclass 0.630
This the best accuracy rate compared to the other models.
Compare models
We can compare between the above models using resample caret function.
models_eval<-resamples(list(logit=model_logistic, cart=model_cart, forest=model_forest, vgam=model_vgam)) summary(models_eval) Call: summary.resamples(object = models_eval) Models: logit, cart, forest, vgam Number of resamples: 25 Accuracy Min. 1st Qu. Median Mean 3rd Qu. Max. NA's logit 0.5060241 0.5731707 0.5822785 0.5871083 0.6097561 0.6627907 0 cart 0.3734940 0.5824176 0.6097561 0.6004116 0.6279070 0.6746988 0 forest 0.5357143 0.5569620 0.5853659 0.5929076 0.6117647 0.6860465 0 vgam 0.4936709 0.5760870 0.6046512 0.5973844 0.6202532 0.6626506 0 Kappa Min. 1st Qu. Median Mean 3rd Qu. Max. NA's logit 0.189086980 0.2792369 0.3144822 0.3100458 0.3437500 0.4512651 0 cart -0.004889406 0.3185420 0.3474144 0.3343997 0.3775576 0.4526136 0 forest 0.233434988 0.2852665 0.3289346 0.3277110 0.3554862 0.4618772 0 vgam 0.144558744 0.2993406 0.3367647 0.3293056 0.3690791 0.4142980 0
Based on the training set and using the mean of the accuracy rate we can say that cart model is the best model for this data with 60.97% accuracy for the training set. However, things are different when it comes to use the testing set instead.
tibble(models=c("logit", "cart", "forest", "vgam"), accuracy=c( predict(model_logistic, test) %>% bind_cols(test) %>% rename(pred="...1", truth=cp) %>% accuracy(pred, truth) %>% .[, ".estimate"], predict(model_cart, test) %>% bind_cols(test) %>% rename(pred="...1", truth=cp) %>% accuracy(pred, truth) %>% .[, ".estimate"], predict(model_forest, test) %>% bind_cols(test) %>% rename(pred="...1", truth=cp) %>% accuracy(pred, truth) %>% .[, ".estimate"], predict(model_vgam, test) %>% bind_cols(test) %>% rename(pred="...1", truth=cp) %>% accuracy(pred, truth) %>% .[, ".estimate"])) %>% unnest() # A tibble: 4 x 2 models accuracy <chr> <dbl> 1 logit 0.648 2 cart 0.630 3 forest 0.519 4 vgam 0.630
Using the testing set, the logistic model with the link logit is the best model to predict this data.
Conclusion
We have seen so far how to model ordinal data by exploring several models, and it happened that the logistic model is the best on for our data. However, in general the best model depends strongly on the data at hand.
Session information
sessionInfo() R version 4.0.1 (2020-06-06) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18362) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] splines stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] VGAM_1.1-3 e1071_1.7-3 ranger_0.12.1 [4] ordinalForest_2.3-1 rpartScore_1.0-1 rpart_4.1-15 [7] MASS_7.3-51.6 yardstick_0.0.6 workflows_0.1.1 [10] tune_0.1.0 rsample_0.0.7 recipes_0.1.12 [13] parsnip_0.1.1 infer_0.5.2 dials_0.0.7 [16] scales_1.1.1 broom_0.5.6 tidymodels_0.1.0 [19] caret_6.0-86 lattice_0.20-41 forcats_0.5.0 [22] stringr_1.4.0 dplyr_1.0.0 purrr_0.3.4 [25] readr_1.3.1 tidyr_1.1.0 tibble_3.0.1 [28] ggplot2_3.3.1 tidyverse_1.3.0 loaded via a namespace (and not attached): [1] readxl_1.3.1 backports_1.1.7 tidytext_0.2.4 [4] plyr_1.8.6 igraph_1.2.5 repr_1.1.0 [7] crosstalk_1.1.0.1 listenv_0.8.0 SnowballC_0.7.0 [10] rstantools_2.0.0 inline_0.3.15 digest_0.6.25 [13] foreach_1.5.0 htmltools_0.4.0 rsconnect_0.8.16 [16] fansi_0.4.1 magrittr_1.5 globals_0.12.5 [19] modelr_0.1.8 gower_0.2.1 matrixStats_0.56.0 [22] RcppParallel_5.0.1 xts_0.12-0 prettyunits_1.1.1 [25] colorspace_1.4-1 skimr_2.1.1 blob_1.2.1 [28] rvest_0.3.5 haven_2.3.1 xfun_0.14 [31] callr_3.4.3 crayon_1.3.4 jsonlite_1.6.1 [34] lme4_1.1-23 survival_3.1-12 zoo_1.8-8 [37] iterators_1.0.12 glue_1.4.1 gtable_0.3.0 [40] ipred_0.9-9 pkgbuild_1.0.8 rstan_2.19.3 [43] DBI_1.1.0 miniUI_0.1.1.1 Rcpp_1.0.4.6 [46] xtable_1.8-4 GPfit_1.0-8 StanHeaders_2.21.0-5 [49] lava_1.6.7 prodlim_2019.11.13 DT_0.13 [52] htmlwidgets_1.5.1 httr_1.4.1 threejs_0.3.3 [55] ellipsis_0.3.1 loo_2.2.0 pkgconfig_2.0.3 [58] nnet_7.3-14 dbplyr_1.4.4 utf8_1.1.4 [61] tidyselect_1.1.0 rlang_0.4.6 DiceDesign_1.8-1 [64] reshape2_1.4.4 later_1.1.0.1 munsell_0.5.0 [67] cellranger_1.1.0 tools_4.0.1 cli_2.0.2 [70] generics_0.0.2 ggridges_0.5.2 evaluate_0.14 [73] fastmap_1.0.1 yaml_2.2.1 ModelMetrics_1.2.2.2 [76] processx_3.4.2 knitr_1.28 fs_1.4.1 [79] future_1.17.0 nlme_3.1-148 mime_0.9 [82] rstanarm_2.19.3 xml2_1.3.2 tokenizers_0.2.1 [85] compiler_4.0.1 bayesplot_1.7.2 shinythemes_1.1.2 [88] rstudioapi_0.11 reprex_0.3.0 tidyposterior_0.0.3 [91] lhs_1.0.2 statmod_1.4.34 stringi_1.4.6 [94] highr_0.8 ps_1.3.3 blogdown_0.19 [97] Matrix_1.2-18 nloptr_1.2.2.1 markdown_1.1 [100] shinyjs_1.1 vctrs_0.3.1 pillar_1.4.4 [103] lifecycle_0.2.0 furrr_0.1.0 data.table_1.12.8 [106] httpuv_1.5.4 R6_2.4.1 bookdown_0.19 [109] promises_1.1.1 gridExtra_2.3 janeaustenr_0.1.5 [112] codetools_0.2-16 boot_1.3-25 colourpicker_1.0 [115] gtools_3.8.2 assertthat_0.2.1 withr_2.2.0 [118] shinystan_2.5.0 parallel_4.0.1 hms_0.5.3 [121] grid_4.0.1 timeDate_3043.102 minqa_1.2.4 [124] class_7.3-17 rmarkdown_2.2 pROC_1.16.2 [127] tidypredict_0.4.5 shiny_1.4.0.2 lubridate_1.7.9 [130] base64enc_0.1-3 dygraphs_1.1.1.6
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.