Find the best predictive model using R/caret package/modelgrid
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Category
Tags
It's tough to make predictions, especially about the future (Yogi Berra), but I think the way to get there shouldn't be.
I have built a new shiny application BMuCaret to fit and evaluate multiple classifiers and select the best one, which achieves the best performance for a given data **. The area under the ROC curve (not the curve) has been considered as a key metric to measure the model performance.
With this new application (BMuCaret: Best Model using Caret) I evaluate five classifiers arising from 5 families (discriminant analysis, neural networks, support vector machines, Generalized Linear Models, random forests) and identify the classifier with the highest AUC. On the training data the Random Forest classifier(method: rf) tends to perform well **but on the validation data, the eXtreme Gradient Boosting classifier(method: xgbDART) is performing better and gives better results in respect to the AUC and to the model accuracy as well.
Concept
In order to perform a fair comparison of the candidate classifiers:
- I will use the same training/test split.
- The learning problem(as an example) is the binary classification problem; predict customer churn. the data set can be found here.
- Data pre-Processing has been performed using the recipe method.
- A common setting/trainControl will be created, which will be shared by all considered classifiers(this will allow us to fit the models under the same conditions).
- For tuning the hyperparameters the method adaptive_cv will be used (also allow us to reduce the tuning time).
I use the following candidate classifiers:
- eXtreme Gradient Boosting (method: xgbDART).
- glmnet (method glmnet).
- Linear discriminant analysis (method: lda).
- Neural Network (method: nnet).
- Random Forest (method : rf).
This list can be expanded with further classifiers by using the add_model function from the model grid package.
The caret and modelgrid packages are used to train and to evaluate the candidate models ( For a very accessible introduction to caret and modelgid please have a look at here and here).
DataExplorer packet has been used to explore the data.
Here is the code
Data set preprocessing (Module1)
It will return a clean data set, and the partition of the data set into training and validation data set.
library(shiny) library(shinythemes) library(modelgrid) library(magrittr) library(caret) library(foreach) library(recipes) library(plyr) library(purrr) library(dplyr) library(pROC) library(BradleyTerry2) library(corrplot) library(DataExplorer) library(randomForest) library(glmnet) library(xgboost) library(e1071) ####################################### # load data set.seed(9650) Telco_custumer_2 <- read.csv("Telco_customer_2.csv") data_set <- Telco_custumer_2 data_set <- data_set%>%dplyr::select(-"customerID") data_set$Churn <- as.factor(data_set$Churn) ###################################### # Pre-Processing the data Telco_Customer_Churn using recipe set.seed(9650) Attri_rec_2 <- recipe(Churn ~ ., data = data_set) %>% # Formula. step_dummy(all_predictors())%>% # convert nominal data into one or more numeric. step_knnimpute(all_predictors())%>% # impute missing data using nearest neighbors. step_zv(all_predictors()) %>% # remove variables that are highly sparse and unbalanced. step_corr(all_predictors()) %>% # remove variables that have large absolute correlations with # other variables. step_center(all_predictors()) %>% # normalize numeric data to have a mean of zero. step_scale(all_predictors()) %>% # normalize numeric data to have a standard deviation of one. prep(training = data_set, retain = TRUE) # train the data recipe ## Warning: The following variables are not factor vectors and will be ## ignored: `SeniorCitizen`, `tenure`, `MonthlyCharges`, `TotalCharges` data_rec_2 <- as.data.frame(juice(Attri_rec_2)) # add the response variable data_rec_2$Churn <- data_set$Churn # str(data_rec_2) # Create a Validation Dataset (training data 70% and validation data 30% of the original data) set.seed(9650) validation_index <- createDataPartition(data_rec_2$Churn,times= 1, p= 0.70, list=FALSE) # select 30% of the data for validation data_validation <- data_rec_2[-validation_index,] # use the remaining 70% of data to train the models data_train <- data_rec_2[validation_index, ] # For traing the models x_train <- data_train%>%select(-Churn) # Predictors y_train <- data_train[["Churn"]] # Response # for validation/test x_validation <- data_validation%>%select(-Churn) y_validation <- data_validation[["Churn"]]
Construct model grid and define shared settings (Module2)
set.seed(9650) mg <- model_grid() %>% share_settings( y = y_train, x = x_train, metric = "ROC", trControl = trainControl( method = "adaptive_cv", number = 10, repeats = 5, adaptive = list(min =3, alpha =0.05, method = "BT", complete = FALSE), search = "random", summaryFunction = twoClassSummary, classProbs = TRUE)) purrr::map_chr(mg$shared_settings, class) ## y x metric trControl ## "factor" "data.frame" "character" "list" # add models to train mg_final <- mg %>% add_model(model_name = "LDA", method = "lda", custom_control = list(method = "repeatedcv", number = 10, repeats =5))%>% add_model(model_name = "eXtreme Gradient Boosting", method = "xgbDART") %>% add_model(model_name = "Neural Network", method = "nnet")%>% add_model(model_name = "glmnet", method = "glmnet")%>% add_model(model_name = "Random Forest", method = "rf") # %>% add_model(model_name = "gbm", # method="gbm", # custom_control = list(verbose = FALSE)) list_model <- c(names(mg_final$models))
The code for the BMuCaret Shiny application (Module3)
################################# Define UI ######################### ui <- fluidPage(theme = shinytheme("slate"), # Application title titlePanel(wellPanel("Find the best predictive model using R/caret package/modelgrid",br(),"BMuCaret")), navbarPage("Workflow ===>", tabPanel("Exploratory Data Analysis", tabsetPanel(type = "tabs", tabPanel("Explore the data object", navlistPanel( tabPanel("1- Structure of the data object", verbatimTextOutput("str_ucture")), tabPanel("2- Missing values ?", plotOutput("missing_value")), tabPanel("3- Correlation analysis", plotOutput("coor_plot", height = 700)))), tabPanel("After Data Cleaning (using recipe method)", navlistPanel( tabPanel("1- Structure of the data object after Processing", verbatimTextOutput("str_ucture_after")), tabPanel("2- Missing values ?", plotOutput("missing_value_after") ), tabPanel("3- Correlation analysis after", plotOutput("coor_plot_after", height = 600)) )))), tabPanel("Fitting & Validation & Statistics", sidebarLayout( sidebarPanel( wellPanel(selectInput(inputId = "abd", label = "Model choose : ", choices = c("none", list_model), selected = "none"))), mainPanel( tabsetPanel(type= "tabs", tabPanel("Model Training & Summary", navlistPanel( tabPanel("1- Show info model grid ",verbatimTextOutput("info_modelgrid")), tabPanel("2- Performance statistics of the model grid (dotplot) ", plotOutput("dotplot", width = 600, height = 600)), tabPanel("3- Extract Performance of the model grid ", verbatimTextOutput(outputId = "summary")), tabPanel("4- Show the AUC & Accuracy of individual models (on data training)", verbatimTextOutput("Accurac_AUC"), htmlOutput("best_model_train")), tabPanel("5- Show Statistics of individual model", verbatimTextOutput(outputId = "Indiv_Analysis") ,"Examine the relationship between the estimates of performance and the tuning parameters", br(), plotOutput(outputId= "tuning_parameter")), tabPanel("6- Show single model's Accuracy (on data training)", verbatimTextOutput(outputId = "accuracy")))), # output model validation and statistics tabPanel("Model Validation & Statistics", navlistPanel( tabPanel("1-Show the AUC & Accuracy of individual models (on validation data)", verbatimTextOutput(outputId = "AUC_of_individ"), htmlOutput("best_model_vali")), tabPanel("2- Show single model's Accuracy/data validation", verbatimTextOutput("accuracy_vali")), tabPanel("3- Plot variable importance", plotOutput("variable_imp")))) )))), tabPanel("Next: Model Interpretation") ), br(), br(),br(), h5("BMuCaret App. built with", img(src = "https://www.rstudio.com/wp-content/uploads/2014/04/shiny-200x232.png", height= "30px"), "by", br(), "Author: Abderrahim Lyoubi-Idrissi", br(), "Contact: [email protected]") ) ################################# Define server logic ######################### server <- function(input, output) { set.seed(9650) # before processing the data # data structure output$str_ucture <- renderPrint({ str(data_set) }) # Missing values ? output$missing_value <- renderPlot({ plot_missing(data_set) }) # plot to check the results of step_corr(all_predictors()) output$coor_plot <- renderPlot({ plot_correlation(data_set) }) ## after processing the data # data structure output$str_ucture_after <- renderPrint({ str(data_rec_2) }) # plot to check the missing values output$missing_value_after <- renderPlot({ plot_missing(data_rec_2) }) # plot to check the correlation output$coor_plot_after <- renderPlot({ plot_correlation(data_rec_2) }) set.seed(9650) # Infos about the componets of the constructed model grid output$info_modelgrid <- renderPrint({ mg_final$models }) # train all models set.seed(9650) mg_final_tra <- reactive({train(mg_final)}) ## plot to compare output$dotplot <- renderPlot({ mg_final_tra()$model_fits %>% caret::resamples(.) %>% lattice::dotplot(.) }) # Show the overall summary set.seed(9650) output$summary <- renderPrint({ mg_final_tra()$model_fits %>% caret::resamples(.) %>% summary(.) }) # computing the auc & the accuracy (on train data) set.seed(9650) output$Accurac_AUC <- renderPrint({ AUC1 <- mg_final_tra()$model_fits%>% predict(., newdata = x_train, type ="prob")%>%map(~roc(y_train, .x[,2])) auc_value_train <- map_dbl(AUC1, ~(.x)$auc) auc_value_df_train_df <- as.data.frame(auc_value_train) accuarcy_all_train <- predict(mg_final_tra()$model_fits, newdata = x_train)%>% map( ~confusionMatrix(.x, y_train))%>% map_dfr(~ cbind(as.data.frame(t(.x$overall)),as.data.frame(t(.x$byClass))), .id = "Modelname")%>% select(Modelname, Accuracy, Kappa) accuracy_auc_train <- bind_cols(accuarcy_all_train,auc_value_df_train_df) print(accuracy_auc_train) }) # computing the auc and the Accuracy set.seed(9650) output$best_model_train <- renderUI({ AUC1 <- mg_final_tra()$model_fits%>% predict(., newdata = x_train, type ="prob")%>% map(~roc(y_train, .x[,2])) auc_value_train <- map_dbl(AUC1, ~(.x)$auc) auc_value_df_train_df <- as.data.frame(auc_value_train) accuarcy_all_train <- predict(mg_final_tra()$model_fits, newdata = x_train)%>% map( ~confusionMatrix(.x, y_train))%>% map_dfr(~ cbind(as.data.frame(t(.x$overall)),as.data.frame(t(.x$byClass))), .id = "Modelname")%>% select(Modelname, Accuracy, Kappa) accuracy_auc_train <- bind_cols(accuarcy_all_train, auc_value_df_train_df) max_auc_train <- filter(accuracy_auc_train, auc_value_train == max(accuracy_auc_train$auc_value_train))%>% select(Modelname) max_Accuracy_train <- filter(accuracy_auc_train, Accuracy == max(accuracy_auc_train$Accuracy))%>% select(Modelname) HTML(paste("Results", br(), "1- The model", strong(max_auc_train), "has the highest AUC value.", br(), "2- The model",strong(max_Accuracy_train), "has the highest Accuracy" )) }) # Show the summary of individual model output$Indiv_Analysis <- renderPrint({if(input$abd == "none"){print("Please select a model")} mg_final_tra()$model_fits[[input$abd]] }) # Plot the individual model output$tuning_parameter <- renderPlot({ ggplot( mg_final_tra()$model_fits[[input$abd]]) }) # ConfusionMatrix on training data output$accuracy <- renderPrint({if(input$abd == "none"){print("Please select a model")} else{ confusionMatrix( predict(mg_final_tra()$model_fits[[input$abd]], x_train), y_train)} }) ################################### Validation ######################### # Extract the auc values output$AUC_of_individ <- renderPrint({ AUC2 <- mg_final_tra()$model_fits%>% predict(., newdata = x_validation, type ="prob")%>% map(~roc(y_validation, .x[,2])) auc_value_vali <- map_dbl(AUC2, ~(.x)$auc) auc_value_df_vali_df <- as.data.frame(auc_value_vali) cf_vali <- predict(mg_final_tra()$model_fits, newdata = x_validation)%>% map( ~confusionMatrix(.x, y_validation)) cf_vali%>% map_dfr(~ cbind(as.data.frame(t(.x$overall)),as.data.frame(t(.x$byClass))), .id = "Modelname")%>% select(Modelname, Accuracy, Kappa) accuarcy_all_vali <- predict(mg_final_tra()$model_fits, newdata = x_validation)%>% map( ~confusionMatrix(.x, y_validation))%>% map_dfr(~ cbind(as.data.frame(t(.x$overall)), as.data.frame(t(.x$byClass))), .id = "Modelname")%>% select(Modelname, Accuracy, Kappa) accuracy_auc_vali <- bind_cols(accuarcy_all_vali,auc_value_df_vali_df) print(accuracy_auc_vali) }) output$best_model_vali <- renderUI({ AUC2 <- mg_final_tra()$model_fits%>% predict(., newdata = x_validation, type ="prob")%>% map(~roc(y_validation, .x[,2])) auc_value_vali <- map_dbl(AUC2, ~(.x)$auc) auc_value_df_vali_df <- as.data.frame(auc_value_vali) cf_vali <- predict(mg_final_tra()$model_fits, newdata = x_validation)%>% map( ~confusionMatrix(.x, y_validation)) cf_vali%>% map_dfr(~ cbind(as.data.frame(t(.x$overall)),as.data.frame(t(.x$byClass))), .id = "Modelname")%>% select(Modelname, Accuracy, Kappa) accuarcy_all_vali <- predict(mg_final_tra()$model_fits, newdata = x_validation)%>% map( ~confusionMatrix(.x, y_validation))%>% map_dfr(~ cbind(as.data.frame(t(.x$overall)),as.data.frame(t(.x$byClass))), .id="Modelname")%>% select(Modelname, Accuracy, Kappa) accuracy_auc_vali <- bind_cols(accuarcy_all_vali,auc_value_df_vali_df) max_auc_vali <- filter(accuracy_auc_vali, auc_value_vali == max(accuracy_auc_vali$auc_value_vali))%>% select(Modelname) max_Accuracy_vali <- filter(accuracy_auc_vali, Accuracy == max(accuracy_auc_vali$Accuracy))%>% select(Modelname) HTML(paste("Results", br(), "1- The model", strong(max_auc_vali), "has the highest AUC value.", br(), "2- The model",strong(max_Accuracy_vali), "has the highest Accuracy.")) }) ## confusion matrix on data validation output$accuracy_vali<- renderPrint({if(input$abd == "none"){print("Please select a model")} else{ confusionMatrix(predict(mg_final_tra()$model_fits[[input$abd]], x_validation), y_validation)} }) ## Variable importance output$variable_imp <- renderPlot({ var_imp_gr <- varImp(mg_final_tra()$model_fits[[input$abd]]) ggplot(var_imp_gr) }) }
The BMuCaret Shiny Application
Conclusion
- BMuCaret Shiny application should help reduce the effort made to perform comparison of caret models for a classification problem.
- To use this application for a new data set the modules 1&2 need to be customised to the new classification problem.
Outlook
- Implementing module1 and module2 into module3
- Updating BMuCaret application to be also used for regression problems.
- Implementing model interpretation(maybe with the lime package for R, or DALEX ..).
- Developing a new Shiny application to perform model comparison by using other packages.
- ….
Thank you in advance for your feedback.
Related Post
- Variable Selection Methods: Lasso and Ridge Regression in Python
- Bootstrap with Monte Carlo Simulation in Python
- Monte Carlo Simulation and Statistical Probability Distributions in Python
- K-Nearest Neighbors (KNN) with Python
- Image Recognition with Keras: Convolutional Neural Networks
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.