Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Hello everyone,
In the last post we have decided to continue our study with the logistic regression. We have obtained the following ROC curve with an area under the curve (AUC) of 0.843
.
Now we imagine 2 scenarios. In the first one, we suppose we have a large budget and we want to target many customers. In the second one, we have a restricted budget. Hence, we want to contact just 25% of the customers in our database.
For the first scenario, we will focus on the threshold value. We want to optimize the number of good classifications.
Then, in the second scenario, we will focus on the probability that each customer has to churn. We will sort our database by their probability to churn, we also call this probability, the score. Then we will select the top 25% customers of our database based on their probability to churn by using a lift curve.
Scenario 1: Threshold Optimization
Each customer has a score that corresponds to his probability to churn. Let’s see the score of the five first customers of our database.
head(pred_logistic,5) ## 1 2 3 4 5 ## 0.62513394 0.32512336 0.52812027 0.31273422 0.01264661
Originally, the threshold is 0.5
. We predict a positive answer if the score is higher than 0.5
and a negative answer if the score is lower than 0.5
.
It means for the 5 customers above that the model predicts a positive answer for customer 1 and customer 3 and a negative answer for the 3 other customers.
To optimize the threshold, we want to compute different statistical indicators (accuracy, precision, sensitivity, f1 and kappa) for different threshold values from 0.05
to 0.847
with a step value of 0.001
. We don’t go above 0.847
because after this value, we just have negative answers.
if (!require("tidyverse")) install.packages("tidyverse") library("tidyverse") if (!require("caret")) install.packages("caret") library("caret") if (!require("e1071")) install.packages("e1071") library("e1071") comp = cbind.data.frame(answer = db_test$ChurnNum, pred=pred_logistic) %>% arrange(desc(pred)) indic_perf = function(x){ compare = comp %>% mutate(pred = ifelse(pred>x,1,0)) if(ncol(table(compare))>1){ mat = confusionMatrix(table(compare), positive = "1") #acuracy acc = mat$overall["Accuracy"] #Kappa kap = mat$overall["Kappa"] #sensitivity sen = mat$byClass["Sensitivity"] #F1 f1_stat = mat$byClass["F1"] #Precision prec = mat$byClass["Precision"] }else{ acc = NA prec = NA sen = NA kap = NA f1_stat = NA } return(data.frame(threshold = x, accuracy = acc, precision = prec, sensitivity = sen, kappa = kap, f1= f1_stat)) } indics = do.call(rbind, lapply(seq(0.05,0.95, by=0.001), indic_perf)) %>% filter(!is.na(accuracy)) if (!require("plotly")) install.packages("plotly") library("plotly") if (!require("IRdisplay")) install.packages("IRdisplay") library("IRdisplay") gather_indics = tidyr::gather(indics, variable, value, -threshold) %>% group_by(variable) %>% mutate(color = (max(value) == value), threshold = as.numeric(threshold) ) q=ggplot(gather_indics , aes(x= threshold, y=value)) + ggtitle("Indicator values by thresholds")+ geom_point(aes(color = color), size=0.5) + facet_wrap(~variable, scales = 'free_x') + scale_color_manual(values = c(NA, "tomato")) + labs(x="thresholds", y=" ") + geom_line(color="navy") + theme_bw()+ theme( legend.position="none") offline(ggplotly(q), width = '100%')
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.