Data science with Microsoft Fabric – Plotting ROC curve and distribution of scores
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
ROC (Receiver Operation Characteristics) – curve is a graph that shows how classifiers performs by plotting the true positive and false positive rates. It is used to evaluate the performance of binary classification models by illustrating the trade-off between True positive rate (TPR) and False positive rate (FPR) at various threshold settings.
Key concepts
True Positive Rate (TPR): Also known as Sensitivity or Recall, it measures the proportion of actual positives correctly identified by the model.
False Positive Rate (FPR): It measures the proportion of actual negatives that are incorrectly identified as positives.
Area Under the Curve (AUC): This value summarizes the performance of the model. A value of 1 indicates a perfect model, while a value of 0.5 suggests a model with no discrimination ability.
ROC (Receiver Operating Characteristics) Curve: Is a value that illustrates the performance of a binary classifier model at varying threshold values.
How to read ROC
The ROC curve is the plot of the true positive rate against the false positive rate at each threshold setting. The threshold values can be obtained from the statistics used in confusion matrix that will generate the TPR or FPR.
Before going into the threshold value, lets check the synthetic data.
import numpy as np import pandas as pd from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_curve, auc, confusion_matrix import matplotlib.pyplot as plt import seaborn as sns # Create a sample binary classification dataset X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
Let’s split the data and train the model (using logistic regression):
# Split the data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Train a logistic regression model model = LogisticRegression() model.fit(X_train, y_train) # Get predicted probabilities and predicted classes y_pred_proba = model.predict_proba(X_test)[:, 1] # Probability for class 1 y_pred = model.predict(X_test)
And check the statistics:
# Calculate the ROC curve fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba) roc_auc = auc(fpr, tpr) # Compute the confusion matrix conf_matrix = confusion_matrix(y_test, y_pred)
And add the results in data frame:
import pandas as pd confusion_stats = [] # Loop over each threshold and calculate TP, FN, FP, TN for idx, threshold in enumerate(thresholds): tp, fn, fp, tn = get_confusion_matrix_stats(y_test, y_pred_proba, threshold) confusion_stats.append({ 'Threshold': threshold, 'True Positive (TP)': tp, 'False Negative (FN)': fn, 'False Positive (FP)': fp, 'True Negative (TN)': tn }) # Convert the results into a pandas DataFrame confusion_df = pd.DataFrame(confusion_stats) confusion_df
And the top and bottom five results of the thresholds and all the statistics for TP, FN, FP and TN.
For the same ROC Curve let us set a threshold value to 0.08 and check the statistics:
Adding calculation formula to
So, for that value we can calculate the correctly and falsely classified data (based on the prediction model) as true positive rate (TPR) against the false positive rate (FPR).
So with the values are:
TP (True Positives) = 149
FN (False Negative) = 6
FP (False Positive) = 80 and
TN (True Negative) = 65
TPR = TP / (TP+FN) = 149/(149+6) = 0.9612903
FPR = FP / (FP + TN) = 80/(80+65) = 0.5517241
The threshold value is determined where X and Y are crossed on the graph, where FPR in plotted on X-axis (values going from 0 to 1) and TPR is plotted on Y-axis (values going from 0 to 1). All values are plotted on a scale from 0 to 1 since we are plotting the results of binary classifies of a logistic regression.
Where and how to understand the curve
To understand the curve and where to “cut” or how to segment results based on the curve, that both variables can be displayed.
In binary classification, the class prediction for each instance is often made based on a continuous random variable X, which is a “score” computed for the instance (e.g. the estimated probability in logistic regression). Plotting the probability density for both TPR and FPR. Therefore, the true positive rate is given by TPR(T)=∫T∞f1(x)dx
and the false positive rate is given by FPR(T)=∫T∞f0(x)dx
.
Plotting the probabilities using KDE (Kernel Density Estimate) to in Python Seaborn package to understand how changing (or increasing) the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have.
# Now, let's mark the threshold on the distribution plot and display confusion matrix stats plt.figure(figsize=(10, 6)) # Plot the distributions of predicted probabilities for both classes sns.kdeplot(df[df['y_test'] == 0]['y_pred_proba'], label="Class 0 (actual)", shade=True, color="blue") sns.kdeplot(df[df['y_test'] == 1]['y_pred_proba'], label="Class 1 (actual)", shade=True, color="orange") # Highlight the chosen threshold on the x-axis plt.axvline(chosen_threshold, color='green', linestyle='--', label=f'Threshold = {chosen_threshold:.2f}') plt.text(chosen_threshold + 0.05, 0.5, f"TP = {tp}\nFN = {fn}\nFP = {fp}\nTN = {tn}", fontsize=12, bbox=dict(facecolor='white', alpha=0.5)) plt.xlabel('Predicted Probability') plt.ylabel('Density') plt.title('Distribution of Predicted Probabilities with Highlighted Threshold') plt.legend() plt.grid(True) plt.show()
And the distributions for each values: True Negative and True Positive and False Negative and False Positive (different shades on class predictions).
Based on the problem you are trying to solve, it is also crucial which predicted values you want to evaluate and reduce / increase. Based on selected problem, you can reduce the complexity of the model by classifying the results in additional sub-models, making the solution more resilient and robust. This reduction complexity can be done using SVD, t-SNE, PCA, Isomap and others.
As always, complete Fabric notebook is available on the Github repository for Data science with Microsoft Fabric.
Working with Fabric, you can always investigate further spark operations, runs and optimise the workload. And complete results can also be done in Microsoft Fabric using R language.
Stay healthy and keep exploring!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.