Data science with Microsoft Fabric – Plotting ROC curve and distribution of scores

Posted on October 2, 2024 by tomaztsql in R bloggers | 0 Comments

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

ROC (Receiver Operation Characteristics) – curve is a graph that shows how classifiers performs by plotting the true positive and false positive rates. It is used to evaluate the performance of binary classification models by illustrating the trade-off between True positive rate (TPR) and False positive rate (FPR) at various threshold settings.

Key concepts

True Positive Rate (TPR): Also known as Sensitivity or Recall, it measures the proportion of actual positives correctly identified by the model.

$\displaystyle {TPR} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}$

False Positive Rate (FPR): It measures the proportion of actual negatives that are incorrectly identified as positives.

$\displaystyle {FPR} = \frac{\text{False Positives (FP)}}{\text{True Negatives (TN)} + \text{False Positives (FP)}}$

Area Under the Curve (AUC): This value summarizes the performance of the model. A value of 1 indicates a perfect model, while a value of 0.5 suggests a model with no discrimination ability.

ROC (Receiver Operating Characteristics) Curve: Is a value that illustrates the performance of a binary classifier model at varying threshold values.

How to read ROC

The ROC curve is the plot of the true positive rate against the false positive rate at each threshold setting. The threshold values can be obtained from the statistics used in confusion matrix that will generate the TPR or FPR.

Before going into the threshold value, lets check the synthetic data.

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Create a sample binary classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

Let’s split the data and train the model (using logistic regression):

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Get predicted probabilities and predicted classes
y_pred_proba = model.predict_proba(X_test)[:, 1]  # Probability for class 1
y_pred = model.predict(X_test)

And check the statistics:

# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

# Compute the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

And add the results in data frame:

import pandas as pd

confusion_stats = []

# Loop over each threshold and calculate TP, FN, FP, TN
for idx, threshold in enumerate(thresholds):
    tp, fn, fp, tn = get_confusion_matrix_stats(y_test, y_pred_proba, threshold)
    confusion_stats.append({
        'Threshold': threshold,
        'True Positive (TP)': tp,
        'False Negative (FN)': fn,
        'False Positive (FP)': fp,
        'True Negative (TN)': tn
    })

# Convert the results into a pandas DataFrame
confusion_df = pd.DataFrame(confusion_stats)

confusion_df

And the top and bottom five results of the thresholds and all the statistics for TP, FN, FP and TN.

For the same ROC Curve let us set a threshold value to 0.08 and check the statistics:

Adding calculation formula to

So, for that value we can calculate the correctly and falsely classified data (based on the prediction model) as true positive rate (TPR) against the false positive rate (FPR).

So with the values are:
TP (True Positives) = 149
FN (False Negative) = 6
FP (False Positive) = 80 and
TN (True Negative) = 65
TPR = TP / (TP+FN) = 149/(149+6) = 0.9612903
FPR = FP / (FP + TN) = 80/(80+65) = 0.5517241

The threshold value is determined where X and Y are crossed on the graph, where FPR in plotted on X-axis (values going from 0 to 1) and TPR is plotted on Y-axis (values going from 0 to 1). All values are plotted on a scale from 0 to 1 since we are plotting the results of binary classifies of a logistic regression.

Where and how to understand the curve

To understand the curve and where to “cut” or how to segment results based on the curve, that both variables can be displayed.

In binary classification, the class prediction for each instance is often made based on a continuous random variable X, which is a “score” computed for the instance (e.g. the estimated probability in logistic regression). Plotting the probability density for both TPR and FPR. Therefore, the true positive rate is given by TPR(T)=∫T∞f1(x)dx

{\displaystyle {\mbox{TPR}}(T)=\int _{T}^{\infty }f_{1}(x)\,dx}

and the false positive rate is given by FPR(T)=∫T∞f0(x)dx

{\displaystyle {\mbox{FPR}}(T)=\int _{T}^{\infty }f_{0}(x)\,dx}

Plotting the probabilities using KDE (Kernel Density Estimate) to in Python Seaborn package to understand how changing (or increasing) the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have.

# Now, let's mark the threshold on the distribution plot and display confusion matrix stats
plt.figure(figsize=(10, 6))

# Plot the distributions of predicted probabilities for both classes
sns.kdeplot(df[df['y_test'] == 0]['y_pred_proba'], label="Class 0 (actual)", shade=True, color="blue")
sns.kdeplot(df[df['y_test'] == 1]['y_pred_proba'], label="Class 1 (actual)", shade=True, color="orange")

# Highlight the chosen threshold on the x-axis
plt.axvline(chosen_threshold, color='green', linestyle='--', label=f'Threshold = {chosen_threshold:.2f}')


plt.text(chosen_threshold + 0.05, 0.5,
         f"TP = {tp}\nFN = {fn}\nFP = {fp}\nTN = {tn}",
         fontsize=12, bbox=dict(facecolor='white', alpha=0.5))

plt.xlabel('Predicted Probability')
plt.ylabel('Density')
plt.title('Distribution of Predicted Probabilities with Highlighted Threshold')
plt.legend()
plt.grid(True)
plt.show()

And the distributions for each values: True Negative and True Positive and False Negative and False Positive (different shades on class predictions).

Based on the problem you are trying to solve, it is also crucial which predicted values you want to evaluate and reduce / increase. Based on selected problem, you can reduce the complexity of the model by classifying the results in additional sub-models, making the solution more resilient and robust. This reduction complexity can be done using SVD, t-SNE, PCA, Isomap and others.

As always, complete Fabric notebook is available on the Github repository for Data science with Microsoft Fabric.

Working with Fabric, you can always investigate further spark operations, runs and optimise the workload. And complete results can also be done in Microsoft Fabric using R language.

Stay healthy and keep exploring!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data science with Microsoft Fabric – Plotting ROC curve and distribution of scores

Key concepts

How to read ROC

Where and how to understand the curve

Related

Key concepts

How to read ROC

Where and how to understand the curve

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)