Site icon R-bloggers

Data science with Microsoft Fabric – Plotting ROC curve and distribution of scores

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

ROC (Receiver Operation Characteristics) – curve is a graph that shows how classifiers performs by plotting the true positive and false positive rates. It is used to evaluate the performance of binary classification models by illustrating the trade-off between True positive rate (TPR) and False positive rate (FPR) at various threshold settings.

Key concepts

True Positive Rate (TPR): Also known as Sensitivity or Recall, it measures the proportion of actual positives correctly identified by the model.

False Positive Rate (FPR): It measures the proportion of actual negatives that are incorrectly identified as positives.

Area Under the Curve (AUC): This value summarizes the performance of the model. A value of 1 indicates a perfect model, while a value of 0.5 suggests a model with no discrimination ability.

ROC (Receiver Operating Characteristics) Curve: Is a value  that illustrates the performance of a binary classifier model at varying threshold values.

    How to read ROC

    The ROC curve is the plot of the true positive rate against the false positive rate at each threshold setting. The threshold values can be obtained from the statistics used in confusion matrix that will generate the TPR or FPR.

    Before going into the threshold value, lets check the synthetic data.

    import numpy as np
    import pandas as pd
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import roc_curve, auc, confusion_matrix
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Create a sample binary classification dataset
    X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
    

    Let’s split the data and train the model (using logistic regression):

    # Split the data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Train a logistic regression model
    model = LogisticRegression()
    model.fit(X_train, y_train)
    
    # Get predicted probabilities and predicted classes
    y_pred_proba = model.predict_proba(X_test)[:, 1]  # Probability for class 1
    y_pred = model.predict(X_test)
    
    

    And check the statistics:

    # Calculate the ROC curve
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
    roc_auc = auc(fpr, tpr)
    
    # Compute the confusion matrix
    conf_matrix = confusion_matrix(y_test, y_pred)
    

    And add the results in data frame:

    import pandas as pd
    
    confusion_stats = []
    
    # Loop over each threshold and calculate TP, FN, FP, TN
    for idx, threshold in enumerate(thresholds):
        tp, fn, fp, tn = get_confusion_matrix_stats(y_test, y_pred_proba, threshold)
        confusion_stats.append({
            'Threshold': threshold,
            'True Positive (TP)': tp,
            'False Negative (FN)': fn,
            'False Positive (FP)': fp,
            'True Negative (TN)': tn
        })
    
    # Convert the results into a pandas DataFrame
    confusion_df = pd.DataFrame(confusion_stats)
    
    confusion_df
    

    And the top and bottom five results of the thresholds and all the statistics for TP, FN, FP and TN.

    For the same ROC Curve let us set a threshold value to 0.08 and check the statistics:

    Adding calculation formula to

    So, for that value we can calculate the correctly and falsely classified data (based on the prediction model) as true positive rate (TPR) against the false positive rate (FPR).

    So with the values are:
    TP (True Positives) = 149
    FN (False Negative) = 6
    FP (False Positive) = 80 and
    TN (True Negative) = 65
    TPR = TP / (TP+FN) = 149/(149+6) = 0.9612903
    FPR = FP / (FP + TN) = 80/(80+65) = 0.5517241


    The threshold value is determined where X and Y are crossed on the graph, where FPR in plotted on X-axis (values going from 0 to 1) and TPR is plotted on Y-axis (values going from 0 to 1). All values are plotted on a scale from 0 to 1 since we are plotting the results of binary classifies of a logistic regression.

    Where and how to understand the curve

    To understand the curve and where to “cut” or how to segment results based on the curve, that both variables can be displayed.

    In binary classification, the class prediction for each instance is often made based on a continuous random variable X, which is a “score” computed for the instance (e.g. the estimated probability in logistic regression). Plotting the probability density for both TPR and FPR. Therefore, the true positive rate is given by TPR(T)=∫T∞f1(x)dx

     and the false positive rate is given by FPR(T)=∫T∞f0(x)dx

    .

    Plotting the probabilities using KDE (Kernel Density Estimate) to in Python Seaborn package to understand how changing (or increasing) the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have.

    # Now, let's mark the threshold on the distribution plot and display confusion matrix stats
    plt.figure(figsize=(10, 6))
    
    # Plot the distributions of predicted probabilities for both classes
    sns.kdeplot(df[df['y_test'] == 0]['y_pred_proba'], label="Class 0 (actual)", shade=True, color="blue")
    sns.kdeplot(df[df['y_test'] == 1]['y_pred_proba'], label="Class 1 (actual)", shade=True, color="orange")
    
    # Highlight the chosen threshold on the x-axis
    plt.axvline(chosen_threshold, color='green', linestyle='--', label=f'Threshold = {chosen_threshold:.2f}')
    
    
    plt.text(chosen_threshold + 0.05, 0.5,
             f"TP = {tp}\nFN = {fn}\nFP = {fp}\nTN = {tn}",
             size=12, bbox=dict(facecolor='white', alpha=0.5))
    
    plt.xlabel('Predicted Probability')
    plt.ylabel('Density')
    plt.title('Distribution of Predicted Probabilities with Highlighted Threshold')
    plt.legend()
    plt.grid(True)
    plt.show()
    
    

    And the distributions for each values: True Negative and True Positive and False Negative and False Positive (different shades on class predictions).

    Based on the problem you are trying to solve, it is also crucial which predicted values you want to evaluate and reduce / increase. Based on selected problem, you can reduce the complexity of the model by classifying the results in additional sub-models, making the solution more resilient and robust. This reduction complexity can be done using SVD, t-SNE, PCA, Isomap and others.

    As always, complete Fabric notebook is available on the Github repository for Data science with Microsoft Fabric.

    Working with Fabric, you can always investigate further spark operations, runs and optimise the workload. And complete results can also be done in Microsoft Fabric using R language.

    Stay healthy and keep exploring!

    To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
    Exit mobile version