Calibrating Confidence: Strategies for Accurate Probability Estimation in ML

Improve machine learning predictions’ reliability and interpretability by enhancing confidence levels through probability calibration techniques

Amit Kulkarni
AI Advances

--

Source: Author

In this blog, we will cover the below topics

  • Introduction
  • What is probability calibration?
  • Data loading and processing
  • Use case 1: Fetching class probabilities
  • Use case 2: Improving model performance
  • Sigmoid Vs Isotonic
    - Applying the Brier Score
  • Conclusion & FAQs
  • References

Introduction

In machine learning classification tasks, the conventional approach involves predicting class values directly, such as determining whether an observation belongs to class 0 or class 1 in binary classification. However, a more nuanced approach involves predicting the probability of an observation belonging to each class. For instance, if a model assigns a probability of 0.85 (85%) to class 1 and 0.15 (15%) to class 2, it offers a clearer indication of the model’s confidence in the prediction. Nevertheless, not all models excel in accurately predicting calibrated probabilities that closely align with the expected distribution for each class. In such scenarios, the options are typically limited to either relying solely on the class predicted by a model, disregarding the associated probabilities, or opting for a different model that provides class probabilities.

Is there a middle ground where we can leverage the strengths of our original model while still obtaining probabilities for enhanced interpretation? Indeed, there is. This middle path involves probability calibration, a technique that fine-tunes the predicted probabilities from our existing model, allowing us to maintain its benefits while improving the interpretability of the predictions. We’ll explore this approach in detail in the next section.

What is probability calibration?

Probability calibration in machine learning enhances prediction accuracy by aligning predicted probabilities with true likelihoods. Unlike traditional models that assign class labels, calibration estimates the probability of an observation belonging to each class, offering deeper insights into prediction confidence. By adjusting predicted probabilities to match the actual distribution, calibration improves model reliability. It provides a middle ground where users can leverage original models while obtaining calibrated probabilities for enhanced interpretation. With calibrated probabilities, users gain clarity on prediction uncertainties, facilitating more precise decision-making across various machine-learning applications.

Data loading and processing

We will be using the data from Playground Series — Season 4, Episode 1 Binary Classification with a Bank Churn Dataset from Kaggle. The detailed definition of each of the variables can be found in the Kaggle dataset description. Our objective is to build a binary classification model to predict the churn of either 1 or 0.

The data processing will be the same as Boost Your ML Model’s Performance with Ensemble Modeling. Please feel free to refer to the same or access the complete code from GitHub.

I have used the same dataset in my previous blogs where we learned 3 approaches to building ensemble models and the second where we used automation tools like Autogluon and Autoviz for effortless modeling. The third ML Model Transparency: LIME for Explainability in Python. This blog is not a sequel to these blogs but i thought it would be better to have same dataset, experiment on it and learn from it.

Use case 1: Fetching class probabilities

In this case, we build an SVC model for the binary classification with the below-listed steps.

  • Split the dataset into train, test, and validation set
  • Build an SVC model and make a prediction
  • Measure the model results.
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, 
test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val,
y_train_val, test_size=0.25, random_state=42)

svc = SVC()
svc.fit(X_train, y_train)
y_pred_val = svc.predict(X_val)

The important thing to note here is the predict() function, let’s predict a specific (first) record.

print(svc.predict(X_val.loc[[1]]))
---------------------------------------
OUTPUT:
[0]

We get the predicted class as [0] which is fine but we don’t know how confident the model is with this prediction meaning we don’t have the predicted probabilities for each class. To fetch these probabilities in most models we use the predict_proba() function on the same record.

print(svc.predict_proba(X_val.loc[[1]]))

-----------------------------------------------
OUTPUT:
---------------------------------------------------------------------------
AttributeError: Traceback (most recent call last)
.....
AttributeError: predict_proba is not available.....

Well, some models only output the predicted class and don’t show the probabilities meaning they don’t have the predict_proba() function. In the case of SVC, we can still set the probability = True and get these probabilities but it may not be in many other models.

svc = SVC(probability=True)
svc.fit(X_train, y_train)
y_pred_val = svc.predict(X_val)

print(svc.predict_proba(X_val.loc[[1]]))

OUTPUT:
[[0.88146797 0.11853203]]

Let’s say we have a working model that doesn’t give us the class probabilities. In such a scenario, we can leverage probability calibration as explained below.

  • CalibratedClassifierCV: A scikit-learn class that performs probability calibration for classifiers.
  • svc: The original classifier (in this case, a Support Vector Classifier or SVC) that we want to calibrate.
  • method=’isotonic’: Specifies the method used for probability calibration. In this case, ‘isotonic’ refers to isotonic regression, a non-parametric method for calibration.
  • cv=’prefit’: Indicates that to use the already trained model (svc) without further training.
calibrated_svc = CalibratedClassifierCV(
svc,
method='isotonic',
cv='prefit')
calibrated_svc.fit(X_val, y_val)
print(calibrated_svc.predict(X_val.loc[[1]]))
print(calibrated_svc.predict_proba(X_val.loc[[1]]))

---------------------------------------------------------
OUTPUT:
[0]
[[0.92592593 0.07407407]]

Key observations

  • The prediction from the original model and the calibrated model is [0]
  • The probabilities are different in both cases because the underlying methodology is different and values are not significantly different.
  • We have been able to take a model that doesn’t output the class probabilities and create a wrapper on it via a calibrated model to get the class probabilities.

Use case 2: Improving model performance

Let’s extend the same model and learn how to incrementally improve the model’s performance. There are two methods to configure the calibration model — Sigmoid and Isotonic. We will implement both these methods and compare them with the uncalibrated model.

Sigmoid

Sigmoid calibration is a method that transforms predicted probabilities using a logistic regression model, assuming a parametric calibration curve and using optimization techniques. It is known for its computational efficiency and is suitable for cases where the relationship between predicted and true probabilities is sigmoidal. However, it may struggle to capture non-linear relationships effectively, making it less accurate in scenarios with complex or non-sigmoidal relationships between predicted and true probabilities.

Isotonic

Isotonic calibration is a non-parametric method that uses isotonic regression to refine predicted probabilities. It uses a piecewise constant function for adjustment, allowing for intricate, non-linear relationships between predicted and true probabilities. This method is particularly useful in scenarios where the calibration curve defies a sigmoidal model or exhibits complex, non-linear behavior, making it a valuable tool for enhancing calibration accuracy.

n_bins = 5
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42)

# Train SVC model
svc = SVC(probability=True) # Set probability=True for SVC to enable probability estimates
svc.fit(X_train, y_train)
y_pred_val = svc.predict(X_val)
prob_pos_val = svc.predict_proba(X_val)[:, 1]

prob_true_uncalibrated, prob_pred_uncalibrated = calibration_curve(y_val, prob_pos_val, n_bins=n_bins)

calibrated_svc_sigmoid = CalibratedClassifierCV(svc, method='sigmoid', cv= 5)
calibrated_svc_sigmoid.fit(X_val, y_val)
prob_pos_val_sigmoid = calibrated_svc_sigmoid.predict_proba(X_val)[:, 1]

calibrated_svc_isotonic = CalibratedClassifierCV(svc, method='isotonic', cv=5)
calibrated_svc_isotonic.fit(X_val, y_val)
prob_pos_val_isotonic = calibrated_svc_isotonic.predict_proba(X_val)[:, 1]

prob_true_sigmoid, prob_pred_sigmoid = calibration_curve(y_val, prob_pos_val_sigmoid, n_bins=n_bins)
prob_true_isotonic, prob_pred_isotonic = calibration_curve(y_val, prob_pos_val_isotonic, n_bins=n_bins)

# Plot calibration curves
plt.figure(figsize=(8, 8))

# Plot uncalibrated calibration curve
plt.plot(prob_pred_uncalibrated, prob_true_uncalibrated, marker='o', label='Uncalibrated')
# Plot calibration curve for sigmoid calibration
plt.plot(prob_pred_sigmoid, prob_true_sigmoid, marker='o', label='Sigmoid Calibration')
# Plot calibration curve for isotonic calibration
plt.plot(prob_pred_isotonic, prob_true_isotonic, marker='o', label='Isotonic Calibration')
# Plot the diagonal line representing perfect calibration
plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Perfect Calibration')
plt.xlabel('Mean predicted probability')
plt.ylabel('Fraction of positives')
plt.title('Calibration Curve')
plt.legend()
plt.grid(True)
plt.show()
Source: Author | Fig 1: Calibration curve for each of the models

How to interpret the calibration curve?

  • The Diagonal dotted line is the ideal or perfect calibration where the model prediction exactly matches the test values.
  • The model whose predictions lie close to the diagonal is performing better other the rest.
  • Any model prediction above the diagonal line is an under fit and below the line is an overfit.
  • The uncalibrated line is the least performing, the isotonic model being the best and Sigmoid lies between the two.

Let’s also look at the model metrics for each of them. The complete code used in this blog can be found on GitHub

Souce: Author | Fig 2: Model metrics of uncalibrated, Sigmoid and Isotonic models

Key observations

  • There is an improvement in the model’s accuracy 0.83 < 0.85 < 0.86
  • It is important to note that precision has improved 0.84 < 0.86 < 0.88
  • The ROC curve also indicates isotonic and Sigmoid is very close and better than the Uncalibrated model.
Source: Author | Fig 3: ROC curve for each of the models

The Brier score

As we observe the Sigmoid and Isotonic models are almost giving the same results. Let’s see how similar they are using the Brier score.

The Brier score is a metric used to assess the accuracy of probabilistic predictions in classification models. It measures the difference between predicted probabilities and actual outcomes, indicating the discrepancy between predicted probabilities and observed outcomes. Lower scores indicate better calibration, while higher scores indicate poorer calibration. The Brier score ranges from 0 to 1, with 0 indicating perfect calibration and 1 indicating complete miscalibration. It is commonly used to evaluate the calibration performance of binary and multi-class classification models.

from sklearn.metrics import brier_score_loss

# Calculate Brier score for sigmoid calibration
brier_score_sigmoid = brier_score_loss(y_val, prob_pos_val_sigmoid)

# Calculate Brier score for isotonic calibration
brier_score_isotonic = brier_score_loss(y_val, prob_pos_val_isotonic)

print("Brier Score (Sigmoid Calibration):", brier_score_sigmoid)
print("Brier Score (Isotonic Calibration):", brier_score_isotonic)

-----------------------------------------------------------------------
OUTPUT:
Brier Score (Sigmoid Calibration): 0.10885411896674385
Brier Score (Isotonic Calibration): 0.10447999956877632

We observe that the Isotonic calibration model is slightly better than the Sogmoid model but the difference is not significant.

Conclusion

Probability calibration plays a vital role in machine learning, improving prediction accuracy and instilling greater confidence in model results. By aligning predicted probabilities with actual likelihoods, it empowers decision-makers with trustworthy estimates of uncertainty. Both sigmoid and isotonic calibration techniques offer efficiency and adaptability, fostering a deeper comprehension of models and facilitating more precise decision-making processes.

I hope you liked the article and found it helpful.

Connect with me

Collection of blogs

Data Science Using Python and R
Python For Finance
App Development Using Python
GeoSpatial Analysis Using Python

FAQs

Q1: How do I know if my ML model needs probability calibration?
A1: Understanding when to use probability calibration depends on the application and the importance of accurate probabilities. If you’re relying on the confidence level of your model’s predictions for decision-making, like in healthcare or finance, probability calibration can be beneficial.

Q2: Are there any drawbacks or risks associated with probability calibration?
A2: Probability calibration enhances prediction reliability, but it can introduce complexity or computational resources to the model, and improper calibration can lead to overfitting or underfitting, affecting the model’s performance.

Q3: How can I evaluate the effectiveness of probability calibration in my model?
A4: Probability calibration effectiveness is evaluated using calibration curves, log-loss, or Brier score, which compares predicted probabilities to true probabilities, and performance metrics like log-loss or Brier score quantify calibration quality.

Q4: Are there any best practices or tips for implementing probability calibration in machine learning projects?
A5: To optimize calibration, consider the trade-offs between different techniques and select the one that best suits your application and dataset. Use cross-validation and model evaluation techniques to prevent overfitting to the training data.

--

--