Probabilistic Adaptive Performance Estimation (PAPE)

Intuition

Classification model predictions usually come with an associated uncertainty. For example, a binary classification model typically returns two outputs for each prediction - a predicted class (binary) and a class probability estimate (sometimes referred to as score). The score provides information about the confidence of the prediction. A rule of thumb is that the closer the score is to its lower or upper limit (usually 0 and 1), the higher the probability that the classifier’s prediction is correct. When this score is an actual probability, it can be directly used to estimate the probability of making an error. For instance, imagine a high-performing model which, for a large set of observations, returns a prediction of 1 (positive class) with a probability of 0.9. It means that the model is correct for approximately 90% of these observations, while for the other 10%, the model is wrong.

PAPE can use the uncertainty information encoded in a model's outputs in a reference dataset to estimate the confusion matrix elements for the model in a newer dataset - called analysis dataset. The resulting confusion matrix elements can then be transformed into our chosen performance metric completing the estimation process. This is done using only the model's outputs in the analysis dataset.

NannyML has previously created the CBPE algorithm for performance estimation. Further research showed that covariate shift can have a material impact on the quality of calibration. PAPE addresses this by calibrating predicted probabilities according to the data distribution of the analysis data. This is done by calculating the ratio of probability density functions between the reference and the analysis dataset. This ratio is used to perform weighted calibration on reference data which is what makes the calibration result accurately reflect the uncertainty in the analysis data.

Implementation Details

The PAPE Algorithm

Let's first go through the steps of the PAPE algorithm.

Preprocess available data to create a training dataset for the density ratio estimation model.
1. Assign label 0 to reference data and label 1 to analysis data.
2. Concatenate reference and analysis data. Note that we are only using columns labelled as model inputs.
3. Concatenate labels.
Train the density ratio estimation model. Use it to estimate predicted probabilities, $\hat{p}$ , for reference data.
Translate the density ratio estimation model's predicted probabilities for reference data into density ratios, also called importance weights, using the formula:

\hat{w}_{ref} = \frac{n_{ref}}{n_{anl}}\cdot\frac{\hat{p}}{1-\hat{p}}

Fit a weighted-calibrator $c$ on reference data, using the calculated weights $\hat{w}_{ref}$ .
Get the monitored model's predicted scores on analysis data, $f(x_{anl})$ , and perform weighted-calibration on them, $c(f(x_{anl}))$ .
Use the uncertainty encoded in the calibrated scores to estimate performance. This is done in the same way as the CBPE algorithm, the only difference is that we are now using the weighted-calibration scores to estimate the expected performance.

Assumptions and Limitations

PAPE rests on some assumptions in order to give accurate estimates:

There is no Concept Drift

There should be no change between the relationship of the model targets and the model inputs. To express it mathematically $\mathrm{P}(Y|X)$ should remain unchanged. This is a strong limitation and PAPE will give inaccurate results if this assumption is violated.

The data available is large enough.

We need enough data to be able to accurately train a density ratio estimation model and be able to properly calibrate.

There is no covariate shift to previously unseen regions.

PAPE will likely fail if there is covariate shift to previously unseen regions in the model input space. Mathematically we can say that the support of the analysis data needs to be a subset of the support of the reference data. If not density ratio estimation is theoretically not defined. Practically if we don't have data from an analysis region in the reference data we can't account for that shift with a weighted calculation from reference data.

PreviousHow it works NextReverse Concept Drift (RCD)