Writing Functions for Multiclass Classification

Writing the functions needed to create a custom multiclass classification metric.

As we have seen on the Introductory Custom Metric page the key components of a custom multiclass classification metric are the specific Python functions we need to provide for the custom metric to work. Here we will see how to create them.

We will assume the user has access to a Jupyter Notebook running Python with the NannyML open-source library installed.

Sample Dataset

We have created a sample dataset to facilitate developing the code needed for custom binary classification metrics. The dataset is publicly accessible here. It is a pure covariate shift dataset that consists of:

7 numerical features: ['feature1', 'feature2', 'feature3', 'feature4', 'feature5', 'feature6', 'feature7']
Target column that contains the 5 classes {0,1,2,3,4}: y_true
Model prediction column: y_pred
The model predicted probability columns: y_pred_proba_0/1/2/3/4
A timestamp column: timestamp
An identifier column: identifier
The probabilities from which the target values have been sampled: estimated_target_probabilities_0/1/2/3/4

We can inspect the dataset with the following code in a Jupyter cell:

import pandas as pd
import nannyml as nml

reference = pd.read_parquet("https://github.com/NannyML/sample_datasets/raw/main/synthetic_pure_covariate_shift_datasets/multiclass_classification/synthetic_custom_metrics_multiclass_classification_reference.pq")
monitored = pd.read_parquet("https://github.com/NannyML/sample_datasets/raw/main/synthetic_pure_covariate_shift_datasets/multiclass_classification/synthetic_custom_metrics_multiclass_classification_monitored.pq")
reference.head(5)

+----+------------+------------+------------+------------+------------+------------+------------+----------+------------------+------------------+------------------+------------------+------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+----------+----------------------------+--------------+
|    | feature1   | feature2   | feature3   | feature4   | feature5   | feature6   | feature7   | y_true   | y_pred_proba_0   | y_pred_proba_1   | y_pred_proba_2   | y_pred_proba_3   | y_pred_proba_4   | estimated_target_probabilities_0   | estimated_target_probabilities_1   | estimated_target_probabilities_2   | estimated_target_probabilities_3   | estimated_target_probabilities_4   | y_pred   | timestamp                  | identifier   |
+====+============+============+============+============+============+============+============+==========+==================+==================+==================+==================+==================+====================================+====================================+====================================+====================================+====================================+==========+============================+==============+
| 0  | 1.00527    | -2.95951   | 3.13132    | 2.26554    | -2.83038   | -2.37214   | -0.403287  | 2        | 0                | 0.01             | 0.97             | 0.02             | 0                | 0.000636401                        | 0.00533408                         | 0.991451                           | 0.00255351                         | 2.47182e-05                        | 2        | 2020-03-25 00:00:00        | 60000        |
+----+------------+------------+------------+------------+------------+------------+------------+----------+------------------+------------------+------------------+------------------+------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+----------+----------------------------+--------------+
| 1  | -1.21882   | -0.494579  | 2.17917    | -0.422763  | 0.578662   | 2.98901    | -1.91584   | 1        | 0                | 0.12             | 0.1              | 0.01             | 0.77             | 0.00375799                         | 0.11238                            | 0.120747                           | 0.024366                           | 0.738749                           | 4        | 2020-03-25 00:02:00.960000 | 60001        |
+----+------------+------------+------------+------------+------------+------------+------------+----------+------------------+------------------+------------------+------------------+------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+----------+----------------------------+--------------+
| 2  | 0.692891   | 1.03325    | 1.46143    | 2.90911    | -0.868391  | 1.58143    | -0.94909   | 1        | 0.17             | 0.16             | 0.58             | 0.08             | 0.01             | 0.0527302                          | 0.113719                           | 0.742181                           | 0.0722241                          | 0.0191459                          | 2        | 2020-03-25 00:04:01.920000 | 60002        |
+----+------------+------------+------------+------------+------------+------------+------------+----------+------------------+------------------+------------------+------------------+------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+----------+----------------------------+--------------+
| 3  | -1.94359   | -0.606053  | 1.77703    | 4.61647    | -1.99186   | -0.307676  | -2.04368   | 2        | 0                | 0.03             | 0.96             | 0.01             | 0                | 0.000126977                        | 0.00524288                         | 0.972035                           | 0.0224324                          | 0.00016306                         | 2        | 2020-03-25 00:06:02.880000 | 60003        |
+----+------------+------------+------------+------------+------------+------------+------------+----------+------------------+------------------+------------------+------------------+------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+----------+----------------------------+--------------+
| 4  | -5.35189   | 0.369351   | -2.86275   | -2.59814   | 1.33145    | -2.88658   | -1.88045   | 3        | 0.12             | 0.25             | 0.01             | 0.42             | 0.2              | 0.0116875                          | 0.140845                           | 0.00724397                         | 0.480283                           | 0.359941                           | 3        | 2020-03-25 00:08:03.840000 | 60004        |
+----+------------+------------+------------+------------+------------+------------+------------+----------+------------------+------------------+------------------+------------------+------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+----------+----------------------------+--------------+

Developing custom multiclass classification metric functions

NannyML Cloud requires two functions for the custom metric to be used. The first is the calculate function, which is mandatory, and is used to calculate realized performance for the custom metric. The second is the estimate function, which is optional, and is used to do performance estimation for the custom metric when target values are not available.

Custom Functions API

The API of these functions is set by NannyML Cloud and is shown as a template on the New Custom Multiclass Classification Metric screen. They are the same for both binary and multiclass classification.

import pandas as pd

def calculate(
    y_true: pd.Series,
    y_pred: pd.Series,
    y_pred_proba: pd.DataFrame,
    chunk_data: pd.DataFrame,
    labels: list[str],
    class_probability_columns: list[str],
    **kwargs
) -> float:
    pass


def estimate(
    estimated_target_probabilities: pd.DataFrame,
    y_pred: pd.Series,
    y_pred_proba: pd.DataFrame,
    chunk_data: pd.DataFrame,
    labels: list[str],
    class_probability_columns: list[str],
    **kwargs,
) -> float:
    pass

Creating the calculate function is simpler and depends on what we want our custom metric to be. Let's describe the data that are available to us to create our calculate function.

y_true: A pandas.Series python object containing the target column.
y_pred: A pandas.Series python object containing the model predictions column.
y_pred_proba: A pandas.DataFrame python object containing the predicted probabilities column. This is a single-column dataframe for binary classification. It is a dataframe because in multiclass classification it contains multiple columns.
chunk_data: A pandas.DataFrame python object containing all columns associated with the model. This allows using other columns in the data provided for the calculation of the custom metric
labels: A python list object containing the values for the class labels. Currently, for binary classification, only 0 and 1 are supported. This parameter is mostly for multiclass classification.
class_probability_columns: A python list object containing the names of the class probability columns. The column names of the class_probability_columns and y_pred_proba dataframes are the elements of this list. This helps ensure that the column referring to the appropriate class label is always selected.
estimated_target_probabilities: A pandas.DataFrame python object containing the calibrated predicted probabilities calculated from the predicted probabilities of the monitored model. This is a single-column dataframe for binary classification. It is a dataframe because in multiclass classification it contains multiple columns.
**kwargs: You can use the keyword arguments placeholder to omit any parameters you don't actually require in your custom metric functions. This keeps your function signatures nice and clean. This also serves as a placeholder for future arguments in later NannyML cloud versions, intended to make the functions forward compatible.

Note that estimated_target_probabilities are calculated and provided by NannyML. The monitored model's predicted probabilities need not be calibrated for performance estimation to work. To simulate this in the dataset we provided this column contains the probabilities from which the target values have been sampled. While using NannyML Cloud however the estimated_target_probabilities are estimated from the provided data.

Custom F_2 score

To create a custom metric from the F_2 score we would create the calculate function below:

import pandas as pd
from sklearn.metrics import fbeta_score

def calculate(
    y_true: pd.Series,
    y_pred: pd.Series,
    y_pred_proba: pd.DataFrame,
    chunk_data: pd.DataFrame,
    labels: list[str],
    class_probability_columns: list[str],
    **kwargs
) -> float:
    return fbeta_score(y_true, y_pred, beta=2, average='macro')

While the calculate function of the F_2 score is straightforward this is not the case for the estimate function. In order to create an estimate function we need to understand performance estimation. Reading how CBPE works is enough to do so for classification problems. The key concept to understand are the estimated confusion matrix elements and how they are created. We can then use the functional form of the F_2 score to estimate the metric.

\mathrm{F}_{\beta} = \frac{(1+\beta^2)\mathrm{TP}}{(1+\beta^2)\mathrm{TP}+\mathrm{FP}+\beta^2\mathrm{FN}}

Lastly, we need to consider how to average the class results. We will use macro averaging. Putting everything together we get:

import numpy as np
import pandas as pd
from sklearn.preprocessing import label_binarize

def estimate(
    estimated_target_probabilities: pd.DataFrame,
    y_pred: pd.Series,
    y_pred_proba: pd.DataFrame,
    chunk_data: pd.DataFrame,
    labels: list[str],
    class_probability_columns: list[str],
    **kwargs
):
    beta = 2

    def estimate_fb(_y_pred, _y_pred_proba, beta) -> float:
        # Estimates the Fb metric.
        #
        # Parameters
        # ----------
        # y_pred: np.ndarray
        #     Predicted class label of the sample
        # y_pred_proba: np.ndarray
        #     Probability estimates of the sample for each class in the model.
        # beta: float
        #     beta parameter
        #
        # Returns
        # -------
        # metric: float
        #     Estimated Fb score.
        

        est_tp = np.sum(np.where(_y_pred == 1, _y_pred_proba, 0))
        est_fp = np.sum(np.where(_y_pred == 1, 1 - _y_pred_proba, 0))
        est_fn = np.sum(np.where(_y_pred == 0, _y_pred_proba, 0))
        est_tn = np.sum(np.where(_y_pred == 0, 1 - _y_pred_proba, 0))

        fbeta =  (1 + beta**2) * est_tp / ( (1 + beta**2) * est_tp + est_fp + beta**2 * est_fn)
        fbeta = np.nan_to_num(fbeta)
        return fbeta

    estimated_target_probabilities = estimated_target_probabilities.to_numpy()
    y_preds = label_binarize(y_pred, classes=labels)

    ovr_estimates = []
    for idx, _  in enumerate(labels):
        ovr_estimates.append(
            estimate_fb(
                y_preds[:, idx],
                estimated_target_probabilities[:, idx],
                beta=2
            )
        )
    multiclass_metric = np.mean(ovr_estimates)

    return multiclass_metric

We can test those functions on the dataset loaded earlier. Assuming we run the functions as provided in a Jupyter cell we can then call them. Running calculate we get:

features = ['feature1', 'feature2', 'feature3', 'feature4', 'feature5', 'feature6', 'feature7']
class_probability_columns = ['y_pred_proba_0', 'y_pred_proba_1', 'y_pred_proba_2', 'y_pred_proba_3', 'y_pred_proba_4',]
labels = [0, 1, 2, 3, 4]
estimated_target_probabilties = [
    'estimated_target_probabilities_0',
    'estimated_target_probabilities_1',
    'estimated_target_probabilities_2',
    'estimated_target_probabilities_3',
    'estimated_target_probabilities_4'
]

# simulate estimated_target_probabilities and y_pred_proba having the same column names
estimated_target_probabilities = reference[estimated_target_probabilties].rename(
    columns=dict(zip(estimated_target_probabilties, class_probability_columns))
)

estimate(
    estimated_target_probabilities=estimated_target_probabilities,
    y_pred=reference['y_pred'],
    y_pred_proba=reference[class_probability_columns],
    chunk_data=reference,
    labels=labels,
    class_probability_columns=class_probability_columns
)

0.6929117226840699

While running estimate we get:

calculate(
    y_true=reference['y_true'],
    y_pred=reference['y_pred'],
    y_pred_proba=reference[class_probability_columns],
    chunk_data=reference,
    labels=labels,
    class_probability_columns=class_probability_columns
)

0.694377528321876

We can see that the values between estimated and realized F_2 score are very close. This means that we are likely estimating the metric correctly. The values will never match due to the statistical nature of the problem. Sampling error will always induce some differences.

Testing a Custom Metric in the Cloud product

We saw how to add a multiclass classification custom metric in the Custom Metrics Introductory page. We can further test it by using the dataset in the cloud product. The datasets are publicly available hence we can use the Public Link option when adding data to a new model.

Reference Dataset Public Link:

https://github.com/NannyML/sample_datasets/raw/main/synthetic_pure_covariate_shift_datasets/multiclass_classification/synthetic_custom_metrics_multiclass_classification_reference.pq

Monitored Dataset Public Link:

https://github.com/NannyML/sample_datasets/raw/main/synthetic_pure_covariate_shift_datasets/multiclass_classification/synthetic_custom_metrics_multiclass_classification_monitored.pq

The process of creating a new model is described in the Monitoring a tabular data model.

We need to be careful to mark estimated_target_probabilities columns as an ignored column since it's related to our oracle knowledge of the problem and not to the monitored model the dataset represents.

Note that when we are on the Metrics page

we can go to Performance monitoring and directly add a custom metric we have already specified.

After the model has been added to NannyML Cloud and the first run has been completed we can inspect the monitoring results. Of particular interest to us is the comparison between estimated and realized performance for our custom metric.

We see that NannyML can accurately estimate our custom metric across the whole dataset. Even in the areas where there is a performance difference. This means that our calculate and estimate functions have been correctly created as the dataset is created specifically to facilitate this test.

You may have noticed that for custom metrics we don't have a sampling error implementation. Therefore you will have to make a qualitative judgement, based on the results, whether the estimated and realized performance results are a good enough match or not.

Next Steps

You are now ready to use your new custom metric in production. However, you may want to make your implementation more robust to account for the data you will encounter in production. For example, you can add missing value handling to your implementation.

PreviousWriting Functions for Binary Classification NextWriting Functions for Regression

hashtagSample Dataset

hashtagDeveloping custom multiclass classification metric functions

hashtagCustom Functions API

hashtagCustom F_2 score

hashtagTesting a Custom Metric in the Cloud product