Writing Functions for Multiclass Classification
Writing the functions needed to create a custom multiclass classification metric.
As we have seen on the Introductory Custom Metric page the key components of a custom multiclass classification metric are the specific Python functions we need to provide for the custom metric to work. Here we will see how to create them.
We will assume the user has access to a Jupyter Notebook running Python with the NannyML open-source library installed.
Sample Dataset
We have created a sample dataset to facilitate developing the code needed for custom binary classification metrics. The dataset is publicly accessible here. It is a pure covariate shift dataset that consists of:
7 numerical features:
['feature1', 'feature2', 'feature3', 'feature4', 'feature5', 'feature6', 'feature7']
Target column that contains the 5 classes
{0,1,2,3,4}
:y_true
Model prediction column:
y_pred
The model predicted probability columns:
y_pred_proba_0/1/2/3/4
A timestamp column:
timestamp
An identifier column:
identifier
The probabilities from which the target values have been sampled:
estimated_target_probabilities_0/1/2/3/4
We can inspect the dataset with the following code in a Jupyter cell:
import pandas as pd
import nannyml as nml
reference = pd.read_parquet("https://github.com/NannyML/sample_datasets/raw/main/synthetic_pure_covariate_shift_datasets/multiclass_classification/synthetic_custom_metrics_multiclass_classification_reference.pq")
monitored = pd.read_parquet("https://github.com/NannyML/sample_datasets/raw/main/synthetic_pure_covariate_shift_datasets/multiclass_classification/synthetic_custom_metrics_multiclass_classification_monitored.pq")
reference.head(5)
+----+------------+------------+------------+------------+------------+------------+------------+----------+------------------+------------------+------------------+------------------+------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+----------+----------------------------+--------------+
| | feature1 | feature2 | feature3 | feature4 | feature5 | feature6 | feature7 | y_true | y_pred_proba_0 | y_pred_proba_1 | y_pred_proba_2 | y_pred_proba_3 | y_pred_proba_4 | estimated_target_probabilities_0 | estimated_target_probabilities_1 | estimated_target_probabilities_2 | estimated_target_probabilities_3 | estimated_target_probabilities_4 | y_pred | timestamp | identifier |
+====+============+============+============+============+============+============+============+==========+==================+==================+==================+==================+==================+====================================+====================================+====================================+====================================+====================================+==========+============================+==============+
| 0 | 1.00527 | -2.95951 | 3.13132 | 2.26554 | -2.83038 | -2.37214 | -0.403287 | 2 | 0 | 0.01 | 0.97 | 0.02 | 0 | 0.000636401 | 0.00533408 | 0.991451 | 0.00255351 | 2.47182e-05 | 2 | 2020-03-25 00:00:00 | 60000 |
+----+------------+------------+------------+------------+------------+------------+------------+----------+------------------+------------------+------------------+------------------+------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+----------+----------------------------+--------------+
| 1 | -1.21882 | -0.494579 | 2.17917 | -0.422763 | 0.578662 | 2.98901 | -1.91584 | 1 | 0 | 0.12 | 0.1 | 0.01 | 0.77 | 0.00375799 | 0.11238 | 0.120747 | 0.024366 | 0.738749 | 4 | 2020-03-25 00:02:00.960000 | 60001 |
+----+------------+------------+------------+------------+------------+------------+------------+----------+------------------+------------------+------------------+------------------+------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+----------+----------------------------+--------------+
| 2 | 0.692891 | 1.03325 | 1.46143 | 2.90911 | -0.868391 | 1.58143 | -0.94909 | 1 | 0.17 | 0.16 | 0.58 | 0.08 | 0.01 | 0.0527302 | 0.113719 | 0.742181 | 0.0722241 | 0.0191459 | 2 | 2020-03-25 00:04:01.920000 | 60002 |
+----+------------+------------+------------+------------+------------+------------+------------+----------+------------------+------------------+------------------+------------------+------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+----------+----------------------------+--------------+
| 3 | -1.94359 | -0.606053 | 1.77703 | 4.61647 | -1.99186 | -0.307676 | -2.04368 | 2 | 0 | 0.03 | 0.96 | 0.01 | 0 | 0.000126977 | 0.00524288 | 0.972035 | 0.0224324 | 0.00016306 | 2 | 2020-03-25 00:06:02.880000 | 60003 |
+----+------------+------------+------------+------------+------------+------------+------------+----------+------------------+------------------+------------------+------------------+------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+----------+----------------------------+--------------+
| 4 | -5.35189 | 0.369351 | -2.86275 | -2.59814 | 1.33145 | -2.88658 | -1.88045 | 3 | 0.12 | 0.25 | 0.01 | 0.42 | 0.2 | 0.0116875 | 0.140845 | 0.00724397 | 0.480283 | 0.359941 | 3 | 2020-03-25 00:08:03.840000 | 60004 |
+----+------------+------------+------------+------------+------------+------------+------------+----------+------------------+------------------+------------------+------------------+------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+----------+----------------------------+--------------+
Developing custom multiclass classification metric functions
NannyML Cloud requires two functions for the custom metric to be used. The first is the calculate
function, which is mandatory, and is used to calculate realized performance for the custom metric. The second is the estimate
function, which is optional, and is used to do performance estimation for the custom metric when target values are not available.
Custom Functions API
The API of these functions is set by NannyML Cloud and is shown as a template on the New Custom Multiclass Classification Metric screen. They are the same for both binary and multiclass classification.
import pandas as pd
def calculate(
y_true: pd.Series,
y_pred: pd.Series,
y_pred_proba: pd.DataFrame,
chunk_data: pd.DataFrame,
labels: list[str],
class_probability_columns: list[str],
**kwargs
) -> float:
pass
def estimate(
estimated_target_probabilities: pd.DataFrame,
y_pred: pd.Series,
y_pred_proba: pd.DataFrame,
chunk_data: pd.DataFrame,
labels: list[str],
class_probability_columns: list[str],
**kwargs,
) -> float:
pass
Creating the calculate
function is simpler and depends on what we want our custom metric to be. Let's describe the data that are available to us to create our calculate function.
y_true
: Apandas.Series
python object containing the target column.y_pred
: Apandas.Series
python object containing the model predictions column.y_pred_proba
: Apandas.DataFrame
python object containing the predicted probabilities column. This is a single-column dataframe for binary classification. It is a dataframe because in multiclass classification it contains multiple columns.chunk_data:
Apandas.DataFrame
python object containing all columns associated with the model. This allows using other columns in the data provided for the calculation of the custom metriclabels
: A python list object containing the values for the class labels. Currently, for binary classification, only 0 and 1 are supported. This parameter is mostly for multiclass classification.class_probability_columns
: A python list object containing the names of the class probability columns. The column names of theclass_probability_columns
andy_pred_proba
dataframes are the elements of this list. This helps ensure that the column referring to the appropriate class label is always selected.estimated_target_probabilities
: Apandas.DataFrame
python object containing the calibrated predicted probabilities calculated from the predicted probabilities of the monitored model. This is a single-column dataframe for binary classification. It is a dataframe because in multiclass classification it contains multiple columns.**kwargs
: You can use the keyword arguments placeholder to omit any parameters you don't actually require in your custom metric functions. This keeps your function signatures nice and clean. This also serves as a placeholder for future arguments in later NannyML cloud versions, intended to make the functions forward compatible.
Custom F_2 score
To create a custom metric from the F_2
score we would create the calculate
function below:
import pandas as pd
from sklearn.metrics import fbeta_score
def calculate(
y_true: pd.Series,
y_pred: pd.Series,
y_pred_proba: pd.DataFrame,
chunk_data: pd.DataFrame,
labels: list[str],
class_probability_columns: list[str],
**kwargs
) -> float:
return fbeta_score(y_true, y_pred, beta=2, average='macro')
While the calculate
function of the F_2
score is straightforward this is not the case for the estimate function. In order to create an estimate function we need to understand performance estimation. Reading how CBPE works is enough to do so for classification problems. The key concept to understand are the estimated confusion matrix elements and how they are created. We can then use the functional form of the F_2
score to estimate the metric.
Lastly, we need to consider how to average the class results. We will use macro averaging. Putting everything together we get:
import numpy as np
import pandas as pd
from sklearn.preprocessing import label_binarize
def estimate(
estimated_target_probabilities: pd.DataFrame,
y_pred: pd.Series,
y_pred_proba: pd.DataFrame,
chunk_data: pd.DataFrame,
labels: list[str],
class_probability_columns: list[str],
**kwargs
):
beta = 2
def estimate_fb(_y_pred, _y_pred_proba, beta) -> float:
# Estimates the Fb metric.
#
# Parameters
# ----------
# y_pred: np.ndarray
# Predicted class label of the sample
# y_pred_proba: np.ndarray
# Probability estimates of the sample for each class in the model.
# beta: float
# beta parameter
#
# Returns
# -------
# metric: float
# Estimated Fb score.
est_tp = np.sum(np.where(_y_pred == 1, _y_pred_proba, 0))
est_fp = np.sum(np.where(_y_pred == 1, 1 - _y_pred_proba, 0))
est_fn = np.sum(np.where(_y_pred == 0, _y_pred_proba, 0))
est_tn = np.sum(np.where(_y_pred == 0, 1 - _y_pred_proba, 0))
fbeta = (1 + beta**2) * est_tp / ( (1 + beta**2) * est_tp + est_fp + beta**2 * est_fn)
fbeta = np.nan_to_num(fbeta)
return fbeta
estimated_target_probabilities = estimated_target_probabilities.to_numpy()
y_preds = label_binarize(y_pred, classes=labels)
ovr_estimates = []
for idx, _ in enumerate(labels):
ovr_estimates.append(
estimate_fb(
y_preds[:, idx],
estimated_target_probabilities[:, idx],
beta=2
)
)
multiclass_metric = np.mean(ovr_estimates)
return multiclass_metric
We can test those functions on the dataset loaded earlier. Assuming we run the functions as provided in a Jupyter cell we can then call them. Running calculate
we get:
features = ['feature1', 'feature2', 'feature3', 'feature4', 'feature5', 'feature6', 'feature7']
class_probability_columns = ['y_pred_proba_0', 'y_pred_proba_1', 'y_pred_proba_2', 'y_pred_proba_3', 'y_pred_proba_4',]
labels = [0, 1, 2, 3, 4]
estimated_target_probabilties = [
'estimated_target_probabilities_0',
'estimated_target_probabilities_1',
'estimated_target_probabilities_2',
'estimated_target_probabilities_3',
'estimated_target_probabilities_4'
]
# simulate estimated_target_probabilities and y_pred_proba having the same column names
estimated_target_probabilities = reference[estimated_target_probabilties].rename(
columns=dict(zip(estimated_target_probabilties, class_probability_columns))
)
estimate(
estimated_target_probabilities=estimated_target_probabilities,
y_pred=reference['y_pred'],
y_pred_proba=reference[class_probability_columns],
chunk_data=reference,
labels=labels,
class_probability_columns=class_probability_columns
)
0.6929117226840699
While running estimate
we get:
calculate(
y_true=reference['y_true'],
y_pred=reference['y_pred'],
y_pred_proba=reference[class_probability_columns],
chunk_data=reference,
labels=labels,
class_probability_columns=class_probability_columns
)
0.694377528321876
We can see that the values between estimated and realized F_2
score are very close. This means that we are likely estimating the metric correctly. The values will never match due to the statistical nature of the problem. Sampling error will always induce some differences.
Testing a Custom Metric in the Cloud product
We saw how to add a multiclass classification custom metric in the Custom Metrics Introductory page. We can further test it by using the dataset in the cloud product. The datasets are publicly available hence we can use the Public Link option when adding data to a new model.
Reference Dataset Public Link:
https://github.com/NannyML/sample_datasets/raw/main/synthetic_pure_covariate_shift_datasets/multiclass_classification/synthetic_custom_metrics_multiclass_classification_reference.pq
Monitored Dataset Public Link:
https://github.com/NannyML/sample_datasets/raw/main/synthetic_pure_covariate_shift_datasets/multiclass_classification/synthetic_custom_metrics_multiclass_classification_monitored.pq
The process of creating a new model is described in the Monitoring a tabular data model.
We need to be careful to mark estimated_target_probabilities
columns as an ignored column since it's related to our oracle knowledge of the problem and not to the monitored model the dataset represents.

Note that when we are on the Metrics page

we can go to Performance monitoring and directly add a custom metric we have already specified.

After the model has been added to NannyML Cloud and the first run has been completed we can inspect the monitoring results. Of particular interest to us is the comparison between estimated and realized performance for our custom metric.

We see that NannyML can accurately estimate our custom metric across the whole dataset. Even in the areas where there is a performance difference. This means that our calculate and estimate functions have been correctly created as the dataset is created specifically to facilitate this test.
You may have noticed that for custom metrics we don't have a sampling error implementation. Therefore you will have to make a qualitative judgement, based on the results, whether the estimated and realized performance results are a good enough match or not.
Next Steps
You are now ready to use your new custom metric in production. However, you may want to make your implementation more robust to account for the data you will encounter in production. For example, you can add missing value handling to your implementation.