Advanced Tutorial. Handling missing values with your custom metric functions.
In previous tutorials, we saw how to create the functions needed for simple custom metrics for binary classification, multiclass classification, and regression. Let's see how we can improve on said code to be able to handle missing values in our data. As previously we assume the user has access to a Jupyter Notebook python environment with the NannyML open-source library installed.
Handling missing values in binary classification
Let's load the covariate shift dataset we have been using and add some missing values.
import numpy as np
import pandas as pd
import nannyml as nml
# Comment out if needed the code below to filter out warnings
# import warnings
# warnings.filterwarnings('ignore')
# Comment out if needed the code below to see logging messages
# import logging
# logging.basicConfig(level=logging.DEBUG)
reference = pd.read_parquet("https://github.com/NannyML/sample_datasets/raw/main/synthetic_pure_covariate_shift_datasets/binary_classification/synthetic_custom_metrics_binary_classification_reference.pq")
monitored = pd.read_parquet("https://github.com/NannyML/sample_datasets/raw/main/synthetic_pure_covariate_shift_datasets/binary_classification/synthetic_custom_metrics_binary_classification_monitored.pq")
reference.y_pred.iloc[11_000:13_000] = np.nan
reference.y_true.iloc[17_000:19_000] = np.nan
reference.y_pred.iloc[21_000:23_000] = np.nan
reference.y_true.iloc[27_000:29_000] = np.nan
reference.y_pred.iloc[31_000:33_000] = np.nan
reference.y_true.iloc[37_000:39_000] = np.nan
reference.y_pred_proba.iloc[17_000:19_000] = np.nan
reference.y_pred_proba.iloc[27_000:29_000] = np.nan
reference.y_pred_proba.iloc[37_000:39_000] = np.nan
As a reminder here are the custom metric functions for the F_2 metric we already created.
import pandas as pd
from sklearn.metrics import fbeta_score
def calculate(
y_true: pd.Series,
y_pred: pd.Series,
y_pred_proba: pd.DataFrame,
chunk_data: pd.DataFrame,
labels: list[str],
class_probability_columns: list[str],
**kwargs
) -> float:
# labels and class_probability_columns are only needed for multiclass classification
# and can be ignored for binary classification custom metrics
return fbeta_score(y_true, y_pred, beta=2)
There is an open question of how to deal with the missing values. This is ultimately up to the user and the particular use case for which the custom metric is being created. Here we will show how to remove rows containing missing values for the custom metric calculation. Doing this the custom metric functions become:
By looking at the estimate function it is visible that even there, decisions may need to be made. For example which columns to include in the functions that drops rows if they contain missing values. Again this can depend on the use case and what data the function is expected to handle.
We can now test our functions to see if they are robust when they encounter missing values: