Advanced Tutorial. Handling missing values with your custom metric functions.
In previous tutorials, we saw how to create the functions needed for simple custom metrics for binary classification, multiclass classification, and regression. Let's see how we can improve on said code to be able to handle missing values in our data. As previously we assume the user has access to a Jupyter Notebook python environment with the NannyML open-source library installed.
Handling missing values in binary classification
Let's load the covariate shift dataset we have been using and add some missing values.
import numpy as npimport pandas as pdimport nannyml as nml# Comment out if needed the code below to filter out warnings# import warnings# warnings.filterwarnings('ignore')# Comment out if needed the code below to see logging messages# import logging# logging.basicConfig(level=logging.DEBUG)reference = pd.read_parquet("https://github.com/NannyML/sample_datasets/raw/main/synthetic_pure_covariate_shift_datasets/binary_classification/synthetic_custom_metrics_binary_classification_reference.pq")monitored = pd.read_parquet("https://github.com/NannyML/sample_datasets/raw/main/synthetic_pure_covariate_shift_datasets/binary_classification/synthetic_custom_metrics_binary_classification_monitored.pq")reference.y_pred.iloc[11_000:13_000]= np.nanreference.y_true.iloc[17_000:19_000]= np.nanreference.y_pred.iloc[21_000:23_000]= np.nanreference.y_true.iloc[27_000:29_000]= np.nanreference.y_pred.iloc[31_000:33_000]= np.nanreference.y_true.iloc[37_000:39_000]= np.nanreference.y_pred_proba.iloc[17_000:19_000]= np.nanreference.y_pred_proba.iloc[27_000:29_000]= np.nanreference.y_pred_proba.iloc[37_000:39_000]= np.nan
As a reminder here are the custom metric functions for the F_2 metric we already created.
import pandas as pdfrom sklearn.metrics import fbeta_scoredefcalculate(y_true: pd.Series,y_pred: pd.Series,y_pred_proba: pd.DataFrame,chunk_data: pd.DataFrame,labels: list[str],class_probability_columns: list[str],**kwargs) ->float:# labels and class_probability_columns are only needed for multiclass classification# and can be ignored for binary classification custom metricsreturnfbeta_score(y_true, y_pred, beta=2)
import numpy as npimport pandas as pddefestimate(estimated_target_probabilities: pd.DataFrame,y_pred: pd.Series,y_pred_proba: pd.DataFrame,chunk_data: pd.DataFrame,labels: list[str],class_probability_columns: list[str],**kwargs) ->float:# labels and class_probability_columns are only needed for multiclass classification# and can be ignored for binary classification custom metrics estimated_target_probabilities = estimated_target_probabilities.to_numpy().ravel() y_pred = y_pred.to_numpy()# Create estimated confusion matrix elements est_tp = np.sum(np.where(y_pred ==1, estimated_target_probabilities, 0)) est_fp = np.sum(np.where(y_pred ==1, 1- estimated_target_probabilities, 0)) est_fn = np.sum(np.where(y_pred ==0, estimated_target_probabilities, 0)) est_tn = np.sum(np.where(y_pred ==0, 1- estimated_target_probabilities, 0)) beta =2 fbeta = (1+ beta**2) * est_tp / ( (1+ beta**2) * est_tp + est_fp + beta**2* est_fn) fbeta = np.nan_to_num(fbeta)return fbeta
There is an open question of how to deal with the missing values. This is ultimately up to the user and the particular use case for which the custom metric is being created. Here we will show how to remove rows containing missing values for the custom metric calculation. Doing this the custom metric functions become:
import numpy as npimport pandas as pddefestimate(estimated_target_probabilities: pd.DataFrame,y_pred: pd.Series,y_pred_proba: pd.DataFrame,chunk_data: pd.DataFrame,labels: list[str],class_probability_columns: list[str],) ->float:# labels and class_probability_columns are only needed for multiclass classification# and can be ignored for binary classification custom metrics data = pd.DataFrame({'estimated_target_probabilities': estimated_target_probabilities.to_numpy().ravel(),'y_pred_proba': y_pred_proba.to_numpy().ravel(),'y_pred': y_pred, }) data.dropna(axis=0, inplace=True) y_pred = data.y_pred.to_numpy() estimated_target_probabilities = data.estimated_target_probabilities.to_numpy() est_tp = np.sum(np.where(y_pred ==1, estimated_target_probabilities, 0)) est_fp = np.sum(np.where(y_pred ==1, 1- estimated_target_probabilities, 0)) est_fn = np.sum(np.where(y_pred ==0, estimated_target_probabilities, 0)) est_tn = np.sum(np.where(y_pred ==0, 1- estimated_target_probabilities, 0)) beta =2 fbeta = (1+ beta**2) * est_tp / ( (1+ beta**2) * est_tp + est_fp + beta**2* est_fn) fbeta = np.nan_to_num(fbeta)return fbeta
By looking at the estimate function it is visible that even there, decisions may need to be made. For example which columns to include in the functions that drops rows if they contain missing values. Again this can depend on the use case and what data the function is expected to handle.
We can now test our functions to see if they are robust when they encounter missing values: