Handling Missing Values

Advanced Tutorial. Handling missing values with your custom metric functions.

In previous tutorials, we saw how to create the functions needed for simple custom metrics for binary classification, multiclass classification, and regression. Let's see how we can improve on said code to be able to handle missing values in our data. As previously we assume the user has access to a Jupyter Notebook python environment with the NannyML open-source library installed.

Handling missing values in binary classification

Let's load the covariate shift dataset we have been using and add some missing values.

import numpy as np
import pandas as pd
import nannyml as nml

# Comment out if needed the code below to filter out warnings
# import warnings
# warnings.filterwarnings('ignore')

# Comment out if needed the code below to see logging messages
# import logging
# logging.basicConfig(level=logging.DEBUG)

reference = pd.read_parquet("https://github.com/NannyML/sample_datasets/raw/main/synthetic_pure_covariate_shift_datasets/binary_classification/synthetic_custom_metrics_binary_classification_reference.pq")
monitored = pd.read_parquet("https://github.com/NannyML/sample_datasets/raw/main/synthetic_pure_covariate_shift_datasets/binary_classification/synthetic_custom_metrics_binary_classification_monitored.pq")

reference.y_pred.iloc[11_000:13_000] = np.nan
reference.y_true.iloc[17_000:19_000] = np.nan
reference.y_pred.iloc[21_000:23_000] = np.nan
reference.y_true.iloc[27_000:29_000] = np.nan
reference.y_pred.iloc[31_000:33_000] = np.nan
reference.y_true.iloc[37_000:39_000] = np.nan
reference.y_pred_proba.iloc[17_000:19_000] = np.nan
reference.y_pred_proba.iloc[27_000:29_000] = np.nan
reference.y_pred_proba.iloc[37_000:39_000] = np.nan

As a reminder here are the custom metric functions for the F_2 metric we already created.

import pandas as pd
from sklearn.metrics import fbeta_score

def calculate(
    y_true: pd.Series,
    y_pred: pd.Series,
    y_pred_proba: pd.DataFrame,
    chunk_data: pd.DataFrame,
    labels: list[str],
    class_probability_columns: list[str],
    **kwargs
) -> float:
    # labels and class_probability_columns are only needed for multiclass classification
    # and can be ignored for binary classification custom metrics
    return fbeta_score(y_true, y_pred, beta=2)
import numpy as np
import pandas as pd

def estimate(
    estimated_target_probabilities: pd.DataFrame,
    y_pred: pd.Series,
    y_pred_proba: pd.DataFrame,
    chunk_data: pd.DataFrame,
    labels: list[str],
    class_probability_columns: list[str],
    **kwargs
) -> float:
    # labels and class_probability_columns are only needed for multiclass classification
    # and can be ignored for binary classification custom metrics

    estimated_target_probabilities = estimated_target_probabilities.to_numpy().ravel()
    y_pred = y_pred.to_numpy()

    # Create estimated confusion matrix elements
    est_tp = np.sum(np.where(y_pred == 1, estimated_target_probabilities, 0))
    est_fp = np.sum(np.where(y_pred == 1, 1 - estimated_target_probabilities, 0))
    est_fn = np.sum(np.where(y_pred == 0, estimated_target_probabilities, 0))
    est_tn = np.sum(np.where(y_pred == 0, 1 - estimated_target_probabilities, 0))

    beta = 2
    fbeta =  (1 + beta**2) * est_tp / ( (1 + beta**2) * est_tp + est_fp + beta**2 * est_fn)
    fbeta = np.nan_to_num(fbeta)
    return fbeta

There is an open question of how to deal with the missing values. This is ultimately up to the user and the particular use case for which the custom metric is being created. Here we will show how to remove rows containing missing values for the custom metric calculation. Doing this the custom metric functions become:

import pandas as pd
from sklearn.metrics import fbeta_score

def calculate(
    y_true: pd.Series,
    y_pred: pd.Series,
    y_pred_proba: pd.DataFrame,
    chunk_data: pd.DataFrame,
    **kwargs
) -> float:
    data = pd.DataFrame({
        'y_true': y_true,
        'y_pred': y_pred
    })
    data.dropna(axis=0, inplace=True)
    return fbeta_score(data.y_true, data.y_pred, beta=2)
import numpy as np
import pandas as pd

def estimate(
    estimated_target_probabilities: pd.DataFrame,
    y_pred: pd.Series,
    y_pred_proba: pd.DataFrame,
    chunk_data: pd.DataFrame,
    labels: list[str],
    class_probability_columns: list[str],
) -> float:
    # labels and class_probability_columns are only needed for multiclass classification
    # and can be ignored for binary classification custom metrics

    data = pd.DataFrame({
        'estimated_target_probabilities': estimated_target_probabilities.to_numpy().ravel(),
        'y_pred_proba': y_pred_proba.to_numpy().ravel(),
        'y_pred': y_pred,
    })
    data.dropna(axis=0, inplace=True)
    y_pred = data.y_pred.to_numpy()
    estimated_target_probabilities = data.estimated_target_probabilities.to_numpy()

    est_tp = np.sum(np.where(y_pred == 1, estimated_target_probabilities, 0))
    est_fp = np.sum(np.where(y_pred == 1, 1 - estimated_target_probabilities, 0))
    est_fn = np.sum(np.where(y_pred == 0, estimated_target_probabilities, 0))
    est_tn = np.sum(np.where(y_pred == 0, 1 - estimated_target_probabilities, 0))

    beta = 2
    fbeta =  (1 + beta**2) * est_tp / ( (1 + beta**2) * est_tp + est_fp + beta**2 * est_fn)
    fbeta = np.nan_to_num(fbeta)
    return fbeta

By looking at the estimate function it is visible that even there, decisions may need to be made. For example which columns to include in the functions that drops rows if they contain missing values. Again this can depend on the use case and what data the function is expected to handle.

We can now test our functions to see if they are robust when they encounter missing values:

class_probability_columns = ['y_pred_proba',]
labels = [0, 1]

calculate(
    reference['y_true'],
    reference['y_pred'],
    reference[class_probability_columns],
    reference,
    labels=labels,
    class_probability_columns=class_probability_columns
)
0.8081988545474137
estimate(
    reference[['estimated_target_probabilities']],
    reference['y_pred'],
    reference[class_probability_columns],
    reference,
    labels=labels,
    class_probability_columns=class_probability_columns
)
0.8081160256970812

Next Steps

We can now test our new functions by creating a new custom metric either through the GUI of the web interface or by using the NannyML Cloud SDK.