How to get data ready for NannyML

This tutorial is a practical guide on how to get your data in the right format to monitor your models with NannyML. We will illustrate step-by-step how to prepare data for monitoring. We will start with building a simple model that we will monitor later.

Step 1: Have a trained model (prerequisite)

If you are looking into monitoring, you probably already have a model, a training set, and most likely some production (monitored) data on which you would like to check how your model is doing. This tutorial uses the Banana Quality Prediction dataset to illustrate how to prepare your data.

Let's start by partitioning our data into three sets:

  • Training

  • Testing

  • Monitored (production data)

In a real scenario, you won't need to create the monitoring partition since this will naturally come after deploying the model.

If you already have a trained model, skip to Step 2: Building a reference dataset.

import pandas as pd

url = 'https://raw.githubusercontent.com/NannyML/sample_datasets/main/banana_quality/banana_quality.csv' 
data = pd.read_csv(url)
train_df, test_df, monitor_df = data.iloc[0:4000], data.iloc[4000:6000], data.iloc[6000:8000]
data.head()
IdentifierTimestampSizeWeightSweetnessSoftnessHarvestTimeRipenessAcidityQuality

0

2020-01-01 00:00:00

2.622631

-1.531197

-0.293286

2.086239

4.379607

0.865033

0.139479

1

1

2020-01-01 00:30:00

-3.592020

0.220773

0.298705

-1.759699

-4.116195

0.604932

-2.387580

1

2

2020-01-01 01:00:00

1.270482

-3.959625

-3.447087

0.799014

0.148791

-2.098769

0.727079

0

3

2020-01-01 01:30:00

-0.002022

-1.417225

-1.847301

0.301628

1.364337

0.265501

-0.458392

1

4

2020-01-01 02:00:00

-0.786569

-4.323583

1.466369

0.157923

-2.063598

2.619294

-0.340423

0

The Quality column is the target variable. It is encoded as:

  • 1 - Good banana quality

  • 0 - Bad banana quality

Once we have the data separated, we can build a simple model.

from lightgbm import LGBMClassifier

features = [
    'Size', 'Weight', 'Sweetness', 'Softness', 
    'HarvestTime', 'Ripeness', 'Acidity'
]
target = 'Quality'
timestamp = 'Timestamp'

clf = LGBMClassifier()
clf.fit(train_df[features], train_df[target])

Let's quickly check its performance on the training and testing sets before moving to the next step.

from sklearn.metrics import roc_auc_score

roc_auc_train = roc_auc_score(
    train_df[target], clf.predict_proba(train_df[features])[:, 1]
)
roc_auc_test = roc_auc_score(
    test_df[target], clf.predict_proba(test_df[features])[:, 1]
)

print(f"ROC-AUC (train): {roc_auc_train}")
print(f"ROC-AUC (test): {roc_auc_test}")
ROC-AUC (train): 1.0
ROC-AUC (test): 0.9931109775641026

Looks like the model is doing a nice job in the training and testing datasets.

Step 2: Building a reference dataset

Once we have a model that performs well on the test set, we can use it to build a reference dataset.

Reference dataset: The reference dataset is a benchmark set where the model behaves as expected. It is important that this dataset was not seen by the model during training. It should include targets and the model's predictions. In this tutorial, we will use the test set to build the reference set.

NannyML uses the reference dataset to establish a baseline of how the monitoring data and the model predictions should look after deployment. Internally, it uses it to calibrate most of its performance estimation methods and calculate the default threshold values for every monitored metric.

To transform the test dataset into a reference dataset, all we need to do is add the predicted probabilities and the model predictions to the test dataset.

This means that we will create the following columns:

  • y_pred_proba: contains the model's predicted probabilities (scores) for class 1 (Good quality banana, in this case).

  • y_pred: contains the model's predicted class (0 or 1).

reference_df = test_df.copy()

reference_df['y_pred_proba'] = clf.predict_proba(test_df[features])[:, 1]
reference_df['y_pred'] = clf.predict(test_df[features])
reference_df.head()
IdentifierTimestampSizeWeightSweetnessSoftnessHarvestTimeRipenessAcidityQualityy_pred_probay_pred

4000

2020-03-24 08:00:00

-1.005942

0.457455

-0.819113

1.014088

1.380156

-2.005996

3.522873

0

0.002104

0

4001

2020-03-24 08:30:00

-2.425685

1.682808

-2.041250

-1.925259

-1.118302

-1.264770

-3.848056

0

0.006822

0

4002

2020-03-24 09:00:00

0.363760

-1.777900

-0.585864

2.736269

3.509048

1.986182

-1.747902

1

0.998873

1

4003

2020-03-24 09:30:00

-2.206337

-3.712593

0.734068

-2.015895

-3.488548

0.747031

0.754206

0

0.019571

0

4004

2020-03-24 10:00:00

0.649706

-3.580519

0.352195

1.058788

0.834113

3.243598

-3.792337

1

0.967419

1

We recommend storing your data as parquet files.

NannyML Cloud supports both parquet and CSV files, but CSV files don't store data type information. CSV files may cause incorrect data types to be inferred. If you later add more data to the model using the SDK or using parquet format, a data type conflict may occur.

Step 3: Setting up the monitored dataset

Monitored dataset: This is the data that NannyML will monitor. It typically contains the latest production data, which should be after the reference period ends. This dataset does not require targets.

Similarly, as with the reference dataset, let's add the following columns to the monitored dataset.

  • y_pred_proba: contains the model's predicted probabilities (scores) for class 1 (Good quality banana, in this case).

  • y_pred: contains the model's predicted class (0 or 1).

monitor_df['y_pred_proba'] = clf.predict_proba(monitor_df[features])[:, 1]
monitor_df['y_pred'] = clf.predict(monitor_df[features])
monitor_df.head()
IdentifierTimestampSizeWeightSweetnessSoftnessHarvestTimeRipenessAcidityQualityy_pred_probay_pred

6000

2020-05-05 00:00:00

-1.955202

-0.117761

0.121855

-5.150094

-1.387223

3.435446

1.662624

1

0.999419

1

6001

2020-05-05 00:30:00

-0.940370

-1.119835

-0.998754

1.206442

-3.871381

0.711751

0.799148

0

0.001309

0

6002

2020-05-05 01:00:00

-0.819169

-3.060052

-1.092720

2.581883

-0.306439

-2.864940

-1.350002

0

0.000332

0

6003

2020-05-05 01:30:00

-1.757643

-0.109889

1.044176

-2.219761

-2.899120

-0.954929

2.877549

1

0.384413

0

6004

2020-05-05 02:00:00

-1.499208

0.618753

-0.616309

0.195418

-0.299742

-0.323257

-4.152448

0

0.012700

0

In this particular example, we are dealing with a binary classification problem. The required columns for multiclass classification and regression differ slightly.

Here is a summary of all the columns required for regression, binary classification, and multiclass classification.

Regression Model - Required columns

ColumnReferenceMonitored

identifier

timestamp

y_pred_proba

N/A

N/A

y_pred

target

Optional

Classification Model - Required Columns

ColumnReferenceMonitored

identifier

timestamp

y_pred_proba

y_pred

target

Optional

If the model is multiclass classification, there should be a y_pred_proba column for each class.

Example:

  • y_pred_proba_setosa

  • y_pred_proba_virginica

  • y_pred_proba_versicolor

What if my data doesn't have timestamps?

Timestamps indicate the date and time when observations were recorded. If your dataset doesn't have timestamps, you can generate synthetic ones, just make sure they align with your specific use case.

For example:

import datetime as dt

timestamps = [dt.datetime(2020,1,1) + dt.timedelta(hours=x/2) for x in df.index]

Where to go next?

Check out how to launch NannyML cloud on Azure

Check out how to launch NannyML cloud on AWS