How to get data ready for NannyML

This tutorial is a practical guide on how to get your data in the right format to monitor your models with NannyML. We will illustrate step-by-step how to prepare data for monitoring. We will start with building a simple model that we will monitor later.

Step 1: Have a trained model (prerequisite)

If you are looking into monitoring, you probably already have a model, a training set, and most likely some production (monitored) data on which you would like to check how your model is doing. This tutorial uses the Banana Quality Prediction dataset to illustrate how to prepare your data.

Let's start by partitioning our data into three sets:

Training
Testing
Monitored (production data)

In a real scenario, you won't need to create the monitoring partition since this will naturally come after deploying the model.

If you already have a trained model, skip to Step 2: Building a reference dataset.

import pandas as pd

url = 'https://raw.githubusercontent.com/NannyML/sample_datasets/main/banana_quality/banana_quality.csv' 
data = pd.read_csv(url)
train_df, test_df, monitor_df = data.iloc[0:4000], data.iloc[4000:6000], data.iloc[6000:8000]

data.head()

Identifier

Timestamp

Size

Weight

Sweetness

Softness

HarvestTime

Ripeness

Acidity

Quality

2020-01-01 00:00:00

2.622631

-1.531197

-0.293286

2.086239

4.379607

0.865033

0.139479

2020-01-01 00:30:00

-3.592020

0.220773

0.298705

-1.759699

-4.116195

0.604932

-2.387580

2020-01-01 01:00:00

1.270482

-3.959625

-3.447087

0.799014

0.148791

-2.098769

0.727079

2020-01-01 01:30:00

-0.002022

-1.417225

-1.847301

0.301628

1.364337

0.265501

-0.458392

2020-01-01 02:00:00

-0.786569

-4.323583

1.466369

0.157923

-2.063598

2.619294

-0.340423

The Quality column is the target variable. It is encoded as:

1 - Good banana quality
0 - Bad banana quality

Once we have the data separated, we can build a simple model.

from lightgbm import LGBMClassifier

features = [
    'Size', 'Weight', 'Sweetness', 'Softness', 
    'HarvestTime', 'Ripeness', 'Acidity'
]
target = 'Quality'
timestamp = 'Timestamp'

clf = LGBMClassifier()
clf.fit(train_df[features], train_df[target])

Let's quickly check its performance on the training and testing sets before moving to the next step.

from sklearn.metrics import roc_auc_score

roc_auc_train = roc_auc_score(
    train_df[target], clf.predict_proba(train_df[features])[:, 1]
)
roc_auc_test = roc_auc_score(
    test_df[target], clf.predict_proba(test_df[features])[:, 1]
)

print(f"ROC-AUC (train): {roc_auc_train}")
print(f"ROC-AUC (test): {roc_auc_test}")

ROC-AUC (train): 1.0
ROC-AUC (test): 0.9931109775641026

Looks like the model is doing a nice job in the training and testing datasets.

Step 2: Building a reference dataset

Once we have a model that performs well on the test set, we can use it to build a reference dataset.

Reference dataset: The reference dataset is a benchmark set where the model behaves as expected. It is important that this dataset was not seen by the model during training. It should include targets and the model's predictions. In this tutorial, we will use the test set to build the reference set.

NannyML uses the reference dataset to establish a baseline of how the monitoring data and the model predictions should look after deployment. Internally, it uses it to calibrate most of its performance estimation methods and calculate the default threshold values for every monitored metric.

To transform the test dataset into a reference dataset, all we need to do is add the predicted probabilities and the model predictions to the test dataset.

This means that we will create the following columns:

y_pred_proba: contains the model's predicted probabilities (scores) for class 1 (Good quality banana, in this case).
y_pred: contains the model's predicted class (0 or 1).

reference_df = test_df.copy()

reference_df['y_pred_proba'] = clf.predict_proba(test_df[features])[:, 1]
reference_df['y_pred'] = clf.predict(test_df[features])

reference_df.head()

Identifier

Timestamp

Size

Weight

Sweetness

Softness

HarvestTime

Ripeness

Acidity

Quality

y_pred_proba

y_pred

4000

2020-03-24 08:00:00

-1.005942

0.457455

-0.819113

1.014088

1.380156

-2.005996

3.522873

0.002104

4001

2020-03-24 08:30:00

-2.425685

1.682808

-2.041250

-1.925259

-1.118302

-1.264770

-3.848056

0.006822

4002

2020-03-24 09:00:00

0.363760

-1.777900

-0.585864

2.736269

3.509048

1.986182

-1.747902

0.998873

4003

2020-03-24 09:30:00

-2.206337

-3.712593

0.734068

-2.015895

-3.488548

0.747031

0.754206

0.019571

4004

2020-03-24 10:00:00

0.649706

-3.580519

0.352195

1.058788

0.834113

3.243598

-3.792337

0.967419

We recommend storing your data as parquet files.

NannyML Cloud supports both parquet and CSV files, but CSV files don't store data type information. CSV files may cause incorrect data types to be inferred. If you later add more data to the model using the SDK or using parquet format, a data type conflict may occur.

Step 3: Setting up the monitored dataset

Monitored dataset: This is the data that NannyML will monitor. It typically contains the latest production data, which should be after the reference period ends. This dataset does not require targets.

Similarly, as with the reference dataset, let's add the following columns to the monitored dataset.

y_pred_proba: contains the model's predicted probabilities (scores) for class 1 (Good quality banana, in this case).
y_pred: contains the model's predicted class (0 or 1).

monitor_df['y_pred_proba'] = clf.predict_proba(monitor_df[features])[:, 1]
monitor_df['y_pred'] = clf.predict(monitor_df[features])

monitor_df.head()

Identifier

Timestamp

Size

Weight

Sweetness

Softness

HarvestTime

Ripeness

Acidity

Quality

y_pred_proba

y_pred

6000

2020-05-05 00:00:00

-1.955202

-0.117761

0.121855

-5.150094

-1.387223

3.435446

1.662624

0.999419

6001

2020-05-05 00:30:00

-0.940370

-1.119835

-0.998754

1.206442

-3.871381

0.711751

0.799148

0.001309

6002

2020-05-05 01:00:00

-0.819169

-3.060052

-1.092720

2.581883

-0.306439

-2.864940

-1.350002

0.000332

6003

2020-05-05 01:30:00

-1.757643

-0.109889

1.044176

-2.219761

-2.899120

-0.954929

2.877549

0.384413

6004

2020-05-05 02:00:00

-1.499208

0.618753

-0.616309

0.195418

-0.299742

-0.323257

-4.152448

0.012700

In this particular example, we are dealing with a binary classification problem. The required columns for multiclass classification and regression differ slightly.

Here is a summary of all the columns required for regression, binary classification, and multiclass classification.

Regression Model - Required columns

Column

Reference

Monitored

identifier

✅

timestamp

✅

y_pred_proba

N/A

y_pred

✅

target

✅

Optional

Classification Model - Required Columns

Column

Reference

Monitored

identifier

✅

timestamp

✅

y_pred_proba

✅

y_pred

✅

target

✅

Optional

If the model is multiclass classification, there should be a y_pred_proba column for each class.

Example:

y_pred_proba_setosa
y_pred_proba_virginica
y_pred_proba_versicolor

What if my data doesn't have timestamps?

Timestamps indicate the date and time when observations were recorded. If your dataset doesn't have timestamps, you can generate synthetic ones, just make sure they align with your specific use case.

For example:

import datetime as dt

timestamps = [dt.datetime(2020,1,1) + dt.timedelta(hours=x/2) for x in df.index]

Where to go next?

PreviousData Preparation NextTutorials

Last updated 8 months ago