How to get data ready for NannyML
This tutorial is a practical guide on how to get your data in the right format to monitor your models with NannyML. We will illustrate step-by-step how to prepare data for monitoring. We will start with building a simple model that we will monitor later.
Step 1: Have a trained model (prerequisite)
If you are looking into monitoring, you probably already have a model, a training set, and most likely some production (monitored) data on which you would like to check how your model is doing. This tutorial uses the Banana Quality Prediction dataset to illustrate how to prepare your data.
Let's start by partitioning our data into three sets:
Training
Testing
Monitored (production data)
In a real scenario, you won't need to create the monitoring partition since this will naturally come after deploying the model.
If you already have a trained model, skip to Step 2: Building a reference dataset.
0
2020-01-01 00:00:00
2.622631
-1.531197
-0.293286
2.086239
4.379607
0.865033
0.139479
1
1
2020-01-01 00:30:00
-3.592020
0.220773
0.298705
-1.759699
-4.116195
0.604932
-2.387580
1
2
2020-01-01 01:00:00
1.270482
-3.959625
-3.447087
0.799014
0.148791
-2.098769
0.727079
0
3
2020-01-01 01:30:00
-0.002022
-1.417225
-1.847301
0.301628
1.364337
0.265501
-0.458392
1
4
2020-01-01 02:00:00
-0.786569
-4.323583
1.466369
0.157923
-2.063598
2.619294
-0.340423
0
The Quality column is the target variable. It is encoded as:
1 - Good banana quality
0 - Bad banana quality
Once we have the data separated, we can build a simple model.
Let's quickly check its performance on the training and testing sets before moving to the next step.
Looks like the model is doing a nice job in the training and testing datasets.
Step 2: Building a reference dataset
Once we have a model that performs well on the test set, we can use it to build a reference dataset.
Reference dataset: The reference dataset is a benchmark set where the model behaves as expected. It is important that this dataset was not seen by the model during training. It should include targets and the model's predictions. In this tutorial, we will use the test set to build the reference set.
NannyML uses the reference dataset to establish a baseline of how the monitoring data and the model predictions should look after deployment. Internally, it uses it to calibrate most of its performance estimation methods and calculate the default threshold values for every monitored metric.
To transform the test dataset into a reference dataset, all we need to do is add the predicted probabilities and the model predictions to the test dataset.
This means that we will create the following columns:
y_pred_proba: contains the model's predicted probabilities (scores) for class 1 (Good quality banana, in this case).
y_pred: contains the model's predicted class (0 or 1).
4000
2020-03-24 08:00:00
-1.005942
0.457455
-0.819113
1.014088
1.380156
-2.005996
3.522873
0
0.002104
0
4001
2020-03-24 08:30:00
-2.425685
1.682808
-2.041250
-1.925259
-1.118302
-1.264770
-3.848056
0
0.006822
0
4002
2020-03-24 09:00:00
0.363760
-1.777900
-0.585864
2.736269
3.509048
1.986182
-1.747902
1
0.998873
1
4003
2020-03-24 09:30:00
-2.206337
-3.712593
0.734068
-2.015895
-3.488548
0.747031
0.754206
0
0.019571
0
4004
2020-03-24 10:00:00
0.649706
-3.580519
0.352195
1.058788
0.834113
3.243598
-3.792337
1
0.967419
1
We recommend storing your data as parquet files.
NannyML Cloud supports both parquet and CSV files, but CSV files don't store data type information. CSV files may cause incorrect data types to be inferred. If you later add more data to the model using the SDK or using parquet format, a data type conflict may occur.
Step 3: Setting up the monitored dataset
Monitored dataset: This is the data that NannyML will monitor. It typically contains the latest production data, which should be after the reference period ends. This dataset does not require targets.
Similarly, as with the reference dataset, let's add the following columns to the monitored dataset.
y_pred_proba: contains the model's predicted probabilities (scores) for class 1 (Good quality banana, in this case).
y_pred: contains the model's predicted class (0 or 1).
6000
2020-05-05 00:00:00
-1.955202
-0.117761
0.121855
-5.150094
-1.387223
3.435446
1.662624
1
0.999419
1
6001
2020-05-05 00:30:00
-0.940370
-1.119835
-0.998754
1.206442
-3.871381
0.711751
0.799148
0
0.001309
0
6002
2020-05-05 01:00:00
-0.819169
-3.060052
-1.092720
2.581883
-0.306439
-2.864940
-1.350002
0
0.000332
0
6003
2020-05-05 01:30:00
-1.757643
-0.109889
1.044176
-2.219761
-2.899120
-0.954929
2.877549
1
0.384413
0
6004
2020-05-05 02:00:00
-1.499208
0.618753
-0.616309
0.195418
-0.299742
-0.323257
-4.152448
0
0.012700
0
In this particular example, we are dealing with a binary classification problem. The required columns for multiclass classification and regression differ slightly.
Here is a summary of all the columns required for regression, binary classification, and multiclass classification.
Regression Model - Required columns
identifier
✅
✅
timestamp
✅
✅
y_pred_proba
N/A
N/A
y_pred
✅
✅
target
✅
Optional
Classification Model - Required Columns
identifier
✅
✅
timestamp
✅
✅
y_pred_proba
✅
✅
y_pred
✅
✅
target
✅
Optional
If the model is multiclass classification, there should be a y_pred_proba column for each class.
Example:
y_pred_proba_setosa
y_pred_proba_virginica
y_pred_proba_versicolor