Data Preparation

How to prepare your data before using NannyML

What data does NannyML need to monitor a machine learning model in production?

Data from 2 different data periods are needed.
Data needs to be in tabular format.
Model inputs, outputs and optionally targets are needed.
Some additional columns needed, namely a timestamp and an identifier column.

The NannyML open-source library provides sample datasets. Let's have a quick preview of the synthetic car loan dataset before we go into details.

import nannyml as nml
reference, monitored, targets = nml.load_synthetic_car_loan_dataset()
reference.head()

timestamp

car_value

salary_range

debt_to_income_ratio

loan_length

repaid_loan_on_prev_car

size_of_downpayment

driver_tenure

repaid

y_pred_proba

y_pred

2018-01-01 00:00:00.000

39811

40K - 60K €

0.63295

False

40%

0.212653

0.99

2018-01-01 00:08:43.152

12679

40K - 60K €

0.718627

True

10%

4.92755

0.07

2018-01-01 00:17:26.304

19847

40K - 60K €

0.721724

False

0.520817

2018-01-01 00:26:09.456

22652

20K - 20K €

0.705992

False

10%

0.453649

0.98

2018-01-01 00:34:52.608

21268

60K+ €

0.671888

True

30%

5.69526

0.99

Let's now see the requirements in more detail.

Datasets and Periods

In order to monitor a model's behavior, we first need to establish a pattern of acceptable behavior. This is done by data from a reference period, often called reference dataset. Usually, this dataset is the test set from when the model was developed, or the latest available production data were the model performed according to expectations.

Then we need a monitored dataset, which is a dataset that comes from the period where we want to examine how well a model performs.

In some cases, the monitored datasets does not contain targets. This often happens when targets are available at a date later than when the prediction is made. To accommodate for this NannyML allows for a third dataset, the target dataset. The target dataset only needs to contain targets and an identifier column.

Also note that the same column names must be used in the reference, monitored and target datasets.

Data Format

As can be seen in the above example NannyML consumes data in a tabular format. Each prediction is expected to be described in one row. Features and other information are provided through columns. An example can be seen in the sample data presented above. NannyML accepts data in csv and parquet formats.

Model Inputs, Outputs and Targets

Those are the key information needed to monitor a model. All the model inputs and outputs are expected to be represented by unique columns in the data provided. By outputs we mean both predicted probabilities and predicted classes for classification problems. For the reference dataset model targets are required. For the monitored data they are optional. If they are not provided they can be added later through the target dataset option. Unless this is done, NannyML features such as realized performance and concept drift monitoring cannot be used.

Additional Columns

Apart from the standard features NannyML needs two additional columns, an id column and a timestamp column.

Id column

An id column is a column that provides a unique identifier for each model prediction. Since each prediction is expected to be in one row the id column is unique per row in our data. It can be integer or a string. If a unique identifier is not present in your data, you need to create one before you can use NannyML Cloud.

Timestamp Column

The timestamp column is a column that describes the time at which a model prediction was made. It is mainly used in order to aggregate predictions according to when they were made in order to organize model monitoring results.

If that information is not stored for your business use case, you need to create a synthetic timestamp column before using NannyML. Note that timestamp information is used in chunking and when plotting results so be careful to use values that will make sense. Any format supported by pandas can be used.