Data Preparation

How to prepare your data before using NannyML

What data does NannyML need to monitor a machine learning model in production?

  • Data from 2 different data periods are needed.

  • Data needs to be in tabular format.

  • Features, model outputs, and target (targets are optional for the monitored dataset).

  • Some additional columns needed, namely a timestamp and an identifier column.

The NannyML open-source library provides sample datasets. Let's have a quick preview of the synthetic car loan dataset before we go into details.

import nannyml as nml
reference, monitored, targets = nml.load_synthetic_car_loan_dataset()
reference.head()
id
timestamp
car_value
salary_range
debt_to_income_ratio
loan_length
repaid_loan_on_prev_car
size_of_downpayment
driver_tenure
repaid
y_pred_proba
y_pred

0

2018-01-01 00:00:00.000

39811

40K - 60K €

0.63295

19

False

40%

0.212653

1

0.99

1

1

2018-01-01 00:08:43.152

12679

40K - 60K €

0.718627

7

True

10%

4.92755

0

0.07

0

2

2018-01-01 00:17:26.304

19847

40K - 60K €

0.721724

17

False

0%

0.520817

1

1

1

3

2018-01-01 00:26:09.456

22652

20K - 20K €

0.705992

16

False

10%

0.453649

1

0.98

1

4

2018-01-01 00:34:52.608

21268

60K+ €

0.671888

21

True

30%

5.69526

1

0.99

1

Let's now see the requirements in more detail.

Datasets and Periods

In order to monitor a model's behavior, we first need to establish a pattern of acceptable behavior. This is done by data from a reference period, often called reference dataset. Usually, this dataset is the test set from when the model was developed, or the latest available production data were the model performed according to expectations.

Then we need a monitored dataset, which is a dataset that comes from the period where we want to examine how well a model performs.

In some cases, the monitored datasets does not contain targets. This often happens when targets are available at a date later than when the prediction is made. To accommodate for this NannyML allows for a third dataset, the target dataset. The target dataset only needs to contain targets and an identifier column.

Also note that the same column names must be used in the reference, monitored and target datasets.

Data Format

As can be seen in the above example NannyML consumes data in a tabular format. Each prediction is expected to be described in one row. Features and other information are provided through columns. An example can be seen in the sample data presented above. NannyML accepts data in csv and parquet formats.

Features, Model Outputs and Targets

Those are the key information needed to monitor a model. All the features and model outputs are expected to be represented by unique columns in the data provided. By outputs we mean both predicted probabilities and predicted classes for classification problems. For the reference dataset model targets are required. For the monitored data they are optional. If they are not provided they can be added later through the target dataset option. Unless this is done, NannyML features such as realized performance and concept drift monitoring cannot be used.

Additional Columns

Apart from the standard features NannyML needs two additional columns, an id column and a timestamp column.

Id column

An id column is a column that provides a unique identifier for each model prediction. Since each prediction is expected to be in one row the id column is unique per row in our data. It can be integer or a string. If a unique identifier is not present in your data, you need to create one before you can use NannyML Cloud.

Timestamp Column

The timestamp column is a column that describes the time at which a model prediction was made. It is mainly used in order to aggregate predictions according to when they were made in order to organize model monitoring results.

If that information is not stored for your business use case, you need to create a synthetic timestamp column before using NannyML. Note that timestamp information is used in chunking and when plotting results so be careful to use values that will make sense. Any format supported by pandas can be used.

Data Preparation Workflow

Given all of the above, how would one go about creating datasets to monitor their model with NannyML Cloud? Let's list the necessary steps:

  1. Decide on which data period will be used for the reference dataset and which for the monitored dataset.

  2. Gather relevant data from where they are stored.

    1. Identify what data need to be collected. They are features, model outputs, targets as well as an id and a timestamp column.

    2. Create queries to gather relevant data. This can be complicated if data are stored in different places. Keep in mind that the end results need to be data in tabular format.

  3. Add required additional columns if they are not present.

  4. Store the data. Note that NannyML can receive data from:

    1. A public URL serving the raw file.

    2. A cloud storage option, namely S3 and Azure blob storage.

    3. Local files in your computer if their size is less than 100Mb.

    4. The NannyML Cloud SDK.

    Hence store the data in a way that is most convenient for you.

We recommend storing your data as parquet files.

NannyML Cloud supports both parquet and CSV files, but CSV files don't store data type information. CSV files may cause incorrect data types to be inferred. If you later add more data to the model using the SDK or using parquet format, a data type conflict may occur.