Data Preparation
How to prepare your data before using NannyML
How to prepare your data before using NannyML
What data does NannyML need to monitor a machine learning model in production?
Data from 2 different data periods are needed.
Data needs to be in tabular format.
Features, model outputs, and target (targets are optional for the monitored dataset).
Some additional columns needed, namely a timestamp and an identifier column.
The NannyML open-source library provides sample datasets. Let's have a quick preview of the synthetic car loan dataset before we go into details.
id | timestamp | car_value | salary_range | debt_to_income_ratio | loan_length | repaid_loan_on_prev_car | size_of_downpayment | driver_tenure | repaid | y_pred_proba | y_pred |
---|---|---|---|---|---|---|---|---|---|---|---|
Let's now see the requirements in more detail.
In order to monitor a model's behavior, we first need to establish a pattern of acceptable behavior. This is done by data from a reference period, often called reference dataset. Usually, this dataset is the test set from when the model was developed, or the latest available production data were the model performed according to expectations.
Then we need a monitored dataset, which is a dataset that comes from the period where we want to examine how well a model performs.
In some cases, the monitored datasets does not contain targets. This often happens when targets are available at a date later than when the prediction is made. To accommodate for this NannyML allows for a third dataset, the target dataset. The target dataset only needs to contain targets and an identifier column.
Also note that the same column names must be used in the reference, monitored and target datasets.
As can be seen in the above example NannyML consumes data in a tabular format. Each prediction is expected to be described in one row. Features and other information are provided through columns. An example can be seen in the sample data presented above. NannyML accepts data in csv and parquet formats.
Those are the key information needed to monitor a model. All the features and model outputs are expected to be represented by unique columns in the data provided. By outputs we mean both predicted probabilities and predicted classes for classification problems. For the reference dataset model targets are required. For the monitored data they are optional. If they are not provided they can be added later through the target dataset option. Unless this is done, NannyML features such as realized performance and concept drift monitoring cannot be used.
Apart from the standard features NannyML needs two additional columns, an id column and a timestamp column.
An id column is a column that provides a unique identifier for each model prediction. Since each prediction is expected to be in one row the id column is unique per row in our data. It can be integer or a string. If a unique identifier is not present in your data, you need to create one before you can use NannyML Cloud.
The timestamp column is a column that describes the time at which a model prediction was made. It is mainly used in order to aggregate predictions according to when they were made in order to organize model monitoring results.
If that information is not stored for your business use case, you need to create a synthetic timestamp column before using NannyML. Note that timestamp information is used in chunking and when plotting results so be careful to use values that will make sense. Any format supported by pandas can be used.
Given all of the above, how would one go about creating datasets to monitor their model with NannyML Cloud? Let's list the necessary steps:
Decide on which data period will be used for the reference dataset and which for the monitored dataset.
Gather relevant data from where they are stored.
Identify what data need to be collected. They are features, model outputs, targets as well as an id and a timestamp column.
Create queries to gather relevant data. This can be complicated if data are stored in different places. Keep in mind that the end results need to be data in tabular format.
Add required additional columns if they are not present.
Store the data. Note that NannyML can receive data from:
A public URL serving the raw file.
A cloud storage option, namely S3 and Azure blob storage.
Local files in your computer if their size is less than 100Mb.
The NannyML Cloud SDK.
Hence store the data in a way that is most convenient for you.
We recommend storing your data as parquet files.
NannyML Cloud supports both parquet and CSV files, but CSV files don't store data type information. CSV files may cause incorrect data types to be inferred. If you later add more data to the model using the SDK or using parquet format, a data type conflict may occur.
0
2018-01-01 00:00:00.000
39811
40K - 60K €
0.63295
19
False
40%
0.212653
1
0.99
1
1
2018-01-01 00:08:43.152
12679
40K - 60K €
0.718627
7
True
10%
4.92755
0
0.07
0
2
2018-01-01 00:17:26.304
19847
40K - 60K €
0.721724
17
False
0%
0.520817
1
1
1
3
2018-01-01 00:26:09.456
22652
20K - 20K €
0.705992
16
False
10%
0.453649
1
0.98
1
4
2018-01-01 00:34:52.608
21268
60K+ €
0.671888
21
True
30%
5.69526
1
0.99
1