> For the complete documentation index, see [llms.txt](https://docs.nannyml.com/cloud/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.nannyml.com/cloud/model-monitoring/data-preparation/how-to-get-data-ready-for-nannyml.md).

# How to get data ready for NannyML

This tutorial is a practical guide on how to get your data in the right format to monitor your models with NannyML. We will illustrate step-by-step how to prepare data for monitoring. We will start with building a simple model that we will monitor later.

## Step 1: Have a trained model (prerequisite)

If you are looking into monitoring, you probably already have a model, a training set, and most likely some production (monitored) data on which you would like to check how your model is doing. This tutorial uses the [Banana Quality Prediction](https://github.com/NannyML/sample_datasets/tree/main/banana_quality) dataset to illustrate how to prepare your data.

Let's start by partitioning our data into three sets:

* Training
* Testing
* Monitored (production data)

{% hint style="info" %}
In a real scenario, you won't need to create the monitoring partition since this will naturally come after deploying the model.

If you already have a trained model, skip to [Step 2: Building a reference dataset.](#step-2-building-a-reference-dataset)
{% endhint %}

```python
import pandas as pd

url = 'https://raw.githubusercontent.com/NannyML/sample_datasets/main/banana_quality/banana_quality.csv' 
data = pd.read_csv(url)
train_df, test_df, monitor_df = data.iloc[0:4000], data.iloc[4000:6000], data.iloc[6000:8000]
```

```python
data.head()
```

| Identifier | Timestamp           | Size      | Weight    | Sweetness | Softness  | HarvestTime | Ripeness  | Acidity   | Quality |
| ---------- | ------------------- | --------- | --------- | --------- | --------- | ----------- | --------- | --------- | ------- |
| 0          | 2020-01-01 00:00:00 | 2.622631  | -1.531197 | -0.293286 | 2.086239  | 4.379607    | 0.865033  | 0.139479  | 1       |
| 1          | 2020-01-01 00:30:00 | -3.592020 | 0.220773  | 0.298705  | -1.759699 | -4.116195   | 0.604932  | -2.387580 | 1       |
| 2          | 2020-01-01 01:00:00 | 1.270482  | -3.959625 | -3.447087 | 0.799014  | 0.148791    | -2.098769 | 0.727079  | 0       |
| 3          | 2020-01-01 01:30:00 | -0.002022 | -1.417225 | -1.847301 | 0.301628  | 1.364337    | 0.265501  | -0.458392 | 1       |
| 4          | 2020-01-01 02:00:00 | -0.786569 | -4.323583 | 1.466369  | 0.157923  | -2.063598   | 2.619294  | -0.340423 | 0       |

The **Quality** column is the target variable. It is encoded as:

* **1** - Good banana quality
* **0** - Bad banana quality

Once we have the data separated, we can build a simple model.

```python
from lightgbm import LGBMClassifier

features = [
    'Size', 'Weight', 'Sweetness', 'Softness', 
    'HarvestTime', 'Ripeness', 'Acidity'
]
target = 'Quality'
timestamp = 'Timestamp'

clf = LGBMClassifier()
clf.fit(train_df[features], train_df[target])
```

Let's quickly check its performance on the training and testing sets before moving to the next step.

```python
from sklearn.metrics import roc_auc_score

roc_auc_train = roc_auc_score(
    train_df[target], clf.predict_proba(train_df[features])[:, 1]
)
roc_auc_test = roc_auc_score(
    test_df[target], clf.predict_proba(test_df[features])[:, 1]
)

print(f"ROC-AUC (train): {roc_auc_train}")
print(f"ROC-AUC (test): {roc_auc_test}")
```

```
ROC-AUC (train): 1.0
ROC-AUC (test): 0.9931109775641026
```

Looks like the model is doing a nice job in the training and testing datasets.

## Step 2: Building a reference dataset

Once we have a model that performs well on the test set, we can use it to build a reference dataset.

**Reference dataset:** The reference dataset is a benchmark set where the model behaves as expected. It is important that this dataset was not seen by the model during training. It should include targets and the model's predictions. In this tutorial, we will use the test set to build the reference set.

NannyML uses the reference dataset to establish a baseline of how the monitoring data and the model predictions should look after deployment. Internally, it uses it to calibrate most of its performance estimation methods and calculate the default threshold values for every monitored metric.

To transform the test dataset into a reference dataset, all we need to do is add the predicted probabilities and the model predictions to the test dataset.

This means that we will create the following columns:

* **y\_pred\_proba**: contains the model's predicted probabilities (scores) for class 1 (Good quality banana, in this case).
* **y\_pred**: contains the model's predicted class (0 or 1).

```python
reference_df = test_df.copy()

reference_df['y_pred_proba'] = clf.predict_proba(test_df[features])[:, 1]
reference_df['y_pred'] = clf.predict(test_df[features])
```

```python
reference_df.head()
```

| Identifier | Timestamp           | Size      | Weight    | Sweetness | Softness  | HarvestTime | Ripeness  | Acidity   | Quality | y\_pred\_proba | y\_pred |
| ---------- | ------------------- | --------- | --------- | --------- | --------- | ----------- | --------- | --------- | ------- | -------------- | ------- |
| 4000       | 2020-03-24 08:00:00 | -1.005942 | 0.457455  | -0.819113 | 1.014088  | 1.380156    | -2.005996 | 3.522873  | 0       | 0.002104       | 0       |
| 4001       | 2020-03-24 08:30:00 | -2.425685 | 1.682808  | -2.041250 | -1.925259 | -1.118302   | -1.264770 | -3.848056 | 0       | 0.006822       | 0       |
| 4002       | 2020-03-24 09:00:00 | 0.363760  | -1.777900 | -0.585864 | 2.736269  | 3.509048    | 1.986182  | -1.747902 | 1       | 0.998873       | 1       |
| 4003       | 2020-03-24 09:30:00 | -2.206337 | -3.712593 | 0.734068  | -2.015895 | -3.488548   | 0.747031  | 0.754206  | 0       | 0.019571       | 0       |
| 4004       | 2020-03-24 10:00:00 | 0.649706  | -3.580519 | 0.352195  | 1.058788  | 0.834113    | 3.243598  | -3.792337 | 1       | 0.967419       | 1       |

{% hint style="info" %}
We recommend storing your data as parquet files.

NannyML Cloud supports both parquet and CSV files, but CSV files don't store data type information. CSV files may cause incorrect data types to be inferred. If you later add more data to the model using the SDK or using parquet format, a data type conflict may occur.
{% endhint %}

## Step 3: Setting up the monitored dataset

**Monitored dataset:** This is the data that NannyML will monitor. It typically contains the latest production data, which should be after the reference period ends. This dataset does not require targets.

Similarly, as with the reference dataset, let's add the following columns to the monitored dataset.

* **y\_pred\_proba**: contains the model's predicted probabilities (scores) for class 1 (Good quality banana, in this case).
* **y\_pred**: contains the model's predicted class (0 or 1).

```python
monitor_df['y_pred_proba'] = clf.predict_proba(monitor_df[features])[:, 1]
monitor_df['y_pred'] = clf.predict(monitor_df[features])
```

```python
monitor_df.head()
```

| Identifier | Timestamp           | Size      | Weight    | Sweetness | Softness  | HarvestTime | Ripeness  | Acidity   | Quality | y\_pred\_proba | y\_pred |
| ---------- | ------------------- | --------- | --------- | --------- | --------- | ----------- | --------- | --------- | ------- | -------------- | ------- |
| 6000       | 2020-05-05 00:00:00 | -1.955202 | -0.117761 | 0.121855  | -5.150094 | -1.387223   | 3.435446  | 1.662624  | 1       | 0.999419       | 1       |
| 6001       | 2020-05-05 00:30:00 | -0.940370 | -1.119835 | -0.998754 | 1.206442  | -3.871381   | 0.711751  | 0.799148  | 0       | 0.001309       | 0       |
| 6002       | 2020-05-05 01:00:00 | -0.819169 | -3.060052 | -1.092720 | 2.581883  | -0.306439   | -2.864940 | -1.350002 | 0       | 0.000332       | 0       |
| 6003       | 2020-05-05 01:30:00 | -1.757643 | -0.109889 | 1.044176  | -2.219761 | -2.899120   | -0.954929 | 2.877549  | 1       | 0.384413       | 0       |
| 6004       | 2020-05-05 02:00:00 | -1.499208 | 0.618753  | -0.616309 | 0.195418  | -0.299742   | -0.323257 | -4.152448 | 0       | 0.012700       | 0       |

{% hint style="info" %}
In this particular example, we are dealing with a binary classification problem. The required columns for multiclass classification and regression differ slightly.
{% endhint %}

Here is a summary of all the columns required for regression, binary classification, and multiclass classification.

#### Regression Model - Required columns

<table data-full-width="true"><thead><tr><th>Column</th><th width="181">Reference</th><th>Monitored</th></tr></thead><tbody><tr><td><strong>identifier</strong></td><td>✅</td><td>✅</td></tr><tr><td><strong>timestamp</strong></td><td>✅</td><td>✅</td></tr><tr><td><strong>y_pred_proba</strong></td><td>N/A</td><td>N/A</td></tr><tr><td><strong>y_pred</strong></td><td>✅</td><td>✅</td></tr><tr><td><strong>target</strong></td><td>✅</td><td>Optional</td></tr></tbody></table>

#### Classification Model - Required Columns

<table data-full-width="true"><thead><tr><th>Column</th><th width="181">Reference</th><th>Monitored</th></tr></thead><tbody><tr><td><strong>identifier</strong></td><td>✅</td><td>✅</td></tr><tr><td><strong>timestamp</strong></td><td>✅</td><td>✅</td></tr><tr><td><strong>y_pred_proba</strong></td><td>✅</td><td>✅</td></tr><tr><td><strong>y_pred</strong></td><td>✅</td><td>✅</td></tr><tr><td><strong>target</strong></td><td>✅</td><td>Optional</td></tr></tbody></table>

If the model is multiclass classification, there should be a **y\_pred\_proba** column for each class.&#x20;

Example:&#x20;

* y\_pred\_proba\_setosa
* y\_pred\_proba\_virginica
* y\_pred\_proba\_versicolor

<details>

<summary>What if my data doesn't have timestamps?</summary>

Timestamps indicate the date and time when observations were recorded. If your dataset doesn't have timestamps, you can generate synthetic ones, just make sure they align with your specific use case.

For example:

```python
import datetime as dt

timestamps = [dt.datetime(2020,1,1) + dt.timedelta(hours=x/2) for x in df.index]

```

</details>

## Where to go next?

<table data-view="cards"><thead><tr><th></th><th></th><th></th><th data-hidden data-card-target data-type="content-ref"></th><th data-hidden data-card-cover data-type="files"></th></tr></thead><tbody><tr><td><a href="/pages/CBUhbrgPxH7E6VhLtxHs"><strong>Get started on Azure</strong></a></td><td>Check out how to launch NannyML cloud on Azure</td><td></td><td><a href="/pages/CBUhbrgPxH7E6VhLtxHs">/pages/CBUhbrgPxH7E6VhLtxHs</a></td><td></td></tr><tr><td><a href="/pages/e1bOAHCEBBhJItFHonjK"><strong>Get started on AWS</strong></a></td><td>Check out how to launch NannyML cloud on AWS</td><td></td><td><a href="/pages/e1bOAHCEBBhJItFHonjK">/pages/e1bOAHCEBBhJItFHonjK</a></td><td></td></tr></tbody></table>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.nannyml.com/cloud/model-monitoring/data-preparation/how-to-get-data-ready-for-nannyml.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.