> For the complete documentation index, see [llms.txt](https://docs.nannyml.com/cloud/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.nannyml.com/cloud/v0.23.0/model-monitoring/data-preparation.md). # Data Preparation What data does NannyML need to monitor a machine learning model in production? * Data from 2 different data periods are needed. * Data needs to be in tabular format. * Features, model outputs, and target (targets are optional for the monitored dataset). * Some additional columns needed, namely a timestamp and an identifier column. The NannyML open-source library provides sample datasets. Let's have a quick preview of the [synthetic car loan dataset](https://nannyml.readthedocs.io/en/stable/datasets/binary_car_loan.html) before we go into details. ```python import nannyml as nml reference, monitored, targets = nml.load_synthetic_car_loan_dataset() reference.head() ```

id	timestamp	car_value	salary_range	debt_to_income_ratio	loan_length	repaid_loan_on_prev_car	size_of_downpayment	driver_tenure	repaid	y_pred_proba	y_pred
0	2018-01-01 00:00:00.000	39811	40K - 60K €	0.63295	19	False	40%	0.212653	1	0.99	1
1	2018-01-01 00:08:43.152	12679	40K - 60K €	0.718627	7	True	10%	4.92755	0	0.07	0
2	2018-01-01 00:17:26.304	19847	40K - 60K €	0.721724	17	False	0%	0.520817	1	1	1
3	2018-01-01 00:26:09.456	22652	20K - 20K €	0.705992	16	False	10%	0.453649	1	0.98	1
4	2018-01-01 00:34:52.608	21268	60K+ €	0.671888	21	True	30%	5.69526	1	0.99	1

Let's now see the requirements in more detail. ### Datasets and Periods In order to monitor a model's behavior, we first need to establish a pattern of acceptable behavior. This is done by data from a reference period, often called **reference dataset**. Usually, this dataset is the test set from when the model was developed, or the latest available production data were the model performed according to expectations. Then we need a **monitored dataset**, which is a dataset that comes from the period where we want to examine how well a model performs. In some cases, the monitored datasets does not contain targets. This often happens when targets are available at a date later than when the prediction is made. To accommodate for this NannyML allows for a third dataset, the **target dataset.** The target dataset only needs to contain targets and an identifier column. Also note that the same column names must be used in the reference, monitored and target datasets. ### Data Format As can be seen in the above example NannyML consumes data in a tabular format. Each prediction is expected to be described in one row. Features and other information are provided through columns. An example can be seen in the sample data presented above. NannyML accepts data in csv and parquet formats. ### Features, Model Outputs and Targets Those are the key information needed to monitor a model. All the features and model outputs are expected to be represented by unique columns in the data provided. By outputs we mean both predicted probabilities and predicted classes for classification problems. For the reference dataset model targets are required. For the monitored data they are optional. If they are not provided they can be added later through the target dataset option. Unless this is done, NannyML features such as realized performance and concept drift monitoring cannot be used. ### Additional Columns Apart from the standard features NannyML needs two additional columns, an **id** column and a **timestamp** column. #### Id column An id column is a column that provides a unique identifier for each model prediction. Since each prediction is expected to be in one row the id column is unique per row in our data. It can be integer or a string. If a unique identifier is not present in your data, you need to create one before you can use NannyML Cloud. #### Timestamp Column The timestamp column is a column that describes the time at which a model prediction was made. It is mainly used in order to aggregate predictions according to when they were made in order to organize model monitoring results. If that information is not stored for your business use case, you need to create a synthetic timestamp column before using NannyML. Note that timestamp information is used in [chunking](https://nannyml.readthedocs.io/en/stable/tutorials/chunking.html) and when plotting results so be careful to use values that will make sense. Any format supported by pandas can be used. ## Data Preparation Workflow Given all of the above, how would one go about creating datasets to monitor their model with NannyML Cloud? Let's list the necessary steps: 1. Decide on which data period will be used for the reference dataset and which for the monitored dataset. 2. Gather relevant data from where they are stored. 1. Identify what data need to be collected. They are features, model outputs, targets as well as an id and a timestamp column. 2. Create queries to gather relevant data. This can be complicated if data are stored in different places. Keep in mind that the end results need to be data in tabular format. 3. Add required additional columns if they are not present. 4. Store the data. Note that NannyML can receive data from: 1. A public URL serving the raw file. 2. A cloud storage option, namely S3 and Azure blob storage. 3. Local files in your computer if their size is less than 100Mb. 4. The NannyML Cloud SDK. Hence store the data in a way that is most convenient for you. {% hint style="info" %} We recommend storing your data as parquet files. NannyML Cloud supports both parquet and CSV files, but CSV files don't store data type information. CSV files may cause incorrect data types to be inferred. If you later add more data to the model using the SDK or using parquet format, a data type conflict may occur. {% endhint %}