Adding a model

Two methods are available to add a model to NannyML Cloud. The optimal way is to do it programmatically, which can be found on the NannyML SDK Cloud page. Alternatively, you can manually add the model using the NannyML Cloud UI, which we'll explain here.

If you prefer a video walkthrough and you upload your data from Azure, here's our YouTube guide:

2. Provide model information

You need to provide four pieces of information about your model:

The machine learning problem type The type of problem the machine learning model is dealing with. The different options are binary classification, multiclass classification, and regression. This has lots of implications. It impacts what type of model output and target data NannyML is expecting and which metrics NannyML can calculate. This cannot be changed later.
The main performance metric Depending on which problem type you select, the available metrics will change. These metrics can always be changed later, and you can monitor multiple metrics simultaneously. Currently, we support the following metrics:

ROC-AUC
F1
Precision
Recall
Specificity
Accuracy
Business value
Confusion matrix elements

How the data has to be chunked Chunking determines how metrics will be aggregated, i.e. the granularity of the monitoring analysis. The options are either time-based or size-based. For time-based chunking the options are daily, weekly, monthly, quarterly, and yearly chunking. For size-based chunking you can select a chunk size; i.e. a number of records to have in a single chunk. The last chunk may not be completely "filled" if there are not enough records. It will be recomputed automatically as more records are added and the chunk "fills up". The chunking unit can always be changed later in the model settings.

We currently only support time-based and size-based chunking; if you need support for number-based chunking, contact us.

3. Configure the reference dataset

The reference dataset is the dataset NannyML will use as a baseline for monitoring your model. This dataset ideally represents a time when the model worked as expected. The ideal candidate for this is the test set. You need to point NannyML to where this dataset is located and provide some basic information about the dataset schema.

Point nannyML to the reference dataset location

Pick one of the following upload options:

Provide a public URL

If the dataset is accessible via a public URL, you can provide that link here:

To try out NannyML, use one of our public datasets on GitHub. Here is a link to the synthetic car price prediction - reference dataset:

https://github.com/NannyML/nannyml/raw/main/nannyml/datasets/data/regression_synthetic_reference.csvgithub.com

We recommend using parquet files when uploading data using the user interface.

NannyML Cloud supports both parquet and CSV files, but CSV files don't store data type information. CSV files may cause incorrect data types to be inferred. If you later add more data to the model using the SDK or using parquet format, a data type conflict may occur.

Provide reference dataset information

NannyML requires schema information about the reference dataset. While it automatically gets column details, it's always wise to double-check. The most critical columns to define are listed on the left. The columns you need to specify depend on the type of machine-learning problem you chose at the beginning of this workflow. All other columns are automatically treated as features. Additionally, NannyML automatically detects the data types of these feature columns.

The following columns have to be specified:

Timestamp This provides NannyML with the date and time that the prediction was made.
Prediction The model output that the model predicted for its target outcome.
Target The ground truth or actual outcome of what the model is predicting.
Identifier A unique identifier for each row in the dataset. NannyML will use this column to join analysis and target data sources.

The mapping of the columns can be changed when scrolling horizontally. It is possible to ignore specific columns or flag columns that should be used for joining predictions and targets later.

4. Configure the analysis dataset

The analysis dataset is what NannyML uses to analyze the performance of the monitored model. Typically, it will consist of the latest production data up to a desired point in the past, which should be after the reference dataset ends.

Note: NannyML assumes that the schema of the analysis dataset is the same as the reference dataset.