Introduction

What is Probabilistic Model Evaluation and when to use it.

Probabilistic Model Evaluation

This module enables robust ML model evaluation by estimating the full probability distribution of selected performance metrics. Unlike typically calculated point estimates of performance metrics, which tell us the model's performance on a specific sample of data, the probability distribution tells us where the true population performance lies given the data observed so far. This allows us to answer whether the model's performance is within a specific range or above a certain threshold.

When to use it?

The typical use case for this is the following: an ML model is deployed for a subset of the population (of users/machines/cars, etc.), and we want to know whether it maintains the performance seen during the model development phase. We usually want the answer as soon as possible to scale it if it is good enough or iterate on it if it's not. Let's say we get 100 observations, and the metric we care about (f1, for example) is above our minimum threshold. Should we deploy the model at scale or wait and gather more data? Having only a single value of F1 does not allow us to answer these questions. We need a measure of uncertainty related to sample size and data distribution. Posterior data distribution can be that measure. Probabilistic Model Evaluation estimates the probability distribution of the selected performance metric using the Bayesian approach. Then, we can evaluate the null hypothesis (for example, the hypothesis can be that the model performance as measured by F1 is no lower than 0.9) using the .

Where to go next?

Quickstart

See how to use Probabilistic Model Evaluation to robustly assess performance of ML model.

Find out what HDI+ROPE decision rule is.

Find out how NannyML estimates probability distribution of performance metric and helps to define ROPE and precision.

Last updated