Defaults for ROPE and estimation precision
This pages explains how NannyML calculates default values for ROPE and precision.
This pages explains how NannyML calculates default values for ROPE and precision.
ROPE and estimation precision are typically business-related. ROPE is defined as a region with a similar business impact to the expected impact. Precision should be determined so that it limits the probability of getting a false positive (falsely accepting the hypothesis, that is - claiming the population performance is within ROPE) or false negative probability down to the business-required level. It is sometimes difficult to provide those parameters, yet it is still worth running a Probabilistic Model Evaluation. For those cases, NannyML provides reasonable defaults for ROPE and estimation precision.
Default ROPE reflects the hypothesis that an ML model performance is no worse than the one from reference data. In practice, default ROPE spans from the left 95% HDI edge from reference performance posterior to the maximum value of the metric possible (1 for the metrics currently supported). Figure 1 shows the default ROPE.
The default precision is calculated to ensure the experiment's power is 0.8, with the experiment's goal to get a conclusive answer (to accept or reject the hypothesis). At the default precision, the experiment will yield a conclusive answer with 80% probability. The process of estimating the default precision is the following:
Assume that the hypothesis is correct: the performance metric is within ROPE.
Sample performance metric uniformly from ROPE.
Generate n observations of data that we could observe given the performance metric sampled (similarly to predictive posterior sampling).
Get posterior from the sampled observations.
Calculate the HDI of the posterior from 4.
Check whether HDI is fully within the ROPE.
Repeat steps 2-6 multiple times. Store the result from step 6.
Check if ~80% of the experiment results (step 6) give a conclusive answer.
Repeat steps 2-8 to find n, for which 80% of experiments give a conclusive, positive answer.
Repeat the process with the assumption that the hypothesis should be rejected (that is - sample performance metric from outside of ROPE) to find n for this assumption. Pick the larger n.