Test Categories

Subset Performance

Test that your model performs equally well across different subsets of the evaluation dataset.
Predictions are required. Labels are required for most tests.

Data Cleanliness

Test for data reliability by checking that your data is consistent and complete.

Attacks

Test the robustness of your model by measuring the maximum difference in model predictions that can be caused by small perturbations to data points.
Model is required.

Transformations

Augment your evaluation dataset with synthetic abnormal values to proactively test your pipeline’s error-handling behavior and measure the performance degradation caused by different types of abnormal values.
Model is required. Labels are not required but they improve results.

Distribution Drift

Test for differences in the distribution of the reference dataset versus the evaluation dataset. If predictions and labels are provided, measure the performance degradation caused by shifting data as well as drift in predictions and labels themselves.
Labels and predictions are not required but they improve results.

Abnormal Input

Check the evaluation dataset for abnormal values commonly encountered in production. If model predictions are provided, test if the observed abnormal values cause a degradation in your model’s performance.
Labels and predictions are not required but they improve results.

Compliance

Test that your model does not discriminate based on protected features.
For some tests predictions, labels, and/or model are required.