RIME v13 Changelog

AI Continuous Testing

Continuous Testing is now supports incremental uploads via the CLI. Users can now create a CT run and then upload new batches of production data to detect drift in various summary metrics and diagnose the underlying cause of issues.

There are new summary metrics that are tracked over time to better help users determine the underlying cause of errors. These are populated in the top level of the Continuous Testing page and include: Prediction Distribution Drift, Prediction Percentiles, Abnormal Inputs by Feature, Feature Distribution Drift, and Mutual Information Difference. Furthermore, one can access these summary metrics in the feature level page and plot them alongside feature level metrics so as to attribute changes in summary level metrics to changes in feature level metrics.

Within the main summary page of Continuous Testing, there are new columns in the Feature table. These three columns (Model Impact, Abnormality Rate, and Drift Statistic) allows the user to more easily attribute changes in summary metrics to specific features. Model Impact is a combination of all model impacts calculated for each test run on a given feature, and so this value can be thought of as an analogue to feature importance. Abnormality Rate is the number and percent of rows that had an abnormal input test fail for a given feature. The drift statistic is the PSI of the distribution of the feature in a given batch of data relative to the distribution of the feature in the original reference set.

Continuous Testing is now enabled for the Text Classification and Named Entity Recognition use cases. Track performance drift in your NLP models and understand the underlying causative distributional drift in your data.

For large datasets, RIME performance can be optimized by running on a subset of features. This works by specifying the number of features to test on. RIME automatically calculates feature importances and runs its suite of tests across the most important features up to the number specified by the user. The default run mode still runs RIME across all features. Recommended data limits on standard RIME production hardware (16GB Memory) are a function of the number of rows of input data and the number of features. For data with 25 features, RIME supports up to 7 millions rows. When input data has 500 features, the number of rows supported on standard hardware is 250,000.

Clicking on the “More” button by each test description leads to a new tooltip that splits up the previously displayed sections into tabs for compactness. In addition, there is an additional tab called “Configuration” that displays the test config json used in the run for that given test, and the definition of each argument in the config.

Within the Test Run Page, there is an additional metric that can be used to compare multiple test runs called the Sensitivity Score. This score is a measure of your model’s performance across the four tests that measure model robustness to adversarial attacks (Single-Feature Changes, Multi-Feature Changes, Bounded Single-Feature Changes, and Bounded Multi-Feature Changes). A more robust model will have a lower sensitivity score than a less robust model as it is less sensitive to errors in data.

Within the Test Run Page, clicking on the “Compare” button on the top of the table takes the user to the Side by Side Test Run Comparison Page. This page allows users to compare two test runs with each other. Users can see the performance of their model on all stress tests, and clicking on an individual stress test reveals information on the severity of test results on individual features.

There is an updated filtering tool on all tables in RIME that allows users to filter by multiple attributes.