Validating Your Model with AI Stress Testing

This tutorial will guide you through validating NLP models with RIME AI Stress Testing.

All examples are available in the rime_trial/ bundle provided during installation.

An AI Stress Test is a statistical evaluation of a machine learning model, designed to detect a specific vulnerability. At Robust Intelligence, we are constantly researching new vulnerabilities to test.

For a full list of available stress tests, see our Test Bank.

NLP Setup

Please ensure that the extra RIME NLP dependencies have been installed from the nlp_requirements.txt file from installation. If you run into a ModuleNotFoundError at any point during this walkthrough, it is likely that you need to install the RIME NLP Extras!

pip install -r nlp_requirements.txt

Running Stress Testing on a Text Classification Example

This example uses a DistilBERT emotion recognition model trained on a slightly modified version of the CARER Emotion Recognition Dataset, which relies on a couple of additional dependencies.

To install, please run the following command:

pip install -r nlp_examples/trial_model_requirements.txt

To kick off a run of AI Stress Testing using a model and datasets:

rime-engine run-nlp --config-path nlp_examples/classification/emotion_recognition/stress_tests_config.json

NOTE: if the above command throws a ModuleNotFoundError, it is likely that you forgot to install the RIME NLP Extras (see setup above).

After this finishes running, you should be able to see the results in the web client, where they will be uploaded to the Default Project.

If you explore the test config in nlp_examples/classification/emotion_recognition/stress_tests_config.json you’ll see that we’ve configured a few parameters to specify the data, model, and other task-specific information.

For a full reference on the configuration file see the NLP Configuration Reference.

Running Stress Testing on a Text Classification Example with Metadata

In this tutorial, we will cover adding custom metadata to your test run.

This example uses a RoBERTa based model trained on tweets and finetuned for sentiment analysis. The dataset used in this example is data scraped from Twitter to analyze how travelers expressed feelings about airlines.

The data includes several attributes alongside text and label that RIME will also run automated tests on:

Custom numeric metadata: Retweet_count
Custom categorical metadata: Reason, Airline, Location

These attributes exist for each datapoint in the meta dict, as key-value pairs. For example:

{"text": "@USAirways You have no idea how upset and frustrated I am right now. I'll be sure to follow through with never flying with you again.", "label": 0, "meta": {"Reason": "Can't Tell", "Airline": "US Airways", "Location": "Saratoga Springs", "Retweet_count": 0}}

To kick off a run of AI Stress Testing using a model, datasets, and custom metadata:

rime-engine run-nlp --config-path nlp_examples/classification/sentiment_analysis/stress_tests_config_with_metadata.json

If you poke around in stress_tests_config_with_metadata.json you’ll see that we’ve added custom_numeric_metadata and custom_categorical_metadata to data_profiling_info:

{
    "run_name": "Sentiment Analysis (Twitter Airline)",
    "model_task": "Text Classification",
    "data_info": { ... },
    "model_info": { ... },
    "prediction_info": { ... },
    "data_profiling_info": {
        "class_names": [
            "negative",
            "neutral",
            "positive"
        ],
        "custom_numeric_metadata": [
            "Retweet_count"
        ],
        "custom_categorical_metadata": [
            "Reason",
            "Airline",
            "Location"
        ]
    },
    "random_seed": 42
}

For a full reference on the data profile file see the Data Profiling Configuration.

Running Stress Testing on a Named Entity Recognition Example

This examples uses the bert-base-NER model from Hugging Face.

To kick off a run of AI Stress Testing using the bert-base-NER model:

rime-engine run-nlp --config-path nlp_examples/ner/conll/stress_tests_config.json

If you poke around in nlp_examples/ner/conll/stress_tests_config.json you’ll see that we’ve changed the model_task to Named Entity Recognition, along with a couple of other parameters.

For a full reference on the configuration file see the NLP Configuration Reference.

Running Stress Testing on Your Own Model and Datasets

Define a Python Model File

Please refer to How to Create an NLP Model FIle for step-by-step instructions on creating a model interface for RIME.

Gather Datasets

1. Prepare Input Data

For a detailed specification of data formatting, see Input Data Format

2. Specify Prediction Logs (Recommended)

Because model inference is usually the most time-consuming part of the testing framework, we recommend specifying cached prediction logs using the prediction_info argument of the runtime config.

In the classification example above, we specified model predictions separately in the files nlp_examples/classification/emotion_recognition/data/{train|val}_preds.json.

An example prediction can be viewed from this file by running the following command from your terminal:

cat nlp_examples/classification/emotion_recognition/data/test_preds.json | jq '.[0]'

RIME supports predictions stored in compressed .json or .jsonl format and accepts predictions added within the datafile itself (by adding the “probabilities” key to each data sample).

However, if you do not wish to create a prediction log beforehand, RIME can call your model during a test run and infer its performance using a subsample of the provided datasets. See nlp_examples/classification/emotion_recognition/stress_tests_config_no_preds.json for an example.

Create Configuration

With your data and model ready, you can now create a configuration file. Examples of these can be found in the rime_trial/ bundle (the ones used for these examples are under nlp_examples/).

For a detailed reference on what the configuration should look like, see AI Stress Testing Configuration Reference.

Run the CLI

To kick off a run of AI Stress Testing using your configuration file, simply replace the --config-path argument below:

rime-engine run-nlp --config-path <PATH-TO-CONFIGURATION>

Conclusion

Congratulations! You’ve successfully used RIME to test out the various NLP models.

Once again, we strongly recommended that you run RIME using a cached predictions file, similar to the one provided in the first part of this tutorial. This will greatly improve both the RIME runtime and the test suite result quality.

Model inference tends to be the most computationally expensive part of each RIME run, especially for large transformer models. While access to the model is still required for some tests due to design constraints (e.g., the use of randomness, iterative attacks, etc.), providing a prediction file can help RIME avoid redundant computation so each run is fast and focused.

Troubleshooting

If you run into issues, please refer to our Troubleshooting page for help! Additionally, your RI representative will be happy to assist — feel free to reach out!