AI Stress Tests

Be sure to complete the initial setup described in RIME Data and Model Setup before proceeding.

Overview

An AI Stress Test is a statistical evaluation of a machine learning model, designed to detect a specific vulnerability. At Robust Intelligence, we are constantly researching new vulnerabilities to test.

For a full list of available stress tests, see our Test Bank.

All tests expose a run_notebook function, which returns outputs in a notebook-friendly format. The return type is a dictionary with a few standard keys. The fundamental ones are:

status: Will be one of PASS, FAIL, WARNING, or SKIP. Denotes the status of the test.
severity: Will be one of High, Medium, Low, or None. Denotes the severity of the failure of the test (will be None if test did not fail).
params: A dictionary of all the parameters of the test.
columns: A list of column names that this test was run over.

Depending on their purpose, different tests may have additional keys for unique information.

Unseen Categorical

As an example, we can run the Unseen Categorical test:

from rime.tabular.tests import UnseenCategoricalTest
test = UnseenCategoricalTest(col_name="Device_operating_system")
test.run_notebook(container)

Output:

{'status': 'FAIL',
 'severity': 'Low',
 'params': {'_id': '4d2a94f6-d7aa-c547-b682-7e78fd71a79f',
  'model_impact_config': ObservedModelImpactConfig(severity_thresholds=None, min_num_samples=10),
  'col_name': 'Device_operating_system'},
 'columns': ['Device_operating_system'],
 'unseen_value_counts': Mac OS X 10_11_4    2
 Mac OS X 10.9       2
 Mac OS X 10_12_2    1
 Mac OS X 10_12_1    1
 Mac OS X 10.6       1
 Windows             1
 Mac OS X 10.10      1
 Name: Device_operating_system, dtype: int64,
 'failing_rows': [158, 1330, 1807, 2429, 2831, 4380, 4727, 7494, 9317],
 'num_failing_rows': 9}

Duplicate Rows

Running the Duplicate Rows test:

from rime.tabular.tests import DuplicateRowsTest
test = DuplicateRowsTest()
test.run_notebook(container)

Output:

This test passed because there are 0 duplicate row(s) in the evaluation data.
{'status': 'PASS',
 'severity': 'None',
 'Failing Rows': '0 (0.00%)',
 'params': {'_id': 'eccd9267-a47a-185c-58e4-eb88fea02ce7',
  'col_names': None,
  'severity_thresholds': (0.01, 0.05)},
 'columns': []}

Non-Parametric Outliers

Running the Non-Parametric Outliers test on a numeric feature column:

from rime.tabular.tests import NonParametricOutliersTest
test = NonParametricOutliersTest("TransactionAmt")
test.run_notebook(container)

Output:

{'status': 'FAIL',
 'severity': 'Low',
 'params': {'_id': 'af584cae-191e-8cfa-b9f1-50dfa0a188a3',
  'model_impact_config': ObservedModelImpactConfig(severity_thresholds=None, min_num_samples=10),
  'col_name': 'TransactionAmt',
  'min_normal_prop': 0.99,
  'baseline_quantile': 0.1,
  'perturb_multiplier': 1.0},
 'columns': ['TransactionAmt'],
 'lower_threshold': -30.1300916166291,
 'upper_threshold': 4396.228995809948,
 'failing_rows': [3302, 8373],
 'num_failing_rows': 2}

Vulnerability

Running the Vulnerability (AKA single-feature change) test:

from rime.tabular.tests import VulnerabilityTest
test = VulnerabilityTest("DeviceInfo")
test.run_notebook(container)

Output

This test passed because the average change in prediction caused by an unbounded manipulation of the feature DeviceInfo over a sample of 10 rows was 0.00555, which is below the warning threshold of 0.05.
{'status': 'PASS',
 'severity': 'None',
 'Average Prediction Change': 0.0055514594454474705,
 'params': {'_id': 'e94863f0-e938-4be9-5e9b-e64674edc3b1',
  'severity_level_thresholds': (0.05, 0.15, 0.25),
  'col_names': ['DeviceInfo'],
  'l0_constraint': 1,
  'linf_constraint': None,
  'sample_size': 10,
  'search_count': 10,
  'use_tqdm': False,
  'label_range': (0.0, 1.0),
  'scaled_min_impact_threshold': 0.05},
 'columns': ['DeviceInfo'],
 'sample_inds': [3344, 1712, 4970, 4480, 1498, 1581, 3531, 473, 9554, 2929],
 'avg_score_change': 0.0055514594454474705,
 'normalized_avg_score_change': 0.0055514594454474705}

Feature Subset

Running the Feature Subset test:

from rime.tabular.tests import FeatureSubsetTest
from rime.tabular.metric import AccuracyMetric
test = FeatureSubsetTest("DeviceType", AccuracyMetric, (0.1, 1.0, 1.0))
test.run_notebook(container)

Output

{'status': 'PASS',
 'severity': 'None',
 'params': {'_id': '48457123-e119-0d15-c942-e9cb31e54840',
  'metric_name': <MetricName.ACCURACY: 'accuracy'>,
  'metric_cls': rime.tabular.metric.shared_metrics.AccuracyMetric,
  'min_sample_size': 20,
  'perf_change_thresholds': (0.1, 1.0, 1.0),
  'perf_change_threshold': 0.1,
  'col_name': 'DeviceType'},
 'columns': ['DeviceType'],
 'num_failing': 0,
 'overall_perf': 0.9692,
 'sample_size': 10000,
 'subsets_metric_dict': {'overall_perf': 0.9692,
  'subsets_info': {'desktop': {'name': 'desktop',
    'size': 1504,
    'criterion': 'desktop',
    'perf': 0.9461436170212766,
    'margin_error': 0.011408309534789187,
    'diff': 0.023056382978723367,
    'pos_rate': 0.06648936170212766,
    'sample_size_info': {<SampleSizeType.POS_LABEL: 'Positive Label'>: 100,
     <SampleSizeType.NEG_LABEL: 'Negative Label'>: 1404,
     <SampleSizeType.POS_PRED: 'Positive Prediction'>: 33,
     <SampleSizeType.NEG_PRED: 'Negative Prediction'>: 1471}},
   'mobile': {'name': 'mobile',
    'size': 947,
    'criterion': 'mobile',
    'perf': 0.9292502639915523,
    'margin_error': 0.016330589417348093,
    'diff': 0.039949736008447645,
    'pos_rate': 0.11298838437170011,
    'sample_size_info': {<SampleSizeType.POS_LABEL: 'Positive Label'>: 107,
     <SampleSizeType.NEG_LABEL: 'Negative Label'>: 840,
     <SampleSizeType.POS_PRED: 'Positive Prediction'>: 50,
     <SampleSizeType.NEG_PRED: 'Negative Prediction'>: 897}},
   'None': {'name': 'None',
    'size': 7549,
    'criterion': 'None',
    'perf': 0.9788051397536097,
    'margin_error': 0.003249127676628865,
    'diff': -0.00960513975360977,
    'pos_rate': 0.021592263876010067,
    'sample_size_info': {<SampleSizeType.POS_LABEL: 'Positive Label'>: 163,
     <SampleSizeType.NEG_LABEL: 'Negative Label'>: 7386,
     <SampleSizeType.POS_PRED: 'Positive Prediction'>: 3,
     <SampleSizeType.NEG_PRED: 'Negative Prediction'>: 7546}}}},
 'worst_subset': {'name': 'mobile',
  'size': 947,
  'criterion': 'mobile',
  'perf': 0.9292502639915523,
  'margin_error': 0.016330589417348093,
  'diff': 0.039949736008447645,
  'pos_rate': 0.11298838437170011,
  'sample_size_info': {<SampleSizeType.POS_LABEL: 'Positive Label'>: 107,
   <SampleSizeType.NEG_LABEL: 'Negative Label'>: 840,
   <SampleSizeType.POS_PRED: 'Positive Prediction'>: 50,
   <SampleSizeType.NEG_PRED: 'Negative Prediction'>: 897}}}

That’s it!

NOTE: It’s important to point out that while we loaded a pretrained model for convenience, the RIME Python Library can be used at any point during the prototyping workflow, whether that’s during initial data exploration, or model training and iteration.