AI Stress Tests

Be sure to complete the initial setup described in RIME Data and Model Setup before proceeding.

Overview

An AI Stress Test is a statistical evaluation of a machine learning model, designed to detect a specific vulnerability. At Robust Intelligence, we are constantly researching new vulnerabilities to test.

For a full list of available stress tests, see our Test Bank.

All tests expose a run_notebook function, which returns outputs in a notebook-friendly format. The return type is a dictionary with a few standard keys. The fundamental ones are:

  • status: Will be one of PASS, FAIL, WARNING, or SKIP. Denotes the status of the test.

  • severity: Will be one of High, Medium, Low, or None. Denotes the severity of the failure of the test (will be None if test did not fail).

  • params: A dictionary of all the parameters of the test.

  • columns: A list of column names that this test was run over.

Depending on their purpose, different tests may have additional keys for unique information.

Unseen Categorical

As an example, we can run the Unseen Categorical test:

from rime.tabular.tests import UnseenCategoricalTest
test = UnseenCategoricalTest(col_name="Device_operating_system")
test.run_notebook(container)

Output:

{'status': 'FAIL',
 'severity': 'Low',
 'params': {'_id': '4d2a94f6-d7aa-c547-b682-7e78fd71a79f',
  'model_impact_config': ObservedModelImpactConfig(severity_thresholds=None, min_num_samples=10),
  'col_name': 'Device_operating_system'},
 'columns': ['Device_operating_system'],
 'unseen_value_counts': Mac OS X 10_11_4    2
 Mac OS X 10.9       2
 Mac OS X 10_12_2    1
 Mac OS X 10_12_1    1
 Mac OS X 10.6       1
 Windows             1
 Mac OS X 10.10      1
 Name: Device_operating_system, dtype: int64,
 'failing_rows': [158, 1330, 1807, 2429, 2831, 4380, 4727, 7494, 9317],
 'num_failing_rows': 9}

Duplicate Rows

Running the Duplicate Rows test:

from rime.tabular.tests import DuplicateRowsTest
test = DuplicateRowsTest()
test.run_notebook(container)

Output:

This test passed because there are 0 duplicate row(s) in the evaluation data.
{'status': 'PASS',
 'severity': 'None',
 'Failing Rows': '0 (0.00%)',
 'params': {'_id': 'eccd9267-a47a-185c-58e4-eb88fea02ce7',
  'col_names': None,
  'severity_thresholds': (0.01, 0.05)},
 'columns': []}

Non-Parametric Outliers

Running the Non-Parametric Outliers test on a numeric feature column:

from rime.tabular.tests import NonParametricOutliersTest
test = NonParametricOutliersTest("TransactionAmt")
test.run_notebook(container)

Output:

{'status': 'FAIL',
 'severity': 'Low',
 'params': {'_id': 'af584cae-191e-8cfa-b9f1-50dfa0a188a3',
  'model_impact_config': ObservedModelImpactConfig(severity_thresholds=None, min_num_samples=10),
  'col_name': 'TransactionAmt',
  'min_normal_prop': 0.99,
  'baseline_quantile': 0.1,
  'perturb_multiplier': 1.0},
 'columns': ['TransactionAmt'],
 'lower_threshold': -30.1300916166291,
 'upper_threshold': 4396.228995809948,
 'failing_rows': [3302, 8373],
 'num_failing_rows': 2}

Vulnerability

Running the Vulnerability (AKA single-feature change) test:

from rime.tabular.tests import VulnerabilityTest
test = VulnerabilityTest("DeviceInfo")
test.run_notebook(container)

Output

This test passed because the average change in prediction caused by an unbounded manipulation of the feature DeviceInfo over a sample of 10 rows was 0.00555, which is below the warning threshold of 0.05.
{'status': 'PASS',
 'severity': 'None',
 'Average Prediction Change': 0.0055514594454474705,
 'params': {'_id': 'e94863f0-e938-4be9-5e9b-e64674edc3b1',
  'severity_level_thresholds': (0.05, 0.15, 0.25),
  'col_names': ['DeviceInfo'],
  'l0_constraint': 1,
  'linf_constraint': None,
  'sample_size': 10,
  'search_count': 10,
  'use_tqdm': False,
  'label_range': (0.0, 1.0),
  'scaled_min_impact_threshold': 0.05},
 'columns': ['DeviceInfo'],
 'sample_inds': [3344, 1712, 4970, 4480, 1498, 1581, 3531, 473, 9554, 2929],
 'avg_score_change': 0.0055514594454474705,
 'normalized_avg_score_change': 0.0055514594454474705}

Feature Subset

Running the Feature Subset test:

from rime.tabular.tests import FeatureSubsetTest
from rime.tabular.metric import AccuracyMetric
test = FeatureSubsetTest("DeviceType", AccuracyMetric, (0.1, 1.0, 1.0))
test.run_notebook(container)

Output

{'status': 'PASS',
 'severity': 'None',
 'params': {'_id': '48457123-e119-0d15-c942-e9cb31e54840',
  'metric_name': <MetricName.ACCURACY: 'accuracy'>,
  'metric_cls': rime.tabular.metric.shared_metrics.AccuracyMetric,
  'min_sample_size': 20,
  'perf_change_thresholds': (0.1, 1.0, 1.0),
  'perf_change_threshold': 0.1,
  'col_name': 'DeviceType'},
 'columns': ['DeviceType'],
 'num_failing': 0,
 'overall_perf': 0.9692,
 'sample_size': 10000,
 'subsets_metric_dict': {'overall_perf': 0.9692,
  'subsets_info': {'desktop': {'name': 'desktop',
    'size': 1504,
    'criterion': 'desktop',
    'perf': 0.9461436170212766,
    'margin_error': 0.011408309534789187,
    'diff': 0.023056382978723367,
    'pos_rate': 0.06648936170212766,
    'sample_size_info': {<SampleSizeType.POS_LABEL: 'Positive Label'>: 100,
     <SampleSizeType.NEG_LABEL: 'Negative Label'>: 1404,
     <SampleSizeType.POS_PRED: 'Positive Prediction'>: 33,
     <SampleSizeType.NEG_PRED: 'Negative Prediction'>: 1471}},
   'mobile': {'name': 'mobile',
    'size': 947,
    'criterion': 'mobile',
    'perf': 0.9292502639915523,
    'margin_error': 0.016330589417348093,
    'diff': 0.039949736008447645,
    'pos_rate': 0.11298838437170011,
    'sample_size_info': {<SampleSizeType.POS_LABEL: 'Positive Label'>: 107,
     <SampleSizeType.NEG_LABEL: 'Negative Label'>: 840,
     <SampleSizeType.POS_PRED: 'Positive Prediction'>: 50,
     <SampleSizeType.NEG_PRED: 'Negative Prediction'>: 897}},
   'None': {'name': 'None',
    'size': 7549,
    'criterion': 'None',
    'perf': 0.9788051397536097,
    'margin_error': 0.003249127676628865,
    'diff': -0.00960513975360977,
    'pos_rate': 0.021592263876010067,
    'sample_size_info': {<SampleSizeType.POS_LABEL: 'Positive Label'>: 163,
     <SampleSizeType.NEG_LABEL: 'Negative Label'>: 7386,
     <SampleSizeType.POS_PRED: 'Positive Prediction'>: 3,
     <SampleSizeType.NEG_PRED: 'Negative Prediction'>: 7546}}}},
 'worst_subset': {'name': 'mobile',
  'size': 947,
  'criterion': 'mobile',
  'perf': 0.9292502639915523,
  'margin_error': 0.016330589417348093,
  'diff': 0.039949736008447645,
  'pos_rate': 0.11298838437170011,
  'sample_size_info': {<SampleSizeType.POS_LABEL: 'Positive Label'>: 107,
   <SampleSizeType.NEG_LABEL: 'Negative Label'>: 840,
   <SampleSizeType.POS_PRED: 'Positive Prediction'>: 50,
   <SampleSizeType.NEG_PRED: 'Negative Prediction'>: 897}}}

That’s it!

NOTE: It’s important to point out that while we loaded a pretrained model for convenience, the RIME Python Library can be used at any point during the prototyping workflow, whether that’s during initial data exploration, or model training and iteration.