Tests

Distribution Drift

Nulls Per Feature Drift

This test measures the severity of passing to the model data points that have features with a null proportion that has drifted from the distribution observed in the reference set. The severity is a function of the impact on the model, as well as the presence of drift in the data. The model impact measures how much model performance changes due to drift in the given feature. The key detail is the p-value from a two-sample proportion test that checks if there is a statistically significant difference in the frequencies of null values between the reference and evaluation sets.

Why it matters: Distribution drift in null values between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the preprocessing pipeline. A big shift in null value proportion could indicate a degradation in model performance and signal the need for relabeling and retraining.

Configuration: By default, this test runs over all columns with sufficiently many samples.

Example: Suppose that the observed frequencies of the null values for a given feature is 100/2000 in the reference set but 100/1500 in the test. Then the p-value would be 0.0425. If our p-value threshold was 0.05 then the test would fail.

Nulls Per Row Drift

This test measures the severity of passing to the model data points that have proportions of null values that have drifted from the distribution observed in the reference set. The severity is a function of the impact on the model, as well as the presence of drift in the data. The model impact measures how much predictions change when the observed drift is applied to a given row. The key detail displayed is the PSI statistic that is a measure of how statistically significant the difference in the proportion of null values in a row between the reference and evaluation sets is.

Why it matters: Distribution drift in null values between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the preprocessing pipeline. A big shift in null value proportion could indicate a degradation in model performance and signal the need for relabeling and retraining.

Configuration: By default, this test runs over all rows.

Example: Suppose that in the reference set 5% of rows had more than three features that were null. If we observe in the evaluation set that now 50% of rows had more than three features that were null, this test would fail, highlighting a large drift in the proportion of features within a row that were null.

Feature Correlation Drift

This test measures the severity of feature correlation drift from the reference to the evaluation set for a given pair of features. The severity is a function of the mutual information drift in the data. The key detail is the difference in correlation scores between the reference and evaluation sets, along with an associated p-value. Correlation is a measure of the linear relationship between two numeric features, so this test checks for significant changes in this relationship between pairs of features in the reference and evaluation sets. To compute the p-value, we use Fisher's z-transformation to convert the distribution of sample correlations to a normal distribution, and then we run a standard two-sample test on two normal distributions.

Why it matters: Correlation drift between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the underlying processing stage. A big shift in these dependencies could indicate shifting datasets and degradation in model performance, signalling the need for relabeling and retraining.

Configuration: By default, this test runs over all pairs of features in the dataset.

Example: Suppose that the correlation between country and state is 0.5 in the reference set but 0.7 in the evaluation set, and the p-value is 0.03. Then the large difference in scores indicates that the dependency between the two features has drifted. If our difference threshold was 0.2, and p-value threshold was 0.05, then the test would fail.

Mutual Information Drift (Feature-to-Feature)

This test measures the severity of feature mutual information drift from the reference to the evaluation set for a given pair of features. The severity is a function of the mutual information drift in the data. The key detail is the difference in mutual information scores between the reference and evaluation sets. Mutual information is a measure of how dependent two features are, so this checks for significant changes in dependence between pairs of features in the reference and evaluation sets.

Why it matters: Mutual information drift between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the underlying processing stage. A big shift in these dependencies could indicate shifting datasets and degradation in model performance, signalling the need for relabeling and retraining.

Configuration: By default, this test runs over all pairs of features in the dataset.

Example: Suppose that the mutual information between country and state is 0.5 in the reference set but 0.7 in the evaluation set. Then the large difference in scores indicates that the dependency between the two features has drifted. If our difference threshold was 0.2 then the test would fail.

Mutual Information Drift (Feature-to-Label)

This test measures the severity of feature mutual information drift from the reference to the evaluation set for a given pair of features. The severity is a function of the mutual information drift in the data. The key detail is the difference in mutual information scores between the reference and evaluation sets. Mutual information is a measure of how dependent two features are, so this checks for significant changes in dependence between pairs of features in the reference and evaluation sets.

Why it matters: Mutual information drift between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the underlying processing stage. A big shift in these dependencies could indicate shifting datasets and degradation in model performance, signalling the need for relabeling and retraining.

Configuration: By default, this test runs over all pairs of features in the dataset.

Example: Suppose that the mutual information between country and state is 0.5 in the reference set but 0.7 in the evaluation set. Then the large difference in scores indicates that the dependency between the two features has drifted. If our difference threshold was 0.2 then the test would fail.

Categorical Feature Drift

This test measures the severity of passing to the model data points that have categorical features which have drifted from the distribution observed in the reference set. The severity is a function of the impact on the model, as well as the presence of drift in the data. The model impact measures how much model performance changes due to drift in the given feature. The key detail displayed is the PSI test statistic, which is a measure of how statistically significant the difference between the frequencies of categorical values in the reference and evaluation sets is.

Why it matters: Distribution drift in categorical features between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the preprocessing pipeline. A big shift in categorical features towards categorical subsets that your model performs poorly in could indicate a degradation in model performance and signal the need for relabeling and retraining.

Configuration: By default, this test runs over all categorical columns with sufficiently many samples.

Example: Suppose that the observed frequencies of the isLoggedIn feature is [100, 200] in the reference set but [25, 150] in the test set. Then the PSI would be 0.201. If our PSI threshold was 0.1 then the test would fail.

Label Drift (PSI)

This test checks that the difference in label distribution between the reference and evaluation sets is small, using PSI test. The key detail displayed is the PSI statistic which is a measure of how different the frequencies of the column in the reference and evaluation sets are.

Why it matters: Label distribution shift between reference and test can indicate that the underlying data distribution has changed significantly enough to modify model decisions. This may mean that the model needs to be retrained to adjust to the new data environment. In addition, significant label distribution shift may indicate that upstream decision-making modules (e.g. thresholds) may need to be updated.

Configuration: This test is run by default whenever both the reference and evaluation sets have associated labels.

Example: Suppose that the observed frequencies of the label column is [100, 200] in the reference set but [25, 150] in the test set. Then the PSI would be 0.201. If our PSI threshold was 0.1 then the test would fail.

Label Drift

This test checks that the difference in label distribution between the reference and evaluation sets is small, using the Kolmogorov–Smirnov (K-S) test. The key detail displayed is the KS statistic which is a measure of how different the labels in the reference and evaluation sets are. Concretely, the KS statistic is the maximum difference of the empirical CDF's of the two label columns.

Why it matters: Label distribution shift between reference and test can indicate that the underlying data distribution has changed significantly enough to modify model decisions. This may mean that the model needs to be retrained to adjust to the new data environment. In addition, significant label distribution shift may indicate that upstream decision-making modules (e.g. thresholds) may need to be updated.

Configuration: This test is run by default whenever both the reference and evaluation sets have associated labels.

Example: Suppose that the distribution of labels changes between the reference and evaluation sets such that the p-value for the K-S test between these two samples is 0.005 and the test statistic is 0.2. If the p-value threshold is set to 0.01 and the model impact threshold is set to 0.1, this test would raise a warning.

Numeric Feature Drift

This test measures the severity of passing to the model data points that have numeric features that have drifted from the distribution observed in the reference set. The severity is a function of the impact on the model, as well as the presence of drift in the data. The model impact measures how much model performance changes due to drift in the given feature. The key detail is the Population Stability Index statistic. The Population Stability Index (PSI) is a measure of how different two distributions are. Given two distributions P and Q, it is computed as the sum of the KL Divergence between P and Q and the (reverse) KL Divergence between Q and P. Thus, PSI is symmetric.

Why it matters: Distribution shift between training and inference can cause degradation in model performance. If the shift is sufficiently large, retraining the model on newer data may be necessary.

Configuration: By default, this test runs over all numeric columns with sufficiently many samples and stored quantiles in each of the reference and evaluation sets.

Example: Suppose that the distribution of a feature Age changes between the reference and evaluation sets such that the Population Stability Index between these two samples is 0.2. If the distance threshold is set to 0.1, this test would raise a warning.

Overall Metrics

This test checks a set of overall metrics to see if any have experienced significant degradation. The key detail displays whether the given performance metric has degraded beyond a defined threshold.

Why it matters: During production, factors like distribution shift or a change in p(y|x) may cause model performance to decrease significantly.

Configuration: By default, this test runs over all metrics for this model task.

Example: Assume that on the reference set the model obtained 0.85 AUC but on the evaluation set the model obtained 0.5 AUC. Then this test raises a warning.

Prediction Drift

This test checks that the difference in the prediction distribution between the reference and evaluation sets is small, using Population Stability Index. The key detail displayed is the PSI which is a measure of how different the prediction distributions in the reference and evaluation sets are.

Why it matters: Prediction distribution shift between reference and test can indicate that the underlying data distribution has changed significantly enough to modify model decisions. This may mean that the model needs to be retrained to adjust to the new data environment. In addition, significant prediction distribution drift may indicate that upstream decision-making modules (e.g. thresholds) may need to be updated.

Configuration: This test is run by default whenever both the reference and evaluation sets have associated predictions. Different thresholds are associated with different severities.

Example: Suppose that the PSI between the prediction distributions in the reference and evaluation sets is 0.201. Then if the PSI thresholds are (0.1, 0.2, 0.3), the test would fail with medium severity.

Calibration Comparison

This test checks that the reference and evaluation sets have sufficiently similar calibration curves as measured by the Mean Squared Error (MSE) between the two curves. The calibration curve is a line plot where the x-axis represents the average predicted probability and the y-axis is the proportion of positive predictions. The curve of the ideal calibrated model is thus a linear straight line from (0, 0) moving linearly.

Why it matters: Knowing how well-calibrated your model is can help you better interpret and act upon model outputs, and can even be an indicator of generalization. A greater difference between reference and evaluation curves could indicate a lack of generalizability. In addition, a change in calibration could indicate that decision-making or thresholding conducted upstream needs to change as it is behaving differently on held-out data.

Configuration: By default, this test runs over the predictions and labels.

Example: Suppose the model’s task is binary classification and predicts whether or not a datapoint is fraudulent. If we have a reference set in which 1% of the datapoints are fraudulent, but an evaluation set where 50% are fraudulent, then our model may not be well calibrated, and the MSE difference in the curves will be large, resulting in a failing test.

Predicted Label Drift (PSI)

This test checks that the difference in predicted label distribution between the reference and evaluation sets is small, using PSI test. The key detail displayed is the PSI statistic which is a measure of how different the frequencies of the column in the reference and evaluation sets are.

Why it matters: Predicted Label distribution shift between reference and test can indicate that the underlying data distribution has changed significantly enough to modify model decisions. This may mean that the model needs to be retrained to adjust to the new data environment. In addition, significant predicted label distribution shift may indicate that upstream decision-making modules (e.g. thresholds) may need to be updated.

Configuration: This test is run by default whenever the model or predictions is provided.

Example: Suppose that the observed frequencies of the predicted label column is [100, 200] in the reference set but [25, 150] in the test set. Then the PSI would be 0.201. If our PSI threshold was 0.1 then the test would fail.

Abnormal Input

Must be Int

This test measures the number of failing rows in your data with values not of type Integer and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values not of type Integer. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

Configuration: By default, this test runs over all features that are inferred to be type Integer.

Example: Say that the feature X requires the Integer type. This test raises a warning if we observe any values where X is represented as a different type instead.

Must be Float

This test measures the number of failing rows in your data with values not of type Float and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values not of type Float. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

Configuration: By default, this test runs over all features that are inferred to be type Float.

Example: Say that the feature X requires the Float type. This test raises a warning if we observe any values where X is represented as a different type instead.

Must be String

This test measures the number of failing rows in your data with values not of type String Categorical and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values not of type String Categorical. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

Configuration: By default, this test runs over all features that are inferred to be type String Categorical.

Example: Say that the feature X requires the String Categorical type. This test raises a warning if we observe any values where X is represented as a different type instead.

Must be Boolean

This test measures the number of failing rows in your data with values not of type Boolean Categorical and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values not of type Boolean Categorical. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

Configuration: By default, this test runs over all features that are inferred to be type Boolean Categorical.

Example: Say that the feature X requires the Boolean Categorical type. This test raises a warning if we observe any values where X is represented as a different type instead.

Must be URL

This test measures the number of failing rows in your data with values not of type URL Categorical and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values not of type URL Categorical. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

Configuration: By default, this test runs over all features that are inferred to be type URL Categorical.

Example: Say that the feature X requires the URL Categorical type. This test raises a warning if we observe any values where X is represented as a different type instead.

Must be Domain

This test measures the number of failing rows in your data with values not of type Domain Categorical and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values not of type Domain Categorical. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

Configuration: By default, this test runs over all features that are inferred to be type Domain Categorical.

Example: Say that the feature X requires the Domain Categorical type. This test raises a warning if we observe any values where X is represented as a different type instead.

Must be Email

This test measures the number of failing rows in your data with values not of type Email Categorical and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values not of type Email Categorical. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

Configuration: By default, this test runs over all features that are inferred to be type Email Categorical.

Example: Say that the feature X requires the Email Categorical type. This test raises a warning if we observe any values where X is represented as a different type instead.

Null Check

This test measures the number of failing rows in your data with nulls in features that should not have nulls and their impact on the model. The model impact is the difference in model performance between passing and failing rows with nulls in features that should not have nulls. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: The model may make certain assumptions about a column depending on whether or not it had nulls in the training data. If these assumptions break during production, this may damage the model's performance. For example, if a column was never null during training then a model may not have learned to be robust against noise in that column.

Configuration: By default, this test runs over all columns that had zero nulls in the reference set.

Example: Suppose that the feature Age was never null in the reference set. This test raises a warning if Age was null 10% of the time in the evaluation set or if model performance decreases on observed datapoints with nulls

Numeric Outliers

This test measures the number of failing rows in your data with outliers and their impact on the model. Outliers are values which may not necessarily be outside of an allowed range for a feature, but are extreme values that are unusual and may be indicative of abnormality. The model impact is the difference in model performance between passing and failing rows with outliers. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: Outliers can be a sign of corrupted or otherwise erroneous data, and can degrade model performance if used in the training data, or lead to unexpected behaviour if input at inference time.

Configuration: By default this test is run over each numeric feature that is neither unique nor ascending.

Example: Suppose there is a feature age for which in the reference set the values 103 and 114 each appear once but every other value (with subsantial sample size) is contained within the range [0, 97]. Then we would infer a lower outlier threshold of 0 and an upper outlier threshold of 97. This test raises a warning if we observe any values in the evaluation set outside these thresholds or if model performance decreases on observed datapoints with outliers.

Unseen URL

This test measures the number of failing rows in your data with unseen URL values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with unseen URL values. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.

Configuration: By default, this test runs over all features inferred to contain URLs.

Example: Say that the feature WebURL contains the values ['http://google.com', 'http://yahoo.com'] from the reference set. This test raises a warning if we observe any unseen values in the evaluation set such as 'http://xyzabc.com'.

Unseen Domain

This test measures the number of failing rows in your data with unseen domain values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with unseen domain values. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.

Configuration: By default, this test runs over all features inferred to contain domains.

Example: Say that the feature WebDomain contains the values ['gmail.com', 'hotmail.com'] from the reference set. This test raises a warning if we observe any unseen values in the evaluation set such as 'xyzabc.com'.

Unseen Email

This test measures the number of failing rows in your data with unseen email values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with unseen email values. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.

Configuration: By default, this test runs over all features inferred to contain emails.

Example: Say that the feature Email contains the values ['user1@gmail.com', 'user2@yahoo.com'] from the reference set. This test raises a warning if we observe any unseen values in the evaluation set such as 'xyz@xyzabc.com'.

Out of Range

This test measures the number of failing rows in your data with values outside the inferred range of allowed values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with values outside the inferred range of allowed values. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: In production, the model may encounter corrupted or manipulated out of range values. It is important that the model is robust to such extremities.

Configuration: By default, this test runs over all numeric features.

Example: In the reference set, the Age feature has a range of [0, 121]. This test raises a warning if we observe values outside of this range in the evaluation set (eg. 150, 200) or if model performance decreases on observed datapoints outside of this range.

Rare Categories

This test measures the severity of passing to the model data points whose features contain rarely observed categories (relative the reference set). The severity is a function of the impact of these values on the model, as well as the presence of these values in the data. The model impact is the difference in model performance between passing and failing rows with rarely observed categorical values. If labels are not provided, prediction change is used instead of model performance change. The number of failing rows refers to the number of times rarely observed categorical values are observed in the evaluation set.

Why it matters: Rare categories are a common failure point in machine learning systems because less data often means worse performance. In addition, this may expose gaps or errors in data collection.

Configuration: By default, this test runs over all categorical features. A category is considered rare if it occurs fewer than min_num_occurrences times, or if it occurs less than min_pct_occurrences of the time. If neither of these values are specified, the rate of appearance below which a category is considered rare is min_ratio_rel_uniform divided by the number of classes.

Example: Say that the feature AgeGroup takes on the value 0-18 twice while taking on the value 35-55 a total of 98 times. If the min_num_occurences is 5 and the min_pct_occurrences is 0.03 then the test will flag the value 0-18 as a rare category.

Empty String

This test measures the number of failing rows in your data with empty string values instead of null values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with empty string values instead of null values. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: In production, the model may encounter corrupted or manipulated string values. Null values and empty strings are often expected to be treated the same, but the model might not treat them that way. It is important that the model is robust to such extremities.

Configuration: By default, this test runs over all string features with null values.

Example: In the reference set, the Name feature contains nulls. This test raises a warning if we observe any empty string in the Name feature or if these values decrease model performance.

Inconsistencies

This test measures the severity of passing to the model data points whose values are inconsistent (as inferred from the reference set). The severity is a function of the impact of these values on the model, as well as the presence of these values in the data. The model impact is the difference in model performance between passing and failing rows with data containing inconsistent feature values. If labels are not provided, prediction change is used instead of model performance change. The number of failing rows refers to the number of times data containing inconsistent feature values are observed in the evaluation set.

Why it matters: Inconsistent values might be the result of malicious actors manipulating the data or errors in the data pipeline. Thus, it is important to be aware of inconsistent values to identify sources of manipulations or errors.

Configuration: By default, this test runs on pairs of categorical features whose correlations exceed some minimum threshold. The default threshold for the frequency ratio below which values are considered to be inconsistent is 0.02.

Example: Suppose we have a feature country that takes on value "US" with frequency 0.5, and a feature time_zone that takes on value "Central European Time" with frequency 0.2. Then if these values appear together with frequency less than 0.5 * 0.2 * 0.02 = 0.002 , in the reference set, rows in which these values do appear together are inconsistencies.

Capitalization

This test measures the number of failing rows in your data with different types of capitalization and their impact on the model. The model impact is the difference in model performance between passing and failing rows with different types of capitalization. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: In production, models can come across the same value with different capitalizations, making it important to explicitly check that your model is invariant to such differences.

Configuration: By default, this test runs over all categorical features.

Example: Suppose we had a column that corresponded to country code. For a specific row, let's say the observed value in the reference set was USA. This test raises a warning if we observe a similar value in the evaluation set with case changes, e.g. uSa or if model performance decreases on observed datapoints with case changes.

Required Characters

This test measures the number of failing rows in your data with strings without any required characters and their impact on the model. The model impact is the difference in model performance between passing and failing rows with strings without any required characters. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: A feature may require specific characters. However, errors in the data pipeline may allow invalid data points that lack these required characters to pass. Failing to catch such errors may lead to noisier training data or noisier predictions during inference, which can degrade model metrics.

Configuration: By default, this test runs over all string features that are inferred to have required characters.

Example: Say that the feature email requires the character @. This test raises a warning if we observe any values in the evaluation set where the character is missing.

Unseen Categorical

This test measures the number of failing rows in your data with unseen categorical values and their impact on the model. The model impact is the difference in model performance between passing and failing rows with unseen categorical values. If labels are not provided, prediction change is used instead of model performance change.

Why it matters: Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.

Configuration: By default, this test runs over all categorical features.

Example: Say that the feature Animal contains the values ['Cat', 'Dog'] from the reference set. This test raises a warning if we observe any unseen values in the evaluation set such as 'Mouse'.

Compliance

Demographic Parity

This test is commonly known as the demographic parity or statistical parity test in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Positive Prediction Rate of model predictions within a specific subset is significantly different than the model prediction Positive Prediction Rate over the entire 'population'.

Why it matters: Demographic parity is one of the most well-known and strict measures of fairness. It is meant to be used in a setting where we assert that the base label rates between subgroups should be the same (even if empirically they are different). This contrasts with equality of opportunity or predictive parity tests, which permit classification rates to depend on a protected attribute. It can be useful in legal/compliance settings where we want a selection rate for any protected group to fundamentally be the same as other groups.

Configuration: By default, the Positive Prediction Rate is computed for all protected features.

Example: Suppose we had data with the following protected feature 'animal': ['cat', 'cat', 'cat', 'dog', 'dog', 'dog'], and model predictions [0.3, 0.3, 0.9, 0.9, 0.9, 0.3]. Then regardless of the labels, the Positive Prediction Rate over the feature values ('cat', 'dog') would be (0.33, 0.66), indicating a failure in demographic parity.

Protected Feature Drift

This test measures the severity of passing to the model data points that have categorical features which have drifted from the distribution observed in the reference set. The severity is a function of the impact on the model, as well as the presence of drift in the data. The model impact measures how much model performance changes due to drift in the given feature. The key detail displayed is the Χ² test statistic and p-value, which are measures of how statistically significant the difference between the frequencies of categorical values in the reference and evaluation sets is.

Why it matters: Distribution drift in categorical features between training and inference can be caused by a variety of factors, including a change in the data generation process or a change in the preprocessing pipeline. A big shift in categorical features towards categorical subsets that your model performs poorly in could indicate a degradation in model performance and signal the need for relabeling and retraining.

Configuration: By default, this test runs over all categorical columns with sufficiently many samples.

Example: Suppose that the observed frequencies of the isLoggedIn feature is [100, 200] in the reference set but [75, 100] in the test set. Then the p-value would be 0.048. If our p-value threshold was 0.05 then the test would fail.

Discrimination By Proxy

This test checks whether any feature is a proxy for a protected feature. It runs over categorical features, using mutual information as a measure of similarity with a protected feature. Mutual information measures any dependencies between two variables.

Why it matters: A common strategy to try to ensure a model is not biased is to remove protected features from the training data entirely so the model cannot learn over them. However, if other features are highly dependent on those features, that could lead to the model effectively still training over those features by proxy.

Configuration: By default, this test is run over all categorical protected columns.

Example: Suppose we had data with a protected feature (`gender`). If there was another feature, like `title`, which was highly associated with gender, this test would raise a warning if the mutual information between those two features was particularly high.

Intersectional Group Fairness

This test checks whether the model performs equally well across subgroups created from the intersection of protected groups. The test first creates unique pairs of categorical protected features. We then test whether the positive prediction rate of model predictions within a specific subset is significantly lower than the model positive prediction rate over the entire population. This will expose hidden biases against groups at the intersection of these protected features

Why it matters: Most existing work in the fairness literature deals with a binary view of fairness - either a particular group is performing worse or not. This binary categorization misses the important nuance of the fairness field - that biases can often be amplified in subgroups that combine membership from different protected groups, especially if such a subgroup is particularly underrepresented in opportunities historically. The intersectional group fairness test is run over subsets representing this intersection between two protected groups.

Configuration: This test runs over unique pairs of categorical protected features.

Example: Suppose your dataset contains two protected features: race and gender. Both features pass the demographic parity test for categories women, men, white and black. However, when certain subsets of these features are combined, such as black women or white men, the positive prediction rates perform significantly worse than the overall population. This would show disparate impact towards this subgroup.

Selection Rate

This test checks whether the selection rate for any subset of a feature performs as well as the best selection rate across all subsets of that feature. The selection rate is calculated as the Positive Prediction Rate. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the selection rate of model predictions within a specific subset is significantly lower than that of other subsets by taking a ratio of the rates.

Why it matters: Assessing differences in selection rate is an important measures of fairness. It is meant to be used in a setting where we assert that the base selection rates between subgroups should be the same (even if empirically they are different). This contrasts with equality of opportunity or predictive parity tests, which permit classification rates to depend on a sensitive attribute. It can be useful in legal/compliance settings where we want a selection rate for any sensitive group to fundamentally be the same as other groups.

Configuration: By default, the selection rate is computed for all protected features.

Example: Suppose we had data with the following protected feature 'animal': ['cat', 'cat', 'cat', 'dog', 'dog', 'dog'], and model predictions [0.3, 0.3, 0.9, 0.9, 0.9, 0.3]. Then regardless of the labels, the selection rate over the feature values ('cat', 'dog') would be (0.33, 0.66), indicating a failure because cats would be selected half as often as dogs.

Feature Independence

This test checks the independence of each protected feature with the predicted label class. It runs over categorical protected features and uses the chi square test of independence to determine the feature independence. The test compares the observed data to a model that distributes the data according to the expectation that the variables are independent. Wherever the observed data does not fit the model, the likelihood that the variables are dependent becomes stronger.

Why it matters: A test of independence assesses whether observations consisting of measures on two variables, expressed in a contingency table, are independent of each other. This can be useful when assessing how protected features impact the predicted class and helping with the feature selection process.

Configuration: By default, this test is run over all protected categorical features.

Example: Let's say you have a model that predicts whether or not a person will be hired or not. One protected feature is gender. If these two variables are independent then the male-female ratio across hired and not hired should be the same. The p-value is 0.06 and the chi squared value is 300. The p-value is above the threshold of 0.05 to declare independence.

Subset Sensitivity

This test measures how sensitive the model is to substituting the lowest performing subset of a feature into a sample of data. The test splits the dataset into various subsets based on the feature values and finds the lowest performing subset, based on the lowest Positive Prediction Rate. The test then substitutes this subset into a sample from the original data and calculates the average prediction change. This test fails if a model predicts worse on the lowest performing subset.

Why it matters: Assessing differences in model output is an important measure of fairness. If the model performs worse because of the value of a protected feature such as race or gender, then this could indicate bias. It can be useful in legal/compliance settings where we fundamentally want the prediction for any protected group to be the same as for other groups.

Configuration: By default, the subset sensitivity is computed for all protected features that are strings.

Example: Suppose the data had the following protected feature 'animal': ['cat', 'cat', 'cat', 'dog', 'dog', 'dog', 'horse', 'horse'], and model predictions for cat were the lowest. If substituting cat for dog and horse in the other inputs causes model predictions to decrease, then this would indicate a failure because the model disadvantages cats.

Transformations

Out of Range Substitution

This test measures the impact on the model when we substitute values outside the inferred range of allowed values into clean datapoints.

Why it matters: In production, the model may encounter corrupted or manipulated out of range values. It is important that the model is robust to such extremities.

Configuration: By default, this test runs over all numeric features.

Example: In the reference set, the Age feature has a range of [0, 121]. This test raises a warning if substituting values outside of this range into Age (eg. 150, 200) causes model performance to decrease.

Numeric Outliers Substitution

This test measures the impact on the model when we substitute outliers into clean datapoints. Outliers are values which may not necessarily be outside of an allowed range for a feature, but are extreme values that are unusual and may be indicative of abnormality.

Why it matters: Outliers can be a sign of corrupted or otherwise erroneous data, and can degrade model performance if used in the training data, or lead to unexpected behaviour if input at inference time.

Configuration: By default this test is run over each numeric feature that is neither unique nor ascending.

Example: Suppose there is a feature age for which in the reference set the values 103 and 114 each appear once but every other value (with subsantial sample size) is contained within the range [0, 97]. Then we would infer a lower outlier threshold of 0 and an upper outlier threshold of 97. This test raises a warning if substituting outliers into age causes model performance to decrease.

Empty String Substitution

This test measures the impact on the model when we substitute empty string values instead of null values into clean datapoints.

Why it matters: In production, the model may encounter corrupted or manipulated string values. Null values and empty strings are often expected to be treated the same, but the model might not treat them that way. It is important that the model is robust to such extremities.

Configuration: By default, this test runs over all string features with null values.

Example: In the reference set, the Name feature contains nulls. This test raises a warning if substituting empty strings instead of null values into the Name feature causes model performance to decrease.

Int Feature Type Change

This test measures the impact on the model when we substitute values not of type Integer into features that are inferred to be Integer type from the reference set.

Why it matters: A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

Configuration: By default, this test runs over all features that are inferred to be type Integer.

Example: Say that the feature X requires the Integer type. This test raises a warning if changing values in X to a different type causes model performance to decrease.

Float Feature Type Change

This test measures the impact on the model when we substitute values not of type Float into features that are inferred to be Float type from the reference set.

Why it matters: A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

Configuration: By default, this test runs over all features that are inferred to be type Float.

Example: Say that the feature X requires the Float type. This test raises a warning if changing values in X to a different type causes model performance to decrease.

String Feature Type Change

This test measures the impact on the model when we substitute values not of type String Categorical into features that are inferred to be String Categorical type from the reference set.

Why it matters: A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

Configuration: By default, this test runs over all features that are inferred to be type String Categorical.

Example: Say that the feature X requires the String Categorical type. This test raises a warning if changing values in X to a different type causes model performance to decrease.

Boolean Feature Type Change

This test measures the impact on the model when we substitute values not of type Boolean Categorical into features that are inferred to be Boolean Categorical type from the reference set.

Why it matters: A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

Configuration: By default, this test runs over all features that are inferred to be type Boolean Categorical.

Example: Say that the feature X requires the Boolean Categorical type. This test raises a warning if changing values in X to a different type causes model performance to decrease.

URL Feature Type Change

This test measures the impact on the model when we substitute values not of type URL Categorical into features that are inferred to be URL Categorical type from the reference set.

Why it matters: A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

Configuration: By default, this test runs over all features that are inferred to be type URL Categorical.

Example: Say that the feature X requires the URL Categorical type. This test raises a warning if changing values in X to a different type causes model performance to decrease.

Domain Feature Type Change

This test measures the impact on the model when we substitute values not of type Domain Categorical into features that are inferred to be Domain Categorical type from the reference set.

Why it matters: A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

Configuration: By default, this test runs over all features that are inferred to be type Domain Categorical.

Example: Say that the feature X requires the Domain Categorical type. This test raises a warning if changing values in X to a different type causes model performance to decrease.

Email Feature Type Change

This test measures the impact on the model when we substitute values not of type Email Categorical into features that are inferred to be Email Categorical type from the reference set.

Why it matters: A feature may require a specific type. However, errors in the data pipeline may produce values that are outside the expected type. Failing to catch such errors may lead to errors or undefined behavior from the model.

Configuration: By default, this test runs over all features that are inferred to be type Email Categorical.

Example: Say that the feature X requires the Email Categorical type. This test raises a warning if changing values in X to a different type causes model performance to decrease.

Capitalization Change

This test measures the impact on the model when we substitute different types of capitalization into clean datapoints.

Why it matters: In production, models can come across the same value with different capitalizations, making it important to explicitly check that your model is invariant to such differences.

Configuration: By default, this test runs over all categorical features.

Example: Suppose we had a column that corresponded to country code. For a specific row, let's say the observed value in the reference set was USA. This test raises a warning if substituting different capitalizations of USA, eg.usa, causes model performance to decrease.

Required Characters Deletion

This test measures the impact on the model when we delete required characters, inferred from the reference set, from the strings of clean datapoints.

Why it matters: A feature may require specific characters. However, errors in the data pipeline may allow invalid data points that lack these required characters to pass. Failing to catch such errors may lead to noisier training data or noisier predictions during inference, which can degrade model metrics.

Configuration: By default, this test runs over all string features that are inferred to have required characters.

Example: Say that the feature email requires the character @. This test raises a warning if removing @ from values in email causes model performance to decrease

Unseen Categorical Substitution

This test measures the impact on the model when we substitute unseen categorical values into clean datapoints.

Why it matters: Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.

Configuration: By default, this test runs over all categorical features.

Example: Say that the feature Animal contains the values ['Cat', 'Dog'] from the reference set. This test raises a warning if substituting unseen values into the feature Animal causes model performance to decrease.

Null Substitution

This test measures the impact on the model when we substitute nulls in features that should not have nulls into clean datapoints.

Why it matters: The model may make certain assumptions about a column depending on whether or not it had nulls in the training data. If these assumptions break during production, this may damage the model's performance. For example, if a column was never null during training then a model may not have learned to be robust against noise in that column.

Configuration: By default, this test runs over all columns that had zero nulls in the reference set.

Example: Suppose that the feature Age was never null in the reference set. This test raises a warning if substituting nulls into the Age feature causes model performance to decrease.

Unseen URL Substitution

This test measures the impact on the model when we substitute unseen URL values into clean datapoints.

Why it matters: Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.

Configuration: By default, this test runs over all features inferred to contain URLs.

Example: Say that the feature WebURL contains the values ['http://google.com', 'http://yahoo.com'] from the reference set. This test raises a warning if substituting unseen values into the feature WebURL causes model performance to decrease.

Unseen Domain Substitution

This test measures the impact on the model when we substitute unseen domain values into clean datapoints.

Why it matters: Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.

Configuration: By default, this test runs over all features inferred to contain domains.

Example: Say that the feature WebDomain contains the values ['gmail.com', 'hotmail.com'] from the reference set. This test raises a warning if substituting unseen values into the feature WebDomain causes model performance to decrease.

Unseen Email Substitution

This test measures the impact on the model when we substitute unseen email values into clean datapoints.

Why it matters: Unseen categorical values are a common failure point in machine learning systems; since these models are trained over a reference set, they may yield uninterpretable or undefined behavior when interacting with an unseen categorical value. In addition, such errors may expose gaps or errors in data collection.

Configuration: By default, this test runs over all features inferred to contain emails.

Example: Say that the feature Email contains the values ['user1@gmail.com', 'user2@yahoo.com'] from the reference set. This test raises a warning if substituting unseen values into the feature Email causes model performance to decrease.

Attacks

Single-Feature Changes

This test measures the severity of passing to the model data points that have been manipulated across a single feature in an unbounded manner. The severity is a function of the impact of these manipulations on the model.

Why it matters: In production, your model will likely come across inputs that are out-of-distribution with respect to the training data, and it is often difficult to know ahead of time how your model will behave on such inputs. 'Attacking' a model in the manner of this test is a technique for finding the out-of-distribution regions of the input space where your model most severely misbehaves, before putting it into production. Restricting ourselves to changing a single feature at a time is one proxy for what 'realistic' out-of-distribution data can look like.

Configuration: By default, for a given input we aim to change your model's prediction in the opposite direction of the true label. This test raises a warning if the average prediction change that can be achieved exceeds an acceptable threshold.

Example: Suppose your model has an Age feature with observed range 0 to 120. For every row in some sample, this test would search for the value of Age in 0 to 120 that caused the maximal change in prediction in the desired direction.

Bounded Single-Feature Changes

This test measures the severity of passing to the model data points that have been manipulated across a single feature in an bounded manner. The severity is a function of the impact of these manipulations on the model.We bound the manipulations to be less than some fraction of the range of the given feature.

Why it matters: In production, your model will likely come across inputs that are out-of-distribution with respect to the training data, and it is often difficult to know ahead of time how your model will behave on such inputs. 'Attacking' a model in the manner of this test is a technique for finding the out-of-distribution regions of the input space where your model most severely misbehaves, before putting it into production. Restricting ourselves to changing a single feature by a small amount is one proxy for what 'realistic' out-of-distribution data can look like.

Configuration: By default, for a given input we aim to change your model's prediction in the opposite direction of the true label. This test raises a warning if the average prediction change that can be achieved exceeds an acceptable threshold. This test runs only over numeric features.

Example: Suppose your model has an Age feature with observed range 0 to 120, and we restricted ourselves to changes that were no greater than 10% of the feature range. For every row in some sample, this test would search for the value of Age that was at most 12 away from the row's initial Age value and that caused the maximal change in prediction in the desired direction.

Multi-Feature Changes

This test measures the severity of passing to the model data points that have been manipulated across multiple features in an unbounded manner. The severity is a function of the impact of these manipulations on the model.

Why it matters: In production, your model will likely come across inputs that are out-of-distribution with respect to the training data, and it is often difficult to know ahead of time how your model will behave on such inputs. 'Attacking' a model in the manner of this test is a technique for finding the out-of-distribution regions of the input space where your model most severely misbehaves, before putting it into production. Restricting the number of features that can be changed is one proxy for what 'realistic' out-of-distribution data can look like.

Configuration: By default, for a given input we aim to change your model's prediction in the opposite direction of the true label. This test raises a warning if the average prediction change that can be achieved exceeds an acceptable threshold.

Example: Suppose we restricted ourselves to changing 5 features. This means for each input we would search for the 5 feature values change that, when performed together, caused the largest possible change in your model's prediction on that input.

Bounded Multi-Feature Changes

This test measures the severity of passing to the model data points that have been manipulated across multiple features in an bounded manner. The severity is a function of the impact of these manipulations on the model.We bound the manipulations to be less than some fraction of the range of the given feature.

Why it matters: In production, your model will likely come across inputs that are out-of-distribution with respect to the training data, and it is often difficult to know ahead of time how your model will behave on such inputs. 'Attacking' a model in the manner of this test is a technique for finding the out-of-distribution regions of the input space where your model most severely misbehaves, before putting it into production. Restricting the number of features that can be changed and the magnitude of the change that can be made to each feature is one proxy for what 'realistic' out-of-distribution data can look like.

Configuration: By default, for a given input we aim to change your model's prediction in the opposite direction of the true label. This test raises a warning if the average prediction change that can be achieved exceeds an acceptable threshold. This test runs only over numeric features.

Example: Suppose we restricted ourselves to changing 5 features, each by no more than 10% of the range of the given feature. This means for each input we would search for the 5 restricted feature values change that, when performed together, caused the largest possible change in your model's prediction on that input.

Data Cleanliness

Duplicate Row

This test checks if there are any duplicate rows in your dataset. The key detail displays the number of duplicate rows in your dataset.

Why it matters: Duplicate rows are potentially a sign of a broken data pipeline or an otherwise corrupted input.

Configuration: By default this test is run over all features, meaning two rows are considered duplicates only if they match across all features.

Example: Suppose we had two rows that were the same across every feature except an ID feature. By default these two rows would not be flagged as duplicates. If we exclude the ID feature, then these two rows would be flagged as duplicates.

Required Features

This test checks that the features of a dataset are as expected.

Why it matters: Errors in data collection and processing can lead to invalid missing (or extra) features. In the case of missing features, this can cause failures in models. In the case of extra features, this can lead to unneccessary storage and computation.

Configuration: This test runs only when required features are specified.

Example: Suppose we had a few features (Age, Location, etc.) that we always expected to be present in the dataset. We can configure this test to check that those columns are there.

Feature Leakage

Feature leakage occurs when a model is trained on features that include information about the label that is not normally present during production.This tests flags a likely data leakage issue if both of the following occur:

the normalized mutual information between the feature and the label is too high in the reference set
the normalized mutual information for the reference set is much higher than for the evaluation set

The first criteria is an indicator that this feature has unreasonably high predictive power for the label during training, and the second criteria checks that this feature is no longer a good predictor in the evaluation set. One requirement for this test to flag data leakage is that the evaluation set labels and features are collected properly.

Why it matters: Errors in data collection and processing can lead to the some features containing information about the label in the reference set that do not appear in the evaluation set. This causes the model to under-perform during production.

Configuration: By default, this test always runs on all categorical features.

Example: Consider a lending model that is trying to predict a boolean variable loan given that reports whether or not a bank will issue this loan to a potential borrower, and suppose one of the features is total debt. An error during the data processing causes the model to be trained on a data set where total debt is calculated after the loan has already been given, resulting in the model predicting loan given to be true whenever total debt is large. However, when the model is deployed, the feature total debt must be calculated before the loan given prediction can be made.
The normalized mutual information between these columns might be 0.3 in the reference set but only 0.1 in the evaluation set. This test would then flag a likely feature leakage issue where total debt is leaking into the variable loan given during training.

Subset Performance

Subset AUC

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Area Under Curve (AUC) of model predictions within a specific subset is significantly lower than the model prediction Area Under Curve (AUC) over the entire 'population'.

Why it matters: Having similar AUC between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

Configuration: By default, AUC is computed over all predictions/labels. Note that we compute AUC of the Receiver Operating Characteristic (ROC) curve.

Example: Suppose we had data with 2 features: [['cat', 0.2], ['dog', 0.3], ['cat', 0.5], ['dog', 0.7], ['cat', 0.7], ['dog', 0.2]], model predictions [0.3, 0.51, 0.7, 0.49, 0.9, 0.58], and labels [1, 0, 1, 0, 0, 1]. Then, the AUC over the feature subset value 'cat' would be 0.0, compared to the overall metric of 0.44.

Subset Accuracy

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the accuracy of model predictions within a specific subset is significantly lower than the model prediction accuracy over the entire 'population'.

Why it matters: Having similar accuracy between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Accuracy can be thought of as a 'weaker' metric of model bias compared to measuring false positive rate (predictive equality) or false negative rate (equal opportunity). This is because we can have similar accuracy between group A and group B; yet group A actually has higher false positive rate, while group B has higher false negative rate (e.g. we reject qualified applicants in group A but accept non-qualified applicants in group B). Nevertheless, accuracy is a standard metric used during evaluation and should be considered as part of performance bias testing.

Configuration: By default, accuracy is computed over all predictions/labels. Note we round predictions to 0/1 to compute accuracy.

Example: Suppose we had data with 2 features: [['cat', 0.2], ['dog', 0.3], ['cat', 0.5], ['dog', 0.7], ['cat', 0.7], ['dog', 0.2]], model predictions [0.3, 0.51, 0.7, 0.49, 0.9, 0.58], and labels [1, 0, 1, 0, 0, 1]. Then, the accuracy over the feature subset value 'cat' would be 0.33, compared to the overall metric of 0.5.

Subset Accuracy

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the accuracy of model predictions within a specific subset is significantly lower than the model prediction accuracy over the entire 'population'.

Why it matters: Having similar accuracy between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Accuracy can be thought of as a 'weaker' metric of model bias compared to measuring false positive rate (predictive equality) or false negative rate (equal opportunity). This is because we can have similar accuracy between group A and group B; yet group A actually has higher false positive rate, while group B has higher false negative rate (e.g. we reject qualified applicants in group A but accept non-qualified applicants in group B). Nevertheless, accuracy is a standard metric used during evaluation and should be considered as part of performance bias testing.

Configuration: By default, accuracy is computed over all predictions/labels. Note we round predictions to 0/1 to compute accuracy.

Example: Suppose we had data with 2 features: [['cat', 0.2], ['dog', 0.3], ['cat', 0.5], ['dog', 0.7], ['cat', 0.7], ['dog', 0.2]], model predictions [0.3, 0.51, 0.7, 0.49, 0.9, 0.58], and labels [1, 0, 1, 0, 0, 1]. Then, the accuracy over the feature subset value 'cat' would be 0.33, compared to the overall metric of 0.5.

Subset Multiclass AUC

In the multiclass setting, we compute one vs. one area under the curve (AUC), which computes the AUC between every pairwise combination of classes. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Area Under Curve (AUC) of model predictions within a specific subset is significantly lower than the model prediction Area Under Curve (AUC) over the entire 'population'.

Why it matters: Having similar AUC between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

Configuration: By default, AUC is computed over all predictions/labels. Note that we compute AUC of the Receiver Operating Characteristic (ROC) curve.

Example: Suppose we are differentiating between cats, bears, and dogs. Assume that across the data points where height=2 the predictions are [0.9, 0.1, 0], [0.1, 0.9, 0], [0.2, 0.1, 0.7] and the labels are [1, 0, 0], [1, 0, 0], [0, 0, 1] (where the first index corresponds to cat, the second corresponds to bear, and the third corresponds to dog). Then the AUC (one vs. one) across this subset is 0.75. If the overall AUC (one vs. one) across all subsets is 0.9 then this test raises a warning.

Subset F1

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the F1 of model predictions within a specific subset is significantly lower than the model prediction F1 over the entire 'population'.

Why it matters: Having similar F1 between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

Configuration: By default, F1 is computed over all predictions/labels. Note that we round predictions to 0/1 to compute F1 score.

Example: Suppose we had data with 2 features: [['cat', 0.2], ['dog', 0.3], ['cat', 0.5], ['dog', 0.7], ['cat', 0.7], ['dog', 0.2]], model predictions [0.3, 0.51, 0.7, 0.49, 0.9, 0.58], and labels [1, 0, 1, 0, 0, 1]. Then, the F1 over the feature subset value 'cat' would be 0.5, compared to the overall metric of 0.57.

Subset Macro F1

F1 is a holistic measure of both precision and recall. When transitioning to the multiclass setting we can use macro F1 which computes the F1 of each class and averages them. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the macro F1 of model predictions within a specific subset is significantly lower than the model prediction macro F1 over the entire 'population'.

Why it matters: Having similar macro F1 between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

Configuration: By default, macro F1 is computed over all predictions/labels. Note that the predicted label is the label with the largest predicted probability.

Example: Suppose we are differentiating between cats, bears, and dogs. Assume that across the data points where height=2 the predictions are [0.9, 0.1, 0], [0.1, 0.9, 0], [0.2, 0.1, 0.7] and the labels are [1, 0, 0], [1, 0, 0], [0, 0, 1] (where the first index corresponds to cat, the second corresponds to bear, and the third corresponds to dog). Then the macro F1 across this subset is 0.78. If the overall macro F1 across all subsets is 0.9 then this test raises a warning.

Subset Precision

The precision test is also popularly referred to positive predictive parity in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Precision of model predictions within a specific subset is significantly lower than the model prediction Precision over the entire 'population'.

Why it matters: Having similar precision (e.g. false discovery rates) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. Note that positive predictive parity does not necessarily indicate equal opportunity or predictive equality: as a hypothetical example, imagine that a loan qualification classifier flags 100 entries for group A and 100 entries for group B, each with a precision of 100%, but there are 100 actual qualified entries in group A and 9000 in group B. This would indicate disparities in opportunities given to each subgroup.

Configuration: By default, Precision is computed over all predictions/labels. Note that we round predictions to 0/1 to compute precision.

Example: Suppose we had data with 2 features: [['cat', 0.2], ['dog', 0.3], ['cat', 0.5], ['dog', 0.7], ['cat', 0.7], ['dog', 0.2]], model predictions [0.3, 0.51, 0.7, 0.49, 0.9, 0.58], and labels [1, 0, 1, 0, 0, 1]. Then, the Precision over the feature subset value 'cat' would be 0.5, compared to the overall metric of 0.5.

Subset Macro Precision

The precision test is also popularly referred to as positive predictive parity in fairness literature. When transitioning to the multiclass setting, we can compute macro precision which computes the precisions of each class individually and then averages them.This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Macro Precision of model predictions within a specific subset is significantly lower than the model prediction Macro Precision over the entire 'population'.

Why it matters: Having similar macro precision (e.g. false discovery rates) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. Note that positive predictive parity does not necessarily indicate equal opportunity or predictive equality: as a hypothetical example, imagine that a loan qualification classifier flags 100 entries for group A and 100 entries for group B, each with a precision of 100%, but there are 100 actual qualified entries in group A and 9000 in group B. This would indicate disparities in opportunities given to each subgroup.

Configuration: By default, Macro Precision is computed over all predictions/labels. Note that the predicted label is the label with the greatest predicted probability.

Example: Suppose we are differentiating between cats, bears, and dogs. Assume that across the data points where height=2 the predictions are [0.9, 0.1, 0], [0.1, 0.9, 0], [0.2, 0.1, 0.7] and the labels are [1, 0, 0], [1, 0, 0], [0, 0, 1] (where the first index corresponds to cat, the second corresponds to bear, and the third corresponds to dog). Then the Macro Precision across this subset is 0.67. If the overall Macro Precision across all subsets is 0.9 then this test raises a warning.

Subset False Positive Rate

The false positive error rate test is also popularly referred to as as predictive equality in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the false positive rate of model predictions within a specific subset is significantly higher than the model prediction false positive rate over the entire 'population'.

Why it matters: Having similar false positive rates (e.g. predictive equality) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. As an intuitive example, consider the case when the label indicates an undesirable attribute: if predicting whether a person will default on their loan, make sure that for people who didn't default, the rate at which the model incorrectly predicts positive is similar for group A and B.

Configuration: By default, false positive rate is computed over all predictions/labels. Note that we round predictions to 0/1 to compute false positive rate.

Example: Suppose we had data with 2 features: [['cat', 0.2], ['dog', 0.3], ['cat', 0.5], ['dog', 0.7], ['cat', 0.7], ['dog', 0.2]], model predictions [0.3, 0.51, 0.7, 0.49, 0.9, 0.58], and labels [1, 0, 1, 0, 0, 1]. Then, the false positive rate over the feature subset value 'cat' would be 1.0, compared to the overall metric of 0.67.

Subset Recall

The recall test is more popularly referred to as equal opportunity or false negative error rate balance in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Recall of model predictions within a specific subset is significantly lower than the model prediction Recall over the entire 'population'.

Why it matters: Having similar true positive rates (e.g. equal opportunity) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. An intuitive example is when the label indicates a positive attribute: if predicting whether to interview a given candidate, make sure that out of qualified candidates, the rate at which the model predicts a rejection is similar to group A and B.

Configuration: By default, Recall is computed over all predictions/labels. Note that we round predictions to 0/1 to compute recall.

Example: Suppose we had data with 2 features: [['cat', 0.2], ['dog', 0.3], ['cat', 0.5], ['dog', 0.7], ['cat', 0.7], ['dog', 0.2]], model predictions [0.3, 0.51, 0.7, 0.49, 0.9, 0.58], and labels [1, 0, 1, 0, 0, 1]. Then, the Recall over the feature subset value 'cat' would be 0.5, compared to the overall metric of 0.66.

Subset Macro Recall

The recall test is more popularly referred to as equal opportunity or false negative error rate balance in fairness literature. When transitioning to the multiclass setting we can use macro recall which computes the recall of each individual class and then averages these numbers.This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Macro Recall of model predictions within a specific subset is significantly lower than the model prediction Macro Recall over the entire 'population'.

Why it matters: Having similar true positive rates (e.g. equal opportunity) between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. An intuitive example is when the label indicates a positive attribute: if predicting whether to interview a given candidate, make sure that out of qualified candidates, the rate at which the model predicts an interview is similar to group A and B.

Configuration: By default, Macro Recall is computed over all predictions/labels. Note that the predicted label is the label with the largest predicted class probability.

Example: Suppose we are differentiating between cats, bears, and dogs. Assume that across the data points where height=2 the predictions are [0.9, 0.1, 0], [0.1, 0.9, 0], [0.2, 0.1, 0.7] and the labels are [1, 0, 0], [1, 0, 0], [0, 0, 1] (where the first index corresponds to cat, the second corresponds to bear, and the third corresponds to dog). Then the Macro Recall across this subset is 0.67. If the overall Macro Recall across all subsets is 0.9 then this test raises a warning.

Subset Prediction Variance (Positive Labels)

The subset variance test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the variance of model predictions within a specific subset is significantly higher than model prediction variance of the entire 'population'. In this test, the population refers to all data with positive ground-truth labels.

Why it matters: High variance within a feature subset compared to the overall population could mean a few different things, and should be analyzed with other subset performance tests (accuracy, AUC) for a more clear view. In the variance metric over positive/negative labels, this could mean the model is much more uncertain about the given subset. When paired with a decrease in AUC, this implies the model underperforms on this subset.

Configuration: By default, the variance is computed over all predictions with a positive ground-truth label.

Example: Suppose we had data with 2 features: [['cat', 0.2], ['dog', 0.3], ['cat', 0.5], ['dog', 0.7], ['cat', 0.7], ['dog', 0.2]] and model predictions [0.3, 0.51, 0.7, 0.49, 0.9, 0.48]. Assume the labels are [1, 0, 1, 0, 0, 0].Then the prediction variance for feature column 1, subset 'cat' with positive labels would be 0.04.

Subset Prediction Variance (Negative Labels)

The subset variance test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the variance of model predictions within a specific subset is significantly higher than model prediction variance of the entire 'population'. In this test, the population refers to all data with negative ground-truth labels.

Why it matters: High variance within a feature subset compared to the overall population could mean a few different things, and should be analyzed with other subset performance tests (accuracy, AUC) for a more clear view. In the variance metric over positive/negative labels, this could mean the model is much more uncertain about the given subset. When paired with a decrease in AUC, this implies the model underperforms on this subset.

Configuration: By default, the variance is computed over all predictions with a negative ground-truth label.

Example: Suppose we had data with 2 features: [['cat', 0.2], ['dog', 0.3], ['cat', 0.5], ['dog', 0.7], ['cat', 0.7], ['dog', 0.2]] and model predictions [0.3, 0.51, 0.7, 0.49, 0.9, 0.48]. Assume the labels are [1, 0, 1, 0, 0, 0].Then the prediction variance for feature column 1, subset 'cat' with negative labels would be 0.

Subset Mean-Absolute Error (MAE)

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the MAE of model predictions within a specific subset is significantly higher than the model prediction MAE over the entire 'population'.

Why it matters: Having similar mean-absolute error between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

Configuration: By default, mean-absolute error is computed over all predictions/labels.

Example: Suppose we had data with 2 features: [[0.4, 0.2], [0.5, 0.3], [0.7, 0.5], [0.6, 0.7], [0.8, 0.7]], model predictions [0.3, 0.4, 0.8, 0.8, 0.9], and labels [0.5, 0.5, 1.5, 1.5, 1.5]. Then, the Mean-absolute error over the feature subset (0.0, 0.5] for the first feature would be 0.15, compared to the overall metric of 0.46.

Subset Root-Mean-Square Error (RMSE)

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the RMSE of model predictions within a specific subset is significantly higher than the model prediction RMSE over the entire 'population'.

Why it matters: Having similar RMSE between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

Configuration: By default, RMSE is computed over all predictions/labels.

Example: Suppose we had data with 2 features: [[0.4, 0.2], [0.5, 0.3], [0.7, 0.5], [0.6, 0.7], [0.8, 0.7]], model predictions [0.3, 0.4, 0.8, 0.8, 0.9], and labels [0.5, 0.5, 1.5, 1.5, 1.5]. Then, the RMSE over the feature subset (0.0, 0.5] for the first feature would be 0.158, compared to the overall metric of 0.527.

Subset Prediction Variance

The subset variance test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the variance of model predictions within a specific subset is significantly higher than model prediction variance of the entire 'population'. In this test, the population refers to all data with both positive/negative ground-truth labels.

Why it matters: High variance within a feature subset compared to the overall population could mean a few different things, and should be analyzed with other subset performance tests (accuracy, AUC) for a more clear view. In this variance metric over all labels, it could mean the label variance itself is higher within a subgroup. It could mean the model is much more uncertain about the given subset (especially when paired with a decrease in AUC). On the other hand it could mean the model has gained predictive power on the subset (imagine the model outputting accurate predictions close to 0 and 1 within the subset, and 0.5 everywhere else).

Configuration: By default, the variance is computed over all predictions across all ground-truth labels.

Example: Suppose we had data with 2 features: [['cat', 0.2], ['dog', 0.3], ['cat', 0.5], ['dog', 0.7], ['cat', 0.7], ['dog', 0.2]] and model predictions [0.3, 0.51, 0.7, 0.49, 0.9, 0.48]. Then the prediction variance for feature column 1, subset 'cat' would be 0.062.

Subset Rank Correlation

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the rank correlation of model predictions within a specific subset is significantly lower than the model prediction rank correlation over the entire 'population'.

Why it matters: Having similar rank correlation between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

Configuration: By default, rank correlation is computed over all predictions/labels.

Example: Suppose we had the following query-document pairs: [[(qid: 1), 'A'], [(qid: 1), 'A'], [(qid: 2), 'B'], [(qid: 2), 'B']], model predictions [2, 1, 1, 2], and true relevance ranks [1,2,1,2]. Then, the rank correlation over the feature subset 'A' would be -1, compared to the overall metric of 0.

Subset Normalized Discounted Cumulative Gain (NDCG)

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the NDCG of model predictions within a specific subset is significantly lower than the model prediction NDCG over the entire 'population'.

Why it matters: Having similar NDCG between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

Configuration: By default, NDCG is computed over all predictions/labels.

Example: Suppose we had the following query-document pairs: [[(qid: 1), 'A'], [(qid: 1), 'A'], [(qid: 2), 'B'], [(qid: 2), 'B']], model predictions [2, 1, 1, 2], and true relevance ranks [1,2,1,2]. Then, the NDCG over the feature subset 'A' would be 0.86, compared to the overall metric of 0.93.

Subset Mean Reciprocal Rank (MRR)

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the MRR of model predictions within a specific subset is significantly lower than the model prediction MRR over the entire 'population'.

Why it matters: Having similar MRR between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

Configuration: By default, MRR is computed over all predictions/labels.

Example: Suppose we had the following query-document pairs: [[(qid: 1), 'A'], [(qid: 1), 'A'], [(qid: 2), 'B'], [(qid: 2), 'B']], model predictions [2, 1, 1, 2], and true relevance ranks [1,2,1,2]. Then, the MRR over the feature subset 'A' would be 0.5, compared to the overall metric of 0.75.

Subset Precision

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Precision of model predictions within a specific subset is significantly lower than the model prediction Precision over the entire 'population'.

Why it matters: Having similar Precision between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

Configuration: By default, Precision is computed over all predictions/labels.

Example: Suppose in our subset the ground truth has the following: [Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today Suppose your actual extraction has the following: [Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today] This has 1 true positive ([Microsoft Corp.]), 2 false negatives ([Steve Ballmer], [Windows 7]), and 3 false positives ([Steve], [CEO], [today]). This leads to a Precision of 0.25 on this subset of data. We then compare that to the overall Precision on the full dataset.

Subset Recall

This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Recall of model predictions within a specific subset is significantly lower than the model prediction Recall over the entire 'population'.

Why it matters: Having similar Recall between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for fairness and ethics, but also indicates failures in adequate feature representation and spurious correlation.

Configuration: By default, Recall is computed over all predictions/labels.

Example: Suppose in our subset the ground truth has the following: [Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today Suppose your actual extraction has the following: [Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today] This has 1 true positive ([Microsoft Corp.]), 2 false negatives ([Steve Ballmer], [Windows 7]), and 3 false positives ([Steve], [CEO], [today]). This leads to a Recall of 0.33 on this subset of data. We then compare that to the overall Recall on the full dataset.