Data Configuration
Configuring a data source can be done by specifying a mapping in the main RIME JSON configuration file, under the
data_info argument.
The data_info configuration can take on different forms, offering a tradeoff between simplicity and flexibility. In the default approach,
you may choose to specify the file paths to both the reference and evaluation datasets, along with supporting arguments such as pred_col,
label_col, and more. This is the easiest configuration to get started with.
In the split approach, you may choose to specify a separate “single” data info struct for both the reference and evaluation datasets:
ref_data_info and eval_data_info. Both the reference and evaluation single data info structs can take on different configuration types,
from a file-path based approach, to a custom data loader, to a Delta Lake table, to our data collector (for Continuous Testing only). This allows
you to specify different data loaders for the reference and evaluation datasets. The configuration
templates for all of these data infos are also detailed below.
NOTE: for AI Continuous Testing, predictions are required. Either pred_col must be specified or ref_pred_path and eval_pred_path must be specified.
Default Data Info Template
{
"data_info": {
"ref_path": "path/to/ref.csv", (REQUIRED)
"eval_path": "path/to/eval.csv", (REQUIRED)
"label_col": "Label",
"pred_col": "Prediction", (WORKS FOR ALL TASKS EXCEPT MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
"ref_pred_path": "path/to/ref/preds.csv", (ONLY SPECIFY FOR MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
"eval_pred_path": "path/to/eval/preds.csv", (ONLY SPECIFY FOR MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
"nrows": null,
"categorical_features": null,
"protected_features": null,
"features_not_in_model": null,
"feature_type_path": null,
"ranking_info": null, (REQUIRED FOR RANKING)
"loading_kwargs": null,
"embeddings": null
},
...
}
Arguments
ref_path: string, requiredPath to reference data file.
eval_path: string, requiredPath to evaluation data file.
label_col: string or null, default =nullName of column in data that corresponds to the labels.
pred_col: string or null, default =nullName of column in data that corresponds to the predictions.
ref_pred_path: string or null, default =nullPath to a CSV or Parquet file containing the predictions on the reference dataset. This is how predictions are specified for multi-class models.
eval_pred_path: string or null, default =nullPath to a CSV or Parquet file containing the predictions on the evaluation dataset. This is how predictions are specified for multi-class models.
nrows: int or null, default =nullNumber of rows of data to load and test. If
null, will load all rows. By default it isnull.categorical_features: list or null, default =nullList of categorical features in data. If provided, these should be ALL the categorical features. If
null, RIME will automatically determine whether a column is categorical or not. By default it isnull.protected_features: list or null, default =nullList of protected features in data. If the
Bias and Fairnesscategory is added tocategoriesin the test config (see TestSuiteConfig(), andprotected_featuresare included, a set of bias and fairness tests will be run over the protected features.features_not_in_model: list or null, default =nullList of features in the dataset that are not used by the model. Specifying this will ensure that only relevant tests are run on these features.
feature_type_path: string, default =nullPath to a CSV file that specifies the data type of each feature. The file should have two columns:
FeatureNameandFeatureType. The possible values forFeatureTypeareBoolCategoricalColumn,DomainColumn,EmailColumn,NumericCategoricalColumn,StringCategoricalColumn,UrlColumn,FloatColumn,IntegerColumn.ranking_info: mapping, default =nullArguments to be used for Ranking tasks. If you are not running RIME on a Ranking task this value should be null. If you are running on a Ranking task, the following keys should be provided:
query_col: string, requiredName of column in dataset that contains the query ids.
nqueries: int or null, default =nullNumber of queries to consider when running RIME. If
null, will use all queries.nrows_per_query: int or null, default =nullNumber of rows to use per query when running RIME. If
null, will use all rows.drop_query_id: bool, default = TrueWhether to drop the query ID column from the dataset to avoid passing as a feature to the model.
loading_kwargs: mapping, default =nullKeyword arguments to be passed to the
pandasloading function (eitherpd.read_CSVorpd.read_Parquet, depending on your data format). NOTE: if you wish to specifynrows, this should NOT be done with these kwargs but rather with thenrowsparameter above.embeddings: list ornull, default =nullA list of dictionaries corresponding to information for each embedding. The arguments for each dictionary are described below.
name: stringName of the embedding to be shown in the test run results.
cols: listList of column names corresponding to the embedding. For example, suppose each data row is represented by columns
age,is_member,query_0,query_1,query_2, etc., where each column “query_\d” is a dimension of a sentence embedding extracted from a user query. Specifying"embeddings": [{"name": "user_query", "cols": ["query_0", "query_1", "query_2", ...]}](i.e., specifying each column contained by the embedding) would direct the RI Platform to treat these columns as a single dense vector-valued embedding feature.
Split Data Info Template
{
"data_info": {
"ref_data_info": ..., (REQUIRED)
"eval_data_info": ..., (REQUIRED)
},
...
}
Arguments
ref_data_info: SingleDataInfo, requiredPath to single data info struct (see below).
eval_data_info: SingleDataInfo, requiredPath to single data info struct (see below).
Single Data Info Templates
Note that these single data info structs can be used to specify both the ref_data_info as well as eval_data_info
in the split data into template above.
Note that all single data info structs also take in a set of tabular parameters which allow the user to additionally
specify properties of their data. These parameters include fields such as label_col, pred_col, nrows, protected_features,
and more. The full list is detailed below. Note that for AI Continuous Testing, predictions are required.
Either pred_col must be specified or pred_path must be specified.
General Tabular Parameters for Single Data Info
{
"label_col": "Label",
"pred_col": "Prediction", (WORKS FOR ALL TASKS EXCEPT MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
"pred_path": "path/to/preds.csv", (ONLY SPECIFY FOR MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
"nrows": null,
"nrows_per_time_bin": null,
"categorical_features": null,
"protected_features": null,
"features_not_in_model": null,
"feature_type_path": null,
"ranking_info": null, (REQUIRED FOR RANKING)
"loading_kwargs": null,
"embeddings": null
}
Arguments
label_col: string or null, default =nullName of column in data that corresponds to the labels.
pred_col: string or null, default =nullName of column in data that corresponds to the predictions.
pred_path: string or null, default =nullPath to a CSV or Parquet file containing the predictions on the evaluation dataset. This is how predictions are specified for multi-class models.
nrows: int or null, default =nullNumber of rows of data to load and test. If
null, will load all rows. By default it isnull.nrows_per_time_bin: int or null, default =nullNumber of rows of data per time bin to load and test in continuous testing (does not affect stress testing results). If
null, will load all rows. By default it isnull.categorical_features: list or null, default =nullList of categorical features in data. If provided, these should be ALL the categorical features. If
null, RIME will automatically determine whether a column is categorical or not. By default it isnull.protected_features: list or null, default =nullList of protected features in data. If the
Bias And Fairnesscategory is added tocategoriesin the test config (see TestSuiteConfig(), andprotected_featuresare included - a set of bias and fairness tests will be run over the protected features.features_not_in_model: list or null, default =nullList of features in the dataset that are not used by the model. Specifying this will ensure that only relevant tests are run on these features.
feature_type_path: string, default =nullPath to a CSV file that specifies the data type of each feature. The file should have two columns:
FeatureNameandFeatureType. The possible values forFeatureTypeareBoolCategoricalColumn,DomainColumn,EmailColumn,NumericCategoricalColumn,StringCategoricalColumn,UrlColumn,FloatColumn,IntegerColumn.ranking_info: mapping, default =nullArguments to be used for Ranking tasks. If you are not running RIME on a Ranking task this value should be null. If you are running on a Ranking task, the following keys should be provided:
query_col: string, requiredName of column in dataset that contains the query ids.
nqueries: int or null, default =nullNumber of queries to consider when running RIME. If
null, will use all queries.nrows_per_query: int or null, default =nullNumber of rows to use per query when running RIME. If
null, will use all rows.drop_query_id: bool, default = TrueWhether to drop the query ID column from the dataset to avoid passing as a feature to the model.
loading_kwargs: mapping, default =nullKeyword arguments to be passed to the
pandasloading function (eitherpd.read_CSVorpd.read_Parquet, depending on your data format). NOTE: if you wish to specifynrows, this should NOT be done with these kwargs but rather with thenrowsparameter above.embeddings: list ornull, default =nullA list of dictionaries corresponding to information for each embedding. The arguments for each dictionary are described below.
name: stringName of the embedding to be shown in the test run results.
cols: listList of column names corresponding to the embedding. For example, suppose each data row is represented by columns
age,is_member,query_0,query_1,query_2, etc., where each column “query_\d” is a dimension of a sentence embedding extracted from a user query. Specifying"embeddings": [{"name": "user_query", "cols": ["query_0", "query_1", "query_2", ...]}](i.e., specifying each column contained by the embedding) would direct the RI Platform to treat these columns as a single dense vector-valued embedding feature.
File-based Single Data Info Template
{
"file_name": "path/to/file.csv",
**tabular_params
}
Arguments
file_name: string, requiredPath to data file.
**tabular_params: DictSee Tabular Parameters above.
Custom Dataloader Single Data Info Template
{
"load_path": "path/to/custom_loader.py",
"load_func_name": "load_fn_name",
"loader_kwargs": null,
"loader_kwargs_json": null
**tabular_params
}
Arguments
load_path: string, requiredPath to custom loader Python file.
load_func_name: string, requiredName of the loader function. Must be defined within the Python file.
loader_kwargs: Dict, default =nullArguments to pass in to the loader function, in dictionary form. We pass these arguments in as **kwargs. Only one of
loader_kwargsandloader_kwargs_jsoncan be specified.**loader_kwargs_json: DictArguments to pass in to the loader function, in JSON-serialized string form. We pass these arguments in as **kwargs. Only one of
loader_kwargsandloader_kwargs_jsoncan be specified.**tabular_params: DictSee Tabular Parameters above.
Data Collector Single Data Info Template
NOTE: this can only be specified as part of a Continuous Testing config, not offline testing config. See the Continuous Tests Configuration for more details.
{
"start_time": start_time,
"end_time": end_time,
**tabular_params
}
Arguments
start_time: int, requiredStart time of the data collector to fetch data from. Format is UNIX epoch time in seconds.
end_time: int, requiredEnd time of the data collector to fetch data from. Format is UNIX epoch time in seconds.
**tabular_params: DictSee Tabular Parameters above.
Intersections Template
Specifies a custom group based on a specified intersection of features. Subset performance tests and certain bias tests run over the specified intersection.
{
{
"data_info": {
"ref_path": "path",
"eval_path": "path",
"ref_pred_path": "path",
"eval_pred_path": "path",
"label_col": "Column label",
"protected_features": ["feature1", "feature2", "feature3"],
"intersections": [{"features": ["feature1", "feature2", "feature3"]}, {"features": ["feature3", "feature4"]}]
}
Arguments
ref_path: string, requiredPath to reference data file.
eval_path: string, requiredPath to evaluation data file.
ref_pred_path: string, requiredPath to a CSV or Parquet file containing the predictions on the reference dataset. This is how predictions are specified for multi-class models.
eval_pred_path: string, requiredPath to a CSV or Parquet file containing the predictions on the evaluation dataset. This is how predictions are specified for multi-class models.
label_col: string, requiredName of column in data that corresponds to the labels.
protected_features: array, requiredList of protected features in data. If the
Bias And Fairnesscategory is added tocategoriesin the test config (see TestSuiteConfig(), andprotected_featuresare included, a set of bias and fairness tests will be run over the protected features.intersections: array, requiredA list of arrays of features. The intersection of the sets of features specifies a custom group on which to run subset performance tests and certain bias tests.
Delta Lake Single Data Info Template
NOTE: the Databricks secret access token (along with the “server_hostname” and “http_path”) are specified as part of Data Sources Configuration. If you are launching a manual Stress Test run or Continuous Test run, do not fill in the server_hostname or http_path fields; those fields + the secret token will automatically be inserted if you specify the corresponding data source name when kicking off the run.
{
"server_hostname": "<workspace-id>.cloud.databricks.com",
"http_path": "http/path",
"table_name": "table_name",
"start_time": start_time,
"end_time": end_time,
"time_col": "Timestamp column",
**tabular_params
}
Arguments
server_hostname: string, requiredServer hostname of the Databricks cluster. This should be specified as part of a Data Source. See Databricks docs for more details.
http_path: string, requiredHTTP Path of the Databricks cluster. This should be specified as part of a Data Source. See Databricks docs for more details.
table_name: string, requiredName of the Delta Lake table.
start_time: int, requiredStart time of the Delta Lake table to fetch data from. Format is UNIX epoch time in seconds.
end_time: int, requiredEnd time of the Delta Lake table to fetch data from. Format is UNIX epoch time in seconds.
time_col: string, requiredName of the timestamp column. This is a required field in order to determine the range of data which satisfies the
start_timeandend_timeparams.**tabular_params: DictSee Tabular Parameters above.