Data Configuration

Configuring a data source can be done by specifying a mapping in the main RIME JSON configuration file, under the data_info argument.

NOTE: for AI Continuous Testing, predictions are required. Either pred_col must be specified or ref_pred_path and eval_pred_path must be specified.

Template

{
    "data_info": {
        "ref_path": "path/to/ref.csv",        (REQUIRED)
        "eval_path": "path/to/eval.csv",      (REQUIRED)
        "label_col": "Label",
        "pred_col": "Prediction",             (WORKS FOR ALL TASKS EXCEPT MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
        "ref_pred_path": "path/to/ref/preds.csv",  (ONLY SPECIFY FOR MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
        "eval_pred_path": "path/to/eval/preds.csv",  (ONLY SPECIFY FOR MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
        "nrows": null,
        "categorical_features": null,
        "loading_kwargs": null,
        "ranking_info": null,
        "protected_features": null
    },
    ...
}

Arguments

ref_path: string, required

Path to reference data file.
eval_path: string, required

Path to evaluation data file.
label_col: string or null, default = null

Name of column in data that corresponds to the labels.
pred_col: string or null, default = null

Name of column in data that corresponds to the predictions.
ref_pred_path: string or null, default = null

Path to a csv or parquet file containing the predictions on the reference dataset. This is how predictions are specified for multi-class models.
eval_pred_path: string or null, default = null

Path to a csv or parquet file containing the predictions on the evaluation dataset. This is how predictions are specified for multi-class models.
nrows: int or null, default = null

Number of rows of data to load and test. If null, will load all rows. By default is null.
categorical_features: list or null, default = null

List of categorical features in data. If provided, these should be ALL the categorical features. If null, RIME will automatically determine whether a column is categorical or not. By default is null.
loading_kwargs: mapping, default = null

Keyword arguments to be passed to the pandas loading function (either pd.read_csv or pd.read_parquet, depending on your data format). NOTE: if you wish to specify nrows, this should NOT be done with these kwargs but rather with the nrows parameter above.
ranking_info: mapping, default = null

Arguments to be used for Ranking tasks. If you are not running RIME on a Ranking task this value should be null. If you are running on a Ranking task, the following keys should be provided:
- query_col: string, required
  
  Name of column in dataset that contains the query ids.
- nqueries: int or null, default = null
  
  Number of queries to consider when running RIME. If null, will use all queries.
- nrows_per_query: int or null, default = null
  
  Number of rows to use per query when running RIME. If null, will use all rows.
- drop_query_id: bool, default = True
  
  Whether to drop the query ID column from the dataset to avoid passing as a feature to the model.
protected_features: list or null, default = null

List of protected features in data. If Compliance category is added to categories in the test config (see TestSuiteConfig(), and protected_features are included - a set of compliance tests will be run over the protected features.