Data Configuration

Configuring a data source can be done by specifying a mapping in the main RIME JSON configuration file, under the data_info argument.

NOTE: for AI Continuous Testing, predictions are required. Either pred_col must be specified or ref_pred_path and eval_pred_path must be specified.

Template

{
    "data_info": {
        "ref_path": "path/to/ref.csv",        (REQUIRED)
        "eval_path": "path/to/eval.csv",      (REQUIRED)
        "label_col": "Label",
        "pred_col": "Prediction",             (WORKS FOR ALL TASKS EXCEPT MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
        "ref_pred_path": "path/to/ref/preds.csv",  (ONLY SPECIFY FOR MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
        "eval_pred_path": "path/to/eval/preds.csv",  (ONLY SPECIFY FOR MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
        "nrows": null,
        "categorical_features": null,
        "loading_kwargs": null,
        "ranking_info": null,
        "protected_features": null
    },
    ...
}

Arguments

  • ref_path: string, required

    Path to reference data file.

  • eval_path: string, required

    Path to evaluation data file.

  • label_col: string or null, default = null

    Name of column in data that corresponds to the labels.

  • pred_col: string or null, default = null

    Name of column in data that corresponds to the predictions.

  • ref_pred_path: string or null, default = null

    Path to a csv or parquet file containing the predictions on the reference dataset. This is how predictions are specified for multi-class models.

  • eval_pred_path: string or null, default = null

    Path to a csv or parquet file containing the predictions on the evaluation dataset. This is how predictions are specified for multi-class models.

  • nrows: int or null, default = null

    Number of rows of data to load and test. If null, will load all rows. By default is null.

  • categorical_features: list or null, default = null

    List of categorical features in data. If provided, these should be ALL the categorical features. If null, RIME will automatically determine whether a column is categorical or not. By default is null.

  • loading_kwargs: mapping, default = null

    Keyword arguments to be passed to the pandas loading function (either pd.read_csv or pd.read_parquet, depending on your data format). NOTE: if you wish to specify nrows, this should NOT be done with these kwargs but rather with the nrows parameter above.

  • ranking_info: mapping, default = null

    Arguments to be used for Ranking tasks. If you are not running RIME on a Ranking task this value should be null. If you are running on a Ranking task, the following keys should be provided:

    • query_col: string, required

      Name of column in dataset that contains the query ids.

    • nqueries: int or null, default = null

      Number of queries to consider when running RIME. If null, will use all queries.

    • nrows_per_query: int or null, default = null

      Number of rows to use per query when running RIME. If null, will use all rows.

    • drop_query_id: bool, default = True

      Whether to drop the query ID column from the dataset to avoid passing as a feature to the model.

  • protected_features: list or null, default = null

    List of protected features in data. If Compliance category is added to categories in the test config (see TestSuiteConfig(), and protected_features are included - a set of compliance tests will be run over the protected features.