Data Configuration
To configure a data source, specify a mapping in the main RIME JSON configuration file
in the data_params argument.
The data_params configuration can take on different forms, offering a tradeoff between
simplicity and flexibility.
The “Register a reference dataset” step in Stress Tests walkthrough has an example of using this configuration in a data registry.
Default Data Configuration Template
{
    "connection_info": {...}
    "data_params": {
        "label_col": "",
        "pred_col": "",
        "timestamp_col": "",
        "class_names": [],
        "ranking_info":{
            "query_col":"",
            "nqueries": null,
            "nrows_per_query": null,
            "drop_query_id": null
        },
        "nrows": null,
        "nrows_per_time_bin": null,
        "sample": null,
        "categorical_features": null,
        "protected_features": null,
        "features_not_in_model": null,
        "text_features": null,
        "image_features": null,
        "features": null,
        "loading_kwargs": null,
        "feature_type_path": null,
        "pred_path": null,
        "image_load_path": null
    },
    ...
}
Parameters for the data_params object
General
| Parameter | Type | Description | 
|---|---|---|
| label_col | String | Naming of special columns. | 
| pred_col | String | Column to look at for predictions. | 
| timestamp_col | String | Column to look at for CT timestamp. | 
| class_names | Repeated String | List of label class names. | 
| ranking_info | JSON object | Contains parameters used for the ranking model task. | 
| query_col | String | Name of column in dataset that contains the query IDs. | 
| nqueries | Optional int64 | Number of queries to consider. Uses all queries when null. | 
| nrows_per_query | Optional int64 | Number of rows to use per query. Uses all rows when null. | 
| drop_query_id | Optional Boolean | Specifies whether to drop the query ID column from the dataset in order to prevent passing the query ID column to the model as a feature. | 
Dataset sizing
| Parameter | Type | Description | 
|---|---|---|
| nrows | Optional int64 | Number of rows of data to load and test. Loads all rows when null and sampleis not specified. Infers the maximum number rows possible when null andsampleis specified. | 
| nrows_per_time_bin | Optional int64 | Number of rows of data per time bin to load and test in CT. Loads all rows when null. | 
| sample | Optional Boolean | Specifies whether to sample rows in the data. Default is True. | 
Feature types and relations
| Parameter | Type | Description | 
|---|---|---|
| categorical_features | Repeated String | A list of categorical features. | 
| protected_features | Repeated String | A list of features that are protected attributes. When the Bias and Fairness category is specified, these tests are only run over the listed features. | 
| features_not_in_model | Repeated String | A list of features not present in the model. | 
| text_features | Repeated String | A list of text features to run NLP tests over. | 
| image_features | Repeated String | A list of image features to run CV tests over. | 
Feature intersections
| Parameter | Type | Description | 
|---|---|---|
| features | Repeated String | A list of features to run tabular tests over. | 
External resources
| Parameter | Type | Description | 
|---|---|---|
| loading_kwargs | String | Keyword arguments passed to the pandas loading function. Do not specify nrowshere. | 
| feature_type_path | String | Deprecated. Path to a CSV file that specifies the data type of each feature. The file must have two columns, FeatureNameandFeatureType. | 
| pred_path | String | Deprecated. Path to a CSV file or Parquet file that contains predictions. | 
| image_load_path | String | Path to a python file that contains a load_imagefunction defining custom logic for loading an image from the file path provided in the dataset. | 
Data Info template
The data_info format supports separately specifying a reference and evaluation dataset.
Register your reference and evaluation datasets separately, then specify the unique
IDs for each dataset in data_info.
{
    "data_info": {
        "ref_dataset_id": ...,
        "eval_dataset_id": ...,
    },
    ...
}
Arguments
| Parameter | Type | Description | 
|---|---|---|
| ref_dataset_id | String | Unique identifier of a reference dataset. | 
| eval_dataset_id | String | Unique identifier of an evaluation dataset. | 
Single Data Info Templates
Use SingleDataInfo to specify reference and evaluation datasets in a
split approach, as seen in the previous template. SingleDataInfo takes
two elements, a connection_info object and a data_params object, detailed
following this section.
All SingleDataInfo objects also take a set of parameters that enable you to specify
additional data properties.
Continuous Testing requires that you specify a prediction set by setting a value
for either the pred_col or pred_path variables.
{
    "connection_info": ...,
    "data_params": ...,
}
General Parameters for Single Data Info
| Parameter | Type | Description | 
|---|---|---|
| connection_info | ConnectionInfo | Path to a ConnectionInfo object. | 
| data_params | DataInfoParams | 
Connection Info template
Specifies how to connect to a data source. Specify exactly one of the parameters in the following table.
| Parameter | Type | Description | 
|---|---|---|
| data_file | DataFileInfo | Information required by RIME to load a data file. | 
| data_loading | DataLoadingInfo | Loads a data file with additional parameters | 
| data_collector | DataCollectorInfo | Loads a data stream from a data collector | 
| delta_lake | DeltaLakeInfo | Loads a Delta Lake table. | 
| hugging_face | HuggingFaceDataInfo | Loads a HuggingFace dataset. | 
File-based Single Data Info Template
Uses ConnectionInfo and DataInfoParams objects, which are discussed earlier in this
section.
{
    "connection_info": {
        "data_file": {
            "path": ""
        }
    }
    "data_params": ...,
}
Data Collector Single Data Info Template
This can only be specified as part of a Continuous Testing configuration.
{
    "connection_info": {
        "data_collector": {
            "data_stream_id": null,
            "start_time": 0,
            "end_time": 0
        }
    },
    "data_params": {},
}
| Parameter | Type | Description | 
|---|---|---|
| data_stream_id | rime.UUID | The unique identifier assigned by RIME to a data stream. | 
| start_time | int64 | The start time in seconds from the UNIX epoch. | 
| end_time | int64 | The end time in seconds from the UNIX epoch. | 
Delta Lake Single Data Info Template
Loads a Delta Lake table.
{
    "connection_info": {
        "delta_lake_info": {
            "table_name": "Table",
            "start_time": "1970-01-01 00:00:01",
            "end_time": "1970-01-01 00:00:02",
            "time_col": "Updated",
        }
    },
    "data_params": {},
}
| Parameter | Type | Description | 
|---|---|---|
| table_name | String | The name of the Delta Lake table. | 
| start_time | int64 | The start time in seconds from the UNIX epoch. | 
| end_time | int64 | The end time in seconds from the UNIX epoch. | 
| time_col | string | The name of the column that contains the timestamp of the last update. | 
HuggingFace Single Data Info Template
Specifies how to load a HuggingFace dataset.
{
    "connection_info": {
        "hugging_face": {
            "dataset_uri": "",
            "split_name": "",
            "loading_params_json": ""
        }
    },
    "data_params": {}
}
| Parameter | Type | Description | 
|---|---|---|
| dataset_uri | String | The unique identifier of the dataset. | 
| split_name | String | The name of a predefined subset of data. | 
| loading_params_json | String | A JSON serialized string that contains loading parameters. | 
