Data Configuration

Configuring a data source can be done by specifying a mapping in the main RIME JSON configuration file, under the data_info argument. By default, RIME can load any dataset from disk or cloud storage so long as the files are correctly formatted. RIME additionally supports user-defined dataloaders contained in a configured python file as well as a native integration with the Hugging Face datasets hub.

Stress Testing

Default Template

{
    "data_info": {
        "ref_path": "path/to/ref.jsonl.gz",        (REQUIRED)
        "eval_path": "path/to/eval.jsonl.gz",      (REQUIRED)
    },
    ...
}

Arguments

ref_path: string, required

Path to reference data file. Please reference the NLP file guide for a description of supported file formats.
eval_path: string, required

Path to evaluation data file. Please reference the NLP file guide for a description of supported file formats.

Custom Dataloader

{
    "data_info": {
        "type": "custom",                           (REQUIRED)
        "load_path": "path/to/dataloader.py",       (REQUIRED)
    },
    ...
}

Arguments

type: string, required

Must be set to “custom”.
load_path: string, required

Path to the custom dataloader file. Please reference the NLP Dataloader documentation instructions on how to create a compatible file.

Hugging Face Dataset

{
    "data_info": {
        "type": "huggingface",                      (REQUIRED)
        "dataset_uri": "path",                      (REQUIRED)
        "ref_split": "train",
        "eval_split": "test",
        "label_key": "label",
        "eval_label_key": "label"
        "loading_params": null

    },
    ...
}

Arguments

type: string, required

Must be set to “huggingface”.
dataset_uri: string, required

The path or tag passed to ‘load_dataset’.
ref_split: string, default = train

The key used to access the reference split from the downloaded ‘DatasetDict’.
eval_split: string, default = test

The key used to access the evaluation split from the downloaded ‘DatasetDict’.
text_key: string, default = “text”

The feature name for the NLP input text attribute.
label_key: string or null, default = “label”

The feature name for the label class ID. If ‘None’, don’t load labels.
eval_label_key: string or null, default = “label”

The feature name for the label class ID in the evaluation split. If ‘None’, don’t load labels.
loading_params: dict or null, default = null

Additional kwargs to pass to ‘load_dataset’.