Data Configuration

Configuring a data source can be done by specifying a mapping in the main RIME JSON configuration file, under the data_info argument. By default, RIME can load any dataset from disk or cloud storage so long as the files are correctly formatted. RIME additionally supports user-defined dataloaders contained in a configured python file as well as a native integration with the Hugging Face datasets hub.

Stress Testing

Default Template

{
    "data_info": {
        "ref_path": "path/to/ref.jsonl.gz",        (REQUIRED)
        "eval_path": "path/to/eval.jsonl.gz",      (REQUIRED)
    },
    ...
}

Arguments

  • ref_path: string, required

    Path to reference data file. Please reference the NLP file guide for a description of supported file formats.

  • eval_path: string, required

    Path to evaluation data file. Please reference the NLP file guide for a description of supported file formats.

Custom Dataloader

{
    "data_info": {
        "type": "custom",                           (REQUIRED)
        "load_path": "path/to/dataloader.py",       (REQUIRED)
    },
    ...
}

Arguments

  • type: string, required

    Must be set to “custom”.

  • load_path: string, required

    Path to the custom dataloader file. Please reference the NLP Dataloader documentation instructions on how to create a compatible file.

Hugging Face Dataset

{
    "data_info": {
        "type": "huggingface",                      (REQUIRED)
        "dataset_uri": "path",                      (REQUIRED)
        "ref_split": "train",
        "eval_split": "test",
        "label_key": "label",
        "eval_label_key": "label"
        "loading_params": null

    },
    ...
}

Arguments

  • type: string, required

    Must be set to “huggingface”.

  • dataset_uri: string, required

    The path or tag passed to ‘load_dataset’.

  • ref_split: string, default = train

    The key used to access the reference split from the downloaded ‘DatasetDict’.

  • eval_split: string, default = test

    The key used to access the evaluation split from the downloaded ‘DatasetDict’.

  • text_key: string, default = “text”

    The feature name for the NLP input text attribute.

  • label_key: string or null, default = “label”

    The feature name for the label class ID. If ‘None’, don’t load labels.

  • eval_label_key: string or null, default = “label”

    The feature name for the label class ID in the evaluation split. If ‘None’, don’t load labels.

  • loading_params: dict or null, default = null

    Additional kwargs to pass to ‘load_dataset’.