Input Data Format

Automated Validation

The Python SDK exposes a command-line utility that can automatically validate your input data:

rime-data-format-check <ARGS>


Inspecting <REFERENCE_SET>
Done!

Inspecting <EVALUATION_SET>
Done!


---


Your data should work with RIME!

Instructions are available here.


Supported File Formats

RIME NLP currently supports both JSON (.json) and JSON lines (.jsonl) formats, optionally compressed using gzip (.json.gz or .jsonl.gz). For JSON lines files, RIME expects each line to be a dictionary representing a single data point. For standard JSON files, RIME expects the content to be a list of dictionaries. The structure of each dictionary is task-specific. See below for a detailed description of the expected data format for supported tasks.

Requirements By Task

Text Classification

For the Text Classification task, each data point is represented by a dictionary containing the following keys:

[
  {
    "text": "Hello, world!",               (REQUIRED)
    "label": 1,
    "probabilities": [0.02, 0.94, 0.04]
  },
  ...
]
  • text: string, required

    The input string for this data point.

  • label: int

    The ground truth class label. This should be an integer in [0, num_classes), where num_classes is the length of the probability vector output by the model. The label for a class should correspond to the index of that class in the model output.

  • probabilities: List[float]

    The model prediction for this data point. This should be a normalized vector of class probabilities, with a probability for each possible class. NOTE: predictions also can be provided in a separate file. See the NLP Prediction Configuration reference for more on how to provide cached predictions.

Named Entity Recognition

For the Named Entity Recognition task, each data point is represented by a dictionary containing the following keys:

[
  {
  "text": "Hello, world!",               (REQUIRED)
  "entities": [
    {
      "mentions": [
        {
          "start_offset": 7,
          "end_offset": 11
        }
      ],
      "type": "LOC"
    }
  ],
  "predicted_entities": [
    {
      "mentions": [
        {
          "start_offset": 7,
          "end_offset": 11
        }
      ],
      "type": "ORG"
    }
  ]
  },
  ...
]
  • text: string, required

    The input string for this data point.

  • entities: List[dict]

    The ground truth annotations. This should be a list of dictionaries, with each dictionary corresponding to an entity. Each entity dictionary should have a ‘type’ key (specifying the type the entity is predicted to be) as well as a ‘mentions’ key which contains all the mentions predicted to refer to this entity. Each mention itself a dictionary with two keys: a ‘start_offset’ key and a ‘end offset’ key, which are both integers referring to the start and end of the mention in question.

  • predicted_entities: List[dict]

    The model predictions for this data point. This should be a list of dictionaries, with each dictionary corresponding to an entity. Each entity dictionary should have a ‘type’ key (specifying the type the entity is predicted to be) as well as a ‘mentions’ key which contains all the mentions predicted to refer to this entity. Each mention itself a dictionary with two keys: a ‘start_offset’ key and a ‘end offset’ key, which are both integers referring to the start and end of the mention in question. See the NLP Prediction Configuration reference for more on how to provide cached predictions.