RIME Data Format Check

rime-data-format-check is a command-line utility that validates data for use with the RIME platform by performing a series of sanity checks, depending on the nature of the data (tabular vs. NLP).

Be sure to have the Python SDK installed beforehand:

pip install rime-sdk

After installation, you will be able to run rime-data-format-check commands in your shell. Try running it with --help to see what flags to specify to use it on your data:

rime-data-format-check --help

usage: rime-data-format-check [-h] (-nlp | -tabular) --ref-path REF_PATH
                              --eval-path EVAL_PATH --task {Multi-class
                              Classification,Regression,Binary
                              Classification,Named Entity Recognition,Text
                              Classification}
                              [--preds-ref-path PREDS_REF_PATH]
                              [--preds-eval-path PREDS_EVAL_PATH]
                              [--label-col-name LABEL_COL_NAME]
                              [--pred-col-name PRED_COL_NAME]
                              [--timestamp-col-name TIMESTAMP_COL_NAME]

Examples

Tabular

For a full list of the requirements for tabular input data, see the specification here.

These examples were run using the Kaggle Titanic dataset.

1. Basic Usage (no labels or predictions)

rime-data-format-check -tabular \
  --task "Binary Classification" \
  --ref-path data/titanic/train.csv \
  --eval-path data/titanic/test.csv

Output (PASSING):


Inspecting 'data/titanic/train.csv'
Done!

Inspecting 'data/titanic/test.csv'
Done!

WARNING: No prediction column is provided. Although you can still run RIME without predictions, it will not be as powerful as if you run it WITH predictions.

WARNING: No label column is provided. Although you can still run RIME without labels, it will not be as powerful as if you run it WITH labels.


---


Your data should work with RIME!

2. With Labels (and a sample failure)

rime-data-format-check -tabular \
  --task "Binary Classification" \
  --ref-path data/titanic/train.csv \
  --eval-path data/titanic/test.csv \
  --label-col-name "Survived"

Output (ERROR):


Inspecting 'data/titanic/train.csv'
Done!

Inspecting 'data/titanic/test.csv'

---

Error:

Label column (Survived) not found in data (data/titanic/test.csv). If a label column exists in one dataset, it MUST exist in the other.

In this case, it looks like we’re missing the label column from our evaluation set. For the Titanic dataset, these values are provided in the gender_submission.csv.

Adding the label column to the evaluation set should resolve the issue:

import pandas as pd
df_test = pd.read_csv("titanic/test.csv")
df_labels = pd.read_csv("titanic/gender_submission.csv")
df_test_with_labels = pd.merge(df_test, df_labels, on=["PassengerId"])
df_test_with_labels.to_csv("titanic/test_with_labels.csv")

Using the updated evaluation set, the data checks should pass:

rime-data-format-check -tabular \
  --task "Binary Classification" \
  --ref-path data/titanic/train.csv \
  --eval-path data/titanic/test_with_labels.csv \
  --label-col-name "Survived"

Output (PASSING):


Inspecting 'data/titanic/train.csv'
Done!

Inspecting 'data/titanic/test_with_labels.csv'
Done!

WARNING: No prediction column is provided. Although you can still run RIME without predictions, it will not be as powerful as if you run it WITH predictions.


---


Your data should work with RIME!

The --pred-col-name flag operates identically to --label-col-name.

NLP

For a full list of the requirements for NLP input data, see the specification here.

These examples were run on a formatted version of the arXiv dataset, where input files contains data points with the following structure:

(in data/classification/arxiv/train.json)
[
  ...
  {
    "text": "On maxima and ladder processes for a dense class of Levy processes",
    "timestamp": "2007-05-23 00:00:00",
    "label": 4,
    "probabilities": [0.11, 0.0, 0.03, 0.01, 0.82, 0.0, 0.03, 0.0, 0.0, 0.0]
  },
  ...
]

1. Basic Usage (predictions included with inputs)

rime-data-format-check -nlp \
  --task "Text Classification" \
  --ref-path data/classification/arxiv/train.json \
  --eval-path data/classification/arxiv/val.json

Output (PASSING)

rime-data-format-check -nlp \
  --task "Text Classification" \
  --ref-path ../../../rime-data-format-check/data/classification/arxiv/train.json \
  --eval-path ../../../rime-data-format-check/data/classification/arxiv/val.json

Inspecting '../../../rime-data-format-check/data/classification/arxiv/train.json':
100%|█████████████████████████████████████████████████████████████████████████████| 50000/50000 [00:01<00:00, 33219.50it/s]

Inspecting '../../../rime-data-format-check/data/classification/arxiv/val.json':
100%|███████████████████████████████████████████████████████████████████████████| 147087/147087 [00:04<00:00, 33319.64it/s]

---


Your data should work with RIME!

2. With Predictions Provided Separately (and a sample failure)

In this example, predictions are omitted from the inputs and provided in a separate file, data/classification/arxiv/train_preds.json.

(in data/classification/arxiv/train.json)
[
  ...
  {
    "text": "On maxima and ladder processes for a dense class of Levy processes"
    "timestamp": "2007-05-23 00:00:00",
    "label": 4
  },
  ...
]

(in data/classification/arxiv/train_preds.json)
[
  ...
  {
    "probabilities": [0.11, 0.0, 0.03, 0.01, 0.82, 0.0, 0.03, 0.0, 0.0, 0.0]
  },
  ...
]

rime-data-format-check -nlp \
  --task "Text Classification" \
  --ref-path data/classification/arxiv/train.json \
  --preds-ref-path data/classification/arxiv/train_preds.json \
  --eval-path data/classification/arxiv/val.json \
  --preds-eval-path data/classification/arxiv/val_preds.json

Output (ERROR):



Inspecting 'data/classification/arxiv/train_preds.json':
  0%|                                                                             | 35/50000 [00:00<00:02, 24377.39it/s]

---

Error:

File 'data/classification/arxiv/train_preds.json', Index 35:

Key 'probabilities' error:
Or(<class 'float'>) did not validate 24
24 should be instance of 'float'

---

Inputs for task 'Text Classification' must adhere to the following structure:

{'probabilities': [<class 'float'>]}

In this case, it looks like one of our prediction entities has an invalid value of 24, an int instead of the expected float.

The element at the specified file (data/classification/arxiv/train_preds.json) and index (35) has the following values:

{
  "probabilities": [24, 0.01, 0.03, 0.03, 0.68, 0.01, 0.0, 0.0, 0.0, 0.0]
}

It appears that the first value, 24, should actually be 0.24. Adusting the value resolves the issue:


Inspecting 'data/classification/arxiv/train_preds.json':
100%|█████████████████████████████████████████████████████████████████████████████|50000/50000 [00:01<00:00, 26954.06it/s]

Inspecting 'data/classification/arxiv/val_preds.json':
100%|███████████████████████████████████████████████████████████████████████████|147087/147087 [00:05<00:00, 27215.18it/s]

Inspecting 'data/classification/arxiv/train.json':
100%|█████████████████████████████████████████████████████████████████████████████|50000/50000 [00:01<00:00, 32684.76it/s]

Inspecting 'data/classification/arxiv/val.json':
100%|███████████████████████████████████████████████████████████████████████████|147087/147087 [00:04<00:00, 32876.84it/s]

---


Your data should work with RIME!