Data and Model Setup

For readability, we recommend running these guides in a Jupyter notebook.

Begin by setting the RIME_PATH variable to your local path to the trial bundle you downloaded as part of installation, e.g., RIME_PATH = '/home/ec2-user/rime_trial'

import pandas as pd
RIME_PATH = 'SET THIS!'
train_df = pd.read_csv(RIME_PATH + 'examples/fraud/train.csv')
test_df = pd.read_csv(RIME_PATH + 'examples/fraud/val.csv')
label_col = "isFraud"

Make sure to split the train label data out for RIME:

def split_df(df: pd.DataFrame, label_col: str):
    labels = df[label_col]
    df = df.drop(columns=[label_col])
    return df, labels

train_df, train_labels = split_df(train_df, label_col)
test_df, test_labels = split_df(test_df, label_col)

We then load the model as well as the preprocessing helper from their corresponding pickle files:

import catboost as catb
import pickle

model = catb.CatBoostClassifier()
model.load_model(str(RIME_PATH + "examples/fraud/fraud.catb"))
with open(RIME_PATH + "examples/fraud/null_impute.pkl", "rb") as f:
    null_impute = pickle.load(f)

def preprocess(x: dict):
    """Null impute categoricals."""
    for col_name in x.keys():
        if pd.isnull(x[col_name]) and col_name in null_impute.keys():
            x[col_name] = null_impute[col_name]
    return x

def preprocess_df(df: pd.DataFrame):
    """Null impute categoricals."""
    new_df = df.copy()
    for col_name in df.columns:
        if col_name in null_impute.keys():
            new_df.loc[new_df[col_name].isnull(), col_name] = null_impute[col_name]
    return new_df

We now define the inference function we want to use. We can either define a predict_dict or predict_df function.

NOTE: We recommend using predict_df to speed up profiling and testing.

# We now define our interface.
def predict_dict(x: dict):
    """Predict dict function."""
    new_x = preprocess(x)
    new_x = pd.DataFrame(new_x, index=[0])
    return model.predict_proba(new_x)[0][1]

With this, we can directly start using the tests for the RIME Python Library! We first instantiate the data containers:

from rime.tabular import DataContainer, TabularRunContainer, ModelTask

data_container = DataContainer.from_df(train_df, model_task=ModelTask.BINARY_CLASSIFICATION, labels=train_labels)
test_data_container = DataContainer.from_df(test_df, labels=test_labels, model_task=ModelTask.BINARY_CLASSIFICATION, ref_data=data_container)
container = TabularRunContainer.from_predict_dict_function(data_container, test_data_container, predict_dict, ModelTask.BINARY_CLASSIFICATION)

Once you have done that, we can access the components we need from the container as follows:

model_wrapper = container.model.base_model
df = container.eval_data.df
labels = container.eval_data.labels

This allows us to do initial profiling on the dataset and model. Now you’re all set to run tests.