Debugging the Python SDK

Troubleshooting Stress Tests

This document includes a few common failure scenarios and recommended debugging techniques.

Lost Job Object

When you close your Python notebook or scripting session, you lose access to ephemeral in-memory objects such as Job. To recover these objects, connect the client to the same backend service.

rime_client = Client("my_vpc.rime.com", api_key="api-key")

After connecting, use rime_client.list_stress_testing_jobs() to query the server for a list of jobs from the past two days. You can filter the list ny status and project ID to reduce the volume of jobs returned. Call get_status() on each job to find which job is yours. The return value from get_status() includes the start time and status of the job which should help you identify which job you started.

jobs = rime_client.list_stress_testing_jobs(status_filters= ['succeeded', 'failed'], project_id="bar")
# Print out the metadata for each job to see which one you started most recently.
for job in jobs:
    print(job.get_status())

Test Run Results Don’t Show Up in UI

This indicates that the Job executing the suite of stress tests failed along the way. There are a number of reasons why this would happen, including but not limited to:

  • Misspecified test_run_config, dataset, or model.

  • Resource limits exceeded.

  • Managed image doesn’t include the model’s required dependencies.

  • SDK version doesn’t match the version of the cluster.

The best place to start is the get_status() of the Job object. When the job status is 'FAILING' and the verbose flag is set to True, get_status() dumps the logs to stdout. For configuration issues, this can be very helpful.

# Assume the job is 'FAILING'
status = job.get_status(verbose=True, wait_until_finish=True)
# This will dump the logs if there are any to stdout.

Looking at the logs can help solve many problems.

If you have trouble making additional progress with debugging, please contact RI support.