Smart Feature Sampling and Performance Profiling

Smart Feature Sampling

For large datasets, RIME can be optimized by running on a subset of features. This works by specifying the number of features to test on. RIME automatically calculates feature importances in both the with model case and without model case (as long as labels or predictions are provided), and runs its suite of tests across all of them.

To set up smart feature sampling, specify the num_feats_to_profile option in DataProfilingInfo. The maximum number of features that RIME can be run on is 750.

More information can be found in DataProfilingInfo Configuration.

Performance Profiling

RIME’s analysis of datasets augments your data with additional profiles and relationships. The following tables show RIME’s limits in reference to data in csv format.

Each feature/ row pair represents the total recommended number of rows to run RIME on, assuming a memory ceiling. Memory scales roughly linearly, so calculating the maximum advisable rows to run on for your features can be accomplished by taking a ratio with the closest feature count to that in the table.

For ranking use cases, recommended row values should be roughly 80% of the values provided below.

If your data is in parquet format, RIME supports 20% more rows than recommended below.

8GB Memory Ceiling Recommendations
Feature Count Row Guidelines for Standard Production Resources
25 13,000,000
50 6,400,000
75 4,200,000
100 3,000,000
200 1,600,000
300 1,000,000
400 750,000
500 600,000
750 175,000
16GB Memory Ceiling Recommendations
Feature Count Row Guidelines for Standard Production Resources
25 26,000,000
50 12,800,000
75 8,400,000
100 6,000,000
200 3,200,000
300 2,000,000
400 1,500,000
500 1,200,000
750 350,000
32GB Memory Ceiling Recommendations
Feature Count Row Guidelines for Standard Production Resources
25 50,000,000
50 24,500,000
75 16,200,000
100 11,500,000
200 6,100,000
300 3,800,000
400 2,800,000
500 2,200,000
750 650,000