Capacity Evaluation

etalon provides a tool through capacity search to find the maximum Queries per Second (QPS) that a given model and inference system can handle given various constraints.

What is Capacity?

It is defined as maximum request load (queries-per-second) a system can sustain while meeting certain latency targets (SLOs). Higher capacity reduces the cost of serving requests and improves the user experience.

Steps to Measure Capacity

Prefill Profiler

First etalon needs to profile the prefill times of the given open source system and model combination.

Refer to Prefill Profiler for more details on how to run prefill profiler.