Capacity Evaluation¶
etalon
provides a tool through capacity search to find the maximum Queries per Second (QPS) that a given model and inference system can handle given various constraints.
What is Capacity?¶
It is defined as maximum request load (queries-per-second) a system can sustain while meeting certain latency targets (SLOs). Higher capacity reduces the cost of serving requests and improves the user experience.
Steps to Measure Capacity¶
Prefill Profiler¶
First etalon
needs to profile the prefill times of the given open source system and model combination.
Refer to Prefill Profiler for more details on how to run prefill profiler.
Capacity Search¶
etalon
then runs capacity search to find the maximum QPS.
Refer to Capacity Search for more details on how to run capacity search.