How to use etalonΒΆ
etalon
can evaluate LLM inference systems as a black-box and also determine the serving capacity of the LLM inference system.
etalon
provides two evaluation recipes as described below:
Black-box Evaluation:
etalon
hits LLM inference server exposed through API endpoint with set of requests with different prompt lengths and tracks when each output token is generated. This allowsetalon
to calculate several metrics like TTFT, TBT, TPOT, and fluidity-index. With fluidity-index metric,etalon
also infers the minimum TBT required to meet target acceptance rate threshold, such as - 99% of requests should have acceptance rate of at least 90%, and obtains fluid-token-generation-rate metric.Capacity Evaluation: When deploying LLM inference system, operator needs to know how many requests can be served by the system. This will help operator in determining the configuration of the system, for example, the number of GPUs needed, to meet certain service quality requirements. To help with this process,
etalon
provides a capacity evaluation module which determines maximum capacity each replica can provide under different request loads while meeting target SLO requirements.
The description of each metric used in black-box evaluation is provided in Metrics.
Check out the following resources to learn more: