Metrics used by etalon ====================== ``etalon`` supports 4 conventional metrics: TTFT, TBT, TPOT and Normalized Latency. Additionally, it introduces two new metrics, *fluidity-index* and *fluid-token-generation-rate*, to evaluate LLM inference systems. The description of each metric is provided below: Time to First Token (TTFT) ^^^^^^^^^^^^^^^^^^^^^^^^^^ It is defined as the time taken between arrival and first output token generated by system for each request. TTFT includes both scheduling delay and prompt processing time. Lower TTFT is better. Time Between Tokens (TBT) ^^^^^^^^^^^^^^^^^^^^^^^^^ It is defined as the time taken between two consecutive output tokens generated by system for each request. Lower TBT is better. Time Per Output Token (TPOT) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ It is defined as total time taken to generate all output tokens divided by the number of output tokens generated. Lower TPOT is better. Normalized Latency ^^^^^^^^^^^^^^^^^^ It is defined as total execution time of request divided by the number of output tokens generated. It includes scheduling delay, prompt generation time and time taken to generate all decode tokens. Lower Normalized Latency is better. .. _fluidity-index: *fluidity-index* ^^^^^^^^^^^^^^^^^^ .. note:: *fluidity-index* is a new metric introduced by ``etalon`` to evaluate LLM inference systems. It is designed to capture the nuances of LLM inference process and its impact on real-time user experience. Given target prefill and decode latencies, *fluidity-index* is defined as fraction of tokens that satisfy the target latencies for a given request. It accounts for slack which is the difference between actual time taken to generate token and deadline for that token. That slack is used by subsequent tokens if current token is generated before deadline. Higher *fluidity-index* is better. More formally, let target prefill latency be :math:`D_p` and target decode latency be :math:`D_d`. Let :math:`D` be denoted as deadline for each token, where :math:`D = D_p` for first token and :math:`D = D_d` for subsequent tokens. Every token generation is characterized as a periodic task :math:`r_i = (t_i, d_i, s_i)`, where :math:`t_i` is the arrival time of :math:`i^{th}` token, :math:`d_i` is deadline for :math:`i^{th}` token, (i.e., :math:`t_i + D + slack_{i-1}`) and :math:`s_i` is the actual time taken to generate :math:`i^{th}` token. If :math:`s_i + t_i \leq d_i`, then :math:`slack_{i} = slack_{i-1} + D - s_i`, else :math:`slack_{i} = 0`. :math:`slack_{0} = 0` for first token. The *fluidity-index* is calculated as follows: .. math:: \textit{fluidity-index} = \frac{\sum_{i=1}^{n} \mathbb{I}\{t_i + s_i \leq d_i\}}{n} , where :math:`\mathbb{I}\{t_i + s_i \leq d_i\} = 1` if :math:`t_i + s_i \leq d_i` else :math:`0` and :math:`n` is the number of decode tokens generated. .. _fluid-token-generation-rate: *fluid token generation rate* ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. note:: *fluid token generation rate* is a another new metric introduced by ``etalon`` to evaluate LLM inference systems. *fluid token generation rate* is defined as maximum tokens per second an inference system can serve such that 99% of the requests achieve fluidity-index of at-least 0.9. Higher *fluid token generation rate* is better.