Metrics used by etalon¶
etalon
supports 4 conventional metrics: TTFT, TBT, TPOT and Normalized Latency.
Additionally, it introduces two new metrics, fluidity-index and fluid-token-generation-rate, to evaluate LLM inference systems.
The description of each metric is provided below:
Time to First Token (TTFT)¶
It is defined as the time taken between arrival and first output token generated by system for each request. TTFT includes both scheduling delay and prompt processing time. Lower TTFT is better.
Time Between Tokens (TBT)¶
It is defined as the time taken between two consecutive output tokens generated by system for each request. Lower TBT is better.
Time Per Output Token (TPOT)¶
It is defined as total time taken to generate all output tokens divided by the number of output tokens generated. Lower TPOT is better.
Normalized Latency¶
It is defined as total execution time of request divided by the number of output tokens generated. It includes scheduling delay, prompt generation time and time taken to generate all decode tokens. Lower Normalized Latency is better.
fluidity-index¶
Note
fluidity-index is a new metric introduced by etalon
to evaluate LLM inference systems. It is designed to capture the nuances of LLM inference process and its impact on real-time user experience.
Given target prefill and decode latencies, fluidity-index is defined as fraction of tokens that satisfy the target latencies for a given request. It accounts for slack which is the difference between actual time taken to generate token and deadline for that token. That slack is used by subsequent tokens if current token is generated before deadline. Higher fluidity-index is better.
More formally, let target prefill latency be \(D_p\) and target decode latency be \(D_d\). Let \(D\) be denoted as deadline for each token, where \(D = D_p\) for first token and \(D = D_d\) for subsequent tokens. Every token generation is characterized as a periodic task \(r_i = (t_i, d_i, s_i)\), where \(t_i\) is the arrival time of \(i^{th}\) token, \(d_i\) is deadline for \(i^{th}\) token, (i.e., \(t_i + D + slack_{i-1}\)) and \(s_i\) is the actual time taken to generate \(i^{th}\) token.
If \(s_i + t_i \leq d_i\), then \(slack_{i} = slack_{i-1} + D - s_i\), else \(slack_{i} = 0\). \(slack_{0} = 0\) for first token.
The fluidity-index is calculated as follows:
, where \(\mathbb{I}\{t_i + s_i \leq d_i\} = 1\) if \(t_i + s_i \leq d_i\) else \(0\) and \(n\) is the number of decode tokens generated.
fluid token generation rate¶
Note
fluid token generation rate is a another new metric introduced by etalon
to evaluate LLM inference systems.
fluid token generation rate is defined as maximum tokens per second an inference system can serve such that 99% of the requests achieve fluidity-index of at-least 0.9. Higher fluid token generation rate is better.