Proprietary Systems

etalon can benchmark the performance of LLM Inference Systems that are exposed as public APIs. The following sections describe how to benchmark these systems.

Note

Custom tokenizer corresponding to the model is fetched from Hugging Face hub. Make sure you have access to the model and are logged in to Hugging Face. Check Setup Hugging Face for more details.

Export API Key and URL

export OPENAI_API_BASE=https://api.endpoints.anyscale.com/v1
export OPENAI_API_KEY=secret_abcdefg

Running Benchmark

python -m etalon.run_benchmark \
--model "meta-llama/Meta-Llama-3-8B-Instruct" \
--max-num-completed-requests 20 \
--request-interval-generator-provider "gamma" \
--request-length-generator-provider "zipf" \
--request-generator-max-tokens 8192 \
--output-dir "results"

Be sure to update --model flag to the model used in the proprietary system.

Note

etalon supports different generator providers for request interval and request length. For more details, refer to Configuring Request Generator Providers.

Specifying wandb args [Optional]

Optionally, you can also specify the following arguments to log results to wandb:

--should-write-metrics \
--wandb-project Project \
--wandb-group Group \
--wandb-run-name Run

Other Arguments

There are many more arguments for running benchmark, run the following to know more:

python -m etalon.run_benchmark -h