Proprietary Systems¶
etalon
can benchmark the performance of LLM Inference Systems that are exposed as public APIs. The following sections describe how to benchmark these systems.
Note
Custom tokenizer corresponding to the model is fetched from Hugging Face hub. Make sure you have access to the model and are logged in to Hugging Face. Check Setup Hugging Face for more details.
Export API Key and URL¶
export OPENAI_API_BASE=https://api.endpoints.anyscale.com/v1
export OPENAI_API_KEY=secret_abcdefg
Running Benchmark¶
python -m etalon.run_benchmark \
--model "meta-llama/Meta-Llama-3-8B-Instruct" \
--max-num-completed-requests 20 \
--request-interval-generator-provider "gamma" \
--request-length-generator-provider "zipf" \
--request-generator-max-tokens 8192 \
--output-dir "results"
Be sure to update --model
flag to the model used in the proprietary system.
Note
etalon
supports different generator providers for request interval and request length. For more details, refer to Configuring Request Generator Providers.
Specifying wandb args [Optional]¶
Optionally, you can also specify the following arguments to log results to wandb:
--should-write-metrics \
--wandb-project Project \
--wandb-group Group \
--wandb-run-name Run
Other Arguments¶
There are many more arguments for running benchmark, run the following to know more:
python -m etalon.run_benchmark -h