Open Source Systems¶
etalon
can be run with any open source LLM inference system. If open source system does not provide OpenAI Compatible APIs, then new LLM clients can be implemented to support new open source system as explained in Implementing New LLM Clients.
Note
Custom tokenizer corresponding to the model is fetched from Hugging Face hub. Make sure you have access to the model and are logged in to Hugging Face. Check Setup Hugging Face for more details.
Here we give an example with vLLM
.
Launch vLLM Server¶
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123 -tp 1 --rope-scaling '{"type":"dynamic","factor":2.0}'
If higher context length is needed than supported by the model with certain scale factor, then add rope-scaling as --rope-scaling '{"type":"dynamic","factor":2.0}'
. Adjust type and factor as per the use case.
Export API Key and URL¶
export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=token-abc123
Running Benchmark¶
Benchmark can be run as shown below:
python -m etalon.run_benchmark \
--model "meta-llama/Meta-Llama-3-8B-Instruct" \
--max-num-completed-requests 20 \
--request-interval-generator-provider "gamma" \
--request-length-generator-provider "zipf" \
--request-generator-max-tokens 8192 \
--output-dir "results"
Be sure to update --model
flag to same model used to launch vLLM.
Note
etalon
supports different generator providers for request interval and request length. For more details, refer to Configuring Request Generator Providers.
Specifying wandb args [Optional]¶
Optionally, you can also specify the following arguments to log results to wandb:
--should-write-metrics \
--wandb-project Project \
--wandb-group Group \
--wandb-run-name Run
Other Arguments¶
There are many more arguments for running benchmark, run the following to know more:
python -m etalon.run_benchmark -h
Saving Results¶
The results of the benchmark are saved in the results directory specified by the --output-dir
argument.