Prefill Profiler
================

To profile prefill times of open source systems and create a prefill time predictor for a given model and open source system combination, based on input prompt length, we can run ``etalon.prefill_profiler``.

.. image:: ../_static/assets/yi-prefill-time-curve.png
    :align: center
    :scale: 50%

Above figure shows prefill time curve for Yi-34B on 2 H100s. We can see that prefill time increases with prompt length quadratically.

Launch any open source system and setup API keys and URL as shown in :doc:`open_source_systems`.

And, then run the following command:

.. code-block:: shell

    python -m etalon.prefill_profiler \
    --model "meta-llama/Meta-Llama-3-8B-Instruct" \
    --timeout 600 \
    --fixed-request-generator-decode-tokens 16 \
    --output-dir "prefill_experiments/prefill_profiler_vllm_llama-3-8b"

Adjusting Prompt Lengths
~~~~~~~~~~~~~~~~~~~~~~~~

By default, prefill profiler profiles the following range of prompt lengths:

.. code-block:: python

    [256, 512, 1024, 2048, 4096, 8192, 16384]

To profile a custom range of prompt lengths, use the flag ``--prefill-lengths`` as follows:

.. code-block:: shell

    python -m etalon.prefill_profiler \
    --model "meta-llama/Meta-Llama-3-8B-Instruct" \
    --timeout 600 \
    --fixed-request-generator-decode-tokens 16 \
    --output-dir "prefill_experiments/prefill_profiler_vllm_llama-3-8b" \
    --prefill-lengths 256 512 1024 2048 4096 8192 16384 32768 65536