Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 18-02 — Benchmark Real vLLM and Write the Tuning Report [GPU-OPT]

Lab-01's loop, with wall-clocks: run vllm bench serve against a live server, sweep two knobs, and produce the artifact this phase exists to teach — a tuning report: workload stated, configs compared, distributions (not means) reported, a recommendation with its trade named. The capture below is such a report in miniature; your deliverable is the same table for your hardware and a workload you choose.

No GPU? Don't panic. The captured sweep below is the worked example, and the report-writing discipline is hardware-free. (You can also run the whole lab against Phase 17 lab-02's CPU backend — slower numbers, identical methodology.)

Contents


Why this lab exists

Benchmark numbers without methodology are advocacy, and most published LLM serving comparisons fail one of four checks you'll practice here: stated workload (QPS, prompt/output length distributions — Phase 13 taught how much one image shifts these), warm measurement (Phase 5's capture and compile excluded), distributions (p50/p99 for TTFT and ITL — the tails are the product, Phase 3 lab-05), and one knob at a time. The phase's CPU labs built every mental model this lab's numbers will land in; the remaining skill is operational care, which only practice installs.

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct

Steps

  1. Serve (one terminal): vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000
  2. Bench (another): sweep request rate first to find the knee, then the knobs:
vllm bench serve --backend openai-chat --base-url http://localhost:8000 \
  --model Qwen/Qwen2.5-0.5B-Instruct \
  --dataset-name random --random-input-len 512 --random-output-len 128 \
  --num-prompts 200 --request-rate 8
  1. Re-run the same command against servers restarted with one change each: --long-prefill-token-threshold 64, then --max-num-seqs 64, then --gpu-memory-utilization 0.9. Two runs per config (eyeball variance before trusting deltas — Phase 5 lab-04's discipline).
  2. Write the report: table, distributions, the knee, one recommendation per SLO profile.

Captured sweep (Qwen2.5-0.5B, L4, vLLM 0.22.1)

workload: 512-in/128-out random, 200 prompts, rate 8 req/s, warm server, 2 runs each
config                      tput tok/s   TTFT p50/p99 (ms)   ITL p50/p99 (ms)
baseline (defaults)            4,310        145 / 610          11.2 / 41.8
threshold=64                   4,150        160 / 660          11.3 / 14.9   <- p99 ITL 2.8x better
max_num_seqs=64 (was 256)      4,290        150 / 1,240        11.1 / 13.6   <- queueing moved to TTFT
gpu_mem_util=0.9 (was 0.85)    4,420        144 / 600          11.2 / 40.9   <- more KV, small gain here
# rate sweep (baseline): 4 req/s p99 TTFT 210ms; 8 -> 610ms; 12 -> 4,900ms  <- the knee is ~8-10

Reading the sweep

  • threshold=64: −4% throughput, ÷2.8 p99 ITL — lab-01's trade with real units, and the per-request-vs-global subtlety still applies (check max chunk concurrency before promising the cap). For a chat product this row is the recommendation; for batch summarization it's a pure loss. The workload decides; the report must say so.
  • max_num_seqs=64: ITL p99 improves (fewer co-resident decodes per step) but TTFT p99 doubles — the queue moved from inside steps to in front of them. Conservation of suffering: admission knobs relocate latency between metrics; only capacity (next row) or efficiency creates more of it.
  • gpu_mem_util=0.9: +2.5% here because this workload wasn't KV-bound (0.5B, short contexts). The same knob on a 70B at long context is the difference between serving and queueing — a knob's value is workload-conditional, which is why the report states the workload first.
  • The rate sweep is the most important line: the knee (~8–10 req/s) is the capacity number every other measurement is conditional on. Benchmarking at the knee shows tradeoffs; past it, everything drowns in queueing and configs look identical (all terrible). Find the knee first, always.

Hitchhiker's notes

  • vllm bench subsumes the old benchmark_serving.py scripts — datasets (random, sharegpt, sonnet), Poisson arrivals via --request-rate, and the percentile outputs this report needs. The server-side Prometheus metrics (vllm:time_to_first_token_seconds and friends) should agree with the client-side numbers minus network — when they don't, you've found front-door overhead (Phase 16 lab-02's gap measurement).
  • Variance discipline scales with claim size: two runs to eyeball, five+ with a t-test before shipping a regression report someone will act on. The single most common benchmarking sin is one run per config and a conclusion from a 3% delta inside run-to-run noise.
  • Profile only after the macro story is clear: this lab's table tells you which config to keep; Phase 7 lab-02's profiler tells you why a step costs what it does. Macro → micro, never the reverse — profiling an untuned config optimizes the wrong thing precisely.
  • Report format matters more than it should: workload, configs, table, distributions, knee, recommendation-with-trade — one page. Decision-makers act on the page, not the runs; a perfect sweep badly reported changes nothing.

Reflect

  • Reconcile each captured row with its CPU-lab prediction: threshold (lab-01 + Phase 3 lab-05), max_num_seqs (lab-01's queueing test), mem_util (Phase 2 lab-03's blocks). Any row you couldn't have predicted within 2× deserves a note in the report — that's where your model of the system is thinnest.
  • Your p99 TTFT SLO is 800 ms and traffic is 10 req/s on this hardware. What does the rate sweep say, and what are the three escape routes? (You're past the knee: more replicas, a smaller/quantized model — Phase 6 — or admission control that sheds load visibly. Tuning knobs won't move a knee much; capacity does.)
  • Why benchmark with random data instead of real prompts first? (Controlled lengths isolate the knobs; then confirm with a real-trace dataset — sharegpt — because length distributions, prefix sharing, and image tokens all shift the knee. Synthetic isolates; real validates. You need both, in that order.)

References