Lab 18-02 — Benchmark Real vLLM and Write the Tuning Report [GPU-OPT]
Lab-01's loop, with wall-clocks: run vllm bench serve against a live server,
sweep two knobs, and produce the artifact this phase exists to teach — a tuning
report: workload stated, configs compared, distributions (not means) reported,
a recommendation with its trade named. The capture below is such a report in
miniature; your deliverable is the same table for your hardware and a workload you
choose.
No GPU? Don't panic. The captured sweep below is the worked example, and the report-writing discipline is hardware-free. (You can also run the whole lab against Phase 17 lab-02's CPU backend — slower numbers, identical methodology.)
Contents
- Why this lab exists
- Requirements
- Steps
- Captured sweep (Qwen2.5-0.5B, L4, vLLM 0.22.1)
- Reading the sweep
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Benchmark numbers without methodology are advocacy, and most published LLM serving comparisons fail one of four checks you'll practice here: stated workload (QPS, prompt/output length distributions — Phase 13 taught how much one image shifts these), warm measurement (Phase 5's capture and compile excluded), distributions (p50/p99 for TTFT and ITL — the tails are the product, Phase 3 lab-05), and one knob at a time. The phase's CPU labs built every mental model this lab's numbers will land in; the remaining skill is operational care, which only practice installs.
Requirements
uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
Steps
- Serve (one terminal):
vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000 - Bench (another): sweep request rate first to find the knee, then the knobs:
vllm bench serve --backend openai-chat --base-url http://localhost:8000 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--dataset-name random --random-input-len 512 --random-output-len 128 \
--num-prompts 200 --request-rate 8
- Re-run the same command against servers restarted with one change each:
--long-prefill-token-threshold 64, then--max-num-seqs 64, then--gpu-memory-utilization 0.9. Two runs per config (eyeball variance before trusting deltas — Phase 5 lab-04's discipline). - Write the report: table, distributions, the knee, one recommendation per SLO profile.
Captured sweep (Qwen2.5-0.5B, L4, vLLM 0.22.1)
workload: 512-in/128-out random, 200 prompts, rate 8 req/s, warm server, 2 runs each
config tput tok/s TTFT p50/p99 (ms) ITL p50/p99 (ms)
baseline (defaults) 4,310 145 / 610 11.2 / 41.8
threshold=64 4,150 160 / 660 11.3 / 14.9 <- p99 ITL 2.8x better
max_num_seqs=64 (was 256) 4,290 150 / 1,240 11.1 / 13.6 <- queueing moved to TTFT
gpu_mem_util=0.9 (was 0.85) 4,420 144 / 600 11.2 / 40.9 <- more KV, small gain here
# rate sweep (baseline): 4 req/s p99 TTFT 210ms; 8 -> 610ms; 12 -> 4,900ms <- the knee is ~8-10
Reading the sweep
- threshold=64: −4% throughput, ÷2.8 p99 ITL — lab-01's trade with real units, and the per-request-vs-global subtlety still applies (check max chunk concurrency before promising the cap). For a chat product this row is the recommendation; for batch summarization it's a pure loss. The workload decides; the report must say so.
- max_num_seqs=64: ITL p99 improves (fewer co-resident decodes per step) but TTFT p99 doubles — the queue moved from inside steps to in front of them. Conservation of suffering: admission knobs relocate latency between metrics; only capacity (next row) or efficiency creates more of it.
- gpu_mem_util=0.9: +2.5% here because this workload wasn't KV-bound (0.5B, short contexts). The same knob on a 70B at long context is the difference between serving and queueing — a knob's value is workload-conditional, which is why the report states the workload first.
- The rate sweep is the most important line: the knee (~8–10 req/s) is the capacity number every other measurement is conditional on. Benchmarking at the knee shows tradeoffs; past it, everything drowns in queueing and configs look identical (all terrible). Find the knee first, always.
Hitchhiker's notes
vllm benchsubsumes the oldbenchmark_serving.pyscripts — datasets (random, sharegpt, sonnet), Poisson arrivals via--request-rate, and the percentile outputs this report needs. The server-side Prometheus metrics (vllm:time_to_first_token_secondsand friends) should agree with the client-side numbers minus network — when they don't, you've found front-door overhead (Phase 16 lab-02's gap measurement).- Variance discipline scales with claim size: two runs to eyeball, five+ with a t-test before shipping a regression report someone will act on. The single most common benchmarking sin is one run per config and a conclusion from a 3% delta inside run-to-run noise.
- Profile only after the macro story is clear: this lab's table tells you which config to keep; Phase 7 lab-02's profiler tells you why a step costs what it does. Macro → micro, never the reverse — profiling an untuned config optimizes the wrong thing precisely.
- Report format matters more than it should: workload, configs, table, distributions, knee, recommendation-with-trade — one page. Decision-makers act on the page, not the runs; a perfect sweep badly reported changes nothing.
Reflect
- Reconcile each captured row with its CPU-lab prediction: threshold (lab-01 + Phase 3 lab-05), max_num_seqs (lab-01's queueing test), mem_util (Phase 2 lab-03's blocks). Any row you couldn't have predicted within 2× deserves a note in the report — that's where your model of the system is thinnest.
- Your p99 TTFT SLO is 800 ms and traffic is 10 req/s on this hardware. What does the rate sweep say, and what are the three escape routes? (You're past the knee: more replicas, a smaller/quantized model — Phase 6 — or admission control that sheds load visibly. Tuning knobs won't move a knee much; capacity does.)
- Why benchmark with
randomdata instead of real prompts first? (Controlled lengths isolate the knobs; then confirm with a real-trace dataset — sharegpt — because length distributions, prefix sharing, and image tokens all shift the knee. Synthetic isolates; real validates. You need both, in that order.)
References
vllm bench serve --helpandupstream/vllm/benchmarks/— the harness.- vLLM docs, Benchmarking — official methodology notes: https://docs.vllm.ai/en/latest/contributing/benchmarks/
- Phase 3 lab-05 (the ITL story), Phase 2 lab-03 (the capacity story), Phase 5 lab-04 (warmup + variance), lab-01 (the search loop this lab runs for real).
- Dean & Barroso, The Tail at Scale — why every column here is a percentile: https://research.google/pubs/the-tail-at-scale/