Lab 18-02 — Benchmark Real vLLM and Write the Tuning Report `[GPU-OPT]`

Lab-01's loop, with wall-clocks: run vllm bench serve against a live server, sweep two knobs, and produce the artifact this phase exists to teach — a tuning report: workload stated, configs compared, distributions (not means) reported, a recommendation with its trade named. The capture below is such a report in miniature; your deliverable is the same table for your hardware and a workload you choose.

No GPU? Don't panic. The captured sweep below is the worked example, and the report-writing discipline is hardware-free. (You can also run the whole lab against Phase 17 lab-02's CPU backend — slower numbers, identical methodology.)

Why this lab exists
Requirements
Steps
Captured sweep (Qwen2.5-0.5B, L4, vLLM 0.22.1)
Reading the sweep
Hitchhiker's notes
Reflect
References

Why this lab exists

Benchmark numbers without methodology are advocacy, and most published LLM serving comparisons fail one of four checks you'll practice here: stated workload (QPS, prompt/output length distributions — Phase 13 taught how much one image shifts these), warm measurement (Phase 5's capture and compile excluded), distributions (p50/p99 for TTFT and ITL — the tails are the product, Phase 3 lab-05), and one knob at a time. The phase's CPU labs built every mental model this lab's numbers will land in; the remaining skill is operational care, which only practice installs.

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct

Steps

Serve (one terminal): vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000
Bench (another): sweep request rate first to find the knee, then the knobs:

vllm bench serve --backend openai-chat --base-url http://localhost:8000 \
  --model Qwen/Qwen2.5-0.5B-Instruct \
  --dataset-name random --random-input-len 512 --random-output-len 128 \
  --num-prompts 200 --request-rate 8

Re-run the same command against servers restarted with one change each: --long-prefill-token-threshold 64, then --max-num-seqs 64, then --gpu-memory-utilization 0.9. Two runs per config (eyeball variance before trusting deltas — Phase 5 lab-04's discipline).
Write the report: table, distributions, the knee, one recommendation per SLO profile.

Captured sweep (Qwen2.5-0.5B, L4, vLLM 0.22.1)

workload: 512-in/128-out random, 200 prompts, rate 8 req/s, warm server, 2 runs each
config                      tput tok/s   TTFT p50/p99 (ms)   ITL p50/p99 (ms)
baseline (defaults)            4,310        145 / 610          11.2 / 41.8
threshold=64                   4,150        160 / 660          11.3 / 14.9   <- p99 ITL 2.8x better
max_num_seqs=64 (was 256)      4,290        150 / 1,240        11.1 / 13.6   <- queueing moved to TTFT
gpu_mem_util=0.9 (was 0.85)    4,420        144 / 600          11.2 / 40.9   <- more KV, small gain here
# rate sweep (baseline): 4 req/s p99 TTFT 210ms; 8 -> 610ms; 12 -> 4,900ms  <- the knee is ~8-10

Reading the sweep

threshold=64: −4% throughput, ÷2.8 p99 ITL — lab-01's trade with real units, and the per-request-vs-global subtlety still applies (check max chunk concurrency before promising the cap). For a chat product this row is the recommendation; for batch summarization it's a pure loss. The workload decides; the report must say so.
max_num_seqs=64: ITL p99 improves (fewer co-resident decodes per step) but TTFT p99 doubles — the queue moved from inside steps to in front of them. Conservation of suffering: admission knobs relocate latency between metrics; only capacity (next row) or efficiency creates more of it.
gpu_mem_util=0.9: +2.5% here because this workload wasn't KV-bound (0.5B, short contexts). The same knob on a 70B at long context is the difference between serving and queueing — a knob's value is workload-conditional, which is why the report states the workload first.
The rate sweep is the most important line: the knee (~8–10 req/s) is the capacity number every other measurement is conditional on. Benchmarking at the knee shows tradeoffs; past it, everything drowns in queueing and configs look identical (all terrible). Find the knee first, always.

Hitchhiker's notes

vllm bench subsumes the old benchmark_serving.py scripts — datasets (random, sharegpt, sonnet), Poisson arrivals via --request-rate, and the percentile outputs this report needs. The server-side Prometheus metrics (vllm:time_to_first_token_seconds and friends) should agree with the client-side numbers minus network — when they don't, you've found front-door overhead (Phase 16 lab-02's gap measurement).
Variance discipline scales with claim size: two runs to eyeball, five+ with a t-test before shipping a regression report someone will act on. The single most common benchmarking sin is one run per config and a conclusion from a 3% delta inside run-to-run noise.
Profile only after the macro story is clear: this lab's table tells you which config to keep; Phase 7 lab-02's profiler tells you why a step costs what it does. Macro → micro, never the reverse — profiling an untuned config optimizes the wrong thing precisely.
Report format matters more than it should: workload, configs, table, distributions, knee, recommendation-with-trade — one page. Decision-makers act on the page, not the runs; a perfect sweep badly reported changes nothing.

Reflect

Reconcile each captured row with its CPU-lab prediction: threshold (lab-01 + Phase 3 lab-05), max_num_seqs (lab-01's queueing test), mem_util (Phase 2 lab-03's blocks). Any row you couldn't have predicted within 2× deserves a note in the report — that's where your model of the system is thinnest.
Your p99 TTFT SLO is 800 ms and traffic is 10 req/s on this hardware. What does the rate sweep say, and what are the three escape routes? (You're past the knee: more replicas, a smaller/quantized model — Phase 6 — or admission control that sheds load visibly. Tuning knobs won't move a knee much; capacity does.)
Why benchmark with random data instead of real prompts first? (Controlled lengths isolate the knobs; then confirm with a real-trace dataset — sharegpt — because length distributions, prefix sharing, and image tokens all shift the knee. Synthetic isolates; real validates. You need both, in that order.)

References

vllm bench serve --help and upstream/vllm/benchmarks/ — the harness.
vLLM docs, Benchmarking — official methodology notes: https://docs.vllm.ai/en/latest/contributing/benchmarks/
Phase 3 lab-05 (the ITL story), Phase 2 lab-03 (the capacity story), Phase 5 lab-04 (warmup + variance), lab-01 (the search loop this lab runs for real).
Dean & Barroso, The Tail at Scale — why every column here is a percentile: https://research.google/pubs/the-tail-at-scale/

vLLM Mastery — From Zero to Maintainer