Phase 18 — Cheatsheet: Performance Engineering

Loop: measure (TTFT/ITL/throughput) -> find bottleneck -> turn one knob -> re-measure.
Knobs: max_num_seqs, max_num_batched_tokens, gpu_memory_utilization, chunked prefill, CUDA graphs, quant, spec decode.
Roofline: decode=memory-bound, prefill=compute-bound. Little's Law links batch/rate/latency.
Benchmark with warmup + steady state + identical traffic, or it's noise.

Key upstream files

benchmarks/
vllm/benchmarks/
vllm/v1/metrics/
vllm/v1/metrics/stats.py
vllm/config/scheduler.py

Full reference: 00-guide.md · 01-deep-dive.md