Phase 18 — Cheatsheet: Performance Engineering
- Loop: measure (TTFT/ITL/throughput) -> find bottleneck -> turn one knob -> re-measure.
- Knobs: max_num_seqs, max_num_batched_tokens, gpu_memory_utilization, chunked prefill, CUDA graphs, quant, spec decode.
- Roofline: decode=memory-bound, prefill=compute-bound. Little's Law links batch/rate/latency.
- Benchmark with warmup + steady state + identical traffic, or it's noise.
Key upstream files
benchmarks/vllm/benchmarks/vllm/v1/metrics/vllm/v1/metrics/stats.pyvllm/config/scheduler.py
Full reference: 00-guide.md · 01-deep-dive.md