Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 18 — Cheatsheet: Performance Engineering

  • Loop: measure (TTFT/ITL/throughput) -> find bottleneck -> turn one knob -> re-measure.
  • Knobs: max_num_seqs, max_num_batched_tokens, gpu_memory_utilization, chunked prefill, CUDA graphs, quant, spec decode.
  • Roofline: decode=memory-bound, prefill=compute-bound. Little's Law links batch/rate/latency.
  • Benchmark with warmup + steady state + identical traffic, or it's noise.

Key upstream files

  • benchmarks/
  • vllm/benchmarks/
  • vllm/v1/metrics/
  • vllm/v1/metrics/stats.py
  • vllm/config/scheduler.py

Full reference: 00-guide.md · 01-deep-dive.md