Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 18 — Deep Dive: Performance Engineering

Read this with upstream/ open. Every path is relative to upstream/ at the pinned commit v0.22.1 @ 0decac0 (UPSTREAM_PIN.md). If a line number ever drifts, search for the named symbol instead.

Contents


Guided reading list

Work through these in order. This is a scaffold: the reading targets and the questions are real; fill in the line-by-line annotations as you go (this is exactly the muscle a maintainer uses — reading unfamiliar code and extracting its contract).

  1. benchmarks/ — The benchmark suite (throughput, latency, serving).
    • Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
  2. vllm/benchmarks/ — The 'vllm bench' implementation.
    • Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
  3. vllm/v1/metrics/ — The metrics/stats the engine exposes (Prometheus + logging).
    • Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
  4. vllm/v1/metrics/stats.py — SchedulerStats / IterationStats: what's measured each step.
    • Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
  5. vllm/config/scheduler.py — The tuning knobs and their defaults/semantics.
    • Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.

Questions to answer as you read

  • Metrics that matter: throughput (tok/s), TTFT, ITL/TPOT, goodput, latency percentiles?
  • Little's Law and how batch size, arrival rate, and latency relate?
  • The roofline model: compute-bound vs memory-bound; arithmetic intensity?
  • Profiling: the torch profiler, Nsight Systems, and vLLM's own metrics?
  • The knobs: max_num_seqs, max_num_batched_tokens, gpu_memory_utilization, enable_chunked_prefill, CUDA graphs, quant, spec decode?
  • Benchmarking properly: vllm bench, warmup, steady state, fair comparisons?

Cross-references