Phase 18 — Performance Engineering
← Phase 17 · Course home · Phase 19 →
Contents
- Don't Panic
- Why this phase matters
- What you'll learn
- The map: where this lives in the real code
- Labs in this phase
- How to work this phase
- Where you are
Don't Panic
Now you make it FAST and prove it. This phase is the engineer's loop: measure (TTFT, ITL, throughput) with the right tools, find the bottleneck (CPU launch? memory? a kernel?), turn the right knob (batch size, token budget, memory utilization, graphs, quant), and re-measure. It's the meta-skill that ties phases 2–17 together.
Why this phase matters
This is the daily job of a staff inference engineer and the thing startups live or die on (cost/token). Being able to read a profile, reason with a roofline, and tune vLLM's knobs methodically is what separates senior from staff.
What you'll learn
- Metrics that matter: throughput (tok/s), TTFT, ITL/TPOT, goodput, latency percentiles
- Little's Law and how batch size, arrival rate, and latency relate
- The roofline model: compute-bound vs memory-bound; arithmetic intensity
- Profiling: the torch profiler, Nsight Systems, and vLLM's own metrics
- The knobs: max_num_seqs, max_num_batched_tokens, gpu_memory_utilization, enable_chunked_prefill, CUDA graphs, quant, spec decode
- Benchmarking properly: vllm bench, warmup, steady state, fair comparisons
The map: where this lives in the real code
Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see
UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md)
walks through the important ones line by line.
benchmarks/— The benchmark suite (throughput, latency, serving).vllm/benchmarks/— The 'vllm bench' implementation.vllm/v1/metrics/— The metrics/stats the engine exposes (Prometheus + logging).vllm/v1/metrics/stats.py— SchedulerStats / IterationStats: what's measured each step.vllm/config/scheduler.py— The tuning knobs and their defaults/semantics.
Labs in this phase
- lab-01-tune-the-knobs
[CPU-OK]— build the full tuning loop on mini_vllm: arrival schedules (queueing enters the course), TTFT/spike/steps metrics, and an SLO-constrained grid search that refuses impossible SLOs — with two measured surprises about the chunk threshold. - lab-02-benchmark-real-vllm
[GPU-OPT]— the same loop with wall-clocks:vllm bench servesweeps, the rate-sweep knee found first, percentiles everywhere, and the one-page tuning report as the deliverable. Captured numbers included.
See labs/README.md for how to run them.
How to work this phase
- Read this guide for intuition.
- Read 01-deep-dive.md with the
upstream/files open. - Do 02-mini-build.md — build the
mini_vllmpiece yourself. - Run the labs, then attempt EXERCISES.md.
- Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.
Where you are
This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.
← Phase 17 · Course home · Phase 19 →