Phase 18 — Performance Engineering

← Phase 17 · Course home · Phase 19 →

Don't Panic
Why this phase matters
What you'll learn
The map: where this lives in the real code
Labs in this phase
How to work this phase
Where you are

Don't Panic

Now you make it FAST and prove it. This phase is the engineer's loop: measure (TTFT, ITL, throughput) with the right tools, find the bottleneck (CPU launch? memory? a kernel?), turn the right knob (batch size, token budget, memory utilization, graphs, quant), and re-measure. It's the meta-skill that ties phases 2–17 together.

Why this phase matters

This is the daily job of a staff inference engineer and the thing startups live or die on (cost/token). Being able to read a profile, reason with a roofline, and tune vLLM's knobs methodically is what separates senior from staff.

What you'll learn

Metrics that matter: throughput (tok/s), TTFT, ITL/TPOT, goodput, latency percentiles
Little's Law and how batch size, arrival rate, and latency relate
The roofline model: compute-bound vs memory-bound; arithmetic intensity
Profiling: the torch profiler, Nsight Systems, and vLLM's own metrics
The knobs: max_num_seqs, max_num_batched_tokens, gpu_memory_utilization, enable_chunked_prefill, CUDA graphs, quant, spec decode
Benchmarking properly: vllm bench, warmup, steady state, fair comparisons

The map: where this lives in the real code

Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md) walks through the important ones line by line.

benchmarks/ — The benchmark suite (throughput, latency, serving).
vllm/benchmarks/ — The 'vllm bench' implementation.
vllm/v1/metrics/ — The metrics/stats the engine exposes (Prometheus + logging).
vllm/v1/metrics/stats.py — SchedulerStats / IterationStats: what's measured each step.
vllm/config/scheduler.py — The tuning knobs and their defaults/semantics.

Labs in this phase

lab-01-tune-the-knobs [CPU-OK] — build the full tuning loop on mini_vllm: arrival schedules (queueing enters the course), TTFT/spike/steps metrics, and an SLO-constrained grid search that refuses impossible SLOs — with two measured surprises about the chunk threshold.
lab-02-benchmark-real-vllm [GPU-OPT] — the same loop with wall-clocks: vllm bench serve sweeps, the rate-sweep knee found first, percentiles everywhere, and the one-page tuning report as the deliverable. Captured numbers included.

See labs/README.md for how to run them.

How to work this phase

Read this guide for intuition.
Read 01-deep-dive.md with the upstream/ files open.
Do 02-mini-build.md — build the mini_vllm piece yourself.
Run the labs, then attempt EXERCISES.md.
Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.

Where you are

This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.