Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 18 — Performance Engineering

Phase 17 · Course home · Phase 19

Contents


Don't Panic

Now you make it FAST and prove it. This phase is the engineer's loop: measure (TTFT, ITL, throughput) with the right tools, find the bottleneck (CPU launch? memory? a kernel?), turn the right knob (batch size, token budget, memory utilization, graphs, quant), and re-measure. It's the meta-skill that ties phases 2–17 together.

Why this phase matters

This is the daily job of a staff inference engineer and the thing startups live or die on (cost/token). Being able to read a profile, reason with a roofline, and tune vLLM's knobs methodically is what separates senior from staff.

What you'll learn

  • Metrics that matter: throughput (tok/s), TTFT, ITL/TPOT, goodput, latency percentiles
  • Little's Law and how batch size, arrival rate, and latency relate
  • The roofline model: compute-bound vs memory-bound; arithmetic intensity
  • Profiling: the torch profiler, Nsight Systems, and vLLM's own metrics
  • The knobs: max_num_seqs, max_num_batched_tokens, gpu_memory_utilization, enable_chunked_prefill, CUDA graphs, quant, spec decode
  • Benchmarking properly: vllm bench, warmup, steady state, fair comparisons

The map: where this lives in the real code

Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md) walks through the important ones line by line.

Labs in this phase

  • lab-01-tune-the-knobs [CPU-OK] — build the full tuning loop on mini_vllm: arrival schedules (queueing enters the course), TTFT/spike/steps metrics, and an SLO-constrained grid search that refuses impossible SLOs — with two measured surprises about the chunk threshold.
  • lab-02-benchmark-real-vllm [GPU-OPT] — the same loop with wall-clocks: vllm bench serve sweeps, the rate-sweep knee found first, percentiles everywhere, and the one-page tuning report as the deliverable. Captured numbers included.

See labs/README.md for how to run them.

How to work this phase

  1. Read this guide for intuition.
  2. Read 01-deep-dive.md with the upstream/ files open.
  3. Do 02-mini-build.md — build the mini_vllm piece yourself.
  4. Run the labs, then attempt EXERCISES.md.
  5. Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.

Where you are

This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.

Phase 17 · Course home · Phase 19