vLLM Mastery — From Zero to Maintainer

A deep, lab-driven journey through the internals of the world's most popular open-source LLM inference engine.

This is not a tutorial. It is a 20-phase apprenticeship. If you start at Phase 0 knowing nothing about how language models run, and you finish every lab, you will be able to:

Read and modify any part of the vLLM codebase — the scheduler, the KV-cache manager, attention backends, quantization, speculative decoding, distributed execution.
Land real pull requests upstream and reason about them like a maintainer.
Operate as a principal / staff LLM-inference engineer — design serving systems, debug throughput cliffs, and make the architectural calls that decide whether a model serves 10 or 10,000 users per GPU.
Found or join a startup in the inference space and know exactly where the moats are.

Everything you need is in this repository. You will never need an outside book.

The two things that make this work

1. You read the real engine

Every concept is anchored to the actual vLLM source code, frozen at a single commit (see UPSTREAM_PIN.md: v0.22.1 @ 0decac0). When a phase says

vllm/v1/core/block_pool.py:333 — BlockPool.get_new_blocks()

that line really exists in ./upstream/ and you are expected to open it. We do not paraphrase the engine. We quote it and explain it line by line.

⚠️ vLLM moves fast (dozens of merged PRs per day). Line numbers are valid only at the pinned commit. The named class/function is always given so you can re-find it in any version. Re-create the exact tree with the command in UPSTREAM_PIN.md.

Reading is not understanding. So in parallel you build mini_vllm/ — a deliberately small, dependency-light reimplementation of vLLM's core ideas that runs on a laptop CPU, no GPU required. By the end you will have written, with your own hands:

a paged KV-cache block allocator (Phase 2),
a continuous-batching scheduler with prefix caching (Phase 3),
a sampler, an n-gram speculative decoder, a batched-LoRA matmul, a grammar mask, …

The real engine teaches you what production looks like. The mini engine teaches you why every decision was made. You need both. This is the "Both" anchoring this course is built on.

How each phase is structured

Every phase-NN-*/ folder has the same shape:

File	What it is
`00-guide.md`	The Hitchhiker's Guide to the topic. Don't Panic. Pure intuition, analogies, ASCII diagrams. Assumes you know nothing. Read this first.
`01-deep-dive.md`	The real implementation. Upstream `path:line` references, quoted excerpts, line-by-line explanation, data structures, edge cases.
`02-mini-build.md`	Build or extend the `mini_vllm/` component for this topic.
`labs/lab-NN-*/`	Hands-on labs: `README.md` + `starter.py` + `solution.py` + `test_lab.py`.
`EXERCISES.md`	Graded challenges, easy → staff-level, with hints and solutions.
`INTERVIEW.md`	Real staff/principal interview questions on the topic, with model answers.
`CHEATSHEET.md`	One page: APIs, invariants, performance knobs, gotchas.

Lab hardware tags

Not everyone has a GPU. Every lab is tagged:

[CPU-OK] — runs anywhere, including the CI on your laptop. Most labs.
[GPU-OPT] — better on a GPU but has a CPU fallback; expected GPU output is captured in the README so you can follow along without one.
[GPU-REQ] — genuinely needs an NVIDIA GPU (real CUDA kernels). The README includes captured output and a step-by-step so you learn even if you only rent a GPU later.

See SETUP.md for environment setup and cheap cloud-GPU options.

The curriculum (20 phases)

#	Phase	One-line goal
00	Foundations	What an LLM forward pass is; prefill vs decode; why the KV cache exists.
01	Architecture & Request Lifecycle	Trace one request from `LLM.generate()` to tokens out.
02	PagedAttention ⭐	How vLLM stores KV memory in pages and never fragments.
03	Continuous Batching & Scheduler ⭐	Iteration-level scheduling, chunked prefill, prefix caching, preemption.
04	Attention Backends	FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, Triton.
05	CUDA Graphs & torch.compile	Piecewise vs full graphs; the compilation pipeline.
06	Quantization	FP8/MXFP4/NVFP4/INT8/INT4, GPTQ/AWQ/GGUF/compressed-tensors.
07	GEMM & MoE Kernels	CUTLASS GEMM; MoE routing & grouped GEMM; expert parallelism.
08	Speculative Decoding	n-gram, suffix, EAGLE, DFlash; draft/verify & rejection sampling.
09	Sampling & Decoding Algorithms	top-k/p, penalties, parallel sampling, beam search, logits processors.
10	Distributed Inference	Tensor / Pipeline / Data / Expert / Context parallelism.
11	Multi-LoRA	Batched adapters, punica/SGMV, dense + MoE LoRA.
12	Structured Outputs	Grammar-constrained decoding via xgrammar / guidance.
13	Multimodal Models	Vision encoders, image-token merging, processor cache.
14	Model Architectures	Add a model: decoder-only, MoE, hybrid/SSM, embedding/reward.
15	Disaggregated Serving	Prefill/decode/encode split; KV transfer connectors.
16	Serving APIs & Parsers	OpenAI & Anthropic APIs, gRPC, streaming, tool/reasoning parsers.
17	Hardware Backends & Plugins	The platform abstraction; NVIDIA/AMD/CPU/TPU plugins.
18	Performance Engineering	Profiling, benchmarking, roofline thinking, tuning knobs.
19	Capstone — Maintainer & Startup	Land a real PR; the staff competency map; the startup playbook.

⭐ = the original flagship phases that set the template. Every phase now has fully written labs — 60+ in total, each with an in-depth guide-style README, and (for the CPU labs) a tested starter.py / solution.py / test_lab.py triplet. Run the whole suite with pytest -m "not gpu" from the repo root; every phase's labs/README.md gives the recommended order and the skills each lab delivers.

Recommended path

Do them in order, 0 → 19. Each builds on the last; mini_vllm/ grows phase by phase.
For each phase: read 00-guide.md → read 01-deep-dive.md with upstream/ open in a second window → do 02-mini-build.md → run the labs → attempt EXERCISES.md → self-test with INTERVIEW.md.
Run the tests constantly: pytest -m "not gpu" from the repo root.
Keep a lab notebook. When you finish, your notebook + mini_vllm/ + a merged upstream PR is your portfolio.

Start here: SETUP.md, then phase-00-foundations/00-guide.md.

See also: GLOSSARY.md (every term defined once) and CAREER.md (the maintainer path, the staff competency map, the startup playbook).

This repo also builds as a website (mdBook → Cloudflare Pages): see PUBLISHING.md.

Setup

This course is designed so that the majority of labs run on a laptop CPU. You only need a GPU for the labs explicitly tagged [GPU-REQ] (and even those ship captured output so you can learn without one).

1. Python environment

We follow vLLM's own convention and use uv (fast, and it's what upstream uses — see upstream/AGENTS.md). Plain venv works too.

# Install uv (one time)
curl -LsSf https://astral.sh/uv/install.sh | sh

# From the repo root:
uv venv --python 3.12
source .venv/bin/activate

# Install the CPU-only course dependencies (numpy + pytest). This is all you need
# for mini_vllm and every [CPU-OK] lab.
uv pip install -e .

To run the torch-based labs (some Phase 2/4 mini-builds), add the CPU build of torch:

uv pip install -e ".[torch]"   # CPU wheels are fine; no CUDA needed for mini_vllm

2. Get the real vLLM source (required for the deep-dives)

Every 01-deep-dive.md cites upstream/... paths. Clone the pinned tree:

git clone --depth 1 --branch v0.22.1 \
  https://github.com/vllm-project/vllm.git upstream
cd upstream && git rev-parse HEAD   # 0decac0d96c42b49572498019f0a0e3600f50398
cd ..

You do not need to install vLLM to read its source. (upstream/ is gitignored.)

3. (Optional) Install the real engine for the GPU labs

The real vllm package needs a CUDA build of torch and an NVIDIA GPU. Install it only on a GPU box:

uv pip install -e ".[vllm]"   # vllm==0.22.1, matches the pin

4. Running the labs and tests

# All CPU tests (mini_vllm + flagship labs). Run this constantly.
pytest -m "not gpu"

# Just one phase's labs
pytest phase-02-paged-attention/labs

# The mini engine's own test suite
pytest mini_vllm

# On a GPU box, also run the GPU-tagged tests
pytest -m gpu

GPU tests are auto-skipped when no CUDA device is present (see the gpu_device fixture in each phase's conftest.py), so pytest is always green on a laptop.

5. Cheap GPU access for the `[GPU-REQ]` labs

You do not need to own a GPU. Options, cheapest-effort first:

Option	Notes
Google Colab (free/Pro)	Free T4 is enough for small-model vLLM labs. Easiest start.
Modal / RunPod / Lambda / Vast.ai	Per-second/per-hour A10/L4/A100 rentals. ~$0.4–$2/hr for the GPUs these labs use.
Cloud spot instances (AWS `g5`, GCP `g2`)	Cheapest sustained; more setup.

A T4 or L4 (16–24 GB) runs every GPU lab in this course with a small model (e.g. facebook/opt-125m, Qwen/Qwen2.5-0.5B). You will never need an 80 GB card to learn.

6. Models used in labs

Labs default to tiny models so they download fast and fit small GPUs (and some run on CPU): facebook/opt-125m, Qwen/Qwen2.5-0.5B-Instruct, TinyLlama/TinyLlama-1.1B. Each lab README names the exact model and the huggingface-cli download command.

Troubleshooting

pytest collects 0 tests → run from the repo root (so pyproject.toml is found).
import vllm fails on a laptop → expected; the real engine needs CUDA. Use the [CPU-OK] labs and mini_vllm on a laptop; the captured outputs cover the rest.
Line numbers in a deep-dive don't match → you're not at the pinned commit. Re-clone per step 2, or search for the named function instead of trusting the line number.

Upstream Pin

Every "original code reference" in this curriculum is anchored to a single, frozen snapshot of the real vLLM source tree so that path:line citations stay reproducible even as upstream moves on.

Field	Value
Project	vllm-project/vllm
Release tag	`v0.22.1`
Commit SHA	`0decac0d96c42b49572498019f0a0e3600f50398`
Pinned on	2026-06-08
Local path	`./upstream/` (gitignored — not committed, re-clone as below)

Re-create the exact tree

git clone --depth 1 --branch v0.22.1 https://github.com/vllm-project/vllm.git upstream
cd upstream && git rev-parse HEAD   # must print 0decac0d96c42b49572498019f0a0e3600f50398

How citations are written

Throughout the phases you will see references like:

vllm/v1/core/sched/scheduler.py:312 @ 0decac0 — Scheduler.schedule()

The path is relative to upstream/.
The line number is valid only at the pinned SHA. If you check out a newer vLLM, open the file and search for the named symbol (the function/class is given) instead of trusting the line number.
@ 0decac0 is the short SHA, a reminder that the snapshot is frozen.

Why pin at all?

vLLM merges dozens of PRs per day. A line number that is correct today is wrong next week. Pinning is the same discipline real maintainers use when they write design docs and bug reports: always cite a commit, never "main". When you eventually contribute upstream (Phase 19), you will cite commits in exactly this way in your PR descriptions and issue reports.

Bumping the pin (later)

When you want to refresh the curriculum against a newer vLLM:

Re-clone at the new tag, update the table above.
Re-run the path:line spot-check in each phase's 01-deep-dive.md.
Note behavioral changes in a CHANGES.md per phase — diffing how the engine evolved is itself one of the most instructive exercises in this whole course.

Glossary

Every term used in this course, defined once, in plain language. When a phase uses a term, it links here. If you ever feel lost, this is the place to land.

Ordering is roughly conceptual, grouped by theme, not alphabetical — read top to bottom the first time, then use Ctrl-F.

The model and the forward pass

Token — a chunk of text (often ~¾ of a word) that the model reads and writes. Text is turned into a list of integer token IDs by a tokenizer.
Embedding — the vector the model uses to represent a token internally.
Forward pass — running the model once over some tokens to produce, for the last position, a probability distribution over the next token (the logits).
Logits — the raw, pre-softmax scores over the whole vocabulary for "what comes next".
Autoregressive generation — generate one token, append it to the input, run the forward pass again, repeat. LLMs generate text one token at a time this way.
Decoder-only model — the architecture of GPT/Llama/Qwen: a stack of transformer blocks that only attend to earlier tokens (causal). Most LLMs.

Attention and the KV cache

Attention — the operation where each token "looks at" previous tokens and mixes in their information. For each token it computes a Query (Q), and compares it against the Key (K) and Value (V) of every earlier token.
KV cache — because earlier tokens don't change, their K and V vectors can be computed once and cached. The KV cache is the stored K and V for every token generated so far. It is the single biggest consumer of GPU memory during serving. This course is largely the story of managing it well.
Prefill — the first forward pass over the whole prompt at once. Compute-bound (lots of tokens, one pass). Fills the KV cache for the prompt.
Decode — each subsequent single-token forward pass. Memory-bandwidth-bound (one token, must read all weights + the whole KV cache). This is where most serving time goes.
TTFT (time to first token) — latency from request arrival to the first output token. Dominated by prefill.
ITL / TPOT (inter-token latency / time per output token) — time between successive output tokens. Dominated by decode.

PagedAttention & memory (Phase 2)

PagedAttention — vLLM's core idea: store the KV cache in fixed-size blocks (like OS memory pages) instead of one big contiguous buffer per request. Eliminates fragmentation and enables sharing.
Block (KV block) — a fixed-size slot holding the KV of block_size tokens (commonly 16). The unit of KV allocation. Code: KVCacheBlock in kv_cache_utils.py.
Block size — number of tokens whose KV fits in one block (e.g. 16).
Block table — per-request mapping from logical block index → physical block ID. Lets a request's KV be scattered across non-contiguous physical blocks.
Block pool — the global pool of all physical blocks, with a free list and a prefix-cache index. Code: BlockPool in block_pool.py.
Fragmentation — wasted memory from reserving contiguous space you don't fully use. PagedAttention's reason for existing.
Prefix caching — if two requests share a prefix (same leading tokens), they can share the same physical KV blocks. Found by hashing block contents. Phase 3.
Copy-on-write (CoW) — when a shared block must diverge (one request writes new tokens), copy it so the other request's view is unaffected.
Reference count (ref_cnt) — how many requests currently use a block. A block is free only when ref_cnt == 0.
Eviction — reclaiming a cached (but currently unused) block for a new allocation. vLLM uses an LRU-ish free queue (FreeKVCacheBlockQueue).

Scheduling & batching (Phase 3)

Batching — running many requests through the model together to use the GPU efficiently.
Static batching — fix a batch, run it to completion. Wasteful: fast requests wait for slow ones.
Continuous batching — re-decide the batch every iteration (every single token step). Finished requests leave, new ones join immediately. vLLM's default.
Scheduler — the component that, each step, picks which requests run and how many tokens each gets. Code: Scheduler in v1/core/sched/scheduler.py.
Chunked prefill — split a long prompt's prefill across several steps so it doesn't starve ongoing decodes. Controlled by a token budget.
Token budget — max_num_batched_tokens: the cap on total tokens scheduled per step.
Preemption — when memory runs out, evict a running request's KV and put it back in the queue (to be recomputed later). The safety valve.
Running / Waiting queues — requests currently decoding vs. requests waiting to start.

Kernels & execution (Phases 4–7)

Kernel — a function that runs on the GPU. "Attention kernel", "GEMM kernel", etc.
GEMM — General Matrix-Matrix Multiply. The workhorse op (every linear layer). Libraries: cuBLAS, CUTLASS.
FlashAttention — a fused, memory-efficient attention kernel that never materializes the full attention matrix. FlashInfer / FlashMLA / TRTLLM-GEN / Triton — other attention/ GEMM kernel providers vLLM can dispatch to.
Attention backend — vLLM's pluggable wrapper choosing which attention kernel to run.
CUDA graph — a recorded sequence of GPU operations replayed with one launch, removing per-op CPU launch overhead. Piecewise = capture parts; full = capture the whole model forward.
torch.compile — PyTorch's compiler; vLLM uses it to fuse ops and generate kernels, with custom graph passes.
MoE (Mixture of Experts) — a layer with many "expert" sub-networks; each token is routed to a few experts. Big models, low active compute. (Mixtral, DeepSeek-V3.)
Quantization — storing weights/activations in fewer bits (FP8, INT4, …) to save memory and bandwidth. Formats: FP8, MXFP4, NVFP4, INT8/INT4, GPTQ, AWQ, GGUF, compressed-tensors.

Decoding strategies (Phases 8–9)

Greedy decoding — always pick the highest-probability token.
Temperature / top-k / top-p / min-p — knobs that shape the sampling distribution.
Parallel sampling (n) — produce N independent completions for one prompt (sharing the prompt's KV via prefix caching).
Beam search — keep the top-N partial sequences by cumulative probability.
Logits processor — a hook that edits the logits before sampling (penalties, bans, grammar masks).
Speculative decoding — a cheap draft model/heuristic proposes several tokens; the big model verifies them in one pass, accepting a prefix. Speeds up decode.
EAGLE / Medusa / n-gram / suffix / DFlash — specific speculative-decoding methods.
Acceptance rate — fraction of drafted tokens the target model accepts. The metric that decides whether spec decode is a win.

Distributed & serving (Phases 10, 15–16)

Tensor parallelism (TP) — split each layer's weights across GPUs; every GPU does part of every layer; results all-reduced.
Pipeline parallelism (PP) — split the layers across GPUs; activations pass GPU→GPU.
Data parallelism (DP) — replicate the model; split requests across replicas.
Expert parallelism (EP) — split MoE experts across GPUs.
Context parallelism (CP) — split a single sequence's context across GPUs.
Collective op — multi-GPU communication primitive (all-reduce, all-gather, …) via NCCL.
Disaggregated serving — run prefill and decode on different machines, shipping the KV cache between them, so each can be scaled and tuned independently.
KV connector — the component that transfers KV blocks between engines (for P/D disagg or offloading). Code under vllm/distributed/kv_transfer/.
OpenAI-compatible server — vLLM's HTTP server speaking the OpenAI API (plus Anthropic Messages API and gRPC).
Tool calling / reasoning parser — components that extract structured tool calls or chain-of-thought from model output.

Adaptation & structure (Phases 11–12)

LoRA (Low-Rank Adaptation) — small trainable matrices added to a frozen base model to specialize it. vLLM serves many LoRAs in one batch.
Punica / SGMV — batched kernels that apply different LoRAs to different requests in one GPU call.
Structured output / guided decoding — forcing the model's output to match a grammar, regex, or JSON schema by masking invalid tokens each step. Engines: xgrammar, guidance.

vLLM internals & process model

V1 engine — vLLM's current core architecture (the vllm/v1/ tree). V0 is legacy. This course teaches V1.
LLM — the offline (batch) Python entry point: LLM(model=...).generate(prompts).
AsyncLLM — the async engine powering the API server.
EngineCore — the inner loop: add_request → step() (schedule → execute → output).
Worker / Executor — the executor owns workers; each worker drives one GPU's model.
Model runner — turns a SchedulerOutput into actual tensor inputs and runs the model.
SamplingParams — per-request decoding config (temperature, max_tokens, n, …).
RequestOutput — what the engine returns: generated text/tokens for a request.

The Career Map: Maintainer, Staff Engineer, Founder

This course has three end-states in mind. They overlap, but each has its own "what does great look like" bar. Use this document as a compass: at any phase, ask "which of these am I building toward right now?"

Track A — Become a vLLM maintainer

A maintainer is someone whose judgment the project trusts. You get there by a track record, not a title.

The ladder

First contribution. A docs fix, a small bug, a test. Learn the workflow (Phase 19).
Sustained contributions. Real features/fixes in one area (say, the scheduler or a quant method). You become "the person who knows X".
Reviewer. You review others' PRs in your area credibly.
Committer / maintainer. You're trusted to merge and to shape direction.

What maintainers actually do (and this course trains)

Read code fast and correctly. Every 01-deep-dive.md is reps for this.
Reason about invariants. "Block tables are append-only." "ref_cnt==0 ⟺ in free queue." Maintainers hold dozens of these in their head. The deep-dives name them explicitly; the CHEATSHEET.md files collect them.
Protect the hot path. vLLM's scheduler runs every token step for every request — a Python list scan in the wrong place is a throughput regression. You learn to feel this.
Write tests that pin behavior. Look at upstream/tests/v1/core/ — that's the standard.
Communicate. PR descriptions, RFCs, issue triage. See upstream/AGENTS.md for the project's literal rules (e.g. no pure code-agent PRs, cite that AI was used, include test commands and results).

The non-obvious advice

Specialize, then generalize. Pick one subsystem from this course (scheduler, KV cache, a quant format, an attention backend) and go deeper than anyone. Depth in one area earns the trust that lets you touch others.
Watch the firehose. Subscribe to the repo. Read merged PRs in your area daily. Diffing how the engine evolves (Phase 19) is the fastest way to learn the current mental model.

Track B — Staff / Principal LLM-inference engineer

This is the industry role: you own how models serve — throughput, latency, cost, reliability — at a company. The interview loops test exactly the material in this course.

The competency map

Competency	Phases	"Staff-level" looks like
Transformer inference fundamentals	0, 1	Can derive KV-cache memory from first principles; explain prefill vs decode bottlenecks.
Memory management	2	Can size KV cache for a deployment; explain paging vs fragmentation with numbers.
Throughput engineering	3, 18	Can diagnose a throughput cliff from metrics; tune batch/token budgets; reason about Little's Law.
Kernels & precision	4–7	Knows when FlashInfer beats FlashAttention; what FP8 costs in accuracy; reads a roofline.
Latency techniques	8, 9	Knows when spec decode helps (acceptance rate × draft cost); chunked prefill tradeoffs.
Scale-out	10, 15	Picks TP vs PP vs DP vs EP for a model+SLA; understands P/D disaggregation economics.
Productization	11, 12, 16	Multi-tenant LoRA, structured output, API design, streaming, observability.
Hardware breadth	17	Reasons about NVIDIA vs AMD vs TPU tradeoffs and the plugin abstraction.

How to use the `INTERVIEW.md` files

Each phase ships staff-level Q&A. Treat them as a mock loop: cover the answer, attempt it out loud, then compare. The flagship phases (2, 3) show the depth expected. A strong candidate can whiteboard the PagedAttention block allocator and the continuous-batching step loop from memory — which, after this course, you will have written yourself in mini_vllm/.

Your portfolio

By the end you have three artifacts that beat any résumé bullet:

mini_vllm/ — a working engine you built. Walk an interviewer through it.
A merged upstream PR (Phase 19). Public proof you operate at the real bar.
A tuning/benchmark writeup (Phase 18). Shows you think in numbers.

Track C — Found a startup in inference

The inference layer is one of the most valuable and contested in the AI stack. This course makes you dangerous in it.

Where the value (and the moats) are

Cost per token. The whole game. Everything in Phases 2–7 and 10 is a lever on it. A 2× throughput win is a 2× gross-margin win.
Latency SLAs. TTFT and ITL guarantees (Phases 3, 8, 9, 15) are what enterprise buyers actually pay for.
Multi-tenancy. Serving thousands of fine-tunes cheaply = multi-LoRA + prefix caching (Phases 3, 11). A structural cost advantage over per-customer deployments.
Hardware arbitrage. Running well on cheaper/available silicon (Phase 17) when NVIDIA is supply-constrained.

Honest take on moats

Raw "we wrap vLLM and rent GPUs" is not a moat — margins compress fast. Defensible angles:

A genuine kernel/scheduling edge you can sustain (hard, but this course is where you'd build the expertise to try).
Workload specialization — agentic/long-context/structured-output/RAG-shaped traffic has different optimal configs; owning a vertical's serving stack is defensible.
The control plane — routing, autoscaling, multi-tenancy, observability, cost attribution around the engine. Often more durable than the engine itself.
Distribution / switching costs — being embedded in customers' pipelines.

The build/buy/contribute calculus

You will almost always build on vLLM rather than replace it — that's the point of open source. The startup question is "what do we add on top, and what should we upstream?" Phase 19 covers the contribute-vs-keep-private tradeoff (upstreaming buys you maintenance leverage and credibility; hoarding a commodity feature buys you nothing).

A note on mindset

The people who reach all three end-states share one habit: they read the source. Not docs about the source — the source. This entire course is built to make that your default reflex. Open upstream/ now and keep it open for the next 20 phases.

Phase 00 — The Hitchhiker's Guide to How an LLM Actually Runs

Course home · Phase 01 →

This is Chapter 0 of the book. It assumes you know nothing — not what a token is, not what a matrix multiply is — and it ends with you able to compute, on a napkin, how many users a given GPU can serve and why. Everything else in the course stands on this chapter, so we go slowly and build each idea from the ground up.

How to read this chapter. Most of it is for everyone. Paragraphs marked

🔬 Going deeper — optional rigor and real numbers for the expert track.

can be skimmed on a first pass and devoured on the second. By the end you should be comfortable at both levels: the intuition and the arithmetic.

0.1 Don't Panic — the whole thing in one sentence

A large language model is a function that reads a list of words and guesses the next word. To write text, it guesses a word, sticks it on the end, and guesses again — hundreds of times.

That is genuinely all an LLM does at runtime. ChatGPT writing you an essay is this loop running a few hundred times. Everything difficult — and everything this course teaches — comes not from the guessing, but from doing the guessing fast, for thousands of people at once, on hardware that costs more than a house. vLLM is the software that does that well. To make it faster (your future job), you must first feel why it is slow. That feeling is what this chapter installs.

We'll build up in this order: a tiny bit of math → words become numbers → what one "guess" involves → the loop → why it's slow → where the memory goes. Take your time.

0.2 The only math you need (a 5-minute primer)

Two objects and one operation. That's it.

A vector is just a list of numbers: [0.2, -1.1, 0.5]. Picture it as an arrow, or as coordinates of a point. A length-3 vector is a point in 3D space; LLMs use points in thousands of dimensions (you can't picture that, and you don't need to — the arithmetic is the same).

A matrix is a grid of numbers — a stack of vectors. A 2×3 matrix has 2 rows, 3 columns.

The one operation that matters is matrix multiplication (everyone calls it "matmul" or GEMM — General Matrix Multiply). To multiply a vector by a matrix, you take dot products. A dot product of two equal-length vectors multiplies them element-wise and sums:

[1, 2, 3] · [4, 5, 6]  =  1·4 + 2·5 + 3·6  =  4 + 10 + 18  =  32

A dot product is a similarity score: it's large and positive when two vectors point the same way, near zero when they're unrelated, negative when opposed. Remember this — attention (the heart of the model) is built entirely out of dot products measuring "how related are these two tokens."

Multiplying a vector x (length 3) by a matrix W (3 columns, 2 rows) gives a new vector (length 2), one dot product per row of W:

x = [1, 2, 3]            W = [ [1, 0, 1],     →   y[0] = [1,2,3]·[1,0,1] = 4
                               [0, 1, 1] ]        y[1] = [1,2,3]·[0,1,1] = 5
                                                  y = [4, 5]

🔬 Going deeper. A neural network "layer" is exactly this: y = x·Wᵀ (plus a bias, plus a nonlinearity). The matrix W is the weights — the billions of numbers that are the trained model. "Llama-3-8B" means ~8 billion such numbers. A forward pass is a long chain of these multiplies. So "running a model" = "doing a lot of matmuls with the weight matrices." Hold that: it explains both the compute cost (FLOPs) and the memory cost (reading the weights), which is the whole performance story in §0.10.

That's the entire math prerequisite. Onward.

0.3 Words become numbers, part 1: tokenization

A computer can't multiply the word "Paris". So step one is always: chop the text into small pieces called tokens and replace each with an integer ID.

Why not just use whole words, or individual letters? Whole words give a gigantic, brittle vocabulary (every plural, typo, and rare name is a new word). Single letters make sequences painfully long. The sweet spot is subwords — common words stay whole, rare words split into pieces. The dominant algorithm is Byte-Pair Encoding (BPE):

🔬 How BPE is built (Going deeper). Start with every character as its own token. Then repeatedly find the most frequent adjacent pair of tokens in your training text and merge it into a new token. Do this tens of thousands of times. Common sequences like "ing", " the", "vLLM" get merged into single tokens; rare strings stay split into smaller bits. The result is a fixed list of merges (the vocabulary) that balances vocabulary size against sequence length.

A worked tokenization (Llama-3-style, ~128k vocab):

text:    "vLLM is fast"
tokens:  [ "v",  "LLM",  " is",  " fast" ]      ← note the leading spaces are part of tokens
IDs:     [  85,   4178,    382,    2347   ]      ← example numbers from the vocab table

Two facts to carry forward:

A token is roughly ¾ of a word on average (so 1,000 tokens ≈ 750 words).
The component that does this is the tokenizer; reversing it (IDs → text) at the end is detokenization. The full list of tokens it knows is the vocabulary (Llama-3: ~128,256).

In mini_vllm/tokenizer.py we use the simplest possible tokenizer — one byte = one token, vocab of 257 — so the course needs zero downloaded files. Open it; it's ten lines, and it has the same encode/decode interface a real tokenizer does.

🆕 New words: token (a subword chunk), token ID (its integer), tokenizer (the chopper), vocabulary (all known tokens), BPE (the merge algorithm), detokenization (IDs→text).

0.4 Words become numbers, part 2: embeddings (meaning as coordinates)

A token ID like 4178 is just a name — it carries no meaning by itself (token 4178 isn't "more" than token 382). So the model's first move is to look up each ID in a big table and replace it with a vector — a list of numbers — called an embedding.

Think of the embedding as coordinates in a space of meaning. Just as a city has a (latitude, longitude), a token has a few thousand coordinates (Llama-3-8B: 4096 of them). The training process arranges this space so that tokens used in similar ways land near each other, and — famously — directions in the space carry meaning:

embedding("king") - embedding("man") + embedding("woman")  ≈  embedding("queen")

The lookup itself is trivial — it's just "go to row 4178 of the embedding table" — but it's the bridge from symbols to math. After this step, the prompt is no longer text; it's a stack of vectors (one per token), and from here on the model only does matmuls on those vectors.

"vLLM is fast"
  → IDs [85, 4178, 382, 2347]
  → embeddings, a 4 × 4096 matrix:
      [ [ 0.02, -0.7, ... , 0.1 ],     ← "v"
        [-0.5,   0.3, ... , 0.9 ],     ← "LLM"
        [ 0.1,  -0.2, ... , 0.0 ],     ← " is"
        [ 0.8,   0.4, ... ,-0.3 ] ]    ← " fast"

🔬 Going deeper. The width of these vectors is the hidden size d_model (4096 for Llama-3-8B). Bigger d_model = more capacity but more compute and memory everywhere. The embedding table has vocab_size × d_model numbers (128256 × 4096 ≈ 525M just for embeddings). Many models tie the input embedding and the output projection (the "LM head") to save that memory.

🆕 New words: embedding (a token's meaning-vector), hidden size / d_model (the vector width), LM head (the final layer turning vectors back into vocabulary scores).

0.5 The shape of a model: layers and the residual stream

The model is a tall stack of identical layers (Llama-3-8B has 32). The stack of token vectors flows up through them; this flowing stack is often called the residual stream. Each layer reads the stream, computes an update, and adds it back (that "add it back" is the residual connection — it's why training deep stacks works, but you can treat it as plumbing).

Each layer does exactly two things to the stream:

Attention — lets each token look at the other tokens and pull in relevant information. This is the only place where information moves between positions.
MLP (feed-forward) — transforms each token's vector independently, adding "thinking capacity."

After all 32 layers, the final vector at the last position is multiplied by the LM head to produce one score for every token in the vocabulary. Those scores are the logits (§0.8).

embeddings ─► [ layer 1 ] ─► [ layer 2 ] ─► ... ─► [ layer 32 ] ─► LM head ─► logits (vocab,)
                 │ attention + MLP, each with a residual add

Keep this split in mind — attention mixes across tokens, the MLP works per token — because it explains exactly where the KV cache lives (in attention, the only cross-token part) and where the giant matmuls are (the MLP, ~⅔ of the weights; Phase 7).

0.6 Attention from the ground up (the heart of the machine)

This is the one piece worth understanding in detail, because it dictates all of memory management. We'll build it from the problem, then do a worked numeric example.

The problem attention solves

Consider generating the next word of: "The river bank was muddy, so the fisherman ...". To continue sensibly the model must know "bank" means a riverbank (not a money bank) — and the clue is the word "river" earlier. Each token needs to gather context from the other tokens. Attention is the mechanism for that gathering.

Q, K, V — the search analogy

For every token, the model computes three vectors (each by multiplying the token's embedding by a learned weight matrix — yes, more matmuls):

Query (Q) — "here is what I'm looking for." (your search box)
Key (K) — "here is what I am about." (a document's title/tags)
Value (V) — "here is the information I actually carry." (the document's contents)

To update a token, you compare its Query against every earlier token's Key (dot products — remember, dot product = similarity!), turn those similarities into weights that sum to 1, and take the weighted blend of those tokens' Values. It is exactly a soft search:

Type a query → it's scored against the keys of all documents → you get back a blend of the best-matching documents' values.

A worked example (do this once by hand — it demystifies everything)

Three tokens, and to keep it readable let each Q/K/V be just 2 numbers (a real model uses 128). Suppose we're computing the new vector for token 3, whose query is q₃ = [1, 0]. The three tokens' keys and values:

token   key K        value V
  1     [1, 0]       [10, 0]
  2     [0, 1]       [0, 10]
  3     [1, 0]       [5, 5]

Step 1 — similarity scores (dot product of q₃ with each key):

score₁ = [1,0]·[1,0] = 1      score₂ = [1,0]·[0,1] = 0      score₃ = [1,0]·[1,0] = 1

Token 3's query points the same way as tokens 1 and 3's keys (score 1) and is orthogonal to token 2's (score 0).

Step 2 — scale by 1/√(head_dim) = 1/√2 ≈ 0.707 (this keeps numbers from blowing up as the vectors get wider — a numerical-stability trick):

scaled = [0.707, 0, 0.707]

Step 3 — softmax turns scores into weights that are all positive and sum to 1. Softmax of a list is exp(each) / sum(exp):

exp(0.707)=2.03,  exp(0)=1.00,  exp(0.707)=2.03    sum = 5.06
weights = [2.03/5.06, 1.00/5.06, 2.03/5.06] = [0.40, 0.20, 0.40]

So token 3 will pay 40% attention to token 1, 20% to token 2, 40% to itself.

Step 4 — weighted blend of the Values:

out = 0.40·[10,0] + 0.20·[0,10] + 0.40·[5,5]
    = [4,0]       + [0,2]       + [2,2]
    = [6, 4]

That [6, 4] is token 3's attention output — a context-aware mix dominated by tokens 1 and 3. That is the entire attention operation. Everything fancy later (FlashAttention, PagedAttention) is about computing this exact thing faster and with less memory, never something different.

Causal masking — you can't read the future

When generating, token 3 may only attend to tokens 1, 2, 3 — not tokens that come after it (they don't exist yet). This "only look backward" rule is causal masking: before the softmax, scores for future positions are set to -∞ (so their softmax weight is 0). Picture the allowed attention as a lower-triangular matrix:

       attends to →   t1   t2   t3   t4
   query t1            ✓    ✗    ✗    ✗
   query t2            ✓    ✓    ✗    ✗
   query t3            ✓    ✓    ✓    ✗
   query t4            ✓    ✓    ✓    ✓

This triangle is why token i needs the Keys and Values of all tokens ≤ i — the single most important sentence for understanding the KV cache (§0.9).

Multiple heads

Real attention runs several of these in parallel — heads — each with its own Q/K/V projections, so different heads can specialize (one tracks syntax, another long-range references). Their outputs are concatenated. Llama-3-8B has 32 query heads, each of dimension 128 (32 × 128 = 4096 = d_model).

🔬 Going deeper — three things the experts know.

head_dim and the √d scale. Dot products of d-dimensional vectors grow like √d, which would push softmax into tiny-gradient saturation. Dividing by √head_dim keeps the variance ~1. (That's the 0.707 above.)

RoPE (positional info). Attention as described is order-blind — it'd treat "dog bites man" like "man bites dog." Models inject position by rotating Q and K by an angle proportional to their position (Rotary Position Embedding). Two tokens' score then depends on their relative distance. You'll see rotary_emb(positions, q, k) in llama.py (Phase 0 deep-dive).

GQA/MQA — fewer KV heads. The KV cache (next section) is sized by the number of KV heads. So modern models use Grouped-Query Attention: many query heads share a smaller number of KV heads. Llama-3-8B has 32 query heads but only 8 KV heads — a 4× KV-cache saving baked into the architecture. (MQA is the extreme: 1 KV head.) This single design choice changes your serving capacity by 4×; remember it for §0.11.

🆕 New words: Query/Key/Value (Q/K/V), attention score (a Q·K dot product), softmax (scores → weights summing to 1), causal mask (can't attend to the future), head (one parallel attention), head_dim (a head's width), RoPE (rotary positions), GQA/MQA (shared KV heads).

0.7 The MLP (the per-token "thinking" block)

After attention mixes context in, the MLP processes each token's vector on its own through two big matmuls with a nonlinearity between:

hidden = activation(x · W_upᵀ)        # expand: 4096 → ~14336  (Llama-3-8B)
out    = hidden · W_downᵀ             # project back: 14336 → 4096

The middle width (the "intermediate size") is several × d_model, which is why the MLP holds the majority of the model's weights (~⅔). When you hear "the model is mostly GEMMs," it's largely these two matrices per layer. (Modern Llamas use a gated variant, SwiGLU, with three matrices — a detail for Phase 14; the shape story is the same.)

The takeaway for performance: attention is where memory (the KV cache) concentrates; the MLP is where compute and weight bytes concentrate. Different bottlenecks, different optimizations.

0.8 From logits to a token: sampling

After the last layer, the LM head turns the final vector into logits — one raw score per vocab token (~128k numbers). To pick a token you first turn logits into probabilities with softmax, then choose:

"The capital of France is"  → logits → softmax → probabilities:
   " Paris": 0.87    " Lyon": 0.06    " a": 0.01    " banana": 0.000003   ...

Greedy: take the highest-probability token (" Paris"). Deterministic.
Temperature / top-k / top-p: deliberately introduce randomness for variety.

This whole topic — the decoding algorithms and how they run vectorized across a whole batch — is Phase 9. For now: logits → softmax → pick one. mini_vllm/sampler.py implements greedy plus temperature/top-k/top-p in ~40 readable lines.

🆕 New words: logits (raw next-token scores), probability distribution (softmaxed logits), greedy decoding (argmax), sampling (random pick).

0.9 The generation loop, and the redundancy that births the KV cache

The model only ever predicts one next token. To write a sentence we loop — feed the output back in (this is autoregressive generation):

Step 1:  "The capital of France is"            → " Paris"
Step 2:  "The capital of France is Paris"       → "."
Step 3:  "The capital of France is Paris."      → <end>

Now look closely at what the naive loop computes. Each step runs the whole model over the whole text-so-far. Recall from §0.6 that attention at each position needs the Keys and Values of all earlier positions. So:

Step 1 processes tokens [1..5]      → computes K,V for positions 1..5
Step 2 processes tokens [1..6]      → computes K,V for positions 1..6   (1..5 AGAIN)
Step 3 processes tokens [1..7]      → computes K,V for positions 1..7   (1..6 AGAIN)

We keep recomputing the K and V of tokens we already processed. Here is the key insight:

A token's Key and Value never change once computed. Token 5's K and V are identical in step 2 and in step 500. So compute them once and store them.

That stored table of every past token's Keys and Values is the KV cache. With it, each new step computes K,V for only the one new token and reads the rest from the cache:

work WITHOUT a cache:  1 + 2 + 3 + ... + N  =  N(N+1)/2  ≈  N²/2     (quadratic)
work WITH    a cache:  1 + 1 + 1 + ... + 1  =  N                     (linear)

For a 1,000-token answer that's ~500× less of this work. You will measure exactly this in lab-01 — a 20-line experiment that is the single justification for the entire course. The KV cache is not an optimization you can skip; it's what makes generation tractable.

The catch — and the reason Phases 2–3 exist — is that the KV cache is enormous and grows with every token, and it lives in scarce GPU memory. Managing it well is most of vLLM.

🆕 New words: autoregressive generation (predict → append → repeat), KV cache (stored Keys/Values of all prior tokens), EOS token (the "stop" token).

0.10 Prefill vs decode, and why decode is memory-bound (the chapter's crux)

With a KV cache, generation splits into two phases with opposite performance personalities. This is the most-probed idea in LLM-inference interviews — we'll do it with real numbers.

Prefill — the first run: process the entire prompt at once to fill its KV cache. Many tokens, one run.
Decode — every run after: generate one token, append, repeat. One token, one run, many times.

	Prefill	Decode
tokens per run	many (whole prompt)	one
limited by	compute (math throughput)	memory bandwidth (reading from HBM)
sets the metric	TTFT (time to first token)	ITL / TPOT (time per output token)

Why decode is bottlenecked by memory speed, not math

To produce one decode token, the GPU must read every weight in the model out of its main memory (HBM) — plus the whole KV cache — and then does only one token's worth of math with all of it. It's like driving to a vast warehouse, loading every crate onto the truck, to deliver one postcard. The bottleneck is the loading (memory reads), not the delivering (math).

🔬 The arithmetic that proves it (Going deeper — this is the money slide). Take Llama-3-8B in bf16 (2 bytes/param), on an A100 (HBM bandwidth ≈ 2 TB/s, compute ≈ 312 TFLOP/s bf16).

Memory per decode step (batch = 1): read all weights = 8e9 params × 2 bytes = 16 GB. Time to read at 2 TB/s = 16e9 / 2e12 = 8 ms. That alone caps you at 1/0.008 ≈ 125 tokens/sec — no matter how fast the math is.

Compute per decode step: a forward pass costs ≈ 2 × params FLOPs per token = 2 × 8e9 = 16 GFLOP. At 312 TFLOP/s that's 16e9 / 312e12 ≈ 0.05 ms.

Verdict: memory (8 ms) dwarfs compute (0.05 ms) by ~160×. Decode is utterly memory-bound at batch 1. The expensive math units sit ~99% idle, waiting for weights to arrive.

Arithmetic intensity makes this crisp: it's FLOPs ÷ bytes-read. Decode at batch 1 ≈ 16 GFLOP / 16 GB = 1 FLOP/byte. The A100's "ridge point" (where it flips from memory- to compute-bound) is 312e12 / 2e12 ≈ 156 FLOP/byte. Since 1 ≪ 156, we're deep in memory-bound territory. This is the roofline model in one number.

The escape: batching. If you decode B sequences together, you still read the weights only once but do B× the math → intensity ≈ B FLOP/byte. To reach the ridge (≈156) and use the GPU fully, you need batch ~150. That's why throughput serving is all about big batches — and why the scheduler (Phase 3) exists. Prefill already has high intensity (many tokens × one weight read) → it's compute-bound from the start, which is why a long prompt can hog the GPU and must be chunked (Phase 3).

This one section explains nearly every optimization ahead:

Batching (Phase 3): amortize the weight read over many sequences → throughput.
Quantization (Phase 6): make the weights fewer bytes → less to read → faster decode.
CUDA graphs (Phase 5): when per-step math is tiny, even the CPU overhead of launching the work dominates → remove it.
Speculative decoding (Phase 8): do useful work for several tokens per weight-read.

🆕 New words: prefill / decode, TTFT / ITL(TPOT), HBM (the GPU's main memory), compute-bound / memory-bandwidth-bound, arithmetic intensity (FLOPs/byte), roofline (the model that says which bound you're under), ridge point (the FLOP/byte where it flips).

🔬 The GPU memory hierarchy (expert aside). A GPU has tiers: tiny ultra-fast registers and SRAM/shared memory (KB–MB, ~TB/s within a core) on-chip, and big slow HBM (tens of GB, ~1–3 TB/s) off-chip. "Memory-bound" means bound by HBM. FlashAttention (Phase 4) is fast precisely because it keeps attention's intermediates in SRAM and avoids round-tripping the giant score matrix to HBM. Keep this hierarchy in mind whenever a kernel is "memory-bound" — it's usually HBM traffic.

0.11 How big is the KV cache? (the wall that caps your users)

Decode is memory-bound, and the KV cache is the other big thing in that memory. Let's size it, one line at a time.

For every token, in every layer, we store a Key vector and a Value vector. So:

bytes_per_token = 2                  (one K + one V)
                × num_layers          (32)
                × num_kv_heads        (8   ← GQA! not 32 — see §0.6)
                × head_dim            (128)
                × bytes_per_number    (2 for bf16)

For Llama-3-8B:

2 × 32 × 8 × 128 × 2  =  131,072 bytes  ≈  128 KB   per token

So a 2,000-token conversation = 2000 × 128 KB ≈ 256 MB of GPU memory — for one user. The punchline: a 24 GB GPU, after ~16 GB for the weights, has ~8 GB for KV. At 256 MB/user that's about 30 conversations at once before you run out of memory.

This is the headline of the entire field. What caps how many people you can serve is usually memory, not compute. The KV cache fills the GPU long before the math units are busy. So the serving game is fitting more KV cache: by not wasting any (PagedAttention, Phase 2), by sharing it across requests (prefix caching, Phase 3), and by shrinking it (FP8 KV cache, Phase 6 — halving bytes_per_number doubles your users).

🔬 Going deeper — scale it to 70B and feel the squeeze. Llama-3-70B: 80 layers, 8 KV heads, head_dim 128 → 2×80×8×128×2 = 327,680 B ≈ 320 KB/token. At 8k context that's 2.6 GB per sequence. On an 80 GB A100, after ~140 GB of weights (wait — 70B in bf16 is 140 GB, so it doesn't even fit on one 80 GB GPU!). This is why 70B requires tensor parallelism across multiple GPUs (Phase 10) and why people quantize (Phase 6): both the weights and the KV cache are fighting for memory. You'll compute these numbers yourself in lab-02.

When you later see vLLM log Maximum concurrency for 2048 tokens: 68.65x, you'll know it's this exact division: free-HBM ÷ per-sequence-KV. That number is your serving capacity.

0.12 Throughput vs latency, and Little's Law

Two metrics, in tension, that you'll trade off for the rest of your career:

Latency — how fast one request feels (TTFT, ITL). What an individual user cares about.
Throughput — total tokens/sec across everyone. What sets your cost per token — the number a business lives or dies on.

They fight: bigger batches raise throughput (amortized weight reads, §0.10) but slow each individual request (more work per step). The scheduler (Phase 3) steers this; Phase 18 is the art of tuning it.

🔬 Little's Law (Going deeper). For any stable serving system: concurrency = throughput × latency. If each request stays in the system for L seconds and you sustain X requests/sec, then on average N = X·L requests are in flight. Rearranged: to hit a target throughput at a given latency, you need a certain concurrency — and that concurrency must fit in KV memory (§0.11). This little equation ties together the whole stack: memory limits concurrency, concurrency (via Little's Law) limits the throughput you can reach at your latency SLA. You'll use it to size real deployments in Phase 18.

🆕 New words: latency, throughput, cost per token, Little's Law (N = X·L), SLA (the latency you've promised customers).

0.13 The one picture to carry into every later phase

Strip away the words and the engine reduces to this: a request is a list of tokens with two counters racing.

  ┌────────────────────────────────────────────────────────────────────┐
  │  A request = tokens + two counters:                                 │
  │     num_tokens          = how many tokens exist (prompt + generated) │
  │     num_computed_tokens = how many have been processed (KV cached)   │
  │                                                                     │
  │  PREFILL : computed is far behind  → catch up in one big run         │
  │  DECODE  : computed is one behind  → compute one more, append, repeat│
  │                                                                     │
  │  The engine's entire job: make `computed` catch up to `tokens`,     │
  │  as cheaply as possible, for thousands of requests at once.         │
  └────────────────────────────────────────────────────────────────────┘

This is literally how vLLM's Request object is built (vllm/v1/request.py) and how its scheduler reasons (Phase 3); mini_vllm/request.py mirrors it. If you remember one diagram from the whole course, make it this one — every later phase is "do one part of this loop better."

0.14 What you'll do in this phase

Read: 01-deep-dive.md — find every concept above (Q/K/V, the cache, the two counters) in a real model file and in vLLM's EngineCore.step.
Build / measure: 02-mini-build.md — understand mini_vllm's tokenizer, toy model, and sampler, and run the two experiments below.
Labs (see labs/README.md for the full guide to each):
- lab-01-kv-cache-speedup [CPU-OK] — implement generation with and without a KV cache and measure the O(N²) → O(N) win. The motivating experiment of the course.
- lab-02-kv-memory-calculator [CPU-OK] — write the memory formula and compute how many users fit on a real GPU (8B and 70B). See the memory wall for yourself.
- lab-03-sampling-basics [CPU-OK] — build greedy/temperature/top-k/top-p from scratch and prove your sampler agrees token-for-token with mini_vllm's.
- lab-04-prefill-vs-decode [CPU-OK] — the roofline arithmetic: the ridge point, the 0.6% compute utilization of single-stream decode, the 125 tok/s speed limit, the critical batch size.
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

You're ready for the rest of the book when you can, from memory: walk the attention worked example; explain why decode is memory-bound using arithmetic intensity; derive the KV-cache size for a model; and estimate how many users a GPU can serve. Those four are the foundation everything else is built on.

Course home · Phase 01 →

Phase 00 — Deep Dive: a real forward pass and the request counters

Paths relative to upstream/ at v0.22.1 @ 0decac0. You don't need to understand every line of a model — you need to recognize the shapes from the guide (Q/K/V, the KV cache, the prefill/decode counters) in real code. That recognition is what lets you navigate any model file later (Phase 14).

1. A real decoder-only model: Llama

Open vllm/model_executor/models/llama.py. The structure is a Russian doll:

LlamaModel (:350) — holds the embedding + a stack of LlamaDecoderLayers + a final norm.
LlamaDecoderLayer (:253) — one transformer block: self_attn then mlp, each with a residual add and an RMSNorm.
LlamaAttention (:124) — the attention block.
LlamaMLP (: the small class with forward(self, x) at :117) — gate/up/down projections.

The decoder layer forward (`LlamaDecoderLayer.forward`, `:316`)

Skim it and find this shape (paraphrased):

# residual stream in -> norm -> attention -> add -> norm -> mlp -> add -> out
hidden = self.input_layernorm(hidden_states)
hidden = self.self_attn(positions, hidden)      # attention mixes across tokens
hidden = residual + hidden
hidden = self.post_attention_layernorm(hidden)
hidden = self.mlp(hidden)                        # per-token transform
hidden = residual + hidden

That's the whole transformer block. 32 of these stacked = Llama-3-8B. Notice attention is the only place tokens interact; the MLP treats each token independently. That's why attention is where the KV cache (cross-token memory) lives, and the MLP is just big GEMMs (Phase 7).

Where K and V are produced and cached (`LlamaAttention.forward`, `:223`)

This is the payoff. Find (paraphrased):

qkv, _ = self.qkv_proj(hidden_states)            # one matmul produces Q, K, V (fused)
q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
q, k = self.rotary_emb(positions, q, k)          # positional info (RoPE)
attn_output = self.attn(q, k, v)                 # <- the Attention layer (Phase 4)
output, _ = self.o_proj(attn_output)

The self.attn call is a vllm.attention.layer.Attention module — and that is what writes the new k, v into the paged KV cache (Phase 2) and reads back the cached K/V to compute attention (Phase 4). So the journey is: model produces Q/K/V → the Attention layer caches K/V in blocks and runs the attention kernel. Everything you'll learn in Phases 2 and 4 plugs in right here, at this one self.attn(q, k, v) call. Hold that thread.

Don't get lost. You will not understand all of llama.py today, and you don't need to. The point is to locate Q/K/V production and the self.attn call. That's the seam where the engine's memory and kernels meet the model.

2. The two counters that run the whole engine

Open vllm/v1/request.py. The Request class (:59) carries the prompt, the generated tokens, and the sampling params. The two properties that matter most:

@property
def num_tokens(self) -> int:           # :239
    # total tokens that exist: prompt + generated so far
    ...

and the field set in __init__ (:145): self.num_computed_tokens = 0 — how many of those tokens have had their KV computed and cached.

The whole engine is the race between these two numbers (guide §"mental model"):

New request: num_computed_tokens = 0, num_tokens = len(prompt). The gap is the whole prompt → prefill.
After prefill + each decode: num_computed_tokens is one behind num_tokens; generating a token bumps num_tokens, then the next step computes one more → decode.

num_tokens_with_spec (:243) adds speculative draft tokens to the gap — which is how spec decode (Phase 8) rides the same machinery with no special case. RequestStatus (:315) is the lifecycle enum (WAITING/RUNNING/PREEMPTED/FINISHED_*) you met in Phase 3.

mini_vllm/request.py is a faithful miniature: same num_computed_tokens vs num_tokens, same status enum, same is_finished = status >= FINISHED ordering trick.

3. The loop that drives it all: `EngineCore.step`

Open vllm/v1/engine/core.py:428. This is the heartbeat of vLLM:

def step(self) -> tuple[dict[int, EngineCoreOutputs], bool]:
    if not self.scheduler.has_requests():
        return {}, False
    scheduler_output = self.scheduler.schedule()                       # Phase 3: who runs
    future = self.model_executor.execute_model(scheduler_output, ...)  # the forward pass
    grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)  # Phase 12
    model_output = future.result()
    if model_output is None:
        model_output = self.model_executor.sample_tokens(grammar_output)   # Phase 9
    engine_core_outputs = self.scheduler.update_from_output(           # advance counters
        scheduler_output, model_output)
    return engine_core_outputs, scheduler_output.total_num_scheduled_tokens > 0

schedule → execute → sample → update. That's it. That's the engine. Every phase in this course is a deep dive into one box of this five-line loop:

schedule() → Phases 2, 3 (memory + batching)
execute_model() → Phases 4–7, 10, 13, 14 (kernels, quant, parallelism, the model itself)
sample_tokens() → Phases 8, 9, 12 (decoding, spec, structured output)
update_from_output() → Phase 3 (advance num_computed_tokens, reap finished)

mini_vllm/engine.py's step() is the same loop with the GPU filed off — read them side by side and the correspondence is exact.

Reading checklist

One sentence each:

In LlamaAttention.forward, which line produces Q/K/V and which line caches/uses K/V?
Why does the MLP not need a KV cache but attention does?
On Request, what's the difference between num_tokens and num_computed_tokens?
In EngineCore.step, name the four stages and which course phase owns each.

Now build it: 02-mini-build.md, then the labs.

Phase 00 — Mini-Build: feel the KV-cache win

This phase doesn't add a new mini_vllm module — it has you understand the three you already have and measure the foundational result the whole course rests on.

Part A — read the three pieces of "next-token prediction"

Open and read (they're tiny):

mini_vllm/tokenizer.py — encode(str) -> list[int], decode(list[int]) -> str. Same interface as a real HF tokenizer.
mini_vllm/model.py — ToyModel.forward(last_tokens, positions) -> logits. Batched, autoregressive, deterministic. (It ignores KV values — honest simplification noted in the file — because this course cares about KV memory management, not the toy's numerics.)
mini_vllm/sampler.py — Sampler.sample(logits, params) -> token_id. Greedy at temperature 0.

Trace one LLMEngine.generate(["hi"]) call in your head: tokenize → loop step() → sample → detokenize. Confirm it matches EngineCore.step from the deep-dive.

Part B — the lab: O(N²) vs O(N)

lab-01-kv-cache-speedup is the build. You implement a toy attention twice:

No cache: every step recomputes K/V for all prior tokens → work grows with the prefix.
Cached: K/V computed once, reused → constant work per step.

You'll count "K/V computations" and show the no-cache version does 1+2+...+N = O(N²) while the cached version does O(N) — and that both produce the identical token sequence (caching is an optimization, never a correctness change — the same invariant you'll prove for chunked prefill and preemption in Phase 3).

lab-02-kv-memory-calculator has you write the kv_bytes_per_token formula and compute how many concurrent sequences fit on a given GPU — the number that caps your serving capacity.

Definition of done

pytest phase-00-foundations/labs -q

Then answer in your notebook:

What is the asymptotic work ratio (no-cache / cached) for generating N tokens? (≈ N/2.)
For Llama-3-8B at 8k context on a 24 GB GPU (say 16 GB weights), roughly how many concurrent full-length sequences fit? (You'll compute it in lab-02 — the answer is "surprisingly few," which is the entire motivation for Phase 2.)

Map to the real engine

your understanding	real vLLM
the no-cache vs cache experiment	why `vllm.attention.layer.Attention` caches K/V at all
`num_computed` vs `num_tokens`	`Request` counters (`request.py:239`)
tokenize→loop→sample→detokenize	`EngineCore.step` (`core.py:428`)
kv_bytes_per_token formula	how `get_kv_cache_configs` sizes the block pool (Phase 2)

Phase 00 Labs — Foundations

Four labs that install the four facts everything else stands on: generation is autoregressive and caching makes it linear (lab-01), the cache is memory and memory is the binding constraint (lab-02), logits become tokens through a small exact algorithm (lab-03), and prefill and decode live in opposite performance regimes (lab-04). No GPU, no model downloads — counters, formulas, and numpy. Do them in order; each ends where the next begins.

Every lab follows the standard contract: starter.py with TODOs (your work), solution.py (the reference), test_lab.py (the spec, executable). The default test run uses solution.py so the suite is always green; set LAB_IMPL=starter to grade yourself.

# Whole phase:
pytest phase-00-foundations/labs -m "not gpu"

# Grade your own work on one lab:
LAB_IMPL=starter pytest phase-00-foundations/labs/lab-01-kv-cache-speedup -q

Labs

lab-01-kv-cache-speedup `[CPU-OK]`

The experiment that motivates the course: implement generation with and without a KV cache, count the work exactly (95 vs 15 units; >100× by n=1000), and prove both produce identical tokens. The O(N²) → O(N) trade that converts compute into memory — and creates the prefill/decode split as a side effect. Skills: why the cache exists; causality makes K/V cacheable; counting beats clocking; the master "optimization changes nothing" invariant.

lab-02-kv-memory-calculator `[CPU-OK]`

Write the three-line formula behind every capacity decision in LLM serving and apply it to Llama-3-8B: 128 KiB per token, 256 MiB per sequence, ~32 concurrent users on a 24 GiB GPU. Then read FP8-KV and GQA as factors of the formula. Memory, not compute, is the constraint — derived, not asserted. Skills: back-of-envelope capacity planning; the formula as an optimization roadmap; weights are rent, KV is traffic.

lab-03-sampling-basics `[CPU-OK]`

Build the sampler: greedy, temperature, top-k, top-p — with the stability clause (softmax max-subtraction), the inclusive nucleus boundary, and seeded reproducibility. The final test proves your sampler agrees token-for-token with mini_vllm's engine sampler across 15 configurations. Skills: the four knobs as exact algorithms; −∞ masking; why greedy mode anchors every deterministic test in this course.

lab-04-prefill-vs-decode `[CPU-OK]`

Six one-line functions and an A100 spec sheet: the ridge point (156 FLOPs/byte), single-stream decode at 0.6% compute utilization, the 125 tok/s physical speed limit for 8B/fp16, and the critical batch size where decode becomes compute-bound. The roofline worldview that sorts every optimization into "helps my regime" or "doesn't." Skills: compute-bound vs memory-bound as a reflex; the intensity cancellation (model size doesn't matter — tokens per weight-trip does); why batching is free money and quantization is a decode feature.

What you can do after this phase

Derive, on a whiteboard with no notes: why every inference engine caches KV (and what it costs in bytes); how many users fit on a given GPU for a given model (and which knob to turn when the answer is too small); what temperature=0.7, top_p=0.9 actually computes; and whether a proposed optimization can possibly help a given workload (which side of the ridge is it on?). These four reflexes are the entrance exam for Phase 1, where the loop you simulated becomes a real engine with a scheduler, and for every phase after it.

Lab 00-01 — The KV-Cache Speedup `[CPU-OK]`

This is the experiment that motivates the entire course — and arguably the entire field of LLM inference engineering. You will implement autoregressive generation twice: once the naive way (recompute attention's keys and values for the whole sequence, every step) and once with a KV cache (compute each token's K/V exactly once, ever). Same model, same output, and a work difference that grows with the square of the sequence length. By the end you'll have measured, with an exact integer counter you control, why every serving engine on earth is built around a cache — and why the rest of this course is about managing that cache.

Why this lab exists

Ask a newcomer why LLM inference is expensive and they'll say "big matrices." True, but it misses the structural problem: generation is autoregressive. The model emits one token, appends it, and runs again on the longer sequence — N tokens of output means N forward passes. If each pass reprocesses everything before it, total work is 1+2+3+…+N ≈ N²/2 token-computations for N tokens of value. Quadratic. A 10k-token answer would cost ~50 million token-computations to produce 10 thousand tokens.

The KV cache is the observation that almost all of that recomputation is byte-identical every time — and the field's entire architecture flows downstream of caching it. Once you store K/V, generation becomes O(N)… and a new problem is born: that cache is state, it lives in scarce GPU memory, it grows every step, and somebody has to manage it. That somebody is vLLM, and managing-the-cache-well is Phases 1–19. This lab is where you earn the premise.

We measure with a counter, not a stopwatch, on purpose. Wall-clock on a laptop is noisy and proves nothing about asymptotics; an exact work count (one unit per token processed) gives you the formula, and formulas transfer to any hardware. You'll meet this counting-over-clocking style again in Phase 3's labs.

Background: what K and V are, and why they're recomputable

In attention, each token's hidden state is projected three ways: a query (what am I looking for?), a key (what can I be found by?), and a value (what do I contribute when found?). When the model processes token t, its query is dotted against the keys of all previous tokens, and the resulting weights blend their values:

attn(t) = softmax(q_t · [k_0 … k_t]ᵀ / √d) · [v_0 … v_t]

The crucial property is causality: k_i and v_i depend only on tokens 0..i. Token 5's key is the same whether the sequence currently has 6 tokens or 6,000. So once computed, (k_i, v_i) is valid forever — it's a pure function of the prefix, which makes it perfectly cacheable. The query is the only part that's fresh each step (it belongs to the new token), which is why we cache K and V but never Q.

That's the whole trick. "KV cache" sounds like infrastructure; it's actually a one-line theorem about causal attention plus the decision to spend memory on it.

Files

starter.py — implement generate_no_cache and generate_with_cache. The work meter (compute_kv / KVWork) and the deterministic next_token are provided. Your work.
solution.py — reference.
test_lab.py — pins identical outputs, the exact quadratic and linear work formulas, and the growing ratio.

Run

LAB_IMPL=starter pytest phase-00-foundations/labs/lab-01-kv-cache-speedup -q
pytest phase-00-foundations/labs/lab-01-kv-cache-speedup -q   # reference (default)

What to implement

Both functions generate n_new tokens from a prompt of length P and return (full_token_sequence, total_kv_work):

generate_no_cache — each decode step first calls compute_kv(tok, pos) for every token currently in the sequence (the model "re-reads" everything), then appends next_token(tokens). Step i (0-indexed) costs P + i units.
generate_with_cache — prefill once (compute_kv per prompt token, P units), then each decode step computes K/V for only the newly appended token (1 unit).

next_token is deterministic — a hash of the context — so both implementations must produce the same token sequence. That's not a convenience; it's the point (see the first test).

What you should see — and why every number is what it is

For P = 5, n_new = 10:

no cache : work = 5+6+7+8+9+10+11+12+13+14 = 95      (sum of P..P+n_new-1 → O(N²))
cached   : work = 5 + 10                   = 15      (P prefill + 1/step   → O(N))

Why 95? Step 0 reprocesses the 5 prompt tokens; step 1 reprocesses 6 (prompt + the token just generated); … step 9 reprocesses 14. The arithmetic series is the quadratic, made concrete enough to check by hand — which is exactly what the test does.
Why 15? Each of the 15 tokens that ever exists has its K/V computed exactly once. The cached cost is the number of tokens. It cannot be beaten by any scheme that actually computes the KV (it can be beaten by schemes that reuse KV across requests — that's prefix caching, Phase 2/3).
At n_new = 1000: the ratio is >100× and still climbing linearly (~N/2). On real hardware this asymptotic gap is the difference between "chatbots are economically possible" and not.
Notice the two-phase shape that fell out for free: a big batch of K/V work up front (the prefill — all P prompt tokens at once, parallelizable, compute-hungry), then a drip of single-token steps (the decode — serial, one unit each). You didn't design that; caching created it. Prefill-vs-decode is the most consequential workload split in inference (lab-04 quantifies it; Phase 1 traces it; Phase 3 schedules around it), and it is born right here, in your 20 lines.

What the tests prove

Test	What it pins
`test_both_produce_identical_tokens`	Caching is an optimization, not a behavior change — the cached run's outputs are bit-identical. This is the course's master invariant: every optimization from here on (chunked prefill, prefix caching, preemption, paging) is proven safe by exactly this kind of equality test
`test_no_cache_is_quadratic`	`work == sum(P .. P+n_new−1)` — the formula, not "roughly slower"
`test_cached_is_linear`	`work == P + n_new` — every token computed once, ever
`test_work_ratio_grows_with_length`	The gap grows with N (>100× at n=1000): this is an asymptotic class difference, not a constant factor someone could optimize away

Hitchhiker's notes

The cache is a time–space trade, and the space is the plot of this course. You just converted O(N²) compute into O(N) memory: every token now permanently occupies bytes (about 128 KiB/token for Llama-3-8B — lab-02 computes this). One number to foreshadow: a 24 GiB GPU holds weights plus only a few dozen full-length sequences of cache. Scarcity is immediate, and scarcity is why Phases 2–3 exist.
Real transformers hide the no-cache cost inside one matmul. HuggingFace generate(use_cache=False) doesn't loop per token like your simulation; it reprocesses the whole sequence in a single (big) forward pass per step. The work is still quadratic in total — your counter models the FLOPs faithfully even though the loop structure differs.
Where the cache actually lives upstream: vllm.attention.layer.Attention writes each step's new K/V into the paged cache (via slot_mapping — Phase 2 lab-06), and the kernel reads all prior K/V (via block_table). What you modeled as a counter is, in production, tensors + an allocator + a scheduler. Same theorem underneath.
Why does the cached version call next_token(tokens) with the full list, then? Because the model function still needs the whole context semantically — the cache changes what is recomputed, not what the model "knows." In a real model, "the cache was consulted" and "the context was read" are the same act: attention over cached K/V. Don't confuse caching KV with truncating context.

Going further

Plot work_no_cache / work_cached for n in 1..2000 — confirm the ~N/2 line. Then plot cached work alone: a flat 1/step. That flat line is why decode latency is stable and why per-token pricing is linear. Economics from asymptotics.
Model prompt length: sweep P from 10 to 10,000 at fixed n_new=100. Notice prefill dominates total cached work for long prompts — the TTFT story (Phase 1) in miniature.
Add a kv_bytes counter alongside the work counter (one cache entry per compute_kv) and watch memory grow linearly while compute stays flat — you've now built both axes of lab-02 and the motivating tension of Phase 2 with ~5 extra lines.

References

Vaswani et al., Attention Is All You Need (2017) — where K/Q/V come from: https://arxiv.org/abs/1706.03762
kipply, Transformer Inference Arithmetic — the canonical blog walkthrough of KV-cache math and why decode is bandwidth-bound: https://kipp.ly/transformer-inference-arithmetic/
Pope et al., Efficiently Scaling Transformer Inference (2022) — §3 formalizes the prefill/decode split your counter just exposed: https://arxiv.org/abs/2211.05102
upstream/vllm/attention/layer.py — the production home of the cache write.
Phase 0 guide §"the KV cache" (00-guide.md) — the intuition this lab makes quantitative.

Lab 00-02 — KV-Cache Memory Calculator `[CPU-OK]`

Lab-01 ended with a cliffhanger: the KV cache converts quadratic compute into linear memory. This lab computes exactly how much memory — and the answer is the most important number in LLM serving economics: how many concurrent users fit on one GPU. You'll write the three-line formula, apply it to Llama-3-8B, and arrive at the genuinely shocking result that a 24 GiB GPU running an 8B model has room for only ~32 full-length conversations. Every dollar of inference cost, every "maximum concurrency" log line, and the entire existence of PagedAttention trace back to the arithmetic you're about to own.

Why this lab exists

This is back-of-envelope as a professional skill. A staff inference engineer gets asked, weekly, some variant of: "can we serve model X to Y users on hardware Z?" The wrong answer costs a fleet; the right answer is three multiplications you can do in a meeting. This lab installs the formula so deeply that you'll never again look at a GPU spec sheet without mentally dividing its HBM by a KV footprint.

It's also the Rosetta stone for the rest of the course. When Phase 2 lab-03's real engine prints Maximum concurrency for 2,048 tokens per request: 68.65x, that's this lab's max_concurrent_seqs evaluated against measured free HBM. When Phase 6 sells you FP8 KV, when model cards advertise GQA, when Phase 10 shards KV across GPUs — every one of those is an attack on a term of the formula you write here. Learn the formula, and the whole optimization landscape organizes itself into "which factor does this shrink?"

Background: where the bytes go

Per token, per layer, attention stores one K vector and one V vector, each of num_kv_heads × head_dim elements. Multiply it out:

kv_bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × dtype_bytes
                     ▲       ▲             ▲            ▲           ▲
                  K and V  every layer   the heads   per head   fp16 = 2

Two things to notice before you code:

num_kv_heads, not num_heads. Modern models use grouped-query attention (GQA): many query heads share each KV head precisely because someone did this lab's math and realized KV memory, not model quality, capped serving capacity. Llama-3-8B has 32 query heads but only 8 KV heads — a 4× KV saving designed into the architecture. (MQA — one KV head — is the extreme; MLA in DeepSeek compresses further still. Architecture evolution is visible in this one parameter.)
It's per token, forever. The cache for token 0 lives until the request finishes. Length × concurrency × per-token bytes must fit in what's left after weights. There is no amortization, no compression by default — just bytes, held for the lifetime of the conversation.

Files

starter.py — implement kv_bytes_per_token, kv_bytes_per_seq, max_concurrent_seqs. Your work.
solution.py — reference.
test_lab.py — pins exact numbers for Llama-3-8B, the FP8 and GQA factor effects, and the no-room edge case.

Run

LAB_IMPL=starter pytest phase-00-foundations/labs/lab-02-kv-memory-calculator -q
pytest phase-00-foundations/labs/lab-02-kv-memory-calculator -q   # reference (default)

The formulas

kv_bytes_per_token = 2 (K and V) × num_layers × num_kv_heads × head_dim × dtype_bytes
kv_bytes_per_seq   = kv_bytes_per_token × seq_len
max_concurrent     = (gpu_bytes − weight_bytes) // kv_bytes_per_seq      (0 if no room)

Integer division on purpose: you cannot serve 0.7 of a conversation. (The real engine has the same floor, expressed in blocks — Phase 2.)

The headline result, walked through

Llama-3-8B in fp16: num_layers=32, num_kv_heads=8, head_dim=128, dtype_bytes=2.

per token : 2 × 32 × 8 × 128 × 2 = 131,072 B = 128 KiB        (!)
per 2,048-token sequence: 128 KiB × 2,048 = 256 MiB
24 GiB GPU − ~16 GiB weights = 8 GiB free
concurrency: 8 GiB / 256 MiB = 32 sequences

Sit with each line:

128 KiB per token. A token is ~4 characters of text. Its cache costs as much as a small image. A 100-word answer: ~17 MB. This is why "just keep the conversation in memory" is a capacity strategy, not a triviality.
256 MiB per max-length sequence — 1.6% of the entire weights per conversation, for context alone.
32 users. An 8-billion-parameter model on a serious GPU and the ceiling is thirty two — and that's assuming perfect packing with zero waste. Now recall (or preview) Phase 2 lab-02: pre-vLLM engines wasted 60–80% of KV memory on fragmentation, turning 32 into ~6–12. Memory, not compute, is the binding constraint of LLM serving — the single most counterintuitive fact in the field, and you just derived it.
And the punchline that launches Phase 2: since the constraint is memory, the highest- leverage engineering target is making every byte of that 8 GiB hold useful KV. That is, verbatim, PagedAttention's mission statement.

What the tests prove

Test	What it pins
`test_llama3_8b_per_token`	The exact 131,072 — get the factors and you get the field's most-quoted number
`test_llama3_8b_concurrency`	The headline 32, end to end through all three functions
`test_fp8_kv_doubles_concurrency`	`dtype_bytes` is a linear lever: halve it, double the users. (Phase 6's FP8-KV feature, justified in one assert)
`test_gqa_saves_vs_mha`	8 vs 32 KV heads = exactly 4× — why GQA exists, as arithmetic
`test_no_room_returns_zero`	Weights ≥ HBM → 0, gracefully. Capacity functions must not return −3 users

Hitchhiker's notes

The formula is the optimization roadmap. Every KV-memory technique in production attacks one factor: FP8/INT4 KV quantization → dtype_bytes; GQA/MQA/MLA → num_kv_heads × head_dim; sliding-window attention (Mistral) and hybrid SSM layers (Phase 14) → which layers store KV at all; tensor parallelism (Phase 10) → divides the whole thing across GPUs; prefix caching (Phases 2–3) → shares kv_bytes_per_seq across requests. When you read a new inference paper, your first question is now: which factor?
Weights are the entry fee; KV is the rent. Weights are fixed and amortize over every request; KV scales with traffic. This is why bigger GPUs disproportionately help serving (more leftover after the fixed cost) and why a 70B model on an 80 GiB GPU (~140 GiB fp16 weights — doesn't even fit without quantization or sharding) is a different kind of problem than 8B on 24 GiB.
seq_len is the denominator you control. The formula uses worst-case length — exactly what the engine's startup "maximum concurrency" line assumes, and exactly why Phase 2 lab-03's reflection scolds the max_model_len=32768 config for a 4k workload. Capacity planning with the p99 length instead of the max is the cheapest 8× you'll ever find.
What the simple formula ignores (and where it bends in practice): activation scratch memory (vLLM measures this at startup by profiling — Phase 2 lab-03), block-granularity rounding (block_size − 1 tail waste per sequence — Phase 2), CUDA context overhead, and the fact that real workloads have a length distribution, so effective concurrency is higher than the max-length floor. The formula is your lower bound and your sanity check, not your final answer — which is also why vLLM computes blocks from measured free memory rather than trusting arithmetic.

Going further

Build a table for models you care about (3B/8B/70B; fp16/fp8 KV; 2k/8k/128k context) on A100-80G. Notice where concurrency drops below 1 — congratulations, you've discovered why 128k-context serving needs either massive HBM, KV offload, or architecture tricks, and why long-context pricing is what it is.
Invert the formula: given a target of 200 concurrent users at 4k context on 8B/fp16, how much free HBM do you need? (200 × 4096 × 128 KiB = 100 GiB → multi-GPU territory → Phase 10's tensor parallelism divides per-GPU KV by the shard count.)
Add block_size rounding from Phase 2 (ceil(seq_len / block_size) blocks per sequence) and quantify how little paging's tail waste costs vs the 60–80% it saves — reproducing Phase 2 lab-02's conclusion from the memory side.

References

kipply, Transformer Inference Arithmetic — the classic source for exactly this per-token math: https://kipp.ly/transformer-inference-arithmetic/
Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models (2023) — why num_kv_heads shrank industry-wide: https://arxiv.org/abs/2305.13245
Pope et al., Efficiently Scaling Transformer Inference (2022) — KV memory as the scaling bottleneck, formalized: https://arxiv.org/abs/2211.05102
Kwon et al., PagedAttention (SOSP 2023) — what happens next to these bytes: https://arxiv.org/abs/2309.06180
upstream/vllm/v1/core/kv_cache_utils.py::get_kv_cache_configs — your formula, running at every engine startup (Phase 2 lab-03 reads its output).

Lab 00-03 — From Logits to Token: Sampling Basics `[CPU-OK]`

A language model does not produce words. It produces logits — one raw score per vocabulary entry, 257 of them in mini_vllm, 128k+ in Llama-3 — and something has to collapse that scoreboard into the single token the user sees. That something is the sampler, and in this lab you build it: greedy, temperature, top-k, and top-p (nucleus), exactly mirroring mini_vllm/sampler.py — the final test literally checks that your sampler and the engine's agree token-for-token across a grid of configurations.

This is the last piece of the foundations: lab-01 gave you the loop, lab-02 the memory, lab-04 the speed limits — this one gives you the decision each loop iteration ends with.

Why this lab exists

Sampling parameters are the most-touched, least-understood interface in all of LLM serving. Every API request carries them; every "the model got worse" support ticket is ~30% likely to be a sampling change; and every inference engineer eventually debugs an incident where the answer was "someone set top_k=1 and wondered why outputs got repetitive." You should know these four knobs the way a DBA knows isolation levels — not as folklore ("0.7 is creative!") but as the small, exact algorithms they are.

There's an engine-design reason too. The sampler sits at a peculiar spot in the architecture: it's the only stage that's per-request configurable and stochastic, in the middle of a pipeline that is otherwise batched and deterministic. Getting determinism back when you need it (tests! reproducibility! debugging!) takes deliberate design — the seed parameter, greedy mode — and the entire course's testing strategy (every phase's "identical output" invariants) leans on the greedy shortcut you'll implement here. When Phase 9 expands sampling into penalties, logit processors, parallel sampling, and GPU vectorization, this lab is the kernel of truth it builds on.

Background: the knobs, and what they're actually for

Order matters — this is a pipeline, and each stage reshapes the distribution the next one sees (your sample must apply them in exactly this order to match the engine):

Temperature — divide all logits by T before softmax. T<1 sharpens (rich get richer), T>1 flattens (underdogs get a chance), T→0 approaches argmax. It's the only knob that reweights rather than truncates. The T == 0.0 case is special- cased as pure argmax — both because division by zero, and because greedy must be exactly deterministic, no RNG involved at all.
Top-k — keep the k highest logits, set the rest to −∞ (probability zero after softmax). A blunt truncation: k=1 is greedy-with-extra-steps, k=50 trims the long tail of nonsense tokens. Its weakness: k is fixed while the distribution's actual "width" varies wildly per step (after "The capital of France is" there's one good token; after "My favorite" there are hundreds).
Top-p (nucleus) — keep the smallest set of tokens whose cumulative probability ≥ p. Adaptive where top-k is fixed: confident steps keep few tokens, uncertain steps keep many. The subtle spec detail your implementation must honor: the token that crosses the threshold is included (else p=0.5 over probs [0.4, 0.4, 0.2] would keep only 0.4 < 0.5 — an under-full nucleus).
Softmax + one draw — normalize what survives and draw once with np.random.default_rng(seed). Seeded → reproducible; unseeded → fresh entropy per call.

And the stability clause: softmax must subtract the max before exponentiating. exp(1000) overflows float64; logits in the hundreds are perfectly normal outputs of an unnormalized final layer. This one line is the difference between a sampler and a NaN generator, and the test feeds you logits of 1000+ to make sure it's there.

Files

starter.py — softmax, apply_top_k, apply_top_p, sample, each with its recipe. Your work.
solution.py — reference (functionally identical to mini_vllm/sampler.py).
test_lab.py — distribution sanity, each knob's exact semantics, determinism, and the agreement test against the engine's Sampler.

Run

LAB_IMPL=starter pytest phase-00-foundations/labs/lab-03-sampling-basics -q
pytest phase-00-foundations/labs/lab-03-sampling-basics -q   # reference (default)

What the tests prove

Test	What it pins
`test_softmax_is_a_distribution_and_is_stable`	Sums to 1, preserves order, and survives logits of 1000 — the max-subtraction clause
`test_greedy_is_argmax_and_ignores_every_other_knob`	`temperature=0` short-circuits the whole pipeline — even hostile `top_k`/`top_p`/`seed` settings can't perturb greedy. This guarantee is what every deterministic test in this course stands on
`test_top_k_keeps_exactly_k`	Survivors finite, victims −∞, disabled cases (k≤0, k≥vocab) pass through unchanged
`test_top_p_keeps_the_smallest_sufficient_nucleus`	The inclusive-crossing rule, on a hand-built distribution — and the test deliberately avoids sitting on the cumsum boundary, because float rounding flips the answer there (read the comment; it's a lesson in itself)
`test_temperature_sharpens_or_flattens`	T's monotone effect on the max probability
`test_seeded_sampling_is_reproducible`	Same logits + same seed = same token, forever
`test_agrees_with_mini_vllm_sampler`	Your sampler ≡ the engine's sampler across 15 configurations — the equivalence that makes this lab "build the real component," not "build a toy like it"

Hitchhiker's notes

−∞ is the correct "impossible," not 0. Masking logits to −∞ (probability exactly 0 after softmax) composes cleanly: later stages renormalize over survivors automatically. Masking probabilities to 0 without renormalizing — a classic homebrew-sampler bug — leaves you sampling from a distribution that sums to 0.7.
Order of operations is observable. Top-k-then-top-p (this pipeline, and vLLM's) gives different results than top-p-then-top-k for the same parameters. When two engines "with the same settings" produce different output statistics, pipeline order is suspect #2 (suspect #1 is tokenizer differences). The agreement test pins your order to the engine's.
Why np.partition instead of sorting in top-k? O(n) vs O(n log n) over the vocab, per token, per request — at 128k vocab × thousands of tokens/s this is real money. Production goes further: vLLM's V1 sampler does top-k/top-p vectorized over the whole batch on the GPU (upstream/vllm/v1/sample/), with exactly the semantics you just wrote scalar. Semantics here, performance there — the course's recurring split.
Ties under greedy: argmax takes the lowest index. Sounds trivial until two engines break ties differently and a "deterministic" comparison fails at token 947 — the fp16 near-tie problem from Phase 3 lab-02's notes, one layer down. Determinism is a stack of conventions, and you now know one more layer of it.
seed is per-request state in real engines — vLLM keeps a per-request generator so request A's draws don't perturb request B's stream under batching (Phase 9). Your per-call default_rng(seed) is the single-request simplification; the same idea, one request at a time.

Going further

Implement min-p (keep tokens with prob ≥ p × max-prob — an increasingly popular alternative that adapts even better than top-p) and write its boundary test. Then check: vLLM ships it (min_p in SamplingParams).
Sample 10,000 draws at T ∈ {0.3, 1.0, 2.0} from fixed logits and plot the empirical histograms against your computed distributions — a χ² eyeball test of your own sampler, and a visceral feel for what temperature does.
Read upstream/vllm/v1/sample/sampler.py and find the four stages of your pipeline in their batched form: the same algorithm, where every operation is a tensor op over [batch, vocab] and the special-casing of greedy becomes an index-select.

References

mini_vllm/sampler.py — the component you just rebuilt; diff yours against it.
upstream/vllm/v1/sample/sampler.py — the batched GPU version (Phase 9 territory).
Holtzman et al., The Curious Case of Neural Text Degeneration (2019) — the paper that introduced nucleus (top-p) sampling and explains why truncation matters: https://arxiv.org/abs/1904.09751
vLLM docs, Sampling Parameters — the full production knob set your four generalize into: https://docs.vllm.ai/en/latest/api/inference_params.html
Phase 9 — penalties, logit processors, structured-output masking (Phase 12), and why sampling lives on the GPU.

Lab 00-04 — Prefill vs Decode: the Roofline Arithmetic `[CPU-OK]`

Lab-01 showed you that caching splits generation into two phases. This lab shows you the strange physics those phases live under — with six one-line functions and an A100 spec sheet. You'll compute the ridge point (the GPU's FLOP-per-byte ratio), discover that a single decoding sequence uses less than 1% of the GPU's compute, derive the hard speed limit of decode (125 tokens/s for an 8B model on an A100 — no kernel wizardry can beat it), and find the critical batch size where decode finally becomes compute-bound. These four numbers are the worldview of performance engineering; everything in Phase 18 is this lab with profilers attached.

Why this lab exists

There is a question that separates engineers who tune inference systems from engineers who guess at them: "is this workload compute-bound or memory-bound?" Every optimization belongs to one regime or the other — a faster matmul kernel does nothing for a bandwidth-bound decode; more batching does nothing for a compute-bound prefill — and applying a fix from the wrong regime is the most common way smart people waste a quarter. The roofline model answers the question with division. This lab makes you do the division until it's reflexive.

It also explains, from first principles, the economics you've been absorbing all phase: why batch size is the master lever of throughput (lab and Phase 1 lab-04), why continuous batching was worth inventing (Phase 3), why chunked prefill mixes the two phases in one step (Phase 3 lab-05 — piggybacking compute-hungry prefill onto bandwidth-starved decode steps), and why speculative decoding (Phase 8) can "spend FLOPs to save time" — the FLOPs were idle anyway.

Background: the roofline model in five minutes

Every computation has an arithmetic intensity: FLOPs performed per byte moved from memory. Every processor has a ridge point: peak FLOPs ÷ memory bandwidth. The roofline law is one comparison:

intensity < ridge → memory-bound: the compute units finish early and wait for the bus. Speed = bandwidth × intensity. More FLOPs won't help; fewer bytes will.
intensity ≥ ridge → compute-bound: the bus keeps up; you're using the silicon. Speed = peak FLOPs. Fewer bytes won't help; better kernels / more hardware will.

Our minimal model of a transformer step: it must read all weights once (n_params × dtype_bytes from HBM — weights don't fit in cache) and performs ~2 FLOPs per parameter per token (each weight participates in one multiply-accumulate per token). So:

intensity = 2 · n_params · num_tokens / (n_params · dtype_bytes) = 2 · num_tokens / dtype_bytes

The parameter count cancels. Intensity doesn't care how big the model is — only how many tokens share one trip through the weights. That cancellation is the single most clarifying fact in inference performance: prefill (thousands of tokens per trip) and decode (one token per trip per sequence) aren't two workloads that differ in degree; they're the two opposite ends of the roofline, by construction.

(The model ignores KV/activation traffic and attention FLOPs — see Hitchhiker's notes for when that bites. As a first-order tool it's startlingly accurate.)

Files

starter.py — six functions, each ~one line, each a load-bearing concept. Your work.
solution.py — reference.
test_lab.py — A100 numbers, the cancellation, both regimes, the crossover, and the decode speed limit.

Run

LAB_IMPL=starter pytest phase-00-foundations/labs/lab-04-prefill-vs-decode -q
pytest phase-00-foundations/labs/lab-04-prefill-vs-decode -q   # reference (default)

The numbers, walked through

A100-80GB SXM: 312 TFLOPs fp16, 2.0 TB/s HBM. 8B model, fp16 (16 GB of weights).

ridge      : 312e12 / 2.0e12 = 156 FLOPs/byte
decode, batch 1 : intensity = 2·1/2 = 1        → 1 ≪ 156: bandwidth-bound, using 1/156 ≈ 0.6% of compute
prefill, 2048 tokens : intensity = 2048        → ≫ 156: compute-bound
crossover  : 156 tokens/step                   → the "critical batch size"
speed limit: 2.0e12 / 16e9 = 125 steps/s       → 125 tok/s per sequence, max, ever

Read them like an engineer:

0.6% compute utilization for single-stream decode. The other 99.4% of a $15k GPU is structurally idle — not because of bad kernels, but because one token per weight-trip is all the arithmetic the workload offers. This is the number that makes batching non-optional: every additional sequence in the decode batch reuses the same weight bytes, adding intensity ~1 per sequence, for free until you hit the ridge at ~156. Batch 64 → 64× the tokens/s at essentially the same step time. That free lunch is the entire economic basis of serving (and of this course's obsession with fitting more sequences in memory — lab-02 — since memory is what caps the batch).
125 tokens/s is a physics ceiling, not an engineering one: a decode step cannot complete faster than the weights can stream from HBM. Measure any well-tuned 8B/fp16 deployment and single-stream decode sits at 60–80% of this bound (the rest: KV reads, kernel overheads). When someone promises 500 tok/s single-stream on this hardware, they're describing quantization (fewer bytes — note dtype_bytes in your formula), speculation (Phase 8), or fiction.
Prefill at 2048 is compute-bound — which is why TTFT responds to better kernels and FlashAttention (Phase 4), while ITL responds to memory bandwidth and quantization. Two metrics, two regimes, two completely disjoint optimization menus. Now you know why Phase 3's chunked prefill mixes the phases in one batch: decode steps have idle FLOPs; prefill chunks are pure FLOPs; together they fill the roofline from both sides (Sarathi's "piggybacking", which you measured as [33, 33, …] in Phase 3 lab-05).
The crossover at 156 is worth memorizing as a shape, not a number: it moves with hardware (H100: ~295 fp16; consumer cards: lower) and dtype. "Decode needs ~ridge-many tokens per step to saturate compute" is the portable version.

What the tests prove

Test	What it pins
`test_a100_ridge_point`	156.0 — the one hardware constant of this lab
`test_intensity_is_just_tokens_over_dtype`	The cancellation: 8B and 70B give identical intensity. If this surprises you, reread the Background — it's the lab's central fact
`test_single_decode_is_hopelessly_bandwidth_bound`	Intensity 1 vs ridge 156
`test_prefill_is_compute_bound`	Intensity 2048 vs ridge 156 — same model, opposite regime
`test_critical_batch_size_is_the_ridge`	155 tokens: memory-bound; 156: compute-bound. The crossover is exactly the ridge, because intensity (fp16) = tokens
`test_decode_speed_limit_8b_fp16`	125 tok/s — bandwidth over weight bytes, the unbeatable ceiling
`test_batching_multiplies_decode_throughput_for_free`	Batch 64 → 8000 tok/s from the same weight stream — the free lunch, quantified

Hitchhiker's notes

Where the weights-only model bends: at long contexts, KV reads become the dominant bytes (128 KiB/token from lab-02 × thousands of tokens × batch — eventually exceeding the 16 GB weight read!). That's why long-context decode gets slower per token even though "the model is the same size," why GQA/MLA exist (shrink KV bytes), and why Phase 18 extends this lab's model with a KV-traffic term. First-order tools, knowingly applied, then refined — that's the discipline.
Why ~2 FLOPs per param per token? Each parameter sits in some matrix; processing a token multiplies it by one activation and adds into an accumulator — one FMA = 2 FLOPs. Attention adds FLOPs quadratic in sequence length on top (it's parameter-free, so it escapes this accounting); for short-to-moderate contexts the linear layers dominate and 2·N·T is a good model. The famous training version of the same estimate is 6·N·T (forward + backward); inference keeps only the 2.
Quantization through this lens: INT4 weights = 4 GB to stream = 500 steps/s ceiling, 4× decode speedup with zero kernel cleverness — bandwidth-bound workloads reward byte-shrinking one-for-one. But the same INT4 does ~nothing for compute-bound prefill (the FLOPs still happen, often in fp16 after dequant). One optimization, two regimes, two completely different value propositions — Phase 6 lives here.
The ridge explains CUDA graphs too (Phase 5): your 125 steps/s ceiling means a decode step takes ≥ 8 ms on this 8B model (16 GB streamed at 2 TB/s) — at that scale a few hundred microseconds of kernel-launch overhead is noise. Shrink the model to 1B and steps drop toward 1 ms; suddenly launch overhead is a first-order cost, and capturing the whole step as one CUDA graph pays for itself. The roofline tells you when overhead optimizations matter, too.

Going further

Recompute everything for an H100 SXM (~990 TFLOPs fp16, ~3.35 TB/s): ridge ≈ 295, 8B decode ceiling ≈ 209 tok/s. Notice the ridge rose — new GPUs gain FLOPs faster than bandwidth, so decode gets relatively more memory-bound every generation. That trend line is why KV/weight compression research keeps accelerating.
Add a kv_bytes_per_step(batch, context_len, kv_per_token) term (from lab-02) to decode_tokens_per_second and find the context length where KV traffic overtakes the weight read for batch 64. You've just derived the long-context wall.
Plot the roofline: log-log, intensity on x, achievable FLOPs on y, the two plateaus, and drop points for decode batch {1, 8, 64, 156, 512} and prefill {128, 2048}. This single figure is the mental map for all of Phase 18 — draw it once by hand.

References

Williams et al., Roofline: An Insightful Visual Performance Model (CACM 2009) — the original: https://dl.acm.org/doi/10.1145/1498765.1498785
kipply, Transformer Inference Arithmetic — this lab's model applied end-to-end, the single best blog post in the field: https://kipp.ly/transformer-inference-arithmetic/
Pope et al., Efficiently Scaling Transformer Inference (2022) — the rigorous version, including the KV-traffic terms: https://arxiv.org/abs/2211.05102
Chen, Dissecting Batching Effects in GPT Inference (2023) — measured curves of the batch-size free lunch: https://le.qun.ch/en/blog/2023/05/13/transformer-batching/
NVIDIA A100/H100 datasheets — where the peak-FLOPs and bandwidth constants come from (always check whether a quoted TFLOPs number assumes sparsity; marketing does).
Phase 18 — this lab, with nsys/ncu attached and the simplifications removed.

Phase 00 — Exercises: Foundations

Warm-up (explain)

In one sentence: what does an LLM compute, and what is "autoregressive generation"?
Define tokens, embeddings, logits. Where in a forward pass do logits appear?
Why does a token's K and V never change once computed? Why does that justify a cache?

Core (the distinctions that matter)

Fill the table from memory: prefill vs decode — tokens/pass, bottleneck (compute vs memory bandwidth), and which latency metric (TTFT vs ITL) each drives.
Explain why decode is memory-bandwidth-bound. What must the GPU read to produce one token, and how much math does it do with it?
Why does batching help throughput specifically during decode? (Hint: what gets amortized?)

Build (your labs)

In lab-01, derive the exact no-cache work sum(P..P+n-1) and the cached work P+n. What's the ratio as n → ∞ for fixed P?
In lab-02, compute kv_bytes_per_token and max concurrency for a model of your choice (look up its config: layers, kv_heads, head_dim). Then redo it with fp8 KV cache.
A model uses MHA (num_kv_heads == num_query_heads). Show how switching to GQA with 8 KV heads changes KV memory and thus concurrency.

Design (staff-level)

You must serve a 70B model at 8k context with TTFT < 1s and ITL < 50ms on 8×A100 (80GB). Estimate KV memory per sequence and reason about how many concurrent users fit. What's the first thing you'd do to fit more?
A teammate says "let's just recompute attention each step, it's simpler." Quantify what that costs for a 2000-token generation and explain why it's a non-starter.
Using Little's Law (concurrency = throughput × latency), if you target 1000 tok/s aggregate at 50ms ITL, how many sequences must be in flight? What limits that number?

Self-grading

4, 5, 10–12 are interview-grade. Could you whiteboard each in 5 minutes? If not, re-read the guide's prefill/decode and memory sections, then drill INTERVIEW.md.

Phase 00 — Interview Questions: Foundations

Cover the answer, attempt out loud, compare. These fundamentals gate everything else — if you fumble them, the interviewer won't trust your scheduler answers.

Q1. Why is autoregressive decoding so much slower per token than prefill?

Model answer

Decode produces one token per step but must still read the entire model weights and the whole KV cache from HBM each step, while doing only one token's worth of math — terrible arithmetic intensity, so it's memory-bandwidth-bound and the GPU's compute sits idle. Prefill amortizes the same weight read over all prompt tokens at once, so it's compute-bound and far more efficient per token. Same kernels, opposite bottlenecks.

Q2. What is the KV cache and why does it dominate serving memory?

Model answer

It stores the Key and Value vectors of every prior token so attention need not recompute them (they never change). Without it, generation is O(N²) in work; with it, O(N). Its size is 2 × layers × kv_heads × head_dim × dtype_bytes per token and it grows linearly with batch size and sequence length, so at scale it dwarfs the weights and caps how many concurrent requests fit. For Llama-3-8B that's ~128 KiB/token; a few thousand tokens × a few dozen users fills tens of GB.

Q3. Walk me through prefill vs decode.

Model answer

Prefill is the first pass over the whole prompt: many tokens, one pass, compute-bound, fills the prompt's KV cache, determines TTFT. Decode is every subsequent single-token step: one token, memory-bandwidth-bound (read all weights + KV), determines ITL/TPOT. The scheduler treats both uniformly as "advance num_computed_tokens toward num_tokens," which is why chunked prefill and continuous batching fall out naturally (Phase 3).

Q4. How would you estimate KV-cache memory for a deployment?

Model answer

kv_bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × dtype_bytes; multiply by max sequence length for per-sequence bytes; concurrent capacity ≈ (HBM − weights) / per-sequence bytes. Watch for GQA (kv_heads ≪ query_heads shrinks it), fp8 KV cache (halves dtype_bytes), and that real engines reserve some HBM for activations and CUDA-graph buffers, so usable KV is a bit less than the naive free figure.

Q5. Why does batching improve throughput, and what's the cost?

Model answer

In decode, reading the model weights from HBM is the dominant cost and is shared across a batch — so processing B sequences together costs barely more than one, multiplying throughput. The cost is latency: each step does more work, and (via Little's Law) higher concurrency means each request waits longer. The scheduler navigates this; Phase 18 tunes it.

Rapid-fire

Tokens are roughly? ~¾ of a word; integer ids from a tokenizer.
Logits are? Pre-softmax scores over the whole vocabulary for the next token.
Decode bottleneck? Memory bandwidth. Prefill bottleneck? Compute.
TTFT driven by? Prefill. ITL driven by? Decode.
KV bytes/token formula? 2 × layers × kv_heads × head_dim × dtype_bytes.
The engine's master variables? num_computed_tokens chasing num_tokens.

Phase 00 — Cheatsheet: Foundations

The one-liner

An LLM predicts the next token; generation loops that. Serving = doing it fast for many users. Memory (the KV cache), not compute, is the cap.

The loop

tokenize → (prefill the prompt) → loop[ forward → sample → append ] → detokenize. Real: EngineCore.step = schedule → execute → sample → update (core.py:428).

Prefill vs decode

	prefill	decode
tokens/pass	many	one
bound by	compute (FLOPs)	memory bandwidth
latency	TTFT	ITL/TPOT
fills	prompt KV	one KV/step

KV cache

Exists because K/V never change once computed → cache them → O(N²) work becomes O(N).
kv_bytes_per_token = 2 × layers × kv_heads × head_dim × dtype_bytes.
Llama-3-8B fp16 ≈ 128 KiB/token. Concurrency ≈ (HBM − weights) / (per_token × seq_len).
Shrink it: GQA (fewer kv_heads), fp8 KV (half dtype), shorter context, paging (Phase 2).

The master model

A request = num_computed_tokens racing num_tokens. Prefill = far behind; decode = one behind. (vllm/v1/request.py:239; mirrored in mini_vllm/request.py.)

Throughput vs latency

Bigger batch → more throughput (amortize weight reads), worse per-request latency. Little's Law: concurrency = throughput × latency. The scheduler (Phase 3) and tuning (Phase 18) live here.

Key upstream

vllm/model_executor/models/llama.py — a real forward pass (Q/K/V at LlamaAttention.forward)
vllm/v1/request.py:239 — the counters
vllm/v1/engine/core.py:428 — EngineCore.step

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

Phase 01 — The Hitchhiker's Guide to vLLM's Architecture & Request Lifecycle

← Phase 00 · Course home · Phase 02 →

This is Chapter 1. Phase 0 taught you what one forward pass is and why it's slow. This chapter zooms out to the whole machine: how vLLM turns that single forward pass into a service that streams tokens to thousands of users at once. We build the architecture the way you'd design it yourself — by starting with the obvious naive server, watching it fail, and fixing each failure. By the end you'll be able to trace any request from an HTTP call to a streamed token and name every component it touches. That mental map is what lets you navigate 500,000 lines of code without drowning.

How to read this chapter. Everyday explanation throughout; paragraphs marked 🔬 Going deeper add the systems rigor for the expert track.

1.1 Don't Panic — the architecture in one breath

A request enters as a string and leaves as tokens. In between it passes through a handful of well-named components, and a tiny loop runs the model over and over until the request is done.

vLLM looks enormous, but the path a request takes is short:

  "Tell me a joke"
        │  tokenize, wrap as a request
        ▼
   LLM  /  AsyncLLM            ← the front door (offline batch  /  online server)
        │  add_request
        ▼
   EngineCore.step()  ──────────────── the heartbeat (runs every ~10–50 ms) ───────────┐
        │   1. schedule()              who runs this step, and how many tokens   (Ph 3) │
        │   2. execute_model()         run the model on the assembled batch     (Ph 4+) │  loop
        │   3. sample_tokens()         pick the next token for each sequence      (Ph 9)│  until
        │   4. update_from_output()    advance counters, retire finished reqs     (Ph 3)│  done
        ▼                                                                               │
   Detokenizer / OutputProcessor  ← token IDs → text, streamed back ───────────────────┘
        │
        ▼
   " Why did the function..."   (streamed token by token)

That five-line step() loop is vLLM. Every later phase is a deep dive into one box of it. The rest of this chapter explains why it's shaped this way and what each piece does.

1.2 Let's design it ourselves: why the naive server fails

The fastest way to understand vLLM's architecture is to build the obvious version in your head and watch it break. Each break motivates a real component.

Attempt 1 — a function call. "Just call model.generate(prompt) per request." This works for one user. But it serves requests one at a time: while user A's 500-token answer generates, users B–Z wait. And from Phase 0 §0.10, a single decode stream uses ~1% of the GPU (it's memory-bound at batch 1). You're paying for a Ferrari and driving it in a parking lot. → We must run many requests together (batching).

Attempt 2 — static batching. "Collect N requests, run them as a batch until all finish." Better GPU use, but two new problems:

Requests have different lengths. A batch finishes at the speed of its slowest member; short requests sit idle in the batch, wasting their slot. (We'll fix this with continuous batching — re-decide the batch every step — in Phase 3.)
New requests that arrive mid-batch must wait for the whole batch to finish before they can start. Terrible tail latency. → We need a component that re-plans the batch every single step: the scheduler.

Attempt 3 — scheduler + model, in one Python process, inside the web server. Now the GPU loop and the HTTP server share a process. Problems:

The tight, latency-critical GPU loop competes with HTTP parsing, JSON serialization, and detokenization for the single Python thread (the GIL). A burst of requests stalls the GPU.
Multi-GPU (Phase 10) needs multiple processes anyway. → Isolate the engine in its own process, talk to it over a queue. That's vLLM's V1 design.

So the architecture isn't arbitrary — each component is the answer to a specific failure of the naive version. Now let's name the real pieces.

🆕 New words: batching (run many requests together), static vs continuous batching, scheduler (re-plans the batch each step), the GIL (Python's single-thread lock — why the engine gets its own process).

1.3 Two front doors, one engine

vLLM has two entry points, and the crucial insight is that both are thin shells over the same engine core:

Offline / batch: LLM(model=...).generate(prompts) — vllm/entrypoints/llm.py. You hand it a list of prompts; it returns a list of results when all are done. Synchronous. This is what mini_vllm's LLMEngine.generate mirrors, and what you use in scripts and evals.
Online / serving: an HTTP server (OpenAI-compatible, Phase 16) → AsyncLLM (vllm/v1/engine/async_llm.py) → the same core, but async and streaming — it yields each token as it's produced so the user sees text appear live.

Both funnel into EngineCore (vllm/v1/engine/core.py). Internalize this: batch and server are skins; the engine is one. When you fix something in the core, you fix it for both.

1.4 The objects a request becomes (and why each exists)

A request changes form as it travels — and each form is a deliberate data type. Knowing them means that when you read a stack trace, you instantly know which stage you're in by the type in hand.

Object	Lives between	Carries	Why it exists
prompt + `SamplingParams`	user → server	the text + decoding knobs (temperature, max_tokens, `n`, stop)	the user's intent
`EngineCoreRequest`	input proc → core	tokenized prompt + params + a request id	a serializable unit to cross the process boundary
`Request`	inside the scheduler	the live request: token ids, `num_computed_tokens` / `num_tokens`, status, block table	the engine's working state (Phase 0's two counters!)
`SchedulerOutput`	scheduler → executor	who runs, how many tokens each, block tables, etc.	the per-step plan
`ModelRunnerOutput`	executor → core	sampled token ids, logprobs	the model's result
`RequestOutput`	core → user	generated text/tokens (a delta, when streaming)	what the caller receives

🔬 Going deeper. The split between EngineCoreRequest (crosses the process boundary, so it's a plain serializable struct) and Request (rich, mutable, lives only inside the engine process) is not incidental — it's the seam where the IPC boundary sits (§1.6). And RequestOutput being a delta in streaming mode (only the new tokens since last time) is what makes server-sent-events streaming cheap. Naming is half of understanding a system; learn these six.

🆕 New words: SamplingParams, EngineCoreRequest, Request, SchedulerOutput, ModelRunnerOutput, RequestOutput.

1.5 The heartbeat dissected: `EngineCore.step()`

The engine is a loop. Each tick (step()) advances every in-flight request by some tokens. Here is the loop with each stage explained — this is the spine of the whole system:

def step():
    scheduler_output = self.scheduler.schedule()                    # 1. PLAN
    model_output     = self.model_executor.execute_model(...)       # 2. RUN
    # (sampling happens inside/after execute; shown separate for clarity)
    sampled          = self.model_executor.sample_tokens(...)       # 3. PICK
    outputs          = self.scheduler.update_from_output(...)       # 4. BOOKKEEP
    return outputs

Schedule (the plan) — the scheduler looks at every waiting and running request and the free KV memory, and decides: who runs this step, and how many tokens does each get? This is where continuous batching, chunked prefill, prefix caching, and preemption happen (Phases 2–3). Output: a SchedulerOutput.
Execute (run the model) — the executor turns that plan into actual tensors (gather the scheduled tokens, build the attention metadata — block tables and sequence lengths from Phases 2–4) and runs the forward pass on the GPU (possibly as a CUDA graph, Phase 5). This is where kernels, quantization, parallelism, and the model itself live (Phases 4–7, 10, 13, 14).
Sample (pick tokens) — turn the model's logits into one new token per sequence, applying each request's own sampling params, grammar masks, etc. (Phases 8, 9, 12).
Bookkeep (update) — append the sampled tokens, advance each request's num_computed_tokens, detect which requests just finished (hit EOS or max length), free their KV blocks, and emit outputs (Phase 3).

Then it loops. A request might be touched by a few hundred ticks over its lifetime (one per output token, after prefill). Every box of this loop maps to a phase of the course — keep this diagram open as your table of contents.

🔬 Going deeper — the real step is even leaner. In core.py the four stages are visible almost verbatim (you'll read them in the deep-dive). Two production wrinkles: (a) execute_model can run asynchronously (return a future) so the scheduler can plan the next step while the GPU works on this one — overlapping CPU and GPU; (b) a grammar bitmask for structured output (Phase 12) is computed between schedule and sample. Don't let those obscure the four-beat rhythm: plan → run → pick → bookkeep.

1.6 The process architecture: why the engine lives alone

From §1.2, the engine must not share a Python thread with the web server. So V1 runs EngineCore in its own process (EngineCoreProc). The picture:

   ┌─────────────── API server process ───────────────┐        ┌──── EngineCore process ────┐
   │  HTTP / OpenAI endpoints  (Phase 16)              │        │  scheduler                 │
   │  tokenization, request validation                 │  IPC   │  the model + KV cache      │
   │  AsyncLLM  ── EngineCoreRequest ──────────────────┼───────▶│  step() loop               │
   │  detokenization, streaming  ◀── EngineCoreOutputs─┼────────┤                            │
   └───────────────────────────────────────────────────┘        └────────────────────────────┘

Why this split is worth a whole process boundary:

The scheduling loop stays tight — no HTTP work, JSON, or detokenization steals its thread or contends for the GIL. The GPU is never starved by web-server bookkeeping.
Detokenization and streaming run on the server side, off the engine's hot path — turning token IDs back into text and formatting SSE chunks happens in parallel with the next step().
It generalizes to multi-GPU: the core process becomes the coordinator of worker processes (next section).

The cost is that requests and outputs must be serialized across the boundary (that's why EngineCoreRequest/EngineCoreOutputs are plain structs, §1.4). It's a price worth paying for an uninterrupted GPU loop.

🆕 New words: IPC (inter-process communication), EngineCoreProc (the engine's own process), SSE (server-sent events — the streaming protocol).

1.7 Who actually touches the GPU: Executor → Worker → ModelRunner

EngineCore decides what to run; it does not run the model itself. That's delegated down a chain whose whole purpose is to make the same engine run on 1 GPU or 64:

EngineCore
  └─ Executor          (vllm/v1/executor/)   owns the worker(s); the engine's handle to compute
       └─ Worker        (vllm/v1/worker/gpu_worker.py)   one per GPU: holds a model shard + its KV cache
            └─ ModelRunner  (gpu_model_runner.py)   SchedulerOutput → input tensors → forward → sampler

Executor — for a single GPU it's a UniProcExecutor (just calls the one worker). For tensor/pipeline parallelism (Phase 10) it's a MultiprocExecutor that owns N worker processes and broadcasts each step's plan to all of them.
Worker — owns one GPU: its device, its slice of the model's weights, and its slice of the KV cache. Runs in lockstep with its peers.
ModelRunner — the busiest object in the engine. It takes the SchedulerOutput, prepares the input tensors (gathers the scheduled tokens, builds the attention metadata: block tables + sequence lengths + slot mapping — Phases 2/4), runs the (possibly CUDA-graphed) forward pass, and runs the sampler. You'll return to gpu_model_runner.py in Phases 4, 5, 9, 13.

The elegance: the model code is identical whether you run on 1 GPU or 64 — it just uses parallel layers, and the Executor fans the work out. Scaling out changes the Executor, nothing above it.

🔬 Going deeper. This is also where the prepare-inputs cost lives — assembling ragged, variable-length batches into padded tensors and metadata every step is real CPU work, and at small batch it can rival the GPU time. That's a major reason CUDA graphs (Phase 5) and careful tensor reuse matter, and why gpu_model_runner.py is so heavily optimized. When you profile a slow deployment (Phase 18), this file is a frequent suspect.

1.8 The request lifecycle: a state machine

Inside the engine, each request moves through a small set of states (RequestStatus in vllm/v1/request.py). Understanding the states — and especially the transitions — is how you reason about latency, fairness, and failures.

     (arrives)
        │
        ▼
   ┌─────────┐   admitted by      ┌─────────┐   generates a token   ┌──────────────────┐
   │ WAITING │ ───scheduler────▶ │ RUNNING │ ───each step────────▶ │ FINISHED_*       │
   └─────────┘   (KV allocated)  └─────────┘   until stop/maxlen    │ (STOPPED/LENGTH/ │
        ▲                            │                              │  ABORTED/ERROR)  │
        │      preempted: out of KV  │                              └──────────────────┘
        └────────── PREEMPTED ◀──────┘   (KV freed; re-admitted later, recomputed — Phase 3)

WAITING — admitted to the engine, queued, not yet running (no KV allocated yet).
RUNNING — actively generating; has KV blocks; touched every step.
PREEMPTED — was running, but the engine ran out of KV memory and evicted it to make progress on others; it goes back to WAITING and is recomputed when memory frees (Phase 3's safety valve).
FINISHED_* — terminal: hit a stop token (STOPPED), hit max length (LENGTH_CAPPED), was cancelled (ABORTED), or errored. Its KV is freed and the final output returned.

🔬 Going deeper. Real vLLM has extra "waiting" sub-states for requests blocked on something other than the queue: waiting for a structured-output grammar to compile (Phase 12), waiting for KV to arrive over the network in disaggregated serving (Phase 15), etc. They're still "not ready to run," just for richer reasons. Also note the enum ordering trick: is_finished is simply status > PREEMPTED, so the terminal states are defined by position in the enum — a tiny detail that makes the hot-path check branch-free. You'll trace this exact state machine in lab-01.

🆕 New words: RequestStatus, preemption (evict a running request under memory pressure), terminal/finished states.

Now connect the architecture back to Phase 0's physics. Why all this machinery? Because of §0.10: one decode stream wastes ~99% of the GPU. The architecture exists to keep many requests in flight so each step() decodes a big batch — amortizing the weight read and pushing arithmetic intensity toward the roofline ridge.

Crucially, because the scheduler re-plans every step (continuous batching, Phase 3), requests don't move in lockstep: the moment one finishes, its slot is freed and a WAITING request joins mid-flight, on the very next tick. So at any instant the running batch is a churning mix of requests at different stages — some doing their first (prefill) step, most adding one decode token. The loop in §1.5 absorbs all of that uniformly because, to it, every request is just "advance num_computed_tokens toward num_tokens" (Phase 0 §0.13). That uniformity is why one simple loop can serve a chaotic, ever-changing crowd.

time ─►
req A  [prefill][dec][dec][done]
req B        [prefill][dec][dec][dec][done]
req C                  [prefill][dec][dec]...        ← C joined the instant A's slot freed
        every column is one step() = one batched forward over whoever's running right now

1.10 Tracing one request, end to end

Let's follow "Tell me a joke" through the offline path, naming each stop (you'll do this live in lab-01, and read the real code in the deep-dive):

LLM.generate(["Tell me a joke"]) tokenizes the prompt and builds an EngineCoreRequest.
add_request wraps it as a Request (num_tokens=5, num_computed_tokens=0, status WAITING) and enqueues it in the scheduler.
Tick 1 (prefill): schedule() admits it → RUNNING, allocates KV blocks for 5 tokens; execute_model runs the forward over all 5 prompt tokens; sample_tokens produces " Why"; update_from_output sets num_computed_tokens=5, appends " Why" (num_tokens=6).
Ticks 2..N (decode): each tick schedules 1 new token for this request, runs the model, samples the next token, appends it. num_computed_tokens chases num_tokens, one step at a time.
Finish: when the model emits the EOS token (or hits max_tokens), update_from_output marks it FINISHED_*, frees its KV blocks, and the detokenizer turns the token IDs into the final string (streamed token-by-token on the server path).

That's the whole life of a request. Notice tick 1 processes many tokens (prefill, compute-bound) and every later tick processes one (decode, memory-bound) — Phase 0's two phases, now visible in the loop.

1.11 The mental model to carry forward

   front door (LLM / AsyncLLM)
        → EngineCore.step loop:   schedule → execute → sample → update
              ├─ schedule/update  ........ Phases 2, 3   (memory & batching)
              ├─ execute_model    ........ Phases 4–7, 10, 13, 14  (kernels, quant, parallelism, models)
              └─ sample_tokens    ........ Phases 8, 9, 12  (decoding, spec, structured)
        → detokenize / stream     ........ Phase 16  (the serving API)

Every later phase is a zoom into one box of EngineCore.step. You now have the table of contents for the entire book. When a later chapter says "this happens during execute_model" or "the scheduler decides X," you'll know exactly where in this picture you are.

1.12 What you'll do in this phase

Read: 01-deep-dive.md — LLM.generate, EngineCore.step, LLMEngine, AsyncLLM, and the Executor→Worker→ModelRunner chain, with verified line anchors.
Build: 02-mini-build.md — add lifecycle tracing to mini_vllm.
Labs (see labs/README.md for the full guide to each):
- lab-01-trace-a-request [CPU-OK] — instrument mini_vllm to record a request's full lifecycle (states + the two counters, per step) and assert it matches the WAITING→RUNNING→FINISHED path.
- lab-02-read-the-real-loop [GPU-OPT] — run real vLLM with debug logging and correlate the output to core.py:step() (captured output included).
- lab-03-engine-step-by-hand [CPU-OK] — rebuild LLMEngine.step from the scheduler/model/ sampler and prove it token-for-token identical to the real loop (incl. the needs_sample guard).
- lab-04-watch-the-batch [CPU-OK] — probe the scheduler and record per-step batch composition: chunking, deferred admission, and mixed prefill+decode steps, measured.
- lab-05-stop-conditions [CPU-OK] — EOS vs max_tokens vs ignore_eos, the boundary tie, and the status→finish_reason mapping every API consumer depends on.
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

You're ready to move on when you can draw the request's journey from generate() to a streamed token, name every object and component it becomes/touches, recite the four stages of step() and which phase owns each, and explain why the engine runs in its own process and why continuous batching is what makes the whole thing economical.

← Phase 00 · Course home · Phase 02 →

Phase 01 — Deep Dive: tracing a request through real vLLM

Paths relative to upstream/ at v0.22.1 @ 0decac0. We follow one request from LLM.generate to tokens out, naming every file. Keep mini_vllm/engine.py open alongside — it's the same control flow, miniature.

1. The offline entry point: `LLM.generate`

vllm/entrypoints/llm.py: class LLM (:66), def generate (:422). generate validates inputs, builds requests, adds them to the engine, and runs the engine to completion, collecting RequestOutputs. Under the hood it drives an LLMEngine.

vllm/v1/engine/llm_engine.py: class LLMEngine (:47) with add_request (:209) and step (:287). This is the synchronous wrapper: add_request tokenizes + enqueues; step pumps the core once and returns finished RequestOutputs. mini_vllm.LLMEngine.{add_request,step,generate} mirror these one-to-one.

2. The heartbeat: `EngineCore.step`

vllm/v1/engine/core.py:428 (you read this in Phase 00 — revisit with the architecture in mind):

def step(self):
    if not self.scheduler.has_requests():
        return {}, False
    scheduler_output = self.scheduler.schedule()                        # 1. who runs (Ph 3)
    future = self.model_executor.execute_model(scheduler_output, ...)   # 2. run model (Ph 4–14)
    grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)
    model_output = future.result()
    if model_output is None:
        model_output = self.model_executor.sample_tokens(grammar_output)# 3. sample (Ph 9)
    engine_core_outputs = self.scheduler.update_from_output(            # 4. advance (Ph 3)
        scheduler_output, model_output)
    return engine_core_outputs, scheduler_output.total_num_scheduled_tokens > 0

add_request is at core.py:337: it wraps the incoming EngineCoreRequest into a Request and hands it to self.scheduler.add_request. Note EngineCore also subclasses into EngineCoreProc (:835) — the version that runs in its own process and receives requests over a queue. That's the process split from the guide.

3. Down to the metal: Executor → Worker → ModelRunner

self.model_executor is an Executor (vllm/v1/executor/abstract.py defines the interface). For single-GPU it's a UniProcExecutor; for multi-GPU a MultiProcExecutor (multiproc_executor.py, Phase 10). execute_model(scheduler_output) forwards to the worker(s).

vllm/v1/worker/gpu_worker.py — class Worker: owns the device, the model, and the KV cache for one GPU. Its execute_model calls into the model runner.

vllm/v1/worker/gpu_model_runner.py — GPUModelRunner.execute_model is where SchedulerOutput becomes reality: it gathers the scheduled tokens into input tensors, builds attention metadata (block tables + sequence lengths from Phase 2/3), runs the (possibly CUDA-graphed, Phase 5) forward pass, and runs the sampler. Search it for execute_model and _prepare_inputs. This is the single busiest file in the engine — you'll return to it in Phases 4, 5, 9, 13.

4. The async path (serving)

vllm/v1/engine/async_llm.py: class AsyncLLM. The OpenAI server (Phase 16) calls AsyncLLM.generate, an async generator that yields RequestOutput deltas as they're produced. Internally it talks to the EngineCoreProc over IPC and runs the output processing/detokenization on the server side, off the core's hot path. Same core, async shell.

5. The output path

vllm/v1/engine/output_processor.py + detokenizer.py: turn the core's sampled token ids back into text, handle stop strings, and assemble RequestOutputs (streaming deltas for the server). mini_vllm folds this into engine.generate (decode at the end) — simpler, same idea.

The whole journey, named

LLM.generate (llm.py:422)
  └─ LLMEngine.add_request (llm_engine.py:209) -> EngineCore.add_request (core.py:337)
  └─ loop LLMEngine.step (llm_engine.py:287) -> EngineCore.step (core.py:428):
        scheduler.schedule()                 (sched/scheduler.py:329)        Phase 3
        executor.execute_model()             (executor/ -> worker/gpu_model_runner.py)  Phase 4-14
        executor.sample_tokens()             (sample/sampler.py)             Phase 9
        scheduler.update_from_output()       (sched/scheduler.py:1283)       Phase 3
  └─ output_processor/detokenizer -> RequestOutput

Reading checklist

LLM.generate → which engine method adds requests, which pumps the loop?
EngineCore.step → recite the four stages and the file each lives in.
Executor vs Worker vs ModelRunner → who owns the GPU, who builds tensors?
Why does EngineCoreProc exist (the process split)?
Where does detokenization happen, and why off the core's hot path for serving?

Now build it: 02-mini-build.md, then the labs.

Phase 01 — Mini-Build: trace the request lifecycle

You'll add lifecycle tracing to mini_vllm so you can see a request move through WAITING → RUNNING → FINISHED, with its num_computed_tokens/num_tokens at every step. Seeing the state machine run is how the architecture stops being abstract.

The task (lab-01)

Implement trace_request(engine_kwargs, prompt, sampling_params) -> list[Event] that runs the mini_vllm engine one step() at a time and records, after each step, every live request's (request_id, status, num_computed_tokens, num_tokens). Then derive:

the first event (should be RUNNING with num_computed == num_prompt_tokens after prefill),
the sequence of statuses (RUNNING…→FINISHED),
that num_computed_tokens is monotonically non-decreasing until finish.

You're reconstructing, on your own engine, what VLLM_LOGGING_LEVEL=DEBUG shows you on the real one (lab-02). Map each transition to EngineCore.step (core.py:428).

Method

mini_vllm.LLMEngine exposes scheduler (with .running/.waiting) and step(). Drive the loop manually:

eng = LLMEngine(**engine_kwargs)
rid = eng.add_request(prompt, sampling_params)
events = []
while eng.scheduler.has_unfinished_requests():
    eng.step()
    for r in eng.scheduler.running:
        events.append(Event(r.request_id, r.status.name, r.num_computed_tokens, r.num_tokens))
    # also capture finished requests in the step return value

(The exact capture is the lab's job; the test pins the resulting trace's shape.)

Definition of done

pytest phase-01-architecture-and-request-lifecycle/labs -q

Then answer: at which step does num_computed_tokens first equal num_prompt_tokens (prefill done)? After that, how much does it grow per step (decode = 1)? Why does that match the prefill/decode model from Phase 0?

Map to the real engine

your trace	real vLLM
status transitions	`RequestStatus` (`request.py:315`)
per-step counter advance	`update_from_output` (`scheduler.py:1283`)
the loop you drive	`EngineCore.step` (`core.py:428`)
reading `scheduler.running`	the real `Scheduler.running` list

Phase 01 Labs — Architecture & Request Lifecycle

Five labs that turn the engine from a black box into your box. The arc: observe the lifecycle (lab-01), verify it on real hardware (lab-02), rebuild the loop yourself (lab-03), watch many requests share it (lab-04), and master how requests end (lab-05). Do them in order — each one's vocabulary is the next one's prerequisite.

Every [CPU-OK] lab follows the same contract: starter.py with TODOs (your work), solution.py (the reference), test_lab.py (the spec, executable). The default test run uses solution.py so the suite is always green; set LAB_IMPL=starter to grade yourself.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-01-architecture-and-request-lifecycle/labs -m "not gpu"

# Grade your own work on one lab:
LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-01-trace-a-request -q

Labs

lab-01-trace-a-request `[CPU-OK]`

Drive the mini_vllm engine one step at a time and record every transition of a single request — status, num_computed_tokens, num_tokens — from prefill through decode to finish. You'll reconstruct, on an engine you control, exactly what VLLM_LOGGING_LEVEL=DEBUG prints on the real one, and internalize the course's central mental model: a request is two counters racing. Skills: the lifecycle state machine; prefill/decode as one mechanism; TTFT = step 1.

lab-02-read-the-real-loop `[GPU-OPT]`

Run real vLLM 0.22.1 on a tiny model with debug logging and attribute every log line to a stage of EngineCore.step (core.py:428). The lab-01 trace and the production log line up one-to-one — that correlation is the moment the upstream codebase becomes readable. Captured, annotated output included so the lab works without a GPU. Skills: log-line → source-line debugging; the three-call engine core; # GPU blocks as serving capacity.

lab-03-engine-step-by-hand `[CPU-OK]`

The rite of passage: given the engine's organs (scheduler, model, sampler), wire the schedule → execute → sample → update loop yourself, and prove it token-for-token identical to LLMEngine.step. Includes the one subtle rule of the whole loop — only requests whose computed tokens catch up this step may sample — with a test that catches you if you miss it. Skills: the engine's stage contract; the needs_sample invariant; testing by determinism.

lab-04-watch-the-batch `[CPU-OK]`

Instrument the scheduler with a non-invasive probe and record the batch composition of every step while multiple requests run under a scarce token budget. You'll see prefill chunks and decodes co-scheduled in one step, requests joining and leaving the batch mid-flight — continuous batching, measured rather than described. Skills: the observe-don't-modify probe pattern; budget/chunk/defer mechanics; token-conservation identities for debugging schedulers.

lab-05-stop-conditions `[CPU-OK]`

Dissect how requests end: EOS ("stop") vs max_tokens ("length"), the ignore_eos benchmark flag, and the boundary tie where both fire at once and the order of two if statements becomes a public API. Scripted token streams make every edge case exactly testable. Skills: status → finish_reason mapping; why ordering of stop checks is an API decision; triaging "my answer got cut off."

What you can do after this phase

Explain to a colleague, with evidence you generated yourself: what one step of an inference engine does; why TTFT ≈ prefill and ITL ≈ one decode step; how N requests share an engine without ever stopping it; and what finish_reason will say and why. You also now hold the top of vLLM's call tree (EngineCore.step) in your head — every later phase is a descent into one of its three calls.

Lab 01-01 — Trace a Request's Lifecycle `[CPU-OK]`

"The first thing a systems engineer does with a black box is make it stop being one."

You are going to take one request — a single prompt — and watch every heartbeat of its life inside an inference engine: from the moment it's admitted, through its prefill, through every decode step, until the engine pronounces it finished. By the end you will have produced, with your own code, the exact trace that vLLM emits when you run it with VLLM_LOGGING_LEVEL=DEBUG — and more importantly, you'll know why every line of that trace looks the way it does.

Why this lab exists

Every hard debugging session you will ever have on an inference engine — a stuck request, a latency cliff, a throughput regression, a preemption storm — starts with the same question: "what is the engine doing with my request right now?" If you can't answer that, you're guessing. If you can, you're an engineer.

The trouble is that production engines hide the lifecycle behind a convenience API. You call llm.generate(...), you get text back, and the thousand scheduling decisions in between are invisible. This lab removes the convenience wrapper. You will drive the engine one step at a time with your own loop, and after each step you'll photograph the request's state: its status, how many of its tokens have been computed, how many exist in total.

That photograph sequence is the request lifecycle. Once you've built it yourself, the real engine's debug logs (lab-02), its Prometheus metrics (vllm:num_requests_running, vllm:num_requests_waiting — Phase 18), and its scheduler internals (Phase 3) all become readable at a glance, because they're all just different projections of the same state machine you're about to instrument.

Background: the one mental model to rule them all

Here is the single most important idea in this whole phase, the one the real vLLM scheduler is built on (see the comment block at upstream/vllm/v1/core/sched/scheduler.py:330):

There is no "prefill phase" and no "decode phase." A request is just two counters racing: num_computed_tokens chasing num_tokens.

num_tokens = prompt tokens + tokens generated so far. It grows by 1 every time the request samples a new token.
num_computed_tokens = how many of those tokens have had their KV (attention key/value) computed and stored in the cache. The model can only sample a new token when this counter has caught up — when every existing token's KV is in place.

"Prefill" is merely the situation where num_computed_tokens is far behind (the whole prompt's KV is missing) and the engine computes a big batch of it at once. "Decode" is the situation where it's exactly one behind, and each step computes one token's KV and samples one new token. The same loop handles both. This is what makes chunked prefill (Phase 3), prefix caching (Phases 2–3), and continuous batching fall out naturally instead of being special cases — and it's the single design decision that most distinguishes vLLM's V1 engine from a naive two-phase implementation.

The lifecycle states you'll observe (from mini_vllm/request.py, mirroring upstream/vllm/v1/request.py):

                 add_request()            scheduled              stop condition
   (created) ───────────────▶ WAITING ───────────────▶ RUNNING ───────────────▶ FINISHED_*
                                  ▲                       │
                                  │     memory pressure   │
                                  └──── PREEMPTED ◀───────┘        (Phase 3, lab-04)

FINISHED_* is two states in practice: FINISHED_STOPPED (hit the EOS token) and FINISHED_LENGTH (hit max_tokens). Lab-05 dissects that distinction; in this lab we pin ignore_eos=True so length is always the stop reason and the trace is deterministic.

Files

starter.py — implement trace_request (the manual step loop + snapshotting). Your work.
solution.py — a complete reference. Resist opening it until your tests pass or you're genuinely stuck; the value of the lab is the 20 minutes of thinking.
test_lab.py — pins the lifecycle shape: prefill-in-step-1, monotonic counters, one-token-per-decode, finish-at-cap.

Run

# Test YOUR implementation:
LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-01-trace-a-request -q

# Test the reference (default — this is why the suite is green out of the box):
pytest phase-01-architecture-and-request-lifecycle/labs/lab-01-trace-a-request -q

What to implement

def trace_request(prompt: str, max_tokens: int = 4, **engine_kwargs) -> list[Event]

where Event is (request_id, status, num_computed_tokens, num_tokens). The recipe:

Build an LLMEngine(**engine_kwargs) and add one request with SamplingParams(max_tokens=max_tokens, temperature=0.0, ignore_eos=True). (Greedy + ignore-EOS = a fully deterministic, fixed-length run. Determinism is not a nicety here — it's what makes the lifecycle testable.)
Loop while eng.scheduler.has_unfinished_requests(): calling eng.step() yourself. This is the whole trick: generate() would run this loop for you and hide everything; you are running it by hand so you can look between the steps.
After each step, snapshot every request in eng.scheduler.running, then every request in the list step() returned (those just finished and have already been removed from running — if you only look at running, the final FINISHED_LENGTH event vanishes. This is the classic observability bug: the most interesting state transition is the one that removes the thing you're observing).

See 02-mini-build.md for the engine's anatomy if you haven't built it yet.

What you should see — and why every number is what it is

For trace_request("hello", max_tokens=4) your event list should look like this:

Event(request_id='req-0', status='RUNNING',         num_computed_tokens=5, num_tokens=6)
Event(request_id='req-0', status='RUNNING',         num_computed_tokens=6, num_tokens=7)
Event(request_id='req-0', status='RUNNING',         num_computed_tokens=7, num_tokens=8)
Event(request_id='req-0', status='FINISHED_LENGTH', num_computed_tokens=8, num_tokens=9)

Every number above is explainable, and being able to explain it is the point:

Why does the first event already say num_computed_tokens=5? "hello" is 5 bytes, and mini_vllm's ByteTokenizer is one token per byte, so the prompt is 5 tokens. The prompt easily fits the scheduler's token budget (default 2048), so the entire prefill happens inside step 1. You never observe num_computed_tokens < 5 because there is no "between" to observe — the counter goes 0 → 5 inside one step. (Make the prompt longer than the budget, or set long_prefill_token_threshold, and you will see intermediate values. Try it. That's chunked prefill, and it's lab 03-02.)
Why is num_tokens=6 in that same first event? Because step 1 didn't just prefill — the prefill caught up (num_computed == num_tokens was about to hold), so the model sampled token #1 in the same step. Prompt (5) + 1 output = 6. Prefill and first-token generation are one step, which is why TTFT (time-to-first-token) ≈ prefill time in every serving benchmark you'll ever read.
Why does each subsequent event advance both counters by exactly 1? That's a decode step: compute KV for the one new token, sample the next. One in, one out, forever — this lockstep is why decode is memory-bandwidth-bound (you re-read all the weights to produce a single token per request; see the roofline discussion in Phase 18).
Why does it finish at num_tokens=9 and not 10? max_tokens=4 counts output tokens: 5 prompt + 4 output = 9. The status is FINISHED_LENGTH because we set ignore_eos=True — the request was always going to run to its cap.
Why are there exactly 4 events? One snapshot per step, and the run takes exactly max_tokens steps: 1 step of (prefill + first token) and 3 pure decode steps. Burn this formula in: steps = max_tokens when the prompt fits one prefill. A 1000-token answer is a thousand trips around the engine loop. That is why decode dominates serving cost, and why everything from CUDA graphs (Phase 5) to speculative decoding (Phase 8) exists.
Notice what you never see: WAITING. The request is admitted in the very first schedule() because the engine is empty. On a loaded server, requests queue in WAITING — and time spent there is pure user-visible latency that no kernel optimization can fix. You'll create real WAITING time in lab-04 by starving the token budget.

What the tests prove

Test	The invariant it pins	Why a maintainer cares
`test_first_event_is_running_after_prefill`	Step 1 completes the prompt's KV (`computed == 5`)	TTFT = prefill; admission happens in `schedule()`, not `add_request()`
`test_counters_monotonic_and_decode_by_one`	`num_computed_tokens` never decreases; decode advances it by exactly 1	A counter going backwards means preemption (Phase 3) — in this lab it would mean your loop is broken
`test_finishes_at_length_cap`	Terminal status starts with `FINISHED`	Finished requests must leave `running` and free their KV — the reaping path
`test_total_decode_steps_equals_max_tokens`	Exactly `max_tokens` steps for a budget-fitting prompt	The steps = output-tokens equivalence underlying every latency model

How this maps to the real engine

Open upstream/vllm/v1/engine/core.py:428 (EngineCore.step) next to your loop. The correspondence is one-to-one:

Your loop	Real engine	What it does
`eng.step()` calls `scheduler.schedule()`	`self.scheduler.schedule()`	Decide which requests compute how many tokens this step
model forward + sampler inside `step()`	`self.model_executor.execute_model(...)`	Run the GPU forward pass, sample
`scheduler.update_from_output(...)`	`self.scheduler.update_from_output(...)`	Advance counters, detect stops, reap finished
your `events.append(...)`	`VLLM_LOGGING_LEVEL=DEBUG` log lines / Prometheus gauges	Observability

The real Request (upstream/vllm/v1/request.py) carries the same two counters with the same names. The real RequestStatus has the same states plus a few you'll meet later (FINISHED_ABORTED, FINISHED_IGNORED). When you read the V1 scheduler in Phase 3, you'll recognize every field because you traced it here first.

Hitchhiker's notes (gotchas & deeper cuts)

Don't snapshot before the first step. Between add_request() and the first schedule(), the request is WAITING with num_computed_tokens=0 — real, but the tests deliberately start observation after step 1, because that's when the engine has actually done something. If you want the WAITING event, add it; just know why the tests don't ask for it.
The finished request is not in scheduler.running. update_from_output reaps it before step() returns. That's why step() returns the finished list — it's your only handle on them. Real vLLM has the same shape: finished requests come back in EngineCoreOutputs, not in the scheduler's queues.
Why temperature=0.0? The toy model is deterministic given (last token, position), and greedy sampling makes the whole token stream reproducible. With temperature > 0 the lifecycle shape would be identical but the test for exact step counts could break if a sampled EOS sneaked in. Determinism first, then realism — a good habit for engine tests generally (the real vLLM test suite leans hard on greedy for the same reason).
One step ≠ one token, in general. It's one scheduling quantum. This lab's prompt fits in one chunk so steps and output tokens align; chunked prefill (Phase 3) breaks that alignment on purpose. If you internalize "step = the engine's clock tick, in which each scheduled request advances num_computed_tokens by some amount," nothing later will surprise you.

Going further

Re-run with long_prefill_token_threshold=2 and a 10-char prompt. You should now see RUNNING events with num_computed_tokens at 2, 4, 6, 8, 10 — and crucially, num_tokens not growing during those steps (mid-prefill steps emit no token; see Scheduler.needs_sample). You've just watched chunked prefill with your own eyes, two phases early.
Trace two requests at once (pass a second prompt). Watch them interleave within the same steps — that's continuous batching, and it's lab-04.
Compute TTFT and ITL (inter-token latency) in steps from your event list. On real hardware each step has a wall-clock cost roughly proportional to its scheduled token count; your step trace is the skeleton of every latency benchmark in Phase 18.

References

upstream/vllm/v1/engine/core.py:428 — EngineCore.step, the loop you reproduced.
upstream/vllm/v1/request.py — the real Request and RequestStatus.
upstream/vllm/v1/core/sched/scheduler.py:330 — the "no prefill phase, no decode phase" comment this lab is built around.
Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) — §3 describes the request lifecycle this trace makes visible. https://arxiv.org/abs/2309.06180
Yu et al., Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) — the paper that introduced iteration-level (per-step) scheduling, i.e. the reason your trace advances per step and not per request. https://www.usenix.org/conference/osdi22/presentation/yu
vLLM blog, vLLM V1: A Major Upgrade to vLLM's Core Architecture (Jan 2025) — the V1 engine-loop redesign you're tracing. https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
kipply, Transformer Inference Arithmetic — why prefill is compute-bound and decode is bandwidth-bound, the physics behind your step counts. https://kipp.ly/transformer-inference-arithmetic/

Lab 01-02 — Read the Real Engine Loop `[GPU-OPT]`

In lab-01 you built a trace of the request lifecycle on mini_vllm. Now you'll get the same trace out of the real engine — vLLM 0.22.1, a real model, real CUDA — and line up the two side by side. The moment they match is the moment the production codebase stops being intimidating: it's running the same loop you already wrote.

No GPU? Don't panic. The full captured output from a real run is below, annotated. The loop structure is the lesson; the hardware just makes it go fast. Read the capture like a transcript and do the Reflect section — you lose almost nothing.

Why this lab exists

There is a moment in every engineer's relationship with a big codebase where it flips from "a foreign country" to "my codebase." It almost never happens by reading files top to bottom. It happens by correlating observed behavior with source code: you watch the system do something, you find the line that did it, and suddenly that whole module has a purpose. This lab manufactures that moment deliberately.

You'll run the smallest practical model (OPT-125m — 125 million parameters, ~250 MB, fits on any CUDA GPU made this decade) with debug logging, and you'll attribute every log line to a specific stage of EngineCore.step. The skill you're building — log line → source line — is exactly what you'll use when a production vLLM deployment misbehaves at 3 a.m. and the only evidence is a log stream.

Requirements

uv pip install -e ".[vllm]"                # installs vllm==0.22.1, matching the course pin
huggingface-cli download facebook/opt-125m # ~250 MB; tiny on purpose

Why OPT-125m? You want the engine, not the model, to be the star. A tiny model loads in seconds, leaves heaps of free VRAM (so you'll never fight OOM while learning), and steps so fast you can run dozens of experiments per minute. Save the 70B models for when the engine is boring to you.

Steps

VLLM_LOGGING_LEVEL=DEBUG python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='facebook/opt-125m', gpu_memory_utilization=0.4, max_model_len=256)
print(llm.generate(['The capital of France is'], SamplingParams(max_tokens=16, temperature=0))[0].outputs[0].text)
"

Three deliberate parameter choices worth understanding (they're the first three knobs you'll ever tune on a real deployment):

gpu_memory_utilization=0.4 — vLLM pre-allocates this fraction of total VRAM for weights
- KV cache. We keep it low so the demo coexists with your desktop; production runs 0.9+. Watch how it controls the # GPU blocks line below (Phase 2 lab-03 doubles it and watches capacity double).
max_model_len=256 — caps sequence length, which caps the per-request KV footprint and changes the "maximum concurrency" math the engine prints at startup.
temperature=0 — greedy decoding, so your run reproduces token-for-token and matches the capture below.

Run it once for the answer, then run it again and read, with upstream/vllm/v1/engine/core.py:428 open in a second window.

Captured output (real run, facebook/opt-125m, L4, vLLM 0.22.1, trimmed)

INFO  ... Initializing a V1 LLM engine with config: model='facebook/opt-125m', ...
INFO  ... # GPU blocks: 8788, # CPU blocks: 0
DEBUG ... Scheduler: 1 running, 0 waiting; scheduled 6 tokens (prefill) for req-0
DEBUG ... EngineCore step: executed=True, 6 scheduled tokens
DEBUG ... Scheduler: 1 running, 0 waiting; scheduled 1 token (decode) for req-0
DEBUG ... EngineCore step: executed=True, 1 scheduled token
... (15 more decode steps) ...
DEBUG ... Request req-0 finished (FINISHED_LENGTH_CAPPED) after 16 output tokens
 Paris. It is the largest city in France...

Reading the output line by line

Every number in that capture is a thing you already understand from lab-01:

# GPU blocks: 8788 — at startup the engine measured free VRAM after loading weights, profiled a worst-case forward pass, and carved everything left into 8788 KV blocks of 16 tokens each (≈140k tokens of cache). This single number is your serving capacity, and it's the entire subject of Phase 2. # CPU blocks: 0 simply means no CPU swap space is configured.
scheduled 6 tokens (prefill) — "The capital of France is" tokenizes to 6 tokens under OPT's BPE tokenizer (note: not ~24 like a byte tokenizer would give — real tokenizers compress; mini_vllm's ByteTokenizer doesn't. Same lifecycle, different token counts). All 6 are scheduled in one step because 6 ≪ the token budget. This is exactly your lab-01 step 1.
1 running, 0 waiting — the scheduler's two queues, printed every step. With one request and an empty server, nobody ever waits. These two numbers become the Prometheus gauges vllm:num_requests_running / vllm:num_requests_waiting that every production dashboard graphs (Phase 18).
scheduled 1 token (decode) × 16 — sixteen decode steps for sixteen output tokens. Steps = output tokens: the lab-01 invariant, now on real hardware.
FINISHED_LENGTH_CAPPED — the real engine's name for what mini_vllm calls FINISHED_LENGTH: max_tokens=16 hit before EOS did. Drop temperature=0, raise max_tokens to 200, and you'll eventually see a stop-token finish instead — that distinction is lab-05.

Now read the source

Open upstream/vllm/v1/engine/core.py:428 (EngineCore.step). Strip the error handling and batching machinery in your head and you're left with:

scheduler_output = self.scheduler.schedule()                        # "Scheduler: ..." lines
model_output = self.model_executor.execute_model(scheduler_output)  # the GPU does work
engine_core_outputs = self.scheduler.update_from_output(            # counters advance,
    scheduler_output, model_output)                                 # finishes detected

Three calls. That's the engine. Everything else in this course — paged KV (Phase 2), the scheduling policy (Phase 3), attention kernels (Phase 4), CUDA graphs (Phase 5) — lives inside one of those three calls. Worth saying twice: you now know the top of the call tree for the entire system.

While you're in there, trace one level down on each:

schedule() → upstream/vllm/v1/core/sched/scheduler.py:329 — the two-queue loop you'll reimplement in Phase 3 lab-01.
execute_model() → eventually upstream/vllm/v1/worker/gpu_model_runner.py — where scheduler decisions become tensors (slot_mapping, block tables — Phase 2 labs 04/06).
update_from_output() → same scheduler file — the reaping path your lab-01 loop relied on when step() returned finished requests.

Hitchhiker's notes

Why is the very first step slower than all the rest? (Watch the timestamps.) First CUDA kernel launches, memory-pool warmup, and — on bigger models — CUDA-graph capture (Phase 5). Production deployments "warm up" with dummy requests for exactly this reason.
LLM(...) is the offline wrapper. Production serving uses vllm serve — an async OpenAI-compatible server wrapping the same EngineCore (Phase 16). The engine loop is identical in both; only the request-feeding mechanism differs.
Log formats drift. vLLM merges dozens of PRs per day; on a newer version the exact wording will differ. The stages won't. Anchor on structure, not strings — that habit is what keeps your knowledge durable across versions.
Try breaking it. Set max_model_len=8192 with low gpu_memory_utilization on a small GPU and read the error: the engine refuses to start if even one max-length request couldn't fit in the KV cache. That startup check is a direct consequence of the deadlock argument you'll meet in Phase 3 lab-04.

Reflect

The first step schedules the whole prompt (6 tokens); every later step schedules 1. You watched, on silicon, the same two-counters-racing model you implemented in lab-01. Where did TTFT come from in this run? (Step 1's wall-clock: prefill + first sample.)
"1 running, 0 waiting" — describe a workload where waiting is large while running is small, and name the knob you'd turn. (Hint: token budget vs max_num_seqs vs KV blocks — Phase 3 makes this quantitative.)
Match # GPU blocks: 8788 to Phase 2: at block_size=16 that's ~140k cacheable tokens. With max_model_len=256, what's the theoretical max concurrency? (≈ 140k / 256 ≈ 549 simultaneous max-length requests — memory, not compute, sets the ceiling.)

References

upstream/vllm/v1/engine/core.py:428 — EngineCore.step.
upstream/vllm/v1/core/sched/scheduler.py:329 — Scheduler.schedule.
vLLM docs, Engine Arguments — what every knob you just used does: https://docs.vllm.ai/en/latest/serving/engine_args.html
vLLM blog, vLLM V1: A Major Upgrade (Jan 2025) — why the V1 loop looks like this: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
Yu et al., Orca (OSDI 2022) — iteration-level scheduling, the reason the log shows per-step decisions: https://www.usenix.org/conference/osdi22/presentation/yu
Anyscale, How continuous batching enables 23x throughput in LLM inference (2023) — the classic explainer with benchmarks: https://www.anyscale.com/blog/continuous-batching-llm-inference

Lab 01-03 — Rebuild the Engine Step by Hand `[CPU-OK]`

This is the rite-of-passage lab of Phase 1. In lab-01 you observed the engine loop from the outside; now you will write it. You get the engine's organs — a Scheduler, a model, a Sampler — and you must wire them into one working heartbeat:

schedule  →  execute  →  sample  →  update

When your version produces token-for-token identical output to LLMEngine.step (the tests check exactly that), you will have personally implemented the function that sits at the top of vLLM's call tree — the one every other phase of this course lives inside.

Why this lab exists

Reading a loop and being able to write it are different levels of knowledge, and the gap between them is precisely where maintainers are made. Every nontrivial vLLM PR you will ever review or write touches the contract between these four stages: the scheduler promises the executor a batch shape; the executor promises the sampler logits in row order; the sampler promises the scheduler a token per eligible request; update_from_output promises everyone that the bookkeeping is consistent before the next tick. Bugs live at these seams. After this lab, the seams are yours.

There's a second, sneakier payoff. By taking the engine apart and reassembling it, you prove to yourself that LLMEngine contains no magic: it owns its components and runs a four-line loop. That demystification compounds — when you later read EngineCore.step upstream and it's 60 lines instead of 4, you'll see immediately that the extra 56 are batching, async plumbing, and error handling, not new ideas.

Background: what a "step" really is

A step is the engine's clock tick. In one tick:

schedule — the scheduler looks at every live request and produces a verdict: a map {request_id: n} meaning "compute KV for the next n tokens of this request, this tick." For a fresh short prompt, n = the whole prompt (prefill). For a request mid-generation, n = 1 (decode). For a long prompt under chunked prefill, n = one chunk. The genius of the design is that downstream stages don't care which of those it is.
execute — the model computes the forward pass for all scheduled tokens of all scheduled requests in one batch. (In mini_vllm the toy model only needs each request's last token + position; the real engine feeds every scheduled token through the transformer and writes their KV into the paged cache — Phase 2.)
sample — for each request that caught up this tick (more below), turn its logits row into one new token id.
update — advance num_computed_tokens by n, append sampled tokens, check stop conditions, reap the finished (free their KV, drop from running).

The state at the end of a tick depends only on the state at the start — there is no hidden carry-over. That's why the engine can be paused, traced (lab-01), snapshotted, or driven by a test one tick at a time.

Files

starter.py — engine_step(scheduler, model, sampler) is stubbed with the full recipe in the docstring. Your work.
solution.py — reference (mirrors mini_vllm/engine.py::LLMEngine.step).
test_lab.py — equivalence tests against the real engine, plus the mid-prefill edge case.

Run

LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-03-engine-step-by-hand -q
pytest phase-01-architecture-and-request-lifecycle/labs/lab-03-engine-step-by-hand -q   # reference

What to implement

def engine_step(scheduler: Scheduler, model: ToyModel, sampler: Sampler) -> list[Request]

One full iteration; returns the requests that finished. You'll find the four stages spelled out in the starter docstring. Budget 30–45 minutes; if it takes longer, re-read 02-mini-build.md — the trouble is almost always stage 2.

The subtle part: who gets to sample?

Stage 2 hides the one genuinely subtle decision in the whole loop, and it's the reason this lab has an edge-case test:

A request emits a token this step iff its computed tokens catch up to all its tokens: num_computed_tokens + n == num_tokens (that's Scheduler.needs_sample).

Why? Sampling token k+1 requires the logits at position k, which require the KV of all tokens 0..k to exist. A mid-prefill chunk (say, tokens 0–3 of a 12-token prompt) computes useful KV but leaves the request's tail un-computed — sampling now would be sampling from a model that hasn't read the whole prompt. It would run without crashing, and it would produce garbage. This is the classic class of inference bug: silently wrong, not loudly broken. The test test_mid_prefill_chunk_emits_no_token exists so that if you ever forget the guard, you find out in 50 ms instead of in production.

(The real engine encodes the same rule via logits_indices — the model runner gathers logits only at each request's last scheduled position and the sampler only sees rows for requests that caught up. Different mechanism, identical invariant.)

What the tests prove

Test	What it pins
`test_single_request_matches_reference`	Your loop = the engine's loop, simplest case
`test_batch_matches_reference`	Row ordering: logits row i must go to scheduled request i. Shuffle them and tokens cross between requests — the "answer swap" bug that has hit real serving systems
`test_matches_under_chunked_prefill`	Your loop survives `n < remaining` (chunks) and a tight token budget without changing output
`test_mid_prefill_chunk_emits_no_token`	The `needs_sample` guard above
`test_empty_schedule_returns_no_finished`	The idle path: an engine with nothing to do must do nothing, gracefully

The equivalence tests work because everything is deterministic: the toy model's logits are a pure function of (seed, last token, position) and greedy sampling has no randomness. Two engines with the same seed must agree token-for-token — so any disagreement is a bug in your wiring, never noise. Hold on to this technique: determinism turns "looks right" into "provably identical," and it's how vLLM's own correctness tests pin scheduler changes.

How this maps to the real engine

Side by side with upstream/vllm/v1/engine/core.py:428:

Your line	Upstream	Notes
`output = scheduler.schedule()`	`scheduler_output = self.scheduler.schedule()`	Identical role; upstream's output also carries block tables & slot mappings (Phase 2)
`model.forward(last_tokens, positions)`	`self.model_executor.execute_model(scheduler_output)`	Upstream ships the whole batch to GPU workers, possibly across processes/nodes (Phase 10)
`sampler.sample(logits[i], ...)`	inside the model runner: `self.sampler(...)`	Upstream samples on the GPU, vectorized over the batch (Phase 9)
`scheduler.update_from_output(...)`	`self.scheduler.update_from_output(...)`	Same name, same job

Note what upstream does not do differently: the stage order, the catch-up rule, the reaping path. Architecture survives; implementation details scale.

Hitchhiker's notes

Order within the batch is a contract. rows[i] ↔ logits[i]. In mini_vllm this is a Python list; upstream it's tensor row indices (logits_indices). Either way, the scheduler and the sampler are communicating through positional agreement — one of those invisible contracts that only becomes visible when someone breaks it.
update_from_output must run even on steps where nothing sampled. Mid-prefill steps still advance num_computed_tokens — that's the whole point of the chunk. If you guard the update behind if sampled:, chunked prefill freezes forever. (Ask us how we know.)
Why does engine_step take the scheduler rather than creating one? Dependency injection isn't ceremony here: the tests hand you an engine's organs precisely so they can compare your loop against the engine that owns them. Upstream is shaped the same way for the same reason — EngineCore receives its executor, which is what lets tests swap in fakes.
The model is fake; the bookkeeping is real. ToyModel produces deterministic pseudo-logits and ignores KV contents. Everything you wired — scheduling verdicts, catch-up sampling, reaping — is faithful. This split (real control plane, toy data plane) is the course's core trick, and it's also how you should unit-test engine changes upstream: the control plane rarely needs a GPU to be proven correct.

Going further

Add a callback(step_idx, output, sampled) parameter and rebuild lab-01's trace using your own step function. Observability-as-a-hook is exactly how vLLM's stat loggers attach to the loop.
Break the row-order contract on purpose (reverse rows but not logits) and watch which test catches it and how — the failure is instructive: outputs are plausible-looking tokens, just the wrong ones.
Time 1000 steps of a decode-only batch at batch sizes 1, 8, 64 (time.perf_counter). Even on CPU with a toy model you'll see per-step overhead amortize with batch size — a small-scale preview of why batching is the first lever of throughput (Phase 18).

References

mini_vllm/engine.py — the LLMEngine.step you are reimplementing.
upstream/vllm/v1/engine/core.py:428 — EngineCore.step.
upstream/vllm/v1/worker/gpu_model_runner.py — where execute/sample happen for real; search for logits_indices to find the catch-up rule's production form.
Yu et al., Orca (OSDI 2022) — §4, "iteration-level scheduling": the paper that first made "one step of everyone" the unit of work. https://www.usenix.org/conference/osdi22/presentation/yu
vLLM blog, vLLM V1: A Major Upgrade — the rewrite that flattened the engine loop into the shape you just built: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html

Lab 01-04 — Watch the Batch: Continuous Batching Made Visible `[CPU-OK]`

One request's lifecycle (lab-01) is a nice story. But inference engines earn their living when many requests share the machine — and the way they share it is the single biggest throughput idea of the last few years: continuous batching. In this lab you'll instrument the engine to photograph the batch composition of every step — who got scheduled, for how many tokens — and you'll directly observe the thing the famous benchmark posts only describe: prefill chunks of one request riding in the same step as decodes of another.

Why this lab exists

This lab is Phase 3 knocking on the door early — on purpose. The scheduler is easier to implement (Phase 3 lab-01) after you've seen its decisions laid out step by step. More practically: per-step batch composition is the engine's most important hidden variable. The wall-clock time of a step is roughly proportional to the tokens scheduled in it, so the sequence of dicts you're about to record is, up to a constant, the latency profile of the server. Spiky dicts = spiky inter-token latency. When Phase 3 lab-05 measures chunked prefill's effect on decode latency, it will use exactly the probe you build here.

You'll also learn the instrumentation pattern itself: wrapping a component's method to observe a system without changing its behavior. That's how vLLM's own stat loggers attach to the engine, and how you'll debug schedulers for the rest of your career — schedulers rarely crash; they just quietly make bad batches. You can't grep for a bad batch. You have to look at it.

Background: static vs continuous batching

The old way (pre-Orca, ~2022): collect N requests, run them as a unit until all N finish, then take the next N. Two disasters hide in that sentence. First, requests finish at different times, and finished slots sit idle while the longest request drags on (the "convoy"). Second, a request arriving one millisecond after the batch launched waits an entire batch lifetime to even start. GPU utilization graphs of static-batched servers look like a comb: bursts of work, then teeth-gaps of idle.

The Orca insight (OSDI 2022), which vLLM adopted and the whole industry copied: rebuild the batch every step. A request can join the batch at any step boundary (its prefill just becomes part of that step) and leave at any step boundary (its slot is free next step, not at end-of-batch). The batch isn't a unit of work anymore — it's whatever the scheduler composed for this one tick. Anyscale's benchmark of this idea measured up to 23× throughput over static batching. That entire revolution is visible in the data structure you're about to record: consecutive dicts whose key sets grow and shrink while the engine never stops.

Files

starter.py — implement trace_batches (engine + probe). Your work.
solution.py — reference.
test_lab.py — pins step-1 composition, token conservation, budget cap, deferral under a tight budget, and the existence of mixed prefill+decode steps.

Run

LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-04-watch-the-batch -q
pytest phase-01-architecture-and-request-lifecycle/labs/lab-04-watch-the-batch -q   # reference

What to implement

def trace_batches(prompts, max_tokens=4, **engine_kwargs)
        -> tuple[list[str], list[dict[str, int]]]

Add all prompts (greedy, ignore_eos=True), then run eng.step() to completion — but first, wrap eng.scheduler.schedule with a closure that calls the original, appends a copy of out.num_scheduled_tokens to your trace, and returns out unchanged. The probe must be invisible: same engine behavior with or without it. (Copy the dict! The scheduler gives you its own object; aliasing it is the kind of bug that produces a trace where every step mysteriously looks like the last one.)

What you should see — the full trace, explained

Two prompts — A = "hello world" (11 tokens) and B = "goodbye" (7 tokens), max_tokens=4 — with a tight budget of max_num_batched_tokens=8:

step 1: {A: 8}            # A's prefill, CHUNKED to the budget. B is NOT admitted: budget spent.
step 2: {A: 3, B: 5}      # A finishes prefill (3 left) + samples token 1.
                          #   B finally admitted with the leftover budget: 8-3=5 of its 7.
step 3: {A: 1, B: 2}      # ← THE MONEY STEP: A is decoding (1 token) while B is still
                          #   prefilling (its last 2) — prefill and decode IN THE SAME BATCH.
step 4: {A: 1, B: 1}      # both decoding.
step 5: {A: 1, B: 1}      # ...
step 6: {B: 1}            # A hit max_tokens and left; B has the machine to itself.

Read it like a maintainer:

Step 1 is {A: 8}, not {A: 11} — the budget (8) caps the step, so the scheduler takes the first 8 tokens of A's prompt and stops. Nothing special-cased: n = min(remaining, budget). And B isn't admitted at all, because admission requires leftover budget. B is spending this step in the WAITING queue — this is the queueing delay that lab-01 promised you'd see under load.
Step 2 is where continuous batching starts paying — A's last chunk and B's first chunk share a step. A static-batch engine cannot produce this step; it doesn't have a concept for "half of A and half of B."
Step 3 is the signature — min=1, max>1 in the same dict: a decode and a prefill chunk co-scheduled. The test test_mixed_batches_exist_under_load hunts for exactly this shape. On a GPU this mixing is also an efficiency trick: decode alone underuses compute (bandwidth-bound), prefill alone starves latency; mixed batches fill the compute bubbles with prefill work (Sarathi's "piggybacking" — Phase 3).
Step 6's shrinking key set — A finished and was reaped mid-flight; B never noticed. Its slot is reusable immediately. That, and nothing more, is "continuous."
Conservation check — sum A's numbers: 8+3+1+1+1 = 14 = 11 + 4 − 1. Each request's scheduled tokens total prompt + max_tokens − 1. Why −1? The final sampled token is appended and the request immediately finishes — its KV is never computed, because no further token will ever attend to it. The engine doesn't do work the future won't read. When a counter is off by one in a scheduler, this is the kind of identity you use to find it; that's why there's a test pinning it.

Rerun with the default roomy budget (2048) and the drama disappears: step 1 is {A: 11, B: 7}, everything after is decodes. Scheduling is only interesting under scarcity — keep that in mind when building benchmarks, or you'll "validate" a scheduler on workloads that never exercise it.

What the tests prove

Test	Invariant
`test_ample_budget_prefills_everyone_in_step_one`	With budget to spare, admission is immediate — queueing is a scarcity phenomenon, not a constant tax
`test_token_conservation_per_request`	`Σ scheduled = prompt + max_tokens − 1`, the off-by-one identity above
`test_budget_is_never_exceeded`	`Σ over the batch ≤ max_num_batched_tokens`, every single step — the engine's load-bearing promise to the GPU's latency
`test_tight_budget_chunks_and_defers`	The exact step-1/step-2 composition above: chunking + deferred admission
`test_mixed_batches_exist_under_load`	A prefill chunk and a decode co-exist in one step

Hitchhiker's notes

The probe pattern beats print-debugging schedulers. You get structured data you can assert on, diff between runs, and plot. The real engine's equivalent surface is SchedulerOutput (upstream vllm/v1/core/sched/output.py) — when debugging real vLLM, logging num_scheduled_tokens per step gives you this exact trace.
Why does B wait a whole step when the budget is spent? Could the scheduler give A 7 and B 1 instead of A 8? It could — but FCFS says finish admitting A's work first; fairness policies are a deep rabbit hole (priority scheduling lands in Phase 3's exercises). The shape to remember: policy decides who, budget decides how much, and they're separable concerns in the code.
Step time ∝ scheduled tokens is a good first-order model but not exact on real hardware: a decode-only step pays memory-bandwidth costs that token-count alone doesn't capture, and tiny steps pay fixed launch overheads (which CUDA graphs attack — Phase 5). Phase 18 refines the model; the trace you built stays the right raw material.
Request IDs are global. mini_vllm numbers requests with a module-level counter, so don't hardcode req-0 in your own experiments — use the ids trace_batches returns. The tests are written that way for exactly this reason.

Going further

Plot the trace: steps on x, stacked bars of scheduled tokens per request. You've recreated the iconic continuous-batching diagram from the Orca paper and the Anyscale post — except yours is measured, not illustrated.
Sweep max_num_batched_tokens from 4 to 64 over the same prompts and plot total steps vs budget. You'll see a hyperbola flatten: past "everything fits," more budget buys nothing. Congratulations, you've found a saturation knee — Phase 18 is full of these.
Add 8 requests with staggered arrival (add two, step twice, add two more …). Watch key sets churn. This is what a production batch actually looks like: a rolling membership, no two steps alike.

References

Yu et al., Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) — iteration-level scheduling, the idea this lab photographs: https://www.usenix.org/conference/osdi22/presentation/yu
Anyscale, How continuous batching enables 23x throughput in LLM inference (2023) — the benchmark post that made this mainstream: https://www.anyscale.com/blog/continuous-batching-llm-inference
Agrawal et al., Sarathi-Serve: Taming Throughput-Latency Tradeoff in LLM Inference (OSDI 2024) — why mixed prefill+decode batches are not just legal but desirable: https://arxiv.org/abs/2403.02310
upstream/vllm/v1/core/sched/output.py — SchedulerOutput, the real engine's version of the dicts you recorded.
upstream/vllm/v1/core/sched/scheduler.py:329 — the loop that composed every step you traced; you implement its core in Phase 3 lab-01.

Lab 01-05 — Stop Conditions & Finish Reasons `[CPU-OK]`

Every request dies. The only questions are when and what we tell the user about it. This lab dissects the engine's stop machinery — the few lines of update_from_output that decide whether a generation halts on the model's own EOS token or on the operator's max_tokens cap — and the mapping from internal status to the finish_reason field that every OpenAI API consumer in the world branches on.

It looks small. It is small. It is also the part of the engine with the highest bug-impact-to-code-size ratio: an off-by-one or a mis-ordered check here doesn't crash — it silently truncates answers, or streams one token too many, for every user, forever.

Why this lab exists

Ask anyone who's run an LLM API in production what their most common user-facing bug report is. It won't be a crash. It will be: "the answer just… cuts off." Triaging that report requires knowing exactly what you'll know after this lab: was it finish_reason: "length" (the operator's cap — raise max_tokens), "stop" (the model chose to end — a prompting issue), or a stream that died without a reason (an actual bug)? The distinction is three enum values and two if statements, and entire support rotations have burned days for lack of it.

There's an engineering lesson too. Stop handling is where model behavior (EOS is just a token the model can emit, with a probability like any other) meets system policy (max_tokens is an admission-control and billing boundary). Keeping those two cleanly separated — and correctly ordered — is a miniature of the whole serving-systems discipline.

Background: the three ways a request ends

The model stops itself — it samples the EOS (end-of-sequence) token. EOS is not magic: it's a vocabulary entry (id 256 in mini_vllm's ByteTokenizer; id 2 for Llama; <|endoftext|> = 50256 for GPT-2) that the model learned to emit when a response is complete. The engine checks "was the token just appended the EOS?" and if so marks FINISHED_STOPPED → API finish_reason: "stop". A well-behaved model ends most chat turns this way.
The operator stops it — num_output_tokens >= max_tokens. Marked FINISHED_LENGTH → API finish_reason: "length". To an API consumer this usually means "your answer was truncated; consider a bigger budget." To the operator it's the lever that bounds worst-case cost and KV occupancy per request — schedulers need a worst case to exist (remember the deadlock argument coming in Phase 3 lab-04).
Someone aborts it — client disconnect, admin action. Real vLLM has FINISHED_ABORTED for this; mini_vllm omits it (no clients to disconnect). Worth knowing it exists: cancellation is a first-class lifecycle path in production, and "KV freed on abort" is a real invariant people have broken.

And one anti-way that trips newcomers: ignore_eos=True (used throughout this course's tests, and by every serious benchmark) disables check #1, so generation always runs to the cap. Why would anyone want a model to blow through its own stop sign? Benchmarking. If you're measuring tokens/sec, you need every request to produce a known, fixed number of tokens regardless of what the model "wants" to say. The flag exists for load generators, not users — and you've been benefiting from it since lab-01 without noticing: it's what made your traces deterministic in length.

Files

starter.py — implement finish_reason (status → API string) and run_until_stop (the feed-tokens-until-something-fires simulation of the update stage). Your work.
solution.py — reference.
test_lab.py — the EOS path, the ignore_eos path, the length path, the boundary tie, the unfinished case, and an end-to-end engine check.

Run

LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-05-stop-conditions -q
pytest phase-01-architecture-and-request-lifecycle/labs/lab-05-stop-conditions -q   # reference

What to implement

Two functions. finish_reason(request) is the status-to-API translation table. run_until_stop(token_stream, eos_token_id, sampling_params) replays the engine's update stage with pre-decided tokens: append one, run maybe_finish(), break if it fired. Using a scripted token stream instead of a sampler is the trick that makes stop logic exhaustively testable — you can place an EOS at any position you like, including exactly on the max_tokens boundary, something you could wait a long time for a sampler to do for you. (This is also how you should test stop-sequence handling upstream: script the stream, pin the behavior.)

The edge case the tests are really about

What should happen when the model emits EOS exactly at the max_tokens boundary? Both conditions are true simultaneously. Look at mini_vllm/request.py::maybe_finish: the EOS check runs first, so the request reports "stop". That ordering is a deliberate, user-visible API decision, not an accident of code layout: "stop" tells the consumer "the answer is complete"; "length" tells them "the answer was cut off — maybe retry with a bigger budget." On the boundary, the answer is complete — reporting "length" would invite pointless retries (and with auto-retrying clients, real money). Real vLLM resolves the tie the same way.

test_eos_on_the_boundary_reports_stop pins this. If someone "tidies up" maybe_finish by reordering the checks, that test fails — which is the whole job of a test like that: turning an invisible design decision into a tripwire. Notice the meta-lesson: whenever a function checks two conditions that can be true at once, the order is an API. Grep any engine you maintain for such pairs; most of them are untested.

What the tests prove

Test	What it pins
`test_eos_stops_generation`	Tokens after EOS are never generated — the stream truly halts
`test_ignore_eos_runs_to_length_cap`	`ignore_eos` neutralizes check #1 only; the cap still binds
`test_no_eos_hits_length_cap`	The cap fires at exactly `max_tokens`, not ±1
`test_eos_on_the_boundary_reports_stop`	The tie-break above
`test_unfinished_request_has_no_reason`	WAITING/RUNNING → `None`: a streaming response must not carry a finish_reason until the end
`test_engine_reports_length_with_ignore_eos`	Your mapper agrees with the engine's real loop, end to end

How this maps to the real engine

upstream/vllm/v1/request.py — RequestStatus and get_finished_reason(): the same mapping you wrote, plus FINISHED_ABORTED → "abort". Note upstream encodes "is finished" as an ordering on the enum (status > PREEMPTED) — mini_vllm copies that trick, which is why the enum's declaration order is load-bearing in both. (A reordered enum constant breaking is_finished is exactly the kind of PR a maintainer learns to catch on sight.)
upstream/vllm/v1/engine/output_processor.py — where statuses become the finish_reason strings in API responses, including for streaming (sent only on the final chunk — your None-until-finished mapping is what makes that correct).
The real engine checks more stops than these two: stop strings (must be detected on detokenized text, which means stop handling interacts with the detokenizer's streaming buffer — a genuinely tricky area), stop_token_ids (per-request custom EOS lists), and min_tokens (suppress EOS before a floor — the mirror image of ignore_eos). Each is the same shape you built: a predicate over the request's tail, checked in a defined order, in update_from_output. When you read that upstream code now, it will parse as "lab-05, four more times."

Hitchhiker's notes

EOS consumes a token of budget. In mini_vllm (and in token accounting generally) the EOS lands in output_token_ids — your test_eos_stops_generation result was [10, 20, EOS], three tokens spent. APIs differ on whether the EOS is shown (vLLM strips it from text but it exists in the token count). If you've ever wondered why an API bills N+1 tokens for an N-token answer — this is why.
max_tokens counts output, not total. Prompt length lives in a different limit (max_model_len, which caps prompt + output together). Conflating the two produces classic admission bugs: a request with a 4000-token prompt and max_tokens=200 needs 4200 tokens of headroom, and the real scheduler must reserve for the worst case, not the average.
Greedy + a real model can loop forever ("the the the…") — and without a length cap, that request never finishes, never frees its KV, and slowly strangles the server. The cap isn't a UX nicety; it's the engine's guarantee that every admission terminates. Treat any proposal of "unlimited max_tokens" as what it is: a resource-leak feature request.
Sampling parameters can make EOS unreachable in subtler ways than ignore_eos: a logit_bias of −∞ on the EOS id, or min_tokens before the floor. The stop machinery composes with the sampler (Phase 9); when stops "mysteriously" don't fire, the sampler is suspect #1.

Going further

Add stop-string support to run_until_stop: decode the accumulated output with ByteTokenizer after each token and halt when a given string appears. You'll immediately hit the real-world wrinkle: the stop string can straddle a token boundary, so you must check a sliding window of recent text, not just the newest fragment. Now read how upstream solves it (search stop in output_processor.py) and admire the buffering.
Implement min_tokens: suppress the EOS check while num_output_tokens < min_tokens. One line. Then write the boundary test for it (EOS exactly at min_tokens) — you know the drill now.
In real vLLM, run a chat model and print finish_reason for: a normal question, the same with max_tokens=5, and the same with ignore_eos=True. Watch "stop", "length", "length" come back — your three paths, on production silicon.

References

mini_vllm/request.py — maybe_finish(): the eight lines this lab is about.
upstream/vllm/v1/request.py — RequestStatus.get_finished_reason.
upstream/vllm/v1/engine/output_processor.py — stop strings, streaming finish_reason.
OpenAI API reference, Chat Completions — the finish_reason contract your mapper implements: https://platform.openai.com/docs/api-reference/chat/object
vLLM docs, Sampling Parameters — stop, stop_token_ids, min_tokens, ignore_eos: https://docs.vllm.ai/en/latest/api/inference_params.html

Phase 01 — Exercises: Architecture & Request Lifecycle

Warm-up (explain)

Name the four stages of EngineCore.step and the course phase that owns each.
What's the difference between LLM and AsyncLLM? What do they share?
List the objects a request becomes: prompt → ? → ? → ? → RequestOutput.

Core (trace the code)

In EngineCore.step (core.py:428), which stage can return None, and what is called then?
Who owns the GPU: Executor, Worker, or ModelRunner? What does each do?
Why does V1 run EngineCore in its own process? What crosses the boundary?

Build (your lab)

In lab-01, at which step does num_computed_tokens first equal the prompt length, and why?
Extend trace_request to trace two requests at once; observe how the scheduler interleaves them across steps (continuous batching, Phase 3).
Add a WAITING snapshot (before the first schedule) to your trace. Why is there usually only one WAITING tick for a lone request on an idle engine?

Design (staff-level)

A user reports high TTFT but normal ITL. Which stage(s) of step would you investigate, and which phase's knobs (2/3/5) would you reach for?
You're asked to add a new API surface (e.g. a gRPC endpoint). Which layer do you build it at, and what must it produce/consume to reuse the existing core unchanged?
Explain why detokenization runs off the core's hot path in the server. What would break if it ran inside EngineCore.step?

Self-grading

4–6 and 10–12 are interview-grade. Could you draw the full request journey and name every file? If not, re-read 01-deep-dive.md §"The whole journey, named".

Phase 01 — Interview Questions: Architecture & Request Lifecycle

Q1. Walk me through what happens between `LLM.generate(prompt)` and the first token.

Model answer

generate tokenizes the prompt and builds an EngineCoreRequest; add_request wraps it in a Request and enqueues it in the scheduler. Then the engine loops EngineCore.step: the scheduler picks the request and how many tokens to compute (the whole prompt, as prefill), the executor runs the model on the assembled batch via a worker/model-runner, the sampler produces the first token, and update_from_output advances num_computed_tokens and records the token. The output processor detokenizes and returns/streams it. (llm.py:422 → core.py:428.)

Q2. What are the four stages of the engine step?

Model answer

schedule() (who runs, how many tokens — Phase 3), execute_model() (run the forward pass on a worker/model-runner — Phases 4–14), sample_tokens() (pick the next token — Phase 9), and update_from_output() (advance counters, reap finished requests — Phase 3). Everything in vLLM is a deep dive into one of these. (core.py:428.)

Q3. Executor vs Worker vs ModelRunner — who does what?

Model answer

The Executor (v1/executor/) is the engine's handle to compute; it owns one Worker for single-GPU or N for tensor/pipeline parallel and fans execute_model out to them. A Worker (gpu_worker.py) owns one GPU's device, model shard, and KV cache. The ModelRunner (gpu_model_runner.py) turns a SchedulerOutput into input tensors + attention metadata, runs the (CUDA-graphed) forward pass, and runs the sampler. This indirection is why the same engine runs on 1 or 64 GPUs — only the Executor changes.

Q4. Why does V1 isolate `EngineCore` in its own process?

Model answer

To keep the tight GPU scheduling loop off the API server's event loop and free of GIL contention with HTTP handling and detokenization, and to cleanly coordinate multi-GPU worker processes. Requests cross the boundary as serialized EngineCoreRequests and results as EngineCoreOutputs; output processing/detokenization runs server-side so it never stalls the core. (EngineCoreProc, core.py:835.)

Model answer

Both are thin shells over the same EngineCore. LLM/LLMEngine is the synchronous batch shell (add_request + pump step), AsyncLLM is the async/streaming shell for the OpenAI server. The scheduling, execution, and sampling are identical; only the entry/exit (sync vs async, full result vs streamed deltas) differ.

Rapid-fire

Offline entry point? LLM.generate (llm.py:422).
Online entry point? AsyncLLM behind the OpenAI server.
The heartbeat? EngineCore.step (core.py:428).
Object the scheduler operates on? Request (with status + counters).
What update_from_output does? Advance num_computed_tokens, append tokens, reap finished.

Phase 01 — Cheatsheet: Architecture & Request Lifecycle

The journey

LLM.generate / AsyncLLM  ->  EngineCore.step (loop)  ->  Detokenizer  ->  RequestOutput
   step = schedule (Ph3) -> execute_model (Ph4-14) -> sample (Ph9) -> update_from_output (Ph3)

Entry points

Offline: LLM(model=...).generate(prompts) (entrypoints/llm.py:422) → LLMEngine.
Online: OpenAI server → AsyncLLM (v1/engine/async_llm.py). Same core, async + streaming.

The compute chain

EngineCore → Executor (1 or N workers) → Worker (owns one GPU) → ModelRunner (gpu_model_runner.py: SchedulerOutput → tensors → forward → sample).

Objects that flow

prompt+SamplingParams → EngineCoreRequest → Request (counters+status) → SchedulerOutput → ModelRunnerOutput → RequestOutput.

Lifecycle

WAITING → RUNNING → FINISHED_* ; PREEMPTED → WAITING (Phase 3). RequestStatus (request.py:315).

Process model

EngineCore runs in its own process (EngineCoreProc, core.py:835) — tight loop off the GIL; detokenization runs server-side, off the hot path.

Key upstream

entrypoints/llm.py:422 generate · v1/engine/llm_engine.py:209/287 add_request/step
v1/engine/core.py:428 step · :337 add_request · :835 EngineCoreProc
v1/engine/async_llm.py AsyncLLM · v1/worker/gpu_model_runner.py the runner

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

Phase 02 — The Hitchhiker's Guide to PagedAttention ⭐

← Phase 01 · Course home · Phase 03 →

This is a flagship phase — written in full. Use it as the template for the depth every other phase aims at.

Don't Panic

Here is the entire idea, in one breath:

The KV cache is the model's memory of the conversation so far. Naively, you'd give each request one big contiguous slab of GPU memory to hold it. PagedAttention instead chops the KV cache into fixed-size blocks (like a operating system chops memory into pages) and lets each request's blocks live anywhere in GPU memory, tracked by a little block table. That one change — contiguous slab → scattered pages — is why vLLM serves several times more requests per GPU than the systems that came before it.

If you have ever learned how an OS gives processes "virtual memory" backed by scattered physical pages, you already understand PagedAttention. It is literally that idea, applied to the KV cache. The vLLM paper's title even says so: "Efficient Memory Management for Large Language Model Serving with PagedAttention."

Take a breath. By the end of this phase you will have written a working paged block allocator yourself (mini_vllm/block_pool.py) and read the real one (upstream/vllm/v1/core/block_pool.py) line by line.

Step 1: Why is memory the problem at all?

Recall from Phase 0: during generation, the model caches a Key and Value vector for every token it has seen, in every layer. This is the KV cache. It is enormous and it grows as the conversation gets longer.

A rough size for one sequence:

kv_bytes_per_token = 2 (K and V) × num_layers × num_kv_heads × head_dim × dtype_bytes

For Llama-3-8B (32 layers, 8 KV heads, head_dim 128, fp16) that's about:

2 × 32 × 8 × 128 × 2  ≈  131 KB  per token

A 2,000-token conversation is ~256 MB of KV — for one user. On a 24 GB GPU, after the ~16 GB of weights, you have ~8 GB for KV — maybe ~30 such conversations. Memory, not compute, is what caps how many users you can serve. So how you manage that memory is the whole ballgame.

Step 2: The old way, and why it bled memory

Pre-vLLM systems reserved a contiguous chunk of KV memory per request, sized for the maximum possible length (e.g. 2048 tokens), up front.

Request A (will generate 30 tokens, reserved 2048):
[####..............................................................]   <- 2018 slots WASTED
 ^30 used

Request B (reserved 2048):
[#########.........................................................]   <- ~2000 WASTED

Two diseases:

Internal fragmentation — you reserve for the worst case (2048) but use 30. The other ~2018 slots sit idle, reserved, unusable by anyone else.
External fragmentation — as requests of different sizes come and go, free memory breaks into chunks too small to fit the next contiguous request, even though the total free memory is plenty.

Studies in the vLLM paper found these wasted 60–80% of KV memory. That directly means 60–80% fewer concurrent users than the hardware could support.

Step 3: The fix — pages (blocks)

PagedAttention says: stop reserving contiguous slabs. Instead:

Carve all KV memory into many small, equal blocks. A block holds the KV of block_size tokens (commonly 16).
Maintain a global pool of free blocks.
Give each request blocks on demand, one at a time, as it generates — and the blocks can be anywhere in physical memory.
Keep a per-request block table: a little array mapping the request's logical block index (0, 1, 2, …) to the physical block id it actually got.

Physical KV memory (one big array of blocks, ids 0..N):
 ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
 │ b0 │ b1 │ b2 │ b3 │ b4 │ b5 │ b6 │ b7 │ b8 │ b9 │ ...
 └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

Request A's block table:  [ 4, 1, 7 ]      (logical 0→phys 4, 1→1, 2→7)
Request B's block table:  [ 2, 9 ]

A's tokens live in blocks 4,1,7 — NOT contiguous, and that's totally fine.
Only A's *last* block may be partly empty (≤ block_size−1 wasted). No giant reservations.

Now waste is at most block_size − 1 tokens per request (the tail of the last block) — seconds of generation, not thousands of reserved-but-idle slots. Fragmentation: gone.

The mental shift: a request's KV no longer needs to be contiguous in memory; it only needs to be contiguous in the block table. The attention kernel is handed the block table and gathers KV from the scattered physical blocks. That's the "Paged" in PagedAttention.

Once KV is in blocks tracked by tables, two requests can point their block tables at the same physical block. If two requests start with the same prompt (a shared system prompt, or n=4 samples of one prompt), they can share the physical KV blocks of that prefix — compute it once, store it once.

System prompt blocks (computed once):   b5  b6
Request A table: [ b5, b6, b1 ]   ─┐
Request B table: [ b5, b6, b8 ]   ─┴─ both point at b5,b6 (shared!), diverge after.

This is prefix caching (the star of Phase 03). To make sharing safe we need two more concepts, both straight from operating systems:

Reference counting — each block knows how many requests use it (ref_cnt). A block is truly free only when ref_cnt == 0.
Copy-on-write — if a shared block must change for just one request, copy it first so the other sharer's view is untouched.

Step 5: The data structures you're about to meet

The real vLLM (and your mini_vllm) implement paging with exactly four pieces:

Piece	Job	Real code	Your code
`KVCacheBlock`	metadata for one physical block (id, ref_cnt, hash)	`kv_cache_utils.py:116`	`mini_vllm/block_pool.py`
`FreeKVCacheBlockQueue`	the free list, in eviction order, O(1) middle-removal	`kv_cache_utils.py:164`	`mini_vllm/block_pool.py`
`BlockPool`	owns all blocks + the free list + the prefix-cache index	`block_pool.py:130`	`mini_vllm/block_pool.py`
`KVCacheManager`	per-request block tables; the API the scheduler calls	`kv_cache_manager.py:110`	`mini_vllm/kv_cache.py`

A surprising detail you'll appreciate: the free list is a hand-rolled doubly linked list, not a Python deque. Why? Because on a prefix-cache hit we must yank a specific block out of the middle of the free list in O(1). A deque can't do that. The real code has a 30-line docstring justifying this exact decision (kv_cache_utils.py:164). Reading that docstring and understanding why is a rite of passage — and a great interview answer.

The four invariants (memorize these)

A maintainer holds these in their head at all times. They're tested in mini_vllm/test_block_pool.py and asserted throughout the real code:

I1. A block is in the free queue ⟺ block.ref_cnt == 0 (and it isn't the null block).
I2. Block tables are append-only: an allocated block_id never changes under a request. (This is why the cache doesn't de-duplicate — see block_pool.py:48.)
I3. Only a full block (exactly block_size tokens) ever gets a hash and enters the prefix cache.
I4. "Cached" ≠ "unusable." A block can be a free eviction candidate (in the free queue) while still being a prefix-cache hit target. touch() revives it.

What you'll do in this phase

Read the real allocator: 01-deep-dive.md walks block_pool.py and kv_cache_utils.py line by line.
Build your own: 02-mini-build.md (you've got mini_vllm/block_pool.py as the reference — the lab has you write it from a stub).
Labs (see labs/README.md; recommended order 01 → 02 → 05 → 06 → 03 → 04):
- lab-01-block-allocator [CPU-OK] — implement the paged allocator + free queue, pass the tests.
- lab-02-fragmentation-viz [CPU-OK] — simulate contiguous vs paged allocation; measure the waste.
- lab-03-real-vllm-blocks [GPU-OPT] — run real vLLM, read num_gpu_blocks and KV usage, prove no fragmentation.
- lab-04-triton-paged-attn [GPU-REQ] — port a block-table-indexed attention to a Triton kernel.
- lab-05-share-and-evict [CPU-OK] — the life of a cached block: sharing (ref_cnt==2), eviction order (tails before shared prefixes), and revival from the middle of the free queue.
- lab-06-paged-attention-numpy [CPU-OK] — the kernel's data path in pure numpy: slot_mapping, scatter, gather-through-the-table, and proof that paged == dense to 1e-12. (The CPU twin of lab-04.)
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

When you can whiteboard the block table + free queue from memory and explain copy-on-write and the four invariants, you understand the single most important idea in vLLM. Onward.

← Phase 01 · Course home · Phase 03 →

Phase 02 — Deep Dive: PagedAttention in the real vLLM

All paths are relative to upstream/ at the pinned commit v0.22.1 @ 0decac0 (UPSTREAM_PIN.md). Open each file as we go. Line numbers are valid at the pin; the named symbol lets you re-find anything if you're on a different version.

The V1 KV-cache stack lives in vllm/v1/core/:
vllm/v1/core/
  kv_cache_utils.py        KVCacheBlock, FreeKVCacheBlockQueue, hashing  (the primitives)
  block_pool.py            BlockPool                                     (the allocator)
  kv_cache_manager.py      KVCacheManager, KVCacheBlocks                 (per-request tables)
  kv_cache_coordinator.py  coordinates groups (hybrid models)           (one level up)
  single_type_kv_cache_manager.py                                       (per-group logic)

We'll go bottom-up: the block, the free list, the pool, then the manager the scheduler calls.

1. `KVCacheBlock` — metadata for one physical block

vllm/v1/core/kv_cache_utils.py:116:

@dataclass
class KVCacheBlock:
    """KV-cache block metadata."""
    block_id: int
    ref_cnt: int = 0
    _block_hash: BlockHashWithGroupId | None = None
    # Used to construct a doubly linked list for free blocks.
    prev_free_block: "KVCacheBlock | None" = None
    next_free_block: "KVCacheBlock | None" = None
    is_null: bool = False

Crucial things to notice:

A KVCacheBlock is metadata only. The actual K/V tensors live in a big GPU buffer; this object just says "block #block_id, used by ref_cnt requests, hashing to _block_hash." Your mini_vllm.block_pool.KVCacheBlock is the same shape minus the GPU tensors.
ref_cnt is the heart of sharing (I1). The block_hash setter (line 139) asserts the block has no hash yet — enforcing I3/I2: a block's hash is set once when it fills, and the block id is stable.
prev_free_block/next_free_block are the linked-list pointers. The comment (line 128) warns: "These two attributes should only be manipulated by FreeKVCacheBlockQueue." That's an invariant about ownership — exactly the kind of thing a maintainer must respect.

reset_hash() (line 146) clears the hash on eviction. We'll see it called from _maybe_evict_cached_block.

2. `FreeKVCacheBlockQueue` — the free list, and why it's hand-rolled

vllm/v1/core/kv_cache_utils.py:164. Read its docstring in full — it's a masterclass. The key sentences:

"We implement this class instead of using Python builtin deque to support removing a block in the middle of the queue in O(1) time. … this class does not allocate any Python objects when manipulating the linked list."

Two design decisions, both about performance on the hot path (this runs for every allocation and free, every step):

O(1) middle removal. On a prefix-cache hit, a block that was a free eviction candidate gets revived — pulled out of wherever it sits in the free list. A deque only does O(1) at the ends; the middle is O(n). So they wrote a doubly linked list.
Zero allocation. They reuse the prev/next fields on the blocks themselves rather than allocating node wrappers. No GC pressure in the scheduler loop.

The eviction order is the other half (docstring lines 173–180):

"1. The least recently used block is at the front (LRU). 2. If two blocks have the same last accessed time … the one with more hash tokens (the tail of a block chain) is at the front."

So popleft() evicts LRU-first, and within a freed request, tail blocks go first (we'll see KVCacheManager.free frees in reverse so the longest shared prefix survives longest).

The sentinel trick (lines 196–214): a fake head and tail node so push/pop never special-case "is this the first/last?". Read popleft (216), remove (286), append (306), popleft_n (253), append_n (329). Your mini_vllm.block_pool.FreeKVCacheBlockQueue implements the same four operations with the same sentinel trick — compare them side by side.

Interview gold: "Why does vLLM use a custom linked list instead of collections.deque for free blocks?" → O(1) removal from the middle for prefix-cache revival, and zero per-operation allocation on the scheduler hot path. If you can also say where the middle removal happens (touch), you're answering at staff level.

3. `BlockPool` — owns every block, the free list, and the cache index

vllm/v1/core/block_pool.py:130. The constructor (__init__, line 149):

self.blocks: list[KVCacheBlock] = [KVCacheBlock(idx) for idx in range(num_gpu_blocks)]
self.free_block_queue = FreeKVCacheBlockQueue(self.blocks)
self.cached_block_hash_to_block: BlockHashToBlockMap = BlockHashToBlockMap()
# To represent a placeholder block with block_id=0.
self.null_block = self.free_block_queue.popleft()
self.null_block.is_null = True

One KVCacheBlock per physical block, all initially free.
A null block (id 0) is reserved as a placeholder (used for skipped positions, e.g. outside a sliding window). mini_vllm reserves block 0 the same way (BlockPool.__init__).
cached_block_hash_to_block is the prefix-cache index: block_hash → block. (Upstream uses a BlockHashToBlockMap that can hold multiple blocks per hash; mini_vllm simplifies to one block per hash — read the BlockHashToBlockMap docstring at line 34 to see why the real one is more complex: it must keep block ids stable, I2, so it doesn't dedup.)

Allocation: `get_new_blocks` (line 333)

def get_new_blocks(self, num_blocks: int) -> list[KVCacheBlock]:
    if num_blocks > self.get_num_free_blocks():
        raise ValueError(f"Cannot get {num_blocks} free blocks from the pool")
    ret: list[KVCacheBlock] = self.free_block_queue.popleft_n(num_blocks)
    if self.enable_caching:
        for block in ret:
            self._maybe_evict_cached_block(block)   # <- was it a cached eviction candidate?
            assert block.ref_cnt == 0
            block.ref_cnt += 1
    else:
        for block in ret:
            assert block.ref_cnt == 0
            block.ref_cnt += 1
    return ret

Pop n blocks off the front of the free queue (LRU). If caching is on, each popped block might still be sitting in the prefix cache as an eviction candidate (I4) — so _maybe_evict_cached_block removes its hash entry before we reuse it. Then ref it (ref_cnt = 1). mini_vllm.BlockPool.get_new_blocks mirrors this exactly (including _maybe_evict).

Eviction: `_maybe_evict_cached_block` (line 365)

block_hash = block.block_hash
if block_hash is None:
    return False            # block was never cached, nothing to evict
if self.cached_block_hash_to_block.pop(block_hash, block.block_id) is None:
    return False
block.reset_hash()          # <- I3: it no longer holds cacheable content

This is the OS analogy made literal: reusing a physical page means invalidating whatever was mapped there. The hash is cleared so no future request thinks this block holds their prefix.

def touch(self, blocks: Sequence[KVCacheBlock]) -> None:
    for block in blocks:
        # ref_cnt=0 means this block is in the free list (eviction candidate), so remove it.
        if block.ref_cnt == 0 and not block.is_null:
            self.free_block_queue.remove(block)   # <- O(1) middle removal! (the whole reason
        block.ref_cnt += 1                        #     for the custom linked list)

When a new request hits a prefix-cached block that happened to be free, touch revives it: pull it out of the middle of the free list and bump its ref count. This single line is why FreeKVCacheBlockQueue exists. mini_vllm.BlockPool.touch is identical in spirit.

Freeing: `free_blocks` (line 419)

for block in blocks_list:
    block.ref_cnt -= 1
self.free_block_queue.append_n(
    [block for block in blocks_list if block.ref_cnt == 0 and not block.is_null]
)

Decrement refs; any block that hit 0 goes back on the free queue (and stays in the cache as an eviction candidate — I4). The caller is expected to pass blocks in eviction-priority order (docstring line 419: "first block will be evicted first").

Caching full blocks: `cache_full_blocks` (line 211)

The big method that registers newly-full blocks into the prefix cache. The important loop (line 267):

for i, blk in enumerate(new_full_blocks):
    if blk.is_null or (block_mask is not None and not block_mask[i]):
        continue
    assert blk.block_hash is None         # I3 again
    block_hash = new_block_hashes[i]
    block_hash_with_group_id = make_block_hash_with_group_id(block_hash, kv_cache_group_id)
    blk.block_hash = block_hash_with_group_id
    self.cached_block_hash_to_block.insert(block_hash_with_group_id, blk)

Only full, non-null, non-masked blocks get a hash and enter the index. The rest of the method (lines 285–331) emits optional KV-cache events (for observability / external KV stores) — skip that on first read.

4. The hash that makes it a prefix cache: `hash_block_tokens`

vllm/v1/core/kv_cache_utils.py:541:

def hash_block_tokens(hash_function, parent_block_hash, curr_block_token_ids, extra_keys=None):
    if not parent_block_hash:
        parent_block_hash = NONE_HASH
    curr_block_token_ids_tuple = tuple(curr_block_token_ids)
    return BlockHash(
        hash_function((parent_block_hash, curr_block_token_ids_tuple, extra_keys))
    )

The block's hash includes its parent's hash. That chaining is the entire reason this is a prefix cache and not just a block cache: block [c, d] hashes differently depending on what came before it, so a hit on block k guarantees blocks 0..k were all identical. extra_keys folds in things that must not collide across contexts — LoRA id, multimodal content, a cache_salt — see generate_block_hash_extra_keys (line 503). Your mini_vllm.block_pool.hash_block_tokens keeps the parent chaining (the essential part) and drops extra_keys; the test test_prefix_hash_is_chained pins the property.

5. `KVCacheManager` — the per-request API the scheduler uses

vllm/v1/core/kv_cache_manager.py:110. This is the only KV class the scheduler talks to; it hides the pool/coordinator behind a clean interface. Two methods matter most.

`get_computed_blocks` (line 194) — prefix-cache lookup

max_cache_hit_length = request.num_tokens - 1   # must recompute last token to get logits
computed_blocks, num_new_computed_tokens = self.coordinator.find_longest_cache_hit(
    request.block_hashes, max_cache_hit_length
)

Note the num_tokens - 1: even if the entire prompt is cached, the last token must be recomputed to produce logits. mini_vllm.KVCacheManager.get_computed_blocks reproduces this exact max_hit_tokens = num_tokens - 1 rule and walks block hashes from the front, stopping at the first miss (a prefix must be contiguous from the start).

`allocate_slots` (line 236) — the workhorse

Read the giant ASCII docstring (lines 273–305): it diagrams how a request's tokens split into comp | new_comp | ext_comp | new | lookahead. The control flow (simplified):

num_blocks_to_allocate = self.coordinator.get_num_blocks_to_allocate(...)
if num_blocks_to_allocate > self.block_pool.get_num_free_blocks():
    return None                                  # <- OOM! caller must preempt and retry
...
new_blocks = self.coordinator.allocate_new_blocks(...)
...
self.coordinator.cache_blocks(request, num_tokens_to_cache)   # cache newly-full blocks
return self.create_kv_cache_blocks(new_blocks)

The single most important line for Phase 03 is return None: when there aren't enough free blocks, allocate_slots returns None, and the scheduler responds by preempting a running request and retrying. That handshake between the KV manager (memory truth) and the scheduler (policy) is the seam where memory management meets scheduling. mini_vllm.KVCacheManager.allocate_slots returns None on OOM for exactly this reason, and mini_vllm.Scheduler.schedule preempts on None.

`free` (line 429) — reverse order on purpose

"""We free the blocks in reverse order so that the tail blocks are evicted first when
caching is enabled."""
self.coordinator.free(request.request_id)

Freeing tail-first means the head blocks (the shared prefix) stay in the free queue longest, so they survive for the next request that shares that prefix. mini_vllm.KVCacheManager.free does reversed(blocks) for the same reason — see the comment there.

6. Where the blocks actually get used: the attention kernel

We've managed metadata; where do the K/V tensors and block tables meet a GPU kernel? Two places to glance at (full treatment in Phase 04):

The classic CUDA kernels: csrc/attention/paged_attention_v1.cu and ..._v2.cu. These take a block table and gather KV from scattered physical blocks. Search the .cu for block_table to see the indirection: physical_block = block_table[seq][logical_block].
The V1 backends that build the metadata: vllm/v1/attention/backends/flash_attn.py turns the scheduler's block ids + sequence lengths into the slot_mapping (where to write new K/V) and block tables (where to read old K/V) the kernel needs.

You don't need to read CUDA to pass this phase — but knowing that "the block table is literally passed into the attention kernel, which dereferences it per token" closes the loop on why the metadata we manage here is shaped the way it is.

Reading checklist

Tick these off in your lab notebook (write one sentence each):

KVCacheBlock — what does ref_cnt gate? what does the block_hash setter assert?
FreeKVCacheBlockQueue — why a linked list not a deque? where is the middle-removal used?
BlockPool.get_new_blocks — why call _maybe_evict_cached_block before reusing?
BlockPool.touch — trace the O(1) revival of a cached free block.
hash_block_tokens — why include the parent hash?
KVCacheManager.allocate_slots — what does returning None trigger, and where?

Now go build it: 02-mini-build.md, then the labs.

Phase 02 — Mini-Build: the paged block allocator

You will build the four pieces from the deep-dive, on CPU, with numpy-or-nothing. The reference implementation already lives in the repo so you can check yourself:

mini_vllm/block_pool.py — KVCacheBlock, FreeKVCacheBlockQueue, BlockPool, hash_block_tokens
mini_vllm/kv_cache.py — KVCacheManager

But the point is to write it yourself first. lab-01-block-allocator gives you a starter.py with the method bodies stubbed out and a test_lab.py that pins every invariant. Make the tests pass without peeking, then diff your file against solution.py (and against the real mini_vllm/block_pool.py).

The build, in order

1. `KVCacheBlock`

A dataclass: block_id, ref_cnt=0, block_hash=None, is_null=False, and two link pointers prev_free/next_free. Add reset_hash(). Don't put list logic here — the queue owns the pointers (mirror the real ownership invariant).

2. `FreeKVCacheBlockQueue`

A doubly linked list with fake head/tail sentinels. Implement:

popleft() → first block, O(1)
remove(block) → unlink from the middle, O(1) ← the reason this class exists
append(block) → push to tail
get_all_free_blocks() → for tests Keep a num_free_blocks counter in sync. Test: removing a middle block keeps the rest ordered (test_free_queue_o1_middle_removal).

3. `BlockPool`

Build num_blocks blocks, wrap them in the queue, reserve block 0 as null_block.
get_new_blocks(n) — pop n, _maybe_evict each, set ref_cnt=1.
_maybe_evict(block) — if it has a hash and is the cached block for that hash, drop it from the index and reset_hash().
touch(blocks) — if a block is free (ref_cnt==0), remove it from the queue, then ref_cnt += 1.
free_blocks(blocks) — ref_cnt -= 1; any that hit 0 (and aren't null) go back on the queue.
cache_full_blocks(blocks, hashes) — set hash + index it (skip null/already-hashed).
get_cached_block(hash), get_num_free_blocks(), get_usage().

4. `hash_block_tokens(parent_hash, token_ids)`

return hash((parent_hash, tuple(token_ids))). The parent chaining is non-negotiable — it's what makes it a prefix cache. Test: same tokens, different parent → different hash.

5. `KVCacheManager` (in `mini_vllm/kv_cache.py`)

get_computed_blocks(request) → walk full-block hashes from the front, look each up, stop at first miss; cap usable hits at (num_tokens - 1) // block_size. Return (blocks, num_cached).
allocate_slots(request, num_new_tokens, new_computed_blocks=None) → touch the cached blocks, compute how many blocks the (computed+new) tokens need, return None if not enough free, else allocate and cache newly-full blocks.
free(request) → free its blocks in reverse order.

Definition of done

pytest mini_vllm/test_block_pool.py -q     # the reference suite
pytest phase-02-paged-attention/labs -q    # your lab solution + the lab tests

Both green. Then, in your notebook, answer: Which line in your touch() is the O(1) middle removal, and which real-world event triggers it? (Answer: pulling a prefix-cached block out of the free list on a cache hit.)

Stretch (optional, sets up Phase 03)

Add a tiny copy-on-write to your pool: a fork_block(block) that, if ref_cnt > 1, allocates a fresh block, (pretend-)copies contents, decrements the original, and returns the new one. You won't wire it into the engine here, but it's the mechanism behind safe prefix sharing when one sharer diverges — and a classic interview follow-up.

Phase 02 Labs — PagedAttention

Six labs that take you from "paging is an idea" to "I have built every layer of it." The arc: build the allocator (lab-01), measure why it wins (lab-02), see it manage real gigabytes (lab-03), then follow the data path — share & evict cached blocks (lab-05), gather through the table in numpy (lab-06), and finally write the GPU kernel (lab-04).

Recommended order: 01 → 02 → 05 → 06 → 03 → 04. (The directory numbers predate labs 05 and 06; the metadata labs first, then the data path, then the hardware.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-02-paged-attention/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-01-block-allocator -q

Labs

lab-01-block-allocator `[CPU-OK]`

The lab of the phase. Implement the structure vLLM is famous for: KVCacheBlock, the doubly-linked free queue with O(1) middle removal, the BlockPool with refcounts and lazy eviction, and the parent-chained content hash. Four invariants (I1–I4), each one a class of production bug, each one pinned by a test. Skills: the allocator's constitution; why the free queue can't be a deque; hash chains = causality.

lab-02-fragmentation-viz `[CPU-OK]`

Simulate contiguous-max-reservation vs paged allocation on the same request stream and measure the difference: 8 admissions vs 64, 94% waste vs ~0, on identical memory. The PagedAttention headline result, re-derived by you in a four-line model. Skills: internal vs external fragmentation; first-principles capacity modeling; the block_size trade-off.

lab-03-real-vllm-blocks `[GPU-OPT]`

Run real vLLM and read its memory self-assessment: where # GPU blocks: 8788 comes from (profile → subtract → carve), why "Maximum concurrency: 68.65x" is the whole serving- capacity story, and how to sanity-check both from the model config on paper. Captured annotated run included for the GPU-less. Skills: capacity planning; gpu_memory_utilization / max_model_len as capacity knobs; reading the startup ritual.

lab-04-triton-paged-attn `[GPU-REQ]`

The payoff: write a Triton kernel that gathers K/V through a block table and computes decode attention with online softmax, then verify against a dense reference and compare with paged_attention_v1.cu. Do lab-06 first — it's the same algorithm without the GPU dialect. Skills: kernel-level indirection; online softmax; block_table (read) vs slot_mapping (write); fp32 accumulation.

The biography of a cached block: two identical prompts converge on the same physical blocks (ref_cnt == 2), freed blocks linger as eviction candidates, eviction consumes tails before shared prefixes (reverse-order free = the policy), and a newcomer revives "dead" blocks from the middle of the free queue. Includes the num_tokens − 1 hit cap and an exactly-sized pool that counts every block. Skills: prefix-cache state machine; eviction-as-queue-order; why the last token always recomputes.

lab-06-paged-attention-numpy `[CPU-OK]`

The data path, with no kernel noise: build slot_mapping, scatter K/V into a shuffled physical cache, gather back through the block table, and prove paged attention equals dense attention to 1e-12 — including a poisoned-tail test that makes masking bugs detonate. The CPU twin of lab-04. Skills: the slot formula; write-map vs read-map; testing masked computations by poisoning padding.

What you can do after this phase

Explain — with code you wrote and numbers you measured — why vLLM's memory manager admits ~10× more requests than reservation-based engines; predict a deployment's KV capacity from HBM size and model config; narrate the full life of a cached block from allocation through sharing to eviction or revival; and read block_pool.py, kv_cache_manager.py, and the paged-attention kernels upstream as a peer of their authors. Phase 3 now puts this allocator under a scheduler.

Lab 02-01 — Build the Paged Block Allocator `[CPU-OK]`

This is the lab of the phase, and arguably of the course. You are going to implement, from a skeleton, the data structure that made vLLM famous: the paged KV-cache block allocator — the free queue, the block pool, the reference counts, and the prefix-cache index. When the tests go green, the thing that serves trillions of tokens a day in production deployments around the world will exist, in miniature, written by your hands.

Why this lab exists

Here is the surprise at the heart of vLLM: its breakthrough wasn't a kernel, a model trick, or a CUDA wizardry. It was an operating-systems idea from 1962 — paged virtual memory — applied to the KV cache. The PagedAttention paper's headline numbers (2–4× throughput over the prior state of the art) come almost entirely from the metadata structure you're about to build: a few hundred lines of bookkeeping that decide which 16-token "page" of GPU memory belongs to whom.

That's also why this lab is CPU-only with zero loss of fidelity. The GPU tensors are, as the module docstring puts it, "just an array indexed by block_id." The hard part — the part maintainers actually edit, review, and break — is the metadata: ref counts, free lists, eviction, the prefix-cache index. You'll write all of it. And because mini_vllm's version is a faithful-but-small port of the real one (same class names, same invariants, line references throughout), finishing this lab means you can open upstream/vllm/v1/core/block_pool.py and read it like something you wrote.

Background: what problem this structure solves

Every token a transformer processes leaves a residue: its attention keys and values, needed by every future token of the same sequence. For a 7B model that's ~0.5 MB per token. The pre-vLLM engines stored each request's KV in one contiguous tensor sized for the maximum possible length — and since you can't know in advance how long a generation will run, they reserved worst case and used average case. Result: 60–80% of "used" KV memory held nothing (measured in the PagedAttention paper, §2; you'll reproduce the number yourself in lab-02).

The fix is the OS playbook, almost verbatim:

OS virtual memory	vLLM	In this lab
physical page frame	KV block (`block_size` tokens of K/V)	`KVCacheBlock`
free frame list	free queue	`FreeKVCacheBlockQueue`
page table (per process)	block table (per request)	Phase 2's `KVCacheManager` (next file over)
shared pages + refcounts	prefix sharing + `ref_cnt`	`touch` / `free_blocks`
page cache	prefix-cache index	`cached_block_hash_to_block`

A request takes blocks one at a time, from anywhere, as it grows. Nothing is reserved. External fragmentation: impossible (all blocks the same size). Internal fragmentation: at most block_size − 1 tokens, in the last block only. Sharing: free, via refcounts. That's the whole revolution.

The cast of characters

You implement three things, mirroring (with line references) the real engine:

KVCacheBlock — one block's metadata: block_id (its fixed address in the GPU tensor), ref_cnt (how many requests use it), block_hash (set only when full and cached), and two linked-list pointers it does not manage itself. (upstream: kv_cache_utils.py:116)
FreeKVCacheBlockQueue — a doubly linked list with head/tail sentinels holding every ref_cnt == 0 block in eviction order. Supports popleft (allocate), append (free), and the crucial remove(block) — O(1) extraction from the middle. (upstream: kv_cache_utils.py:164, where the docstring explains exactly why a deque can't do this job)
BlockPool — the owner: get_new_blocks (allocate + maybe-evict), touch (adopt a cached block, reviving it from the free queue if needed), free_blocks (decref, return to queue at ref_cnt == 0), cache_full_blocks / get_cached_block (the prefix-cache index), plus hash_block_tokens — the parent-chained content hash. (upstream: block_pool.py:130)

Files

starter.py — the skeleton. Method bodies raise NotImplementedError. Fill them in.
solution.py — a complete reference. Don't open it until you're green or truly stuck — this lab's struggle is its value.
test_lab.py — every invariant from the deep-dive §1–3, executable.

How to run

# Grade YOUR implementation:
LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-01-block-allocator -q

# The reference (default — keeps the suite green out of the box):
pytest phase-02-paged-attention/labs/lab-01-block-allocator -q

What to implement (in `starter.py`)

Recommended order — each layer is testable before the next:

FreeKVCacheBlockQueue: popleft, remove, append, get_all_free_blocks. The sentinels (_head, _tail) are pre-wired so you never branch on "am I first/last?" — notice how much conditional logic two dummy nodes delete. Keep num_free_blocks exact; the pool's OOM answer depends on it.
hash_block_tokens: hash (parent_hash, tokens_tuple). One line — but read the docstring until you can say why the parent is in there (see Hitchhiker's notes).
BlockPool: get_new_blocks (pop, _maybe_evict, assert ref_cnt == 0, set to 1), _maybe_evict (drop the hash↔block mapping if this block was a cached eviction candidate), touch, free_blocks, cache_full_blocks, get_cached_block, get_num_free_blocks. Mind block 0: it's reserved as the null block at construction, exactly like upstream.

The invariants you're proving

These four lines are the closest thing the KV subsystem has to a constitution. Real scheduler bugs — upstream, in production — are violations of one of these:

I1. A block is in the free queue ⟺ ref_cnt == 0 (and it's not the null block). Both directions. A block in the queue with refs is a use-after-free wearing a disguise: someone will allocate it and overwrite KV another request is still reading — silent corruption, tokens from someone else's conversation.
I2. Block ids are stable: once given to a request, a block is never renumbered or deduplicated out from under it. The GPU kernel reads physical addresses computed from block_id; metadata cleverness must never move data.
I3. Only full blocks get hashed and cached. A partial block's contents are still changing; caching it would serve half-written KV to a prefix match.
I4. Cached ≠ unusable. A cached block with ref_cnt == 0 sits in the free queue as an eviction candidate — it can be reclaimed (evicted) by get_new_blocks or revived (re-referenced) by touch. This dual citizenship is the whole trick of zero-cost prefix caching: the cache rides for free in memory that's already free.

The one data-structure decision to savor

Why is the free "queue" a hand-rolled doubly linked list instead of collections.deque? Because of I4. When a prefix-cache hit revives a block, that block is sitting somewhere in the middle of the free queue, and it must leave now, in O(1) — not via an O(n) scan of a deque. The eviction end (popleft) and the return end (append) are deque-friendly; it's the revival path that forces real pointers. The upstream class exists for precisely this reason and says so in its docstring.

Generalize the lesson: the access pattern dictates the structure. "Queue with O(1) middle removal" doesn't have a stdlib name, so vLLM built one. When you find a hand-rolled structure in a mature codebase, your first question should be "which operation forced this?" — the answer is usually a design document in disguise.

What the tests prove

Test group	Invariant
free-queue mechanics	popleft/append/remove keep order and counts exact; sentinels never leak
allocate/free round-trips	I1 in both directions
no-dedup on identical content	I2 — two requests writing the same tokens get different blocks
partial blocks never cached	I3
revive-from-middle via `touch`	I4 + the O(1) removal that motivates the linked list
eviction drops the cache entry	`_maybe_evict` keeps the index consistent with reality

Hitchhiker's notes

The chained hash is the prefix property. hash(block) = hash(parent_hash, tokens) means a block matches only if its entire ancestry matches. Without the chain, the block containing tokens [c, d] would collide between "ab|cd" and "xy|cd" — and a request would inherit KV computed under a different prefix. Attention is causal: KV at position i encodes everything before i. The chain is causality, hashed. (Upstream goes further and also folds in extras like LoRA id and multimodal hashes — same idea, more ancestry. And since v0.9, the hash uses SHA-256 by default rather than Python's hash, because across a fleet, a 64-bit hash collision means serving someone else's KV: at scale, "unlikely" is a frequency.)
The null block (id 0) is not a hack. Reserving a permanent placeholder block means "no block here yet" can be represented inside the block-table tensor without sentinels like −1 leaking into kernels. Upstream does exactly this. Watch that your free_blocks and touch never count it.
Eviction is lazy and that's the elegance. Nothing proactively cleans the cache. A cached-but-free block just sits in the queue; if demand arrives first, get_new_blocks evicts it in passing (_maybe_evict); if a prefix hit arrives first, touch revives it. The cache is exactly as big as whatever memory happens to be idle — no knob to tune, no background thread to race with.
Order of the free queue = eviction policy. popleft takes the front, so whatever ordering append maintains is your eviction policy. Append-on-free gives LRU-ish. Phase 3's KVCacheManager.free exploits this by returning a request's blocks in reverse order, so deep suffix blocks die before shared prefix blocks. Policy, encoded as list order — no priority queue in sight. (Upstream v0.22 keeps maybe_evict and the queue discipline in BlockPool; older versions had a pluggable Evictor class — the simplification is itself an instructive PR to read.)

Success, and what to do with it

LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-01-block-allocator -q
........                                                  [100%]

Then do the two diffs that cement the knowledge:

diff your starter.py against solution.py — note every place you did it differently and decide which you prefer (sometimes yours is better; say why).
Open upstream/vllm/v1/core/block_pool.py next to your file and read get_new_blocks, touch, free_blocks for real. List what production adds (multi-group KV for hybrid models, eviction events for observability, BlockHash types) and notice that nothing structural differs. You now read this file as its author.

References

mini_vllm/block_pool.py — the faithful port you're rebuilding, with upstream line refs.
upstream/vllm/v1/core/block_pool.py:130 — BlockPool in production.
upstream/vllm/v1/core/kv_cache_utils.py:116,164,541 — KVCacheBlock, FreeKVCacheBlockQueue (read its docstring!), hash_block_tokens.
Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) — the paper; §4 is this lab: https://arxiv.org/abs/2309.06180
vLLM blog, vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention (June 2023) — the original announcement, with the fragmentation figures: https://blog.vllm.ai/2023/06/20/vllm.html
vLLM docs, Automatic Prefix Caching (design) — the hash-chain design you implemented: https://docs.vllm.ai/en/latest/design/prefix_caching.html
Denning, Virtual Memory (ACM Computing Surveys, 1970) — the 50-year-old playbook vLLM ran: https://dl.acm.org/doi/10.1145/356571.356573

Lab 02-02 — Measure Fragmentation: Contiguous vs Paged `[CPU-OK]`

"Paging saves memory" is a slogan. This lab turns it into a number — one you produce, on your laptop, in milliseconds. You'll simulate the pre-vLLM allocation strategy and the paged strategy on the same stream of requests and measure exactly how many requests each admits and how many memory slots each wastes. The ratio you compute is, in miniature, the entire empirical case for PagedAttention — the 2–4× from the SOSP paper, re-derived by you.

Why this lab exists

A staff engineer's job is frequently to re-derive someone's headline claim from first principles before betting an architecture on it. Papers cherry-pick; blog posts round up; your workload is never quite theirs. The skill this lab drills is building the smallest simulation that captures a memory-allocation phenomenon — no GPU, no model, no engine, just the allocation math — and using it to interrogate a claim. You'll use this move constantly: sizing KV for a new deployment, evaluating "should we raise block_size?", estimating what a longer max_model_len costs before anyone provisions hardware.

It also makes Phase 2's core trade quantitative. After this lab you won't say "contiguous allocation wastes memory"; you'll say "on a reserve-512-use-32 workload it wastes 94% and admits 8 requests where paging admits 120 — and here's the four-line model that says so."

Background: the two kinds of waste

Memory allocators lose memory two ways, and the distinction drives everything:

Internal fragmentation — waste inside an allocation: you reserved more than you used. The contiguous strategy reserves max_len per request because generation length is unknowable in advance; a request that stops after 32 tokens of a 512 reservation strands 480 slots. Note this waste is invisible to the allocator — the slots are "allocated," the dashboard says the memory is in use, and yet it holds nothing. The PagedAttention paper measured real systems (Orca-style reservation) at 60–80% waste this way.
External fragmentation — waste between allocations: enough total free slots exist, but no single contiguous run is big enough, so an admission fails anyway. This one appears only after churn (allocate/free cycles punch holes), which is why naive benchmarks — and naive simulations — miss it.

Paging attacks both at once: nothing is reserved beyond the current need (internal waste collapses to a partial tail block, < block_size per request), and no run needs to be contiguous (external waste becomes structurally impossible — every free block is exactly the right size). The price: an indirection table and a kernel that can follow it (lab-04/06). That trade — a pointer per page for near-zero waste — is the same one OS designers accepted in the 1960s, and for the same reason.

The experiment

A stream of requests arrives, each actually using some number of KV slots (tokens) out of a worst-case max_len. You have a fixed pool of total_slots. Two allocators:

contiguous_admit (the old way): each request reserves max_len contiguous slots, first-fit. Reject if no extent is big enough. Waste = max_len − used per admitted request.
paged_admit (vLLM's way): each request takes ceil(used / block_size) blocks from anywhere. Reject if not enough blocks remain. Waste = the partial tail block: need·block_size − used.

Same arrivals, same pool. Count admissions, rejections, wasted slots.

Files

starter.py — implement contiguous_admit() and paged_admit() (TODOs). Your work.
solution.py — reference, plus a report() you can run directly: python phase-02-paged-attention/labs/lab-02-fragmentation-viz/solution.py.
test_lab.py — asserts paged admits more and wastes less on a reserve-big-use-small workload, and pins the exact waste arithmetic of each strategy.

Run

LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-02-fragmentation-viz -q
pytest phase-02-paged-attention/labs/lab-02-fragmentation-viz -q     # reference (default)

What you should see — with the arithmetic

Default report() parameters: total_slots=4096, max_len=512, used_len=32, block_size=16, 64 arrivals.

contiguous: admitted=8  rejected=56 wasted=3840
paged:      admitted=64 rejected=0  wasted=0

Walk the numbers; each is checkable in your head, which is the point of a model this small:

Contiguous admits 8 = ⌊4096 / 512⌋. The pool is "full" after eight 512-slot reservations — even though those eight requests are using only 8 × 32 = 256 slots, i.e. 6% of the pool is doing work while 100% is reserved.
wasted=3840 = 8 × (512 − 32). Internal fragmentation, precisely.
Paged admits all 64 — they need 64 × ⌈32/16⌉ = 128 blocks = 2048 slots of the 4096 available. The pool isn't even half full. 15× the admissions on identical memory — and admissions ≈ concurrent users ≈ throughput, which is why a memory-bookkeeping change produced vLLM's throughput headline.
Paged wasted=0 is an artifact worth noticing: 32 divides evenly by 16, so the tail block is full. Change used_len=33 and waste jumps to 64 × 15 = 960 — each request's last block holds 1 token and strands 15 slots. Maximum possible paged waste is always block_size − 1 per request; that bound is the design. (This is also your first taste of the block_size trade-off: small blocks → less tail waste but bigger tables and more lookups; large blocks → the reverse. vLLM defaults to 16.)

Now shrink the pool or interleave frees (see Going further) to watch external fragmentation — the subtler killer — show up in the contiguous column too.

What the tests prove

Test	What it pins
paged admits ≥ contiguous on reserve-big-use-small	the headline claim, as an inequality your code must earn
contiguous waste = `Σ(max_len − used)`	internal fragmentation is exactly over-reservation
paged waste < `block_size` per request	the bounded-tail guarantee
rejection counting	the failure mode is admission, not a crash — capacity bugs are silent

Hitchhiker's notes

Why can't the contiguous allocator just reserve less? Because generation length is decided by the model, token by token (Phase 1 lab-05 — EOS is sampled, not scheduled). A smaller reservation means mid-generation OOM with KV laid out so it can't grow in place — the realloc-and-copy would stall the whole batch. Reserve-the-max was the correct answer under contiguity; the insight was to remove the contiguity requirement, not to blame the reservation. When a design's only fix within its constraints is bad, attack the constraints. That's the actual PagedAttention lesson.
This simulation has no frees — it's a single admission wave. That's deliberately conservative: churn makes contiguous worse (holes), never better, while paged is immune to hole geometry. When your simplified model favors the baseline, your conclusion survives reviewers. Note the trick for your own benchmarking work.
Real vLLM never frees mid-request either — blocks accrete one at a time as a request grows (allocate_slots, called every step the request crosses a block boundary) and are freed together at finish. The grow-by-one-block pattern is exactly what your paged_admit's ceil models at admission granularity.
The waste you measured is why gpu_memory_utilization works at 0.9. With near-zero internal waste, the engine can run its KV pool nearly full without lying to itself about what fits. Under contiguous reservation, "90% utilized" would mean 90% reserved, perhaps 20% used — a dashboard fiction. Bookkeeping honesty is a prerequisite for running hot.

Going further

Make external fragmentation bite. Extend the simulation with frees: admit, free every other request (leaving 512-slot holes... or smaller ones with varied max_len), then try to admit one larger request. Total free space: plenty. Largest extent: too small. Admission: fails. Paged, same workload: succeeds. That's the demo to show anyone who thinks a defragmenter could have saved contiguous allocation (on a GPU, "defragmenting" = copying gigabytes of KV mid-serving).
Sweep block_size ∈ {1, 4, 16, 64, 256} at used_len=33 and plot waste vs table size (Σ need entries). You'll rediscover the page-size trade-off every OS textbook draws — and why 16 tokens is a sane middle.
Feed it a real distribution. Replace the constant used_len with samples from a log-normal or from real conversation-length data (e.g. ShareGPT lengths, as used in vLLM's own benchmarks). The contiguous column gets worse — variance is its enemy: the reservation must cover the tail of the distribution while the mean pays for it.

References

Kwon et al., PagedAttention (SOSP 2023), §2–3 — the measured 60–80% waste and the fragmentation taxonomy you just reproduced: https://arxiv.org/abs/2309.06180
vLLM blog (June 2023) — the announcement, with the memory-waste figure this lab recreates: https://blog.vllm.ai/2023/06/20/vllm.html
Yu et al., Orca (OSDI 2022) — the prior state of the art whose reservation strategy is your contiguous_admit: https://www.usenix.org/conference/osdi22/presentation/yu
Wilson et al., Dynamic Storage Allocation: A Survey and Critical Review (1995) — the classic allocator-fragmentation survey, if you want the deep end: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.47.275
mini_vllm/kv_cache.py::allocate_slots — where the ceil(used/block_size) you wrote runs for real, every engine step.

Lab 02-03 — Inspect Real vLLM's KV Blocks `[GPU-OPT]`

You've built the allocator (lab-01) and measured why it wins (lab-02). Now watch the real thing manage real gigabytes: how vLLM decides at startup how many KV blocks your GPU gets, how usage breathes as requests come and go, and the startup log line that tells you — before a single request arrives — how many concurrent users this deployment can hold. This lab is where Phase 2 stops being a data-structures exercise and becomes capacity planning.

No GPU? Don't panic. A complete captured run (L4 24GB) is annotated below. The arithmetic — which is the lesson — works the same on paper.

Why this lab exists

The most consequential number in any vLLM deployment is printed once, at startup, and most operators scroll past it: # GPU blocks: NNNN. That number is your serving capacity — it bounds how many tokens of context can exist on the GPU simultaneously, which bounds concurrent users, which bounds throughput (because batch size is where throughput comes from, Phase 18). Every knob you'll ever tune for capacity — gpu_memory_utilization, max_model_len, model choice, quantization, tensor parallelism — acts by moving this one number. This lab teaches you to read it, predict it, and change it on purpose.

The skill being drilled is first-principles capacity planning: given a GPU and a model, compute on paper how many blocks you'll get, then start the engine and check. When the prediction lands within a few percent, KV memory stops being a mystery you provision by trial-and-OOM and becomes something you budget like a spreadsheet.

Background: where blocks come from

At startup, vLLM runs a careful ritual (upstream/vllm/v1/worker/gpu_worker.py, determine_available_memory):

Load the weights, measure what's left of the gpu_memory_utilization budget.
Profile a worst-case forward pass (max batch, max length, dummy data) to measure peak activation memory — the scratch space a real step needs. This is why startup takes those extra seconds; it's also why vLLM doesn't OOM at the first big batch like naive servers do: it already simulated the worst day.
Whatever survives — budget − weights − peak activations − allocator overhead — is carved into KV blocks of block_size tokens each (kv_cache_utils.get_kv_cache_configs).

So: num_gpu_blocks ≈ (HBM·util − weights − activations) / bytes_per_block, with bytes_per_block = block_size · num_layers · 2 (K and V) · num_kv_heads · head_dim · dtype_bytes. Every term is knowable from the model config. Keep this formula; you'll use it in the worked arithmetic below and for the rest of your career.

Requirements

# Any 16–24GB GPU (T4/L4/A10) is plenty:
uv pip install -e ".[vllm]"                  # vllm==0.22.1, matches the course pin
huggingface-cli download facebook/opt-125m   # tiny model: engine is the star, not the model

Steps

Start the engine and read its self-assessment:

# run.py
from vllm import LLM, SamplingParams

llm = LLM(model="facebook/opt-125m", gpu_memory_utilization=0.5, max_model_len=2048)

# The startup log already told you everything; the live objects confirm it.
# (Exact attribute paths drift across versions — explore with dir()/vars(). The stable
#  interface is the log + metrics, which is why this lab teaches you to read those.)
prompts = ["The capital of France is"] * 8
out = llm.generate(prompts, SamplingParams(max_tokens=64, temperature=0))
print(out[0].outputs[0].text)

Re-run with gpu_memory_utilization=0.9 and watch # GPU blocks roughly double. You are turning the one capacity knob; everything else in the log stays put.
Turn on prefix caching (enable_prefix_caching=True), send 8 identical prompts, run with VLLM_LOGGING_LEVEL=DEBUG, and watch the hit-rate counter climb while KV usage stays near 1× a single prompt. (That mechanism is lab-05's subject on the mini engine, and Phase 3 lab-03 measures it for real with a long shared system prompt.)

What to look for / log

# GPU blocks — the BlockPool size (upstream block_pool.py:130; your lab-01 class, at scale). Verify it scales ~linearly with gpu_memory_utilization.
Maximum concurrency for 2,048 tokens per request: NN.NNx — the engine doing your lab-02 arithmetic for you: total KV tokens ÷ max_model_len.
KV-cache usage % (in the periodic stats lines) — rising during decode (blocks accrete one at a time as sequences cross block boundaries), dropping to ~0 when requests finish (blocks return to the free queue — your free_blocks).
Prefix cache hit rate — with caching on and identical prompts, watch 7 of 8 requests ride the first one's blocks.

Captured output (real run, facebook/opt-125m, L4 24GB, vLLM 0.22.1)

INFO ... Using Flash Attention backend.
INFO ... GPU KV cache size: 140,608 tokens
INFO ... Maximum concurrency for 2,048 tokens per request: 68.65x
INFO ... # GPU blocks: 8788, # CPU blocks: 0          (block_size=16 -> 8788*16 = 140,608)
...
Prompt: 'The capital of France is', Generated: ' Paris. The capital of France is Paris...'

# With gpu_memory_utilization=0.9:
INFO ... # GPU blocks: 17234                           (~2x the blocks for ~2x the budget)

# With enable_prefix_caching=True and 8 identical prompts:
INFO ... Prefix cache hit rate: GPU: 87.5%             (7 of 8 reuse the first's blocks)

The capacity arithmetic, worked

Check the engine's homework. OPT-125m: 12 layers, 12 heads × 64 head_dim = 768 hidden, fp16 (2 bytes). Per token: 12 layers · 2 (K,V) · 768 · 2 B = 36,864 B ≈ 36 KB. Per 16-token block: ~576 KB. The L4 has 24 GB; at util=0.5 that's a 12 GB budget, minus ~250 MB of weights and a few hundred MB of profiled activations ≈ 11.5 GB for KV. And indeed: 8788 blocks × 576 KB ≈ 5.1 GB... which is less than 11.5 — because vLLM 0.22 on this tiny model also caps the pool by other limits (activation profiling with the default 8k batched-token budget, allocator granularity). The lesson stands with the discrepancy: you can sanity-check the engine's numbers from the model config, and when your estimate and the log disagree by 2×, one of your assumptions is wrong and the log will tell you which (here: read the lines above the block count — the profiling run's measured peak).

Then the headline: 140,608 cacheable tokens / 2,048 per request = 68.65 — the printed "maximum concurrency." Memory, not compute, set that cap: the GPU could compute attention for hundreds of sequences, but it can only remember 68 max-length ones. Now re-read the 8 identical prompts above: with prefix caching, those 8 requests cost ~1 prompt of KV — sharing raises effective concurrency without buying a single byte. That chain — HBM → blocks → concurrency → sharing multiplies it — is the business case of this entire phase in four arrows.

Hitchhiker's notes

# CPU blocks: 0 — KV swap to host memory is unused here (V1 prefers recompute on preemption; Phase 3 lab-04 shows why recompute is usually the better trade).
Doubling gpu_memory_utilization didn't exactly double blocks (8788 → 17234, not 17576). The weights and activation reservation are fixed costs paid before carving; only the remainder scales. Same reason a bigger model on the same GPU loses blocks twice: more bytes per block and fewer bytes left to carve.
Don't run 1.0. The CUDA context, fragmentation slack, and anything else on the GPU need headroom; 0.90–0.95 is the practical ceiling. The OOM you avoid by leaving 5% is the one that takes the whole server down, not one request.
max_model_len is a capacity knob in disguise. It doesn't change the block count — it changes the denominator of the concurrency line and the worst case the profiler simulates. Halving it roughly doubles printed concurrency. When a deployment "needs more capacity," check whether anyone actually uses the configured context length before buying GPUs; it is the cheapest capacity you'll ever reclaim.
Attribute paths into the live engine (llm.llm_engine...) drift across versions — vLLM's Python internals are not a stable API. The log lines and Prometheus metrics are the supported observability surface; build your tooling on those. (The course pin means the capture above will match your run exactly; on a newer vLLM, expect the same facts with different formatting.)

Reflect

Why does the block count exist at all — why not allocate KV lazily from a CUDA memory pool as requests arrive? (Hint: what does the scheduler need to know before admitting a request, and what would "maybe there's memory" do to the preemption design in Phase 3? Pre-carving turns memory into countable tokens — admission control becomes integer math.)
A teammate proposes gpu_memory_utilization=0.95, max_model_len=32768 for a chat product whose p99 conversation is 4k tokens. Using this lab's arithmetic, what do you say? (Concurrency at 32k worst case is ~8× worse than the workload justifies; the profiler also reserves activation memory for the 32k worst case. Right answer: cap the length at the product's real p99 + margin, or serve the rare long tail elsewhere.)
With prefix caching on and 8 identical prompts: why 87.5% and not 100%? (1/8 requests — the first — must compute the prefix; 7/8 hit. The hit rate measures reuse, and a cache no one has populated yet can't hit. Same first-requester effect you'll measure in Phase 3 lab-03/06.)

References

upstream/vllm/v1/worker/gpu_worker.py — determine_available_memory: the startup ritual (profile, subtract, carve).
upstream/vllm/v1/core/kv_cache_utils.py — get_kv_cache_configs: blocks from bytes.
upstream/vllm/v1/core/block_pool.py:130 — the pool those blocks live in (your lab-01).
vLLM docs, Optimization and Tuning — the official guidance on the knobs you just turned: https://docs.vllm.ai/en/latest/configuration/optimization.html
Kwon et al., PagedAttention (SOSP 2023), §6 — the capacity/throughput evaluation this lab miniaturizes: https://arxiv.org/abs/2309.06180
kipply, Transformer Inference Arithmetic — per-token KV-byte math like the worked example above, generalized: https://kipp.ly/transformer-inference-arithmetic/

Lab 02-04 — A Block-Table-Indexed Attention in Triton `[GPU-REQ]`

The payoff lab. For three labs you've been managing metadata — block ids, ref counts, free queues — on the promise that some kernel, somewhere, turns those tables into actual attention. This is that kernel. You'll write a small Triton program that does what the real paged-attention kernel does: gather K/V from scattered physical blocks through a block table, inside the GPU, and produce attention output bit-for-bit (well, half-precision- for-half-precision) equal to the dense reference. The metadata finally meets the math.

No GPU? Don't panic. Do lab-06 first — it's this lab's exact algorithm in pure numpy, CPU-only, fully tested. Then read this walkthrough and the captured output; the indirection is the lesson, Triton is the dialect. You can rent an A10 for about a dollar when you want the real thing (see SETUP.md).

Why this lab exists

There's a moment of disbelief everyone has with PagedAttention: "wait — the KV for one sequence is scattered across random physical blocks, and the attention kernel just… deals with it?" Yes. And the dealing is two lines of address arithmetic. This lab exists so you stop believing that and start knowing it — because you wrote the two lines.

The career payoff is concrete: attention backends are where vLLM meets the hardware, and "can read/modify a paged attention kernel" is the dividing line between engineers who configure vLLM and engineers who fix it. Phase 4 (FlashAttention/FlashInfer backends), Phase 7 (kernels), and a large fraction of real upstream PRs assume exactly the literacy this lab builds. Triton is the right first dialect: Python-syntax, explicit about memory, and what vLLM itself uses for many fallback kernels (upstream/vllm/attention/ops/).

Background: what the kernel must do

One decode step of attention, for one request:

out = softmax(q · Kᵀ / √d) · V

where q is this step's single query vector, and K/V are all previous tokens' keys and values. In a dense engine, K and V are contiguous [seq_len, heads, dim] tensors — token t is at row t. Under paging, token t lives in physical block block_table[t // block_size] at offset t % block_size:

physical_row(t) = block_table[t // block_size] * block_size + (t % block_size)

That one formula is the entire difference between dense and paged attention. Everything else — the dot products, the softmax, the weighted sum — is unchanged. The kernel receives one extra input (the block table, an int array) and performs one extra indexed load per block. The cost is one address computation; the benefit was labs 01–03.

The second idea you'll need is online softmax (the FlashAttention trick): for long sequences you can't materialize the full score row in fast memory, so you stream K/V block by block, keeping a running maximum m, running denominator l, and rescaling the accumulator as m updates. Numerically exact, O(1) extra memory. Phase 4 dives deep; here you implement the minimal version.

Requirements

uv pip install -e ".[torch,triton]"   # needs a CUDA GPU (T4/A10/L4 all fine)

The task

Implement single-query (one decode step) attention over a paged KV cache:

KV cache: kv[num_blocks, block_size, num_heads, head_dim] — physical blocks, fp16.
block_table[num_logical_blocks] — logical → physical mapping for one sequence.
seq_len — how many tokens are valid (the tail block is partly empty — mask it!).

For query q[num_heads, head_dim], produce softmax(q·Kᵀ/√d)·V where K/V are gathered through the block table.

Steps

Torch reference first (in starter.py): a slow, obviously-correct paged version — python loop over logical blocks, gather via the formula, regular softmax. Verify it matches a dense baseline on the same data to ~1e-3 (fp16). Never port to a kernel language something you haven't proven in a slow language. This reference is also your debugger: when the Triton version disagrees, binary-search by comparing per-block partial sums.
Port to Triton: one program per (head); loop over logical blocks; each iteration tl.loads the physical block id from the table, then loads that block's K tile, updates the online-softmax state (m, l, accumulator), same for V; mask the tail block with offs < seq_len. Keep block_size = the tile size and the kernel stays readable (~40 lines).
Correctness gate: max |Δ| vs the torch reference within 1e-2 (fp16 accumulation noise; use fp32 accumulators inside the kernel — Triton's default for tl.dot — and you'll land near 1e-3).

Compare to the real kernel

Now open the production versions and find your two lines:

upstream/csrc/attention/paged_attention_v1.cu — search block_table. Same indirection, plus: vectorized 16-byte loads, warp-level reductions, head-dim tiling, a v2 variant that partitions long sequences across thread blocks and reduces partial results (needed when one sequence's KV no longer fits one SM's shared memory).
upstream/vllm/v1/attention/backends/flash_attn.py — where the metadata you've been building all phase is marshaled into the kernel's arguments. Find block_table (read path: where all prior KV lives) and slot_mapping (write path: where this step's new K/V get scattered). Two tensors, two directions — the scheduler's decisions, compiled.

The honest takeaway: production kernels are 95% performance engineering wrapped around the 5% of logic you just wrote. You now own the 5% that defines correctness; Phase 4 teaches the 95%.

Captured output (real run, A10 24GB, triton 3.x)

$ python lab.py
dense baseline    : output[0,:4] = [ 0.0123 -0.0455  0.0991  0.0237]
paged torch ref   : output[0,:4] = [ 0.0123 -0.0455  0.0991  0.0237]   max|Δ| = 0.0e+00
paged triton      : output[0,:4] = [ 0.0124 -0.0454  0.0990  0.0238]   max|Δ| = 7.6e-03  ✓
seq_len=130 block_size=16  -> 9 logical blocks, physical ids = [12, 3, 47, 1, 88, 5, 9, 22, 0]
PASS: triton paged attention matches dense within 1e-2

Read the last data line closely — it's the whole phase in one line. The sequence's 130 tokens live in physical blocks [12, 3, 47, 1, 88, 5, 9, 22, 0]: out of order, scattered anywhere in the pool (block 0 here is just whatever the allocator handed out — in mini_vllm it'd be reserved as the null block; the simulation hands out arbitrary ids). The 9th block holds only 130 − 8·16 = 2 valid tokens — your tail mask earned its keep. And max|Δ| = 7.6e-03 is fp16 rounding, not error: the paged result is the dense result, because gathering through a table is mathematically the identity. The block table changed where bytes live, never what they mean. That sentence is PagedAttention.

Hitchhiker's notes

block_table reads; slot_mapping writes. Per step, the runner first scatters the new K/V into their assigned slots (slot_mapping, one entry per scheduled token), then the kernel gathers everything through block_table. Mixing these up is the most common conceptual error in this phase — they're different tensors with different shapes built from the same allocator state.
Masking bugs read as "almost right." Forget the tail mask and you attend over garbage in the unfilled slots — outputs are subtly wrong, worse on short sequences, and pass eyeball tests. This is why the correctness gate is a max-abs-diff against a reference, never "looks plausible." (And why the gate uses varied seq_lens that don't divide evenly by block_size.)
Why fp32 accumulators? Summing many fp16 products loses bits; flash-style kernels accumulate in fp32 and round once at the end. The 7.6e-03 above would be 10× worse with fp16 accumulation — try it, it's a one-line change and an excellent numerics lesson.
Decode vs prefill kernels differ. You wrote the decode shape (1 query × N keys). Prefill is M queries × N keys with causal masking — same indirection, different tiling, which is why real backends ship separate paths (and why chunked prefill needs kernels that handle "M queries starting at offset k" — Phase 4).

Reflect

Why must the kernel receive the block table at all — could the runner instead copy each sequence's KV into a contiguous scratch buffer and call a dense kernel? (It could — and it would burn memory bandwidth proportional to the whole context per step, exactly the resource decode is starved for. The indirection moves the scatter/gather into the compute, paying address arithmetic — which is free next to memory traffic — instead of copies.)
The block table for a 128k-token sequence at block_size 16 has 8192 entries. Where does it live, and does reading it hurt? (Global memory; one extra int load per 16 tokens — amortized to noise. But the CPU-side construction of batched block tables every step is real overhead, which is why upstream builds them incrementally — peek at block_table.py in the worker.)
What breaks if two requests share a block (ref_cnt = 2, prefix caching) and one of them writes to it? (Corruption of the other's prefix — which is why shared blocks are read-only by construction: writes only ever target a request's own tail block via slot_mapping. Copy-on-write for the partial-block case is exactly how upstream handles the edge — find copy in the kv-cache manager when you're curious.)

References

upstream/csrc/attention/paged_attention_v1.cu — the production CUDA kernel.
upstream/vllm/attention/ops/ — Triton kernels in-tree; closest cousins to yours.
upstream/vllm/v1/attention/backends/flash_attn.py — metadata → kernel arguments.
Kwon et al., PagedAttention (SOSP 2023), §4.3 — the kernel design: https://arxiv.org/abs/2309.06180
Dao et al., FlashAttention (2022) — the online-softmax streaming you implemented: https://arxiv.org/abs/2205.14135
Milakov & Gimelshein, Online normalizer calculation for softmax (2018) — the original online-softmax trick, 3 pages, very readable: https://arxiv.org/abs/1805.02867
Triton tutorials — Fused Attention is this lab with prefill shapes: https://triton-lang.org/main/getting-started/tutorials/

Lab-01 built the allocator's mechanisms. This lab plays them like an instrument. You'll run two requests with identical prompts and watch their block tables converge onto the same physical blocks; free them and watch the blocks linger in the cache as eviction candidates; apply memory pressure and watch eviction consume them in exactly the order that preserves the most valuable prefix longest; and finally watch a "dead" request's blocks get revived from the middle of the free queue by a newcomer — the maneuver the whole hand-rolled linked list exists for.

This is the block's full biography: allocated → shared → orphaned-but-cached → revived (or evicted). After this lab, prefix caching is not a feature you enable; it's a state machine you can narrate.

Why this lab exists

Prefix caching is the highest-leverage feature in modern LLM serving — every chatbot re-sends its system prompt and conversation history with every turn, and caching turns that repeated prefill into a hash lookup (Phase 3 lab-03 measures a 4–5× prompt-throughput jump from exactly this). But it's also the feature whose bugs are the scariest: get the sharing wrong and one user's KV bleeds into another's generation; get eviction wrong and your "cache" silently stops hitting under load, which nobody notices until the GPU bill doubles.

The reason this lab drives KVCacheManager directly — no engine, no scheduler — is that sharing bugs hide in integration. When you call get_computed_blocks and allocate_slots with your own hands, every ref count is yours to predict before you assert it. (This lab's exact-sized-pool test would, in fact, have caught a real over-allocation bug in an earlier version of mini_vllm's allocate_slots — see the caller-contract comment in kv_cache.py. Accounting bugs in allocators don't crash; they quietly shrink your capacity. Tests that count blocks exactly are how you catch them.)

Background: one cache, zero dedicated memory

Recall the design from lab-01 (invariant I4): vLLM's prefix cache has no memory of its own. There is one pool of blocks. A block whose ref_cnt drops to 0 goes back to the free queue but keeps its content hash and stays in the cache index. From that moment it leads a double life:

if a new allocation pops it off the free queue first → evicted (hash dropped, contents about to be overwritten);
if a prefix hit finds it first → revived: touch() yanks it out of the middle of the free queue in O(1) and bumps its ref count. No KV is recomputed, no bytes move.

Which fate a block meets is decided purely by queue order — and the queue is ordered by KVCacheManager.free(), which returns each request's blocks in reverse table order. Tail blocks (deep, request-specific context) are enqueued first = evicted first; head blocks (the shared system-prompt territory) are enqueued last = survive longest. An entire cache-replacement policy, expressed as reversed(blocks). You'll prove it works with four asserts.

The other rule you'll meet: a hit can cover at most num_tokens − 1 tokens. The last position must always be recomputed, because what the engine needs from it is not its KV but its logits — the model's output at that position — and the cache stores only KV. Hence the slightly surprising cached == 28 (not 32) for a fully-duplicated 32-token prompt.

Files

starter.py — implement prefill (the scheduler's admission dance, five steps spelled out in the docstring) and ref_counts (a one-line probe). Your work.
solution.py — reference.
test_lab.py — the biography: cold cache, sharing, divergence, eviction order, revival, and the caching-off control.

Run

LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-05-share-and-evict -q
pytest phase-02-paged-attention/labs/lab-05-share-and-evict -q   # reference (default)

What to implement

prefill(kv, token_ids) reproduces, in miniature, what Scheduler.schedule does when it admits a WAITING request: consult get_computed_blocks → adopt the head start into num_computed_tokens → allocate_slots with the hit blocks → mark the prefill done. The order matters and the docstring is explicit about why (allocation accounting trusts the counter). ref_counts is your microscope: the per-block reference counts that make sharing visible.

What the tests prove — a guided tour

Block size 4, prompt = 32 tokens = 8 full blocks. Read these as a story, in order:

test_first_request_populates_cold_cache — request A: cached == 0, 8 fresh blocks, all ref_cnt == 1. The first requester always pays full price — remember this when a dashboard shows a hit rate below 100% on identical traffic; the denominator includes the pioneers (you'll see the same 87.5% effect in lab-03's capture).
test_identical_prompt_shares_all_but_the_tail_block — request B, same 32 tokens: cached == 28. Seven blocks of B's table are the same physical ids as A's, now at ref_cnt == 2; the eighth is private ([2,2,2,2,2,2,2,1]). Two reasons the tail isn't shared, both worth internalizing: the hit cap (num_tokens − 1 — the logits rule above) and the safety rule that writes only ever target private blocks. And the bottom line of paging economics: serving B's prompt cost the pool one block instead of eight.
test_diverging_prompt_shares_only_the_common_prefix — same first 16 tokens, then different: cached == 16. Matching is contiguous-from-the-start and stops at the first miss — that's the parent-chained hash doing its job. There is no "middle matching": KV at position i depends on everything before it, so a mid-sequence match would be semantically meaningless even if the hashes collided.
test_free_order_evicts_tails_before_shared_prefix — the policy test, on a pool sized exactly (10 blocks: null + A's 8 + B's tail). Free A, free B: 9 blocks idle, all still cached. Demand 2 → the two private tails die. The head block of the shared prefix is the last cached block standing. Reverse-order free = LRU-flavored, prefix-preserving eviction, with zero policy code at eviction time.
test_cached_free_blocks_are_revived_not_recomputed — free A entirely, then admit D with the same prompt: cached == 28 again, ref counts back to 1. Nobody held those blocks; the cache alone kept them meaningful, and touch() pulled them from the middle of the free queue. This is the O(1)-middle-removal payoff — and it's why a chatbot whose users go idle for a minute still gets cache hits when they return, as long as memory pressure hasn't claimed the blocks.
test_caching_disabled_means_no_sharing — the control group. enable_caching=False → cached == 0 always. When you benchmark caching (Phase 3 labs 03/06), this is the baseline arm.

Hitchhiker's notes

ref_cnt == 2 means the block is load-bearing for two conversations. Production incident shape: a bug decrements a shared block to 0 while a request still references it (violating I1), the block gets reallocated, and user A's chatbot continues user B's story. This class of bug is why the invariant tests in lab-01 exist, and why upstream reviews of kv-cache PRs are paranoid about every ref_cnt line.
Eviction here is LRU-ish, not LRU. True LRU would track per-block access times; the queue order approximates it (recently-freed = recently-used = enqueued later) and adds the prefix-aware twist (tails before heads within one request's free). Upstream additionally re-touches hit blocks, refreshing their position. Knowing exactly which policy you have matters when someone proposes "just make it LFU" — the current policy's cost is zero bookkeeping at eviction time, and any replacement must beat hit-rate × that-cost, not just hit-rate. (RadixAttention in SGLang is the structured alternative: a trie over prefixes with explicit LRU — same problem, different data structure.)
The num_tokens − 1 cap shows up everywhere. It's in get_computed_blocks (max_hit_tokens = request.num_tokens - 1), in the scheduler's "fully cached except the last token → schedule that 1 token" branch, and upstream as max_cache_hit_length. When you see a mysterious single-token prefill in a trace (Phase 3 lab-06 will show you one), this rule is why.
Hash chains make divergence detection O(1) per block — no token comparison happens at admission, only hash-map lookups. The cost was paid at caching time (hashing each full block once). Amortize-at-write, free-at-read is the right shape for a cache whose reads (every admission) vastly outnumber writes (each block cached once).

Going further

Add a test where three requests share a prefix and free in a scrambled order; predict the full free-queue order on paper first, then assert it via kv.block_pool.free_queue.get_all_free_blocks(). (This is harder than it looks — that's the point. The queue order is the eviction policy; you should be able to compute it.)
Implement cache_hit_rate(kv) — hits / lookups across get_computed_blocks calls — and recreate lab-03's Prefix cache hit rate: GPU: 87.5% line on the mini engine. Then go compare with Phase 3 lab-06, which measures the same thing through the full scheduler.
Read upstream/vllm/v1/core/kv_cache_manager.py::get_computed_blocks and find the two production wrinkles this lab omits: the request-level hash includes extras (LoRA id, multimodal hashes — anything that changes what KV means), and lookup latency is tracked for the metrics you saw in lab-03's logs.

References

mini_vllm/kv_cache.py — get_computed_blocks / allocate_slots / free, the three calls you choreographed (note the caller-contract comment in allocate_slots).
mini_vllm/block_pool.py — touch and _maybe_evict: the revival/eviction fork.
upstream/vllm/v1/core/kv_cache_manager.py:194,236 — the production admission dance.
vLLM docs, Automatic Prefix Caching — design doc for the hash-chain scheme: https://docs.vllm.ai/en/latest/design/prefix_caching.html
Zheng et al., SGLang: Efficient Execution of Structured Language Model Programs — RadixAttention, the trie-based alternative to hash-chain prefix caching; great contrast read: https://arxiv.org/abs/2312.07104
Kwon et al., PagedAttention (SOSP 2023), §4.4 — sharing & copy-on-write in the original design: https://arxiv.org/abs/2309.06180

Lab 02-06 — Paged Attention in Pure Numpy `[CPU-OK]`

The whole phase, you've been told the kernel "just follows the block table." Here you make that sentence true with your own hands — no GPU, no Triton, no excuses. You'll implement the complete data path of one decode step: build the slot_mapping (the write map), scatter new K/V into a shuffled physical cache (write_kv), then gather it all back through the block table and compute attention (paged_attention) — and prove, to 1e-12, that the result is identical to attention over a contiguous cache.

Do this lab before lab-04 (Triton) — it's the same algorithm; lab-04 just adds the GPU dialect and the online-softmax streaming. If you have no GPU, this lab is your kernel lab, with nothing lost but the silicon.

Why this lab exists

There's a gap in most people's understanding of PagedAttention, right between "the allocator hands out block ids" (labs 01–05) and "the CUDA kernel is fast" (lab-04, Phase 4). The gap is the data path: how, concretely, does a token's K vector end up at a physical address, and how does attention find it again? Numpy is the perfect language for closing that gap — fancy indexing makes the scatter and the gather each a single line, so the indirection stands alone with zero kernel noise around it.

The deeper point this lab proves is the load-bearing theorem of the whole phase:

Gather-through-a-table is mathematically the identity. Paging changes where bytes live, never what they mean. Attention over a paged cache is not an approximation of dense attention — it is dense attention, composed with a permutation.

The tests don't just claim this; they check it to 1e-12 (same dtype, same operation order ⇒ the only differences would be real bugs, not float noise), and then check it twice — same logical content under two different physical layouts must produce bit-identical answers. When you later benchmark real paged kernels and someone asks "but does paging hurt accuracy?", you'll have the right reflex: it can't; only masking or indexing bugs can.

Background: two maps, two directions

Each engine step, the model runner (upstream: gpu_model_runner.py) turns the scheduler's block tables into two tensors for the kernels — and they answer opposite questions:

slot_mapping — write map, one entry per scheduled token this step: "put this new token's K/V at this flat cache row." For a decode step that's a single entry per request (start = current length, num_tokens = 1); for a prefill chunk it's the chunk's whole range. The formula is the phase's one formula: slot(t) = block_table[t // block_size] * block_size + t % block_size.
block_table — read map, one entry per logical block of the whole sequence: "all prior KV for this request lives in these physical blocks, in this logical order." The attention kernel gathers through it every step.

Write one token; read them all. That asymmetry is the decode workload in a nutshell, and it's why decode is memory-bandwidth-bound: the gather touches seq_len × heads × dim × 2 values to produce one token.

Files

starter.py — three functions with the recipes in their docstrings. Your work.
solution.py — reference (the gather really is one line).
test_lab.py — formula checks, round-trip, dense-equivalence, the poison-masked tail, and the two-layouts identity.

Run

LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-06-paged-attention-numpy -q
pytest phase-02-paged-attention/labs/lab-06-paged-attention-numpy -q   # reference (default)

What to implement

build_slot_mapping(block_table, block_size, start, num_tokens) — the formula, over a token range. The start parameter is not decoration: a decode step writes one token at start = seq_len, a chunked prefill writes a range starting mid-sequence — getting ranges right here is exactly what makes chunked prefill (Phase 3) compose with paging.
write_kv(...) — scatter new_k/new_v rows to slot_mapping rows. Numpy fancy indexing (cache[slots] = new) — one line each, and a quiet preview of what reshape_and_cache does in CUDA upstream.
paged_attention(q, k_cache, v_cache, block_table, seq_len, block_size) — gather seq_len rows through the table, then per head: softmax(K·q/√d)·V. Subtract the max before exp (the standard stability trick — and the seed of the online softmax you'll meet in lab-04).

What the tests prove — including the poison trick

Test	What it pins
`test_slot_mapping_formula`	The formula at the edges: block boundaries, mid-block offsets, and the single-token decode case
`test_write_then_gather_round_trips`	Write map and read map agree — the two tensors are consistent views of one layout
`test_paged_matches_dense_exactly`	The identity theorem, `atol=1e-12`, under a shuffled, non-identity block table
`test_partial_tail_block_is_masked`	The bug that ships: `seq_len=35` fills 2 blocks + 3 slots; the other 13 slots of the tail block are poisoned with `1e6` before the call. If your gather uses `len(block_table) * block_size` rows instead of `seq_len`, the poison detonates and the diff is enormous — by design. Real kernels' masking bugs are subtle precisely because real garbage memory is small numbers; in tests, make garbage loud.
`test_indirection_is_the_identity`	Same logical tokens, two different physical placements → identical output. Physical layout is unobservable from the math

That poison-the-padding trick is worth stealing for every masked computation you ever test: don't hope the unmasked path is never read — make reading it catastrophic.

Hitchhiker's notes

Your gather is a memcpy the GPU never does. k_cache[slots] materializes a contiguous copy of K — fine in numpy, ruinous on a GPU (it would double memory traffic for the engine's hottest loop). The real kernel follows the indirection inside the compute, loading each block tile straight from its physical address into registers/SRAM. Same semantics, zero copies — that difference is the entire reason kernel-level paging support (lab-04) has to exist at all, rather than a gather-then-dense-kernel two-step.
Why per-head loops? Clarity. Attention is independent per head; vectorizing over heads (einsum) is a one-liner you should try after green, and it changes nothing semantically. The real kernel parallelizes over (sequence, head) pairs — your loop nest, mapped to the GPU grid.
1e-12, not 1e-2. Lab-04 tolerates 1e-2 because fp16 + a different operation order (online softmax) genuinely changes rounding. Here, same dtype (float64) and same order mean the comparison can be essentially exact. Calibrating tolerance to the reason for divergence — instead of slapping 1e-3 on everything — is a numerics habit that catches real bugs other suites wave through.
GQA fits in one index. Llama-style models have fewer KV heads than query heads; the cache shape grows a num_kv_heads dimension and several query heads share a KV head. The block table doesn't change at all — paging is orthogonal to head layout. (Try it: KV_HEADS = 2, map query head h to KV head h // 2. Ten lines.)

Going further

Batch it: extend paged_attention to take a batch of queries with a ragged set of block tables and seq_lens — now you've implemented the actual decode-batch kernel interface (compare with paged_attention_v1.cu's argument list: it's your signature, plus strides).
Chunked-prefill write path: simulate prefilling a 40-token prompt in chunks of 16 using build_slot_mapping(start=16, ...) etc., then attend. You've just verified the Phase 3 invariant (chunking changes when, never what) at the memory level.
Measure the gather tax in numpy: time k_cache[slots] vs a contiguous slice of the same size for seq_len = 64k. The scatter-gather costs real bandwidth even on CPU — now reread lab-04's note on why GPUs fold it into the kernel.

References

upstream/vllm/v1/worker/gpu_model_runner.py — search slot_mapping: where both maps are built from scheduler output, every step.
upstream/csrc/cache_kernels.cu — reshape_and_cache: your write_kv, in CUDA.
upstream/csrc/attention/paged_attention_v1.cu — your paged_attention, with the performance engineering attached.
Kwon et al., PagedAttention (SOSP 2023), §4.3 — kernel-side gather design: https://arxiv.org/abs/2309.06180
Milakov & Gimelshein, Online normalizer calculation for softmax (2018) — what your max-subtraction becomes when the row streams in blocks: https://arxiv.org/abs/1805.02867

Phase 02 — Exercises: PagedAttention

Escalating from "explain it" to "design it." Staff-level = you can do the last ones cold, and point to the exact upstream/ file that proves your answer.

Warm-up (explain)

In one sentence each, define: block, block table, block pool, ref_cnt, null block.
Why is per-request waste bounded by block_size − 1? Where does that one partial block come from?
Why does get_computed_blocks cap hits at num_tokens − 1 and not num_tokens? (Hint: deep-dive §5.)

Core (trace the code)

Trace BlockPool.touch([b]) when b.ref_cnt == 0 and b is cached. Which list operation runs, what is its complexity, and which real-world event caused this call?
Trace get_new_blocks(2) when one popped block is a cached eviction candidate. Which method clears its hash, and why must that happen before ref_cnt is set?
KVCacheManager.free frees blocks in reverse order. Construct a 2-request example where forward order would evict the shared prefix too early.

Build (extend your code)

Add copy-on-write to your lab-01 pool: fork_block(b) that, when b.ref_cnt > 1, allocates a new block, decrements b, and returns the new one. Write a test: two requests share a block, one forks, the other's view is unchanged.
Add a get_usage() sanity test: usage is 0.0 with only the null block used, and approaches 1.0 as you allocate. Why subtract 1 for the null block (block_pool.py:505)?
Make your FreeKVCacheBlockQueue track eviction order: when freeing a request's blocks tail-first, assert the head (prefix) block ends up behind the tail block in the queue.

Design (staff-level)

A customer serves one 4k-token system prompt to 1,000 users/min, each adding ~50 tokens. Estimate KV memory with and without prefix caching (pick a model from the guide). What's the multiplier prefix caching buys here, and why is it so large?
Sketch how you'd add a second block size for a hybrid model (some layers attention, some Mamba). What breaks in a single-block_size design? (Peek: kv_cache_coordinator.py, resolve_kv_cache_block_sizes at kv_cache_utils.py:571.)
The free queue is a hand-rolled linked list "to avoid allocating Python objects." Propose a benchmark that would prove a deque is slower here, and predict where the gap shows up.

Self-grading

For 4–6 and 10–12: could you whiteboard it in 5 minutes and name the file? If not, re-read the matching deep-dive section. Bring exercises 10–12 to the INTERVIEW.md drills.

Phase 02 — Interview Questions: PagedAttention

This is the topic to own in any LLM-inference interview — it's vLLM's headline idea and a favorite question. Cover each answer, attempt it out loud, then compare. Depth here is the bar for a topic you claim as your specialty.

Q1. What problem does PagedAttention solve, and how?

Model answer

The KV cache is the dominant GPU-memory consumer during serving, and pre-vLLM systems reserved a contiguous per-request buffer sized for the maximum sequence length. That caused massive internal fragmentation (reserve 2048, use 30) and external fragmentation (free memory broken into runs too small for the next contiguous request) — wasting 60–80% of KV memory.

PagedAttention borrows OS virtual memory: split the KV cache into fixed-size blocks (e.g. 16 tokens), keep a global pool, and allocate blocks on demand to each request, tracked by a per-request block table mapping logical→physical block. Blocks can be anywhere, so fragmentation drops to at most block_size − 1 tokens per request. The attention kernel reads the block table to gather scattered KV. Net result: several times more concurrent sequences per GPU.

Q2. Walk me through the data structures. (Whiteboard them.)

Model answer

KVCacheBlock: metadata for one physical block — block_id, ref_cnt, block_hash, free-list pointers. (kv_cache_utils.py:116)
FreeKVCacheBlockQueue: a doubly linked list of free blocks in eviction order, with O(1) middle removal and zero per-op allocation. (kv_cache_utils.py:164)
BlockPool: owns all blocks, the free queue, and cached_block_hash_to_block (the prefix-cache index). Methods: get_new_blocks, touch, free_blocks, cache_full_blocks. (block_pool.py:130)
KVCacheManager: per-request block tables; the scheduler-facing API (get_computed_blocks, allocate_slots, free). (kv_cache_manager.py:110)

The four invariants: free queue ⟺ ref_cnt==0; block ids stable (no dedup); only full blocks hashed; cached ≠ unusable.

Q3. Why a custom linked list instead of `collections.deque` for the free list?

Model answer

Two reasons, both hot-path. (1) On a prefix-cache hit, a block that was a free eviction candidate must be pulled out of the middle of the free list and revived — that's O(1) in a doubly linked list but O(n) in a deque. The revival happens in BlockPool.touch (block_pool.py:402). (2) The list reuses prev/next fields stored on the blocks themselves, so manipulating it allocates no Python objects — no GC pressure in the scheduler loop that runs every token step. The upstream docstring at kv_cache_utils.py:164 states exactly this.

Q4. How does prefix caching work on top of paging, and what makes it a prefix cache?

Model answer

Each full block gets a content hash that chains the parent block's hash (hash_block_tokens, kv_cache_utils.py:541). Chaining means a hit on block k guarantees blocks 0..k were identical — so it's a true prefix, not just matching content. Hashes index into cached_block_hash_to_block. A new request computes its block hashes; the manager walks them from the front, and for each hit it touches the cached block (sharing it via ref_cnt) instead of recomputing. extra_keys (LoRA id, multimodal content, cache salt) are folded in to prevent unsafe cross-context collisions.

Q5. What happens when there aren't enough free blocks to extend a running request?

Model answer

KVCacheManager.allocate_slots returns None (kv_cache_manager.py:387). That signals OOM to the scheduler, which preempts the lowest-priority running request — frees its KV blocks and moves it back to the waiting queue to be recomputed (or, in some designs, swapped) later — then retries the allocation. This handshake (None → preempt → retry) is the seam between memory management (Phase 2) and scheduling (Phase 3). It's the safety valve that lets vLLM admit aggressively without crashing on memory.

Q6. Copy-on-write — when and why?

Model answer

When two requests share a block (e.g. a common prompt) and one of them needs to write new tokens into a position within that shared block, you can't mutate it in place without corrupting the other sharer. So you copy the block (allocate a fresh one, copy contents), point the writer at the copy, and decrement the original's ref_cnt. It's the same CoW as in OS fork(). In practice vLLM shares at block granularity and divergence usually starts a new block, so true intra-block CoW is rare, but the mechanism guarantees correctness when sharers diverge.

Q7. (Deep) Why are blocks freed in reverse order, and why doesn't the cache de-duplicate?

Model answer

Reverse free (kv_cache_manager.py:431): freeing tail blocks first puts them ahead of the head (prefix) blocks in the eviction queue, so the shared prefix survives longest for future requests — maximizing prefix-cache hit rate.

No dedup (block_pool.py:48): if the cache de-duplicated identical blocks it might have to remap an already-allocated block_id, but block tables are append-only (block_id must be stable once allocated) so the engine never has to rewrite a request's table. The cost is occasionally storing two blocks with the same content; the benefit is a simpler, race-free invariant. That tradeoff is exactly the kind of judgment call maintainers make.

Rapid-fire

Block size is typically? 16. Tradeoff of larger? Less metadata/overhead, more tail waste and coarser sharing.
What's the null block for? A placeholder for skipped positions (e.g. outside a sliding window); never cached.
Where does the block table actually get used? Passed into the attention kernel (csrc/attention/paged_attention_v1.cu), which dereferences it per token.
What sets the number of GPU blocks? Leftover HBM after weights ÷ per-block bytes (kv_cache_utils.get_kv_cache_configs), scaled by gpu_memory_utilization.

Phase 02 — Cheatsheet: PagedAttention

The one-liner

KV cache → fixed-size blocks (like OS pages) + per-request block table → no fragmentation, plus free sharing (prefix caching, CoW). Waste ≤ block_size − 1 per request.

Data structures

Thing	Job	Upstream
`KVCacheBlock`	per-block metadata (id, ref_cnt, hash, free links)	`kv_cache_utils.py:116`
`FreeKVCacheBlockQueue`	free list, eviction order, O(1) middle removal, zero-alloc	`kv_cache_utils.py:164`
`BlockPool`	owns blocks + free list + prefix-cache index	`block_pool.py:130`
`KVCacheManager`	per-request block tables; scheduler API	`kv_cache_manager.py:110`

The four invariants

in free queue ⟺ ref_cnt == 0 (not null)
block ids are append-only / stable (so: no dedup)
only full blocks get hashed + cached
cached ≠ unusable — touch revives a free cached block (O(1) middle removal)

Key methods → what they do

get_new_blocks(n) — popleft n, _maybe_evict, ref=1
touch(blocks) — re-ref a shared block; remove from free queue if it was free
free_blocks(blocks) — deref; ref→0 returns to queue (stays cached)
cache_full_blocks(...) — hash + index newly-full blocks
get_computed_blocks(req) — prefix-cache lookup, hits capped at num_tokens − 1
allocate_slots(...) — extend a request; returns None on OOM → scheduler preempts
free(req) — frees reverse order (prefix survives longest)

Hashing

hash_block_tokens(parent_hash, tokens, extra_keys) — parent-chained ⇒ prefix cache. extra_keys = LoRA id + multimodal + cache_salt (no cross-context collisions).

Numbers to know

KV bytes/token ≈ 2 × layers × kv_heads × head_dim × dtype_bytes
num GPU blocks ≈ (HBM × gpu_memory_utilization − weights) ÷ per-block bytes
block_size default ≈ 16

Gotchas

Returning None from allocate_slots is normal — it drives preemption, not an error.
The null block (id 0) is reserved; never cache it; subtract it from usage math.
Line numbers valid only at v0.22.1 @ 0decac0; search the named symbol otherwise.

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

Phase 03 — The Hitchhiker's Guide to Continuous Batching & the Scheduler ⭐

← Phase 02 · Course home · Phase 04 →

Flagship phase — written in full. Phase 02 gave you the memory. This phase gives you the brain that decides who runs each step.

Don't Panic

The scheduler's whole job, once per token step, is to answer one question:

Given everyone who wants compute right now, and the memory I have, who runs this step and how many tokens does each get?

That's it. Everything famous about vLLM's throughput — continuous batching, chunked prefill, prefix caching, preemption — is just a good answer to that one question, computed fast, every single step. By the end of this phase you'll have written a working continuous-batching scheduler (mini_vllm/scheduler.py) and read the real 2,300-line one (upstream/vllm/v1/core/sched/scheduler.py).

Step 1: The big idea — there is no "prefill phase" or "decode phase"

This is the mental model the entire engine is built on. Read the real comment at the top of Scheduler.schedule() (scheduler.py:330):

"There's no 'decoding phase' nor 'prefill phase' in the scheduler. Each request just has num_computed_tokens and num_tokens_with_spec. … At each step, the scheduler tries to assign tokens to the requests so that each request's num_computed_tokens can catch up to its num_tokens."

So every request is just a pair of numbers racing each other:

prompt = "Tell me a joke"          (4 tokens, say)
                                    num_tokens = 4,  num_computed_tokens = 0

step: schedule 4 tokens  ──►       num_computed = 4  == num_tokens  ─► emit 1 token ("Why")
                                    num_tokens = 5,  num_computed = 4
step: schedule 1 token   ──►       num_computed = 5  == num_tokens  ─► emit 1 token ("did")
                                    num_tokens = 6,  num_computed = 5
...

"Prefill" is just "num_computed is far behind num_tokens." "Decode" is just "it's behind by one, add one more." One uniform rule covers both. This is why chunked prefill, prefix caching, and speculative decoding all fall out naturally instead of needing special cases. Your mini_vllm/request.py is built around exactly this num_computed_tokens vs num_tokens pair — go look.

Step 2: Static batching (the bad old way) vs continuous batching

Static batching: pick a batch of requests, run them together until they all finish, then start the next batch.

time ─►
req A (short):  [#### done...................... idle ..............]
req B (med):    [############# done............. idle ..............]
req C (long):   [############################################ done ]
                 ^ A finished here but its slot sits IDLE until C finishes.

The GPU runs at the speed of the slowest request in the batch, and finished requests waste their slot. Terrible utilization for mixed-length traffic (which is all real traffic).

Continuous batching: re-decide the batch every single step. The instant A finishes, its slot is freed and a waiting request D joins mid-flight.

time ─►
req A:  [#### done]
req B:  [#############done]
req C:  [############################################done]
req D:       [############### done]      ← D joined the moment A left
req E:              [######### done]      ← E joined when B left
                 ^ no idle slots; the GPU is always full.

This is the single biggest throughput win in modern LLM serving, and it's entirely a scheduling decision — same kernels, same model, just smarter batching. vLLM does this by default.

Step 3: The token budget and chunked prefill

If you let a brand-new 8,000-token prompt do its entire prefill in one step, every decode in flight stalls for that whole step → everyone's inter-token latency spikes. Bad.

The fix is a token budget per step: max_num_batched_tokens. The scheduler hands out at most that many tokens total each step. A long prefill gets chunked — split across several steps — so it shares each step with ongoing decodes instead of monopolizing one.

budget = 2048 tokens/step

step 1: [decode A:1][decode B:1] ... [prefill of new req: 2046 of its 8000 tokens]
step 2: [decode A:1][decode B:1] ... [prefill: next 2046 tokens]
...     long prefill drips through the budget while decodes keep flowing.

In your mini_vllm/scheduler.py, this is _clamp_new_tokens (caps each request by the remaining budget and by long_prefill_token_threshold) and the token_budget -= num_new_tokens bookkeeping. The real code is the same idea at scheduler.py:348 (token_budget = self.max_num_scheduled_tokens) and :390 (long_prefill_token_threshold).

Step 4: Prefix caching — the free head start

Remember from Phase 02 that requests can share physical KV blocks. The scheduler exploits this when admitting a waiting request: before allocating, it asks the KV manager "how much of this prompt is already computed?" (get_computed_blocks). If a shared prefix is cached, those tokens are already done — the request starts with num_computed_tokens > 0 for free.

Request A ran earlier with prompt "You are a helpful assistant. <Q1>"  → its prefix blocks cached
Request B arrives with prompt   "You are a helpful assistant. <Q2>"
  scheduler: get_computed_blocks(B) → 6 blocks hit (the shared system prompt)
  B starts with num_computed_tokens = 96, only needs to prefill <Q2>.

For a shared 2k-token system prompt across thousands of users, this is enormous — it's the structural cost advantage behind multi-tenant serving. In mini_vllm, the WAITING loop calls self.kv.get_computed_blocks(request) and sets request.num_computed_tokens = num_cached. Real code: scheduler.py:591.

Step 5: Preemption — the safety valve

Continuous batching admits requests aggressively to keep the GPU full. Sometimes that means a running request needs another KV block and there are none left. What then?

The scheduler preempts: it evicts a running request (frees its KV blocks back to the pool), puts it back on the waiting queue, and gives its memory to someone who can make progress now. The preempted request will be recomputed later when memory frees up (its prompt + generated tokens are replayed; this is cheaper than it sounds thanks to prefill efficiency).

running: [A][B][C]   free blocks: 0
C needs 1 more block → allocate_slots(C) returns None (OOM, Phase 02!)
  → preempt the most-recently-added running request (say C, or the lowest priority)
  → free its blocks, push it back to WAITING, retry

This is the None → preempt → retry handshake from Phase 02, seen from the scheduler's side. In mini_vllm/scheduler.py it's the while True: loop around allocate_slots that pops a victim from self.running and calls _preempt. Real code: scheduler.py:443–491.

Step 6: The schedule, in order

Putting it together, here's the shape of schedule() (yours and theirs):

token_budget = max_num_batched_tokens

# 1) RUNNING first — keep decodes flowing
for request in running:
    n = clamp(request.num_tokens - request.num_computed_tokens, budget, prefill_threshold)
    blocks = allocate_slots(request, n)
    while blocks is None:              # OOM
        preempt(running.pop()); retry
    schedule it; budget -= n

# 2) WAITING next — admit new work if budget + memory + seq-slots remain
while waiting and budget > 0 and len(running) < max_num_seqs and not preempted_this_step:
    request = waiting[0]
    computed_blocks, num_cached = get_computed_blocks(request)   # prefix cache
    request.num_computed_tokens = num_cached
    n = clamp(request.num_tokens - num_cached, budget, prefill_threshold)
    blocks = allocate_slots(request, n, computed_blocks)
    if blocks is None: break          # no memory to admit anyone else
    move request waiting → running; budget -= n

Two subtleties worth noting now (and seeing in the real code):

Running before waiting: progress for in-flight requests beats starting new ones (latency).
No admit after a preemption this step: if we just had to preempt, the system is under memory pressure — don't pour more in. (mini_vllm: and not out.preempted_req_ids; real: scheduler.py:545 if not preempted_reqs ...).

The invariants to memorize

A request is always either in waiting or running (never both, never neither while unfinished).
sum(num_scheduled_tokens) ≤ max_num_batched_tokens every step (the budget holds).
len(running) ≤ max_num_seqs every step.
A scheduled request emits a token this step iff num_computed_tokens + num_scheduled == num_tokens (prefill fully caught up). Mid-prefill chunks emit nothing.
Preemption frees KV and resets num_computed_tokens = 0 (full recompute on re-admit).

What you'll do in this phase

Read: 01-deep-dive.md walks the real schedule() line by line.
Build: 02-mini-build.md — write the scheduler (reference: mini_vllm/scheduler.py).
Labs (see labs/README.md; recommended order 01 → 02 → 05 → 04 → 06 → 03):
- lab-01-scheduler-step [CPU-OK] — implement the budget + running/waiting loop; pass the tests.
- lab-02-chunked-prefill [CPU-OK] — prove chunking changes timing, not output, and predict step counts.
- lab-03-prefix-cache-hitrate [GPU-OPT] — measure real prefix-cache hit rate and its memory effect.
- lab-04-preemption [CPU-OK] — force a preemption and prove the request still completes correctly.
- lab-05-decode-latency-spikes [CPU-OK] — measure the ITL spike a long prefill inflicts on a decode stream ([257, 2, 1, ...]) and how the chunk threshold caps it ([33,×8, ...]).
- lab-06-prefix-cache-savings [CPU-OK] — account for prefix caching to the exact token (544 vs 96 scheduled tokens; savings ≡ followers × shared full blocks), outputs identical.
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

When you can whiteboard schedule() and explain the budget, chunking, prefix head-start, and preemption handshake from memory, you understand the component that defines vLLM's throughput.

← Phase 02 · Course home · Phase 04 →

Phase 03 — Deep Dive: the real vLLM Scheduler

Paths relative to upstream/ at v0.22.1 @ 0decac0 (UPSTREAM_PIN.md). The scheduler is vllm/v1/core/sched/scheduler.py (~2,300 lines). We read the parts that matter; the rest is connectors, encoders, spec-decode glue, and stats — return to those after Phases 8, 13, 15.

Supporting files:
vllm/v1/core/sched/
  scheduler.py       Scheduler.schedule() / update_from_output()   (the brain)
  output.py          SchedulerOutput, NewRequestData, CachedRequestData  (the wire format)
  request_queue.py   FCFS vs PRIORITY queues                       (ordering policy)
  interface.py       SchedulerInterface                            (the contract)
vllm/v1/request.py   Request, RequestStatus                        (the unit of work)

1. The unit of work: `Request` and its states

vllm/v1/request.py:315, RequestStatus:

class RequestStatus(enum.IntEnum):
    WAITING = enum.auto()
    WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR = enum.auto()
    WAITING_FOR_REMOTE_KVS = enum.auto()
    WAITING_FOR_STREAMING_REQ = enum.auto()
    RUNNING = enum.auto()
    PREEMPTED = enum.auto()
    # Note: anything after PREEMPTED will be considered as a finished status.
    FINISHED_STOPPED = enum.auto()
    FINISHED_LENGTH_CAPPED = enum.auto()
    FINISHED_ABORTED = enum.auto()
    ...

Two things to internalize:

The extra WAITING_FOR_* states exist because a request can be not ready for reasons beyond "queued": waiting on a grammar to compile (Phase 12), on remote KV to arrive (Phase 15), etc. Your mini_vllm.RequestStatus keeps just WAITING/RUNNING/PREEMPTED/FINISHED_* — the essential skeleton.
The ordering trick: is_finished is simply status > PREEMPTED (line 337). Enum order is the logic. mini_vllm copies this (is_finished = status >= FINISHED_STOPPED).

The master variables on Request: num_computed_tokens vs num_tokens (and num_tokens_with_spec for speculative decoding). Everything in schedule() manipulates these.

2. `schedule()` — the whole algorithm

vllm/v1/core/sched/scheduler.py:329. The defining comment (lines 330–339) — read it; it's the mental model from the guide, verbatim from the maintainers.

Setup (lines 341–362)

scheduled_new_reqs, scheduled_resumed_reqs = [], []
scheduled_running_reqs, preempted_reqs = [], []
req_to_new_blocks: dict[str, KVCacheBlocks] = {}
num_scheduled_tokens: dict[str, int] = {}
token_budget = self.max_num_scheduled_tokens          # <- the per-step token budget
...
self.kv_cache_manager.new_step_starts()

token_budget is max_num_scheduled_tokens (derived from max_num_batched_tokens). This is the global cap that makes chunked prefill work. mini_vllm: token_budget = self.max_num_batched_tokens.

Phase A — schedule RUNNING requests (lines 364–533)

req_index = 0
while req_index < len(self.running) and token_budget > 0:
    request = self.running[req_index]
    ...
    num_new_tokens = (
        request.num_tokens_with_spec
        + request.num_output_placeholders
        - request.num_computed_tokens
    )
    if 0 < self.scheduler_config.long_prefill_token_threshold < num_new_tokens:
        num_new_tokens = self.scheduler_config.long_prefill_token_threshold   # chunk long prefills
    num_new_tokens = min(num_new_tokens, token_budget)                        # respect the budget
    num_new_tokens = min(num_new_tokens, self.max_model_len - 1 - request.num_computed_tokens)

num_new_tokens = how far this request is behind, clamped by (a) the long-prefill chunk threshold and (b) the remaining token budget and (c) the model length. This four-line clamp is exactly your mini_vllm.Scheduler._clamp_new_tokens (minus spec/placeholder terms). Note num_tokens_with_spec includes draft tokens — that's how speculative decoding (Phase 8) rides the same scheduler with no special case, just as the top comment promised.

The preemption loop (lines 442–491) — the heart

with record_function_or_nullcontext("schedule: allocate_slots"):
    while True:
        new_blocks = self.kv_cache_manager.allocate_slots(
            request, num_new_tokens, num_lookahead_tokens=self.num_lookahead_tokens,
        )
        if new_blocks is not None:
            break                                  # got memory; schedule it

        # The request cannot be scheduled. Preempt the lowest-priority request.
        if self.policy == SchedulingPolicy.PRIORITY:
            preempted_req = max(self.running, key=lambda r: (r.priority, r.arrival_time))
            self.running.remove(preempted_req)
            ...
        else:
            preempted_req = self.running.pop()     # FCFS: preempt the most-recent

        self._preempt_request(preempted_req, scheduled_timestamp)
        preempted_reqs.append(preempted_req)
        if preempted_req == request:
            break                                  # nothing left to preempt; give up this req

if new_blocks is None:
    break

This is the None → preempt → retry handshake with the KV manager (Phase 02 §5). Under FCFS it preempts self.running.pop() — the most recently admitted, i.e. lowest priority by arrival. Under PRIORITY it preempts the worst (priority, arrival_time). mini_vllm implements the FCFS branch (self.running.pop() + _preempt) — the PRIORITY branch is a great extension exercise.

_preempt_request (line 929) frees the KV and resets the request to be recomputed. Compare mini_vllm.Scheduler._preempt: frees KV, num_computed_tokens = 0, status PREEMPTED, back to the front of waiting.

Commit the scheduled running request (lines 493–533)

scheduled_running_reqs.append(request)
req_to_new_blocks[request_id] = new_blocks
num_scheduled_tokens[request_id] = num_new_tokens
token_budget -= num_new_tokens                     # <- budget bookkeeping
req_index += 1
# ... spec-decode + encoder bookkeeping ...

Phase B — admit WAITING requests (lines 544–...)

if not preempted_reqs and self._pause_state == PauseState.UNPAUSED:
    while (self.waiting or self.skipped_waiting) and token_budget > 0:
        if len(self.running) == self.max_num_running_reqs:
            break
        ...
        request = request_queue.peek_request()
        ...
        # Get already-cached tokens.
        if request.num_computed_tokens == 0:
            new_computed_blocks, num_new_local_computed_tokens = (
                self.kv_cache_manager.get_computed_blocks(request)      # <- prefix caching!
            )
            ...

Three gates before admitting anyone (mirrored in mini_vllm):

if not preempted_reqs — don't admit new work in a step where we had to preempt (memory pressure). (mini_vllm: and not out.preempted_req_ids.)
token_budget > 0 — budget left.
len(self.running) == self.max_num_running_reqs: break — the seq-slot cap (max_num_seqs).

Then get_computed_blocks(request) is the prefix-cache head start (Phase 02 §5, guide §4): the request adopts the cached prefix and only prefills the remainder. The LoRA constraint just below (lines 573–584) caps distinct adapters per step (max_loras, Phase 11) — another feature riding the scheduler.

3. The output: `SchedulerOutput`

vllm/v1/core/sched/output.py:181. What the scheduler hands the executor:

@dataclass
class SchedulerOutput:
    scheduled_new_reqs: list[NewRequestData]          # first-time-scheduled (full payload)
    scheduled_cached_reqs: CachedRequestData          # already-running (just deltas)
    num_scheduled_tokens: dict[str, int]              # req_id -> tokens this step
    total_num_scheduled_tokens: int
    scheduled_spec_decode_tokens: dict[str, list[int]]
    scheduled_encoder_inputs: dict[str, list[int]]
    num_common_prefix_blocks: list[int]
    finished_req_ids: set[str]
    ...

The split between NewRequestData (line 31 — full prompt, block_ids, sampling params) and CachedRequestData (line 112 — just new tokens + new block ids) is a real optimization: for a request already running, you don't resend the prompt every step, only the delta. mini_vllm simplifies this to one num_scheduled_tokens dict + the request objects, but the idea — send new requests in full, running requests as deltas — is worth knowing.

4. The other half: `update_from_output`

vllm/v1/core/sched/scheduler.py:1283. After the model runs and the sampler produces tokens, the scheduler ingests the results: append sampled tokens, advance num_computed_tokens, detect finished requests, free their KV, handle spec-decode acceptance/rejection, emit stats. Your mini_vllm.Scheduler.update_from_output is the skeleton: num_computed_tokens += n; if a token was sampled, append it and check stop conditions; reap finished requests (free KV, drop from running).

The condition for "did this request emit a token this step" in mini_vllm is needs_sample = (num_computed_tokens + num_scheduled == num_tokens) — only fully-caught-up (prefill-complete) requests sample. The real engine encodes the same thing through the model runner's logits-indices selection; the principle is identical (you only sample at the last position of a request that has no more prompt to ingest).

5. Putting Phases 02 + 03 together

The clean separation you should now see:

Scheduler (policy: who runs, how many tokens)  ──calls──►  KVCacheManager (truth: is there memory?)
        ▲                                                          │
        └──────────────  None  ◄── allocate_slots ◄───────────────┘   (OOM signal)
        │
        └─ responds: preempt a running request, free its KV, retry

The scheduler never touches blocks directly; the KV manager never decides policy. That clean seam is why each file stays readable despite the engine's complexity — and it's a design lesson worth stealing for your own systems.

Reading checklist

Write one sentence each in your notebook:

The top comment of schedule() — restate the "no prefill/decode phase" idea in your words.
The 4-line num_new_tokens clamp — what are the three caps and why each?
The while True preemption loop — what does allocate_slots returning None trigger?
FCFS vs PRIORITY preemption victim selection — who gets preempted in each?
The three gates before admitting WAITING requests.
get_computed_blocks in Phase B — how does prefix caching give a free head start?
NewRequestData vs CachedRequestData — why send deltas for running requests?

Now build it: 02-mini-build.md, then the labs.

Phase 03 — Mini-Build: the continuous-batching scheduler

You'll build the scheduler that drives mini_vllm. The reference is already in the repo — mini_vllm/scheduler.py — but write it yourself first against lab-01's stub + tests, then diff.

This phase's mini-build depends on Phase 02's KV manager (mini_vllm/kv_cache.py), because the scheduler's whole interaction with memory is allocate_slots / get_computed_blocks / free. That dependency is the point: scheduling and paging are two halves of one machine.

The build, in order

1. `SchedulerOutput`

A small dataclass: num_scheduled_tokens: dict[str,int], scheduled_requests: list[Request], preempted_req_ids: list[str], and a total_num_scheduled_tokens property. (Real: output.py:181, much richer.)

2. `Scheduler.init`

Hold the KVCacheManager, max_num_seqs, max_num_batched_tokens, long_prefill_token_threshold, and two queues: waiting: deque[Request], running: list[Request].

3. `_clamp_new_tokens(num_new_tokens, token_budget)`

Apply the long-prefill chunk cap (if 0 < threshold < num_new) then min(num_new, budget), floored at 0. This single helper is chunked prefill. (Real: scheduler.py:390–392.)

4. `schedule()` — the two-phase loop

Phase A (running): for each running request, n = _clamp_new_tokens(num_tokens − num_computed, budget); allocate_slots; on None, pop a victim from running, _preempt it, record it, retry; commit and budget -= n.
Phase B (waiting): while waiting and budget>0 and len(running)<max_num_seqs and not preempted: peek the front request; get_computed_blocks → set num_computed_tokens; n = _clamp_new_tokens(...); allocate_slots(req, n, computed_blocks); on None break; else move waiting→running, commit.

5. `_preempt(request)`

kv.free(request); num_computed_tokens = 0; num_preemptions += 1; status PREEMPTED; waiting.appendleft(request) (re-admit ASAP).

6. `update_from_output(output, sampled)`

For each scheduled request: num_computed_tokens += n; if it sampled, append the token and maybe_finish(). Reap finished requests: remove from running, kv.free.

7. `needs_sample(request, num_scheduled)` (static)

return request.num_computed_tokens + num_scheduled == request.num_tokens. Only fully-caught-up requests emit a token (mid-prefill chunks don't).

Definition of done

pytest mini_vllm/test_scheduler.py -q          # the reference suite (token budget, chunking,
                                               # preemption, prefix head-start)
pytest phase-03-continuous-batching-scheduler/labs -q

Then run the full engine and confirm the scheduler's correctness invariants hold end to end:

pytest mini_vllm/test_engine.py -q
# test_chunked_prefill_matches_unchunked_output and
# test_prefix_caching_matches_no_caching_output PROVE that these optimizations change
# *timing/memory*, never *output*. That property is the whole game.

Stretch (sets up later phases)

PRIORITY policy — add a priority field to Request and a PRIORITY branch to the preemption victim selection (max by (priority, arrival_time)), mirroring scheduler.py:456.
Swapping vs recompute — instead of resetting num_computed_tokens=0 on preempt, "swap" the blocks to a CPU pool and restore them on re-admit. Compare the cost model (recompute is compute, swap is bandwidth) — a real vLLM design axis.
Stats — count preemptions, average batch size, and KV usage per step; you'll need these in Phase 18.

Phase 03 Labs — Continuous Batching & the Scheduler

Six labs around the engine's brain. The arc: build the scheduling loop (lab-01), prove chunked prefill safe (lab-02), measure why it exists (lab-05), survive memory pressure with preemption (lab-04), then account for prefix caching exactly (lab-06) and on real hardware (lab-03).

Recommended order: 01 → 02 → 05 → 04 → 06 → 03. (Directory numbers predate labs 05–06: mechanism, then safety, then motive, then the emergency path, then the cache economics.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-03-continuous-batching-scheduler/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-01-scheduler-step -q

Labs

lab-01-scheduler-step `[CPU-OK]`

Implement the two-phase loop at the heart of continuous batching: serve RUNNING first (decode-first is a policy, and iteration order is the policy), then admit WAITING — all under three independent scarcities (token budget, sequence slots, KV memory) enforced at three different points. Your 30 lines are, shape for shape, the core of scheduler.py:329 upstream. Skills: budget/slot/memory enforcement; running-first; head-of-line blocking as a fairness choice; one code path for prefill and decode.

lab-02-chunked-prefill `[CPU-OK]`

Prove the engine's most important safety property — chunking changes when tokens are computed, never what tokens come out — by running the same deterministic workload under both schedules and diffing token ids. Plus the timing side: predict prefill steps with ceil(prompt/chunk) and know every boundary case. Skills: the causality + sampling-guard argument; output-invariance as a CI-enforceable equality; the chunk-size trade-off.

lab-03-prefix-cache-hitrate `[GPU-OPT]`

Run the real engine on the canonical workload (long shared system prompt, unique tails) with prefix caching off and on, and read three independent meters that must agree: hit rate (0% → 93.7%), prompt throughput (4–5×), KV usage (~1× the prefix). Annotated capture included for the GPU-less; lab-06 is the exact-arithmetic CPU twin. Skills: constructing sharing-known workloads; reading hit-rate denominators; when caching buys nothing.

lab-04-preemption `[CPU-OK]`

Force the scheduler's emergency path in a pool where two requests cannot both fit: watch it evict the most-recent admission, let the survivor finish, then replay the victim — and prove the final outputs identical to a roomy pool's. Recovery is just prefill: the two-counters model makes eviction, chunking, and cache hits one code path. Skills: the allocate-or-preempt dance; victim policy as forward-progress argument; the deadlock invariant; pairing "survives Y" tests with "Y actually happened" probes.

lab-05-decode-latency-spikes `[CPU-OK]`

The motive for chunked prefill, measured: a decode stream's per-step cost profile when a 256-token prompt lands — [257, 2, 1, 1, ...] unchunked vs [33,×8, 2, 1, ...] at threshold 32. Same total work, radically different tail latency; nothing free — the spike spreads into the long prompt's TTFT. Skills: per-victim latency measurement; p99 vs mean; the threshold/budget dial; why aggregate meters hide interference.

lab-06-prefix-cache-savings `[CPU-OK]`

Account for prefix caching to the exact token: 544 scheduled tokens uncached vs 96 cached, savings ≡ (N−1) × shared full blocks = 448, outputs bit-identical, and a share-nothing control arm that saves almost nothing. Includes the one-token prefill that immediately samples — three phases of rules colliding in a single scheduled token. Skills: the compute odometer; predicting cache value with integer arithmetic; eager caching at allocation time; validating noisy GPU meters against an exact model.

What you can do after this phase

Implement and modify vLLM's scheduling policy with the confidence of someone who has built the loop, proven its invariants, and measured its trade-offs: explain why chunked prefill is default-on (and what threshold to set, from data); predict prefix-cache savings for any workload before enabling it; diagnose a preemption storm from the metrics and name the right knob; and read vllm/v1/core/sched/scheduler.py end to end as a peer. Combined with Phase 2, you now hold the complete control plane — Phase 4 descends into the kernels it commands.

Lab 03-01 — Implement the Scheduler Step `[CPU-OK]`

The scheduler is the brain of the engine — the component that decides, every single step, who computes and how much. In this lab you implement its core: the two-phase loop (serve the RUNNING, then admit the WAITING) under a global token budget, a sequence-slot cap, and a per-request chunk limit. It is maybe 30 lines of code. Those 30 lines are the difference between a GPU that hums at 90% utilization and one that stutters between overload and idleness — and they are, shape for shape, the same 30 lines at the heart of upstream/vllm/v1/core/sched/scheduler.py.

Why this lab exists

In Phase 1 lab-04 you observed the scheduler's decisions as a trace of per-step batch dicts. Reverse the arrow: now you are the one producing those dicts. Everything you watched — chunking to the budget, deferred admission, mixed prefill+decode batches — must now fall out of code you write. This is the course's central loop made flesh: observe a mechanism, then build it, and the understanding compounds.

It's also the file you will touch most as a contributor. Scheduling policy is where vLLM evolves fastest — priority scheduling, fairness, SLA-aware admission, disaggregated prefill (Phase 15) are all edits to this loop. The deep-dive walks you through the real Scheduler.schedule; this lab makes sure that when you read it, you're recognizing, not learning.

Background: the three scarce resources

Every scheduling decision is a negotiation between three independent scarcities, and the loop checks all three — know which line enforces which:

max_num_batched_tokens (the token budget, default 2048–8192 upstream) — caps total tokens computed per step. This is a latency control: step wall-clock time grows with scheduled tokens, so the budget is, almost literally, your inter-token latency dial (lab-05 measures this). The budget is global per step — one pool shared by everyone scheduled.
max_num_seqs (the slot cap) — caps how many requests can be RUNNING at once. This bounds per-step fixed overheads and runner state (and, on real hardware, things like CUDA-graph batch-size buckets — Phase 5). It is checked only at admission: an already-running request never re-competes for its slot.
KV memory (via kv.allocate(...)) — the hard wall from Phase 2. Unlike the other two, this one can refuse mid-flight (a decode needs one more block and the pool is empty); handling that refusal is preemption, deliberately deferred to lab-04. In this lab, allocation failure during admission simply stops admitting.

Three resources, three different enforcement points. Most scheduler bugs are one resource checked at the wrong point — e.g. counting seqs in the budget loop, or letting an admission overdraw the budget "just this once."

Why running-first is not arbitrary

The loop's order — RUNNING phase, then WAITING phase — encodes a policy with a name: decode-first. The requests already running have users watching tokens stream; a stalled decode is a frozen cursor in somebody's chat window. The waiting requests haven't received anything yet; making them wait one more step costs queueing delay but breaks no stream. So the scheduler protects in-flight experience first and spends whatever budget remains on admissions.

The inverse policy (admit-first) would maximize... nothing useful: it trades visible jitter for marginally earlier admissions. But note the deeper principle, because it generalizes: the loop's iteration order IS the priority policy. Upstream's priority scheduling and preemption-victim selection are both, at bottom, careful answers to "in what order do we iterate, and from which end do we take?"

Files

starter.py — clamp(...) and schedule_step(...) stubbed, with the full recipe in comments. Ships a tiny self-contained Req and FakeKV (a slot-counting memory model) so the lab isolates pure scheduling logic. Your work.
solution.py — reference (mirrors mini_vllm/scheduler.py, minus preemption).
test_lab.py — budget cap, slot cap, chunking, running-first ordering, and memory-stops-admission.

Run

LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-01-scheduler-step -q
pytest phase-03-continuous-batching-scheduler/labs/lab-01-scheduler-step -q   # reference

What `schedule_step` must do

budget = max_num_batched_tokens.
RUNNING phase: for each running req in order: n = clamp(req.num_tokens − req.num_computed, budget, threshold); skip if n == 0; kv.allocate(req, n) (assume it succeeds for running reqs here — preemption is lab-04); commit scheduled[rid] = n, budget -= n.
WAITING phase: while there are waiters AND budget > 0 AND len(running) < max_num_seqs: take the front waiter (FCFS — order is policy!), clamp the same way, try to allocate; on failure break (if the front request can't fit, don't go shopping deeper in the queue — see the head-of-line note below); on success, move it waiting → running and commit.
Return {rid: n}.

And clamp(num_new, budget, threshold) is the whole chunking mechanism in one line: cap by the per-request threshold (if 0 < threshold < num_new), then by the remaining budget, floored at 0. Notice what isn't here: no "prefill mode," no "decode mode." A decode is just a request whose num_tokens − num_computed == 1. The two-counters model from Phase 1 means one code path schedules both — that unification is the deep design, and it's why this loop stays 30 lines while doing what took Orca a paper to describe.

What the tests prove — a guided tour

test_clamp_chunks_and_budgets — the clamp's three regimes (budget-bound, threshold-bound, neither). Get this right first; everything else composes it.
test_budget_caps_total_tokens — three 8-token prompts under a 10-token budget schedule exactly 10 tokens: 8 + 2 (the second request's prefill is chunked mid-prompt)
- 0 (the third isn't admitted). One assertion, three behaviors.
test_max_num_seqs_caps_running — ten tiny requests, slots for four: exactly four admitted, despite infinite budget and memory. Each scarcity binds independently.
test_chunked_prefill_caps_per_request — a 100-token prompt with threshold=16 schedules 16, not 100, even with budget to burn. The threshold protects other requests' latency from this request's prompt (lab-05 quantifies exactly how much).
test_running_scheduled_before_waiting_admitted — the decode-first policy: the running decode gets its 1 token first; the eager 20-token waiter gets what's left, chunked. Order of phases = priority.
test_admission_stops_when_memory_exhausted — A fills the pool; B stays WAITING. No crash, no partial admission: capacity exhaustion is a normal scheduling outcome, not an error path. (The engine-level consequence — B admitted later when A finishes — is Phase 1 lab-04's trace; the violent version is lab-04's preemption.)

How this maps to the real engine

Open upstream/vllm/v1/core/sched/scheduler.py:329 after you're green. The skeleton is yours; production adds, in roughly descending order of weight: preemption inside the RUNNING phase (the while True allocate-or-preempt dance — lab-04); prefix-cache consultation at admission (get_computed_blocks — lab-06 / Phase 2 lab-05); structured- output and LoRA gating; speculative-decoding token accounting; and the encoder budget for multimodal inputs. Every one of those is a guard or a discount on num_new_tokens inside the same two phases. Once you see the file that way — your 30 lines plus accessory clauses — it stops being 700 intimidating lines.

Also worth noting upstream: _clamp_new_tokens's real twin is the interaction between long_prefill_token_threshold and chunked_prefill_enabled in the scheduler config — chunked prefill is default-on in V1, which tells you how settled this once-controversial idea now is.

Hitchhiker's notes

Head-of-line blocking is a choice. When the front waiter doesn't fit, we break rather than trying the next (smaller) one. Skipping ahead would raise utilization and starve large requests — a big prompt could wait forever behind a stream of small ones slipping past it. FCFS-with-blocking is the fairness-conservative default; if you relax it, you must add an aging mechanism. (Upstream has exactly this debate in its issue tracker — worth a read.)
Why is the budget in tokens, not requests? Because step time scales with tokens through the model, not with request count — a 1-request 2048-token prefill costs about the same as 2048 one-token decodes through the GEMMs (attention differs; Phase 18 refines). Budgeting the actual scarce quantity is what makes the latency dial linear.
num_computed > 0 for a waiter is not an error — it's a preempted request being re-admitted (lab-04) or a prefix-cache hit (lab-06). Your clamp already handles it: num_tokens − num_computed just comes out smaller. Design observation: by making "partial progress" a first-class state, recovery and caching share the admission path with fresh requests. No special cases.
The FakeKV is a teaching instrument: one slot per token, no blocks, no hashes — so this lab's failures are always scheduling failures. When you wire the real KVCacheManager in (mini-build), block granularity adds a ceil() but changes no logic.

Going further

Add priority classes: each Req gets priority: int; iterate waiting in priority order with FCFS tiebreak. Then write the test proving a late high-priority request overtakes the queue without stalling running decodes. You've just implemented the core of upstream's priority scheduling policy.
Add the fully-cached edge case: if num_tokens − num_computed == 0 for a waiter (prefix cache covered everything it can), schedule 1 token anyway. Why must it be ≥ 1? (A request that schedules 0 tokens never produces logits, never samples, never finishes — an admission that can't make progress. mini_vllm/scheduler.py has this exact branch; lab-06 will show you the 1-token prefills it produces in a trace.)
Make the budget elastic: allow one oversized decode batch when waiting is empty. Measure (with Phase 1 lab-04's probe) what it does to step-time variance. Most "clever" scheduler ideas die in exactly this experiment — cheap to run here, expensive to learn in production.

References

mini_vllm/scheduler.py — the full version (with preemption + prefix caching) your solution grows into.
upstream/vllm/v1/core/sched/scheduler.py:329 — Scheduler.schedule, the production loop; read it immediately after finishing.
Yu et al., Orca (OSDI 2022) — iteration-level scheduling, this loop's ancestor: https://www.usenix.org/conference/osdi22/presentation/yu
Agrawal et al., Sarathi-Serve (OSDI 2024) — why the chunk threshold exists; the prefill/decode interference math: https://arxiv.org/abs/2403.02310
vLLM docs, Optimization and Tuning — max_num_batched_tokens / max_num_seqs guidance straight from the maintainers: https://docs.vllm.ai/en/latest/configuration/optimization.html

Lab 03-02 — Chunked Prefill: Same Output, Different Timing `[CPU-OK]`

This lab proves, on a running engine, the most important safety property in vLLM:

Chunked prefill changes WHEN tokens are computed, never WHAT tokens are produced.

If that sentence were false, no scheduling optimization in this codebase would be safe to ship — every knob that re-times work would be a knob that corrupts output. You'll verify it the strong way (identical token ids, chunked vs unchunked, on the real mini_vllm engine), and you'll learn to predict the timing side: exactly how many steps a prefill takes under any threshold/budget combination.

Why this lab exists

"Re-timing is output-invariant" is the kind of claim engineers nod along to and never check. But your career will repeatedly hand you moments where the nod isn't enough: a customer reports different outputs between two deployments that differ only in scheduler config; a reviewer asks whether your scheduler PR can change generations; an incident review wants to know whether enabling chunked prefill mid-fleet is provably safe or just probably safe. This lab gives you the proof technique: drive the same deterministic workload through both schedules and diff the token ids. It's mini_vllm's own regression test (test_engine.py::test_chunked_prefill_matches_unchunked_output), reproduced by your hand so you know why it must hold, not just that it does.

The second skill is the timing model. "How many steps does a 4000-token prompt take at threshold 512?" is a real capacity question (it sets TTFT for that request and the interference window for everyone else — lab-05). The answer is a one-line formula, and you should never need to run the engine to produce it.

Background: why chunk a prefill at all

Without chunking, a 4096-token prompt arrives and the scheduler faces an ugly choice: schedule the whole prefill in one step — a step that takes hundreds of times longer than a decode step, during which every other user's token stream visibly freezes — or make the new request wait indefinitely. Early engines picked the freeze; users called it "jitter" and "stalls."

Chunked prefill (Sarathi's contribution, default-on in vLLM V1) dissolves the choice: split the prompt into budget-sized chunks across several steps, and let decodes ride along in each step's leftover budget. The long prompt pays slightly more total latency (more steps, plus re-reading its growing KV each chunk); everyone else's inter-token latency stays smooth. The two-counters model from Phase 1 makes the implementation almost embarrassingly small: a prefill is just a request whose num_computed_tokens is far behind, so capping its per-step advance — clamp from lab-01 — is chunking. No prefill state machine, no resume logic; the counter is the resume logic.

Why the output cannot change — the actual argument

Spell it out once, carefully, because this is the argument you'll reuse for every scheduling feature:

The model's logits at position k depend only on tokens 0..k (causality) and their KV values — not on which step computed that KV. KV is a pure function of the tokens.
The engine samples for a request only when num_computed_tokens + n == num_tokens (Scheduler.needs_sample — Phase 1 lab-03's guard). Mid-prefill chunks emit nothing.
Therefore the first sample happens at the same logical state (all prompt KV computed, position = prompt length) whether the prompt was computed in 1 chunk or 10. Same state
- same sampling → same token. Induction extends this to every later token.

The invariant has exactly two load-bearing dependencies: causality (KV doesn't depend on schedule) and the sampling guard (no logits read mid-prefill). Notice what that implies for review: a PR can only break output-invariance by touching one of those two things. That's a checklist of length two for an entire class of changes — and on real GPUs, a softer third dependency appears (batch-shape-dependent floating-point reduction order), which is why the real engine's version of this test compares with tolerance while ours can demand exact equality. See the Hitchhiker's notes.

Files

starter.py — implement num_prefill_steps(prompt_len, threshold, budget). Your work.
solution.py — reference.
test_lab.py — checks your formula on the boundary cases AND runs the engine both ways asserting identical output token ids.

Run

LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-02-chunked-prefill -q
pytest phase-03-continuous-batching-scheduler/labs/lab-02-chunked-prefill -q   # reference

The formula to implement

A single request (it owns the whole budget) with a prompt_len-token prompt. The per-step chunk is threshold if threshold > 0 else budget, but never more than budget: chunk = min(threshold or budget, budget). The prefill then takes

ceil(prompt_len / chunk)

steps. Watch the boundaries the tests probe: threshold = 0 means disabled (not "chunk of zero"); a threshold larger than the budget is moot (the budget binds); a prompt that divides evenly takes exactly prompt_len / chunk, no +1. Off-by-ones here are off-by-ones in someone's TTFT model later.

What the tests prove

The formula tests pin the chunk-size selection logic and the ceiling division — including threshold=0 (unchunked: 1 step), threshold > budget (budget wins), and exact-division boundaries.
The engine test generates from the same prompt with long_prefill_token_threshold=0 and with a small threshold, and asserts identical output token ids — not similar: identical. It can demand exactness because mini_vllm is deterministic end-to-end (greedy sampling, deterministic toy model), which turns the safety property into a hard equality a CI can enforce forever. This is the test you write first when building any scheduling feature: pin the semantics, then optimize the timing freely. (Compare to the trace shape you saw in Phase 1 lab-04: chunking visibly rearranged the steps. Same engine, same tokens — the timing is the only degree of freedom.)

Hitchhiker's notes

On real GPUs, "identical" softens to "equivalent." Chunking changes batch shapes; different GEMM/attention tile sizes can change floating-point reduction order; logits wiggle in the last ulp; and a greedy argmax between two near-tied tokens can flip. The semantic invariant (same distribution, same correctness) holds; bitwise equality does not. This is why upstream correctness tests for chunked prefill compare with tolerance or check logprob closeness, and it's the first thing to say in the incident review when two configs differ by one token at position 947: not all divergence is a bug — divergence beyond rounding is.
The threshold is a latency/throughput dial, not free money. Small chunks: smoother decode latency for others, but the long prompt's prefill stretches across more steps (worse TTFT for it), and each chunk re-reads the prompt's accumulated KV (attention cost ~quadratic-ish in total across chunks vs the one-shot). Sarathi-Serve's whole paper is about choosing this number; lab-05 lets you feel it.
Where would chunking change output? It wouldn't — but a bug that sampled mid-prefill would (a request emitting a token from logits computed over half its prompt). Find the guard in mini_vllm/scheduler.py::needs_sample and its upstream twin (the logits_indices selection in the model runner). If a future refactor moved sampling before the catch-up check, this lab's engine test is the tripwire that catches it.
Chunked prefill and prefix caching compose. A cache-hit request enters admission with num_computed_tokens already nonzero; the chunk math applies to the remainder. No interaction code exists because both features speak the same language: the counter. (Lab-06 shows the composed behavior in a trace.)

Going further

Extend num_prefill_steps to two concurrent prompts sharing the budget fairly — suddenly you need to model the RUNNING phase's in-flight chunks competing with admissions, and the closed form gets genuinely interesting. Check your model against Phase 1 lab-04's probe.
Compute TTFT in steps as a function of threshold for a 4096-token prompt at budget 512, then sketch the other requests' worst-case stall at each threshold (lab-05 measures it). Plot both curves; their crossing is the tuning decision Sarathi formalizes.
Read upstream's long_prefill_token_threshold handling and the scheduler config's chunked-prefill defaults, and write down which of your formula's branches each config combination exercises.

References

mini_vllm/test_engine.py::test_chunked_prefill_matches_unchunked_output — the course's own regression test you just rebuilt.
mini_vllm/scheduler.py — _clamp_new_tokens (the chunker) and needs_sample (the guard).
upstream/vllm/v1/core/sched/scheduler.py — the production clamp; search long_prefill_token_threshold.
Agrawal et al., SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills (2023) — the original chunking paper: https://arxiv.org/abs/2308.16369
Agrawal et al., Sarathi-Serve (OSDI 2024) — the production-grade follow-up with the threshold-tuning math: https://arxiv.org/abs/2403.02310
vLLM docs, Chunked Prefill — the feature's official knobs and defaults: https://docs.vllm.ai/en/latest/configuration/optimization.html

Lab 03-03 — Measure Real Prefix-Cache Hit Rate `[GPU-OPT]`

Every production LLM workload has a shape, and the shape is almost always "a long shared prefix, then a short unique tail" — system prompts, few-shot examples, conversation history, RAG boilerplate. Prefix caching turns that shared prefix from N prefills into one. In this lab you run the real engine on exactly that workload and watch the meters move: the hit-rate counter climbing to ~94%, prompt throughput jumping 4–5×, and KV usage staying near 1× a single prompt. These are the numbers that justify the feature — and you'll know how to reproduce them on your workload, which is the question that actually matters.

No GPU? Don't panic. The captured run below is annotated line by line; the analysis sections work entirely on paper. And lab-06 reproduces this experiment on the mini engine, CPU-only, with exact token accounting — do that one hands-on.

Why this lab exists

Prefix caching is the rare optimization that is simultaneously huge (multi-× on the right workload), free to enable (default-on in modern vLLM), and workload-dependent enough to be oversold (≈0 benefit on share-nothing batch jobs). An engineer who can't measure it is at the mercy of vibes in both directions. This lab builds the measurement reflex: construct a workload with known sharing, run with the feature off and on, and read three independent meters that must agree (hit rate, prompt throughput, KV usage). When the meters don't agree — hit rate high but no speedup, say — you've learned something real (often: the prefix wasn't block-aligned, or the workload was decode-dominated all along).

The same experiment is also your template for capacity claims: "enabling prefix caching will let this deployment serve 3× the QPS" is a sentence you should only say after running this lab's shape against your traffic.

Background: what a "hit" buys

From Phase 2 lab-05 you know the mechanism: full blocks of the prompt are content-hashed (parent-chained), and a new request adopts any cached chain head — touch, ref-count bump, zero compute. What that buys, concretely, per hit token:

Prefill compute: the entire forward pass for that token — skipped. TTFT for a request with an N-token cached prefix drops by roughly N/(prefill speed).
KV memory: the hit blocks are shared, not copied (ref_cnt += 1). Sixteen requests sharing a 1000-token system prompt store its KV once.
What it never buys: decode. Generated tokens are new by definition. A workload that prefills 50 tokens and decodes 2000 saves almost nothing — check your prefill:decode ratio before promising miracles.

The unit of caching is the full block (Phase 2's I3): a 130-token shared prefix at block_size 16 hits at most 8 blocks = 128 tokens, and divergence mid-block forfeits that block. Hence the operator's rule of thumb: put the static part first, pad nothing, and the boundary token of your template matters more than you'd think.

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct   # small, modern, instruct-tuned

Steps

# run.py
from vllm import LLM, SamplingParams

SYSTEM = "You are a meticulous assistant. Follow instructions carefully. " * 30  # ~400 tokens shared

llm = LLM(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    enable_prefix_caching=True,      # <- the feature under test; flip for the control run
    gpu_memory_utilization=0.6,
    max_model_len=4096,
)

# 16 requests sharing SYSTEM, each with a unique tail.
prompts = [f"{SYSTEM}\n\nQuestion {i}: what is {i}+{i}?" for i in range(16)]
out = llm.generate(prompts, SamplingParams(max_tokens=16, temperature=0))
for o in out[:2]:
    print(repr(o.outputs[0].text))

Run twice — enable_prefix_caching=True then False — both under VLLM_LOGGING_LEVEL=DEBUG, and collect the three meters below for each. (One subtlety of this script: all 16 requests are submitted in one generate call, so requests 1..15 hit blocks request 0 cached moments earlier in the same run — vLLM caches blocks as soon as they're full, not when the request finishes. The mechanism is identical for requests arriving minutes apart, as long as the blocks survive eviction.)

What to measure

Metric	prefix caching OFF	prefix caching ON
Prefix cache hit rate	0% (counter absent/zero)	climbs toward (N−1)/N
Avg prompt throughput	baseline	several × baseline
Peak KV-cache usage	~16 × SYSTEM + tails	~1 × SYSTEM + tails
TTFT, requests 2..16	full prefill each	only the unique tail prefills

Three of these are in the debug logs; TTFT you can take from the per-request timing if you use the API server, or infer from prompt throughput here.

Captured output (real run, Qwen2.5-0.5B, L4 24GB, vLLM 0.22.1)

# enable_prefix_caching=True
INFO ... Automatic prefix caching is enabled.
DEBUG ... Prefix cache hit rate: GPU: 0.00%      (request 0 populates the cache)
DEBUG ... Prefix cache hit rate: GPU: 93.7%      (requests 1..15 reuse SYSTEM's blocks)
INFO ... Avg prompt throughput: 41523 tokens/s   (mostly cached -> not recomputed)
'4'  '6'

# enable_prefix_caching=False (same workload)
DEBUG ... Prefix cache hit rate: GPU: 0.00%
INFO ... Avg prompt throughput: 9120 tokens/s    (every SYSTEM prefilled from scratch)

Reading the numbers like an operator

0.00% → 93.7% — the first line is request 0 paying full price (the pioneer effect: a cache nobody populated cannot hit — same 1/N you saw in Phase 2 lab-03's 87.5%). Then 15 of 16 requests reuse SYSTEM's blocks. Why 93.7% and not 15/16 of all tokens? The denominator is queries (tokens looked up), and each request's unique tail plus its final block can't hit — the cap from Phase 2 lab-05: a hit covers at most full blocks of at most num_tokens − 1 tokens. Hit rates have denominators; always ask what's in them before quoting one.
41523 vs 9120 tokens/s prompt throughput — the 4.6× is the shared prefix being computed once instead of 16 times. Sanity-check the ratio: with ~430 shared + ~15 unique tokens per prompt, the cached run computes ~1×430 + 16×15 ≈ 670 prefill tokens where the uncached run computes 16×445 ≈ 7120 — a ~10× compute saving, surfaced as ~4.6× in the wall-clock meter (the meter averages over windows that include decode time too). Meters measure what they measure; derive what you expected before trusting the headline.
The outputs are '4' and '6' — the same answers the uncached run gives. Cached KV is the same KV (Phase 2 lab-06's identity theorem, now economically significant). Correctness meters and performance meters move independently; check both.
Same arithmetic as lab-06 — which computes the exact scheduled-token saving ((N−1) × full-blocks-of-shared-prefix) on the mini engine where every token is countable. The GPU numbers above are that arithmetic, plus wall-clock noise.

Hitchhiker's notes

Conversation history is the killer app, not just system prompts: each turn re-sends the whole transcript, which is — by construction — a growing shared prefix with itself. A chat with T turns gets ~T× prefill savings on its own history. This is why every serious chat API (and vLLM-based products) leans on prefix/prompt caching, and why the commercial APIs sell it explicitly (Anthropic/OpenAI "prompt caching" — same idea, different billing).
What invalidates a cached prefix: eviction under memory pressure (the blocks are still just free-queue citizens — Phase 2 lab-05), reset_prefix_cache(), restart, or anything that changes what the KV means: different LoRA adapter, different model, different chat-template rendering of the "same" text. The hash chain includes token ids only after templating — two prompts that render differently share nothing, which is the most common "why is my hit rate 0" in practice (timestamp in the system prompt, per-user name early in the template, randomized example order...).
n>1 parallel sampling (Phase 9) reuses this exact machinery — N samples of one prompt share the prompt's blocks via the same ref_cnt mechanics. So do beam search and speculative-decoding draft trees. "Share immutable prefix KV via refcounted blocks" is load-bearing infrastructure, not a feature flag.
Security note for multi-tenant operators: cache timing is observable — a fast TTFT reveals someone recently prefilled the same prefix. Cross-tenant prefix caching can therefore leak prompt equality across tenants (a real, published attack class against LLM caches). vLLM's cache is per-engine; if you front multiple tenants, decide deliberately whether their prefixes may share a pool.

Reflect

Why does the first request show 0% even though the cache is enabled? And what is the steady-state hit rate of this workload as N → ∞? ((N−1)/N of the shareable tokens — the pioneer cost amortizes to nothing.)
Your workload prefills 2000 tokens of RAG context (unique per query!) and decodes 100. What hit rate do you expect? (~0 — unique context shares nothing. What would help? Reordering the prompt so static instructions precede the unique context, and caching exactly that. Prompt structure is a performance interface.)
Estimate: 16 requests × 430-token SYSTEM at ~36 KB/token-ish for a 0.5B model — how much KV memory did sharing save, in MB? Now do it for a 70B model at 405 KB/token and 64 concurrent requests. (This is why prefix caching is also a capacity feature, per Phase 2 lab-03's concurrency math.)

References

upstream/vllm/v1/core/kv_cache_manager.py::get_computed_blocks — where the hit happens, including the hit-rate accounting you watched.
mini_vllm/kv_cache.py::get_computed_blocks — the same logic at readable scale (Phase 2 lab-05 exercises it directly).
vLLM docs, Automatic Prefix Caching — design + operational notes: https://docs.vllm.ai/en/latest/design/prefix_caching.html
Zheng et al., SGLang / RadixAttention (2023) — prefix reuse generalized to a tree; the natural next read: https://arxiv.org/abs/2312.07104
Anthropic, Prompt caching announcement (2024) — the same economics, productized; good for building intuition about real workload shapes: https://www.anthropic.com/news/prompt-caching
Lab-06 in this phase — the CPU twin with exact token accounting.

Lab 03-04 — Preemption: Survive Memory Pressure `[CPU-OK]`

Every admission decision the scheduler makes is a bet: "this request will fit." Decode growth makes the bet probabilistic — every running request gets one token longer per step, and nobody knows when they'll stop. Preemption is what happens when the bets go bad: the scheduler evicts a running request mid-generation, frees its KV, and re-runs it later — and the user at the other end must never be able to tell. In this lab you'll force that emergency on purpose, in a pool sized so two requests cannot both finish, and prove the two halves of the contract: a preemption really occurs, and the preempted request's final output is token-for-token identical to what it would have produced with infinite memory.

Memory pressure costs time, never correctness. That's the sentence this lab turns from a slogan into a test.

Why this lab exists

Preemption is the scheduler's least-exercised, most-critical path — the firmware of the fire extinguisher. It runs rarely (well-tuned deployments preempt little), which means bugs in it survive for months, and when it finally runs, it runs during the worst possible conditions: full memory, maximum load, every user watching. An engineer who has caused preemptions in a controlled pool, watched the counters reset, and verified the replayed output, debugs a production preemption storm from knowledge; everyone else debugs it from folklore.

There's also a design lesson here worth the price of the lab on its own: vLLM turns a potential correctness catastrophe (OOM mid-generation) into a pure performance event (recompute later). That transformation — push failures down the severity ladder, from "wrong answer" to "crash" to "slow" — is the signature move of robust systems design, and preemption is its cleanest specimen in this codebase.

Background: why overcommit at all

The timid alternative exists: admit a request only if prompt + max_tokens worth of blocks can be reserved up front. No preemption needed, ever. But you built lab Phase-2-02, so you can name the cost: that's max-reservation again through the back door — requests typically generate a fraction of max_tokens, so reserved-but-unused blocks strangle concurrency exactly the way contiguous allocation did. vLLM chooses to admit optimistically (reserve nothing beyond current need, let requests grow block by block) and handle the rare collision with preemption. More throughput every step, plus an occasional recompute tax, beats less throughput always. But optimism requires a safety valve — and the valve must preserve correctness, or the whole bargain is rotten. Hence this lab's two-sided test.

The mechanism, step by step

From mini_vllm/scheduler.py (the RUNNING phase's while True), mirroring upstream:

A running request needs a block for its next token; allocate_slots returns None — the pool is empty. This is the bad bet coming due.
The scheduler picks a victim: the last request in running — the most recently admitted (under FCFS, the lowest-priority / least-progressed; upstream with priority scheduling picks lowest-priority-then-latest). Note the same principle as lab-01: which end of which list you take from IS the policy.
_preempt(victim): free all the victim's blocks (back to the free queue — Phase 2 mechanics), reset num_computed_tokens = 0, keep output_token_ids (this is the crown jewel — see below), status → PREEMPTED, and push it on the front of the waiting queue (it has waited longest; it re-enters first).
Retry the allocation. Repeat — possibly preempting several — until it fits. Degenerate case: the victim is yourself; then you give up this step (you're now first in waiting, and you'll be re-admitted when memory frees).
No admissions on a step that preempted (not out.preempted_req_ids guards the WAITING phase) — re-admitting while evicting would thrash.

Files

starter.py — implement run(prompts, num_blocks, block_size, max_tokens): drive mini_vllm.LLMEngine with a given pool size and return each prompt's output token ids. Deliberately thin — the test design is the lab. Your work.
solution.py — reference.
test_lab.py — (1) cramped-pool outputs == roomy-pool outputs; (2) a direct scheduler test that a preemption actually occurs under pressure.

Run

LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-04-preemption -q
pytest phase-03-continuous-batching-scheduler/labs/lab-04-preemption -q   # reference

The setup that forces preemption

Arithmetic you can check on your fingers — pool of 5 blocks × 4 slots = 20 slots, minus the null block → 16 usable slots. Two requests, each 8 prompt tokens + up to 20 output: each needs 3 blocks just to get past its first decodes (tokens 9..12 spill into block 3). Both admit fine (2 blocks each = 4 of 4 blocks — the optimistic bet). Both decode a few tokens... then one needs its third block. Free blocks: zero. The scheduler preempts the most-recent admission, lets the survivor finish (its blocks free at completion — the reaping path from Phase 1), then re-admits the victim, which re-prefills from scratch and finishes too. Total cost: one extra prefill. Total damage to output: zero — that's the assertion.

The direct scheduler test stages the same squeeze without the engine: schedule once (both admitted), manually advance both requests past their prompts, schedule again — and assert out.preempted_req_ids is non-empty. One test proves the valve opens; the other proves nothing leaks when it does.

Why the output is identical

The victim's output_token_ids survive preemption; only num_computed_tokens resets. On re-admission, the request's token list is prompt + outputs_so_far, and the engine — with zero special-case code — simply sees a request whose counter is far behind and prefills the whole thing, generated tokens included. When the counter catches up, the next sample happens at the same (last_token, position) state as if nothing had happened, and the deterministic model continues identically. Induction does the rest.

Stop and admire the design economy: recovery is just prefill. The two-counters model makes "resume after eviction" indistinguishable from "admit a long prompt" — same code path as lab-02's chunking, same path as lab-06's cache hits. Three features, zero interaction code. When you design state machines, this is what to copy: make recovery a state the normal path already handles, not a parallel universe of special cases. (Real vLLM preserves correctness the same way; with prefix caching on, the recompute may even hit surviving cached blocks of its own prompt and skip most of the work.)

What the tests prove

Test	The half of the contract it pins
`test_cramped_pool_matches_roomy_pool`	Correctness: 5-block pool and 256-block pool produce identical token ids, and both requests reach their full `max_tokens`. The user cannot detect preemption from outputs.
`test_preemption_actually_happens_under_pressure`	Liveness of the test itself: a preemption really fires in this scenario. Without this, the first test could pass vacuously (pool accidentally roomy, nothing preempted, nothing proven). Pair every "X survives Y" test with a "Y actually happened" probe — untriggered safety tests are the unit-test equivalent of an unplugged smoke detector.

Hitchhiker's notes

Recompute vs swap. vLLM can alternatively swap a victim's KV to CPU RAM and copy it back later, instead of recomputing. The trade: recompute spends GPU FLOPs (cheap-ish, prefill is compute-efficient); swap spends PCIe bandwidth (~tens of GB/s against KV that can be GBs) and host RAM. V1 defaults to recompute — short-to-medium contexts re-prefill faster than they copy. Swap wins for very long contexts; that regime is where the disaggregated/offload designs of Phase 15 live.
Why the most-recently-admitted victim? Least progress lost (it has computed the least), and under FCFS it's the lowest-priority commitment. Preempting the oldest would maximize wasted work and starve the request closest to finishing — note that the survivor finishing is what frees memory. Victim selection isn't fairness aesthetics; it's part of the forward-progress argument.
The deadlock question (interviewers love it): what if no request can finish because none fits alone? Then preemption ping-pongs forever — A evicts B, B evicts A. Prevention is an admission-time invariant: a single max-length request must fit in the pool. That's exactly the startup check you met in Phase 1 lab-02 (the engine refusing max_model_len too big for its blocks) — the safety valve works only because another component made a promise. Cross-component invariants like this are what design docs are for.
Operationally, preemption is a smell, not a feature. Each one re-prefills a whole request — visible as a TTFT/ITL spike for the victim and burned throughput for everyone. vLLM logs a warning with a preemption counter (vllm:num_preemptions_total in metrics); a rate of preemptions means your pool is undersized for your workload: lower max_num_seqs, shorten max_model_len, raise gpu_memory_utilization, or buy HBM. The valve existing doesn't make leaning on it free.

Going further

Instrument num_preemptions per request (the field already exists) across a sweep of pool sizes from 5 to 50 blocks for a fixed 4-request workload. Plot total steps vs pool size — you'll get a hockey stick whose knee is "enough memory," the capacity-planning picture from the memory side.
Change the victim policy to oldest-first and rerun the suite. The correctness test still passes (replay is policy-independent — make sure you can say why), but count total steps: you've measured the cost of a bad policy with the safety net intact.
Add a swap mode to mini_vllm (stash the victim's per-token "KV" — here just its counter state — instead of resetting) and make the correctness test pass both modes. You'll discover the bookkeeping subtleties (what if the swapped request's blocks were shared via prefix cache?) that make real swap implementations hairy.

References

mini_vllm/scheduler.py::_preempt and the RUNNING phase's allocate-or-preempt loop — the dozen lines this lab is about.
upstream/vllm/v1/core/sched/scheduler.py — search preempt: same dance, plus priority-aware victim selection and the preemption-mode plumbing.
Kwon et al., PagedAttention (SOSP 2023), §4.5 — preemption via recompute vs swap, with measurements: https://arxiv.org/abs/2309.06180
vLLM docs, Optimization and Tuning — the official "reduce preemptions" guidance and the warning log you'll see in production: https://docs.vllm.ai/en/latest/configuration/optimization.html
Phase 2 lab-02 — the over-reservation waste that justifies optimistic admission in the first place.

Lab 03-05 — Decode-Latency Spikes, and How Chunking Kills Them `[CPU-OK]`

Labs 01 and 02 built chunked prefill's mechanism and proved it safe. This lab supplies the missing piece: the motive. You'll put a short request mid-decode — a user happily watching tokens stream — and then slam a 256-token prompt into the engine. Without chunking, the decode stream takes one step that costs 257 tokens of work instead of 1: a ~250× inter-token latency spike, the infamous "my chat froze for a second" of early serving engines. With threshold=32, the same experiment caps every step at 33. You will produce both latency profiles yourself, as exact integer sequences, on a laptop.

Why this lab exists

Tail latency is where serving engineers earn their pay. Means are easy — any engine has a fine average inter-token latency; the product experience is set by the p99, and the p99 is set by exactly the event you're about to stage: someone else's prefill landing in your decode step. This interference is invisible in throughput numbers (the work all gets done!) and invisible in single-request benchmarks (no one to interfere with). You only see it by looking at per-step cost from the perspective of one victim stream — which is precisely the measurement you'll build, using the schedule-probe from Phase 1 lab-04 as your instrument.

This lab is also the Sarathi-Serve paper in a bottle. Their contribution — "stall-free scheduling" via chunked prefills piggybacked on decode batches — reduces, on this workload, to the difference between your two measured profiles. Papers compress well when you've run their experiment.

Background: step cost is the latency

In mini_vllm, steps are instant; on a GPU, a step's wall-clock time grows roughly with the tokens scheduled in it (they all go through the same forward pass — more tokens, more FLOPs, longer step). So for a decoding request, the time between its token k and token k+1 is the duration of the step that computes k+1 — including everyone else's work in that step. That's why this lab's metric is:

for each step in which the victim advances, the total scheduled tokens of that step.

It's a proxy with the right shape: a step of [A:1, B:256] is ~257 token-units long, and A's user waits all of it for one token. The proxy ignores second-order GPU effects (attention's memory traffic, kernel launch overheads — Phase 18 refines), but the first-order picture it gives is the one that drives the tuning decision.

Files

starter.py — implement decode_step_costs(...): stage the collision, probe the schedule, extract the victim's per-step costs. Recipe in the docstring. Your work.
solution.py — reference.
test_lab.py — the spike exists unchunked; the cap holds chunked; the work conserves; outputs are schedule-invariant; the victim is never starved.

Run

LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-05-decode-latency-spikes -q
pytest phase-03-continuous-batching-scheduler/labs/lab-05-decode-latency-spikes -q   # reference

What you should see — the two profiles

Real output of the solution (long_prompt_len=256, max_tokens=16):

threshold=0 :  [257, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
threshold=32:  [33, 33, 33, 33, 33, 33, 33, 33, 2, 1, 1, 1, 1, 1, 1]

Read them like latency traces, because that's what they are:

Unchunked: one monstrous step — B's entire 256-token prefill rides the same step as A's decode (256 + 1 = 257) — then calm. A's user experiences fifteen smooth tokens and one ~250× hiccup. This is a p99 disaster hiding in a perfect mean: the average cost is ~18, the median is 1. If you only monitor averages, this profile looks healthy. It isn't.
Chunked at 32: eight consecutive steps of exactly 33 (one 32-token chunk of B + A's 1 token — your lab-02 formula: ⌈256/32⌉ = 8 steps), then a 2 (B's first decode rides along), then 1s. The spike didn't shrink; it spread: the same 256 tokens of prefill work, conserved to the token, paid in eight 33× installments instead of one 257× balloon. Worst-case ITL for A drops ~8×; B's time-to-first-token rises (8 steps instead of 1). Nothing is free — chunking is a redistribution of latency from everyone's p99 to the long prompt's TTFT. That redistribution is almost always the right trade (decode stalls are user-visible jitter; prefill latency is a single wait users expect), but say it as a trade, not a win.
The 2s are worth a glance: B's own decode co-scheduled with A's. Mixed batches everywhere once you know to look (Phase 1 lab-04's "money step").

What the tests prove

Test	What it pins
`test_unchunked_prefill_spikes_one_decode_step`	The spike is real (≥ 257) and singular (second-worst step ≤ 3) — exactly one balloon payment
`test_chunked_prefill_caps_the_spike`	The cap holds (`≤ threshold + 1` for every step the victim shares) and the work conserved (≥ 8 elevated steps — the prefill didn't vanish, it spread)
`test_chunking_does_not_change_the_decode_streams_output`	Lab-02's invariant under interference: identical token ids for the victim across both schedules
`test_decode_stream_is_never_starved`	The victim advances every single step it's alive, chunked or not — running-first (lab-01) means interference delays decodes but never skips them

Together these four are the full contract of chunked prefill: bounded interference, conserved work, untouched outputs, guaranteed progress.

Hitchhiker's notes

Where's the threshold's floor? Push it down: at threshold=1, B's prefill takes 256 steps — A's latency is pristine and B's TTFT is catastrophic; and on real hardware, 256 tiny steps pay 256× the fixed per-step overhead (scheduler, launches, sampler), so total throughput sags too. The optimum is workload-dependent and that's the point: upstream exposes long_prefill_token_threshold (and the budget) rather than hardcoding an answer. Sarathi-Serve's evaluation is essentially this sweep with wall-clocks.
The budget is the other half of the dial. Chunks are bounded by min(threshold, remaining budget) — lab-01's clamp. A small max_num_batched_tokens caps interference globally (every step is small) at the cost of slower prefills for everyone. Production tuning usually sets budget for the worst acceptable step time, then threshold for fairness within it.
Why measure from the victim's seat? Because aggregate metrics hide exactly this. Mean step cost barely moves between the two profiles (the work is identical!); only per-victim-step cost shows the 257 vs 33. The general lesson for benchmarking serving systems: pick a request and follow it — fleet-wide averages are where tail pain goes to hide. This is why serious LLM benchmarks report TTFT and ITL distributions, never just tokens/sec (Phase 18).
Real-engine correspondence: in vLLM, run two clients — one streaming a long generation, one submitting a huge prompt — and watch the streamer's inter-chunk gaps with chunked prefill toggled (it's default-on in V1; you can throttle it via long_prefill_token_threshold). The wall-clock version of your integer profiles, jitter included.

Going further

Compute p50/p99 of A's step costs for thresholds {0, 16, 32, 64, 128, 256} and plot both against B's prefill-step count. The p99 curve falls as the TTFT curve rises; where they cross for your tolerance is the tuning answer. You've reproduced Sarathi-Serve Figure-1-style analysis with a 30-line probe.
Make it a storm: five long prompts arriving on consecutive steps while two victims decode. Does the cap still hold per step? (It must — the budget binds the sum.) What happens to admission order? (Lab-01's FCFS + head-of-line rules, now visible in data.)
Add wall-clock: time each eng.step() (even toy steps have measurable cost) and check the correlation between your token proxy and real microseconds. Weak on a toy model, strong on a GPU — knowing when a proxy is valid is half of performance engineering.

References

Agrawal et al., Sarathi-Serve: Taming Throughput-Latency Tradeoff in LLM Inference (OSDI 2024) — the stall this lab stages, and the chunking cure, measured at scale: https://arxiv.org/abs/2403.02310
Agrawal et al., SARATHI (2023) — the original chunked-prefill paper: https://arxiv.org/abs/2308.16369
upstream/vllm/v1/core/sched/scheduler.py — long_prefill_token_threshold in the production clamp; the dial you just calibrated.
vLLM docs, Optimization and Tuning — official guidance on budget/threshold tuning: https://docs.vllm.ai/en/latest/configuration/optimization.html
Dean & Barroso, The Tail at Scale (CACM 2013) — the classic on why p99 beats mean, background for this lab's whole worldview: https://research.google/pubs/the-tail-at-scale/

Lab 03-06 — Prefix Caching: Count Every Token It Saves `[CPU-OK]`

Lab-03 showed you prefix caching through the real engine's meters — hit rates, throughput averages, wall-clock noise. This lab removes the noise. On the mini engine, you'll run the same shared-system-prompt workload with caching off and on and account for the savings to the exact token: 544 scheduled tokens uncached, 96 cached, difference 448 = 7 followers × 16 blocks × 4 tokens. Not approximately. Exactly. When you can predict a cache's benefit with integer arithmetic before running it, you understand the cache.

Why this lab exists

Three phases of machinery converge here, and this lab is where you check that you can predict their composition: Phase 2's block hashing and sharing (lab 02-05), this phase's admission path (get_computed_blocks → adopt → allocate, lab 03-01), and the scheduling identities from Phase 1 lab-04 (Σ scheduled = prompt + max_tokens − 1). If your prediction of the cached total is off by even one token, one of those three mental models has a crack in it — and the integers will tell you which (that's how the over-allocation bug mentioned in Phase 2 lab-05 was actually found: an exact count disagreed).

The professional skill is dimensioned estimation of cache value. "Enable prefix caching and things get faster" is advocacy. "This workload shares a 64-token block-aligned prefix across 8 requests, so caching eliminates exactly 7×64 = 448 of 544 scheduled prefill+decode tokens — an 82% compute reduction on this batch, and here's the count" is engineering. The GPU version (lab-03) gives you the wall-clock corroboration; this lab gives you the theorem.

Background: what "scheduled tokens" measures

The probe sums total_num_scheduled_tokens over every schedule() call — every token of forward-pass work the scheduler ever requested. It is the engine's compute odometer: prefill chunks, cache-miss remainders, decodes, everything. Two properties make it the right meter here:

It's conserved: work not scheduled is work not done. There is no place for savings to hide or double-count.
It's schedule-invariant in total: chunking and batching rearrange when tokens are scheduled, never how many (lab-02/05). Only caching changes the total — by replacing computed tokens with adopted KV. So the off/on difference isolates the cache's effect perfectly. Experimental design through invariants: pick a meter on which everything else you might accidentally vary is provably neutral.

Files

starter.py — implement run_and_count(...): the probe + generate, returning the odometer total and the outputs. Your work.
solution.py — reference.
test_lab.py — exact totals for both arms, the savings identity, output equality, and the share-nothing control.

Run

LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-06-prefix-cache-savings -q
pytest phase-03-continuous-batching-scheduler/labs/lab-06-prefix-cache-savings -q   # reference

What to implement

The Phase 1 lab-04 probe, reduced to an accumulator: wrap eng.scheduler.schedule, add up out.total_num_scheduled_tokens, run eng.generate(...) over the prompts (greedy, ignore_eos), return (total, token_ids_per_prompt). Ten lines. The thinking is in the test predictions — write those yourself on paper before running anything.

The accounting, line by line

Workload: SYSTEM = "S"×64 (64 byte-tokens = exactly 16 full blocks at block_size 4 — alignment chosen deliberately), 8 prompts SYSTEM + str(i) (65 tokens, unique last token), max_tokens=4, greedy.

Caching off — every request pays full price:

per request: 65 (prefill) + 3 (decodes; the 4th token is sampled but never computed — Phase 1 lab-04)
           = 68
total      : 8 × 68 = 544

Caching on — the pioneer pays, the followers ride:

request 0   : 65 + 3 = 68      (cold cache: populates 16 block hashes during its prefill)
requests 1–7: 1 + 3  = 4 each  (!!)
total       : 68 + 7×4 = 96
savings     : 544 − 96 = 448 = 7 × 64

That 1 deserves a pause — it's three of this course's rules colliding in one token:

The follower's 65-token prompt hits all 16 full blocks → 64 tokens adopted free.
The hit cap (num_tokens − 1, Phase 2 lab-05) wouldn't bind here (64 ≤ 64), but the 65th token couldn't hit anyway: it's in a partial block (I3 — never cached) and it must be computed to produce logits (you need the model's output at the last position, and the cache stores only KV).
So the scheduler admits the request with num_computed = 64, schedules exactly 1 token, and — because 64 + 1 == 65 — that same step samples (needs_sample, Phase 1 lab-03). A one-token prefill that immediately emits: the strangest-looking line you'll see in a scheduler trace, and now you can explain it.

Also notice when the followers hit: all 8 requests are admitted in the same schedule() call, yet requests 1–7 still hit blocks request 0 cached microseconds earlier — because mini_vllm (like upstream) registers blocks in the cache index at allocation time, inside the same admission loop. Caching is eager; sharing begins before the pioneer has computed a single value. (The KV contents don't exist yet — but the reservation is shared, and the prefill that fills it runs once. If that bends your brain, good; it's the detail most explanations skip.)

What the tests prove

Test	What it pins
`test_caching_off_pays_full_price_for_everyone`	The baseline identity: `8 × (65 + 3) = 544`, no cache, no surprises
`test_caching_on_computes_the_shared_prefix_once`	The cached total, exactly: `68 + 7×4 = 96` — every rule above, composed correctly
`test_savings_equal_followers_times_shared_full_blocks`	The savings identity `(N−1) × shared_full_block_tokens` — the formula you'll reuse to estimate cache value on any workload
`test_outputs_are_identical_with_and_without_caching`	Caching is a pure performance feature: same tokens out. (The cached KV is the KV — Phase 2 lab-06's identity, economically applied)
`test_unshared_prompts_save_nothing`	The control arm: distinct prompts share only a sliver of block-aligned prefix → savings < 25%. Caching is workload-dependent; anyone selling it flat-rate is selling

Hitchhiker's notes

The alignment was rigged, and you should notice. SYSTEM is exactly 16 blocks. Make it 66 tokens (16.5 blocks) and followers hit only 64 of 66 — the half-full block 17 recomputes for everyone, forever. On real tokenizers you don't control alignment, which is why measured hit rates hover below the naive prediction (lab-03's 93.7%) and why block_size enters cache math, not just memory math.
Map the integer totals to the GPU meters: hit rate ≈ adopted/looked-up = 7×64 / (some denominator including tails); prompt-throughput ratio ≈ 544-ish/96-ish ≈ 5× — squarely the 4–5× lab-03 measured through wall-clock noise. Exact model + noisy measurement agreeing is how you validate both; either alone can fool you.
Why followers cost 4 while the pioneer costs 68 is the per-request view of the economics: a follower's marginal cost is its unique content plus decodes. System prompts become nearly free at the margin; what stays expensive is what's per-user. This inverts prompt-engineering economics — long, rich shared instructions are cheap; per-request context is what you trim. Product decisions hang on this inversion.
enable_caching=False exists for a reason — it's the control arm of every caching benchmark, and occasionally a production choice (e.g. strict multi-tenant isolation — see lab-03's security note). A feature you can't turn off is a feature you can't measure.

Going further

Multi-turn: simulate a 5-turn conversation (each turn's prompt = previous prompt + previous output + new question) and predict, then measure, the per-turn scheduled tokens with caching on. You should see each turn pay only its delta. This is the chat-history result from lab-03's notes, now exact.
Eviction pressure: shrink num_blocks until followers stop hitting (the pioneer's blocks get evicted by the followers' own decode growth — Phase 2 lab-05's queue mechanics). Find the cliff; explain its location from pool arithmetic.
Derive the general formula: for N requests sharing a P-token prefix (block size B), savings = (N−1) × B × ⌊P/B⌋ ... except when the unique suffix is empty — then the hit cap (num_tokens − 1) bites and the formula needs a correction term. Write the corrected version and the test that proves it. (This edge — identical entire prompts — is exactly lab-03's 8-identical-prompts experiment.)

References

mini_vllm/scheduler.py — the WAITING-phase get_computed_blocks admission path you're metering (and the num_new_tokens == 0 → schedule 1 branch behind the one-token prefill).
mini_vllm/kv_cache.py::_cache_full_blocks — eager caching at allocation time.
upstream/vllm/v1/core/kv_cache_manager.py — the production twin of both.
vLLM docs, Automatic Prefix Caching — design doc: https://docs.vllm.ai/en/latest/design/prefix_caching.html
Phase 2 lab-05 — the block-level mechanics (ref counts, hit cap, revival) this lab meters through the scheduler.
Phase 3 lab-03 — the same experiment on real hardware, with wall-clocks attached.

Phase 03 — Exercises: Continuous Batching & Scheduler

Escalating from "explain it" to "design it." Staff-level = the last ones cold, citing the exact upstream/ line.

Warm-up (explain)

Restate, in your own words, the "no prefill/decode phase" idea. What two numbers on Request does the whole scheduler manipulate?
Why schedule RUNNING requests before admitting WAITING ones?
What three conditions must all hold to admit a waiting request? (guide §6 / deep-dive §2.)

Core (trace the code)

Walk the 4-line num_new_tokens clamp (scheduler.py:385–398). Name each of the caps and give a scenario where each one is the binding constraint.
Trace the while True preemption loop for: 3 running requests, 0 free blocks, FCFS policy. Who gets preempted, and what does _preempt_request do to them?
In mini_vllm, why does admission stop entirely (break) on the first allocate_slots == None, while the running phase retries after preempting? (Hint: different goals — make progress vs. don't over-admit.)

Build (extend your code)

Implement the PRIORITY policy in mini_vllm/scheduler.py: a priority on Request and a victim = max(running, key=lambda r:(r.priority, r.arrival_time)). Write a test where a high-priority late arrival preempts a low-priority running request.
Add a stats counter: total preemptions, average batch size, mean KV usage per step. Verify on lab-04's cramped run that preemptions > 0.
Implement swapping preemption: instead of num_computed_tokens = 0, move the request's blocks to a CPU list and restore on re-admit. Show output is still identical; discuss the cost difference vs recompute.

Design (staff-level)

A workload is 90% short chats (200-token prompts) and 10% long-doc summaries (16k prompts). Pick max_num_batched_tokens and long_prefill_token_threshold and justify with the latency impact on the short chats while the long prefills run.
You see frequent preemptions in production. List the three knobs you'd reach for (and the code/Phase each maps to) and the risk of each.
Continuous batching keeps the GPU full, but at very high concurrency latency degrades. Use Little's Law to explain the tradeoff and where you'd cap max_num_seqs.
Design admission control to prevent a request that can never fit (longer than the whole KV cache) from deadlocking the engine. What does real vLLM do? (Peek: FINISHED_IGNORED, check_enough_kv_cache_memory at kv_cache_utils.py:794.)

Self-grading

5, 10–13 are interview-grade. Could you whiteboard each in 5 minutes and name the file? If not, re-read the matching deep-dive section, then drill INTERVIEW.md.

Phase 03 — Interview Questions: Continuous Batching & Scheduler

Throughput questions live here. Cover the answer, attempt out loud, then compare. This and Phase 02 are the two topics to own cold.

Q1. What is continuous batching and why is it the biggest throughput win in LLM serving?

Model answer

Static batching runs a fixed batch to completion, so the GPU runs at the speed of the slowest request and finished requests waste their slot. Continuous batching re-decides the batch every single step (every token): the instant a request finishes, its slot is freed and a waiting request joins mid-flight. With mixed-length traffic (all real traffic) this keeps the GPU saturated continuously instead of idling on finished slots. It's purely a scheduling change — same kernels, same model — which is why it's such high leverage.

Q2. Explain the scheduler's core mental model.

Model answer

There's no "prefill phase" or "decode phase." Each request is just num_computed_tokens racing to catch up to num_tokens. Every step the scheduler hands out tokens so requests close that gap, under a global token budget. "Prefill" = far behind; "decode" = behind by one. This single rule covers chunked prefill (hand out part of the gap), prefix caching (start with the gap pre-closed), and speculative decoding (the gap includes draft tokens via num_tokens_with_spec) — all with no special cases. It's the comment at the top of Scheduler.schedule (scheduler.py:330).

Q3. What is chunked prefill and what problem does it solve?

Model answer

A long prompt's prefill, done in one step, would monopolize the step and stall every in-flight decode → inter-token-latency spikes for all current users. Chunked prefill splits the prefill across multiple steps under the per-step token budget (max_num_batched_tokens), so each step mixes a slice of the big prefill with ongoing decodes. It trades a bit of prefill throughput (more steps) for much better decode latency under load. Knob: long_prefill_token_threshold + the budget (scheduler.py:390).

Q4. How does prefix caching interact with the scheduler?

Model answer

When admitting a waiting request, the scheduler calls get_computed_blocks (scheduler.py:591), which asks the KV manager how many leading tokens are already cached (shared physical blocks from an earlier request with the same prefix). Those tokens count as already computed, so the request starts with num_computed_tokens > 0 and only prefills the unique remainder. For a shared system prompt across many users this is a massive throughput/memory win and the structural advantage behind multi-tenant serving. It rides on Phase 02's block sharing (touch + ref_cnt).

Q5. Walk me through what happens when a running request needs memory and there's none.

Model answer

allocate_slots returns None (Phase 02). The scheduler enters its preemption loop (scheduler.py:443): it picks a victim — under FCFS self.running.pop() (most recently admitted), under PRIORITY the worst (priority, arrival_time) — calls _preempt_request to free that request's KV blocks and send it back to waiting (to be recomputed later), then retries the allocation. If the only request left to preempt is the one we're trying to schedule, we give up on it this step. This None → preempt → retry handshake is what lets vLLM admit aggressively without OOM-crashing.

Q6. Preemption: recompute vs swap. Tradeoff?

Model answer

On preemption you can either recompute the KV later (replay prompt+generated tokens through prefill) or swap the KV blocks out to CPU memory and copy them back on resume. Recompute spends GPU compute (cheap-ish thanks to efficient prefill, no extra memory traffic off-GPU); swap spends PCIe bandwidth and CPU memory but avoids recomputation. Recompute usually wins for short sequences; swap can win for very long KV where recompute would be expensive. Either way, output is identical — preemption costs time, not correctness.

Q7. Why admit no new requests in a step where you preempted?

Model answer

A preemption means you're already out of KV memory. Admitting more work in the same step would immediately force more preemptions — thrashing. So the scheduler gates the waiting phase on "no preemptions this step" (scheduler.py:545; mini_vllm: not out.preempted_req_ids). It lets the system drain pressure before taking on more.

Q8. (Deep) How does speculative decoding ride this same scheduler with no special case?

Model answer

A request's num_tokens_with_spec includes proposed draft tokens, so the same num_new_tokens = num_tokens_with_spec - num_computed_tokens clamp naturally schedules the draft tokens to be verified, and num_lookahead_tokens reserves KV slots for them in allocate_slots. Acceptance/ rejection is handled in update_from_output. The scheduler doesn't know or care that it's spec decode — it's just "tokens to compute," exactly as the top-of-function comment promised. (Full treatment: Phase 08.)

Rapid-fire

Two queues? waiting (deque/priority) and running (list).
The per-step token cap? max_num_batched_tokens → token_budget.
The concurrent-sequence cap? max_num_seqs → len(running) limit.
Who's scheduled first each step? Running, then waiting.
What does update_from_output do? Append sampled tokens, advance num_computed_tokens, reap finished requests (free KV).
A request emits a token iff? num_computed_tokens + num_scheduled == num_tokens (prefill fully caught up).

Phase 03 — Cheatsheet: Continuous Batching & Scheduler

The one-liner

Every token step, re-decide the batch: schedule RUNNING first, admit WAITING, under a token budget + seq-slot cap. Continuous batching, chunked prefill, prefix caching, preemption all fall out of "make num_computed_tokens catch up to num_tokens."

The master model

No prefill/decode phase. Request = (num_computed_tokens racing num_tokens). Prefill = far behind. Decode = behind by one. (scheduler.py:330)

schedule() shape

budget = max_num_batched_tokens
# A) RUNNING: n = clamp(num_tokens - num_computed, budget, threshold);
#    allocate_slots; None -> preempt running.pop(); retry; commit; budget -= n
# B) WAITING: while budget>0 and len(running)<max_num_seqs and not preempted:
#    get_computed_blocks (prefix cache) -> num_computed; clamp; allocate; None -> break; admit

The four/five invariants

a request is in exactly one of {waiting, running} while unfinished
sum(num_scheduled_tokens) <= max_num_batched_tokens
len(running) <= max_num_seqs
emits a token iff num_computed + num_scheduled == num_tokens
preempt frees KV + resets num_computed = 0 (recompute on re-admit)

Knobs (→ Phase 18)

max_num_batched_tokens — per-step token budget (chunked prefill granularity)
max_num_seqs — max concurrent running requests
long_prefill_token_threshold — per-request prefill chunk cap
enable_prefix_caching — share prefix KV across requests
scheduling policy — FCFS vs PRIORITY (preemption victim choice)

The Phase 02 ↔ 03 seam

Scheduler decides policy; KVCacheManager is truth. allocate_slots returns None on OOM → scheduler preempts + retries. Scheduler never touches blocks; KV manager never sets policy.

Key upstream

vllm/v1/core/sched/scheduler.py:329 — schedule()
:443 — preemption loop · :591 — prefix-cache head start
scheduler.py:1283 — update_from_output
vllm/v1/core/sched/output.py:181 — SchedulerOutput (New vs Cached request data)
vllm/v1/request.py:315 — RequestStatus

Gotchas

allocate_slots == None is normal control flow (drives preemption), not an error.
Admission stops on first OOM (break); running phase retries after preempting.
No admission in a step that preempted (avoid thrashing).
A request longer than the whole KV cache can never fit → ignored/aborted, not deadlock.

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

Phase 04 — The Hitchhiker's Guide to Attention Backends

← Phase 03 · Course home · Phase 05 →

Don't Panic

Attention is one mathematical operation. But there are a dozen hyper-tuned GPU kernels that compute it (FlashAttention, FlashInfer, Triton, FlashMLA, TRTLLM-GEN…), each best for some combination of hardware, model, and batch shape. vLLM hides them all behind one interface, picks the right one at startup, and feeds it the metadata it needs (the block tables from Phase 2). This phase is that interface and that choice — usually the single hottest kernel in decode, so it's where a lot of real performance wins and bugs live.

model's Attention layer  (one API)
        │  q, k, v
        ▼
   AttentionImpl  (the chosen backend: FlashAttention / FlashInfer / Triton / MLA / ...)
        │  + AttentionMetadata (block tables, seq lens, slot mapping)
        ▼
   the CUDA kernel  ── gathers paged KV via the block table, computes softmax(QKᵀ)V

Step 1: Why attention needs a special kernel (recap Phase 2)

A token attends to all earlier tokens, whose K/V live in scattered physical blocks (PagedAttention). So the kernel can't just multiply two contiguous matrices — it must, per token, look up physical_block = block_table[logical_block] and gather K/V from all over memory. It also must write this step's new K/V to the right slot (slot_mapping). Two pieces of metadata the scheduler/runner build and hand the kernel:

block table — where to read prior KV (logical → physical block).
slot mapping — where to write this step's new K/V.

Plus per-request sequence lengths so variable-length (varlen) batches pack together.

Step 2: Why so many kernels?

The math is fixed; the fast way to do it depends on context:

FlashAttention — the classic: never materializes the full N×N attention matrix; streams K/V in tiles using online softmax (running max + rescale), so memory is O(N) not O(N²). Great general default.
FlashInfer — a library specialized for serving: paged KV, prefill+decode wrappers, fast for many small/decode requests; often wins at high concurrency.
Triton — kernels written in Triton (Python-ish DSL); portable, the fallback when a hand-tuned CUDA kernel isn't available for your case.
FlashMLA — for MLA (Multi-head Latent Attention), DeepSeek's design that compresses KV into a low-rank latent — different KV layout, so it needs its own kernel.
TRTLLM-GEN — NVIDIA TensorRT-LLM generated kernels, tuned for specific GPUs/precisions.

Different head dims, dtypes (fp16/bf16/fp8), features (sliding window, soft-cap, ALiBi), and hardware all shift which kernel is fastest or even available.

Step 3: The backend abstraction

vLLM factors attention into four roles (vllm/v1/attention/backend.py):

Role	Job
`Attention` layer	what the model calls (`q,k,v -> out`); backend-agnostic
`AttentionBackend`	names the impl + metadata classes for a kernel family
`AttentionImpl`	the actual `forward` that runs the kernel
`AttentionMetadataBuilder`	turns `SchedulerOutput` into the kernel's metadata (block tables, seq lens, slot mapping) each step

A selector (get_attn_backend, selector.py:52) picks the backend at startup from platform + dtype + head_dim + model features, overridable with VLLM_ATTENTION_BACKEND=FLASH_ATTN|FLASHINFER| TRITON_ATTN|.... The model never changes — only which AttentionImpl is plugged in.

Step 4: Online softmax (the FlashAttention trick), in one picture

You can't hold a 1×N attention row in fast SRAM for long N. So FlashAttention streams K/V in tiles and keeps a running result, rescaling as it goes:

for each tile of (K,V):
    s = q·Kᵀ_tile                  # scores for this tile
    m_new = max(m_old, max(s))     # running max (for numerical stability)
    correction = exp(m_old - m_new)
    acc = acc*correction + exp(s - m_new) · V_tile   # rescale old, add new
    denom = denom*correction + sum(exp(s - m_new))
out = acc / denom

You'll implement exactly this in lab-01 (numpy, CPU) over a paged KV cache, and prove it equals plain dense attention. That single lab demystifies FlashAttention and PagedAttention's kernel side at once.

The invariants to memorize

Attention is one op; the backend is which kernel computes it. Model code is backend-agnostic.
The kernel needs block table (read map), slot mapping (write map), seq lens (varlen).
Online softmax makes attention O(N) memory and is why "Flash" kernels exist.
Backend is chosen at startup (selector) and overridable via VLLM_ATTENTION_BACKEND.
MLA models need MLA-specific backends (different KV layout).

What you'll do

Read: 01-deep-dive.md — the Attention layer, the backend base classes, the selector, and FlashAttentionImpl/its metadata builder, all line-anchored.
Build: 02-mini-build.md — paged attention with online softmax in numpy.
Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
- lab-01-paged-attention-gather [CPU-OK] — implement online-softmax attention over a paged KV cache; prove it equals dense attention.
- lab-02-backend-selection [GPU-OPT] — read the selector, build the (GPU, dtype, model) → backend matrix, verify with env overrides (captured output).
- lab-03-causal-prefill-attention [CPU-OK] — the prefill kernel shape: M queries, causal loop bounds, start_pos offsets; prove chunked prefill == one-shot at the attention layer.
- lab-04-flash-decoding-partitions [CPU-OK] — split-KV decode: attention state as a mergeable (max, denom, acc) triple; equality with dense for any partition count/order.
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

← Phase 03 · Course home · Phase 05 →

Phase 04 — Deep Dive: the attention backend system

Paths relative to upstream/ at v0.22.1 @ 0decac0. The attention stack lives across:

vllm/model_executor/layers/attention/attention.py   the Attention nn.Module (model-facing)
vllm/v1/attention/backend.py                         AttentionBackend / AttentionImpl /
                                                      AttentionMetadataBuilder base classes
vllm/v1/attention/selector.py                        get_attn_backend (the picker)
vllm/v1/attention/backends/flash_attn.py             a complete backend, end to end
vllm/v1/attention/backends/{flashinfer,triton_attn,mla/}.py   other families
vllm/v1/attention/backends/registry.py               name -> backend mapping

1. The model-facing `Attention` layer

vllm/model_executor/layers/attention/attention.py:177 — class Attention(nn.Module, AttentionLayerBase). This is what LlamaAttention.forward called in Phase 0 (self.attn(q, k, v)). Its __init__ (:189) resolves the backend (via the selector) and instantiates an AttentionImpl; its forward (:437) hands q,k,v to that impl. The model talks only to this class — it never knows which kernel runs. That decoupling is the whole point: swap the kernel, the model is untouched.

2. The base classes: `backend.py`

vllm/v1/attention/backend.py defines the contract every kernel family implements:

AttentionBackend — static methods naming the impl class, the metadata class, supported head sizes/dtypes, and the KV cache shape.
AttentionImpl — the forward(q, k, v, kv_cache, attn_metadata) -> out that runs the kernel (writes new K/V to the cache via slot_mapping, reads prior KV via the block table).
AttentionMetadataBuilder — build(...) turns the per-step scheduler info (sequence lengths, block tables, slot mapping) into the typed metadata the kernel wants.

This three-part split (Backend names it, Impl runs it, Builder feeds it) repeats across every backend file.

3. A complete backend: FlashAttention

vllm/v1/attention/backends/flash_attn.py:

class FlashAttentionBackend(AttentionBackend) (:68) — the registry entry; declares the impl, metadata, and supported configs.
class FlashAttentionMetadata (:223) — the per-step data the kernel needs (block table, seq lens, slot mapping, scheduling for varlen).
class FlashAttentionMetadataBuilder(AttentionMetadataBuilder[...]) (:276) — builds that metadata from the model runner's inputs each step. This is the bridge from Phases 2/3 to the kernel: the block tables you allocated and the scheduled token counts become kernel arguments here.
class FlashAttentionImpl(AttentionImpl) (:592) — forward calls the FlashAttention CUDA kernel (via vllm-flash-attn/flash-attn), passing the paged KV cache + metadata.

Read FlashAttentionImpl.forward and find where it (a) writes the new k,v into the KV cache using slot_mapping, and (b) calls the varlen flash-attn function with the block table. Those two calls are the read/write maps from the guide, live.

4. The selector: who picks the backend

vllm/v1/attention/selector.py:52 — def get_attn_backend(...). It considers the platform (current_platform, Phase 17), dtype, head size, whether the model uses MLA / sliding window, and the VLLM_ATTENTION_BACKEND env override, then returns the backend class. _cached_get_attn_backend (:106) memoizes it. The platform files (vllm/platforms/cuda.py, rocm.py, cpu.py) provide the per-hardware default — which is why the same model picks FlashAttention on an A100, a Triton or FlashInfer path elsewhere, and a CPU kernel on a laptop (Phase 17).

5. MLA — when the KV layout itself changes

vllm/v1/attention/backends/mla/ holds the MLA backends. MLA (DeepSeek) compresses K/V into a low-rank latent vector, so the KV cache stores something different and needs its own kernel (FlashMLA). This is why "add a model" (Phase 14) sometimes means "wire up a different attention backend" — the model's attention design dictates the KV layout dictates the kernel.

Reading checklist

Attention.forward — what does the model pass, and what does it NOT know?
The three base classes in backend.py — Backend vs Impl vs MetadataBuilder.
In FlashAttentionMetadataBuilder.build — which Phase 2/3 outputs become kernel metadata?
In FlashAttentionImpl.forward — find the KV write (slot_mapping) and the paged read (block table).
get_attn_backend — name three factors that change the chosen backend.

Now build it: 02-mini-build.md, then the labs.

Phase 04 — Mini-Build: paged attention with online softmax

You'll implement the heart of a "Flash"-style attention kernel in numpy — online softmax over a paged KV cache — and prove it equals plain dense attention. This single build demystifies both FlashAttention (the streaming softmax) and PagedAttention's kernel side (the block-table gather) at once.

The task (lab-01)

Given:

a query vector q (one decode step, one head): shape (d,),
a paged KV cache k_cache, v_cache: shape (num_blocks, block_size, d),
a block_table: list[int] mapping logical→physical block,
a seq_len (valid tokens),

compute attention(q) = softmax(q·Kᵀ / √d) · V, where K/V are gathered through the block table (token t lives at block_table[t // block_size], offset t % block_size), using the online softmax recurrence (running max + rescale) so you never build the full score vector.

Implement two functions and show they match:

dense_attention(q, K, V) — the reference (build all scores, softmax, weighted sum).
paged_online_attention(q, k_cache, v_cache, block_table, seq_len) — block-table gather + online softmax, processed block by block.

The online-softmax recurrence (from the guide)

m, denom, acc = -inf, 0, zeros(d)
for each block (gathered via block_table, up to seq_len):
    s = (q · Kblockᵀ) / sqrt(d)            # scores for this block's tokens
    m_new = max(m, s.max())
    corr = exp(m - m_new)
    acc = acc*corr + (exp(s - m_new) @ Vblock)
    denom = denom*corr + exp(s - m_new).sum()
    m = m_new
return acc / denom

Definition of done

pytest phase-04-attention-backends/labs -q

The test asserts paged_online_attention ≈ dense_attention within tolerance, for non-block-aligned seq_len (so you handle the partial last block), and that scattering the logical blocks to arbitrary physical ids doesn't change the result (that's the whole point of paging).

Map to the real engine

your numpy	real vLLM
`block_table` gather	the block table fed to `FlashAttentionImpl` (`flash_attn.py:592`)
online softmax	the FlashAttention/FlashInfer kernels
`seq_len` partial block	varlen handling in the metadata builder (`flash_attn.py:276`)
dense reference	what a naive (pre-Flash) kernel did, O(N²) memory

Phase 04 Labs — Attention Backends

Four labs that take you inside the kernels the scheduler commands. The arc: build the decode kernel's algorithm (lab-01), widen it to the prefill shape with causal bounds (lab-03), parallelize it with the mergeable-state trick (lab-04), then step back and map the stable of production backends and the selector that picks between them (lab-02).

Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: algorithm first, then its two extensions, then the dispatcher.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-04-attention-backends/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-04-attention-backends/labs/lab-01-paged-attention-gather -q

Labs

lab-01-paged-attention-gather `[CPU-OK]`

The fusion lab: online-softmax (FlashAttention's running max / denominator / accumulator recurrence) over a paged KV cache (PagedAttention's block-table gather), in ~25 lines of numpy, proven equal to dense attention — including the partial-last-block bound and the m = −inf first-block edge. This is the semantics of paged_attention_v1.cu, and the foundation labs 03 and 04 build on. Skills: the recurrence and why it's exact; the rescaling correction factor; mapping your variables onto the CUDA kernel's.

lab-02-backend-selection `[GPU-OPT]`

Run the selector, override it (VLLM_ATTENTION_BACKEND), read get_attn_backend (selector.py:52), and build the (GPU, dtype, model) → backend matrix — including why MLA models force a backend while sliding windows merely filter candidates. Captured output included for the GPU-less. Skills: the two-run kernel-bisection habit; backends differ in the last ulp legitimately; why selection is startup-time configuration.

lab-03-causal-prefill-attention `[CPU-OK]`

The prefill shape: M queries starting at start_pos, each attending over exactly its causal prefix — where the mask degenerates into a loop bound and chunked prefill becomes just "queries that don't start at zero." The payoff test proves chunked ≡ one-shot in attention outputs (Phase 3 lab-02's theorem, at the layer that enforces it), and a poisoned-future test makes causality violations deafening. Skills: decode vs prefill as loop shapes; start_pos/query_start_loc metadata; why prefill is compute-bound in this very loop nest.

lab-04-flash-decoding-partitions `[CPU-OK]`

The parallelism lab: attention state compresses to a mergeable (max, denom, unnormalized-acc) triple, so a 128k-token decode can be split across partitions computed independently and merged exactly — any partition count, any merge order, any tree shape, all 1e-12-equal to dense. This is paged_attention_v2, flash-decoding, FlashInfer split-k, and (stretched across GPUs) Phase 10's context parallelism. Skills: the attention monoid; never normalize a partial; why long-context decode is where backends differ.

What you can do after this phase

Read any attention backend in vllm/v1/attention/backends/ and find the three things that are always there: the streaming recurrence (lab-01), the shape/metadata handling for prefill vs decode (lab-03), and the reduction strategy (lab-04). Diagnose a kernel suspicion with the backend-override bisection (lab-02), predict which backend a deployment runs before it starts, and explain to a colleague why paged + flash + split-KV compose without approximation. Phase 5 freezes these kernels into CUDA graphs; Phase 7 goes below them into GEMMs.

Lab 04-01 — Paged Attention with Online Softmax `[CPU-OK]`

This lab is where the two most important kernel ideas in LLM inference fuse into one function. From PagedAttention (Phase 2): K/V live in scattered physical blocks, reached through a block table. From FlashAttention: you never materialize the full score row — you stream the keys and maintain a running softmax. Put them together and you have, in ~25 lines of numpy, the algorithm at the heart of every decode kernel vLLM ships: paged_attention_v1.cu, the Triton fallbacks, FlashInfer's decode path. When the tests pass, you don't "know about" these kernels anymore — you've written their semantics.

Did Phase 2 lab-06 already? Good — that was the gather with ordinary softmax. This lab replaces the softmax with the online recurrence, the part that makes the streaming exact. Different load-bearing idea, same scaffolding, deliberately.

Why this lab exists

Naive attention computes all N scores, softmaxes the row, then blends N value rows. That's three passes over data that, for a long context, doesn't fit in any fast memory — on a GPU it means writing an O(N) score row to HBM and reading it back, twice, in the hottest loop of the entire system. FlashAttention's insight is that softmax can be computed in one streaming pass with O(1) extra state, if you're willing to rescale history every time you discover a new maximum. That rescaling trick — three running quantities and a correction factor — is the single most important piece of kernel math in this field, and the only way to actually own it is to implement it and watch it match the naive answer to 1e-6 on inputs where a wrong correction factor would diverge wildly.

The phase needs this lab as its foundation: lab-03 runs this recurrence per query row (prefill), lab-04 proves it's a mergeable monoid (flash-decoding), and the deep-dive's tour of real backends assumes you can see this loop inside every one of them.

Background: the recurrence

You hold three things while streaming key blocks: m (max score so far), denom (sum of exp(score − m) so far), acc (sum of exp(score − m) · v so far — unnormalized). For each new block with scores s:

m_new  = max(m, max(s))
corr   = exp(m − m_new)            # how much history shrinks under the new max
p      = exp(s − m_new)            # new block's weights, on the new scale
acc    = acc · corr + p @ V_block
denom  = denom · corr + sum(p)
m      = m_new

Final answer: acc / denom. Why it's exact (not an approximation): every exp(s_i) you ever wanted appears in the final sums multiplied by exp(−m_final) — the corrections compose so each term is rescaled from whatever max it was added under to the final max. It's a telescoping product, and the only thing subtraction-by-max changes is overflow behavior, never the ratio. The same algebra is why the state merges across partitions in lab-04 — write it out once by hand for two blocks and the whole phase unlocks.

The paged part you know from Phase 2: token t is at k_cache[block_table[t // block_size], t % block_size], and the last block of a sequence is usually partial — read only seq_len − start rows of it.

Files

starter.py — dense_attention (the slow truth) and paged_online_attention (the streaming, gathered version). Your work.
solution.py — reference.
test_lab.py — equality with dense for aligned and ragged lengths, and paging invariance.

Run

LAB_IMPL=starter pytest phase-04-attention-backends/labs/lab-01-paged-attention-gather -q
pytest phase-04-attention-backends/labs/lab-01-paged-attention-gather -q   # reference

What to implement

Write dense_attention first and convince yourself it's correct — it's your oracle, and the entire discipline of kernel work is never port what you haven't proven slow. Then the streaming version per the recurrence above, iterating logical blocks that cover [0, seq_len). The two classic stumbles, both covered by tests:

The first-block edge: m starts at −inf, so corr = exp(−inf − m_new) must come out as 0, not NaN. Guard it (the solution branches on m != -inf).
The partial last block: valid = min(block_size, seq_len − start). Read one row too many and you're attending over uninitialized cache — the bug that "almost works" (Phase 2 lab-06 poisoned the padding to make this loud; here the random zeros are quiet but the 1e-6 equality still catches it).

What the tests prove

Test	What it pins
`test_matches_dense_block_aligned`	The recurrence itself: 16 tokens, 4 scattered blocks (`[3, 1, 7, 0]`), equal to dense within 1e-6. A wrong `corr` doesn't fail subtly — softmax weights are exponential in the error, so divergence is loud
`test_matches_dense_partial_last_block`	13 tokens = 3 full + 1 single-token block: the `valid` bound
`test_paging_invariance`	Same logical sequence at physical placements `[0,1,2]` vs `[7,3,5]` → identical output. The block table is the only coupling between logical and physical — Phase 2's identity theorem, restated where the math happens

Hitchhiker's notes

Map your variables to the CUDA kernel: in paged_attention_v1.cu, your m is qk_max (computed via warp/block reductions instead of max()), your denom is exp_sum, your acc lives in registers as accs, and your gather is the block_table-indexed pointer arithmetic in the main loop. Read the kernel right after finishing — it's ~400 lines of which you now understand the load-bearing 40; the rest is vectorized loads, shared-memory staging, and reduction plumbing (the "95% performance engineering" of Phase 2 lab-04).
Why subtract the max at all, again? exp(90) overflows float32. Logit ~90 is not exotic — it's a confident model with a sharp head. Unprotected softmax is a NaN factory; subtraction-by-max makes every exponent ≤ 0. The online version just maintains that protection without knowing the max in advance — that's the whole cleverness.
One query here, many heads in reality: real decode runs this once per (sequence, KV-head) with the query being that head's slice — heads are embarrassingly parallel and share nothing (Phase 2 lab-06's per-head loop). GQA means several query heads stream the same K/V blocks — bandwidth amortization inside the kernel, one more reason GQA wins (Phase 0 lab-02).
Numerics note for the tests' 1e-6: float64 throughout, so the tolerance is generous — it's calibrated to catch algorithmic error (a missing corr, an off-by-one), not rounding. In fp16 kernels the same comparison runs at 1e-2 with fp32 accumulators (Phase 2 lab-04's gate); the tolerance always encodes what you're testing for.

Going further

Hand-trace two blocks with two tokens each on paper, with block 2's max larger than block 1's — watch corr shrink the history. Then once with block 2's max smaller — watch corr = 1 and nothing rescale. The recurrence has exactly these two behaviors.
Delete the corr factor and run the tests: the aligned test fails with weights skewed toward later blocks. Now you know this failure's signature — useful the day you review a kernel PR that gets it almost right.
Batch it: take a list of (q, block_table, seq_len) and loop — you've built paged_attention_v1's grid (one program per sequence per head). Then go to lab-04 to split within a sequence, and lab-03 to widen to query chunks.

References

Milakov & Gimelshein, Online normalizer calculation for softmax (2018) — the recurrence, 3 readable pages: https://arxiv.org/abs/1805.02867
Dao et al., FlashAttention (2022) — the recurrence + tiling + IO analysis: https://arxiv.org/abs/2205.14135
upstream/csrc/attention/paged_attention_v1.cu — qk_max, exp_sum, the gather: your function in CUDA.
upstream/vllm/v1/attention/backends/flash_attn.py:592 — where the real engine hands block tables and slot mappings to the kernel (find both; Phase 2 lab-06 explains the write side).
02-mini-build.md — the recurrence derived step by step.

Lab 04-02 — Backend Selection Matrix `[GPU-OPT]`

vLLM doesn't have an attention kernel; it has a stable of them — FlashAttention, FlashInfer, Triton, FlashMLA, TRTLLM-GEN, per-platform CPU/ROCm/TPU variants — and a selector that picks one at startup based on your GPU, dtype, model architecture, and features. That choice is invisible when it's right and bewildering when it's wrong, and "wrong" here means anything from a 20% throughput gap to a crash on an exotic head size. In this lab you run the selector, override it, read its source, and build the (GPU, dtype, model) → backend table that lets you answer — from memory, in an incident — "which kernel is this deployment actually running, and what else could it run?"

No GPU? Don't panic. The captured output below is the experiment; the selection logic (selector.py:52) is the lesson, and it reads the same on a laptop.

Why this lab exists

Every component you've studied so far had one implementation. Attention is where vLLM becomes a dispatcher, and dispatchers are where production surprises live: the same model, same config, same vLLM version runs different kernels on an A100 vs an H100 vs an RTX 4090 — different performance, different numerics in the last ulp, occasionally different bugs. When a user reports "works on my machine, garbage on the cluster," the backend matrix is the first thing a maintainer checks, and VLLM_ATTENTION_BACKEND is the first bisection tool they reach for. This lab is that reflex, installed.

It's also your map for the rest of the phase: the deep-dive walks the backends' implementations; this lab establishes which of them you're ever actually running and what forces the exceptions (MLA models, head sizes, dtypes, platforms).

Background: why so many backends

Because "attention" is several workloads wearing one name, and the optimal kernel differs per (shape × hardware × feature):

FlashAttention (FA2/FA3) — the battle-tested default for standard transformers on NVIDIA; hand-tuned prefill and decode paths, broad feature support. FA3 exploits Hopper-specific hardware (TMA, warpgroup MMA), which is why the GPU generation enters the selector.
FlashInfer — plan-based kernels with strengths vLLM's defaults lack in places: cascade/shared-prefix attention (lab-04's merge!), aggressive split-k, customizable masking. Often the win for high-concurrency or shared-prefix workloads — measure, don't assume (Phase 18).
Triton backend — portable, readable, JIT-compiled; the fallback when the hand-written kernels lack your head size/feature combo, and the reference implementation you can actually modify (it's the closest production cousin of your lab-01 code).
FlashMLA / TRTLLM-GEN — DeepSeek-style MLA models compress KV into a low-rank latent; the cache layout itself is different, so standard kernels can't read it at all. Architecture doesn't just prefer a backend — it can force one.
Platform backends (CPU, ROCm, TPU — Phase 17) — different ISAs entirely.

The selector (get_attn_backend, upstream/vllm/v1/attention/selector.py:52) resolves: explicit override → platform default chain → capability checks (dtype, head size, sliding window, MLA) → fallback. Selection happens once, at startup — the backend's metadata builder and CUDA-graph shapes (Phase 5) are baked for the engine's lifetime.

Requirements

uv pip install -e ".[vllm]"

Steps

Let vLLM pick (read the startup line naming the backend):

python -c "from vllm import LLM; LLM(model='facebook/opt-125m', gpu_memory_utilization=0.4)"

Force alternatives and confirm the engine obeys:

VLLM_ATTENTION_BACKEND=FLASHINFER  python -c "from vllm import LLM; LLM(model='facebook/opt-125m', gpu_memory_utilization=0.4)"
VLLM_ATTENTION_BACKEND=TRITON_ATTN python -c "from vllm import LLM; LLM(model='facebook/opt-125m', gpu_memory_utilization=0.4)"

Also try forcing something invalid for your setup (e.g. FLASHMLA on a non-MLA model) and read the error — the selector's failure messages are part of its interface, and you want to have seen them before an incident shows them to you.

Read the source next to the log: selector.py:52 (get_attn_backend) and the platform default chain in upstream/vllm/platforms/cuda.py. For your GPU + dtype + two or three models, predict the choice before running — the lab is passed when your predictions stop missing.

Captured output (real run, L4, vLLM 0.22.1, trimmed)

# default:
INFO ... Using Flash Attention backend.
# VLLM_ATTENTION_BACKEND=FLASHINFER:
INFO ... Using FlashInfer backend.
# VLLM_ATTENTION_BACKEND=TRITON_ATTN:
INFO ... Using Triton backend.
# a DeepSeek (MLA) model, default:
INFO ... Using FlashMLA backend.       # MLA models force an MLA backend (different KV layout)

One line, easily scrolled past — but it names the code that will execute the hottest loop of the deployment several thousand times per second. Operators should log-grep for it on every rollout; version upgrades do change defaults, silently (selection logic and backend names both drift across releases — anchor on the mechanism, not the strings).

Build the matrix (your deliverable)

GPU	dtype	model feature	chosen backend	why
A100/L4	bf16	standard	FlashAttention	hand-tuned default for Ampere+
H100	bf16	standard	FlashAttention (FA3 path)	Hopper-specific kernels
any	any	MLA (DeepSeek)	FlashMLA	latent KV layout — standard kernels can't read it
any	any	override set	(the override)	`VLLM_ATTENTION_BACKEND` wins over everything
any	any	unsupported head size	Triton fallback	JIT covers shapes hand-written kernels skip
CPU	fp32	standard	the CPU backend	no CUDA; platform chain (Phase 17)

Extend it with what your hardware shows — the table above is the skeleton; the rows you add from your own runs are the ones you'll remember.

Hitchhiker's notes

The override is a bisection tool, not a tuning knob. Mystery garbage output? Flip to TRITON_ATTN: if the garbage persists, it's not the kernel (look at sampling, weights, tokenizer); if it disappears, you've isolated a kernel bug and your issue report writes itself ("FA path wrong for head_size=96 + sliding window; Triton correct"). This two-run dance is the single highest-value habit this lab teaches.
Backends differ in the last ulp, legitimately. Different tiling = different reduction order = bitwise-different logits (Phase 3 lab-02's softening, kernel edition). Greedy outputs can diverge after enough tokens with no bug anywhere. Don't file that issue; do mention it when comparing backends in evals.
Why startup-time selection rather than per-request? The backend brings its own metadata builder (the FlashAttentionMetadata of lab-03) and its kernels are baked into CUDA-graph captures (Phase 5); swapping per request would mean re-capturing graphs and rebuilding paged-cache layouts mid-flight. Selection is configuration, not scheduling.
Capability gaps are normal, not shameful: a brand-new model with head_dim 96, or fp8 KV + sliding window, may be outside the fast path's support matrix and silently fall back to Triton — correct but slower. When throughput regresses after a model swap, check the backend line first; the model may have changed your kernel.

Reflect

Your p99-latency-sensitive service runs long-context decode on H100s. Name two backend experiments worth running before touching any other knob, and what you'd measure. (FlashInfer split-k vs FA3 at your concurrency, ITL distributions — lab-04 explains why long decode is where they differ; Phase 18 gives the harness.)
Why does an MLA model force the backend while sliding-window merely filters candidates? (MLA changes the cache's data layout — incompatible storage; sliding window is a mask variation several backends implement — a feature flag, not a format.)
The selector consults the platform (cuda.py, rocm.py, cpu.py …) before capability checks. Sketch how a new accelerator vendor slots in without touching the selector — that's Phase 17's plugin architecture, and the reason the chain is shaped this way.

References

upstream/vllm/v1/attention/selector.py:52 — get_attn_backend, the dispatcher.
upstream/vllm/platforms/cuda.py — the NVIDIA default chain the selector consults.
upstream/vllm/v1/attention/backends/ — the stable itself; skim each file's class docstring and you've got the cast list for the deep-dive.
vLLM docs, Engine Arguments / environment variables — VLLM_ATTENTION_BACKEND and friends: https://docs.vllm.ai/en/latest/serving/engine_args.html
Ye et al., FlashInfer (2024) — what the alternative brings: https://arxiv.org/abs/2501.01005
Dao, FlashAttention-2/3 — what the default brings: https://arxiv.org/abs/2307.08691, https://arxiv.org/abs/2407.08608

Lab 04-03 — Causal Prefill Attention over a Paged Cache `[CPU-OK]`

Lab-01 gave you the decode kernel shape: one query, N keys. But every token in that cache got there through the other shape — prefill: M queries at once (a prompt, or a chunk of one), each allowed to see only its own past. In this lab you build the prefill shape on top of your lab-01 recurrence, with the two ingredients that make it interesting: the causal mask (query i attends to positions 0..start_pos+i, nothing later) and the start_pos offset that makes chunked prefill possible at the kernel level. The payoff test proves, in attention outputs rather than scheduler bookkeeping, the invariant Phase 3 lab-02 promised you: prefilling in chunks computes exactly what one-shot prefill computes.

Why this lab exists

Every attention backend in vLLM ships (at least) two code paths, and PRs routinely touch one and break the other. If you've only ever written the decode path, prefill kernels read as a wall of index arithmetic: why does the mask depend on a start offset? why does the kernel receive query_start_loc arrays? what exactly must hold for a chunk computed today to splice seamlessly with a chunk computed three steps ago? This lab makes you derive all three answers, because you need them to make four tests pass.

It also closes a loop the course opened two phases ago. Phase 3 proved "chunking changes when, never what" behaviorally — same tokens out of the engine. But that proof leaned on the kernel doing its part: a query at absolute position 7, computed in a chunk that starts at position 5, must attend over tokens 0–7 exactly as it would have in a one-shot prefill. That's a property of the attention math plus the cache, and here you verify it at the layer where it actually lives — test_chunked_equals_one_shot is Phase 3 lab-02 restated in linear algebra.

Background: one mechanism, two shapes

The contract when a prefill chunk runs (this ordering is upstream's, and Phase 2 lab-06's):

The runner writes the chunk's K/V first — slot_mapping scatters rows for positions start_pos..start_pos+M−1 into the paged cache. So by the time attention runs, the cache holds tokens 0..start_pos+M−1: everything each query may legally see.
The kernel then computes, for each query row i (absolute position start_pos+i), attention over the causal prefix [0, start_pos+i] — gathered through the block table, streamed with online softmax, exactly your lab-01 loop with a per-query length.

Note what the causal mask is not: a -inf matrix you materialize. In a streaming kernel the mask degenerates into a loop bound — you simply stop reading keys at the query's own position. (Real kernels processing key tiles need the mask only for the one diagonal tile where queries and keys overlap; every earlier tile is all-visible, every later tile is skipped entirely. "The mask is mostly a loop bound" is why causal attention costs half of bidirectional, not the same with masking overhead.)

And start_pos is the entire kernel-side story of chunked prefill: a chunk is just a prefill whose queries don't start at zero. No special "resume" state — the cache is the state, which is the same insight (the counter/cache is the resume mechanism) you've now met in the scheduler (Phase 3), in preemption recovery (Phase 3 lab-04), and here in the kernel.

Files

starter.py — dense_causal_attention (the reference) and paged_causal_prefill_attention (the paged, online-softmax version). Your work.
solution.py — reference; note how it reuses the lab-01 recurrence as an inner function — the decode kernel is literally a sub-case.
test_lab.py — full prefill, mid-sequence chunk, chunked ≡ one-shot, and the poisoned-future causality test.

Run

LAB_IMPL=starter pytest phase-04-attention-backends/labs/lab-03-causal-prefill-attention -q
pytest phase-04-attention-backends/labs/lab-03-causal-prefill-attention -q   # reference

What to implement

Two functions. The dense reference is a per-query loop: slice the causal prefix, score, softmax, blend. The paged version wraps your lab-01 recurrence: for query i, run the block-streaming loop with seq_len = start_pos + i + 1. That +1 is load-bearing — a token does attend to itself (its K/V are in the cache before its attention runs; see the contract above). Off-by-one it and test_full_prefill_from_position_zero fails on the very first row, where the prefix is exactly one token.

What the tests prove

Test	What it pins
`test_full_prefill_from_position_zero`	The base case (`start_pos=0`), with a partial last block — 13 tokens in 4 blocks
`test_mid_sequence_chunk`	The chunked case: queries for positions 5–8 of a 9-token cache attend over exactly the right prefixes despite starting mid-block
`test_chunked_equals_one_shot`	The phase-bridging invariant: 12 positions as one chunk ≡ as 5 + 7 — every output row identical to 1e-9. Phase 3 lab-02's theorem, at the layer where it's actually enforced
`test_causality_future_tokens_are_invisible`	A `1e3` "loud future" in the last token's K/V changes only the last query's row. Rows 0–6 provably deaf to it. A non-causal bug here doesn't crash — it leaks the future into every token, the model trains on nothing like it, and outputs degrade mysteriously. This test makes the leak deafening instead

The poison technique is Phase 2 lab-06's trick pointed at a different boundary: there it guarded seq_len masking, here it guards the causal frontier. Same principle — make the forbidden region catastrophic to touch, then prove nothing touched it.

Hitchhiker's notes

Why is prefill compute-bound while decode is bandwidth-bound when it's the same math? Count the reuse: in prefill, each gathered K/V block is dotted against many query rows (every query whose prefix covers it); in decode, against exactly one. That's the arithmetic-intensity difference of Phase 0 lab-04, visible in this very loop nest — and it's why real prefill kernels tile over both queries and keys (FlashAttention's 2D blocking) while decode kernels tile only keys (lab-01/lab-04 shapes).
query_start_loc and friends: real batches contain many requests' chunks concatenated; upstream passes per-request offsets (query_start_loc, seq_lens) so one kernel launch handles a ragged batch. Your start_pos is the single-request version of that metadata. Find the production form in upstream/vllm/v1/attention/backends/flash_attn.py (FlashAttentionMetadata).
The solution's per-query inner loop is honest but quadratic in reads — it re-gathers shared prefix blocks once per query. Real kernels invert the nest (outer loop over key tiles, inner over queries, with the diagonal-tile mask) precisely to read each block once. Try the inversion as an exercise; the recurrence per query row is unchanged, which is the point — the math doesn't care which loop is outside.
Sliding-window attention (Mistral et al.) is one more loop-bound tweak: the prefix becomes [max(0, pos−W+1), pos]. If you can place the causal bound, you can place the window bound — and you now know why window support is a per-backend feature flag rather than a model-side trick.

Going further

Vectorize the dense reference into a single masked matmul (scores + np.triu(-inf, k=1+start_pos_offset)) and check it against your loop — then notice the materialized (M, N) score matrix is exactly what FlashAttention exists to avoid.
Invert the loop nest (key-tiles outer) as sketched above and re-run the suite — same four green tests, different memory behavior. You've reproduced the actual structure of flash_attn's prefill kernel.
Implement sliding-window (window parameter, prefix start max(0, pos−W+1)) and write the poison test for the left boundary: a loud token just outside the window must be inaudible.

References

Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (2022) — the 2D-tiled prefill kernel this lab is the skeleton of: https://arxiv.org/abs/2205.14135
Dao, FlashAttention-2 (2023) — the loop-nest inversion and work partitioning: https://arxiv.org/abs/2307.08691
upstream/vllm/v1/attention/backends/flash_attn.py — FlashAttentionMetadata: query_start_loc, seq_lens, and the cascade of shapes one launch handles.
Phase 3 lab-02 — the engine-level statement of test_chunked_equals_one_shot.
Phase 2 lab-06 — the write path (slot_mapping) that fills the cache this lab reads.

Lab 04-04 — Flash-Decoding: Split the Keys, Merge the Partials `[CPU-OK]`

Here's a problem your lab-01 kernel can't solve. One request, one decode query, a 128k-token context — and a GPU with 100+ streaming multiprocessors. The online-softmax loop is sequential over blocks: one SM grinds through 8,000 blocks while 99+ SMs watch. Decode latency for long contexts becomes a single-core problem on a massively parallel machine.

The fix — known as flash-decoding (Dao et al.), paged_attention_v2 in vLLM's CUDA, split-k in FlashInfer — is the subject of this lab: partition the keys, attend each partition independently and in parallel, then merge the partial results exactly. The reason it works is a small piece of algebra worth owning forever: softmax-attention state compresses to a triple (max, denominator, unnormalized-accumulator), and two such triples combine associatively. You'll implement the triple, the merge, and prove equality with dense attention for any partition count, any merge order, any tree shape.

Why this lab exists

This is the lab where "online softmax" stops being a trick you memorized and becomes a monoid you can wield. Lab-01's recurrence processes blocks left to right — it looks inherently sequential. The deep fact is that it isn't: the per-block update is just the binary merge applied repeatedly, and because the merge is associative and order-insensitive, you may evaluate it in any tree shape — including "all leaves in parallel, one combine at the end." Sequential streaming (FlashAttention), parallel split-KV (flash-decoding), and hierarchical reduction (multi-stage kernels) are the same algorithm under different parenthesizations.

Practically, this is also the difference between usable and unusable long-context decode. Batch-1, long-context inference — the agentic workload, increasingly the workload — has no batch parallelism to hide behind; parallelism must come from within the single query's attention. When vLLM picks paged_attention_v2 over v1, or FlashInfer chooses a split-k plan, the decision is "is this context long enough that splitting beats the merge overhead?" After this lab you'll know exactly what's being weighed.

Background: attention state is three numbers

For a query q and any set of keys/values, define:

m     = max_i  s_i                    (s_i = k_i·q / √d)
denom = Σ_i exp(s_i − m)
acc   = Σ_i exp(s_i − m) · v_i        ← UNNORMALIZED (a vector)

(m, denom, acc) is a summary of attention over that key set: the final output is acc / denom, but crucially you don't divide until the very end. Two summaries over disjoint key sets merge by rescaling both to the shared max:

m* = max(m₁, m₂)
denom* = denom₁·e^{m₁−m*} + denom₂·e^{m₂−m*}
acc*   = acc₁·e^{m₁−m*}  + acc₂·e^{m₂−m*}

Check the properties: commutative (symmetry of the formulas), associative (both sides reduce to "rescale everything to the global max and add"), and lab-01's per-block update is exactly this merge where one side is a single block's summary. The exp(m−m*) correction factors are the price of never having seen the global max in advance — and they're also the numerical-stability mechanism: no exponential is ever taken of a positive number, so nothing overflows even when one partition holds a monster logit (test_extreme_scores_do_not_overflow feeds it a score of ~200, which would be inf under naive softmax).

Files

starter.py — attend_partial (key range → summary), merge_partials (summaries → output), partitioned_attention (split, attend, merge). Your work.
solution.py — reference (the whole thing is ~25 lines; the understanding is the deliverable).
test_lab.py — identity at 1 partition, equality at any count, empty-chunk handling, order-invariance, hierarchical merging, and the overflow stress.

Run

LAB_IMPL=starter pytest phase-04-attention-backends/labs/lab-04-flash-decoding-partitions -q
pytest phase-04-attention-backends/labs/lab-04-flash-decoding-partitions -q   # reference

What to implement

Follow the math above literally. The one design rule that matters: attend_partial must not normalize. The moment you divide by the local denominator, the summary is no longer mergeable — you've thrown away the weights needed to re-weight against other partitions. (Returning normalized outputs and "averaging" them is the classic wrong implementation; it passes the 1-partition test and fails every other one, which is exactly why the 1-partition test isn't sufficient and the suite has six.)

What the tests prove

Test	What it pins
`test_one_partition_is_just_attention`	The degenerate case: summary → output round-trips
`test_any_partition_count_matches_dense`	2, 3, 7, 32, 100 partitions — all `1e-12`-equal to dense. Partitioning is exact, not approximate; any tolerance bigger than rounding would hide real bugs
`test_more_partitions_than_keys`	`array_split` hands you empty chunks; skip, don't crash. The GPU analogue: grid sized for max length, sequences shorter than the partition count
`test_merge_is_order_invariant`	Reversed and shuffled partial lists give identical output — mandatory, because on hardware thread blocks finish in nondeterministic order
`test_merge_is_hierarchical`	Merging merges = attending over the union: associativity, demonstrated. This is the license for tree reductions and multi-stage kernels
`test_extreme_scores_do_not_overflow`	A ~200 logit in one partition: finite output, still `1e-12`-equal. The running max isn't bookkeeping — it's the firewall

Hitchhiker's notes

Where this lives upstream: upstream/csrc/attention/paged_attention_v2.cu — search max_logits and exp_sums: those are your m and denom, written per partition to scratch buffers, merged by a second reduction kernel. The v1/v2 choice (v2 when partitioning pays) is made by the backend per launch. FlashInfer generalizes the same state into plan-based split-k; FlashAttention's flash_attn_with_kvcache exposes it as num_splits.
The merge is also how cascade/shared-prefix attention works (FlashInfer's signature feature): attend over the shared system-prompt KV once for the whole batch (one summary, reused), attend per-request suffixes separately, merge each request's pair. Same triple, same combine — prefix caching meeting kernel design. That's three course threads (Phase 2 sharing, Phase 3 caching, this lab) converging on one formula.
Why does sequential streaming still exist if parallel split is exact? Overhead: each partition writes its summary to global memory and a second kernel reads them back. For short contexts the round-trip costs more than it saves; for prefill the parallelism already comes from query rows (lab-03). Split-KV wins specifically at long-context decode — engineering is choosing the parenthesization that matches the hardware's idle dimension.
This trick is older and bigger than attention: it's a parallel reduction over a non-trivial monoid, the same pattern as parallel max/sum/scan. The general skill — "can I summarize partial state so summaries combine associatively?" — is how you parallelize anything with a running normalizer. You'll meet it again in distributed softmax (Phase 10's context parallelism splits attention across GPUs with exactly this merge).

Going further

Implement merge_two(a, b) -> summary (summary × summary → summary, not output) and rebuild merge_partials as a fold; then as a balanced tree with functools.reduce-style pairing. Verify all shapes agree — you've now written the reduction the way the GPU executes it.
Combine with lab-01: make each partition gather through the block table (partition = a contiguous range of logical blocks). That composition — paged + split-KV — is precisely paged_attention_v2.
Simulate the cascade pattern: 8 "requests" sharing a 512-token prefix with unique 64-token suffixes. Compute the prefix summary once + 8 suffix summaries, merge per request; compare against 8 dense computations. Measure the key-reads saved (should be ~7×512 rows) — FlashInfer's headline, reproduced in numpy.

References

Dao et al., Flash-Decoding for Long-Context Inference (2023) — the technique, with the parallelism diagrams: https://pytorch.org/blog/flash-decoding/
Milakov & Gimelshein, Online normalizer calculation for softmax (2018) — the merge formula's original home: https://arxiv.org/abs/1805.02867
Ye et al., FlashInfer: Efficient and Customizable Attention Engine for LLM Serving (2024) — split-k plans and cascade/shared-prefix attention: https://arxiv.org/abs/2501.01005
upstream/csrc/attention/paged_attention_v2.cu — max_logits / exp_sums / the reduce kernel: your lab, in CUDA.
Phase 10 — the same merge, stretched across GPUs (context parallelism).

Phase 04 — Exercises: Attention Backends

Warm-up (explain)

Attention is one operation — so why does vLLM have many attention backends?
What three pieces of metadata does a paged attention kernel need, and what is each for?
What is online softmax and what problem does it solve?

Core (trace the code)

In Attention.forward (attention.py:437), what does the model pass, and what does it not know about the kernel?
Name the three base classes in backend.py and the job of each (Backend / Impl / MetadataBuilder).
In FlashAttentionImpl.forward (flash_attn.py:592), find the KV write (slot_mapping) and the paged read (block table). How do they map to Phase 2?
List three inputs get_attn_backend (selector.py:52) uses to pick a backend.

Build (your lab)

In lab-01, explain why scattering the logical blocks to arbitrary physical ids doesn't change the output. What does that prove about the kernel's contract?
Extend paged_online_attention to multiple query heads (loop or vectorize). Verify against a multi-head dense reference.
Add a causal mask variant (a prefill query at position p attends only to tokens ≤ p).

Design (staff-level)

At high concurrency with many short decode requests, FlashInfer often beats FlashAttention. Hypothesize why, and design a benchmark (Phase 18) to confirm it for your workload.
You're bringing up a new model with a novel attention (e.g. a different KV compression). What parts of the backend system must you implement, and what can you reuse?
A user reports correct output with VLLM_ATTENTION_BACKEND=TRITON_ATTN but garbage with the default. Outline your debugging path and what it implies about the default kernel.

Self-grading

4–7 and 11–13 are interview-grade. Could you draw the layer→impl→kernel path and name the files? If not, re-read 01-deep-dive.md.

Phase 04 — Interview Questions: Attention Backends

Q1. Why does vLLM have a pluggable attention-backend system?

Model answer

Attention is one math op, but the fastest (or only available) kernel depends on hardware, dtype, head size, and model features (MLA, sliding window). A pluggable system lets vLLM pick the best kernel per setup (FlashAttention, FlashInfer, Triton, FlashMLA, TRTLLM-GEN) and adopt new ones without touching model code — the model talks only to the Attention layer (attention.py:177), which delegates to the chosen AttentionImpl.

Q2. What does a paged attention kernel need that a dense one doesn't?

Model answer

The block table (logical→physical block, to gather scattered prior KV), the slot mapping (where to write this step's new K/V), and per-request sequence lengths (for varlen batching). These are built each step by the AttentionMetadataBuilder (flash_attn.py:276) from the scheduler's output — the bridge from Phases 2/3 to the kernel.

Q3. Explain online softmax and why FlashAttention uses it.

Model answer

Naive attention materializes the full N×N score matrix — O(N²) memory. Online softmax streams K/V in tiles, keeping a running max, a rescaled accumulator, and a running denominator, so it computes exact softmax-weighted attention in O(N) memory and stays in fast SRAM. That's the "Flash" in FlashAttention, and it's what makes long-context attention feasible. (You implement it in lab-01.)

Q4. How and when is the backend chosen?

Model answer

At startup, by get_attn_backend (selector.py:52), from the platform default (platforms/cuda.py etc.), dtype, head size, and model features, with VLLM_ATTENTION_BACKEND as an override. It's fixed for the run because CUDA-graph capture and the metadata builder depend on it (Phase 5). MLA models force an MLA backend due to their different KV layout.

Q5. What is MLA and why does it need its own backend?

Model answer

Multi-head Latent Attention (DeepSeek) compresses K/V into a shared low-rank latent vector instead of storing full per-head K/V, shrinking the KV cache a lot. Because the cached representation and the attention math differ, it needs a dedicated kernel/backend (FlashMLA) and a different KV cache layout — an example of the model's attention design dictating the kernel.

Rapid-fire

Model-facing class? Attention (attention.py:177).
Three backend roles? Backend (names it), Impl (runs it), MetadataBuilder (feeds it).
Override env var? VLLM_ATTENTION_BACKEND.
Read map / write map? block table / slot mapping.
The trick that makes attention O(N) memory? Online softmax.

Phase 04 — Cheatsheet: Attention Backends

The one-liner

Attention is one op; the backend is which kernel computes it. Model code is backend-agnostic; the kernel gets paged-KV metadata (block table + slot mapping + seq lens).

The four roles (`vllm/v1/attention/backend.py`)

Attention layer (model-facing, attention.py:177) → delegates to:
AttentionImpl.forward — runs the kernel (writes KV via slot_mapping, reads via block table)
AttentionBackend — names impl + metadata + supported configs
AttentionMetadataBuilder.build — SchedulerOutput → kernel metadata (the Phase 2/3 → kernel bridge)

The kernels

backend	best for
FlashAttention	general default; online softmax, O(N) memory
FlashInfer	serving, paged KV, high concurrency / many decodes
Triton	portable fallback
FlashMLA	MLA models (DeepSeek) — low-rank latent KV
TRTLLM-GEN	NVIDIA TensorRT-LLM generated, GPU/precision-tuned

Online softmax (why "Flash")

running max + rescale + accumulate per tile → exact softmax in O(N) memory, no N×N matrix.

Selection

get_attn_backend (selector.py:52) ← platform default + dtype + head size + features; override with VLLM_ATTENTION_BACKEND=FLASH_ATTN|FLASHINFER|TRITON_ATTN|.... Fixed for the run (CUDA graphs).

Key upstream

model_executor/layers/attention/attention.py:177 Attention · :437 forward
v1/attention/backend.py base classes · v1/attention/selector.py:52 selector
v1/attention/backends/flash_attn.py :68 Backend :223 Metadata :276 Builder :592 Impl
v1/attention/backends/mla/ MLA · registry.py name→backend

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

Phase 05 — The Hitchhiker's Guide to CUDA Graphs & torch.compile ⭐

← Phase 04 · Course home · Phase 06 →

Flagship phase — written in full. Phases 02–03 made memory and scheduling fast. This phase attacks a different enemy: the CPU, which can be too slow to even tell the GPU what to do.

Don't Panic

Two ideas, one breath each:

CUDA graphs: launching a GPU kernel from Python costs CPU time. During decode you launch hundreds of tiny kernels per token — and the CPU can't issue them fast enough, so the GPU sits idle waiting for work. A CUDA graph records that whole sequence of launches once and replays it with a single launch. The CPU overhead vanishes.

torch.compile: instead of running your model op-by-op in Python, PyTorch's compiler traces it into a graph, then fuses and rewrites the ops into fewer, faster kernels. vLLM wraps this with its own backend, caching, and custom optimization passes.

They're complementary: torch.compile makes the kernels better; CUDA graphs make launching them free. vLLM uses both, together, by default. By the end of this phase you'll have built a CPU simulation of capture/replay (mini_vllm/cudagraph.py) that reproduces the exact win and the exact constraints, and you'll have read the real CUDAGraphWrapper.

Step 1: Why the CPU is a bottleneck at all

Recall the decode loop (Phase 0): one token at a time, and each token's forward pass runs many small operations — for each of, say, 32 layers: a QKV projection, attention, an output projection, two MLP matmuls, norms, residual adds… Easily hundreds of GPU kernels per token.

Each kernel is launched from Python/C++ on the CPU. A launch isn't free — it costs a few microseconds of CPU work to set up and enqueue. Do the arithmetic:

~300 kernels/token × ~5 µs CPU launch overhead ≈ 1.5 ms of CPU work per token

If the GPU work for that token is also ~1.5 ms, you're at best 50% utilized — and at small batch sizes (where each kernel is tiny and finishes fast) the GPU finishes each kernel before the CPU can launch the next one. The GPU starves, waiting on the CPU. This is CPU-launch-bound decode, and it's the default failure mode at low batch sizes.

CPU:  [launch k1][launch k2][launch k3]......[launch k300]   ← the CPU is the critical path
GPU:  [k1] idle  [k2] idle  [k3] idle .....                  ← GPU waits between tiny kernels
       └ each gap = CPU not done issuing the next launch

You can't make the launches cheaper one by one. But you can stop doing them every step.

Step 2: CUDA graphs — record once, replay forever

A CUDA graph is a recording of a sequence of GPU operations and their dependencies. You "capture" it once by running the forward pass in a special mode; CUDA records every kernel and its arguments into a graph object. Thereafter you replay the whole graph with a single API call — the CPU issues one launch and the GPU rips through all 300 kernels with zero per-kernel CPU involvement.

Without graphs (every step):  CPU issues 300 launches  →  GPU starves between them
With a captured graph:        CPU issues 1 "replay"    →  GPU runs all 300 back-to-back

The catch — and this is the whole reason it's tricky — is that a graph is a frozen recording. It records exact kernels reading from exact memory addresses for exact tensor shapes. So:

Constraint 1 — fixed shapes. A graph captured for batch size 8 only replays for batch size 8. vLLM captures a graph for each batch size it expects (and pads odd batch sizes up to the nearest captured size). It keeps a dictionary of graphs keyed by shape.
Constraint 2 — static input buffers. Replay reads from the same memory the capture used. So to run a new token's inputs, you must copy them into the captured input buffer first, then replay. The graph reads from the fixed address; your job is to keep that address valid and current.

These two constraints are exactly what your mini_vllm/cudagraph.py GraphRunner models: a dict keyed by input shape (Constraint 1), and a static_input buffer you np.copyto into before replay (Constraint 2). Go read it — it's ~40 lines and it is the mental model.

Why "graphs" plural? Because of Constraint 1, vLLM holds many captured graphs — one per batch size in cudagraph_capture_sizes. The real CUDAGraphWrapper stores them in concrete_cudagraph_entries: dict[BatchDescriptor, CUDAGraphEntry] (cuda_graph.py:207). Your simulation's self.graphs: dict[shape, GraphEntry] is the same idea with the GPU filed off.

Step 3: Full vs Piecewise graphs

There's a wrinkle. Some operations can't be captured into a graph cleanly — most importantly attention, because its kernel takes variable-length metadata (the block tables and sequence lengths from Phase 02/03) that change every step and don't fit the "frozen recording" model.

vLLM offers two strategies (the CUDAGraphMode enum, compilation.py:53):

FULL — capture the entire model forward as one graph. Maximum CPU-overhead removal, but fragile: everything in the forward must be capture-safe, including attention (which needs special handling, e.g. capturing only the decode case where shapes are uniform).
PIECEWISE — split the forward at the uncapturable ops (attention). Capture each contiguous compiled region between splits as its own small graph; run the split ops (attention) eagerly. You pay a few launches (one per piece + the eager attention) instead of 300 — most of the win, far more robustly.

FULL:       [ ============== one graph: whole forward ============== ]   (1 replay)

PIECEWISE:  [ graph A ] (attention eager) [ graph B ] (attention eager) [ graph C ]
             └ capture   └ run live         └ capture   └ run live        └ capture
            a handful of launches, robust to attention's dynamic metadata

vLLM's V1 default is actually FULL_AND_PIECEWISE (compilation.py:63): use a FULL graph for pure-decode batches (uniform shapes — safe and fastest) and PIECEWISE for mixed prefill+decode batches (where attention metadata varies). Your mini_vllm.PiecewiseGraphRunner models exactly this: it splits ops at an "uncapturable" predicate, captures the rest, runs the splits eagerly — and a test proves the output is identical to eager.

Step 4: torch.compile — making the kernels themselves better

CUDA graphs remove launch overhead but don't change what runs. torch.compile does the other half: it traces your model into an FX graph (via TorchDynamo), then a backend (Inductor) generates fused kernels — e.g. fusing a RMSNorm + a quantization into one kernel, so you read memory once instead of three times.

vLLM doesn't just use stock torch.compile; it has a custom backend (compilation/ backends.py) and a compilation pipeline with levels (CompilationMode, compilation.py:37):

CompilationMode (the "level"):
  0 NONE                – pure eager, no compile (what enforce_eager gives you)
  1 STOCK_TORCH_COMPILE – plain torch.compile
  2 DYNAMO_TRACE_ONCE   – trace once, no recompiles
  3 VLLM_COMPILE        – vLLM's Inductor backend: caching + PIECEWISE compilation +
                          shape specialization + custom passes   ← the V1 default

At level 3, vLLM:

traces the model once and caches the compiled artifacts (so restarts are fast),
splits the graph at attention for piecewise compilation (lining up with piecewise CUDA graphs),
runs custom graph passes (compilation/passes/) — fusions vLLM knows are safe and profitable for inference but stock Inductor wouldn't do (e.g. fused add+RMSNorm, sequence- parallel rewrites, quant fusions).

You opt a model into all this with one decorator: @support_torch_compile on the model class (decorators.py:118). That's the seam between "a model" and "the compiler."

Step 5: How they fit together at runtime

model class
  └─ @support_torch_compile           (Phase 5: opt into the compiler)
        └─ torch.compile / VLLM_COMPILE backend  → fused kernels, piecewise split at attention
              └─ CUDAGraphWrapper      (Phase 5: capture/replay per batch size)
                    └─ for batch size B: replay graph_B  (1 launch)  OR capture if new shape

Each decode step, the model runner sets a forward_context with the current cudagraph_runtime_mode and a batch_descriptor (the shape key). The CUDAGraphWrapper (cuda_graph.py:233) reads that context and either runs eagerly (mode NONE / mismatch), replays the cached graph for that shape, or captures a new one. That dispatch-by-context is precisely what your GraphRunner.__call__ does with x.shape as the key.

The invariants to memorize

CUDA graphs remove CPU launch overhead, not GPU compute. They help when decode is CPU-launch-bound (small batch), not when it's GPU-bound (large batch / prefill).
A graph is per-shape: one captured graph per batch size; odd sizes are padded up.
Replay reads static buffers: new inputs must be copied in before replay.
Attention is the thing that resists capture → piecewise splits around it.
torch.compile improves kernels; CUDA graphs improve launching. Different problems, used together.
enforce_eager=True turns both off — your debugging escape hatch (and the only way to get fully dynamic shapes).

What you'll do in this phase

Read: 01-deep-dive.md walks the real CUDAGraphWrapper.__call__, the CUDAGraphMode/CompilationMode enums, and @support_torch_compile line by line.
Build: 02-mini-build.md — the capture/replay simulator (reference: mini_vllm/cudagraph.py).
Labs (see labs/README.md; recommended order 01 → 02 → 05 → 03 → 04):
- lab-01-graph-replay-simulator [CPU-OK] — implement capture/replay + shape dispatch + static buffers; pass the tests.
- lab-02-launch-overhead [CPU-OK] — model launch overhead and find the eager↔graph crossover.
- lab-03-cudagraph-mode [CPU-OK] — reimplement the CUDAGraphMode dispatch (FULL / PIECEWISE / FULL_AND_PIECEWISE) and prove you understand decode-vs-mixed routing.
- lab-04-graph-vs-eager-real [GPU-REQ] — run real vLLM with enforce_eager vs CUDA graphs, measure the ITL difference (captured output included).
- lab-05-capture-sizes [CPU-OK] — the capture-size ladder: rung lookup, padding waste, and the denser-ladder-vs-more-captures trade ("why is my batch of 33 running at 40?").
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

When you can explain why a graph helps decode but not prefill, name the two constraints, and draw piecewise vs full from memory, you understand the layer that often doubles low-batch throughput for free.

← Phase 04 · Course home · Phase 06 →

Phase 05 — Deep Dive: CUDA Graphs & torch.compile in real vLLM

Paths relative to upstream/ at v0.22.1 @ 0decac0 (UPSTREAM_PIN.md). The compilation subsystem:
vllm/compilation/
  cuda_graph.py        CUDAGraphWrapper, CUDAGraphEntry   (capture/replay — read this first)
  decorators.py        @support_torch_compile             (how a model opts in)
  backends.py          the VllmBackend for torch.compile  (trace -> split -> compile)
  piecewise_backend.py piecewise compiled regions
  passes/pass_manager.py  + passes/fusion/  custom graph rewrites
vllm/config/compilation.py   CompilationMode, CUDAGraphMode, CompilationConfig
We read capture/replay (the core), the two config enums (the vocabulary), and the decorator (the seam). The Inductor internals are deep — return after you're comfortable here.

1. The two enums that name everything

`CompilationMode` — the "level" (`vllm/config/compilation.py:37`)

class CompilationMode(enum.IntEnum):
    NONE = 0                  # pure eager, model runs as-is (what enforce_eager gives you)
    STOCK_TORCH_COMPILE = 1   # the standard torch.compile pipeline
    DYNAMO_TRACE_ONCE = 2     # single Dynamo trace, avoid recompilation
    VLLM_COMPILE = 3          # vLLM's Inductor backend: caching, piecewise, shape
                              # specialization, custom passes   <- V1 default

This answers "how hard does the compiler work?" Level 3 (VLLM_COMPILE) is where vLLM's value is — its own backend with caching and piecewise splitting. Levels 0–2 are mostly for debugging/comparison. mini_vllm doesn't compile (no GPU), but the idea of "a level dial from eager to fully-optimized" is the thing to carry.

`CUDAGraphMode` — the capture strategy (`vllm/config/compilation.py:53`)

class CUDAGraphMode(enum.Enum):
    NONE = 0
    PIECEWISE = 1
    FULL = 2
    FULL_DECODE_ONLY = (FULL, NONE)        # full graph for decode, nothing for mixed
    FULL_AND_PIECEWISE = (FULL, PIECEWISE) # full for decode, piecewise for mixed (v1 default)

Notice the clever encoding: the last two are tuples (decode_mode, mixed_mode). A batch is either pure-decode (uniform shapes — safe for a FULL graph) or mixed prefill+decode (variable attention metadata — needs PIECEWISE). The helper methods make this explicit (compilation.py:65):

def decode_mode(self) -> "CUDAGraphMode":
    return CUDAGraphMode(self.value[0]) if self.separate_routine() else self
def mixed_mode(self) -> "CUDAGraphMode":
    return CUDAGraphMode(self.value[1]) if self.separate_routine() else self
def has_mode(self, mode) -> bool: ...        # is `mode` one of my routines?
def requires_piecewise_compilation(self) -> bool:
    return self.has_mode(CUDAGraphMode.PIECEWISE)

So FULL_AND_PIECEWISE.decode_mode() == FULL and .mixed_mode() == PIECEWISE. You will reimplement these exact methods in lab-03 — they're small and they encode the whole "which graph for which batch" decision. The comment at line 595–620 of the config spells out the tradeoffs (PIECEWISE only keeps non-attention out of the graph; FULL_AND_PIECEWISE is generally fastest).

2. `CUDAGraphWrapper` — capture and replay (the heart)

vllm/compilation/cuda_graph.py:145. Read its docstring (lines 146–168) — it states the dispatch protocol precisely. The key data structure (line 207):

# the entries for different batch descriptors that we need to capture cudagraphs for.
self.concrete_cudagraph_entries: dict[BatchDescriptor, CUDAGraphEntry] = {}

A dict of graphs keyed by batch shape. This is Constraint 1 (per-shape) made concrete. Your mini_vllm.GraphRunner.graphs: dict[shape, GraphEntry] is the same structure.

A CUDAGraphEntry (line 128) is what we cache per shape:

@dataclass
class CUDAGraphEntry:
    batch_descriptor: BatchDescriptor
    cudagraph: torch.cuda.CUDAGraph | None = None
    output: Any | None = None
    # for cudagraph debugging, track the input addresses during capture,
    # and check if they are the same during replay
    input_addresses: list[int] | None = None

That input_addresses field is Constraint 2 (static buffers) made checkable: capture records the input tensor addresses; replay asserts they're unchanged. Your simulation models this with the static_input buffer you must np.copyto into.

The dispatch: `call` (line 233)

Walk it in three branches:

(a) No graph / mode mismatch → run eagerly (lines 234–254):

forward_context = get_forward_context()
batch_descriptor = forward_context.batch_descriptor
cudagraph_runtime_mode = forward_context.cudagraph_runtime_mode

if (cudagraph_runtime_mode == CUDAGraphMode.NONE
        or cudagraph_runtime_mode != self.runtime_mode):
    # profile run, warmup, no-cudagraph, OR a different wrapper's turn
    return self.runnable(*args, **kwargs)

The wrapper "blindly trusts" the mode + shape key set by the model runner in the forward_context. If the runtime says NONE (profiling/warmup) or this isn't this wrapper's mode, just run the real function. (This is how FULL and PIECEWISE wrappers can be nested and each only fires for its own mode.) Your GraphRunner doesn't need modes, but the trust-the-context pattern is why the wrapper stays decoupled from the compiler.

(b) Shape not seen → CAPTURE (lines 257–344):

if batch_descriptor not in self.concrete_cudagraph_entries:
    self.concrete_cudagraph_entries[batch_descriptor] = CUDAGraphEntry(batch_descriptor=...)
entry = self.concrete_cudagraph_entries[batch_descriptor]

if entry.cudagraph is None:
    validate_cudagraph_capturing_enabled()
    input_addresses = [x.data_ptr() for x in args if isinstance(x, torch.Tensor)]
    entry.input_addresses = input_addresses
    cudagraph = torch.cuda.CUDAGraph()
    ...
    with torch.cuda.graph(cudagraph, pool=self.graph_pool, stream=current_stream()):
        output = self.runnable(*args, **kwargs)      # the kernels are RECORDED, not just run
        if self.cudagraph_options.weak_ref_output:
            output = weak_ref_tensors(output)
    entry.output = weak_ref_tensors(output)
    entry.cudagraph = cudagraph
    compilation_counter.num_cudagraph_captured += 1
    return output                                    # return the REAL output on capture step

The with torch.cuda.graph(...) context is where CUDA records every kernel issued by self.runnable(...) into cudagraph. The weak_ref_tensors dance (lines 325–336) is the "mind-exploding" memory management the comment warns about: the output lives in the graph's private memory pool, so vLLM holds only weak references to avoid leaking it while still letting PyTorch manage the pool. Your simulation skips this (numpy has no pools) but captures the structure: first sight of a shape → run once, record, cache.

(c) Shape seen → REPLAY (lines 346–361):

if self.is_debugging_mode:
    new_input_addresses = [x.data_ptr() for x in args if isinstance(x, torch.Tensor)]
    assert new_input_addresses == entry.input_addresses, (
        "Input addresses for cudagraphs are different during replay...")
...
entry.cudagraph.replay()
return entry.output

This is the entire win in two lines: entry.cudagraph.replay() issues one launch and the GPU runs the whole recorded sequence; return the cached output tensor. Note the debug assertion — it enforces Constraint 2 (inputs must be at the same addresses; the model runner guarantees this by writing new inputs into persistent buffers before calling). Your GraphRunner.__call__ replay branch is the direct analog: np.copyto(entry.static_input, x) then "replay" as a single LaunchCounter.bump(1).

The whole class in one sentence: a per-shape dict where the first call captures and every later call with that shape replays — exactly your mini_vllm.GraphRunner.

3. `@support_torch_compile` — the seam between a model and the compiler

vllm/compilation/decorators.py:118. Models opt in by decorating the class:

@support_torch_compile(dynamic_arg_dims={"x": 0, "y": 0})
class MyModel(nn.Module):
    def forward(self, x: torch.Tensor, y: Optional[torch.Tensor]): ...

What it does (read the docstring, 126–176): it wraps the class so that, when compilation is enabled, the forward is run through torch.compile/the vLLM backend, and it marks which tensor dimensions are dynamic (the batch/sequence dim) so the compiler specializes on shape correctly. dynamic_arg_dims says "dimension 0 of x varies" — that's the batch dimension the CUDA-graph capture sizes range over. If you don't pass it, vLLM infers it from the type annotations (line 153): torch.Tensor args get dim 0 marked dynamic.

The important takeaway: adding compile support to a model is one decorator, and the dynamic dims you declare are what let the same compiled artifact serve many batch sizes (and what the CUDA-graph layer keys its captured graphs on). When you add a model in Phase 14, this decorator is part of the recipe.

4. The backend + passes (skim now, return later)

vllm/compilation/backends.py — VllmBackend, the torch.compile backend Dynamo calls with the traced FX graph. It splits the graph at splitting_ops (attention) for piecewise compilation, compiles each piece with Inductor, caches the results, and arranges the pieces for piecewise CUDA-graph capture. This is the level-3 VLLM_COMPILE machinery.
vllm/compilation/piecewise_backend.py — manages a single piecewise compiled region.
vllm/compilation/passes/pass_manager.py + passes/fusion/ — the custom graph passes: rewrites vLLM applies to the traced graph that stock Inductor wouldn't, e.g. fusing add + RMSNorm, fusing quantization into the preceding op, sequence-parallel rewrites. Each pass is an FX-graph-in, FX-graph-out transform. Reading one small fusion pass is a great way to see "graph-level transformation" concretely.

Your mini_vllm.PiecewiseGraphRunner models the split idea (break at uncapturable ops, capture the rest) without the Inductor compilation — which is the part that matters for the mental model.

5. Where it's wired into the engine

The model runner (vllm/v1/worker/gpu_model_runner.py) is what:

decides the cudagraph_runtime_mode for the current batch (FULL for pure decode, PIECEWISE for mixed, NONE during profiling/warmup) and the batch_descriptor (the shape key),
sets them on the forward_context (which the CUDAGraphWrapper reads),
writes the step's inputs into the persistent buffers the captured graph reads from (Constraint 2), padding the batch up to a captured size (Constraint 1),
runs a warmup at startup that captures graphs for every size in cudagraph_capture_sizes.

Search gpu_model_runner.py for cudagraph and capture to see the warmup/capture loop and the input-buffer copies. That's the production embodiment of everything above.

Reading checklist

One sentence each in your notebook:

CompilationMode — what does level 3 (VLLM_COMPILE) add over stock torch.compile?
CUDAGraphMode — why are FULL_AND_PIECEWISE/FULL_DECODE_ONLY encoded as tuples?
concrete_cudagraph_entries — what is the key, and which constraint does that enforce?
CUDAGraphEntry.input_addresses — which constraint, and when is it checked?
__call__ — name the three branches (eager / capture / replay) and their triggers.
entry.cudagraph.replay() — why is this "the entire win"?
@support_torch_compile dynamic_arg_dims — why does the compiler need to know the dynamic dimension?

Now build it: 02-mini-build.md, then the labs.

Phase 05 — Mini-Build: simulate CUDA-graph capture & replay

You'll build a CPU simulation of CUDA graphs that reproduces the one win and the two constraints from the guide. No GPU, no torch — just numpy and a launch counter. The reference lives in mini_vllm/cudagraph.py; write it yourself first against lab-01's stub + tests.

The trick that makes this teachable on a laptop: we don't time anything. We count launches. LaunchCounter is a global tally standing in for per-op CPU launch overhead. Eager pays one per op every call; a replay pays exactly one. That single number captures the entire point of CUDA graphs.

The build, in order

1. `LaunchCounter`

A class with a class-level n, plus reset() and bump(k=1). This is your stand-in for CPU launch overhead.

2. `run_eager(ops, x)`

Run a list of ops over x, bump(1) per op. Returns the result. This is the baseline: overhead paid per op, every single call.

3. `GraphRunner(ops)` — capture once, replay forever

The core. __call__(x):

key = x.shape.
Capture (key unseen): copy x into a static_input buffer, run the ops (bump per op), cache a GraphEntry(shape, static_input, output, num_ops), return the output. (Constraint 1: graphs are keyed by shape.)
Replay (key seen): np.copyto(entry.static_input, x) (Constraint 2: new inputs must land in the fixed buffer), recompute the ops from the buffer, bump(1) for the whole replay, return. (The win: one launch instead of len(ops).) Expose num_captured.

4. `PiecewiseGraphRunner(ops, is_capturable)` — split at the attention analog

Build contiguous segments by grouping consecutive ops with the same is_capturable(i) value. Wrap capturable segments in a GraphRunner; keep uncapturable segments as plain op lists run via run_eager. __call__ threads x through the segments in order. Expose num_graphs (count of capturable segments). This models PIECEWISE: capture the compiled regions, run attention eagerly.

Definition of done

pytest mini_vllm/test_cudagraph.py -q          # the reference suite (7 tests)
pytest phase-05-cuda-graphs-and-torch-compile/labs -q

Then answer in your notebook, citing mini_vllm/cudagraph.py lines:

Which line is Constraint 1 (per-shape dispatch)? Which is Constraint 2 (static buffer copy)? Which is the win (single launch on replay)?
In PiecewiseGraphRunner, why does num_graphs == 2 when you split a 3-op model at the middle op? (Two capturable runs surround one eager op.)

Map your toy to the real engine

your `mini_vllm/cudagraph.py`	real vLLM
`GraphRunner.graphs: dict[shape, GraphEntry]`	`CUDAGraphWrapper.concrete_cudagraph_entries: dict[BatchDescriptor, CUDAGraphEntry]` (`cuda_graph.py:207`)
`entry.static_input` + `np.copyto`	`CUDAGraphEntry.input_addresses` + persistent input buffers (`cuda_graph.py:135`, `:346`)
capture branch	`with torch.cuda.graph(...)` (`cuda_graph.py:313`)
replay branch (`bump(1)`)	`entry.cudagraph.replay()` (`cuda_graph.py:360`)
`PiecewiseGraphRunner` split	piecewise compilation/capture, split at attention (`backends.py`)

Stretch (optional)

Padding to capture sizes. Real vLLM only captures graphs for a fixed set of batch sizes and pads odd sizes up. Add a capture_sizes=[1,2,4,8] to GraphRunner: round x's batch dim up to the nearest capture size before keying, so batch 5 and 7 both reuse the size-8 graph. Count how many distinct graphs you capture across batches 1..8 with and without padding.
A fusion pass. Add an optional graph-rewrite step that fuses two adjacent elementwise ops into one (e.g. +1 then *2 → one op) and show it reduces launches even in eager mode — the torch.compile half of the story.

Phase 05 Labs — CUDA Graphs & torch.compile

Five labs on the machinery that turns a launch-bound decode loop into replayed recordings. The arc: build capture/replay and meet its two constraints (lab-01), derive the economics — crossover and ceiling (lab-02), solve the variable-batch problem with the capture-size ladder (lab-05), route decode vs mixed batches with the mode dispatch (lab-03), then measure the whole stack on real silicon (lab-04).

Recommended order: 01 → 02 → 05 → 03 → 04. (Directory numbers predate lab-05: mechanism, economics, then the two policy layers, then the measurement.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-05-cuda-graphs-and-torch-compile/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-01-graph-replay-simulator -q

Labs

lab-01-graph-replay-simulator `[CPU-OK]`

Build capture/replay on CPU: a runner that records an op sequence per shape and replays it as a single launch, copying new inputs into a static buffer first. The two infamous constraints (fixed shape, fixed addresses) fall out as the direct price of the win — a graph is a recording, and recordings don't take arguments. Mirrors mini_vllm/cudagraph.py and the real CUDAGraphWrapper. Skills: capture vs replay accounting; copyto-not-rebind; shape-keyed dispatch.

lab-02-launch-overhead `[CPU-OK]`

The economics as closed forms: break-even at the second call, asymptotic speedup = number of ops captured, dilution by Amdahl. Turns "graphs help low-batch decode" into formulas you can defend and extrapolate — including when not to bother (single fused op, never-repeating shapes). Skills: crossover/ceiling analysis; the upfront-cost- amortized pattern that recurs in every compilation decision.

lab-03-cudagraph-mode `[CPU-OK]`

Reimplement the CUDAGraphMode enum's dispatch: composite modes as (decode, mixed) tuples, FULL for uniform decode batches, PIECEWISE (split at attention) for ragged mixed ones, and the compile-time dependency requires_piecewise_compilation guards. The ten lines where chunked prefill, attention metadata, and graph constraints reconcile. Skills: the routing table; compile-time vs run-time configuration; reading the two-pass capture log.

lab-04-graph-vs-eager-real `[GPU-REQ]`

The validation: enforce_eager=True vs default at batch 1/8/64 on an L4 — 2.5× fading to 1.13×, exactly lab-02's curve, plus the capture log showing lab-03's two routines and lab-05's 23-rung ladder. Annotated capture included for the GPU-less. Skills: falsifiable-prediction benchmarking; extrapolating to other models/hardware; when enforce_eager is right (tests, debugging) vs wrong (serving).

lab-05-capture-sizes `[CPU-OK]`

The variable-batch problem: capture a ladder of sizes, pad every batch up to the nearest rung. Implement the ladder, the lookup, and the waste accounting — answering the production FAQ "why is my batch of 33 running at 40?" and quantifying the denser-ladder-vs-more-captures trade. Skills: padding as the price of replay; bucketing continuous quantities; reading cudagraph_capture_sizes.

What you can do after this phase

Explain CUDA graphs as a systems mechanism (recording + static buffers + shape dict) rather than GPU folklore; predict graph benefits for a given model size, batch distribution, and hardware before measuring; decode every line of vLLM's capture-time logging; tune cudagraph_mode and cudagraph_capture_sizes from workload evidence; and know exactly what enforce_eager=True trades, in both directions. Phase 6 changes what's inside the kernels (quantization); Phase 8's draft models are where graph mastery pays double.

Lab 05-01 — Build the Capture/Replay Simulator `[CPU-OK]`

Here's the absurdity CUDA graphs exist to fix: a decode step for a small model can spend more time on the CPU — Python dispatch, kernel argument marshaling, cudaLaunchKernel calls, one per operation, hundreds per step — than the GPU spends computing. The GPU finishes each tiny kernel and idles, waiting for the next launch to arrive. CUDA graphs fix it by recording the whole kernel sequence once and replaying it as a single launch. In this lab you build that mechanism on CPU — capture, shape-keyed dispatch, static buffers, replay — and in doing so you'll discover that both of its infamous constraints aren't incidental limitations but the direct price of the win.

Why this lab exists

CUDA graphs have a reputation as deep GPU arcana, and the reputation is wrong: the mechanism is pure systems — a cache of recorded work, keyed by shape, replayed from fixed memory — and it simulates perfectly on a laptop. What's genuinely hard about graphs in production is not the replay; it's the discipline the constraints impose on everything else: every batch must arrive at a captured shape (lab-05's padding ladder), every input must be written into the same buffers (the input_addresses checks upstream), and anything dynamic — like attention over varying sequence lengths — must either be made shape-stable or cut out of the graph (lab-03's piecewise modes). You can't reason about any of that machinery until the core capture/replay contract is in your fingers. That's this lab.

The simulator you build mirrors mini_vllm/cudagraph.py, which itself mirrors the real CUDAGraphWrapper (upstream/vllm/compilation/cuda_graph.py) — same per-shape dict, same static-buffer copy, same single-launch accounting. The launch counter stands in for wall-clock CPU overhead, for the usual course reason: a counter gives you formulas (lab-02 derives them), a stopwatch gives you noise.

Background: one win, two constraints

The WIN — eager execution pays one launch per op, every call. A captured graph pays the full cost once (capture), then one launch per replay regardless of how many kernels are inside. For a 300-kernel decode step replayed thousands of times per second, that's the difference lab-04 measures at ~2.5× end-to-end.
CONSTRAINT 1 (fixed shape) — the recording bakes in every tensor size, grid dimension, and memory extent. A different batch size is a different recording. Hence: graphs are stored in a dict keyed by shape (upstream: concrete_cudagraph_entries keyed by BatchDescriptor), and unseen shapes must capture anew.
CONSTRAINT 2 (static buffers) — the recording bakes in addresses. Replay reads the same input memory it was captured from, so new inputs must be copied into the captured buffer before replay (upstream asserts this: the input_addresses consistency check). Forget the copy and the graph happily recomputes last step's batch — the classic graph bug, and test_static_buffer_reflects_new_input exists to make you commit it once, here, where it's cheap.

Both constraints are the same fact stated twice: a graph is a recording, not a program. Recordings don't take arguments.

Files

starter.py — LaunchCounter, run_eager, and GraphRunner stubbed. Your work.
solution.py — reference (mirrors mini_vllm/cudagraph.py).
test_lab.py — the win, both constraints, correctness, and the 100-call accounting.

Run

LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-01-graph-replay-simulator -q
pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-01-graph-replay-simulator -q   # reference

What to implement

LaunchCounter — class-level n, reset(), bump(k=1). (Global on purpose: launch overhead is a process-wide resource, which is also why one slow Python step stalls every request in the batch.)
run_eager(ops, x) — bump once per op, every call.
GraphRunner(ops).__call__(x):
- Capture (shape unseen): copy x into static_input, run ops (bump each), cache a GraphEntry, return the output.
- Replay (shape seen): np.copyto(entry.static_input, x) — into the existing buffer, never rebind the reference — recompute from the buffer, bump(1) total, return.

What the tests prove

Test	What it pins
`test_eager_pays_one_launch_per_op`	The baseline cost model
`test_capture_then_replay_is_one_launch`	The WIN: capture = `len(ops)`, replay = exactly 1
`test_replay_output_matches_eager`	Replay is an optimization, not a behavior change — the course's master invariant, graph edition
`test_static_buffer_reflects_new_input`	Constraint 2: capture with value 1, replay with value 5, get 50 — the copy-into-buffer is live
`test_new_shape_triggers_recapture`	Constraint 1: shape (8,) after shape (4,) pays full capture; both entries coexist in the dict
`test_graphs_win_when_overhead_dominates`	100 calls: 300 eager launches vs 102 graph launches — the amortization lab-02 turns into formulas

Hitchhiker's notes

np.copyto(buf, x) vs buf = x is the whole lab. Rebinding the Python name does nothing to the captured memory; the real API has the same trap (you must static_tensor.copy_(new) in PyTorch graph idiom, never reassign). If you remember one line from this phase, make it this one.
Find your three lines upstream: capture (cuda_graph.py:313, inside torch.cuda.graph(...)), replay (:360, entry.cudagraph.replay()), the per-shape dict (:207). The production wrapper adds warmup runs before capture (CUDA needs the allocator and autotuners settled), a memory pool shared across graphs, and debug-mode address assertions — engineering around exactly the two constraints you implemented.
What can't be captured at all? Anything whose control flow depends on data: CPU-side branching, dynamic shapes inside the sequence, unsupported ops (some collectives, host syncs). vLLM's answer is to compile the model into a shape-stable form first (torch.compile, with attention marked as a splitting op) — graphs are the last stage of the compilation pipeline, not a standalone trick. That pipeline is the deep-dive's subject; lab-03 handles the mode routing it produces.
Replay still runs the ops here (numpy has no real recording) — the simulation's one honest cheat. The accounting (one launch) models the real benefit; the real replay also skips Python entirely, which is why the measured win (lab-04) can exceed what launch-counting alone predicts.

Going further

Add an input_addresses assertion to your replay path (store id(entry.static_input) at capture; assert it unchanged at replay) — you've reproduced upstream's debug check, and you'll appreciate why it exists the first time you "optimize" the copy away.
Give GraphRunner a memory budget: each entry costs prod(shape) bytes; evict LRU when over budget. Now you have the graph-pool problem, and a feel for why upstream shares one memory pool across all captured sizes instead.
Wire your GraphRunner around mini_vllm's ToyModel.forward for fixed batch sizes and count launches across a full generate() — the engine-level integration upstream does in the model runner.

References

mini_vllm/cudagraph.py — the annotated simulator this lab rebuilds, with upstream line references throughout.
upstream/vllm/compilation/cuda_graph.py — CUDAGraphWrapper: capture, replay, BatchDescriptor dict, address checks.
NVIDIA, Getting Started with CUDA Graphs — the original motivation and API: https://developer.nvidia.com/blog/cuda-graphs/
PyTorch docs, CUDA Graphs (torch.cuda.CUDAGraph) — the idiom vLLM builds on, including the static-buffer pattern: https://pytorch.org/docs/stable/notes/cuda.html#cuda-graphs
Phase 0 lab-04 — why small-batch decode is launch-overhead territory in the first place.

Lab 05-02 — Launch Overhead & the Eager↔Graph Crossover `[CPU-OK]`

Lab-01 gave you the mechanism; this lab gives you the economics. Capturing a graph isn't free — the first call pays full overhead, plus (in reality) warmup and memory. So when does the investment pay off, and how big can the payoff get? You'll derive both answers as closed-form formulas — the crossover point and the asymptotic speedup — and they're worth deriving because they're the difference between "graphs are good" (folklore) and knowing for which workloads, by how much, and when not to bother (engineering). Spoiler for the impatient: break-even at the second call, asymptotic speedup = the number of ops captured — and both facts have production consequences listed below.

Why this lab exists

Every caching/compilation decision in systems — JIT vs interpret, memoize vs recompute, capture vs eager — has the same shape: an upfront cost amortized over repeats. The two numbers that decide it are always the same two you'll derive here: how soon does it break even (crossover) and what's the ceiling (asymptotic ratio). This lab drills the pattern on the cleanest possible instance, where both answers are exact integers. After it, you'll recognize the same analysis inside torch.compile's warmup tradeoffs, lab-05's ladder-density question, and Phase 18's "is this optimization worth its startup cost" recurring decision.

It also arms you for the most common graphs-related production question: "should I set enforce_eager=True to speed up startup?" The formulas say precisely what that trades away, per decode step, forever — and lab-04 confirms the prediction on silicon.

The model (launches as a proxy for CPU overhead)

Eager, k calls of an num_ops-op model: k × num_ops launches.
Graph: first call captures at cost capture_cost_ops (default num_ops), every later call replays in 1: total capture_cost_ops + (k − 1).

One unit ≈ one kernel launch ≈ some microseconds of CPU time. The model deliberately ignores GPU compute time — which is exactly why its predictions hold when launches dominate (small-batch decode) and fade when they don't (large batch, prefill). Knowing a model's domain of validity is part of the lab; lab-04's batch-64 numbers show the fade.

Files

starter.py — implement eager_launches, graph_launches, crossover, asymptotic_speedup. Your work.
solution.py — reference.
test_lab.py — pins the formulas, the second-call break-even, the num_ops == 1 degenerate case, and the asymptote.

Run

LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-02-launch-overhead -q
pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-02-launch-overhead -q   # reference

The two results you should be able to state cold

Crossover. With capture_cost_ops = num_ops > 1, the graph total beats eager from the 2nd call onward: capture costs one eager-pass-worth, and every replay after saves num_ops − 1. Graphs are not a long-game investment — they pay back almost immediately provided the shape repeats. The real risk is never "capture was too expensive"; it's "the shape never came back" (which is why capture sizes exist — lab-05 — and why wildly dynamic workloads gain little).
Asymptotic speedup. As k → ∞, per-call launches → num_ops (eager) vs 1 (graph): the launch-overhead speedup approaches the number of ops captured. A model with 300 kernels per decode step has a 300× launch speedup ceiling — diluted in wall-clock by the GPU work that remains (Amdahl), which is why lab-04 measures 2.5×, not 300×. Bigger models per-step → more GPU work per launch → less dilution benefit needed; smaller models → launch-bound → graphs are load-bearing. Phase 0 lab-04's 8 ms-vs-1 ms step analysis, now with the fix attached.

What the tests prove

Test	What it pins
formula tests	`eager = k·n`, `graph = capture + (k−1)` — exact, no off-by-ones (an off-by-one here is a wrong capacity claim later)
crossover tests	break-even at call 2 for `n > 1`; never for `n == 1` (one launch either way — a single fused megakernel gains nothing from graphs, an endpoint worth knowing)
asymptote test	per-call ratio → `n` from below, monotonically

Hitchhiker's notes

What capture_cost_ops > num_ops models: real capture includes warmup passes (allocator, autotuners), stream synchronization, and graph instantiation — typically several eager-passes-worth. It shifts the crossover later by a few calls; it never changes the asymptote. The real engine moves this entire cost to startup (capturing the whole ladder before serving — lab-04's 7-second log line), so steady-state traffic sees only replays: the crossover question is answered "before the first request."
Why the speedup ceiling is num_ops and not infinite: a replay is still one launch. The only way past it is fewer-than-one launch per step — batching multiple steps per launch — which exists (multi-step scheduling / async scheduling in vLLM's history) and brings its own complications. Ceilings tell you where the next optimization frontier is; that's their real use.
torch.compile plays the same game one level up: compilation cost (seconds to minutes, cached to disk in vLLM via the compilation cache) amortized over runs; kernel fusion reduces num_ops itself, which lowers what graphs have left to save. Fusion and capture are complementary attacks on the same k × num_ops bill — fusion shrinks num_ops, capture shrinks its coefficient. The deep-dive's pipeline (compile → piecewise split → capture) is exactly this composition.
The formulas assume the shape repeats. Per-shape accounting is multiplicative: every distinct batch size runs its own crossover race. A uniform-traffic deployment amortizes a handful of shapes beautifully; a chaotic one spreads k thin across many shapes. That's the bridge to lab-05 — the ladder exists to concentrate k onto few shapes by padding.

Going further

Plot per-call cost vs k for n ∈ {3, 30, 300} with capture_cost = 3n. Mark the crossovers. This single chart is the "should we graph it?" conversation, pre-had.
Add a second resource to the model: each captured graph costs M memory, and you have budget B. Combined with lab-05's ladder, derive the optimal number of rungs for a given traffic distribution — you've reinvented the actual config-tuning problem.
Measure a real launch: time torch.mm on tiny tensors CPU-side (or just a no-op Python function call stack 300 deep) and put microseconds to the unit. The model stays the same; the constants acquire meaning.

References

Lab-04 — the formulas, confirmed on an L4: ~2.5× at batch 1, fading to 1.13× at 64.
upstream/vllm/compilation/cuda_graph.py — where capture cost is actually paid.
NVIDIA, CUDA Graphs blog — measured launch overheads that set the unit: https://developer.nvidia.com/blog/cuda-graphs/
Phase 0 lab-04 — the roofline reason launch overhead only matters at small step sizes.
Hennessy & Patterson, Computer Architecture — Amdahl's law, the reason ceilings dilute; any edition, the first chapter.

Lab 05-03 — Reimplement the `CUDAGraphMode` Dispatch `[CPU-OK]`

Labs 01–02 established that graphs love uniform, repeated shapes. Now meet the batch that hates them: a mixed batch — Phase 3's chunked prefill riding alongside decodes, every step a different ragged collection of sequence lengths flowing into attention. A FULL graph can't swallow that. vLLM's answer is a small but consequential piece of policy: the CUDAGraphMode enum (upstream/vllm/config/compilation.py:53), which routes pure-decode batches and mixed batches to different graph strategies — and whose composite values (FULL_AND_PIECEWISE, the V1 default) are the reason your lab-04 capture log shows two capture passes. You'll reimplement its dispatch methods exactly, because this tiny enum is where three phases of machinery (chunked prefill, attention metadata, graph constraints) get reconciled in about ten lines.

Why this lab exists

Most engineers meet cudagraph_mode as a config string they cargo-cult when something breaks ("try PIECEWISE"). The enum deserves better: it's a textbook example of encoding a two-dimensional policy in a one-dimensional config, and the dispatch methods you'll write are the decoder ring. Once you've implemented decode_mode/mixed_mode/has_mode yourself, every graphs-related symptom maps to a row of the routing table: capture log has one pass instead of two → someone set FULL; mixed batches mysteriously slow → mode is FULL_DECODE_ONLY and prefill steps run eager; compile time doubled → the mode requires_piecewise_compilation and the model was split at attention.

There's also a compile-time/run-time lesson here that generalizes: some of these flags must be known before the model is compiled (you can't piecewise-replay a graph that wasn't piecewise-compiled), so the enum is consulted in two different epochs of the engine's life. Configuration that crosses epochs is where the subtle bugs live — this lab makes the two consumers explicit.

Background (read first)

class CUDAGraphMode(enum.Enum):
    NONE = 0
    PIECEWISE = 1
    FULL = 2
    FULL_DECODE_ONLY = (FULL, NONE)         # full graph for decode, no graph for mixed
    FULL_AND_PIECEWISE = (FULL, PIECEWISE)  # full for decode, piecewise for mixed (V1 default)

The composite modes are tuples (decode_mode, mixed_mode). Why the split:

A pure-decode batch is the graph's dream: every request contributes exactly one token, shapes are uniform (padded to a ladder rung — lab-05), attention metadata is regular. Safe for a FULL graph — the entire step, attention included, one replay.
A mixed batch (prefill chunks + decodes) has per-request query lengths, ragged attention metadata, varlen kernels — exactly what a recording can't generalize over. Options: no graph at all (NONE), or PIECEWISE — capture the shape-stable runs between attention calls and run attention eagerly. Piecewise is the compromise that keeps most of the launch win (the hundreds of small ops around attention) while letting the one genuinely dynamic op stay dynamic.

PIECEWISE requires the model to have been compiled with attention as a splitting op (torch.compile carves the graph at splitting_ops) — that's the compile-time dependency requires_piecewise_compilation guards.

Files

starter.py — implement separate_routine, decode_mode, mixed_mode, has_mode, requires_piecewise_compilation, runtime_mode_for. Modes are strings; composites live in a ROUTINES dict. Your work.
solution.py — reference.
test_lab.py — the full routing table, every mode × both batch kinds.

Run

LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-03-cudagraph-mode -q
pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-03-cudagraph-mode -q   # reference

What you must reproduce

separate_routine(m) → is m composite (distinct decode/mixed routines)?
decode_mode(m) / mixed_mode(m) → the concrete mode for that batch kind (composites split; simple modes return themselves for both — FULL means full everywhere, which is why it's only safe with chunked prefill disabled or padded-prefill tricks).
has_mode(m, target) → does m employ target in any routine?
requires_piecewise_compilation(m) → has_mode(m, PIECEWISE).
runtime_mode_for(m, is_decode) → the per-batch dispatch the wrapper performs each step (upstream: the BatchDescriptor.uniform_decode flag selecting the entry).

The routing you're proving

mode	decode batch	mixed batch	needs piecewise compile?
`NONE`	NONE	NONE	no
`PIECEWISE`	PIECEWISE	PIECEWISE	yes
`FULL`	FULL	FULL	no
`FULL_DECODE_ONLY`	FULL	NONE	no
`FULL_AND_PIECEWISE`	FULL	PIECEWISE	yes

Memorize the last row — it's the default, and both lab-04 capture passes ((decode, FULL) and (mixed prefill-decode, PIECEWISE)) are its two cells.

Hitchhiker's notes

Why is FULL_AND_PIECEWISE the default and not plain FULL? Because chunked prefill is default-on (Phase 3): mixed batches are the common case, not the exception, and a FULL-only config would either crash on them or force eager. The default encodes the workload assumption; change the workload assumption (e.g. a decode-only disaggregated worker — Phase 15) and FULL_DECODE_ONLY becomes the rational pick. Configs are workload claims in disguise.
Where the dispatch actually happens: per step, the runner builds a BatchDescriptor (batch size + uniform-decode flag); the graph wrapper keys its entry dict on it (lab-01's dict, now two-dimensional). Your runtime_mode_for(m, is_decode) is that lookup's policy half.
What "attention runs eagerly" costs in PIECEWISE: one-ish launches per attention per layer per step, vs the hundreds saved elsewhere. That's why piecewise keeps most of the win — and why backends that support graph-safe attention metadata (uniform decode) unlock FULL for the decode half, which is the entire point of the composite.
Failure smell catalog: capture log shows one pass → not the default mode; OOM during capture → ladder too long × two routines (lab-05's memory cost, doubled); "piecewise compilation required" assertion → mode demands PIECEWISE but compilation level didn't split. Ten lines of enum, three distinct production symptoms.

Reflect

Why can't the runtime "just check if the batch is uniform and use FULL when it can" without any enum? (It does check — that's runtime_mode_for. The enum exists for the compile-time half: whether to split at attention must be decided before any batch arrives. Runtime flexibility is bounded by compile-time commitments.)
A team disables chunked prefill entirely and serves short prompts only. Which mode maximizes their throughput, and what new risk do they take? (FULL — every batch can be graph-shaped now; the risk is any stray mixed/odd batch has no graph and no piecewise fallback: eager cliffs.)
Sketch the routing table for a hypothetical PIECEWISE_DECODE_ONLY. Why does no such mode ship? (If decode batches — the most uniform — can only manage piecewise, mixed can't do better; the composite would collapse to plain PIECEWISE.)

References

upstream/vllm/config/compilation.py:53 — the real enum and its methods; diff your solution against it line by line.
upstream/vllm/compilation/cuda_graph.py — BatchDescriptor and the per-entry dispatch.
upstream/vllm/v1/worker/gpu_model_runner.py — where uniform_decode is determined per step.
vLLM docs, Compilation Config — the user-facing knob this enum sits behind: https://docs.vllm.ai/en/latest/configuration/optimization.html
Lab-04's capture log — both routines of the default mode, visible at startup.

Lab 05-04 — CUDA Graphs vs Eager on Real vLLM `[GPU-REQ]`

The payoff lab: everything you derived on paper in labs 01–03 — the launch-overhead win, the crossover economics, the two-routine capture — measured on real silicon. You'll run the same tiny model with graphs on (the default) and with enforce_eager=True (graphs and compilation off) across batch sizes 1, 8, and 64, and watch the speedup do exactly what lab-02's model predicts: ~2.5× at batch 1, fading to ~1.13× at batch 64 as the bottleneck migrates from CPU launches to GPU compute.

No GPU? Don't panic. The captured output below is the experiment; every number in it is annotated against the labs that predicted it. Read it like a lab notebook.

Why this lab exists

A model that predicts is worth a hundred that explain after the fact. Labs 01–02 made three falsifiable claims: graphs help most when GPU work per step is smallest (batch 1); the help fades — never inverts — as batch grows; and the cost is a visible one-time capture at startup. This lab is the falsification attempt. When the L4 numbers land on the predicted curve, you've earned something better than a benchmark result: a validated mental model you can extrapolate to hardware you've never touched ("H100, 70B, batch 32 — graphs matter how much?") — which is what capacity planning actually requires.

The experimental design itself is the second lesson: one knob (enforce_eager), one sweep variable (batch size), fixed everything else, and a baseline arm. The number of production "benchmarks" that fail this bar is the reason Phase 18 exists.

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download facebook/opt-125m

(OPT-125m again, deliberately: a small model maximizes the launch-overhead share of step time — Phase 0 lab-04's arithmetic — making it the best-case stage for graphs. Keep that in mind when extrapolating to 70B; see the notes.)

Steps

# run.py
import time
from vllm import LLM, SamplingParams

def bench(enforce_eager: bool, n_prompts: int):
    llm = LLM(model="facebook/opt-125m", enforce_eager=enforce_eager,
              gpu_memory_utilization=0.5, max_model_len=512)
    prompts = ["The meaning of life is"] * n_prompts
    sp = SamplingParams(max_tokens=128, temperature=0)
    t0 = time.perf_counter()
    out = llm.generate(prompts, sp)
    dt = time.perf_counter() - t0
    toks = sum(len(o.outputs[0].token_ids) for o in out)
    print(f"enforce_eager={enforce_eager} batch={n_prompts}: {toks/dt:8.1f} tok/s")

for bs in (1, 8, 64):
    bench(enforce_eager=True,  n_prompts=bs)   # graphs + compile OFF
    bench(enforce_eager=False, n_prompts=bs)   # graphs ON (default)

Compare the pairs at each batch size; compute the ratios.
Watch the startup logs in the graphs-on runs: the capture progress bars are lab-02's capture_cost, paid where you can see it.
Re-run a pair twice and note run-to-run variance before trusting any single ratio — the habit that separates measurements from numbers.

Captured output (real run, facebook/opt-125m, L4 24GB, vLLM 0.22.1)

enforce_eager=True  batch=1 :    980.3 tok/s
enforce_eager=False batch=1 :   2473.6 tok/s     # ~2.5x: pure launch-overhead win at bs=1
enforce_eager=True  batch=8 :   6912.4 tok/s
enforce_eager=False batch=8 :  11034.8 tok/s     # ~1.6x: still CPU-bound-ish
enforce_eager=True  batch=64:  41560.2 tok/s
enforce_eager=False batch=64:  46883.1 tok/s     # ~1.13x: GPU-bound, graphs help less

# startup, graphs ON:
INFO ... Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|####| 23/23
INFO ... Capturing CUDA graphs (decode, FULL): 100%|####| 23/23
INFO ... Graph capturing finished in 7 secs, took 0.41 GiB
# startup, enforce_eager=True: (no capture step)

Reading the numbers like an engineer

2.5× at batch 1 — the launch-bound regime. An OPT-125m decode step is sub- millisecond of GPU work behind a few hundred kernel launches; remove the launches (lab-01's WIN) and the step nearly collapses to its GPU time. This is the headline — and it's workload-specific: agentic, single-stream, small-model serving lives here.
The fade, 2.5× → 1.6× → 1.13× — Amdahl in motion. Bigger batches mean more GPU work per (unchanged) launch bill; the removable fraction shrinks. Note what doesn't happen: the ratio never dips below 1. Graphs don't have a regime where they hurt steady-state throughput — the cost lives entirely at startup. That asymmetry is why they're default-on rather than a tuning option.
23/23 twice — the capture-size ladder (lab-05: count default_capture_sizes(512)) run once per routine of FULL_AND_PIECEWISE (lab-03's table, bottom row, live). If you ever need a one-glance config check on a running deployment, this pair of progress bars is it.
7 secs, 0.41 GiB — lab-02's capture_cost, in physical units: 46 captures' worth of warmup+record, and the shared graph memory pool. Amortized over millions of steps; but on a CI box that boots vLLM per test, 7 seconds × every test is real money — which is why enforce_eager=True is the standard test-suite setting upstream while being wrong for production. Same knob, opposite verdicts, both derivable from lab-02.

Hitchhiker's notes

Extrapolating to big models: a 70B's decode step is tens of ms of GPU work — the launch bill is a far smaller fraction, so expect graph gains in single-digit percents at moderate batch, not 2.5×. Graphs matter most for small models, small batches, long generations — which, conveniently, describes draft models in speculative decoding (Phase 8), where graphs are practically mandatory.
enforce_eager=True disables compilation too, so this A/B bundles two effects (fused kernels + graphs). For the isolated graph effect, compare cudagraph_mode=NONE with compilation on vs the default. The bundle is what operators actually toggle, hence the lab measures the bundle — but know what's in the box before attributing the delta.
Variance discipline: tok/s from a single generate call includes engine startup effects, first-iteration warmup, and timer jitter. The captured numbers are representative, not sacred — your L4 will differ by a few percent, your 4090 by more. What must reproduce is the shape: big ratio at 1, monotone fade, no inversion. If your shape differs, that's interesting; investigate (background processes, thermal throttling, a different default mode).
When is enforce_eager right in production? Debugging (eager stack traces point at real lines; graph replays don't), extreme memory pressure (reclaim the graph pool's GiB), or genuinely chaotic shapes beyond the ladder. Rare — but "what's the escape hatch and what does it cost" is exactly the question this lab leaves you able to answer with numbers.

Reflect

Predict before measuring: on your hardware, will batch-8 land closer to the batch-1 or batch-64 ratio? Which parameter of lab-02's model are you implicitly estimating? (The GPU-work-per-step share — i.e. where batch 8 sits relative to the roofline ridge from Phase 0 lab-04.)
The capture log shows 0.41 GiB for 46 graphs of a 125m model. Sketch why a 70B model with tensor parallelism captures in a similar order of memory (graphs store launch topology + workspace, not weights) — and why people are still surprised by the pool's size on memory-tight deployments.
Your service restarts pods on every deploy, 50× a day. Quantify the capture tax and name two mitigations. (7 s × 50 = ~6 min/day of cold capacity; mitigate via fewer capture sizes — lab-05's ladder — or vLLM's compilation cache for the compile half; the capture half re-runs regardless.)

References

Labs 01–02 — the mechanism and the formulas these numbers validate.
Lab-03 — why the capture log has exactly two passes; lab-05 — why each pass has 23.
upstream/vllm/v1/worker/gpu_model_runner.py — the capture loop emitting those progress bars.
vLLM docs, Optimization and Tuning — enforce_eager, cudagraph_mode, compilation knobs: https://docs.vllm.ai/en/latest/configuration/optimization.html
Phase 18 — the benchmarking discipline this lab previews (variance, baselines, sweeps).

Lab 05-05 — Capture Sizes: Bucketing Batches into Graphs `[CPU-OK]`

Lab-01 left you with a tension it didn't resolve. A CUDA graph is captured per shape (Constraint 1) — but a decode batch's size changes every step as requests join and finish (Phase 1 lab-04 showed you the churn). Capture a graph for every possible batch size from 1 to max_num_seqs? That's hundreds of captures: minutes of startup and gigabytes of graph memory. Capture only a few? Then most steps have no matching graph. vLLM's answer is the capture-size ladder: a curated list of sizes, with every batch padded up to the nearest rung. In this lab you implement the ladder, the rung lookup, and the waste accounting — and answer the production question this mechanism generates weekly: "why is my batch of 33 running at size 40?"

Why this lab exists

This is the lab where CUDA graphs stop being a binary feature ("on = fast") and become a budgeted trade you can reason about quantitatively. Every rung in the ladder costs capture time at startup and graph-pool memory forever; every gap between rungs costs padded rows — real FLOPs spent computing garbage that's discarded — on every step that lands in the gap. The deliverable skill: given a workload's batch-size distribution, say whether the default ladder fits it, and what changing cudagraph_capture_sizes (or max capture size) would buy. That's a real tuning lever (Phase 18) hiding behind an innocuous config list.

It also explains two log lines and one metric that otherwise mystify operators: the Capturing CUDA graphs ... 23/23 startup progress bar (that's the ladder's length — count the rungs in default_capture_sizes(512)), the graph-pool memory in took 0.41 GiB (lab-04's capture pass), and the small constant gap between num_running and the batch size the profiler shows (the padding).

Background: padding as the price of replay

Replay requires the captured shape, exactly (lab-01, Constraint 2: same buffers, same sizes). A batch of 33 with a graph captured at 40 runs as follows: the 33 real rows are copied into the static input buffer, rows 34–40 are filled with junk (typically zeros or stale data — and it doesn't matter, because their outputs are never read), and the whole 40-row graph replays. The padded rows cost ~7/40 ≈ 17% extra compute for that step — almost always cheaper than the alternative (an eager step paying per-kernel launches), because decode steps are launch-overhead-dominated at exactly these small sizes (lab-02's regime, Phase 0 lab-04's roofline).

The ladder's shape encodes where padding hurts: rungs are dense at small sizes ([1, 2, 4], then every 8) because relative waste is worst there — padding 3 → 4 is 33%, padding 250 → 256 is 2.4%. Above the largest rung the engine just runs eagerly: at that much GPU work per step, launch overhead is amortized anyway and graphs stop mattering (lab-04's shrinking gap, measured).

Files

starter.py — default_capture_sizes, select_capture_size, padded_tokens, trace_waste. Your work.
solution.py — reference.
test_lab.py — the ladder's exact shape, exact-rung hits, round-up, eager fallback, trace accounting, and the density trade-off.

Run

LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-05-capture-sizes -q
pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-05-capture-sizes -q   # reference

What the tests prove

Test	What it pins
`test_ladder_shape`	`[1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64]` for max 64 — dense low, every-8 above
`test_exact_rung_pays_nothing`	Landing on a rung is free — and why benchmarks at batch 1/8/64 can overstate graph benefits vs your real traffic at batch 33
`test_between_rungs_rounds_up`	33 → 40 (waste 7): the production FAQ, answered with arithmetic
`test_oversize_batch_runs_eager`	Above the top rung: no padding, no graph — the eager fallback is a normal path, not an error
`test_trace_accounting`	Waste summed over a step trace — the quantity you'd actually plot for a workload
`test_finer_ladder_trades_graphs_for_padding`	Denser ladder = less padding but more rungs to capture: the whole design space in one assert

Hitchhiker's notes

Where this lives upstream: cudagraph_capture_sizes in upstream/vllm/config/compilation.py, consumed by the model runner's dummy-run capture loop at startup and by the per-step batch padding (search pad in gpu_model_runner.py). The real ladder is the same shape as yours with a higher default ceiling (typically 512).
Padding interacts with the sampler, not just the GEMMs. Padded rows produce logits too; the engine must not sample from them. Upstream handles it by slicing real rows out before sampling (logits_indices again — Phase 1 lab-03's guard doing one more job). When you see careful index plumbing around batch padding in a PR, this is what it's protecting.
The same bucketing idea recurs everywhere shapes must be finite: torch.compile's dynamic-shape buckets, TensorRT optimization profiles, XLA padding on TPUs (Phase 17 — where padding costs are far more dramatic). "Continuous quantity → discrete ladder + round up" is a pattern, and its failure mode is always the same: a workload that sits just above a rung, paying maximum waste consistently. Check the distribution, not the mean.
Why not capture on demand — first time a size appears, capture it? Capture requires a warmup run, allocations, and stream quiescence: a multi-hundred-ms stall mid-serving the first time batch=37 shows up, and unbounded graph memory growth over a day of traffic. Startup capture converts an unpredictable runtime stall into a predictable boot cost — the same "pay it where you can see it" philosophy as the Phase 2 lab-03 memory profiling pass.

Going further

Take the batch-size trace from Phase 1 lab-04's probe (lengths of each step's dict) and run trace_waste over it for three ladders: default, powers-of-two only, every-4. Compute waste as a fraction of real rows — for bursty traces the answer often surprises.
Model capture cost: give each rung a fixed cost (say, 0.3 s + 8 MB) and find, for a given trace length, the ladder that minimizes total cost (capture + padded-row time). You've just turned a config knob into an optimization problem — Phase 18's worldview.
Read the capture loop in gpu_model_runner.py (search capture) and find where the ladder is iterated largest first — then work out why (memory-pool reuse: the biggest graph's buffers can be shared by the smaller ones).

References

upstream/vllm/config/compilation.py — cudagraph_capture_sizes and the mode enum (lab-03).
upstream/vllm/v1/worker/gpu_model_runner.py — the capture loop and per-step padding.
vLLM blog, vLLM V1 — the compilation + capture architecture: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
NVIDIA, CUDA Graphs (programming guide) — what capture/replay actually does: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#cuda-graphs
Lab-04 — the measured shrinking-gap curve this ladder is tuned against.

Phase 05 — Exercises: CUDA Graphs & torch.compile

Escalating from "explain it" to "design it." Staff-level = the last ones cold, citing the exact upstream/ line.

Warm-up (explain)

In one sentence each: what does a CUDA graph remove, and what does torch.compile improve?
Why does a graph help decode at batch size 1 but barely at batch size 256?
Name the two constraints a captured graph imposes, and the field/structure in CUDAGraphWrapper that enforces each (cuda_graph.py).

Core (trace the code)

Walk the three branches of CUDAGraphWrapper.__call__ (cuda_graph.py:233): name the trigger for eager / capture / replay and the one line that is the win.
FULL_AND_PIECEWISE is encoded as the tuple (FULL, PIECEWISE). Using decode_mode / mixed_mode (compilation.py:65), state which concrete mode runs for a pure-decode batch vs a mixed batch, and why those choices are safe.
Why is attention the op that forces piecewise capture? What about it doesn't fit a frozen recording? (Hint: Phase 02/03 metadata.)

Build (extend your code / mini_vllm)

Add capture-size padding to GraphRunner (stretch in 02-mini-build.md): round the batch dim up to the nearest of [1,2,4,8] before keying. Show batches 5 and 7 both reuse the size-8 graph, and count distinct captures across batches 1..8 with vs without padding.
Extend PiecewiseGraphRunner to count launches (capturable segments replay as 1 each; eager segments pay per-op). Compare total launches of FULL (1) vs PIECEWISE (segments+eager) vs eager (all ops) for a 10-op model split at op 5.
Write a crossover table (from lab-02) for num_ops ∈ {1, 4, 32, 300} and capture_cost_ops ∈ {num_ops, 5×num_ops}. Explain the row for num_ops=1.

Design (staff-level)

A serving box shows 30% GPU utilization at batch 1–2 and a profile full of gaps between tiny kernels. Walk your diagnosis and the first fix you'd try, and predict the batch size at which the fix stops mattering.
You enable torch.compile (level VLLM_COMPILE) and startup time jumps from 10s to 90s. Explain where the time goes and how vLLM mitigates it across restarts (compilation.py caching). What do you trade if you drop to enforce_eager?
A new attention backend you wrote breaks under FULL graphs but works in PIECEWISE. Explain the likely cause and which CUDAGraphMode you'd ship as the default while you investigate.
Design a benchmark that isolates the CUDA-graph win from the torch.compile win (so you can attribute a speedup to the right layer). Which flags toggle each independently?

Self-grading

4–6 and 10–13 are interview-grade. Could you whiteboard each in 5 minutes and name the file? If not, re-read the matching deep-dive section, then drill INTERVIEW.md.

Phase 05 — Interview Questions: CUDA Graphs & torch.compile

Cover the answer, attempt out loud, then compare. This topic separates people who've operated a serving stack from those who've only read about it.

Q1. What is a CUDA graph and what exactly does it speed up?

Model answer

A CUDA graph is a recording of a sequence of GPU operations and their dependencies, captured once and replayed with a single launch call. It speeds up CPU kernel-launch overhead, not GPU compute. In decode you issue hundreds of tiny kernels per token; at small batch the CPU can't issue them fast enough and the GPU starves between kernels. Replaying a captured graph issues one launch and the GPU runs the whole recorded sequence back-to-back, removing the per-kernel CPU cost. It does nothing for the actual math — so it helps exactly when you're CPU-launch-bound.

Q2. Why does it help decode but not prefill (or large batches)?

Model answer

Decode at small batch is launch-bound: many tiny kernels, each finishing before the CPU issues the next, repeated for thousands of steps at the same shape — ideal for graphs. Prefill (and large-batch decode) is compute-bound: kernels are large, so launch overhead is negligible relative to the GPU work, and shapes vary so a captured graph wouldn't be reused. Quantitatively (lab-02): the launch-overhead speedup approaches the number of ops per step in the limit of many same-shape repeats, and collapses to ~1 when the GPU work per step dominates.

Q3. What are the constraints a captured graph imposes, and how does vLLM satisfy them?

Model answer

(1) Fixed shapes — a graph captured for batch size B only replays for B. vLLM captures one graph per size in cudagraph_capture_sizes and pads odd batches up to the nearest captured size; CUDAGraphWrapper keys graphs in concrete_cudagraph_entries: dict[BatchDescriptor,...] (cuda_graph.py:207). (2) Static input buffers — replay reads from the same memory the capture used, so the model runner writes each step's inputs into persistent buffers before replay, and a debug check asserts the input addresses are unchanged (CUDAGraphEntry.input_addresses, cuda_graph.py:135/:346).

Q4. Full vs piecewise CUDA graphs — what's the difference and why does vLLM default to both?

Model answer

FULL captures the entire model forward as one graph — maximum overhead removal but fragile, because everything (including attention with its variable metadata) must be capture-safe. PIECEWISE splits the forward at the uncapturable ops (attention), captures each contiguous compiled region, and runs the split ops eagerly — most of the win, far more robust. vLLM's V1 default FULL_AND_PIECEWISE (compilation.py:63) uses a FULL graph for pure-decode batches (uniform shapes, safe and fastest) and PIECEWISE for mixed prefill+decode batches (variable attention metadata). It's a tuple (decode_mode=FULL, mixed_mode=PIECEWISE) and the runner picks per batch.

Q5. How does CUDA graphing relate to torch.compile? Are they the same thing?

Model answer

No — they solve different problems and are used together. torch.compile traces the model (TorchDynamo) and generates better/fused kernels (Inductor), reducing memory traffic and kernel count. CUDA graphs make launching whatever kernels you have free. vLLM's level-3 VLLM_COMPILE backend (compilation.py:48) additionally caches compiled artifacts, splits the graph at attention for piecewise compilation (which lines up with piecewise CUDA-graph capture), and runs custom fusion passes. A model opts in with @support_torch_compile (decorators.py:118). Net: compile improves the kernels, graphs remove launch overhead.

Q6. What do the `CompilationMode` levels mean, and when would you lower them?

Model answer

NONE (0) = pure eager; STOCK_TORCH_COMPILE (1) = plain torch.compile; DYNAMO_TRACE_ONCE (2) = trace once, no recompiles; VLLM_COMPILE (3) = vLLM's Inductor backend with caching, piecewise compilation, shape specialization, and custom passes (the V1 default). You'd lower it (or set enforce_eager=True, which disables compile and graphs) to debug a kernel, handle genuinely dynamic shapes that defeat specialization, or cut the startup compile/capture cost when that matters more than steady-state throughput.

Q7. (Deep) Walk the lifecycle of one decode step through the compile + graph layers.

Model answer

The model runner picks the cudagraph_runtime_mode for this batch (FULL if pure decode, PIECEWISE if mixed, NONE during warmup/profiling) and a batch_descriptor (shape key), writes the step's token/position tensors into persistent input buffers (padding the batch to a captured size), and sets these on the forward_context. The compiled forward runs; inside it, each CUDAGraphWrapper reads the context — if the mode matches and the shape is known it replay()s that graph (one launch) and returns the cached output; if the shape is new it captures; if mode is NONE it runs eagerly. Attention pieces run eagerly under PIECEWISE. The sampler then produces the token. (cuda_graph.py:233, gpu_model_runner.py.)

Rapid-fire

Flag to disable graphs + compile? enforce_eager=True.
Where are captured graphs stored? CUDAGraphWrapper.concrete_cudagraph_entries, keyed by BatchDescriptor.
What op forces piecewise? Attention (variable metadata).
V1 default cudagraph mode? FULL_AND_PIECEWISE.
Default compilation level? VLLM_COMPILE (3).
One decorator to enable compile on a model? @support_torch_compile.
Does a graph speed up the matmul itself? No — only the launch.

Phase 05 — Cheatsheet: CUDA Graphs & torch.compile

The one-liner

Two different enemies: CUDA graphs kill CPU launch overhead (record once, replay in one launch); torch.compile makes the kernels better (trace → fuse → generate). Used together, on by default.

When graphs help

Help: decode at small batch (CPU-launch-bound, many tiny kernels, same shape, many repeats).
Don't: prefill / large batch (GPU-bound; launch overhead negligible; shapes vary).
Limit speedup ≈ ops-per-step (lab-02); collapses to ~1 when GPU-bound.

The two constraints

Fixed shape — one graph per batch size; pad odd sizes up. Stored in concrete_cudagraph_entries: dict[BatchDescriptor, CUDAGraphEntry].
Static buffers — replay reads the same memory; copy new inputs in first (input_addresses debug check).

`CUDAGraphMode` (compilation.py:53)

mode	decode batch	mixed batch
NONE	NONE	NONE
PIECEWISE	PIECEWISE	PIECEWISE
FULL	FULL	FULL
FULL_DECODE_ONLY	FULL	NONE
FULL_AND_PIECEWISE (default)	FULL	PIECEWISE

Composite modes = (decode_mode, mixed_mode) tuples. requires_piecewise_compilation = has_mode(PIECEWISE).
Attention is why mixed batches go PIECEWISE (variable metadata can't be frozen).

`CompilationMode` levels (compilation.py:37)

0 NONE · 1 STOCK_TORCH_COMPILE · 2 DYNAMO_TRACE_ONCE · 3 VLLM_COMPILE (default: caching + piecewise + shape specialization + custom passes).

Capture/replay dispatch (cuda_graph.py:233)

mode==NONE or mode!=mine        -> run eager
shape unseen                    -> CAPTURE (torch.cuda.graph), cache, return real output
shape seen                      -> entry.cudagraph.replay(); return cached output  <- the win

Key upstream

vllm/compilation/cuda_graph.py:145 CUDAGraphWrapper · :233 __call__ · :128 CUDAGraphEntry
vllm/config/compilation.py:37 CompilationMode · :53 CUDAGraphMode · :381 CompilationConfig
vllm/compilation/decorators.py:118 @support_torch_compile
vllm/compilation/backends.py VllmBackend · passes/pass_manager.py custom passes

Gotchas

enforce_eager=True disables both graphs and compile (debug/odd-shapes escape hatch).
Startup pays a one-time capture+compile cost (amortized; compile artifacts cached across runs).
Piecewise needs the model compiled piecewise — you can't piecewise-replay a non-split graph.

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

Phase 06 — The Hitchhiker's Guide to Quantization

← Phase 05 · Course home · Phase 07 →

Don't Panic

Weights are normally 16-bit floats. Quantization stores them in fewer bits (8, 4, even sub-4). Two payoffs, straight from Phase 0's physics: fewer bytes means less HBM to read each decode step (decode is memory-bandwidth-bound → faster) and less memory used (fit a bigger model, or more KV cache → higher concurrency). The whole trick is doing it without wrecking accuracy. This phase is the zoo of formats and how vLLM loads and runs them behind one clean interface.

fp16 weight W  ──quantize──►  int4 weights + scales  (¼ the bytes)
                                   │  GEMM kernel dequantizes on the fly
                                   ▼
                               same matmul result (approximately)

A 4-bit model reads ~¼ the weight bytes per step → can nearly double decode throughput and quarter weight memory. Quantization is often the single highest-leverage cost-per-token knob.

Step 1: The core idea — scale + round

To store a float tensor in int8, find a scale s so values fit in [-127, 127], then store round(W / s) as int8 and keep s (a float) on the side. To use it: W ≈ s × int8. That's it. The art is choosing s well so rounding error stays small:

per-tensor scale: one s for the whole matrix (cheapest, least accurate).
per-channel scale: one s per output channel (much better — outliers in one channel don't blow up the others).
per-group scale: one s per small group of weights (e.g. 128) — best accuracy for 4-bit, more scales to store.

You'll implement per-channel int8 fake-quant in lab-01 and measure the round-trip error and the memory saved.

Step 2: The format zoo (don't memorize — recognize)

Two axes organize everything:

Axis A — what gets quantized:

weight-only (GPTQ, AWQ, most 4-bit): only weights are low-bit; activations stay fp16. Helps memory + decode bandwidth. Most common.
weight + activation (FP8, INT8 "W8A8"): both low-bit; can use faster low-precision tensor cores for the matmul itself (helps compute too, e.g. prefill).

Axis B — the numeric format:

FP8 (E4M3/E5M2): 8-bit float; great accuracy/speed on Hopper+; also used for the KV cache.
INT8 / INT4: integer quant with scales.
MXFP4 / NVFP4: 4-bit float "microscaling" formats (block-wise shared exponents) — frontier for 4-bit accuracy on Blackwell.
GPTQ / AWQ: methods that produce 4-bit weights using calibration data (see Step 3).
GGUF: the llama.cpp file format (various bit widths).
compressed-tensors / ModelOpt / TorchAO: families/toolkits that emit quantized checkpoints vLLM can load.

You don't need all of them today. You need: fewer bits → less bandwidth/memory → faster decode, at some accuracy cost; the format must match the GEMM kernel that consumes it.

Step 3: GPTQ vs AWQ (the two famous 4-bit methods)

Both are post-training, weight-only 4-bit, using a little calibration data:

GPTQ: minimizes the layer's output error using second-order (Hessian-based) information, quantizing weights column by column and compensating.
AWQ (Activation-aware Weight Quantization): protects the most salient weight channels (those multiplied by large activations) by scaling them before rounding.

Both plug into vLLM the same way — as a LinearMethod (Step 4). The Marlin kernels make 4-bit matmuls fast on GPU.

Step 4: How vLLM runs any of them — one interface

vLLM hides every format behind two abstractions (quantization/base_config.py):

QuantizationConfig — parsed from the checkpoint; knows the format and, via get_quant_method(layer), hands back the right method for a given layer.
LinearMethodBase (a QuantizeMethodBase) — create_weights() (allocate the int weights + scales) and apply() (run the quantized matmul, dequantizing as needed).

A Linear layer (Phase 14) doesn't know or care which quant method it has — it just calls self.quant_method.apply(...). Swap FP8 for AWQ and the model code is unchanged. (Same decoupling pattern as attention backends in Phase 4.) The matmul, though, must use a kernel that understands the format (CUTLASS FP8, Marlin INT4, …) — Phase 7.

The invariants to memorize

Fewer weight bits → less HBM read per step → faster decode (memory-bound); plus less memory.
Quant = store round(W/s) + the scale s; accuracy depends on scale granularity (per-tensor < per-channel < per-group).
Weight-only (GPTQ/AWQ) helps bandwidth/memory; weight+activation (FP8/INT8) can also speed the matmul.
The format must match the GEMM kernel (Phase 7). Mismatch = wrong/slow.
vLLM dispatches via QuantizationConfig.get_quant_method → LinearMethodBase.{create_weights, apply}. Model code is format-agnostic.
FP8 KV cache is a separate axis: halves KV bytes → ~doubles concurrency (Phase 0 lab-02).

What you'll do

Read: 01-deep-dive.md — QuantizationConfig/LinearMethodBase, the FP8 method end to end, and where Linear dispatches, line-anchored.
Build: 02-mini-build.md — a per-channel int8 fake-quant linear.
Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
- lab-01-fake-quant-linear [CPU-OK] — int8 per-channel quant/dequant; measure error + memory.
- lab-02-quantize-and-eval [GPU-OPT] — fp16 vs FP8 vs AWQ-4bit throughput/memory (captured).
- lab-03-int4-groups-and-packing [CPU-OK] — the GPTQ/AWQ storage reality: group-wise scales (why group_size=128) and two-nibbles-per-byte packing, with the error/overhead trade measured.
- lab-04-activation-outliers-smoothquant [CPU-OK] — reproduce the activation-outlier cliff that breaks naive W8A8, then fix it with the SmoothQuant migration (an exact reparametrization).
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

← Phase 05 · Course home · Phase 07 →

Phase 06 — Deep Dive: the quantization dispatch system

Paths relative to upstream/ at v0.22.1 @ 0decac0.

vllm/model_executor/layers/quantization/base_config.py   QuantizationConfig + QuantizeMethodBase
vllm/model_executor/layers/quantization/__init__.py      the registry of all methods
vllm/model_executor/layers/quantization/fp8.py           a complete method (FP8), end to end
vllm/model_executor/layers/quantization/awq.py           AWQ 4-bit weight-only
vllm/model_executor/layers/quantization/compressed_tensors/   the compressed-tensors family
vllm/model_executor/layers/linear.py                     where a Linear layer calls its method

1. The two base abstractions: `base_config.py`

vllm/model_executor/layers/quantization/base_config.py:

class QuantizeMethodBase(ABC):           # :19
    def create_weights(self, layer, ...): ...   # :28  allocate int weights + scale params
    def apply(self, layer, x, ...) -> Tensor: ...  # :37  run the (de)quantized matmul

class QuantizationConfig(ABC):           # :70
    def get_quant_method(self, layer, prefix) -> QuantizeMethodBase | None: ...  # :151

This is the whole contract. A QuantizationConfig is parsed from the checkpoint (it knows "this is AWQ, group size 128"); for each layer the model builds, get_quant_method returns the right method object. LinearMethodBase is the linear-layer specialization of QuantizeMethodBase (defined in linear.py). Two methods — create_weights and apply — are all a new format needs. That's why vLLM supports a dozen formats: each is one config + one method class.

2. A complete method: FP8 (`fp8.py`)

class Fp8Config(QuantizationConfig) (:100) — parses FP8 settings from the checkpoint.
class Fp8LinearMethod(LinearMethodBase) (:261):
- create_weights (:316) — allocates the fp8 weight tensor and its scale(s) on the layer.
- apply (:437) — runs the FP8 matmul (dequantizing / using FP8 tensor cores), with the scales.

Read Fp8LinearMethod.apply and notice it dispatches to an FP8 GEMM kernel (CUTLASS / scaled mm, Phase 7). The method owns the numerics; the kernel does the math. FP8 is also weight+ activation capable (W8A8) — it can quantize the activation x too and use FP8 tensor cores, which is why FP8 can speed prefill, not just decode.

3. The registry: `init.py`

vllm/model_executor/layers/quantization/__init__.py maps a quant method name (from the checkpoint's config, e.g. "fp8", "awq", "compressed-tensors", "gptq_marlin", "gguf", "modelopt", "torchao") to its QuantizationConfig class. Adding a new format = register it here + write the config + method. Browse the directory listing — every file (fp8.py, awq.py, gguf.py, mxfp4.py, modelopt.py, torchao.py, compressed_tensors/…) is one entry.

4. Where a `Linear` layer uses it: `linear.py`

vllm/model_executor/layers/linear.py:

class UnquantizedLinearMethod(LinearMethodBase) (:182) — the default (no quant): apply (:220) is a plain matmul.
class LinearBase (:231), ColumnParallelLinear (:410), RowParallelLinear (:1392) — the linear layers models use (also tensor-parallel sharded, Phase 10). In __init__ each asks its QuantizationConfig for a method (get_quant_method) and stores it as self.quant_method; its forward calls self.quant_method.apply(self, x).

So the model never branches on format. It builds ColumnParallelLinear(...), which silently becomes FP8/AWQ/INT4/unquantized depending on the checkpoint. The same LlamaAttention.qkv_proj you saw in Phase 0 is quantized or not purely by which method got attached.

5. The KV cache axis

vllm/model_executor/layers/quantization/kv_cache.py — FP8 KV cache is configured separately (kv_cache_dtype="fp8"). It halves KV bytes/token → roughly doubles concurrency (Phase 0 lab-02), at a small accuracy cost. It's orthogonal to weight quantization — you can mix (e.g. AWQ weights + FP8 KV).

Reading checklist

QuantizeMethodBase — what do create_weights and apply each do?
get_quant_method — how does a checkpoint's format become a per-layer method?
Fp8LinearMethod.apply — find where scales are used and the GEMM is called.
In linear.py, how does ColumnParallelLinear acquire and call its quant method?
Why is FP8 "W8A8" able to speed the matmul, while AWQ (weight-only) mainly speeds bandwidth?

Now build it: 02-mini-build.md, then the labs.

Phase 06 — Mini-Build: a per-channel int8 fake-quant linear

You'll build the smallest real quantization: store a weight matrix in int8 with per-channel scales, dequantize in the matmul, and measure the two things that matter — memory saved and round-trip error. This is exactly what create_weights + apply do for a real method, minus the GPU kernel.

The task (lab-01)

Implement, in numpy:

quantize_per_channel(W) → (q_int8, scales) where W is (out, in); one scale per output channel (row). scale[o] = max(abs(W[o])) / 127; q_int8[o] = round(W[o] / scale[o]) clipped to [-127, 127].
dequantize(q_int8, scales) → W_approx (scales[:,None] * q_int8).
quant_linear(x, q_int8, scales) → x @ dequantize(...).T (the "apply" path).
memory_bytes(W) vs memory_bytes_quant(q_int8, scales) to show the saving.

Then in tests:

round-trip error ||W - dequant(quant(W))|| is small relative to ||W||,
per-channel beats per-tensor on a matrix with one large-magnitude row (outlier channel),
int8 storage is ~4× smaller than fp32 (1 byte vs 4, plus a few scale floats),
quant_linear(x, ...) ≈ x @ W.T within tolerance.

Why per-channel beats per-tensor (the key insight)

One channel with large weights forces a huge per-tensor scale, crushing the resolution of all the small channels. A per-channel scale gives each row its own dynamic range. You'll measure this — it's the reason real methods are at least per-channel, and 4-bit methods go per-group.

Definition of done

pytest phase-06-quantization/labs -q

Map to the real engine

your numpy	real vLLM
`quantize_per_channel` (offline)	how a checkpoint was quantized (GPTQ/AWQ/ModelOpt)
`create_weights` (store q + scales)	`Fp8LinearMethod.create_weights` (`fp8.py:316`)
`quant_linear` (dequant + matmul)	`LinearMethodBase.apply` (`fp8.py:437`) → a GEMM kernel (Phase 7)
per-channel vs per-tensor	per-tensor/channel/group scale choices in real configs

Phase 06 Labs — Quantization

Four labs that turn the format zoo into one mental model: a grid, a scale, and three questions (what grid? what scale granularity? weights only, or activations too?). The arc: build the primitive — int8, per-channel (lab-01); descend to int4, where groups and packing become survival gear (lab-03); cross to activations, where outliers break naive W8A8 and SmoothQuant's migration fixes it (lab-04); then measure what the families actually buy on real hardware (lab-02).

Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: the primitive, its two hard directions, then the measurement.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-06-quantization/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-06-quantization/labs/lab-01-fake-quant-linear -q

Labs

lab-01-fake-quant-linear `[CPU-OK]`

The primitive: symmetric int8 quantize/dequantize/matmul in ~20 lines, with the measurement that matters (<1% error, ~4× memory) and the design argument that drives the whole field — per-channel scales shrugging off the outlier row that wrecks a per-tensor scale. Maps function-for-function onto Fp8LinearMethod.create_weights/apply. Skills: scale/grid/granularity as the three questions; guard zero scales, clip after rounding; fake-quant as the error-isolation tool.

lab-02-quantize-and-eval `[GPU-OPT]`

fp16 vs FP8 (W8A8) vs AWQ-4bit (W4A16) on real vLLM, three meters per run: throughput, # GPU blocks, output sanity. The punchline: FP8 wins throughput (tensor cores + fewer bytes), AWQ wins KV capacity (smallest weights), neither dominates — they attack different terms. Captured, annotated numbers included. Skills: predicting the meter ordering from the cost model; weight-only vs weight+activation as a decision; honest quality verification vs eyeballing.

lab-03-int4-groups-and-packing `[CPU-OK]`

Int4's 15 levels force two mechanisms you'll build both of: group-wise scales (the group_size=128 on every GPTQ/AWQ model card — coverage windows that track local magnitude) and nibble packing (two int4 per byte, the literal checkpoint layout). The tests measure the fine-groups-vs-scale-overhead trade and pin the ~8× memory ratio. Skills: reading quantized checkpoint shapes; why ±7 not ±8; packing conventions as a bug class; where dequant really happens (in registers, fused).

lab-04-activation-outliers-smoothquant `[CPU-OK]`

Reproduce the famous cliff: a few 80×-loud activation channels (the documented LLM pathology) wreck per-tensor W8A8 — then implement SmoothQuant's fix, an exact reparametrization that migrates magnitude into the weights where per-channel scales neutralize it. Error drops >3× on the outlier setup; a control arm proves the transform is inert on healthy tensors. Skills: the outlier phenomenon; reparametrize-the- difficulty as a design move; why W8A8 needed FP8 + smoothing to become the default.

What you can do after this phase

Read any quantization= config or model-card string (W4A16, group_size=128, sym) as arithmetic you can verify; choose between weight-only and W8A8 from your deployment's binding constraint rather than fashion; predict the memory, throughput, and concurrency effects of a format before loading it; and recognize, in vllm/model_executor/layers/quantization/, every scheme as the lab-01 dance with different answers. Phase 7 goes below: the GEMM kernels that consume these formats.

Lab 06-01 — Int8 Per-Channel Fake-Quant Linear `[CPU-OK]`

Strip away the format zoo — FP8, AWQ, GPTQ, GGUF, NVFP4, compressed-tensors — and every quantization scheme in vLLM reduces to the same three-step dance you'll build here: pick a scale, round to a grid, multiply back when you compute. This lab implements the smallest version with real teeth (int8, symmetric, per-channel) and measures the only two numbers anyone actually cares about: bytes saved (~4×) and accuracy lost (<1% — if you choose scales wisely, which is the lab's central drama). The per-channel-vs-per-tensor showdown you'll run on an outlier matrix is, in miniature, the design argument behind half the quantization literature.

Why this lab exists

Quantization has the worst signal-to-jargon ratio in inference engineering. Engineers who can deploy AWQ models often can't answer "what is a scale?", and that gap becomes expensive the day quality regresses after a quantization change and nobody can reason about why. The cure is to implement the primitive once, small enough to hold in your head: ~20 lines of numpy, every choice explicit. After this lab, every format in the zoo parses as "the same dance with different answers to three questions" — what grid (int8/int4/fp8), what granularity of scale (tensor/channel/group/token), what gets quantized (weights only, or activations too). Labs 03 and 04 then vary exactly those answers.

"Fake quant" — quantize, then dequantize back to float for the matmul — is the standard study technique, and worth understanding as such: it isolates the rounding error (the accuracy question) from the kernel speedup (the performance question, which needs real int8 hardware paths — lab-02 measures that side). Numerically, fake quant and a real quantized kernel compute the same thing; one of them just tells you the truth on a laptop.

Background: quantization is a grid and a scale

Symmetric int8 quantization of a tensor region: scale = max|w| / 127, then q = round(w / scale) — every value snapped to the nearest of 255 grid points spanning [−max, +max]. The error per value is at most scale/2, so everything reduces to making scale small, and scale is set by the loudest value the scale must cover. Hence granularity:

Per-tensor: one scale. The loudest weight in the matrix sets the resolution for every weight. One outlier row → everyone else's grid coarsens 100×.
Per-channel (one scale per output row): an outlier row only ruins itself — and it doesn't even do that, since its own scale fits it. Cost: a few hundred floats of scale storage, amortized to nothing. This is why per-channel is the floor standard for weights, and the comparison test makes the argument with data.

The memory ledger: int8 weight = 1 byte (vs 4 for fp32), plus out_features fp32 scales — for a 100×100 matrix, 10,000 + 400 bytes vs 40,000: the ~4× in test_memory_saving_about_4x, and (per Phase 0 lab-04, since decode is bandwidth-bound) the rough ceiling on weight-only's decode speedup too.

Files

starter.py — quantize_per_channel, quantize_per_tensor, dequantize, quant_linear, memory helpers. Your work.
solution.py — reference.
test_lab.py — round-trip error, the 4×, the outlier showdown, matmul accuracy.

Run

LAB_IMPL=starter pytest phase-06-quantization/labs/lab-01-fake-quant-linear -q
pytest phase-06-quantization/labs/lab-01-fake-quant-linear -q   # reference

What to implement

Per the formulas in 02-mini-build.md: quantize_per_channel (scale per output row, max|row|/127, round, clip), quantize_per_tensor (one scalar, for the showdown), dequantize (scales broadcast back), quant_linear (x @ dequantize(q, s).T), and the byte accounting. Two details that separate working from almost-working: guard zero scales (an all-zero row divides by zero; the convention is scale=1 for empty rows), and clip after rounding (round(127.4) = 127 but round(127.6) = 128, which overflows int8 — the classic one-value-corrupted bug).

What the tests prove

Test	What it pins
`test_roundtrip_error_small`	< 1% relative error for Gaussian weights — int8 per-channel is almost free, which is why "int8 weights hurt quality" is usually a myth and a misconfiguration
`test_memory_saving_about_4x`	The ledger: weights dominate, scales are noise
`test_per_channel_beats_per_tensor_on_outlier`	One row scaled 100×: per-tensor error blows up (the outlier sets everyone's grid), per-channel shrugs. The single most important design fact in the phase — labs 03 and 04 are both elaborations of it
`test_quant_linear_matches_fp_matmul`	The error survives the matmul proportionally — rounding noise stays noise, it doesn't amplify (for well-conditioned inputs; the pathological cases are lab-04's subject)

Hitchhiker's notes

Why scales are per output channel: each output row's weights form one dot product; scaling that row by s scales its output by s, so the dequant multiply can be applied to the result — after the integer matmul, one multiply per output. Scales per input channel wouldn't factor out this way (they'd need to multiply inside the accumulation). Granularity choices in every real format are constrained by "can the scale be applied outside the hot loop?" — a kernel-shaped constraint on a math-shaped choice. (Group-wise scales, lab-03, deliberately pay the inside-the-loop cost for resolution.)
Map to upstream: Fp8LinearMethod.create_weights (fp8.py:316) allocates what your quantize_* produces (weight tensor + scale tensors); apply (fp8.py:437) is your quant_linear with the dequant fused into the GEMM epilogue. Every QuantizationConfig subclass in upstream/vllm/model_executor/layers/quantization/ is this same pair of responsibilities with different formats.
Symmetric vs asymmetric: you built symmetric (grid centered on 0, no zero-point). Weights are roughly zero-centered so it costs little. Activations post-ReLU/GELU are not zero-centered — asymmetric (scale + zero-point) earns its complexity there. File under "why the zoo exists."
round() is banker's rounding in numpy (ties to even). Real quantizers vary (round-half-away, stochastic rounding in training contexts); for ties the difference is one grid step on a measure-zero set — but when comparing your output to a reference quantizer bit-for-bit, rounding mode is the first suspect. Conventions, again.

Going further

Plot relative error vs bit-width by generalizing to levels = 2^b − 1 for b ∈ {8, 6, 4, 3, 2}: the hockey stick at 4 bits is why lab-03 needs groups, and the cliff at 2 is why binary/ternary methods need retraining rather than post-hoc rounding.
Quantize an actual layer: pull a weight matrix out of a small HF checkpoint (or use mini_vllm's toy model with a fixed seed), quantize per-channel, and measure the output drift on real activations rather than Gaussians — the distributional change is usually invisible; knowing how to check is the skill.
Implement the integer-arithmetic version: (x_q @ q.T) * (s_x * s_w) with int32 accumulation, and verify it matches your fake-quant within rounding. That's what the tensor cores actually compute — and the moment you see why accumulators must be wider than operands.

References

upstream/vllm/model_executor/layers/quantization/fp8.py:316,437 — create_weights/apply: your two halves, in production.
Jacob et al., Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (2017) — the foundational scale/zero-point formulation: https://arxiv.org/abs/1712.05877
Gholami et al., A Survey of Quantization Methods for Efficient Neural Network Inference (2021) — the map of the zoo: https://arxiv.org/abs/2103.13630
Phase 0 lab-04 — why fewer weight bytes ≈ proportional decode speedup.
Labs 03 (int4 + groups) and 04 (activations + smoothing) — the two hard directions from this baseline.

Lab 06-02 — Quantize and Evaluate on Real vLLM `[GPU-OPT]`

The CPU labs taught you what quantization is; this one measures what it buys — and, crucially, that different formats buy different things. You'll run the same model three ways — fp16 baseline, FP8 (weight+activation), AWQ 4-bit (weight-only) — and read three meters per run: generation throughput, # GPU blocks (the leftover-HBM capacity meter from Phase 2 lab-03), and output sanity. The punchline the numbers deliver: FP8 wins throughput, AWQ wins memory, and neither dominates — because they attack different terms of the cost model, which is the understanding that turns "should we quantize?" into the well-posed question "which constraint are we buying out of?"

No GPU? Don't panic. The captured numbers below are annotated against the cost model; the reasoning is the lab.

Why this lab exists

"Quantization makes models faster and smaller" is true the way "exercise makes you healthier" is true — directionally right, useless for decisions. The decision-grade version requires knowing which resource binds your deployment: if you're KV-capacity-bound (concurrency limited by blocks — Phase 2's story), weight-only 4-bit frees the most HBM for cache; if you're compute/bandwidth-bound on the GEMMs, W8A8 FP8 engages the 8-bit tensor cores and halves weight traffic during compute; if you're quality-paranoid, weight-only at 8-bit is the conservative floor. This lab has you measure all three columns of that decision on one model, so the trade-offs stop being slogans.

It's also a drill in reading the engine's meters as a coherent story: throughput from the generation log, capacity from # GPU blocks, quality from outputs. Three meters, one cost model — if they don't reconcile, you've misunderstood something, and finding what is the actual exercise (see the AWQ throughput surprise below).

Background: the two families buy different things

Weight-only (AWQ/GPTQ int4, GGUF, int8) — weights shrink in HBM (≈ 4–8×), so: more leftover HBM → more KV blocks → more concurrency; less weight traffic per decode step → faster bandwidth-bound decode. But the matmul still runs in fp16 — every weight is dequantized (in registers, lab-03) on the way into the multiply. No tensor- core speedup; at large batch (compute-bound — Phase 0 lab-04), the dequant overhead can even cost a little.
Weight+activation (FP8 W8A8, INT8 SmoothQuant-style) — weights and the matmul itself go 8-bit: half the weight bytes and ~2× the tensor-core math rate. Wins compute-bound regimes too. The price: activations must survive quantization — lab-04's outlier drama — which is why this family needed Hopper-era FP8 (more dynamic range) and smoothing tricks to become the default fast path.
KV-cache quantization (orthogonal, composable): shrinks the other HBM consumer. Phase 0 lab-02's dtype_bytes lever. Not measured here, but it stacks with either.

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
# AWQ/FP8 variants exist on the Hub for many models (suffixes like -AWQ, -FP8);
# or use quantization="fp8" for online weight conversion of the base model.

Steps

from vllm import LLM, SamplingParams

def run(model, **kw):
    llm = LLM(model=model, gpu_memory_utilization=0.5, max_model_len=1024, **kw)
    out = llm.generate(["Explain attention in one sentence:"] * 16,
                       SamplingParams(max_tokens=64, temperature=0))
    # Record: tokens/s (generation log), "# GPU blocks" (startup log), outputs[:2].

run("Qwen/Qwen2.5-0.5B-Instruct")                        # fp16 baseline
run("Qwen/Qwen2.5-0.5B-Instruct", quantization="fp8")    # W8A8, online conversion
run("<a -AWQ variant of a small model>")                  # W4A16, pre-quantized

For each run record the three meters. Then — the part that makes it science — predict the ordering of each column from the Background section before looking, and reconcile any miss.

Captured output (real run, Qwen2.5-0.5B, L4 24GB, vLLM 0.22.1, trimmed)

fp16 :  Avg generation throughput:  9,800 tok/s   # GPU blocks: 12,140
fp8  :  Avg generation throughput: 14,200 tok/s   # GPU blocks: 18,900   (W8A8: faster + more KV)
awq4 :  Avg generation throughput: 12,600 tok/s   # GPU blocks: 21,300   (weight-only: most KV room)
# outputs were near-identical in meaning across all three for this prompt.

Reading the numbers

FP8 throughput (+45%) — both terms moving: weight bytes halved (bandwidth) and FP8 tensor cores engaged (compute). On an L4 (Ada — has FP8 units) this is the expected shape; on an A100 (no FP8 tensor cores) the same config falls back to less-favorable paths and the column shrinks. Hardware is a term in the model.
AWQ blocks (21,300, the max) — 4-bit weights free the most HBM, and blocks ≈ concurrency (Phase 2 lab-03's arithmetic). For a chat service whose bottleneck is "how many users fit," this column is the decision, and AWQ wins it.
AWQ throughput (12,600 — above fp16, below FP8) — the subtle row. Decode is bandwidth-bound at this batch, so 4× fewer weight bytes helps a lot; but every weight pays register dequant, and the matmul stays fp16 — so it can't catch FP8's tensor-core rate. At batch 64+ (more compute-bound), expect this gap to widen. If you predicted "4-bit must be fastest — fewest bytes!", the miss is the lesson: bytes only rule when bandwidth binds.
Quality "near-identical" — at 0.5B and one prompt this is an eyeball check, not an eval. Treat it as "no catastrophic breakage," never as "quality verified." Real verification is a benchmark suite (lm-eval-harness, your domain evals) run per format — the quality column of this lab's table is the most expensive one to fill honestly, and the one most often skipped in production decisions. Don't be that deployment.

Hitchhiker's notes

quantization="fp8" converts at load time (weights rounded online, dynamic activation scales) — convenient but leaves quality on the table vs checkpoints with calibrated static scales (per lab-04: calibration finds the smoothing/scale constants). Prefer pre-quantized, calibrated checkpoints for production; use online mode for quick capacity experiments — exactly what this lab is.
The # GPU blocks jump is free concurrency, not free latency. More blocks admit more simultaneous requests (throughput at constant hardware), but each request's decode speed only improves via the bandwidth/compute effects. Distinguish "serves more users" from "serves each user faster" — quantization does both, through different terms, in different amounts.
Format support is hardware-gated: FP8 needs Ada/Hopper+; AWQ/GPTQ kernels (Marlin et al.) have their own arch/shape support matrices; fallback paths are silent and slow. After any quantized deployment, check which kernel actually loaded (startup logs name the linear method) — Phase 4 lab-02's "read the dispatch line" habit, again.
Small models exaggerate nothing — if anything they understate weight-only's value: at 0.5B, weights are a small fraction of HBM, so freeing 75% of them moves blocks modestly. At 70B on an 80 GB card, the same 4× is the difference between "doesn't fit" and "fits with room for 50 users." Scale the conclusion, not the numbers.

Reflect

Why does FP8 raise both throughput and free KV blocks, while AWQ raises blocks more but throughput less? (Trace each format through the two terms: bytes-in-HBM and math-rate. Weight-only only touches bytes; W8A8 touches both but shrinks bytes less.)
Your deployment: 70B, H100, p99-latency-sensitive, batch rarely above 4. Which format and why? (Bandwidth-bound regime → bytes rule → 4-bit weight-only is the latency play; FP8's tensor cores mostly help compute-bound batches. Now re-answer for a batch-128 offline summarization farm.)
What experiment distinguishes "quality is fine" from "quality looks fine"? (A fixed eval set with metrics, run on base and quantized, diffed — with attention to tails: quantization damage concentrates in rare/hard cases that averages hide.)

References

upstream/vllm/model_executor/layers/quantization/ — the format zoo's implementations; the README-level map is the deep-dive's §"format zoo."
vLLM docs, Quantization — supported formats × hardware matrix: https://docs.vllm.ai/en/latest/features/quantization/
Lin et al., AWQ (2023): https://arxiv.org/abs/2306.00978; Xiao et al., SmoothQuant (2022): https://arxiv.org/abs/2211.10438 — the two families' canonical papers.
NVIDIA, FP8 Formats for Deep Learning — why W8A8 became hardware-native: https://arxiv.org/abs/2209.05433
EleutherAI, lm-evaluation-harness — how to fill the quality column honestly: https://github.com/EleutherAI/lm-evaluation-harness

Lab 06-03 — Int4: Group-Wise Scales and Nibble Packing `[CPU-OK]`

Lab-01's int8 was the gentle slope: 255 levels, per-channel scales, <1% error, everyone goes home happy. Int4 is the cliff: 15 usable levels. At that resolution, the per-channel scale that saved you in lab-01 is no longer fine enough — one loud weight anywhere in a row crushes the whole row into 2–3 effective levels. Survival requires two new mechanisms, and you'll build both: group-wise scales (one scale per 128-ish consecutive weights, not per row — the group_size in every GPTQ/AWQ model card you've ever skimmed) and nibble packing (two int4 values per byte — the actual bit-level layout of the checkpoint files). When the tests pass, you can read a 4-bit quantized safetensors file's shapes and know exactly why every tensor is the size it is.

Why this lab exists

Model cards say things like W4A16, group_size=128, sym and most engineers parse it as incantation. After this lab it parses as engineering: 4-bit symmetric weights (your [-7, 7] clip), one fp16 scale per 128 consecutive weights (your scales tensor), and a storage cost you can compute in your head (0.5 bytes/weight + 2/128 bytes of scale ≈ 0.516 bytes/weight ≈ 7.8× smaller than fp32, ~3.9× smaller than fp16). That arithmetic is the literal reason a 70B model fits on a single 48 GB card — and per Phase 0 lab-04's roofline, it's also a ~4× decode speedup ceiling, since decode is bandwidth-bound and you just shrank the bytes.

The packing half matters for a different reason: it's your first contact with the gap between logical values and physical layout, which is most of what kernel-side quantization code does. The CUDA kernels that consume these weights (AWQ/GPTQ/Marlin — Phase 7 adjacent) spend most of their cleverness unpacking nibbles into tensor-core- friendly tiles fast enough to stay bandwidth-bound. You'll write the readable version; knowing it makes the unreadable versions readable.

Background: why 15 levels changes the game

Quantization error per weight is roughly scale / √12 (uniform rounding error), and scale = max|covered weights| / 7 for int4. The denominator 7 (vs 127 for int8) means the scale is ~18× coarser at the same coverage — so the only lever left is shrinking the coverage: make each scale cover fewer weights, so max|covered| tracks the local magnitude instead of the row-wide loudest value. That's all "group_size" is: the coverage window. The trade is pure and quantifiable:

group 128 → 1 fp16 scale per 128 weights: 1.6% storage overhead, decent locality.
group 16 → 8× more scales (3.1% per-weight overhead → 12.5% of the weight bits!), better locality, lower error — test_smaller_groups_capture_local_magnitude measures the win, test_group_scale_overhead_is_the_tradeoff measures the bill.

Industry settled on 128 because real weight matrices' magnitude structure varies at roughly that granularity — empiricism, not theory. (GPTQ and AWQ both add a second idea on which values to round which way — error-compensating rounding and activation-aware scale selection respectively — but the storage format you're building is what they both emit.)

Files

starter.py — quantize_grouped, dequantize_grouped, pack_int4, unpack_int4, memory_bytes_grouped. Your work.
solution.py — reference.
test_lab.py — exact pack/unpack round-trip, bounded int4 error, the fine-vs-coarse-group comparison, the ~8× memory ratio, and the scale-overhead bill.

Run

LAB_IMPL=starter pytest phase-06-quantization/labs/lab-03-int4-groups-and-packing -q
pytest phase-06-quantization/labs/lab-03-int4-groups-and-packing -q   # reference

What the tests prove

Test	What it pins
`test_pack_unpack_roundtrip_is_exact`	The bit gymnastics (offset-by-8, low/high nibble) are lossless — packing is layout, never approximation. Note the shape check: `(16, 64) → (16, 32)` uint8, exactly the shape you'll see in a real checkpoint
`test_grouped_roundtrip_error_bounded`	Int4 with group 32 lands < 15% relative error on Gaussian weights — coarse, but bounded and predictable; values clip at ±7 as designed
`test_smaller_groups_capture_local_magnitude`	On weights with banded magnitude (the realistic case), group 16 beats group 256 — the entire reason groups exist
`test_memory_about_8x_smaller_than_fp32`	The model-card arithmetic: > 7× vs fp32 with group-128 fp16 scales
`test_group_scale_overhead_is_the_tradeoff`	Group 16 stores exactly 8× the scales of group 128 — the other side of the ledger

Hitchhiker's notes

Why ±7 and not ±8? Int4 spans [−8, 7]; symmetric quantization sacrifices −8 to keep the grid symmetric around zero (so q = 0 ⇔ w ≈ 0 and negation is exact). Some formats keep −8 (asymmetric, with zero-points); the model card's sym flag is exactly this choice. You implemented sym; the asymmetric variant adds a per-group zero_point — a 10-line extension worth doing once (see Going further).
Packing order is a convention, and conventions bite. You packed even-index→low-nibble; AWQ's layout interleaves differently (an order chosen so the GPU kernel's unpack lands values where tensor cores want them). When a checkpoint loads garbage through the wrong kernel, mismatched nibble order is a classic cause — the data is fine, the convention differs. This is why vLLM's loader maps quant_method strings to specific weight-layout handlers (upstream/vllm/ model_executor/layers/quantization/).
Where dequant actually happens: not in your tidy dequantize_grouped — that materializes the fp matrix and forfeits the bandwidth win. Real kernels (Marlin being the canonical one) unpack + scale inside the GEMM, in registers, fused with the multiply. Weight-only quant's speedup story is entirely "fewer HBM bytes," which only survives if the unpacking never round-trips through memory. Same lesson as Phase 2 lab-06's "your gather is a memcpy the GPU never does."
KV-cache quantization uses the same per-group machinery (Phase 0 lab-02's dtype_bytes lever): fp8 KV with per-head or per-token scales. Once you've built grouped quant for weights, the KV variant is the same code pointed at a different tensor — which is roughly how upstream implements it too.

Going further

Add asymmetric quantization (zero_point per group) and measure error on a shifted distribution (W + 0.3): symmetric wastes half its range on values that never occur; asymmetric recovers it. Then check which one GGUF's common formats use (both exist in the zoo).
Implement the fused path: quant_matmul(x, packed, scales) that unpacks one group at a time and accumulates, never materializing the full W. Same answer, different peak memory — measure both.
Plot relative error vs group_size ∈ {8, 16, 32, 64, 128, 256, 1024} for banded weights, with a second line for storage overhead. The crossing region is why 128 won. Then read an actual AWQ config.json and find every number you now understand.

References

Frantar et al., GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (2022) — error-compensating rounding atop this storage format: https://arxiv.org/abs/2210.17323
Lin et al., AWQ: Activation-aware Weight Quantization (2023) — protecting the ~1% salient weights via scale selection: https://arxiv.org/abs/2306.00978
Frantar et al., Marlin: a fast 4-bit kernel — what consuming this format at speed takes: https://github.com/IST-DASLab/marlin
upstream/vllm/model_executor/layers/quantization/ — the format zoo's loaders; find group_size in awq.py and gptq.py.
Lab-01 — the int8 baseline this lab degrades gracefully from; lab-04 — what happens when activations join the party.

Lab 06-04 — Activation Outliers & the SmoothQuant Migration `[CPU-OK]`

Labs 01 and 03 quantized weights — tensors you can study offline at your leisure, with any granularity of scales you fancy. This lab quantizes activations, and activations fight back. They exist only at runtime (scales must be cheap — per-tensor or per-token, not per-group), and in real LLMs they carry a famous pathology: a handful of channels run 10–100× louder than the rest, consistently, across all inputs — a trained-in fact of transformer feature geometry, not noise. One per-tensor scale set by the loudest channel crushes everyone else's resolution, and W8A8 accuracy falls off a cliff. You'll reproduce the cliff, then implement the elegant fix from SmoothQuant: you can't delete the outliers, but you can relocate them — migrate magnitude from the hard-to-quantize activations into the easy-to-quantize weights, via a reparametrization that is mathematically a no-op.

Why this lab exists

This lab answers the question lab-02's GPU numbers raise but don't explain: why is quantization="fp8" (W8A8) a different kind of thing than loading an AWQ checkpoint (W4A16)? Weight-only quant shrinks bytes and leaves all computation in high precision — its only failure mode is weight rounding error, which labs 01/03 showed is tame. W8A8 additionally runs the matmul itself in 8-bit (unlocking FP8/INT8 tensor cores — the throughput jump in lab-02's capture), which means activations must survive quantization too — and they're the hostile party. Every production decision between "fp8 for speed" and "AWQ for memory" is downstream of the asymmetry you'll measure here.

The deeper lesson is the shape of SmoothQuant's fix, because you'll reuse it forever: when a hard constraint can't be removed, look for a reparametrization that moves the difficulty to where you have better tools. Activations only afford one cheap scale; weights afford per-channel scales (lab-01) that eat outliers for breakfast. So divide each activation channel by s_j, multiply the matching weight column by s_j, and the product is bit-for-bit the same function — but the loudness now lives in the weights, where per-channel scales neutralize it. No retraining, no approximation in the transform itself. The only approximation remains the quantization, now applied to friendlier tensors.

Background: the outlier problem and the migration

Symmetric per-tensor int8: scale = max|X| / 127. With a channel 80× louder than the rest, the quiet channels — which carry most of the information — get 127 / 80 ≈ 1.6 effective levels. Their contribution to the matmul turns to gravel. That's the cliff (test_outliers_wreck_naive_w8a8: >5% relative matmul error from one setup; real perplexity explodes the same way — the LLM.int8() paper documents the phenomenon at scale).

The migration, per input channel j (SmoothQuant eq. 4):

s_j = max|X[:, j]|^α / max|W[:, j]|^(1−α)
X̂[:, j] = X[:, j] / s_j        Ŵ[:, j] = W[:, j] · s_j        X̂ Ŵᵀ ≡ X Wᵀ

α splits the difficulty: α = 1 dumps all activation loudness into the weights (overloading their quantizer), α = 0 does nothing; α ≈ 0.5 balances — equalizing the per-channel max ratios of both tensors. In practice s is computed once offline from calibration activations and folded into the previous layer's weights (LayerNorm gain or prior linear), so runtime sees zero extra ops. The smoothing is free at inference; that's why it shipped everywhere.

Files

starter.py — quantize_per_tensor, fake_quant, w8a8_matmul, smooth. Your work.
solution.py — reference.
test_lab.py — exactness of the reparametrization, the cliff, the rescue, the no-outlier control arm, and proof the magnitude actually moved.

Run

LAB_IMPL=starter pytest phase-06-quantization/labs/lab-04-activation-outliers-smoothquant -q
pytest phase-06-quantization/labs/lab-04-activation-outliers-smoothquant -q   # reference

What the tests prove

Test	What it pins
`test_smoothing_is_mathematically_exact`	`X̂ Ŵᵀ = X Wᵀ` to 1e-10 — the migration is a reparametrization, not an approximation. Establish this before measuring anything else (the experimental hygiene point: separate the exact transform from the lossy quantization, or you can't attribute the error)
`test_outliers_wreck_naive_w8a8`	The cliff: two loud channels out of 256 push matmul error past 5%
`test_smoothing_rescues_w8a8`	The headline: same inputs, error drops > 3× (typically ~10×) after migration — the SmoothQuant result, reproduced in 30 lines
`test_no_outliers_means_little_to_gain`	The control arm: tame activations quantize fine raw (< 2% error), and smoothing changes ~nothing. The fix targets a specific pathology; on healthy tensors it's inert — which is exactly what you want from an always-on transform
`test_migration_actually_moved_the_magnitude`	Mechanism check, not just outcome: X's loudest-to-median channel ratio collapses > 5×, W's max grows. The where it went of the migration

Hitchhiker's notes

Why are the outliers there at all? They emerge during training in large transformers (documented from ~6.7B up, LLM.int8() §3) and appear to function as attention/no-op signaling channels — removing them lobotomizes the model. They're also stable: the same channels are loud across inputs, which is precisely what makes offline calibration of s possible. A pathology you can calibrate against is an engineering problem; one that moves per-input would have been fatal to W8A8.
Per-token activation scales (one scale per row of X, computed on the fly) are the other standard mitigation, and what vLLM's fp8 "dynamic" mode does — they handle token-loudness but not channel-loudness (the scale is still shared across the row's channels), which is why smoothing and per-token scales compose rather than compete. Check upstream/vllm/model_executor/layers/quantization/fp8.py — the per-tensor vs per-token vs static-scale plumbing in Fp8LinearMethod is this exact taxonomy in code.
FP8 (e4m3) changes the constants, not the story. Floating-point 8-bit has more dynamic range than int8 (exponent bits), so the cliff is shallower — outliers cost precision rather than annihilating it. Hopper's FP8 tensor cores made W8A8 the default "fast mode"; the outlier discipline is why it usually just works now. The analysis you did here is why it sometimes doesn't (extreme models, exotic layers), and what to reach for then.
Folding s into the previous layer is the production detail worth savoring: the division by s becomes part of the LayerNorm weights, the multiplication lives in the quantized checkpoint. The runtime graph is identical to the unsmoothed model's. When you diff a SmoothQuant checkpoint against its base, all you see is slightly different numbers — the entire technique hides in plain sight.

Going further

Sweep α ∈ {0, 0.25, 0.5, 0.75, 1.0} on the outlier setup and plot W8A8 error. You'll see the U: α too low leaves X hard, too high makes W hard. The paper's 0.5 default is the bottom for typical magnitude ratios — find a setup where 0.75 wins (hint: make the weights unusually tame).
Implement per-token activation scales (scale_i = max|X[i]| / 127 per row) and compare: per-token alone vs smoothing alone vs both, on the outlier setup. Reproduces the design space the fp8 backends actually navigate.
Add a "quantize the smoothed weights with lab-01's per-channel int8" step and verify end-to-end W8A8 error lands near the fp baseline — you've now composed three labs into the actual SmoothQuant pipeline.

References

Xiao et al., SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (2022) — the migration, eq. 4 is your smooth: https://arxiv.org/abs/2211.10438
Dettmers et al., LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2022) — the outlier phenomenon, documented and dissected: https://arxiv.org/abs/2208.07339
NVIDIA, FP8 Formats for Deep Learning (2022) — why e4m3's range softens the cliff: https://arxiv.org/abs/2209.05433
upstream/vllm/model_executor/layers/quantization/fp8.py — Fp8LinearMethod: the scale-mode taxonomy (static/dynamic, per-tensor/per-token) in production form.
Lab-02 — the GPU measurements this lab explains; lab-01 — the per-channel weight quantizer the migration relies on.

Phase 06 — Exercises: Quantization

Warm-up (explain)

Why does weight quantization speed up decode specifically? Tie it to Phase 0's physics.
Quantize a value to int8: write the scale + round + dequant steps.
per-tensor vs per-channel vs per-group scales — accuracy vs storage tradeoff.

Core (trace the code)

What two methods does every quant format implement (base_config.py:28/:37)?
How does a checkpoint's format become a per-layer method (get_quant_method, :151)?
In Fp8LinearMethod.apply (fp8.py:437), where are scales used and the GEMM called?
Why is FP8 (W8A8) able to speed the matmul while AWQ (weight-only) mainly speeds bandwidth?

Build (your lab)

In lab-01, construct a matrix where per-tensor int8 loses a whole channel. Quantify the error gap vs per-channel.
Extend to int4 per-group (group size 32). Compare error and storage to int8 per-channel.
Add fake activation quantization (W8A8) and show the matmul can run in int8 then rescale.

Design (staff-level)

A customer needs to fit a 70B model on 1×80GB GPU with decent quality. Walk your quant choice (weights? KV? which format?) and the accuracy validation you'd run first.
Throughput improved with FP8 but a downstream eval regressed 2 points. Diagnose: which layers are most sensitive, and what mitigations exist (keep some layers fp16, per-group scales)?
You want to add a new format (say a vendor's INT3). What exactly must you implement in vLLM, and what kernel work does it imply (Phase 7)?

Self-grading

4–7 and 11–13 are interview-grade. Could you draw the config→method→kernel dispatch and name the files? If not, re-read 01-deep-dive.md.

Phase 06 — Interview Questions: Quantization

Q1. Why does weight quantization speed up decode?

Model answer

Decode is memory-bandwidth-bound on reading the model weights from HBM each step. Storing weights in fewer bits (int4 ≈ ¼ the bytes) means ¼ the HBM traffic per step → higher decode throughput, even when the math is done in higher precision after dequant. It also frees HBM for more KV cache (higher concurrency). Prefill (compute-bound) benefits less unless you also quantize activations (W8A8) to use low-precision tensor cores.

Q2. How do you quantize a tensor to int8, and why do scale granularities matter?

Model answer

Pick a scale s so values fit in int8 range, store round(W/s) and s; reconstruct as s×int8. Granularity controls error: per-tensor uses one scale (an outlier channel forces a huge scale, crushing small channels); per-channel gives each output channel its own range; per-group (e.g. 128 weights) is finest, best for 4-bit, at the cost of more stored scales. You measure exactly this in lab-01.

Q3. GPTQ vs AWQ?

Model answer

Both are post-training, weight-only 4-bit methods using calibration data. GPTQ minimizes layer output error with second-order (Hessian) info, quantizing and compensating column by column. AWQ scales the most salient weight channels (those hit by large activations) before rounding to protect them. Both plug into vLLM as a LinearMethod and use fast 4-bit kernels (Marlin).

Q4. How does vLLM run many formats without the model knowing?

Model answer

A QuantizationConfig parsed from the checkpoint returns a per-layer LinearMethodBase via get_quant_method. The method's create_weights allocates int weights + scales and apply runs the (de)quantized matmul. A Linear layer just calls self.quant_method.apply(x) — it never branches on format. Adding a format = one config + one method class + a registry entry (quantization/__init__.py). The matmul must use a kernel that understands the format (Phase 7).

Q5. What's FP8 KV cache and when do you use it?

Model answer

Storing the KV cache in FP8 (instead of fp16) halves KV bytes/token, roughly doubling how many concurrent sequences fit (Phase 0 lab-02). It's orthogonal to weight quantization (mix freely). Use it when KV memory caps your concurrency and the small accuracy hit is acceptable; validate on your eval first.

Rapid-fire

Two methods a format implements? create_weights, apply.
Weight-only vs W8A8? bandwidth/memory vs also matmul speed.
4-bit accuracy trick? per-group scales (+ GPTQ/AWQ calibration).
Dispatch entry point? QuantizationConfig.get_quant_method.
FP8 KV cache effect? ~2× concurrency.

Phase 06 — Cheatsheet: Quantization

The one-liner

Fewer weight bits → less HBM read/step → faster decode + more room for KV. Store round(W/s) + scale s; accuracy ∝ scale granularity. Format must match the GEMM kernel.

Two axes

What: weight-only (GPTQ/AWQ, 4-bit) = bandwidth/memory; weight+activation (FP8/INT8 W8A8) = also faster matmul (low-precision tensor cores).
Format: FP8(E4M3/E5M2), INT8/INT4, MXFP4/NVFP4, GPTQ, AWQ, GGUF, compressed-tensors, ModelOpt, TorchAO.

Scale granularity

per-tensor (1 scale, worst) < per-channel (1/row) < per-group (1/128, best for 4-bit).

Dispatch (model is format-agnostic)

QuantizationConfig (from checkpoint) → get_quant_method(layer) → LinearMethodBase: create_weights (alloc int weights + scales) + apply (de/quant matmul → GEMM kernel, Phase 7). Linear just calls self.quant_method.apply(x).

GPTQ vs AWQ

Both post-training weight-only 4-bit w/ calibration. GPTQ: Hessian-based error min. AWQ: scale salient channels before rounding. Fast via Marlin kernels.

FP8 KV cache

Separate axis (kv_cache_dtype="fp8"): halves KV bytes → ~2× concurrency. Mix with any weight quant.

Key upstream

quantization/base_config.py:19 QuantizeMethodBase :28 create_weights :37 apply :70 Config :151 get_quant_method
quantization/fp8.py:100 Fp8Config :261 Fp8LinearMethod :316 create_weights :437 apply
quantization/__init__.py registry · quantization/awq.py · compressed_tensors/
layers/linear.py:182 Unquantized :231 LinearBase :410 ColumnParallel :1392 RowParallel

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

Phase 07 — The Hitchhiker's Guide to GEMM & MoE Kernels

← Phase 06 · Course home · Phase 08 →

Don't Panic

GEMM = General Matrix-Matrix Multiply. It's ~all the FLOPs in a transformer — every linear layer is a GEMM. MoE (Mixture of Experts) makes it weirder: instead of one big MLP, there are many expert MLPs, and each token is routed to only a few of them. That turns the dense GEMM into a routed, grouped GEMM — the frontier of open models (Mixtral, DeepSeek-V3, GPT-OSS) and where a lot of vLLM's current performance work lives. This phase is the kernels and the MoE machinery that make big sparse models fast.

Dense MLP:  x ─► [ one big W ] ─► y                       (every token, all weights)

MoE MLP:    x ─► gate ─► top-k experts ─► run only those ─► weighted combine
                                 │
                 token 1 → experts {3, 7}     "sparse": each token uses few of many experts
                 token 2 → experts {0, 7}

Step 1: GEMM — the workhorse and its kernels

A transformer is mostly matmuls: QKV projection, attention output, the two MLP matrices, the LM head. Making these fast is the job of GEMM libraries:

cuBLAS — NVIDIA's baseline.
CUTLASS — NVIDIA's open, composable GEMM templates; vLLM uses it heavily for quantized GEMMs (FP8/INT8, Phase 6).
TRTLLM-GEN / CuTeDSL — generated/DSL kernels tuned per GPU and precision.

The reason there are so many: a GEMM kernel must be tiled to fit the GPU's memory hierarchy and specialized per dtype (fp16 vs fp8 vs int4) to use the right tensor cores. The quant format from Phase 6 dictates which GEMM kernel runs.

Step 2: MoE — sparse compute, dense capacity

A MoE layer replaces the dense MLP with E experts (each its own MLP). A router (a small linear "gate") scores the experts per token; each token goes to its top-k (e.g. top-2). So a model can have huge total parameters (capacity) but only activate a few experts per token (cheap compute). DeepSeek-V3 has 256 experts but activates ~8 per token.

The MoE forward, step by step (you'll build this in lab-01):

1. router:    logits = x @ W_gate        → (tokens, E)
2. top-k:     pick the k best experts per token + their weights (softmax over the k)
3. permute:   group tokens by their assigned expert (so each expert's tokens are contiguous)
4. grouped GEMM:  run each expert's MLP on its block of tokens
5. un-permute: scatter results back to original token order
6. combine:   weighted sum of each token's k expert outputs (by the gate weights)

Steps 3 & 5 (the permute/un-permute) exist because GPUs want contiguous work per expert — you can't efficiently do "token 1 → expert 3, token 2 → expert 0" as scattered tiny matmuls. Sorting tokens by expert turns it into a few big grouped GEMMs.

Step 3: Why fused MoE kernels matter

Done naively, MoE is a gather + many small GEMMs + a scatter — launch-bound and memory-bound (Phase 5's enemy, at the kernel level). Fused MoE kernels combine routing, the grouped GEMM, and the combine into one (or few) kernels, keeping tensor cores busy and killing launch overhead. This is decisive for MoE throughput and is exactly what vllm/model_executor/layers/fused_moe/ provides (Triton and CUTLASS variants).

Step 4: Expert parallelism (EP) — experts across GPUs

Experts are independent, so you can place different experts on different GPUs. Each step, tokens are shuffled to wherever their expert lives (an all-to-all collective), run, and shuffled back. EP scales the number of experts cheaply, at the cost of communication and load balancing (if everyone routes to expert 7, that GPU is the bottleneck). Contrast with tensor parallelism (Phase 10), which shards each expert's weights across GPUs. Real deployments combine EP for the MoE layers with DP/TP for attention.

The invariants to memorize

GEMM = the FLOPs; CUTLASS/TRTLLM-GEN/CuTeDSL are the fast, dtype-specialized kernels.
MoE = router → top-k → permute → grouped GEMM → un-permute → weighted combine.
Permute/un-permute exist to make per-expert work contiguous (big GEMMs, not scattered tiny ones).
Fused MoE kernels remove the gather/scatter launch + memory overhead.
EP spreads experts across GPUs (all-to-all + load balancing); TP shards each expert.
The quant format (Phase 6) selects the GEMM kernel.

What you'll do

Read: 01-deep-dive.md — FusedMoE, the fused kernel + fused_experts, permute/un-permute, and a real MoE model (Mixtral), line-anchored.
Build: 02-mini-build.md — top-k routing + grouped experts + combine.
Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
- lab-01-moe-routing [CPU-OK] — implement the full MoE forward in numpy; prove it equals a reference and that permute/un-permute round-trips.
- lab-02-profile-fused-moe [GPU-OPT] — profile fused MoE's share of step time (captured).
- lab-03-tiled-gemm [CPU-OK] — tiling and the memory-traffic model: reuse = harmonic mean of tile dims; why decode (M=1) caps at reuse 2 and no tile size can save it.
- lab-04-expert-load-balance [CPU-OK] — loads, imbalance, EP step time = max device load; prove a hot expert inflates the step >2.5× at identical total work; capacity-factor drops.
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

← Phase 06 · Course home · Phase 08 →

Phase 07 — Deep Dive: fused MoE in real vLLM

Paths relative to upstream/ at v0.22.1 @ 0decac0.

vllm/model_executor/layers/fused_moe/layer.py             FusedMoE nn.Module (the layer)
vllm/model_executor/layers/fused_moe/fused_moe.py         the Triton fused kernel + fused_experts
vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py   permute/un-permute
vllm/model_executor/layers/fused_moe/moe_align_block_size.py    align tokens to GEMM tiles
vllm/model_executor/layers/fused_moe/fused_moe_method_base.py   the method base (quant-aware)
vllm/model_executor/models/mixtral.py                     a real MoE model
vllm/model_executor/models/deepseek_v2.py                 DeepSeek MoE (shared experts + MLA)

1. The layer: `FusedMoE`

vllm/model_executor/layers/fused_moe/layer.py:73 — class FusedMoE(PluggableLayer). This is what a model instantiates (see Mixtral below). It holds all experts' weights as stacked tensors (shape roughly (E, ...)) and a quant method (Phase 6 — MoE weights are often quantized too). Its forward (:1306) takes the hidden states and the router logits and returns the combined output. It hides the routing + grouped GEMM + combine behind one call — the model just does self.experts(hidden_states, router_logits).

2. The fused kernel: `fused_moe.py`

vllm/model_executor/layers/fused_moe/fused_moe.py:

fused_moe_kernel (:295) — the Triton kernel doing the grouped expert GEMM (one program per tile, looking up which expert a tile belongs to). fused_moe_kernel_gptq_awq (:61) is the quantized variant (Phase 6 formats need their own MoE kernel).
fused_experts (:1587) / fused_experts_impl (:1664) — the host-side orchestration: align tokens to block size, run the kernel for the up/gate projection, apply the activation, run the down projection, and combine. Read fused_experts_impl to see the full sequence — it's the guide's 6 steps in code.

The win vs naive: instead of a Python loop of E small matmuls, one kernel processes all tokens for all experts, indexed by a sorted token→expert mapping. That's the "fused" in fused MoE.

3. Permute / align: making per-expert work contiguous

moe_align_block_size.py — sorts/pads tokens so each expert's tokens form contiguous, tile-aligned blocks the GEMM kernel can chew through efficiently. This is the practical form of the guide's "permute" step.
moe_permute_unpermute.py — the explicit permute (group by expert) and un-permute (scatter back) used by some paths.

Either way the principle is the same: sort tokens by expert → big grouped GEMM → scatter back. Your lab-01 does this with argsort, which is exactly the idea minus the tile alignment.

4. Routing (top-k) and combine

The router is a small linear (gate) producing (tokens, E) logits. Selecting top-k experts and their normalized weights happens in the layer/kernel path (look for topk / select_experts in layer.py and fused_moe.py). DeepSeek adds grouped top-k (group experts, pick groups first) and shared experts (always-on experts added to every token) — see deepseek_v2.py. The combine is a weighted sum of each token's k expert outputs by the gate weights.

5. A real MoE model: Mixtral

vllm/model_executor/models/mixtral.py:

class MixtralMoE(nn.Module) (:77) — builds self.experts = FusedMoE(...) (:132) and a gate linear; its forward computes router_logits = gate(x) then self.experts(x, router_logits) (:153). That's the entire MoE block — the complexity is inside FusedMoE. When you add a model (Phase 14), wiring an MoE layer is this small.

6. Expert parallelism

fused_moe/all2all_utils.py, prepare_finalize/, and expert_map_manager.py implement EP: an expert-to-GPU map, the all-to-all that ships tokens to their expert's GPU and back, and load handling. EP is configured alongside TP/DP (Phase 10). The key cost is the all-to-all + imbalance when routing is skewed.

Reading checklist

FusedMoE.forward — what two things does it take, and what does it hide?
fused_experts_impl — find the up/gate GEMM, activation, down GEMM, and combine.
Why does moe_align_block_size exist (contiguous, tile-aligned per-expert work)?
In Mixtral, how few lines is the MoE block once FusedMoE exists?
EP vs TP for MoE — what does each shard, and what communication does each imply?

Now build it: 02-mini-build.md, then the labs.

Phase 07 — Mini-Build: the MoE forward in numpy

You'll implement the full MoE forward — router → top-k → permute → grouped experts → un-permute → weighted combine — and prove it equals a simple reference. This makes the fused kernel's job concrete: it's this, fused into one GPU pass.

The task (lab-01)

Implement, in numpy:

route(x, W_gate, k) → (topk_ids (T,k), topk_weights (T,k)): logits = x @ W_gate.T; pick the top-k experts per token; softmax the k selected logits for the combine weights.
moe_forward_reference(x, experts, topk_ids, topk_weights) → the naive version: for each token, for each of its k experts, run that expert's MLP and weight-sum. (Correct, slow — the oracle.)
moe_forward_grouped(x, experts, topk_ids, topk_weights) → the "fused" idea: permute tokens by expert (argsort), run each expert once on its contiguous block (grouped GEMM), un-permute, then combine. Must equal the reference.

An "expert" here is a tiny MLP: relu(x @ W1) @ W2.

Why permute/un-permute (the key insight)

Scattered per-token expert calls are tiny and launch-bound. Sorting tokens by expert turns the work into a handful of big matmuls (one per expert), which the GPU loves. Your argsort-based permute is the CPU mirror of moe_align_block_size / moe_permute_unpermute.

Definition of done

pytest phase-07-gemm-and-moe-kernels/labs -q

Tests pin: grouped == reference output; the permutation round-trips (un-permute ∘ permute = identity); each expert is invoked on exactly its assigned tokens; top-k weights sum to 1 per token.

Map to the real engine

your numpy	real vLLM
`route` top-k	routing in `FusedMoE`/`fused_moe.py`
permute by `argsort`	`moe_align_block_size` / `moe_permute_unpermute.py`
grouped expert matmuls	`fused_moe_kernel` (`fused_moe.py:295`)
weighted combine	the combine in `fused_experts_impl` (`:1664`)
(experts on different GPUs)	expert parallelism (`all2all_utils.py`)

Phase 07 Labs — GEMM & MoE Kernels

Four labs below the attention line: the matmuls that are most of every step's milliseconds, and the mixture-of-experts machinery that reorganizes them. The arc: build the MoE forward and prove the grouped formulation exact (lab-01), learn the tiling arithmetic that makes any GEMM fast — and why decode shapes defeat it (lab-03), measure the balance tax that routing levies on parallel experts (lab-04), then profile a real MoE model and check all three models against silicon (lab-02).

Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-07-gemm-and-moe-kernels/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-07-gemm-and-moe-kernels/labs/lab-01-moe-routing -q

Labs

lab-01-moe-routing `[CPU-OK]`

Implement the MoE forward twice — the naive per-token oracle and the grouped formulation (permute by expert → one matmul per expert → scatter back with combine weights) — and prove them equal. The grouped version is the readable edition of fused_moe_kernel; the scatter-back hides the lab's boss fight (np.add.at vs the duplicate-dropping +=). Skills: select-normalize-combine routing; the permute trick; conservation bookkeeping as debugging.

lab-02-profile-fused-moe `[GPU-OPT]`

Capture decode steps of a real MoE model under torch.profiler and read the kernel table: experts ~41%, permute ~7%, router ~1%. Predict the breakdown first; the gaps are your misconceptions. Annotated capture included. Skills: warm-up discipline; kernel-table 80/20; decision-cheap/consequence-expensive structure of routing.

lab-03-tiled-gemm `[CPU-OK]`

The idea that fills the gap between three nested loops and CUTLASS: tiling. Implement a tiled matmul (exact, ragged edges included) and the memory-traffic model — reuse equals the harmonic mean of tile dimensions, square tiles win, and a decode-shaped matmul (M=1) caps at reuse 2 no matter what, re-deriving decode's bandwidth wall from the kernel side. Skills: traffic counting; intensity as an algorithm property; the three-level tiling hierarchy.

lab-04-expert-load-balance `[CPU-OK]`

The MoE serving problem: with experts sharded across devices, a step lasts as long as the busiest device. Build the diagnostics (loads, imbalance factor, EP step time, capacity-overflow drops) and prove a hot expert inflates step time >2.5× at identical total work. Skills: straggler arithmetic; placement vs routing; why inference never drops tokens; what EPLB optimizes.

What you can do after this phase

Estimate any GEMM's achievable performance from shape + tile + hardware on a napkin; explain why grouped MoE kernels exist and verify one against an oracle; diagnose an underperforming MoE deployment with a routing histogram before touching a profiler, and with one afterward; and read fused_moe.py upstream as four phases you've personally implemented. Phase 8 (speculative decoding) spends the idle FLOPs you now know how to find; Phase 10 stretches the all-to-all across nodes.

Lab 07-01 — The MoE Forward (Routing + Grouped Experts) `[CPU-OK]`

A mixture-of-experts layer makes a strange promise: a model with 8× the parameters at ~1× the per-token compute, because each token visits only its top-k of E expert MLPs. The catch is operational — "each token visits different experts" is a scatter/gather nightmare for hardware that loves big uniform matmuls. This lab has you implement both sides of the resolution: the naive per-token loop (obviously correct, hopelessly slow — your oracle) and the grouped formulation (permute tokens by expert → one big matmul per expert → scatter back with combine weights), and prove they're equal. The grouped version is, step for step, what vLLM's fused_moe_kernel does in one GPU pass — you're writing the readable edition of one of the hottest kernels in modern serving.

Why this lab exists

MoE is where the frontier lives (Mixtral, DeepSeek-V3, Qwen-MoE, most rumored frontier models), and its serving stack confuses newcomers because the math (a weighted sum of small MLPs) and the implementation (sorts, histograms, alignment buffers, grouped GEMMs) look unrelated. They aren't — the implementation is the math, reorganized so the GPU sees few large uniform operations instead of many tiny ragged ones. Building both versions and asserting equality is how you internalize the correspondence; after this lab, moe_align_block_size (a real kernel whose name suggests nothing) reads as "my argsort, made tile-friendly."

The grouped-equals-reference test is also this course's master invariant again (optimizations must not change output) in its most insidious habitat: the scatter-back. Combine-weight bugs and duplicate-token-row bugs (np.add.at vs out[toks] += — see the notes) produce outputs that are plausibly wrong, the worst kind. The oracle test is the only honest defense.

Background: the permute trick

Per token: router logits x @ W_gateᵀ → take top-k experts → softmax the selected k logits into combine weights → output is the weighted sum of those experts' MLP outputs. Done literally, that's T × k tiny matmuls — death by launch overhead and zero data reuse (lab-03 quantifies why tiny matmuls waste a GPU).

The grouped reformulation observes that the same set of (token, expert) pairs can be processed expert-major instead of token-major:

Flatten the (T, k) assignment matrix into T·k (token, expert, weight) triples.
Permute: sort triples by expert (your argsort; real kernels build the equivalent grouping with a histogram + prefix sum — moe_align_block_size).
Grouped GEMM: for each expert, one matmul over its contiguous block of tokens — E medium matmuls instead of T·k tiny ones, each big enough to tile well (lab-03).
Un-permute + combine: scatter results back to token order, multiplying by the combine weights, summing the k contributions per token.

No arithmetic changed — only its order. The speedup comes entirely from shaping the work to what hardware rewards: contiguity and uniformity.

Files

starter.py — route, expert_mlp, moe_forward_reference, moe_forward_grouped. Your work.
solution.py — reference.
test_lab.py — grouped == reference, combine weights sum to 1, assignment bookkeeping.

Run

LAB_IMPL=starter pytest phase-07-gemm-and-moe-kernels/labs/lab-01-moe-routing -q
pytest phase-07-gemm-and-moe-kernels/labs/lab-01-moe-routing -q   # reference

What to implement

Per 02-mini-build.md: route (logits → top-k ids + softmax of the selected logits), expert_mlp (relu(x @ W1) @ W2), the reference loop, and the grouped version. Two precision points: softmax over the selected k logits only (not all E — selecting then normalizing is the standard formulation; normalizing then selecting gives different weights), and the scatter-back must handle a token appearing twice in an expert's block when top-k assigns it duplicate experts — np.add.at accumulates correctly where fancy-indexed += silently drops duplicates. That numpy footgun is the lab's hidden boss; the bookkeeping test exists for it.

What the tests prove

Test	What it pins
grouped ≈ reference	The permute/group/scatter pipeline is an identity on the math — the kernel's entire correctness claim
combine weights sum to 1	The router emits a proper convex combination — drop this and outputs scale with k
assignments = `T × k`, each expert sees exactly its tokens	The bookkeeping conservation law: nothing dropped, nothing duplicated in the permute — the histogram you'd actually print when debugging a real routing issue (lab-04 builds the diagnostics on top)

Hitchhiker's notes

Why softmax-after-top-k? It renormalizes mass over the experts actually used, so the output is a proper weighted average regardless of how confident the router was. Mixtral and most modern MoEs do exactly this; some (DeepSeek-V3) use sigmoid gates with normalization — same pipeline, different gate function. The structure (select → normalize → combine) is the stable part.
The real kernel fuses steps 2–4 into one launch: fused_moe_kernel (upstream/vllm/model_executor/layers/fused_moe/fused_moe.py:295) — a Triton kernel whose grid covers (expert blocks × tile positions), reading the alignment metadata that moe_align_block_size produced. Your four functions are its four phases; the fusion exists so intermediate permuted tensors never round-trip through HBM (the recurring lesson from Phase 2 lab-06 and lab-03 here: materializing intermediates forfeits the bandwidth win).
SwiGLU, not ReLU, in real models: (silu(x@W1) * (x@W3)) @ W2 — three weight matrices per expert, one extra elementwise multiply. Changes the per-expert FLOPs, changes nothing about routing/grouping. The lab uses ReLU to keep the oracle short.
Where the time really goes: lab-02's profile shows experts (grouped GEMM) at ~41%, permute at ~7%, router at ~1%. Routing is decision-cheap, consequence-expensive — the gate is a tiny matmul whose output determines whether the expensive part runs balanced (lab-04's entire subject).

Going further

Replace argsort with the histogram + prefix-sum (counting sort) the real kernel uses: np.bincount → np.cumsum → stable placement. Same permutation, O(T·k) instead of O(T·k log T·k) — and now you've written moe_align_block_size's algorithm.
Implement SwiGLU experts and re-run the equality test (it should pass untouched — routing is orthogonal to expert internals; prove it).
Pad each expert's token block to a multiple of 16 (the GEMM tile constraint — lab-03) with zero rows, and verify the output is still exact. You've discovered why moe_align_block_size has "block size" in its name, and where MoE's small padding-waste overhead comes from.

References

upstream/vllm/model_executor/layers/fused_moe/fused_moe.py:295 — the fused kernel; read it next to your grouped function, phase by phase.
Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (2017) — the modern MoE formulation: https://arxiv.org/abs/1701.06538
Jiang et al., Mixtral of Experts (2024) — the architecture this lab's shapes mimic (8 experts, top-2): https://arxiv.org/abs/2401.04088
Lab-03 — why grouped beats tiny matmuls (tiling/reuse); lab-04 — what the routing histogram costs; lab-02 — the profile showing where the milliseconds go.

Lab 07-02 — Profile the Fused MoE Kernel `[GPU-OPT]`

You've built the MoE forward (lab-01), the tiling that makes its GEMMs fast (lab-03), and the balance diagnostics (lab-04). This lab closes the loop with the instrument that tells you which of those matters on your model, on your hardware, right now: the profiler. You'll capture a few decode steps of a real MoE model under torch.profiler and read the kernel-level time breakdown — discovering that the grouped expert GEMM eats ~40% of the step, the router costs one percent, and the permute machinery is visible but minor. That breakdown is the empirical ground truth that every MoE optimization argument has to answer to.

No GPU? Don't panic. The captured profile below is annotated line by line against labs 01/03/04 — the reading skill transfers intact.

Why this lab exists

Profiling is the difference between optimizing and gesturing. Every phase so far has handed you models of where time goes (roofline, launch counts, traffic formulas, imbalance factors); the profiler is how you check a model against a machine — and the discipline of "predict the breakdown, then look" is what makes profiles informative instead of just colorful. Before running this lab, write down your guesses: what share for the experts? for attention? for the router? The gaps between your guesses and the table below are precisely your remaining misconceptions about MoE — that's the lab.

The kernel-table-reading skill is also the universal entry point to Phase 18 (where profiling becomes systematic, with nsys/ncu and timeline views). A key_averages() table sorted by CUDA time is the 80/20 of GPU performance work: ten seconds of looking tells you which subsystem owns the milliseconds, which is the only question that decides where engineering effort goes.

Requirements

uv pip install -e ".[vllm]"
# a small MoE checkpoint, e.g. Qwen1.5-MoE-A2.7B or any 0.5–3B-activated MoE on the Hub

Steps

import torch
from vllm import LLM, SamplingParams

llm = LLM(model="<a small MoE model>", gpu_memory_utilization=0.6, max_model_len=1024)
llm.generate(["warmup"] * 4, SamplingParams(max_tokens=8))   # warm up: capture, caches, autotune

with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA]) as prof:
    llm.generate(["Explain MoE in one line:"] * 8,
                 SamplingParams(max_tokens=32, temperature=0))
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=15))

Warm up first — profiling a cold engine records CUDA-graph capture, compilation, and autotuning, drowning the steady state you actually care about (the mistake that invalidates more first profiles than any other). Then find: the fused MoE / grouped-GEMM kernels, the align/permute ops, attention, and the router's gate matmul. Compute each one's share.

Captured output (real run, small MoE, L4, vLLM 0.22.1, trimmed)

Name                                   CUDA time %
fused_moe_kernel (grouped expert GEMM)    41.2%      <- the experts dominate
moe_align_block_size / permute             6.8%      <- the sort/permute (your argsort)
flash_attn (attention)                    18.5%
rms_norm / residual / misc                 9.0%
gate (router linear)                       1.3%      <- routing is cheap
all-to-all (if EP enabled)                 ...       <- expert-parallel comms

Reading the profile

fused_moe_kernel at 41% — the experts are the model, economically. Every percent shaved here is ~0.4% of the whole step, which is why the fused kernel gets CUTLASS/Triton-level attention upstream and why lab-03's tiling arithmetic is load-bearing. It's also why balance matters so much (lab-04): this 41% is the part that inflates under a hot expert.
moe_align_block_size + permute at ~7% — the bookkeeping tax of the grouped formulation (lab-01's argsort, tile-aligned). Visible, real, and worth exactly as much optimization effort as 7% justifies — which is some, not much. When a PR claims big wins from permute cleverness, this number is your calibration.
gate at 1.3% — the most strategically interesting line: the decision is nearly free while its consequences (the 41% above, and the balance of it) are everything. Cheap decisions with expensive consequences are where you spend design attention, not kernel attention — lab-04 is entirely about this line's downstream effects.
Attention at 18.5% — for context: in a dense model this and the MLP GEMMs would be the whole story (Phases 4 and 3 of your attention). MoE adds the expert economy on top; it doesn't replace the transformer's costs.
The missing line: all-to-all — single-GPU here, so no EP communication. On a multi-node DeepSeek-scale deployment this line appears and can rival the GEMM itself — Phase 10's territory, lab-04's placement problem made physical.

Hitchhiker's notes

Percentages lie across regimes. This is a decode-heavy profile at modest batch. A prefill-heavy run shifts share toward attention (longer sequences — Phase 4 lab-03's quadratic); a bigger batch shifts toward GEMMs and improves their efficiency (lab-03's reuse). Always note the workload a profile was taken under — a profile without its workload is a number without units.
Kernel names drift across versions (Triton autogenerated names especially). Anchor on the structure: one big grouped GEMM, one alignment/permute pass, one tiny gate. Those three will exist under any naming in any version.
ProfilerActivity.CUDA measures GPU time; add CPU and compare totals — if CPU time ≫ CUDA time at small batch, you're launch-bound and Phase 5's graphs are the fix, not kernel work. The profiler answers the "which regime am I in?" question from Phase 0 lab-04 empirically.
vLLM also ships its own profiling hooks (VLLM_TORCH_PROFILER_DIR for trace-on-demand against a running server) — same data, production-shaped collection. Phase 18 uses them; this lab's inline version is the minimal form.

Reflect

Predict-then-check: how would this table change for (a) batch 64 instead of 8, (b) a 4k-token prefill, (c) the same model dense-ified (experts merged)? Each answer is one of labs 03/04 / Phase 4 applied.
The gate is 1.3% of time but determines the balance of the 41%. Where, concretely, would you add instrumentation to catch a balance regression in production? (Routing histogram per window — lab-04's expert_loads — exported as a metric; the profiler is for diagnosis, histograms are for monitoring.)
If moe_align_block_size grew to 20% of the step after a model swap, what changed? (More experts and/or smaller per-expert blocks — the permute is per-assignment, the GEMMs amortize per expert block; small-expert MoEs pay relatively more bookkeeping.)

References

upstream/vllm/model_executor/layers/fused_moe/fused_moe.py — the kernels whose names you just learned to find in a table.
PyTorch docs, torch.profiler — the instrument: https://pytorch.org/docs/stable/profiler.html
vLLM docs, Profiling — server-shaped trace collection: https://docs.vllm.ai/en/latest/contributing/profiling/
Labs 01 / 03 / 04 — the three models this profile validates (grouped formulation, tiling economics, balance tax).
Phase 18 — profiling as a discipline: timelines, nsys, regression hunting.

Lab 07-03 — Tiled GEMM and the Memory-Traffic Model `[CPU-OK]`

A matrix multiply is three nested loops a first-year student can write. CUTLASS — the template library behind most of vLLM's GEMMs — is tens of thousands of lines. This lab is about the single idea that fills that gap: tiling. Not because the loops are wrong, but because the memory traffic is: a naive GEMM re-reads its operands from slow memory incessantly, while a tiled one stages blocks in fast memory and reuses every loaded element many times. You'll implement the tiling (and prove it changes nothing numerically — it's pure loop reordering), then build the traffic model that explains why tile shape is the most important number in any GEMM kernel — and derive, as a bonus, exactly why a decode-shaped matmul (M=1) can't be saved by any tile size at all.

Why this lab exists

Phase 0 lab-04 gave you the roofline: below the ridge you're bandwidth-bound, above it compute-bound. What it didn't say is that arithmetic intensity is not a property of the problem — it's a property of the algorithm. A 1024³ GEMM has enough FLOPs per byte in principle (operands total ~6 MB, work totals 2 G-FLOPs: thousands of FLOPs per byte), but the naive loop order achieves an intensity of ~1 anyway, because it keeps re-loading what it just evicted. Tiling is the act of claiming the intensity the math always had. Every fast kernel you'll ever read — CUTLASS GEMMs, FlashAttention (which is exactly this lab applied to attention — Phase 4), the fused MoE kernel (lab-02's profile) — is this one idea wearing different clothes, and the traffic model you build here is how you estimate any of them on a napkin.

The numerics half matters too: tiling reorders the accumulation, and you'll prove with tests that for exact arithmetic it's an identity (ragged edges included — the place naive implementations corrupt silently). In floating point, reordering shifts the last ulp — the legitimate cross-kernel divergence you've now met in three phases (3, 4, 6).

Background: the reuse arithmetic

Count slow-memory loads. Naive: each output element C[i,j] streams a K-row of A and a K-column of B → M·N·2K loads — every element of A is loaded N times, every element of B loaded M times. Tiled with (tile_m × tile_n) output tiles: each tile streams its A-rows and B-columns once (staged in fast memory while the tile's tile_m·tile_n·K FLOPs consume them):

tiled loads = M·K · ceil(N/tile_n)  +  K·N · ceil(M/tile_m)

reuse = naive/tiled = 2 / (1/tile_m + 1/tile_n)     ← the HARMONIC MEAN of the tile dims

That harmonic mean is the lab's punchline. It says: reuse is governed by the smaller tile dimension (256×16 tiles reuse like ~30, not like their area suggests); square tiles maximize reuse per unit of fast memory (a t×t tile gives reuse t while staging O(t·K) operands); and — the inference-shaped consequence — when M=1 (a single decode token), tile_m is pinned at 1 and reuse caps at 2, no matter how clever the kernel. The weights must stream once per step. That's Phase 0 lab-04's "decode is bandwidth-bound" re-derived from the kernel's side, and it's why decode optimization is about shrinking bytes (Phase 6) and sharing the stream across a batch, never about better GEMM tiling.

Files

starter.py — tiled_gemm, naive_traffic, tiled_traffic, reuse_factor. Your work.
solution.py — reference.
test_lab.py — equality (divisible, ragged, tile=1), the traffic formulas, the bigger-tiles-less-traffic direction, the harmonic mean, and the decode-shape cap.

Run

LAB_IMPL=starter pytest phase-07-gemm-and-moe-kernels/labs/lab-03-tiled-gemm -q
pytest phase-07-gemm-and-moe-kernels/labs/lab-03-tiled-gemm -q   # reference

What the tests prove

Test	What it pins
`test_tiled_equals_matmul_divisible` / `_ragged`	Tiling is loop reordering, not approximation — including 37×23×19, where every edge tile is partial. Ragged edges are where real kernel bugs live (predication/masking in CUTLASS); your `min()` bounds are their readable form
`test_tile_size_one_is_the_naive_algorithm`	The degenerate case anchors the model: tiles of 1 = no reuse = the naive loop
`test_traffic_formulas`	The load counts, exactly — 1024³ with 128² tiles moves 16 MB-equivalents instead of 2 GB-equivalents
`test_bigger_tiles_mean_less_traffic`	The direction that justifies burning shared memory on bigger tiles
`test_reuse_factor_is_the_harmonic_tile_size`	Square 128 → reuse 128; skewed 256×16 → ~30. Shape, not area
`test_decode_shape_has_no_reuse_to_harvest`	M=1 → reuse ≤ 2. The GEMM-side proof of decode's bandwidth wall

Hitchhiker's notes

Why not just make tiles enormous? Fast memory is finite: a GPU SM has ~100–230 KB of shared memory, and a t×t fp16 tile's staging (A-panel + B-panel + accumulator) must fit — which lands real kernels at tiles like 128×128 or 128×256, exactly where your model's curve flattens against the hardware budget. Tile choice is a constrained optimization, and CUTLASS exposes it as template parameters because the optimum moves with dtype, shape, and architecture.
The hierarchy repeats: HBM → shared memory is your model's level, but the same arithmetic recurs for shared memory → registers (warp tiles), and L2 ↔ HBM (threadblock swizzling for L2 reuse). Production GEMMs tile at three levels with the same formula at each. Learn it once, apply it fractally.
Tensor cores change the FLOP rate, not the traffic math. They make the compute side faster, which raises the ridge (Phase 0 lab-04) and makes good tiling more necessary, not less — a tensor-core GEMM that under-tiles just starves faster. This is why Hopper added TMA (bulk async copies HBM→shared): feeding the tiles became the whole game.
Grouped GEMM (the MoE kernel, lab-01/02) is this lab plus one indirection: many small GEMMs (one per expert) whose tiles are scheduled from a single kernel launch so the tile machinery amortizes across experts. moe_align_block_size exists precisely to organize tokens into tile-shaped groups — your lab-01 argsort, upgraded to be tile-aware.

Going further

Add a fast_memory_bytes(tile_m, tile_n, tile_k, dtype_bytes) function and find the best square tile under a 100 KB budget for K=4096 — then compare against the tile shapes in a CUTLASS config or a Triton autotune list. You'll land within a factor of 2 of what the pros chose, from a 5-line model.
Time it for real: your tiled_gemm vs A @ B in numpy is unfair (BLAS is tiled and vectorized), but tiled_gemm with tile 64 vs tile 1 against each other shows the traffic effect even through Python overhead. Measure, then explain the ratio.
Extend the traffic model with the K-dimension split (tile_k, split-K reduction — needed when M and N are both small but K is huge). Notice the merge-partials shape from Phase 4 lab-04 reappearing: split-K GEMM is the same monoid trick, applied to plain sums.

References

upstream/csrc/quantization/cutlass_w8a8/ and upstream/cmake/external_projects/ — where CUTLASS enters vLLM; the deep-dive maps the entry points.
NVIDIA, CUTLASS: Efficient GEMM in CUDA (docs) — the three-level tiling hierarchy: https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md
Triton tutorial, Matrix Multiplication — your lab in Triton, with autotuned tiles: https://triton-lang.org/main/getting-started/tutorials/03-matrix-multiplication.html
Williams et al., Roofline (2009) — intensity as an algorithm property: https://dl.acm.org/doi/10.1145/1498765.1498785
Phase 0 lab-04 — the ridge this lab's reuse factor is racing toward; Phase 4 lab-01 — FlashAttention as tiling applied to attention.

Lab 07-04 — Expert Load Balance: the MoE Serving Problem `[CPU-OK]`

Lab-01 built the MoE forward and treated the router's decisions as given. This lab asks the operator's question: what do those decisions cost? A mixture-of-experts model's economics rest on a promise — E experts' worth of capacity for k experts' worth of compute per token — and that promise has fine print: it holds only if tokens spread evenly. Real routers don't spread evenly. You'll build the three numbers that quantify the damage: per-expert loads, the imbalance factor, and — the one that costs money — EP step time, which with experts sharded across devices equals the busiest device's load, not the average. Same total work, one hot expert, >2.5× the step time: you'll prove it in an assert.

Why this lab exists

MoE models (Mixtral, DeepSeek-V3, Qwen-MoE, the frontier generally) are taking over serving fleets, and their performance pathologies are distributional, not computational: nothing crashes, no kernel is slow — the work just lands unevenly and silicon idles in the gaps. An engineer staring at "MoE deployment at 40% of expected throughput" needs exactly the diagnostics you're building: dump the routing histogram, compute imbalance, map experts to devices, find the hot device. The lab's deliberately crafted "hot expert" router (60% of assignments to one expert) is not a strawman — it's the documented failure shape of undertrained gates, domain-shifted traffic (code-heavy prompts lighting up code-ish experts), and repetitive workloads.

The second reason: capacity factors. Training-era MoE systems used fixed per-expert buffers and dropped overflow tokens — fine for training (a dropped token is a slightly noisier gradient), catastrophic for inference (a dropped token is a corrupted generation). Understanding dropped_tokens is understanding why inference MoE never drops — and what it pays instead (dynamic buffers, the full imbalance tax landing on latency). That design fork explains a lot of otherwise-puzzling differences between training and serving MoE stacks.

Background: why imbalance is a tax on parallelism

With expert parallelism (EP), experts live on different devices and every step runs all-to-all: tokens ship to their experts' devices, compute happens, results ship back. The step completes when the last device finishes — a barrier. So step time is max(device loads), while useful work is sum(loads). Parallel efficiency is their ratio scaled by device count:

efficiency = sum(loads) / (num_devices × max(device_load))

Perfectly balanced: efficiency 1. One expert carrying 60% of assignments on an 8-device layout: the hot device defines the step while seven others idle most of it — your test_imbalance_burns_parallel_efficiency measures >2.5× step inflation at identical total work. Note what this is, structurally: a straggler problem, the same shape as Phase 3 lab-05's prefill spike (one slow element holds the barrier) and the tail-at-scale phenomenon generally. Distribution problems all rhyme.

The mitigation toolbox (each one is a knob in real systems): auxiliary load-balancing losses at training time (bake balance into the router), expert placement (don't put two historically-hot experts on one device — your e % num_devices round-robin is the naive baseline placement), redundant replicas of hot experts (vLLM's EPLB — expert parallel load balancer — does exactly this), and shared experts (DeepSeek's always-active expert absorbs the common patterns so the routed ones stay balanced).

Files

starter.py — expert_loads, imbalance, ep_step_time, dropped_tokens. Your work.
solution.py — reference.
test_lab.py — counting, the uniform baseline, the hot-expert blowup, max-device step time, the efficiency burn, and capacity-overflow accounting.

Run

LAB_IMPL=starter pytest phase-07-gemm-and-moe-kernels/labs/lab-04-expert-load-balance -q
pytest phase-07-gemm-and-moe-kernels/labs/lab-04-expert-load-balance -q   # reference

What the tests prove

Test	What it pins
`test_loads_count_assignments`	The histogram itself — note it counts (token, expert) assignments, so a token routed to expert 1 twice in top-k counts twice (it really does cost two expert-rows of compute)
`test_uniform_routing_is_nearly_balanced`	The healthy baseline: random routing lands imbalance < 1.3 — your reference point for "what good looks like" on a histogram
`test_hot_expert_blows_up_imbalance`	The pathology: 60% to one expert → imbalance > 3 (~5× its fair share)
`test_ep_step_time_is_the_max_device`	The barrier semantics, including the subtle case: with fewer devices than experts, co-located experts' loads add — placement matters, not just routing
`test_imbalance_burns_parallel_efficiency`	The money assert: identical `sum(loads)`, >2.5× the step time. Throughput lost to distribution, with zero slow code anywhere
`test_capacity_drops_only_the_overflow` / `_cf1`	Capacity-factor mechanics: cf=1.0 with a hot expert drops 68 of 128 assignments; cf high enough drops none. The training-era trade inference refuses

Hitchhiker's notes

Why does top-k routing make this worse than it looks? Each token takes k experts, so a hot expert co-occurs with others — you can't fix a hot expert by moving its tokens without also touching their second choices. Balance is a joint property of the whole routing matrix, which is why post-hoc fixes (placement, replication) are often easier than router surgery.
EPLB in one sentence: measure loads over a window, then replicate hot experts across devices and route their traffic round-robin among replicas — trading memory (extra expert copies) for balance. Find it upstream (vllm/distributed/eplb/); your ep_step_time is the objective function it's minimizing.
The all-to-all itself (not modeled here) adds a second imbalance-sensitive cost: hot experts concentrate network traffic too. On NVLink-rich nodes it's tolerable; across nodes it's why DeepSeek-V3's deployment papers obsess over expert placement. Phase 10 picks this up.
Decode vs prefill see different distributions. A prefill batch routes thousands of tokens (law of large numbers smooths loads); a decode batch routes batch × k assignments — at batch 8, k 2, that's 16 samples over maybe 64 experts: structurally lumpy even with a perfect router. Small-batch MoE decode is imbalanced by arithmetic, not pathology — one more reason MoE wants large serving batches (and why test_uniform_routing_is_nearly_balanced uses 1600 assignments, not 16).

Going further

Implement best_placement(loads, num_devices) — greedy bin-packing (sort descending, assign to lightest device) — and measure how much it recovers vs round-robin on the skewed router. Then add replication: one extra copy of the hottest expert, traffic split — compare. You've now built EPLB's two levers.
Sweep batch size from 4 to 4096 with the uniform router and plot imbalance: watch the small-batch lumpiness decay as 1/√batch. This curve is why MoE throughput benchmarks at tiny batch are misleading.
Add a shared-expert column (every token also visits expert S, DeepSeek-style) and check what it does to imbalance among the routed experts when the router is hot. (Hint: it doesn't fix routing — it shrinks the stakes per routed token.)

References

Fedus et al., Switch Transformers (2021) — capacity factor, token dropping, aux losses: the training-era trade space: https://arxiv.org/abs/2101.03961
DeepSeek-AI, DeepSeek-V3 Technical Report (2024) — shared experts, auxiliary-loss-free balancing, and deployment-grade EP placement: https://arxiv.org/abs/2412.19437
upstream/vllm/distributed/eplb/ — vLLM's expert-parallel load balancer; your ep_step_time is its loss function.
upstream/vllm/model_executor/layers/fused_moe/ — where loads meet kernels (moe_align_block_size and friends, lab-01's mapping).
Dean & Barroso, The Tail at Scale — the straggler pattern this is an instance of: https://research.google/pubs/the-tail-at-scale/

Phase 07 — Exercises: GEMM & MoE Kernels

Warm-up (explain)

What is a GEMM and why is "most of a transformer is GEMMs" true?
In one sentence each: router, top-k, expert, combine.
Why does MoE give "huge capacity, cheap compute"? What's actually cheap?

Core (trace the code)

List the 6 steps of an MoE forward and which fused_moe/ file implements each.
Why do permute + un-permute exist (moe_align_block_size.py)? What goes wrong without them?
In MixtralMoE (mixtral.py:77), how few lines is the MoE block once FusedMoE exists, and why?
What does a fused MoE kernel fuse, and which Phase-5 problem does that address?

Build (your lab)

In lab-01, why must moe_forward_grouped use np.add.at (not out[toks] += ...)? (Hint: a token can route to two experts.)
Add expert load metrics: count tokens per expert; construct an input that overloads one expert and explain the throughput impact (load imbalance).
Add a shared expert (always-on, added to every token, DeepSeek-style) and keep grouped == reference.

Design (staff-level)

EP vs TP for a 256-expert MoE on 8 GPUs: what does each shard, what comms does each add, and when would you combine them?
Your MoE serving is throughput-bound and a profile shows the grouped GEMM at 45% but with low tensor-core utilization. Name two likely causes and fixes.
Expert load is skewed (a few hot experts). What mitigations exist (capacity, aux loss at train time, routing tweaks), and which are available at serving time?

Self-grading

4–7 and 11–13 are interview-grade. Could you draw the MoE forward and name the files? If not, re-read 01-deep-dive.md.

Phase 07 — Interview Questions: GEMM & MoE Kernels

Q1. What is MoE and why is it attractive?

Model answer

A Mixture-of-Experts layer replaces the dense MLP with many expert MLPs and a router that sends each token to its top-k experts (e.g. 2 of 256). So the model has huge total parameters (capacity/quality) but activates only a few experts per token (cheap compute). DeepSeek-V3 has 256 experts, ~8 active. The cost moves from FLOPs to moving tokens to the right experts and memory for all those weights.

Q2. Walk through the MoE forward.

Model answer

router (small linear) → top-k expert selection + softmax weights → permute tokens so each expert's tokens are contiguous → grouped GEMM (run each expert's MLP on its block) → un- permute back to token order → weighted combine of each token's k expert outputs. Permute/ un-permute exist so the GPU does a few big matmuls instead of many scattered tiny ones. (fused_moe/fused_moe.py, moe_align_block_size.py, layer.py.)

Q3. Why fused MoE kernels?

Model answer

Naive MoE is a gather + many small per-expert GEMMs + a scatter — launch-bound and memory-bound. A fused kernel does routing/grouped-GEMM/combine in one (or few) kernels indexed by a sorted token→expert map, keeping tensor cores busy and removing launch overhead. It's decisive for MoE throughput (the profile in lab-02 shows the grouped GEMM dominating).

Q4. Expert parallelism vs tensor parallelism for MoE?

Model answer

EP places whole experts on different GPUs; tokens are shipped to their expert's GPU via all-to-all and back. It scales expert count cheaply but adds communication and load-balancing risk (a hot expert bottlenecks its GPU). TP shards each expert's weights across GPUs (per-layer all-reduce). Real deployments often use EP for the MoE layers and DP/TP for attention, since the two have different parallelism sweet spots.

Q5. What sets which GEMM kernel runs?

Model answer

The dtype/quant format (Phase 6) and hardware. CUTLASS/TRTLLM-GEN/CuTeDSL provide kernels specialized per precision (fp16/fp8/int4) and tiled to the GPU's memory hierarchy; a quantized weight needs the matching kernel (e.g. Marlin for INT4, scaled-mm for FP8). Mismatch is wrong or slow — that's why quant format and kernel are chosen together.

Rapid-fire

MoE step order? router → top-k → permute → grouped GEMM → un-permute → combine.
Why permute? contiguous per-expert work → big matmuls, not scattered tiny ones.
EP shards? whole experts (all-to-all). TP shards? each expert's weights.
Router cost? tiny; the experts (GEMM) dominate.
Famous open MoEs? Mixtral, DeepSeek-V3, Qwen-MoE, GPT-OSS.

Phase 07 — Cheatsheet: GEMM & MoE Kernels

The one-liner

GEMM = the FLOPs (every linear layer). MoE = many expert MLPs + a router; each token uses top-k experts → huge capacity, cheap compute. The work becomes a routed, grouped GEMM.

GEMM kernels

cuBLAS (baseline) · CUTLASS (quantized/composable) · TRTLLM-GEN / CuTeDSL (generated, per-GPU). Specialized per dtype (fp16/fp8/int4); the quant format (Phase 6) picks the kernel.

MoE forward (6 steps)

router (gate linear) → top-k experts + softmax weights → permute (group tokens by expert) → grouped GEMM (each expert on its block) → un-permute → combine (weighted sum). Permute/un-permute = make per-expert work contiguous (big matmuls, not scattered tiny ones).

Fused MoE

One/few kernels do routing+grouped-GEMM+combine → removes gather/scatter launch + memory overhead. The experts (grouped GEMM) dominate step time; the router is cheap.

Parallelism

EP (expert parallel): whole experts on different GPUs; all-to-all ships tokens; watch load balance (hot experts).
TP: shard each expert's weights across GPUs (per-layer all-reduce).
Often EP for MoE + DP/TP for attention.

Key upstream

fused_moe/layer.py:73 FusedMoE · :1306 forward
fused_moe/fused_moe.py:295 fused_moe_kernel · :1587 fused_experts · :1664 fused_experts_impl
fused_moe/moe_align_block_size.py · moe_permute_unpermute.py · all2all_utils.py (EP)
models/mixtral.py:77 MixtralMoE · models/deepseek_v2.py (shared experts + MLA)

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

Phase 08 — The Hitchhiker's Guide to Speculative Decoding

← Phase 07 · Course home · Phase 09 →

Don't Panic

Decode is slow because each token needs a full run of the big model (Phase 0: one expensive "haul the books" memory read per token). Speculative decoding is a clever cheat:

A cheap "drafter" guesses the next several tokens. The big model then checks all of them in a single run and keeps the longest correct prefix. If the guesses are good, you get several tokens for the price of one big-model run — and, remarkably, the output is exactly what the big model would have produced anyway.

Think of it like a fast typist proposing the rest of your sentence and you (the careful editor) glancing at it: if the first few words are right, you accept them instantly and only start typing yourself where the guess goes wrong. You did far less typing, and the final sentence is identical to what you'd have written.

Big model alone:   [run]→t1  [run]→t2  [run]→t3  [run]→t4      4 expensive runs, 4 tokens
Speculative:       drafter guesses [t1,t2,t3,t4]  →  big model verifies all in ONE run
                   accepts t1,t2,t3 (t4 wrong) + fixes it      1 expensive run, ~4 tokens

Step 1: Why checking many tokens costs ~one run

This is the magic that makes it work. Remember from Phase 0 that prefill processes many tokens in one run cheaply (it's compute-bound, the math units stay busy). Verification is just a mini-prefill: feed the big model the context plus the drafted tokens, and in one run it produces, for each position, what it would have predicted there. Now compare:

context = "The capital of France is"
draft   = [" Paris", ".", " The"]            (the cheap drafter's guess)

one big-model run over context+draft gives the big model's own prediction at each spot:
   after "...is"        big model says " Paris"   == draft[0] ✓ accept
   after "...is Paris"  big model says "."        == draft[1] ✓ accept
   after "...is Paris." big model says " It"      != draft[2] ✗ stop, use " It"
result: accepted " Paris", ".", then the correction " It"   →  3 tokens from 1 run

So one expensive run yielded 3 tokens instead of 1. The acceptance rate (how many guesses the big model agrees with) decides the speedup.

Step 2: Where the guesses come from (the proposers)

Different "drafters," from free to fancy:

n-gram / prompt-lookup (free, no model): if the recent text repeats something seen earlier (a name, a code snippet, a quoted phrase), just copy what followed it last time. Shockingly effective for repetitive content (code, structured data, summarization).
EAGLE (a tiny trained head): a small network trained to predict the big model's next hidden states, giving high-quality guesses cheaply. One of the best methods today.
Medusa (extra prediction heads), DFlash, suffix decoding, a small draft model: variations on "produce cheap guesses."

vLLM supports several (vllm/v1/spec_decode/). You'll build the n-gram proposer yourself in lab-01 because it needs no model and lays bare the whole mechanism.

Step 3: Why the output doesn't change (the honest part)

A natural worry: "if a cheap drafter is involved, is the output worse?" No — and this is the beautiful guarantee. For greedy decoding it's obvious: you only accept a drafted token if it equals what the big model would have picked anyway (its argmax); the instant a guess disagrees, you throw it away and use the big model's choice. So the accepted sequence is identical to plain greedy decoding.

For random sampling, there's a slightly cleverer rule — rejection sampling — that accepts/ rejects drafted tokens with just the right probabilities so the final distribution is provably exactly the big model's distribution. Either way: speculative decoding changes the speed, never the output. (Same "optimization ≠ behavior change" theme as the KV cache and chunked prefill.)

Step 4: When it helps and when it hurts (the tradeoff to reason about)

Verification isn't totally free — drafting costs a little, and the big model does slightly more work per run (it processes the draft tokens too). So:

speculative is a win when:   accepted_tokens_per_run  >  cost of (drafting + extra verify work)

High acceptance (repetitive text, a good EAGLE head) → big win.
Low acceptance (creative text, weak drafter) → you wasted the draft; can be a loss.
Small batch / latency-bound → shines (the GPU has spare capacity to verify).
Large batch / already GPU-saturated → less benefit (no spare capacity; verifying drafts competes with real work).

The number that decides everything is acceptance rate × draft cost. You'll measure acceptance rate and effective tokens-per-run in lab-01.

Step 5: How it rides the same scheduler (no special case)

Recall Phase 3's mantra: a request is just num_computed_tokens racing num_tokens. Speculative decoding adds the draft tokens into that gap via num_tokens_with_spec, and reserves KV slots for them with num_lookahead_tokens. The scheduler doesn't know it's spec decode — it just schedules "some tokens to compute," exactly as the Phase 3 comment promised. Acceptance/rejection is sorted out afterward in update_from_output. Elegant: one general mechanism absorbs a whole feature.

The invariants to memorize

Drafter guesses k tokens cheaply; the big model verifies all in one run; keep the longest correct prefix + one correction.
Verification ≈ one prefill run (compute-bound) → checking k tokens costs ~one decode run.
Output is identical to normal decoding (greedy: accept only the argmax; sampling: rejection sampling preserves the distribution).
Speedup ∝ acceptance rate; it can lose at low acceptance or when the GPU is already full.
It rides the normal scheduler via num_tokens_with_spec + num_lookahead_tokens.

What you'll do

Read: 01-deep-dive.md — the n-gram proposer, EAGLE, the rejection sampler, and the scheduler hooks, line-anchored.
Build: 02-mini-build.md — an n-gram proposer + greedy verifier; measure acceptance and tokens-per-run.
Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
- lab-01-ngram-spec-decode [CPU-OK] — build draft+verify; prove output == baseline and measure the speedup on repetitive vs random text.
- lab-02-eagle-on-real-vllm [GPU-OPT] — enable EAGLE on real vLLM; measure ITL + acceptance (captured).
- lab-03-rejection-sampling [CPU-OK] — the losslessness theorem for sampling: accept with min(1, p/q), resample the residual; verify empirically that outputs are distributed exactly as p.
- lab-04-speedup-model [CPU-OK] — the (α, c, k) economics: expected tokens/cycle, speedup, optimal k — including the regime where the right answer is "turn it off".
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

← Phase 07 · Course home · Phase 09 →

Phase 08 — Deep Dive: speculative decoding in real vLLM

Paths relative to upstream/ at v0.22.1 @ 0decac0.

vllm/v1/spec_decode/ngram_proposer.py    n-gram / prompt-lookup proposer (no model — read first)
vllm/v1/spec_decode/eagle.py             EAGLE proposer (a tiny trained head)
vllm/v1/spec_decode/{medusa,dflash,suffix_decoding,draft_model}.py   other proposers
vllm/v1/spec_decode/metadata.py          spec metadata passed around
vllm/v1/sample/rejection_sampler.py      verification that preserves the distribution
vllm/v1/core/sched/scheduler.py          spec_token_ids / num_lookahead_tokens (the hook)

1. The simplest proposer: n-gram (`ngram_proposer.py`)

class NgramProposer (:12), propose (:131). The idea (prompt-lookup): take the last n tokens of the sequence, search earlier in the same sequence for a previous occurrence, and if found, propose the k tokens that followed it last time. No model, no weights — pure string matching, yet it crushes repetitive workloads (code, JSON, summarization where the answer quotes the source). Read propose and notice it returns up to k candidate token ids. Your lab-01 ngram_propose is this exact algorithm.

2. A trained proposer: EAGLE (`eagle.py`)

class EagleProposer(SpecDecodeBaseProposer) (:10). EAGLE runs a small network that predicts the target model's next hidden states (not just tokens), which it then turns into high-quality draft tokens — far better acceptance than n-gram on general text, for a small extra cost. It shares the target's KV/hidden states (note extract_hidden_states.py). Medusa (medusa.py), DFlash (dflash.py), suffix decoding, and a plain small draft model are siblings — all implement "produce k cheap, plausible next tokens." They plug into the same verify path.

3. Verification that preserves the distribution: the rejection sampler

vllm/v1/sample/rejection_sampler.py: class RejectionSampler (:37), forward (:87), rejection_sample (:392). For greedy it's trivial (accept a draft token iff it equals the target's argmax). For sampling, rejection_sample implements the speculative-sampling rule: accept draft token i with probability min(1, p_target(i) / p_draft(i)); on rejection, resample from the adjusted distribution normalize(max(0, p_target − p_draft)). The math guarantees the accepted tokens are distributed exactly as if the target had sampled directly — the proof of "speed, not behavior." Skim the function and find the accept test and the resample-on-reject branch.

4. How it rides the scheduler (the elegant part)

Open vllm/v1/core/sched/scheduler.py and search spec_token_ids and num_lookahead_tokens (around the running-request loop, ~:447/:502). What you'll see:

num_lookahead_tokens is passed to allocate_slots so KV space is reserved for the draft tokens (Phase 2).
a request's num_tokens_with_spec (request.py:243) includes the draft tokens, so the same num_new_tokens = num_tokens_with_spec − num_computed_tokens clamp (Phase 3) naturally schedules them to be verified.
after the model runs, update_from_output consults the rejection sampler's result, keeps the accepted prefix, and rolls back the rest (un-computes rejected tokens' KV).

So spec decode is not a special path in the scheduler — it's "a few extra tokens in the gap," exactly as Phase 3's top-of-function comment promised. That's the design lesson: a good abstraction ("close the num_computed→num_tokens gap") absorbs a whole feature for free.

5. Metrics

spec_decode/metrics.py tracks acceptance rate and accepted-tokens-per-step — the numbers that tell you whether spec decode is paying off (Step 4 of the guide). In production you watch these to decide whether to keep it on for a given workload.

Reading checklist

NgramProposer.propose — how does it find a candidate, and what does it return?
EAGLE — what does it predict that makes its drafts good (hidden states, not just tokens)?
rejection_sample — find the accept test and the resample-on-reject; why does it preserve the distribution?
In scheduler.py, how do num_lookahead_tokens and num_tokens_with_spec make spec decode ride the normal schedule?
What does the metrics module measure, and why is acceptance rate the deciding number?

Now build it: 02-mini-build.md, then the labs.

Phase 08 — Mini-Build: n-gram draft + greedy verify

You'll build the whole speculative loop with a free drafter (n-gram / prompt-lookup) and a greedy verifier, then measure the two numbers that decide everything: acceptance rate and tokens per big-model run. No GPU, no weights — just the mechanism.

The task (lab-01)

Model the big model as a deterministic target(context) -> token (so greedy is reproducible). Implement:

ngram_propose(context, n, k) → up to k proposed tokens: find the most recent earlier occurrence of context[-n:], propose the k tokens that followed it last time ([] if no match).
verify_greedy(context, proposed, target) → (accepted, n_proposed_accepted): accept proposed[i] iff it equals target(context + accepted_so_far); stop at the first mismatch; then append the correction token target(context + accepted_so_far) (the big model's own choice). So you always emit at least 1 token per run.
run_speculative(context, target, n, k, num_tokens) → generate ≥num_tokens, returning the produced tokens plus counts: total proposed, total accepted, number of big-model runs.
run_baseline(context, target, num_tokens) → plain greedy: one token per run.

What you'll prove (the two headline properties)

Identical output: run_speculative(...) == run_baseline(...) token-for-token. (You only ever accept the target's own greedy choice.) This is the correctness guarantee.
Fewer runs when guesses are good: on a periodic sequence the n-gram drafter nails the pattern, so tokens_per_run = total_tokens / num_runs ≫ 1; on a random target it's ≈ 1 (no speedup). That's acceptance rate in action.

Definition of done

pytest phase-08-speculative-decoding/labs -q

Map to the real engine

your code	real vLLM
`ngram_propose`	`NgramProposer.propose` (`ngram_proposer.py:131`)
`verify_greedy`	greedy path of `RejectionSampler` (`rejection_sampler.py:87`)
(sampling version)	`rejection_sample` (`rejection_sampler.py:392`)
counts / acceptance	`spec_decode/metrics.py`
"runs" = scheduler steps	`num_lookahead_tokens` + `num_tokens_with_spec` (Phase 3)

Phase 08 Labs — Speculative Decoding

Four labs on the art of spending idle FLOPs to buy latency. The arc: build the draft→verify machine with a free drafter (lab-01), prove the losslessness theorem for the sampled case (lab-03), price the trade with the expected-speedup model — including when to turn it off (lab-04), then measure the state of the art (EAGLE) on real silicon and reconcile every number against the models you built (lab-02).

Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: mechanism, theorem, economics, measurement.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-08-speculative-decoding/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-08-speculative-decoding/labs/lab-01-ngram-spec-decode -q

Labs

lab-01-ngram-spec-decode `[CPU-OK]`

The whole machine with the simplest drafter: n-gram prompt-lookup proposes, a greedy verifier accepts the leading run, corrections and bonus tokens keep progress ≥ 1 token/cycle. Proven token-identical to baseline; speedup measured as a property of the text (dramatic on repetitive, zero on random). Skills: the invariant verify loop; tokens-per-run as THE metric; graceful degradation; the evolving-context off-by-one.

lab-02-eagle-on-real-vllm `[GPU-OPT]`

The integration test: EAGLE (a one-layer head reading the target's hidden states) on Llama-3-8B — ITL 18.2 → 9.6 ms, acceptance 2.8/5 — reconciled number-by-number against labs 01/03/04, including the two honest qualifications (acceptance is workload-dependent; the win fades at saturated batch). Annotated capture included. Skills: predict-then-measure; the three acceptance metrics and their denominators; spec decode as a latency tool funded by spare compute.

lab-03-rejection-sampling `[CPU-OK]`

The theorem: accept draft x with min(1, p[x]/q[x]), else resample from normalize(max(p − q, 0)) — and the output is distributed exactly as the target, for any drafter. Verified empirically (200k draws through a clueless uniform drafter land on the target to 0.005), plus the closed form α = Σ min(p, q) and the adversarial limits. Skills: the residual construction; distributional testing with calibrated tolerances; α as distribution overlap.

lab-04-speedup-model `[CPU-OK]`

The economics in three functions: E[tokens/cycle] = (1−α^(k+1))/(1−α), speedup = that over k·c + 1, and optimal_k — which is sometimes zero (a mediocre drafter at real cost loses to no speculation, and the model says so). Validated against simulation to 1%; EAGLE's published numbers drop out of the formula. Skills: the (α, c, k) economy; diminishing returns; why free drafters can't lose and saturated GPUs can't win.

What you can do after this phase

Explain why speculative decoding is lossless — separately for greedy (trivial) and sampled (the residual theorem) — and test the claim distributionally; evaluate any drafter from two measured numbers (α on your traffic, c from a profile) before deploying it; choose num_speculative_tokens from arithmetic; and reconcile vLLM's spec-decode metrics with first principles. Phase 9 broadens sampling itself; the verify machinery you now own reappears wherever one batched pass scores many candidates.

Lab 08-01 — n-gram Draft + Greedy Verify `[CPU-OK]`

Speculative decoding starts from an indignity: the most powerful models on earth spend most of their decode steps predicting tokens a text search could have guessed — the closing half of a quoted phrase, the rest of a repeated identifier, boilerplate the prompt already contains. This lab builds the whole draft→verify mechanism end to end using exactly that text search as the drafter (n-gram "prompt lookup" — vLLM's method="ngram", the only drafter that needs no second model and costs nothing), and establishes the two facts every speculative method lives by: the output is token-for-token identical to plain decoding, and the speedup is entirely a function of how guessable the text is — dramatic on repetitive sequences, exactly zero on random ones. No GPU, no weights; the mechanism in its purest form.

Why this lab exists

Every speculative method — n-gram, EAGLE, Medusa, draft models, DFlash — is the same machine with a different drafter plugged in: propose k tokens cheaply, verify them with one target pass, keep the leading run that matches, always emit at least one token. Building that machine once, with the simplest possible drafter, separates the invariant structure (the verify loop, the correction token, the bonus token, the ≥1-token progress guarantee) from the pluggable part (where proposals come from). After this lab, every spec-decode paper you read is "lab-01 with a fancier propose," and every acceptance metric in vLLM's logs maps to a counter you maintained by hand.

The n-gram drafter is also worth knowing in its own right, not as a toy: it ships in production vLLM, it's free (no extra model, no GPU memory, no draft forward), and on the workloads where it shines — RAG (the answer quotes the context), code editing (the new code repeats the old), structured output — it's embarrassingly competitive with learned drafters. Free and sometimes great is a good tool to know exists; lab-04's economics make "free" precise (c = 0 ⇒ speculation can never lose).

Background: the asymmetry being arbitraged

Phase 0 lab-04: a decode step is bandwidth-bound — the GPU streams all weights to emit one token, using ~1% of its compute. But verifying k+1 candidate positions in one forward pass costs barely more than that single step (it's prefill-shaped work riding the same weight stream). So the target model can check k guesses for the price of making one. Speculation arbitrages this: any source of decent guesses converts idle compute into accepted tokens. The drafter here is the cheapest source imaginable — "find the last time the current n-gram appeared, propose what followed it" — your ngram_propose, a backwards string scan.

The verify rule (verify_greedy) is the part with invariants worth memorizing:

Accept proposal p_i iff it equals the target's own greedy choice given everything before it (computed left to right, each acceptance extending the context).
On the first mismatch, append the target's choice — the correction — and stop. That's why a cycle always emits ≥ 1 token: failure mode is baseline speed, never stall.
If all k accepted, the verify pass has already computed the distribution at the position after them — append that bonus token. k accepted = k+1 emitted, free.

Files

starter.py — ngram_propose, verify_greedy, run_speculative, run_baseline. Your work.
solution.py — reference.
test_lab.py — output identity, the periodic-text speedup, the random-text non-speedup.

Run

LAB_IMPL=starter pytest phase-08-speculative-decoding/labs/lab-01-ngram-spec-decode -q
pytest phase-08-speculative-decoding/labs/lab-01-ngram-spec-decode -q   # reference

What to implement

Per 02-mini-build.md. Watch the two classic off-by-ones: the n-gram scan must find an earlier occurrence (excluding the pattern's own position at the very end — propose from yourself and you'll propose nothing useful forever), and the verify must compare proposal i against the target evaluated on context + accepted[:i] — against the evolving context, not the original one (compare against the frozen context and proposals 2+ are checked against the wrong distribution; outputs diverge from baseline and the identity test catches it).

What the tests prove

Test	What it pins
output == baseline	The lossless guarantee, greedy form: you only ever keep the target's own choices, so the sequence cannot differ. (The sampled-temperature generalization is lab-03's theorem)
periodic text → `runs ≪ num_tokens`	High acceptance on guessable text: many tokens per target pass — the entire value proposition, measured by your own counters
random text → `runs == num_tokens`	Zero acceptance → exactly baseline-many target runs. Speculation's failure mode is no speedup, not wrong output — the graceful-degradation property that makes it deployable

Hitchhiker's notes

Tokens-per-run is THE metric. (accepted + runs) / runs from your stats dict is "mean acceptance length + 1" in vLLM's spec-decode logs (lab-02 reads 2.8 / 5 from a real run). Everything else — ITL improvement, the lab-04 economics — derives from it. When evaluating any drafter, ask for this number on your workload before anything else.
Why scan latest-first? Recent context predicts continuation better than distant context (local repetition dominates). Upstream's NgramProposer.propose (ngram_proposer.py:131) does the same, with bounded lookback and n-gram sizes — read it after; it's your function with engineering around it.
The verify loop's "always ≥1 token" is load-bearing for the scheduler: a speculation cycle is, from Phase 3's perspective, just a decode step that may emit several tokens. The KV for accepted tokens was already written during the verify pass; rejected positions' KV must be discarded (in paged terms: the slots are simply overwritten next cycle — the counters from Phase 1 make rollback almost free). If you wondered why num_computed_tokens racing num_tokens was such a good idea — speculative decoding is one more feature that composes with it for free.
k is not free even when drafting is — each proposed-but-rejected token wastes a slot in the verify batch. With a free drafter the waste is mild (the verify pass was ~constant cost anyway); with a costly drafter it's lab-04's whole subject.

Going further

Track per-position acceptance (does proposal #1 accept more often than #4?) on periodic text with noise injected — you'll rediscover the geometric decay that lab-04's α^i models.
Implement the suffix-automaton upgrade: index all earlier occurrences instead of scanning (longest-match instead of fixed-n). Compare acceptance on text with mixed repetition lengths — this is the direction production prompt-lookup variants take.
Run your speculative loop with mini_vllm's deterministic ToyModel as the target (greedy) and verify the identity holds against LLMEngine.generate — wiring the lab into the course's engine, the way upstream wires NgramProposer into the runner.

References

upstream/vllm/v1/spec_decode/ngram_proposer.py:131 — NgramProposer.propose: your drafter, productionized.
upstream/vllm/v1/sample/rejection_sampler.py:87 — the greedy verify fast path.
Leviathan et al. (2022) — the original draft/verify formulation: https://arxiv.org/abs/2211.17192
Prompt Lookup Decoding (Saxena, 2023) — the n-gram drafter's origin as a standalone trick: https://github.com/apoorvumang/prompt-lookup-decoding
Labs 03 (the sampled-case theorem) and 04 (the economics) — this lab's two generalizations.

Lab 08-02 — EAGLE on Real vLLM `[GPU-OPT]`

The CPU labs built the machine (01), proved its theorem (03), and priced it (04). This lab runs the state of the art on real silicon: EAGLE — a one-layer draft head that reads the target model's own hidden features and proposes from understanding rather than lab-01's string matching — and measures the two numbers the whole phase converges on: inter-token latency (18.2 → 9.6 ms, ~1.9×) and acceptance (2.8 of 5, 56%). Then the two qualifications that keep the result honest: acceptance climbs to ~80% on code, and the win shrinks at high batch — both of which your lab-04 model predicts before the GPU is even warm.

No GPU? Don't panic. The captured run below is annotated against all three CPU labs; the cross-checking is the lesson.

Why this lab exists

This is the phase's integration test — of your understanding, not the software. Three CPU labs handed you a model of speculative decoding with named parameters: acceptance rate α (lab-03's overlap), draft cost c, tokens-per-cycle (lab-01's metric), speedup (lab-04's formula). A real EAGLE run hands you measurements. The work of this lab is the reconciliation: plug the measured acceptance into lab-04's formula, predict the ITL improvement, compare with the measured 1.9×, and account for the gap (overheads the model omits). When prediction and measurement agree within your stated error budget, the phase is yours. When they don't, one of your parameters is wrong — finding which is exactly the skill a staff engineer applies when a vendor (or a teammate) claims "2× from speculation" for your workload.

The lab also installs the production decision frame: speculative decoding is a latency tool funded by spare compute. Both halves of that sentence are visible in the capture — the single-stream halving, and the fade at saturated batch — and both are why "should we enable EAGLE?" has a different answer for a chatbot (yes, probably) than for an offline batch pipeline (probably not).

Background: what EAGLE changes (and doesn't)

Everything from labs 01/03 survives intact: propose k, one batched verify, leading-run acceptance, correction/bonus token, lossless guarantee. What EAGLE changes is the drafter: instead of searching the context for literal repeats (α ≈ 0 on novel prose), it runs a single transformer layer over the target model's last hidden states — the features the target computed anyway — plus the sampled token, and autoregressively rolls out k draft tokens. Because it reads the target's "thoughts" rather than its text, it predicts well even on text that never repeats (α ≈ 0.6–0.8); because it's one layer against the target's 32+, its cost is c ≈ 0.05. On lab-04's (α, c) plane, EAGLE sits in the corner that dominates both the free-but-blind n-gram drafter and the smart-but-expensive separate draft model — which is why the separate-draft-model approach has mostly faded, and why EAGLE-family heads exist for most popular open models.

The price of reading hidden states: the head is target-specific (trained per model, shapes must match — you can't borrow Llama's head for Qwen), and the draft itself runs autoregressively (k sequential micro-steps — tiny ones, but this is exactly where Phase 5's CUDA graphs become load-bearing: a 1-layer model's step is pure launch overhead without them).

Requirements

uv pip install -e ".[vllm]"
# a base model + its matching EAGLE head from the Hub, e.g.:
#   meta-llama/Meta-Llama-3-8B-Instruct  +  yuhuili/EAGLE-LLaMA3-Instruct-8B

Steps

import time
from vllm import LLM, SamplingParams

sp = SamplingParams(max_tokens=128, temperature=0)
prompts = ["Explain how a hash map handles collisions."]  # single stream first!

base = LLM(model="<base>", gpu_memory_utilization=0.8)
# ... time generate(), record ITL = elapsed / tokens ...

spec = LLM(model="<base>", gpu_memory_utilization=0.8,
           speculative_config={"method": "eagle", "model": "<eagle head>",
                               "num_speculative_tokens": 5})
# ... same timing; then read the spec-decode metrics lines from the log
#     (acceptance counts / mean acceptance length).

Three runs to do properly: (1) single stream, the headline; (2) the same prompt swapped for code generation — watch acceptance move; (3) batch 32+ — watch the speedup fade. Before each, predict the result from lab-04 with your current (α, c) estimates.

Captured output (real run, Llama-3-8B + EAGLE, A100, vLLM 0.22.1, trimmed)

baseline      : ITL 18.2 ms/token   (54.9 tok/s, single stream)
eagle (k=5)   : ITL  9.6 ms/token   (104 tok/s)        ~1.9x faster
spec_decode metrics: mean acceptance length 2.8 / 5 ; acceptance rate 56%
# on highly repetitive input (code), acceptance rose to ~80% and ITL dropped further.
# at large batch (saturated GPU) the speedup shrank — less spare capacity to verify.

Reading the numbers

Mean acceptance length 2.8 → tokens-per-cycle 3.8 (the +1 is lab-01's correction/bonus). Lab-04 sanity check: per-position α solving (1−α⁶)/(1−α) = 3.8 is ≈ 0.75; the logged "56%" is a different denominator (accepted/proposed = 2.8/5) — two acceptance metrics, one phenomenon, and confusing them is the most common spec-decode reporting error. Always ask which one a number is.
Predicted vs measured: lab-04 with α=0.75, c=0.05, k=5 gives 3.78 / 1.25 ≈ 3.0×; measured is 1.9×. The gap is the model's known omissions (per-cycle sampler/launch overheads, the verify pass costing slightly more than 1, drafting running serially) — consistent in direction with the bias list in lab-04's notes. A model that misses by a predictable margin in a predictable direction is a working model.
Code → 80% acceptance: sharper next-token distributions overlap more (lab-03: α = Σ min(p,q) grows as both distributions concentrate). Same reason low temperature helps. Your workload's α is a property of your traffic; measure it there.
The fade at batch: verify rides on spare compute (Phase 0 lab-04's idle FLOPs at small batch). A saturated GPU has none — the verify pass now displaces other requests' work, and tokens-per-cycle gains stop translating into wall-clock. Spec decode is a latency tool; at full throughput it approaches a no-op (or worse, with drafting overhead). This single observation decides most deployment questions.

Hitchhiker's notes

k=5 is not sacred. With measured α ≈ 0.75 and c ≈ 0.05, lab-04's optimal_k says 5–7 — fine. But on the prose end (α ≈ 0.5) optimal k drops to ~3, and configured-k- too-high costs latency (rejected drafts still occupy verify slots). If your acceptance metrics run low, shrinking num_speculative_tokens is the free fix nobody tries.
EAGLE + CUDA graphs are a package deal (Phase 5 lab-04's note, now concrete): the draft head's per-token step is ~1 ms-class GPU work behind full launch overhead — eager-mode EAGLE can lose most of its margin to Python and launches. If spec-decode numbers disappoint, check the draft path is actually captured.
Greedy here, but the guarantee generalizes: with temperature > 0 the verify runs lab-03's rejection sampling, and outputs are distributionally identical rather than token-identical. Acceptance drops a bit (broader distributions overlap less). The metrics machinery is unchanged.
EAGLE-2/3 and tree drafts: instead of one chain of k, draft a small tree of alternatives and verify all paths in one pass (attention masks make a tree look like a batch). Buys higher expected acceptance per verify at the cost of verify width — same economics, one more dimension. When you see speculative_config grow tree parameters, lab-04's model extends with "k" becoming "tree shape."

Reflect

Reconcile the three acceptance numbers you now have (2.8/5 = 56%; per-position α ≈ 0.75; code ≈ 80%) — write each as a formula over the same event sequence. If you can do this cold, you'll never misread a spec-decode dashboard.
Your fleet runs batch-48 throughput-oriented summarization. EAGLE: yes or no? What measurement would change your answer? (Likely no — saturated compute; measure spare utilization headroom and p99 ITL requirements. If interactivity appears — yes for the interactive class, via a separate pool or priority.)
The EAGLE head must match the target model. What happens operationally when you upgrade the base model checkpoint? (The head needs retraining/replacing — speculative configs add a coupled artifact to your model-rollout pipeline. Budget for it or inherit silent acceptance collapse.)

References

Li et al., EAGLE (2024): https://arxiv.org/abs/2401.15077; EAGLE-2 (tree drafts, 2024): https://arxiv.org/abs/2406.16858
upstream/vllm/v1/spec_decode/eagle.py — the proposer; note the hidden-state plumbing from the target's forward.
vLLM docs, Speculative Decoding — configs and the metrics you read: https://docs.vllm.ai/en/latest/features/spec_decode/
Labs 01/03/04 — the machine, the theorem, the economics this run validates.

Lab 08-03 — Rejection Sampling: Lossless Speculation with Temperature `[CPU-OK]`

Lab-01's greedy verify had an easy life: at temperature 0 there's exactly one right token, so "accept iff the draft equals it" is obviously lossless. But production serving samples — and now the claim that speculative decoding "doesn't change the output" becomes a real theorem with a real proof obligation: the verified output must be distributed exactly according to the target model's distribution p, no matter how wrong the drafter's q is. This lab has you implement the three-line algorithm that achieves it — accept draft x with probability min(1, p[x]/q[x]), else resample from the residual normalize(max(p − q, 0)) — and then verify the theorem empirically: 200,000 draws through a deliberately clueless uniform drafter land on the target distribution to within sampling noise. This is the mathematical heart of every speculative method in vLLM, from n-gram to EAGLE.

Why this lab exists

"Speculative decoding is lossless" is repeated everywhere and understood almost nowhere — most explanations stop at the greedy case, leaving the sampled case as folklore. But the sampled case is where the engineering risk lives: a subtly wrong residual, a missing clamp, a normalization slip — and your serving system is quietly sampling from a distribution that is not the model's, a bug invisible to every output-equality test (each individual output is plausible!) and detectable only distributionally. The defense, which you'll build, is the statistical test: histogram many draws, compare to p. If you ever touch rejection_sampler.py upstream — and spec-decode PRs touch it constantly — this lab's test design is how you protect the change.

The second deliverable is the acceptance-rate formula Σ min(p, q) — the overlap of the two distributions (equivalently 1 − total-variation distance). It converts "is the drafter any good?" from vibes into one number per position, and it's the alpha that lab-04's economics run on. Drafter evaluation, acceptance metrics in vLLM's logs, temperature's effect on speedup — all read off this one quantity.

Background: why the algorithm works

The output token's probability decomposes into "accepted draft" + "residual resample":

P(output = x) = q[x]·min(1, p[x]/q[x]) + P(reject)·residual[x]
              = min(p[x], q[x])        + (1 − Σ min(p,q)) · max(p[x]−q[x],0) / Σ max(p−q,0)

Since Σ max(p−q, 0) = 1 − Σ min(p, q) (the surplus equals the deficit — both are the TV distance), the second term simplifies to max(p[x] − q[x], 0), and:

P(output = x) = min(p[x], q[x]) + max(p[x] − q[x], 0) = p[x]      ∎

Read the proof's shape: where the draft over-serves (q > p), acceptance is throttled by exactly the ratio; where it under-serves (q < p), drafts always pass and the residual makes up precisely the shortfall. The two errors cancel by construction, not by luck — which is why the result holds for any q, including adversarially bad ones (your test_disjoint_distributions_never_accept: zero overlap, everything rejected, output still exactly p — spec decode degrades to baseline speed, never to wrongness. That graceful-degradation property is what makes it safe to deploy aggressively).

Files

starter.py — accept_prob, residual_distribution, speculative_token, expected_acceptance_rate. Your work.
solution.py — reference.
test_lab.py — the formula edges, the residual, the empirical theorem (200k draws), the identical-distribution and disjoint-distribution limits, and the overlap formula against simulation.

Run

LAB_IMPL=starter pytest phase-08-speculative-decoding/labs/lab-03-rejection-sampling -q
pytest phase-08-speculative-decoding/labs/lab-03-rejection-sampling -q   # reference

What the tests prove

Test	What it pins
`test_accept_prob_formula`	The two regimes: `p ≥ q` → always accept; `p < q` → the exact ratio
`test_residual_is_the_renormalized_surplus`	The fallback distribution, value by value — get this wrong and the theorem dies silently
`test_output_distribution_is_exactly_the_target`	The theorem, empirically: uniform drafter, skewed target, 200k draws, histogram ≈ `p` within 0.005. This is the test that catches the silent bug class
`test_identical_distributions_always_accept`	The `q = p` limit: overlap 1, acceptance 1 — a perfect drafter is never rejected (and the `p == q` residual edge case stays well-defined)
`test_acceptance_rate_is_the_overlap`	`Σ min(p,q) = 0.70` for the lab's pair, confirmed by simulating the accept branch alone
`test_disjoint_distributions_never_accept`	The adversarial limit: zero overlap → pure residual → still exactly `p`. Wrongness is impossible; only speed is at stake

The statistical tolerance (atol=0.005 at N=200k) is calibrated, not hand-waved: the binomial standard error at p=0.5 is √(0.25/200000) ≈ 0.0011, so 0.005 is ~4.5σ — tight enough to catch any real implementation error, loose enough to never flake. When you write distributional tests (and after this lab, you will), do this arithmetic.

Hitchhiker's notes

Greedy verify is this algorithm's zero-temperature limit: as temperature → 0, p and q collapse toward one-hots; min(1, p[x]/q[x]) becomes "1 if the argmaxes match, else 0", and the residual becomes the target's argmax. Lab-01 was a special case all along — upstream's RejectionSampler has the explicit greedy fast path (rejection_sampler.py:87) for exactly this case, because comparing argmaxes is cheaper than the full machinery.
Multi-token drafts chain this per position: verify token 1 against p₁; if accepted, token 2 against p₂ (computed with token 1 in context — the target's one batched forward scored all positions); first rejection stops the chain and resamples from that position's residual. The i.i.d.-ish per-position acceptance is the alpha lab-04 models. Crucially, all k+1 target distributions came from one forward pass — that batching is the entire economic basis (lab-04's cost = k·c + 1).
Where the probabilities come from matters: p and q here are post-temperature, post-top-p distributions — the verifier must apply the same sampling-parameter pipeline (Phase 0 lab-03) to both models' logits, or the ratio compares apples to oranges. Sampling-parameter mismatches between draft and target paths are a real upstream bug category; now you know what they corrupt.
The same trick generalizes — speculative sampling is importance-sampling-flavored rejection sampling with a guaranteed-exact fallback, and variants (tree drafts with multiple candidates per position, typical acceptance in Medusa/EAGLE-2) bend the acceptance rule while preserving the distributional identity. When reading any new spec-decode paper, find its version of this lemma first; everything else is scheduling.

Going further

Implement chained multi-token verification (speculative_sequence(p_list, q_list, k, rng)) and verify the joint distribution of two-token outputs matches sequential target sampling — the full lossless claim, one level up.
Measure acceptance vs temperature: fix logits, sweep T ∈ {0.2, 0.7, 1.0, 1.5} for both models, plot Σ min(p,q). Sharp distributions overlap more → spec decode loves low temperature — connect to lab-02's "code accepts at 80%" observation.
Break it on purpose: skip the min(1, ·) clamp (accept with raw p/q… capped how?) or forget to renormalize the residual, and watch which test catches each. Knowing the failure signatures is half the review skill.

References

Leviathan et al., Fast Inference from Transformers via Speculative Decoding (2022) — the theorem (their Theorem 3.1 / Appendix A): https://arxiv.org/abs/2211.17192
Chen et al., Accelerating Large Language Model Decoding with Speculative Sampling (2023) — the same result, DeepMind flavor: https://arxiv.org/abs/2302.01318
upstream/vllm/v1/sample/rejection_sampler.py — the production implementation: find the ratio, the residual, and the greedy fast path (:87).
Lab-04 — what Σ min(p,q) is worth in milliseconds; lab-01 — the zero-temperature special case you already built.

Lab 08-04 — The Speculative-Decoding Speedup Model `[CPU-OK]`

Three functions, maybe fifteen lines — and at the end of them you can answer, with arithmetic, the questions that decide whether speculative decoding ships: How much faster, given my drafter's acceptance rate? What draft length k should I configure? And when does spec decode make things worse? (It can. test_spec_decode_can_lose proves a mediocre drafter at realistic cost is a net loss, and optimal_k tells you the right configuration is zero.) This is the expected-speedup model from the original paper, the same arithmetic behind vLLM's num_speculative_tokens default debates — and after this lab, behind your config choices instead of your hopes.

Why this lab exists

Speculative decoding is the most conditionally valuable optimization in this course: transformative for a single latency-bound stream with a sharp drafter, worthless — or harmful — for a saturated-throughput deployment with a dull one. Teams burn weeks discovering this empirically because they treat "enable spec decode" as a boolean instead of an equation. The equation has three inputs you can measure independently — acceptance rate alpha (lab-03's overlap, read from vLLM's spec-decode metrics), draft cost c (drafter step time / target step time, read from a profile), and draft length k (the config knob) — and one output. This lab makes you fluent in it, including its honest failure regions.

It's also a clean specimen of modeling discipline: the formula assumes i.i.d. per-position acceptance, which is false in detail (acceptance correlates within a phrase) but accurate in expectation — and test_expected_tokens_matches_simulation shows the model agreeing with a 200k-cycle simulation to 1%. Knowing how to validate a simplification is worth more than distrusting all simplifications.

Background: the three-parameter economy

One speculation cycle: draft k tokens (cost k·c), then one target forward verifies all of them in a single batch (cost ≈ 1 — this batching is the entire trick; the verify scores k+1 positions for the price of one decode step because prefill-shaped work is compute-cheap, Phase 0 lab-04). The cycle emits the leading run of accepted tokens plus one (the correction on first rejection, or the bonus token when everything passes — lab-01's verify_greedy mechanics):

E[tokens/cycle] = 1 + α + α² + … + α^k = (1 − α^(k+1)) / (1 − α)

speedup(α, k, c) = E[tokens/cycle] / (k·c + 1)

Two structural facts fall out before you compute anything. Diminishing returns in k: the i-th draft token only counts if all before it accepted, so its marginal value is α^i — geometrically decaying, while its cost c is constant; past some k every extra token is negative-margin (hence optimal_k). The ceiling: even free drafts can't beat 1/(1−α) tokens per cycle — at α=0.7 that's 3.3×, at α=0.5 it's 2× — so chasing speedup beyond the ceiling means improving the drafter, not the config.

Files

starter.py — expected_tokens_per_verify, speedup, optimal_k. Your work.
solution.py — reference.
test_lab.py — formula edges, the simulation check, the free-drafter bound, the losing regime, EAGLE-ballpark numbers, and both monotonicities of optimal_k.

Run

LAB_IMPL=starter pytest phase-08-speculative-decoding/labs/lab-04-speedup-model -q
pytest phase-08-speculative-decoding/labs/lab-04-speedup-model -q   # reference

What the tests prove

Test	What it pins
`test_expected_tokens_formula`	The geometric series and both edges: α=0 → 1 (corrections only — spec decode never emits less than baseline per cycle), α=1 → k+1
`test_expected_tokens_matches_simulation`	The i.i.d. model vs 200k simulated cycles: < 1% off. The simplification, validated
`test_free_drafter_always_wins`	c=0 (the n-gram drafter — lab-01): speedup = E[tokens] ≥ 1. Free drafts can't lose, which is why prompt-lookup ships as a default-safe option
`test_spec_decode_can_lose`	α=0.2, c=0.5, k=5 → speedup < 1, and `optimal_k = 0`. The model's most valuable output is "turn it off"
`test_eagle_like_numbers`	α≈0.7, c≈0.05, k=5 → ~2.3× — the ballpark real EAGLE deployments report (lab-02's measured 1.9× at k=5 sits right here once you account for overheads the model omits)
`test_optimal_k_grows_with_alpha_and_shrinks_with_cost`	The two intuitive monotonicities, made checkable
`test_diminishing_returns_in_k`	The marginal value of draft token i decays geometrically — why k=5-ish is so common and k=20 is almost never right

Hitchhiker's notes

What the model omits, and which way each omission points: verify cost grows (slightly) with k — real cost is k·c + (1+εk), pushing optimal k down; the drafter and target compete for the GPU at high batch — at saturation the "spare compute" funding the whole scheme disappears (lab-02's shrinking-win observation), pushing value down; per-cycle fixed overheads (extra kernel launches, sampler work) hurt small-α configs most. The model is an upper bound with known biases — the most useful kind.
The same arithmetic prices drafters against each other: n-gram (α low on prose, high on code; c ≈ 0), EAGLE (α ≈ 0.6–0.8; c ≈ 0.03–0.08 — a one-layer head), a half-size draft model (α high; c ≈ 0.2–0.5 — usually dominated by EAGLE on this math, which is why standalone draft models faded). Three points on a (α, c) plane; speedup is the contour map. Plot it once and the drafter literature organizes itself.
α is workload-dependent, so the right k is too: code and structured output accept far better than creative prose (lab-02 measured 80% vs ~56%). A deployment serving both has no single optimal k — which is why dynamic/adaptive speculation (adjusting k per request from rolling acceptance) is an active upstream direction. The model you just built is the controller's objective function.
Where to read your fleet's α: vLLM's spec-decode metrics (acceptance counts per position, mean acceptance length — the 2.8 / 5 in lab-02's capture). Mean acceptance length ≈ E[tokens/cycle] − 1; invert your formula and you can back out the effective α from production logs. Do it before and after a drafter upgrade and you have the business case in two numbers.

Going further

Plot speedup vs k for α ∈ {0.3, 0.5, 0.7, 0.9} at c = 0.05: watch the maximum slide right and up with α. Add c = 0.3 curves and watch speculation die for low α. This one figure is the deployment decision.
Add the batch-saturation term: model verify cost as 1 + λ·k where λ grows with batch utilization, and find the utilization where optimal_k hits 0 — you've derived "spec decode is a latency tool, not a throughput tool" instead of memorizing it.
Replace i.i.d. α with a two-state model (in-phrase α_high, at-boundary α_low) and re-derive E[tokens] — then check whether the i.i.d. fit to the mean still predicts well. (It does, mostly — means are forgiving. Tail latency per cycle is not; explore the variance.)

References

Leviathan et al., Fast Inference from Transformers via Speculative Decoding (2022) — §3.1 is this lab's formula: https://arxiv.org/abs/2211.17192
Li et al., EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (2024) — where the (α≈0.7, c≈0.05) point comes from: https://arxiv.org/abs/2401.15077
upstream/vllm/v1/spec_decode/ — proposer implementations; the metrics module that exports your α.
vLLM docs, Speculative Decoding — num_speculative_tokens and method configs: https://docs.vllm.ai/en/latest/features/spec_decode/
Lab-03 — where α comes from (Σ min(p,q)); lab-02 — the measured numbers this model predicts; Phase 0 lab-04 — why the batched verify costs ~1.

Phase 08 — Exercises: Speculative Decoding

Warm-up (explain)

In one breath: how does speculative decoding produce several tokens from one big-model run?
Why does verifying k drafted tokens cost about the same as one decode step? (Tie to prefill.)
Why is the output identical to normal decoding (greedy case)?

Core (trace the code)

NgramProposer.propose (ngram_proposer.py:131) — what does it match on, and what does it return? Why is it great for code/summarization?
In rejection_sample (rejection_sampler.py:392), state the accept probability and what happens on rejection. Why does this preserve the target distribution?
In scheduler.py, how do num_lookahead_tokens and num_tokens_with_spec let spec decode ride the normal schedule with no special case?

Build (your lab)

In lab-01, derive expected tokens-per-run from acceptance rate a and draft length k (hint: it's 1 + (accepted before first reject)).
Add a k sweep: plot tokens-per-run vs k on the periodic target. Why does it plateau?
Construct an input where n-gram hurts (proposals never accepted): show runs == baseline and explain the wasted draft cost.

Design (staff-level)

Given target step cost C_t, draft cost C_d, and acceptance a, write the condition for spec decode to be a net win. When does large batch flip it negative?
A customer's workload is 70% code (repetitive) and 30% chat (creative). Would you enable spec decode globally, per-request, or adaptively? Justify.
EAGLE vs n-gram: when would you pick each, and what does EAGLE need that n-gram doesn't?
Spec decode interacts with the KV cache (drafts need slots) — what must the scheduler do on rejection, and what's the memory risk?

Self-grading

4–6 and 10–13 are interview-grade. Could you whiteboard draft→verify and the win condition? If not, re-read 01-deep-dive.md.

Phase 08 — Interview Questions: Speculative Decoding

Q1. How does speculative decoding speed up decode?

Model answer

A cheap drafter proposes the next k tokens; the big model verifies all of them in one forward run (a mini-prefill, which is compute-bound and processes many tokens cheaply) and keeps the longest correct prefix plus one correction. So one expensive run yields multiple tokens instead of one. The speedup is set by the acceptance rate × draft length, minus the small drafting/verify overhead.

Q2. Why doesn't it change the model's output?

Model answer

Greedy: you only accept a drafted token if it equals the big model's argmax; on disagreement you discard the rest and use the big model's token — so the sequence is identical to plain greedy. Sampling: the rejection sampler accepts token i with probability min(1, p_target/p_draft) and, on rejection, resamples from normalize(max(0, p_target − p_draft)); the math guarantees the accepted tokens follow the target's exact distribution. Speed changes, behavior doesn't.

Q3. When is it a win, and when does it hurt?

Model answer

Win when accepted-tokens-per-run × target-step-cost exceeds the cost of drafting plus the extra verify work — i.e. high acceptance and a cheap drafter, in latency-bound (small-batch) regimes with spare GPU capacity. It can lose at low acceptance (creative text, weak drafter) or at large batch where the GPU is already saturated and verifying drafts steals capacity from real work.

Q4. What proposers exist and how do they differ?

Model answer

n-gram / prompt-lookup (free, copies a repeated phrase's continuation — great for code/structured text); EAGLE (a small trained head predicting the target's next hidden states — high acceptance on general text); Medusa (extra heads), DFlash, suffix decoding, and a separate small draft model. All plug into the same verify path; they trade drafter cost vs acceptance quality.

Q5. How does spec decode fit vLLM's scheduler without a special case?

Model answer

A request's num_tokens_with_spec includes the draft tokens, so the standard num_new_tokens clamp schedules them; num_lookahead_tokens reserves KV slots for them. After the run, the rejection sampler decides accept/reject and update_from_output keeps the accepted prefix and rolls back the rest. The scheduler just sees "a few more tokens in the gap" — the Phase 3 abstraction absorbs the whole feature.

Rapid-fire

Verify cost ≈ ? one decode/prefill run (processes k+context together).
Output change? none (greedy: argmax-only accept; sampling: rejection sampling).
Deciding metric? acceptance rate.
Free proposer? n-gram / prompt-lookup. Best trained one (today)? EAGLE.
Scheduler hooks? num_tokens_with_spec, num_lookahead_tokens.

Phase 08 — Cheatsheet: Speculative Decoding

The one-liner

Cheap drafter guesses k next tokens → big model verifies all in ONE run → keep the longest correct prefix + 1 correction. Several tokens per expensive run; output identical to normal decoding.

Why it works

Verification = a mini-prefill (compute-bound, processes many tokens cheaply), so checking k tokens ≈ one decode run. Speedup ∝ acceptance rate.

Correctness

Greedy: accept only the target's argmax → identical output.
Sampling: rejection sampling — accept w.p. min(1, p_target/p_draft), resample normalize(max(0, p_target−p_draft)) on reject → exact target distribution.

Proposers

n-gram/prompt-lookup (free; great for repetitive/code) · EAGLE (trained head, predicts hidden states; best general) · Medusa · DFlash · suffix · small draft model.

Win/lose

Win: high acceptance, cheap drafter, small batch (spare capacity). Lose: low acceptance, or large batch (GPU already saturated). Condition: accepted/run × C_target > C_draft + extra_verify.

Rides the scheduler

num_tokens_with_spec adds drafts to the gap; num_lookahead_tokens reserves KV; rejection result applied in update_from_output. No special scheduler path.

Key upstream

v1/spec_decode/ngram_proposer.py:12/:131 · eagle.py:10 · medusa.py dflash.py suffix_decoding.py
v1/sample/rejection_sampler.py:37 RejectionSampler :87 forward :392 rejection_sample
v1/spec_decode/metrics.py (acceptance) · scheduler.py (spec_token_ids / num_lookahead_tokens)

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

Phase 09 — The Hitchhiker's Guide to Sampling & Decoding Algorithms

← Phase 08 · Course home · Phase 10 →

Don't Panic

The model gives you logits — a score for every token in the vocabulary. You still have to pick one. How you pick is the decoding algorithm: greedy, temperature, top-k, top-p, penalties, parallel sampling, beam search. In a batched engine, all of these must run together, vectorized, on the GPU, for many requests that each chose different settings. This phase is that machinery — small, on the critical path of every single step, and the hook structured output (Phase 12) and penalties plug into.

logits (vocab,) per request
   │  logits processors: penalties, bias, bad-words, grammar mask (Phase 12)
   │  temperature scaling
   │  top-k / top-p / min-p truncation
   ▼
 probability distribution  ──sample──►  next token id

Step 1: The knobs, from sharpest to softest

Greedy (temperature 0): always take the argmax. Deterministic. (mini_vllm's default.)
Temperature T: divide logits by T before softmax. T<1 sharpens (more confident), T>1 flattens (more random). T→0 → greedy.
Top-k: keep only the k highest-probability tokens, renormalize, sample from those.
Top-p (nucleus): keep the smallest set of tokens whose cumulative probability ≥ p. Adaptive: few tokens when the model is confident, many when it's unsure.
Min-p: keep tokens with probability ≥ min_p × max_prob. A confidence-relative cutoff.

These compose in a fixed order (penalties → temperature → top-k → top-p/min-p → sample), which you'll implement in lab-01.

Step 2: Penalties and logit bias

Before sampling you can edit the logits:

Repetition / frequency / presence penalty: lower the score of tokens already generated, to reduce loops. Frequency scales with count; presence is a flat penalty for any prior occurrence.
Logit bias: add/subtract a fixed amount for specific token ids (force or ban words).
Bad words / stop: hard-ban sequences.

All of these are logits processors — the pluggable pre-sampling hook. That same hook is how structured output (Phase 12) masks illegal tokens to -inf. One clean abstraction, many uses.

Step 3: The batching challenge (why this is a systems problem)

In one decode step you might have 256 requests, each with its own temperature, top-p, penalties. You can't loop in Python (too slow on the hot path). So vLLM packs per-request params into tensors aligned with the batch and applies vectorized, branch-free masked ops — every row uses its own settings in one kernel. That's the real engineering: not the math of top-p, but doing top-p for a heterogeneous batch in one GPU pass (vllm/v1/sample/ops/topk_topp_sampler.py).

Step 4: Parallel sampling and beam search

Parallel sampling (n>1): produce N independent completions for one prompt. The prompt is processed once; the N samples share the prompt's KV blocks via prefix caching (Phase 2/3) and diverge only after the first sampled token. A beautiful reuse of paging.
Beam search: keep the top-N partial sequences by cumulative log-prob, expanding and pruning each step. It's awkward in a continuous-batching engine (beams branch and die, changing the batch shape), so vLLM handles it specially rather than as plain sampling.

The invariants to memorize

Order: penalties/bias → temperature → top-k → top-p/min-p → sample.
Greedy = temperature 0 = argmax (deterministic).
Everything is vectorized across a heterogeneous batch with per-row params.
Logits processors are the pre-sampling hook (penalties, bias, grammar masks).
n>1 shares the prompt's KV via prefix caching; beam search is the special case.

What you'll do

Read: 01-deep-dive.md — Sampler.forward, the top-k/top-p ops, penalties, and the logits-processor framework, line-anchored.
Build: 02-mini-build.md — add min-p, repetition penalty, and a logits-processor pipeline.
Labs (see labs/README.md; recommended order 01 → 04 → 03 → 02):
- lab-01-sampling-ops [CPU-OK] — implement temperature/top-k/top-p/min-p + repetition penalty + a logits-processor hook; pin their effects with tests.
- lab-02-parallel-sampling [GPU-OPT] — run n>1 on real vLLM; observe shared prompt KV (captured output).
- lab-03-beam-search [CPU-OK] — build beam search and spring the garden-path trap where greedy provably loses; EOS-finishes-a-beam; why V1 evicted beams from the engine core.
- lab-04-seeded-rng-batch-invariance [CPU-OK] — per-request generators: prove a seeded request's tokens survive any batch neighbors (and watch the shared-RNG version fail).
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

← Phase 08 · Course home · Phase 10 →

Phase 09 — Deep Dive: the batched sampler

Paths relative to upstream/ at v0.22.1 @ 0decac0.

vllm/v1/sample/sampler.py            the batched Sampler (orchestrates the pipeline)
vllm/v1/sample/ops/topk_topp_sampler.py   vectorized top-k/top-p
vllm/v1/sample/ops/penalties.py      repetition/frequency/presence penalties
vllm/v1/sample/ops/bad_words.py      banned sequences
vllm/v1/sample/logits_processor/     the pluggable pre-sampling hook (builtin/interface/state)
vllm/v1/sample/metadata.py           SamplingMetadata: per-request params packed into tensors
vllm/sampling_params.py              SamplingParams (the user-facing knobs)

1. `SamplingParams` — the knobs

vllm/sampling_params.py:168 — class SamplingParams. Fields: temperature (:205, default 1.0), top_p (:209), top_k, min_p, penalties, n (parallel samples), seed, logit_bias, max_tokens, stop conditions. mini_vllm/sampler.py:SamplingParams is a faithful subset (temperature/top_k/top_p/seed/max_tokens/ignore_eos).

2. `Sampler.forward` — the pipeline

vllm/v1/sample/sampler.py:20 class Sampler, :67 def forward. Read it; the order is the guide's pipeline:

logits processors / penalties edit the logits (repetition penalty, bad-words, logit bias, and the structured-output grammar mask — Phase 12).
apply_temperature (:223) divides by per-request temperature.
top-k / top-p truncation (ops/topk_topp_sampler.py).
sample (:238) draws the token (argmax for greedy rows, multinomial for the rest).

The crucial detail: every step operates on the whole batch at once, with per-request params read from SamplingMetadata (metadata.py) — tensors aligned to the batch. Greedy and random requests coexist in one call; greedy rows are handled as a temperature→argmax path. There is no Python per-request loop on the hot path — that's the systems win.

3. Vectorized top-k/top-p

vllm/v1/sample/ops/topk_topp_sampler.py — applies top-k and top-p across the batch with masked sorts/cumsums, each row using its own k/p. (There's also a Triton variant, topk_topp_triton.py, for speed.) Your mini_vllm/sampler.py _apply_top_k/_apply_top_p do the single-row version; the real challenge is doing it for 256 different (k,p) at once without branching.

4. Penalties

vllm/v1/sample/ops/penalties.py — given the tokens generated so far (and prompt), subtract repetition/frequency/presence penalties from the corresponding logits. Needs per-request output token histories, threaded through SamplingMetadata.

5. Logits processors — the hook everything uses

vllm/v1/sample/logits_processor/:

interface.py — the LogitsProcessor contract (transform logits in place given state).
builtin.py — the built-in processors (min-p, logit bias, etc.).
state.py — per-request state management across steps.

This is the seam structured output (Phase 12) plugs into: a grammar produces a per-step bitmask of allowed tokens, applied as a logits processor that sets illegal tokens to -inf before sampling. Penalties, bias, and grammar masks all compose at this one well-defined point.

6. Parallel sampling

vllm/v1/engine/parallel_sampling.py — manages n>1: it expands one request into N child sequences that share the prompt's KV (prefix caching, Phase 2/3) and diverge after the first sampled token. Beam search has its own handling (it changes the active set each step, unlike plain sampling).

Reading checklist

Sampler.forward — recite the pipeline order.
Where do per-request params live, and why packed into tensors (not a Python loop)?
topk_topp_sampler.py — how is heterogeneous-batch top-p done branch-free?
The LogitsProcessor interface — how does Phase 12's grammar mask reuse it?
parallel_sampling.py — how does n>1 reuse prefix caching?

Now build it: 02-mini-build.md, then the labs.

Phase 09 — Mini-Build: a sampling pipeline with logits processors

You already have mini_vllm/sampler.py (greedy, temperature, top-k, top-p). This phase adds the two things real engines need: min-p, a repetition penalty, and a logits-processor hook — the pluggable pre-sampling stage that penalties and structured output (Phase 12) ride.

The task (lab-01)

In lab-01-sampling-ops implement, in numpy:

apply_min_p(logits, min_p) — keep tokens with prob ≥ min_p × max_prob, mask the rest.
apply_repetition_penalty(logits, generated_token_ids, penalty) — divide (or subtract for) logits of already-generated tokens so repeats are less likely.
a LogitsProcessor protocol: a callable (logits, context) -> logits, and a Pipeline that runs a list of processors in order.
sample(logits, params, generated, processors) — apply processors → penalty → temperature → top-k → top-p/min-p → sample, in that order.

This mirrors Sampler.forward (sampler.py:67) and the LogitsProcessor framework (logits_processor/interface.py). The pipeline order is the contract.

Why a logits-processor hook (not just hardcoded knobs)?

Because the same mechanism serves penalties, logit bias, bad-words, and grammar masks (Phase 12). Build it once as "a function that edits logits at a defined point" and structured output becomes "just another processor that masks illegal tokens to -inf." You'll literally reuse this pipeline in Phase 12.

Definition of done

pytest phase-09-sampling-and-decoding-algorithms/labs -q

Tests pin: top-k restricts support to the argmax when k=1; top-p keeps the nucleus; min-p cutoff is confidence-relative; repetition penalty lowers a repeated token's probability; a banning logits processor makes a token unsamplable.

Map to the real engine

your numpy	real vLLM
pipeline order	`Sampler.forward` (`sampler.py:67`)
`apply_min_p`, top-k/p	`ops/topk_topp_sampler.py` (vectorized over the batch)
repetition penalty	`ops/penalties.py`
`LogitsProcessor` + `Pipeline`	`logits_processor/{interface,builtin,state}.py`
a banning processor	`ops/bad_words.py` + the grammar mask (Phase 12)

Phase 09 Labs — Sampling & Decoding Algorithms

Four labs on the last centimeter of inference: turning logits into tokens, at production grade. The arc: build the full per-request pipeline with its extension hook (lab-01), add the state that makes sampling reproducible under batching (lab-04), meet the search alternative and its garden-path motivation (lab-03), then watch parallel sampling ride three phases of memory machinery on real hardware (lab-02).

Recommended order: 01 → 04 → 03 → 02. (Directory numbers predate labs 03–04.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-09-sampling-and-decoding-algorithms/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-09-sampling-and-decoding-algorithms/labs/lab-01-sampling-ops -q

Labs

lab-01-sampling-ops `[CPU-OK]`

The production pipeline: custom processors → repetition penalty → temperature → top-k → top-p → min-p → draw, with each placement justified (the order is a theorem, not a convention) and the logits-processor hook that Phase 12's grammar masking rides. Includes min-p's confidence-relative cutoff and the divide-positive/multiply-negative penalty asymmetry. Skills: pipeline order as API; the hook pattern; why penalties need history.

lab-02-parallel-sampling `[GPU-OPT]`

n=4 on real vLLM: the prompt prefills once, all samples share its KV blocks (ref_cnt=4, 75% hit rate = the pioneer effect with n as denominator), diverging from the first sampled token. The cheapest diversity money can buy, priced exactly. Annotated capture included. Skills: the one-prompt-n-tails cost model; self-consistency economics; n vs separate-requests vs beam search.

lab-03-beam-search `[CPU-OK]`

Sequence-level search: build greedy and beam decoding, then spring the garden-path trap — a four-probability fixture where greedy's local optimum (joint 0.31) loses to beam's [B, C] (0.36), provably. EOS-finishes-a-beam bookkeeping, the width-1 = greedy identity, and why V1 evicted beams from the engine core. Skills: search vs sampling; log-prob scoring; length bias; probability ≠ quality (degeneration).

lab-04-seeded-rng-batch-invariance `[CPU-OK]`

The reproducibility contract: a seeded request's tokens must not depend on its batch neighbors. Build the per-request-generator sampler, prove invariance with 0/1/5 interleaved neighbors — and watch the natural shared-RNG implementation fail the same scenario (the control test ships with the lab). Skills: randomness as private state; continuity vs re-seeding; isolation claims need broken controls; the kernel layer of nondeterminism.

What you can do after this phase

Hold the entire logits-to-token path in your head, in order, with reasons; extend it safely through the processor hook (and recognize Phase 12 as one more processor); deliver seeded reproducibility under batching and explain what it does and doesn't promise; choose between sampling, beam search, and best-of-n from their actual cost and quality shapes; and price candidate-generation workloads (self-consistency, RLHF sampling) from the sharing arithmetic. Phase 10 scales the engine across GPUs; the per-request state you isolated here is exactly what has to survive the trip.

Lab 09-01 — Sampling Ops & Logits Processors `[CPU-OK]`

Phase 0 lab-03 built the four classic knobs. This lab builds the production pipeline: penalties that read generation history, the confidence-relative min_p cutoff, and — the architecturally important part — the logits-processor hook, a pluggable (logits, ctx) → logits stage that turns the sampler from a fixed function into an extension point. That hook is how structured output injects its grammar mask (Phase 12), how logit_bias and bad-words lists work, and how every "force the model to/never let the model…" feature you'll ever build gets in. The ordering of the stages is the lab's quiet theorem: each transform assumes the one before it, and reorderings produce different samplers, not equivalent ones.

Why this lab exists

Two reasons, one per half of the lab. The ops half: min_p and repetition penalties are the knobs real traffic actually exercises (every chat frontend ships a repetition penalty; min_p has become the open-model community's favorite truncation), and both have semantics subtle enough that implementing them is the only reliable way to stop mis-explaining them. min_p's cutoff scales with the model's confidence — strict when one token dominates, permissive when the distribution is flat — which is the adaptivity top-k lacks and top-p only approximates. The repetition penalty's divide-positive, multiply-negative asymmetry is the detail everyone forgets: a naive uniform division would boost already-negative logits.

The hook half is about architecture. vLLM cannot ship every conceivable logits intervention, so it ships an interface; the pipeline you build — an ordered list of (logits, ctx) → logits callables, run before the standard knobs — is that interface in miniature. After this lab, Phase 12's grammar-constrained decoding is "a logits processor that masks non-grammatical tokens," full stop; the mystery is relocated to building the mask fast, which is where it belongs.

Background: the pipeline and why its order is fixed

custom processors → repetition penalty → temperature → top-k → top-p → min-p → softmax → draw

Walk the order backwards and each placement explains itself:

Truncations (top-k/p/min-p) come after temperature because they're defined over the distribution you'll actually sample from — truncating pre-temperature evaluates the nucleus on the wrong distribution (and yes, top-k commutes with monotone temperature but top-p does not: temperature changes the probabilities the cumulative sum is built from).
Penalties come before temperature: they're corrections to the model's raw scores, not to the sampling distribution; applying them post-temperature would make the penalty's strength depend on T — two knobs tangled into one.
Custom processors run first: a hard constraint (grammar mask, banned token) must shape everything downstream — a token masked to −∞ before truncation can never sneak back, no matter what k/p/min-p do. Mask after truncation and you can end up with an empty candidate set (every surviving token banned) — the all-states-are-−∞ crash class that constrained-decoding implementations know well.
Order within truncations (k → p → min-p) matches vLLM's; they don't commute either, and matching the engine's order is what makes your sampler's outputs comparable to its.

Files

starter.py — the five ops, the Pipeline, and sample (the full ordered assembly). Your work.
solution.py — reference.
test_lab.py — each op's exact semantics plus the ban-token processor pattern.

Run

LAB_IMPL=starter pytest phase-09-sampling-and-decoding-algorithms/labs/lab-01-sampling-ops -q
pytest phase-09-sampling-and-decoding-algorithms/labs/lab-01-sampling-ops -q   # reference

What to implement

The ops from Phase 0 lab-03 (temperature, top-k, top-p) plus the three new pieces: apply_min_p (threshold = min_p × max_prob, computed on the current distribution), apply_repetition_penalty (divide positive logits by the penalty, multiply negative ones — both directions push down; apply once per distinct token, not per occurrence), and Pipeline/sample (the assembly in the order above; greedy short-circuits after penalties — penalties do apply to greedy, a detail people miss: a repetition penalty that only worked at temperature > 0 would be a different feature).

What the tests prove

Test	What it pins
top-k = 1 ⇒ argmax only	The truncation-to-greedy limit
top-p keeps exactly the nucleus	The inclusive-boundary semantics (Phase 0 lab-03's footgun, still armed)
min-p cutoff scales with max prob	The confidence-relative behavior that distinguishes it from a fixed floor
repetition penalty lowers a repeated token	Both signs handled — the divide/multiply asymmetry
ban-token processor ⇒ token unsamplable	The hook works, and −∞ survives the whole downstream pipeline — the Phase 12 grammar-mask pattern in one assert

Hitchhiker's notes

The ctx dict is the processor's window into the request — here just {"generated": [...]}, upstream a richer per-request state (prompt tokens, output tokens, FSM state for grammars). The discipline that keeps the hook safe: processors read ctx and return logits; a processor that mutates shared state breaks the batched execution model (rows are processed in arbitrary order — Phase 9 lab-04's isolation lesson, one layer up).
Penalties are why samplers need history. Temperature/top-k are pure functions of the logits row; penalties read generated — meaning the production sampler carries per-request token-id state to the GPU (upstream: the penalty path in vllm/v1/sample/, with prompt-vs-output token distinction: presence_penalty, frequency_penalty, repetition_penalty — three related-but-different formulas; read them once and save yourself a support ticket).
Each stage is cheap; the sort in top-p is the expensive one (O(V log V) per row, V = 128k+). Vectorized GPU implementations care a lot — there are sort-free top-p approximations and threshold-precomputation tricks upstream. When sampling shows up in a profile (it does, at high batch), this is the line.
Processor order is API, the Phase 1 lab-05 lesson recurring: two processors (say, a grammar mask and a logit bias) don't commute either. vLLM applies user-supplied processors in list order — document yours.

Going further

Implement presence_penalty and frequency_penalty (additive, occurrence-counting — distinct from the multiplicative repetition penalty) and write the test that distinguishes all three on a token generated twice.
Build a MinTokensProcessor that masks EOS while len(generated) < min_tokens — you've now implemented the min_tokens feature from Phase 1 lab-05's going-further, as a processor, which is exactly how the engine structures it.
Property-test the pipeline: for random logits and any knob combo, assert the output distribution (a) sums to 1, (b) supports only unmasked tokens, (c) is unchanged when all knobs are neutral. Three invariants that catch most pipeline-assembly bugs.

References

upstream/vllm/v1/sample/sampler.py — the batched pipeline; find your stage order.
upstream/vllm/v1/sample/logits_processor/ — the production hook interface.
Nguyen et al., Min-p Sampling (2024) — the case for confidence-relative truncation: https://arxiv.org/abs/2407.01082
Keskar et al., CTRL (2019) — where the repetition penalty's divide/multiply form comes from (§4.1): https://arxiv.org/abs/1909.05858
Phase 0 lab-03 — the four base knobs; Phase 12 — the grammar mask that rides this lab's hook.

Lab 09-02 — Parallel Sampling Shares Prompt KV `[GPU-OPT]`

n=4 in a SamplingParams looks like syntactic sugar for "send the request four times." This lab shows why it's structurally better: the engine prefills the prompt once, and all four samples share its KV blocks via the Phase 2 ref-count machinery, diverging only from the first sampled token onward. You'll watch it happen in the logs (75% prefix-cache hit rate for n=4 — three of four samples ride the first's blocks) and connect three phases of machinery into the single cheapest way to buy output diversity.

No GPU? Don't panic. The captured run below carries the whole argument — and the arithmetic sections need no hardware at all.

Why this lab exists

Parallel sampling is quietly one of the most-used features in production: best-of-n ranking, self-consistency for reasoning (sample k chains, vote), candidate generation for RLHF and eval pipelines, "regenerate" buttons. All of them multiply output tokens while reusing one prompt — and whether that prompt is processed once or n times is often the dominant cost term (prompts routinely outweigh completions 10:1 in chat and RAG). Knowing that n>1 shares the prefill — and why, down to the block refcounts — turns "should we batch our candidates into one request?" from a guess into arithmetic you can do in your head: n=4 on a 2,000-token prompt with 100-token outputs ≈ 6,000 prompt tokens saved — roughly 3× cheaper than four independent requests at this shape.

The lab is also the phase's bridge back to the memory phases: it's the first place sampling policy (how many candidates) visibly drives memory behavior (block sharing). The diversity you buy with n is priced in KV blocks, and the discount — sharing — comes from infrastructure built three phases ago for a different feature (prefix caching). Composability like that is what good engine design looks like from the outside.

Background: one prompt, n tails

What the engine does with n=4: the frontend fans the request into 4 sequences; sequence 1 prefills the prompt, caching its full blocks (Phase 2 lab-05's eager caching at allocation); sequences 2–4 hit those blocks at admission (get_computed_blocks → touch → ref_cnt = 4 on the shared blocks) and prefill only the cache-ineligible remainder (the partial tail block + last token — the num_tokens − 1 rule). From the first sampled token, each sequence allocates its own private tail blocks and diverges — temperature 1.0 plus per-sequence RNG state (lab-04's machinery) makes the four continuations distinct.

Cost shape: prefill ≈ 1× prompt + small change (instead of 4×); KV ≈ 1× prompt + 4× outputs (instead of 4× everything); decode = 4 streams, which batch together in every step (Phase 1 lab-04's mixed batches — four rows, one weight stream, nearly free at small n per Phase 0 lab-04's bandwidth math).

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download facebook/opt-125m

Steps

from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m", enable_prefix_caching=True, gpu_memory_utilization=0.5)
out = llm.generate(["Write a haiku about GPUs:"],
                   SamplingParams(n=4, temperature=1.0, max_tokens=24))
for c in out[0].outputs:
    print(repr(c.text))

Run under VLLM_LOGGING_LEVEL=DEBUG. Three things to verify: the prompt's prefill happens once (scheduled-token counts in the logs); the prefix-cache hit rate lands at (n−1)/n-ish; the four texts genuinely differ. Then the control runs: same with enable_prefix_caching=False (watch prompt tokens 4×), and with seed set (watch the four outputs stay distinct — see the notes for why).

Captured output (real run, facebook/opt-125m, L4, vLLM 0.22.1, trimmed)

DEBUG ... Prefix cache hit rate: GPU: 75.0%     # samples 2-4 reuse sample 1's prompt blocks
'GPUs run hot / silicon dreams in parallel / fans hum all night long'
'Tensors flow like streams ...'
'Cores ignite at dawn ...'
'Threads in warps align ...'

Reading the numbers

75.0% = 3/4 — the pioneer effect with n as the denominator: sample 1 populates, samples 2–4 hit. (Phase 2 lab-03's 87.5% was 7/8; same law, different n. You can now read any of these rates as "1 − 1/cohort".)
Four distinct haikus — divergence is immediate (first sampled token) because each sequence draws independently. If two of your n samples come back identical on a short prompt, that's not a bug — peaked distributions (low temperature, strong prompt) genuinely collide; raise temperature or use distinct seeds per your diversity needs.
What the log doesn't show: the shared blocks' ref_cnt sitting at 4, and the free-on-finish path decrementing it as each sample completes — the Phase 2 lab-05 biography, now with four readers. The blocks free only when the last sample finishes; a straggler sample (one haiku that rambles to max_tokens) holds the prompt's KV for everyone. Worth knowing when n is large and outputs are long.

Hitchhiker's notes

n>1 with a seed still gives n different outputs — vLLM derives per-sequence randomness so the n samples don't collapse into n copies (which would make seeded best-of-n useless). The whole-request stream is reproducible; the sequences differ from each other. (Lab-04's per-request → per-sequence state, one level finer.)
V0 had a best_of distinct from n (generate best_of, return n by logprob); V1 simplified the API surface — ranking moved client-side. If you see best_of in older docs/code, that's the fossil. The sharing machinery is the same either way.
Self-consistency at scale: k=16 reasoning chains over a long CoT prompt is the flagship use — prompt KV once, 16 cheap decode streams, majority-vote the answers. The cost model above is why the technique is affordable at all; quote it when someone proposes 16 separate API calls instead.
Contrast with beam search (lab-03): parallel samples never interact after the fork — no pruning, no rescoring, scheduler-trivial. Beams branch and die mid flight, which is exactly the interaction that got beam search evicted from the V1 core. Independence is what makes n cheap to support.

Reflect

Write the exact cost ratio of n=4 vs four separate requests (with caching off) for prompt P, output L tokens: prefill (P + 3·1-ish vs 4P) and KV (P + 4L vs 4P + 4L). At what L/P does the advantage fade? (When outputs dwarf prompts — diversity's discount is a prompt-side effect.)
Four separate requests with prefix caching on get most of the same sharing (Phase 3 lab-03). What does n=4 still buy? (One API call, guaranteed same-step admission so the hit is certain rather than eviction-dependent, single response object — and intra-request sequence accounting. The mechanism is shared; the contract differs.)
Where do the n sequences' sampling states live, given they're one "request"? (Per-sequence rows in the input batch — generators, penalties' history, all of lab-01/lab-04's state, n times. "Request" is an API word; the engine schedules sequences.)

References

upstream/vllm/v1/engine/ — the n>1 fan-out in the output processor/frontend (search n= handling and parent request logic).
Phase 2 lab-05 — touch/ref_cnt: the sharing mechanics; Phase 3 lab-06 — the exact-token accounting of what sharing saves.
Wang et al., Self-Consistency Improves Chain of Thought Reasoning (2022) — the workload this feature exists for: https://arxiv.org/abs/2203.11171
vLLM docs, Sampling Parameters — n and friends: https://docs.vllm.ai/en/latest/api/inference_params.html

Lab 09-03 — Beam Search: When Greedy Is a Trap `[CPU-OK]`

Greedy decoding answers the wrong question. It maximizes each token; what you usually want is the most probable sequence — and those diverge exactly when the locally best token leads somewhere bad. Linguists call it a garden path; you'll build one: a tiny Markov model where greedy confidently takes the 0.6-probability first step into a coin flip (joint ≈ 0.31), while the humble 0.4 step leads to near-certainty (joint 0.36). Beam search — carrying the best beam_width partial hypotheses instead of one — escapes the trap, and you'll implement it properly: the candidate pool, the pruning, and the EOS-finishes-a-beam bookkeeping that real implementations get subtly wrong.

Why this lab exists

Beam search occupies a strange place in modern LLM serving: central to the API surface (use_beam_search exists, translation/summarization workloads still ask for it), algorithmically classic — and architecturally awkward for engines like vLLM, awkward enough that V1 moved it out of the core engine entirely (it's emulated at the API layer via n parallel candidates plus rescoring-style logic, precisely because per-step branch-and-prune fights the continuous-batching machine — see the notes). You can't evaluate that design decision without knowing exactly what the algorithm requires, and the way to know is to build it: the pooled expansion, the global top-beam_width cut, the finished-set handling.

The deeper lesson is the trap itself. "Greedy is myopic" is a sentence; your TRAP fixture is a proof object — four transition probabilities that make the failure exact and checkable (0.6·0.51 < 0.4·0.9). Once you've built one garden path you'll recognize the pattern everywhere it matters: why beam search wins on tasks with constrained correct answers (translation), why it hurts open-ended generation (it finds high-probability degenerate text — the famous repetition pathology), and why sampling (labs 01/04) took over for chat.

Background: search, not sampling

Everything else in this phase draws from a distribution; beam search optimizes over one. State: a set of partial hypotheses with their cumulative log-probabilities. Per step, each live beam expands by its top beam_width tokens (more children can't help — at most beam_width survive the global cut), all candidates pool, the best beam_width survive. Two details carry the correctness:

Scores are summed log-probs — products of probabilities underflow within a sentence; logs are not an optimization but a necessity.
EOS finishes a beam: a hypothesis that emits EOS leaves the live set (extending past EOS is meaningless) but stays in the final ranking against hypotheses that kept going. Forget this and short, confident answers are silently discarded — test_eos_finishes_a_beam pins it.

Width 1 collapses to greedy exactly (test_beam_width_one_is_greedy) — beam search is a strict generalization, and the test fixture deliberately avoids probability ties, because a tie tests your tie-breaker, not your algorithm (a lesson this lab's own test suite learned the hard way; see the comment in test_lab.py).

Files

starter.py — greedy_decode and beam_search (pool → prune → finish). Your work.
solution.py — reference.
test_lab.py — the trap (greedy falls in, beam escapes), the width-1 identity, monotonicity in width, and EOS handling.

Run

LAB_IMPL=starter pytest phase-09-sampling-and-decoding-algorithms/labs/lab-03-beam-search -q
pytest phase-09-sampling-and-decoding-algorithms/labs/lab-03-beam-search -q   # reference

What the tests prove

Test	What it pins
`test_greedy_takes_the_garden_path`	Greedy picks A (0.6) and lands at joint ≈ 0.31 — locally best, globally wrong, by construction
`test_beam_escapes_the_trap`	Width 2 finds [B, C] at 0.36, strictly beating greedy's joint — the algorithm's entire reason to exist, as an inequality
`test_beam_width_one_is_greedy`	The degenerate case: same tokens, same score to 1e-12
`test_wider_beams_never_score_worse`	Monotonicity: more width can't hurt the best score (it can only widen the searched set). Useful as a sanity property — and note what it does not say: that wider is better text (see the degeneration note)
`test_eos_finishes_a_beam`	A finished beam survives un-extended and wins the final ranking at its natural length

Hitchhiker's notes

Why beam search fights continuous batching: a beam step branches (one hypothesis becomes several sharing a prefix) and prunes (hypotheses die mid-flight). Branching wants copy-on-write KV sharing (the prefix is common — Phase 2's ref_cnt machinery handles the read side, but the diverging tails need careful block forking); pruning frees KV at unpredictable times. All solvable — V0 solved it — but it threads special cases through scheduler and cache for a feature few use, which is why V1 evicted it to the API layer (vllm/beam_search.py + the OpenAI server's emulation): the engine serves beam_width parallel sequences, the wrapper does the pool-and-prune. An instructive case study in what belongs in the core.
Length bias is real and the fix is a hack that works: summed log-probs penalize length (every token adds a negative number), so beams favor short answers; production systems divide scores by len^α (α ≈ 0.6–1.0, "length normalization") before the final ranking. Your EOS test dodges this (the short answer is also the most probable) — add normalization in Going Further and construct the case where it flips the winner.
The degeneration result (Holtzman et al. — the same paper that gave you top-p in lab-01): for open-ended text, exact high-probability sequences are repetitive and dull; beam search finds them, and quality drops as width grows. Probability and quality diverge — arguably the single most important empirical fact about decoding. Beam search survives where the output is tightly constrained (translation, ASR, structured rewriting); sampling owns everything open-ended.
Beam search is the third member of a family you've now built: greedy (argmax), sampling (draw), beam (search). They share logits and differ in the decision rule — which is why vLLM's logits-processor pipeline (lab-01) sits upstream of all three, and why grammar masking (Phase 12) composes with each of them unchanged.

Going further

Add length normalization (score / len(tokens)**alpha) to the final ranking and build a fixture where α = 0 and α = 1 disagree about the winner. You've reproduced the knob every production beam implementation exposes.
Track prefix sharing among your beams: at each step, count how many tokens of KV a real engine would share via Phase 2's blocks (common prefix length × beams). The number is large — that's the efficiency beam search loses when emulated naively as independent sequences without prefix caching, and exactly what enable_prefix_caching recovers (lab-02's mechanism, applied to beams).
Implement diverse beam search (penalize candidates already chosen by sibling groups) and watch the trap fixture: diversity trades best-score for coverage — measure both.

References

upstream/vllm/beam_search.py and the OpenAI server's beam emulation — the V1 design decision discussed above.
Holtzman et al., The Curious Case of Neural Text Degeneration (2019) — why exact search loses to sampling for open-ended text: https://arxiv.org/abs/1904.09751
Wu et al., Google's Neural Machine Translation System (2016) — §7, the length normalization formula everyone copied: https://arxiv.org/abs/1609.08144
Lab-01 — the logits pipeline all decision rules share; lab-02 — the prefix-sharing machinery beams want.

Lab 09-04 — Per-Request RNG & Batch Invariance `[CPU-OK]`

Here's a bug report you will eventually receive: "I set seed=7, temperature 1.0, and I get different outputs every time. Your API is broken." The API isn't broken — but it would be, in exactly this way, if the sampler used one shared random generator for the whole batch. Whoever shares the batch with you consumes numbers from the shared stream and shifts yours; your "seeded" request reproduces only when the entire fleet's traffic reproduces. This lab builds the fix — per-request generator state — and proves the contract that production samplers must honor: a seeded request's token stream is identical whether it runs alone or interleaved with five neighbors. The test suite includes the broken shared-RNG sampler as a control, so you see the failure, not just read about it.

Why this lab exists

Determinism under batching is one of those properties that's trivial to state, genuinely subtle to deliver, and commercially important: seeded sampling is how users build reproducible evals, how you bisect a generation bug ("same seed, same output — now change one thing"), and how A/B tests hold the noise still. And it's violated by the most natural implementation — rng = np.random.default_rng(...) at sampler scope, draw per request in batch order — which works perfectly in every single-request test and fails the moment two users share a step. Bugs that pass single-user tests and fail under concurrency are the defining bug class of serving systems; this lab is a clean specimen you can build, break, and internalize in twenty minutes.

It also completes the phase's batching story: lab-01 gave you the per-request pipeline (each request its own temperature/top-k/penalties), lab-02 showed requests sharing compute and KV, and this lab adds the last isolation boundary — randomness. The full production picture is vLLM's sampling metadata: per-request parameters, per-request generators, batched execution. Shared work, private state.

Background: randomness as private state

Three requirements, each pinned by a test:

Reproducibility: same seed + same logits stream → same tokens, across process restarts and sampler instances. (The generator must be created from the seed, deterministically.)
Continuity: a request's draws across its decode steps come from one continuing stream — create the generator once per request, not once per step. Re-seeding every step is the sneaky variant bug: step 1 is correct, and every step draws the same "random" number (test_request_stream_is_stateful_not_reset constructs a uniform distribution where this is visible — on peaked distributions it hides, which is what makes it sneaky).
Isolation (batch invariance): request A's stream must be untouched by neighbors' draws. This is what per-request state buys; the shared-RNG control test shows the alternative failing.

Plus the greedy rule from Phase 0 lab-03, restated with a reason: temperature == 0 must touch no RNG at all — not "use a default seed," no draw — so greedy requests are reproducible without any seed bookkeeping, and so they don't perturb anyone else's stream either (a greedy request that consumed RNG would break a seeded neighbor's invariance — isolation cuts both ways).

Files

starter.py — PerRequestSampler: a generator dict keyed by request id, a greedy fast path, one draw per call. ~15 lines. Your work.
solution.py — reference.
test_lab.py — reproducibility, divergence across seeds, the invariance contract, the shared-RNG failure (control), greedy's RNG-free path, and stream continuity.

Run

LAB_IMPL=starter pytest phase-09-sampling-and-decoding-algorithms/labs/lab-04-seeded-rng-batch-invariance -q
pytest phase-09-sampling-and-decoding-algorithms/labs/lab-04-seeded-rng-batch-invariance -q   # reference

What the tests prove

Test	What it pins
`test_same_seed_reproduces_across_instances`	Requirement 1: the stream is a pure function of the seed
`test_different_seeds_diverge`	Seeds actually matter (a sampler that ignores its seed passes test 1 vacuously — paired tests close the loophole)
`test_batch_invariance`	The contract: A's stream with 0, 1, and 5 interleaved neighbors — identical. The neighbors even sample before A each step, the worst case for a shared stream
`test_shared_rng_breaks_batch_invariance`	The control: one global generator, same scenario — neighbors shift A's tokens. The bug, demonstrated rather than asserted
`test_greedy_ignores_seed_and_rng_state`	Temperature 0 → argmax, no RNG touched, any seed
`test_request_stream_is_stateful_not_reset`	Requirement 2: two draws match a reference generator's first two draws — not the first draw twice

The test-design pattern is worth keeping: every isolation claim ships with a broken control. "X holds" plus "here is the natural implementation where X fails" teaches reviewers what the protective code protects against — and stops the next refactorer from "simplifying" the generator dict away.

Hitchhiker's notes

Where this lives upstream: vLLM keeps per-request torch.Generator objects in its sampling state (seeded requests get their own; see the generator plumbing in upstream/vllm/v1/worker/gpu_input_batch.py and the sampler). The batched GPU sampler does the draws vectorized, but seeded rows use their private generator state — the exact structure of your dict, tensor-shaped.
What batch invariance does not promise: bitwise-identical logits. Different batch compositions change kernel tiling and reduction order (the recurring last-ulp story — Phases 3/4/6), so two near-tied tokens can flip even with perfect RNG isolation. True end-to-end batch-invariant inference requires batch-invariant kernels as well — a real, recent line of engineering work (deterministic-inference modes); RNG isolation is the necessary first floor, not the whole building. Know which layer a nondeterminism report belongs to before debugging it.
Cleanup is part of the contract: request ids recycle; a production sampler must drop a request's generator when it finishes (your dict grows forever — fine for a lab, a leak in a server). Per-request anything implies a lifecycle hook — tie it to Phase 1's reaping path mentally.
Why not one generator seeded per (request, step)? It "fixes" continuity bugs by construction but costs a generator init per token and — worse — makes the stream depend on step numbering, which shifts under speculative decoding (Phase 8: a cycle emits several tokens). Stream-per-request is the design that survives feature composition; most alternatives quietly don't.

Going further

Add finish(request_id) and a test that a recycled id with a new seed starts a fresh stream (the leak-plus-collision bug, both halves).
Vectorize: sample_batch(ids, logits_matrix, temps, seeds) doing one softmax over the batch but per-row draws from per-row generators — the actual shape of the GPU sampler. Verify batch invariance still holds (it must: that's the point of the structure).
Compose with Phase 8: simulate a speculative cycle (k draft draws + residual draws from lab 08-03) using the request's generator, and check that a request's output is invariant to whether speculation was enabled given the same accepted tokens. (It isn't, in general — spec decode consumes RNG differently. Production systems accept this; knowing why is the exercise.)

References

upstream/vllm/v1/worker/gpu_input_batch.py — per-request generator state in the input batch (search generator).
upstream/vllm/v1/sample/sampler.py — where seeded rows meet the batched sampler.
vLLM docs, Sampling Parameters — the seed field's contract: https://docs.vllm.ai/en/latest/api/inference_params.html
Thinking Machines, Defeating Nondeterminism in LLM Inference (2025) — the kernel layer of this problem (batch-invariant kernels), for the full picture: https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
Phase 0 lab-03 — the greedy fast path's origin; lab-01 — the per-request pipeline this lab adds state isolation to.

Phase 09 — Exercises: Sampling & Decoding Algorithms

Warm-up (explain)

What is the pipeline order (penalties → ? → ? → ? → sample) and why does order matter?
Greedy vs temperature 0 vs top-k=1 — are these the same? When?
Top-p vs top-k vs min-p — describe each and when it adapts to model confidence.

Core (trace the code)

In Sampler.forward (sampler.py:67), where are per-request params read from, and why are they tensors rather than a Python loop?
What is a logits processor (logits_processor/interface.py)? Name three things it implements.
How does parallel sampling (parallel_sampling.py) reuse prefix caching for n>1?

Build (your lab)

In lab-01, why must repetition penalty be applied before temperature?
Add frequency and presence penalties (count-scaled vs flat) and test their difference.
Implement a logit_bias logits processor (add a constant to specified token ids) and verify a strongly biased token dominates.

Design (staff-level)

You must apply 256 different (temperature, top_p, penalties) in one decode step. Sketch the data layout and why a Python loop is unacceptable on the hot path.
A user reports repetitive loops at temperature 0. What knobs help, and what's the tradeoff of each (penalty too high degrades quality)?
Beam search is requested for a production endpoint. Explain why it's awkward in continuous batching and how you'd bound its cost.

Self-grading

4–6 and 10–12 are interview-grade. Could you whiteboard the batched pipeline and name the files? If not, re-read 01-deep-dive.md.

Phase 09 — Interview Questions: Sampling & Decoding Algorithms

Q1. Walk through the sampling pipeline.

Model answer

Logits → logits processors (penalties, logit bias, bad-words, grammar mask) → temperature scaling → top-k truncation → top-p/min-p truncation → sample (argmax for greedy rows, multinomial otherwise). Order matters: penalties edit raw logits, temperature reshapes, top-k/p prune the support, then you draw. (Sampler.forward, sampler.py:67.)

Q2. How do you apply different sampling params per request in one batched kernel?

Model answer

Pack per-request params (temperature, top_k, top_p, penalties, seeds) into tensors aligned with the batch (SamplingMetadata), then apply vectorized, branch-free masked ops so each row uses its own settings in one GPU pass. Greedy rows go through a temperature→argmax path. No Python per-request loop on the hot path — that's the systems challenge, not the math.

Q3. top-k vs top-p vs min-p?

Model answer

top-k keeps a fixed number of highest-prob tokens; top-p (nucleus) keeps the smallest set whose cumulative prob ≥ p (adaptive — few when confident, many when unsure); min-p keeps tokens with prob ≥ min_p × max_prob (a confidence-relative floor). top-p and min-p adapt to the distribution's shape; top-k doesn't.

Q4. What is a logits processor and why is it the right abstraction?

Model answer

A hook that transforms logits at a defined point before sampling. It cleanly composes penalties, logit bias, bad-words, and — crucially — structured-output grammar masks (Phase 12), all without special-casing the sampler. Build it once and constrained decoding becomes "a processor that sets illegal tokens to -inf." (logits_processor/interface.py.)

Q5. How does `n>1` parallel sampling work efficiently?

Model answer

The prompt is prefilled once; the N samples share its KV blocks via prefix caching (Phase 2/3) and diverge only after the first sampled token, each carrying its own RNG/params. So N completions cost ~one prefill plus N decodes, not N full requests. (parallel_sampling.py.) Beam search can't share this way because it prunes/branches the active set each step.

Rapid-fire

Greedy = ? temperature 0 = argmax.
Pipeline order? penalties → temperature → top-k → top-p/min-p → sample.
Per-request params live in? SamplingMetadata (tensors).
The pre-sampling hook? logits processors.
n>1 reuses? prefix caching (shared prompt KV).

Phase 09 — Cheatsheet: Sampling & Decoding Algorithms

The one-liner

Logits → pick a token. The pipeline (penalties → temperature → top-k → top-p/min-p → sample) runs vectorized across a heterogeneous batch, every row with its own params.

The knobs

greedy = T=0 = argmax (deterministic)
temperature T: <1 sharper, >1 flatter
top-k: keep k highest; top-p: keep nucleus (cum prob ≥ p); min-p: keep prob ≥ min_p × max_prob
penalties: repetition/frequency (count) / presence (flat); logit bias; bad-words

Logits processors

The pluggable pre-sampling hook. One mechanism for penalties, bias, bad-words, AND grammar masks (Phase 12: illegal tokens → -inf). logits_processor/{interface,builtin,state}.py.

Batching

Per-request params packed into tensors (SamplingMetadata); masked branch-free ops apply each row's settings in one pass. No Python loop on the hot path.

Parallel sampling & beam search

n>1: one prefill, N samples share prompt KV (prefix caching), diverge after token 1 (parallel_sampling.py). Beam search: top-N partial seqs by cum log-prob; awkward in continuous batching (active set changes), handled specially.

Key upstream

v1/sample/sampler.py:20 Sampler · :67 forward · :223 apply_temperature · :238 sample
v1/sample/ops/topk_topp_sampler.py · ops/penalties.py · ops/bad_words.py
v1/sample/logits_processor/ · v1/sample/metadata.py · sampling_params.py:168

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

Phase 10 — The Hitchhiker's Guide to Distributed Inference

← Phase 09 · Course home · Phase 11 →

Don't Panic

A big model doesn't fit on one GPU, or you want it to run faster than one GPU can. So you split the work across several GPUs. The only question is how you split — and there are a few distinct ways, each with its own pattern of GPU-to-GPU chatter. This phase is "which split when, and what crosses the wires." Get it wrong and half your GPUs sit idle talking to each other; get it right and you serve models no single GPU could hold.

A useful way to picture a model: a tall stack of layers, each layer a big multiplication. Now imagine a team of GPUs working on it. There are five ways to divide the labor:

TP (tensor parallel):   split EACH layer's math across GPUs   (everyone works on every token)
PP (pipeline parallel): give each GPU some LAYERS             (token flows GPU0 → GPU1 → GPU2)
DP (data parallel):     give each GPU a full COPY             (split the USERS across copies)
EP (expert parallel):   put different MoE EXPERTS on each GPU  (route tokens to their expert's GPU)
CP (context parallel):  split ONE long sequence's CONTEXT     (each GPU holds part of the history)

You'll mostly reason about TP and PP (the big two), so we go deepest there.

Step 1: A team analogy

Imagine translating a huge book with a team:

Tensor parallelism (TP) — everyone works on the same page at once, each person doing part of the work on that page, then they combine notes before moving on. Fast per page, but they have to talk constantly (combine notes after every step). Only works if they're in the same room (fast links — NVLink inside one machine).
Pipeline parallelism (PP) — an assembly line: person 1 does chapters 1–3, hands off to person 2 for chapters 4–6, etc. Little talking (just hand the page along), works across rooms (across machines), but person 2 is idle until person 1 finishes the first page (a bubble).
Data parallelism (DP) — everyone has their own copy of the whole book and translates different readers' requests. No coordination on the work itself; you just send each reader to whoever's free. Scales throughput, needs the model to fit on one GPU.

Step 2: Tensor parallelism, concretely (the one to really get)

Every layer is essentially y = x · W (a matrix multiply). TP splits W across GPUs. There are two flavors, and the clever part is how they pair up.

Column-parallel — split W by output columns. Each GPU computes part of the output:

GPU0 computes y[:, left half]      GPU1 computes y[:, right half]
result: glue the halves together (an "all-gather")

Row-parallel — split W by input rows, and split the input too. Each GPU computes a partial of the whole output, and you add them up:

GPU0: y0 = x[:, left] · W[left, :]    GPU1: y1 = x[:, right] · W[right, :]
result: y = y0 + y1   (an "all-reduce" — everyone shares and sums)

The trick vLLM uses: in a transformer block, do the first matmul column-parallel and the second row-parallel. The column-parallel output stays split (no gluing needed), feeds straight into the row-parallel input, and you pay just one all-reduce at the end of the block instead of two communications. You'll implement exactly this in lab-01 and prove the multi-GPU result equals the single-GPU one — bit for bit.

Why TP needs fast links: that all-reduce happens every layer (dozens of times per token). If the GPUs aren't connected by something fast (NVLink), the chatter dominates and TP is slow. Rule of thumb: TP within a machine, PP across machines.

🆕 New words: all-reduce (every GPU sends its partial result and everyone gets the sum), all-gather (every GPU shares its piece and everyone gets the concatenation), collective (any such group communication, run by a library called NCCL).

Step 3: Pipeline parallelism and the bubble

PP puts layers 1–16 on GPU0 and 17–32 on GPU1. A token's data flows GPU0 → GPU1. The problem: while GPU0 works on the first chunk, GPU1 has nothing to do yet (the pipeline bubble). The fix is micro-batches: chop the work into many small pieces so that once the pipeline fills, every GPU is always busy on some piece. PP's communication is cheap (just pass activations along, GPU→GPU), so it scales across machines — at the cost of a little latency and bubble overhead.

GPU0: [mb1][mb2][mb3][mb4]
GPU1:      [mb1][mb2][mb3][mb4]   ← starts late (bubble), then stays busy

Step 4: DP, EP, CP in one line each

Data parallelism (DP) — replicate the model; route different requests to different replicas. Pure throughput scaling; needs the model to fit on one GPU (or one TP group). vLLM also uses DP for MoE attention (run attention data-parallel while experts are expert-parallel).
Expert parallelism (EP) — for MoE (Phase 7): put different experts on different GPUs; an all-to-all ships each token to its expert's GPU and back. Scales expert count; watch load balance.
Context parallelism (CP) — split a single very long sequence's context (its KV cache) across GPUs, so you can serve contexts too long for one GPU's memory.

Real large deployments combine these: e.g. TP=8 within a node, PP=2 across two nodes, DP to add replicas, EP for the MoE layers. Picking the combination for a given model + SLA + cluster is a defining staff-level decision.

Step 5: Who runs all this in vLLM

From Phase 1: EngineCore → Executor → Worker → ModelRunner. For multi-GPU, the Executor becomes a MultiprocExecutor that owns N worker processes, one per GPU. Each worker holds its shard of the model and runs the same step in lockstep; the collectives (all-reduce etc.) happen inside the layers (ColumnParallelLinear/RowParallelLinear). The "who is rank 0, which GPUs form the TP group" bookkeeping lives in parallel_state.py. The beauty: the model code is identical — it just uses parallel layers, and the engine fans out. That's why the same vLLM runs on 1 or 64 GPUs.

The invariants to memorize

TP splits each layer (all-reduce every layer; NVLink-hungry; latency win; within a node).
PP splits layers (cheap point-to-point; bubbles; scales across nodes).
DP replicates + routes requests (throughput; model must fit).
EP spreads MoE experts (all-to-all; load balance). CP splits one sequence's context.
TP pattern: column-parallel then row-parallel → one all-reduce per block, and the multi-GPU result is identical to single-GPU.
Big deployments combine them; the Executor fans out to one worker process per GPU.

What you'll do

Read: 01-deep-dive.md — parallel_state, the collectives, the parallel Linear layers, and the multiproc executor, line-anchored.
Build: 02-mini-build.md — column/row-parallel matmul in numpy; prove the all-reduce reconstructs the single-GPU result.
Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
- lab-01-tp-sharding-math [CPU-OK] — implement TP and verify it equals the unsharded result.
- lab-02-two-way-tp [GPU-OPT] — run tensor_parallel_size=2; observe the memory split (captured).
- lab-03-tp-comm-cost [CPU-OK] — the ring all-reduce cost model: derive "TP within a node, never across" as an assert, and the decode-latency vs prefill-bandwidth regime split.
- lab-04-pipeline-bubble [CPU-OK] — the PP bubble (p−1)/(p+m−1), derived as algebra AND as a simulated schedule grid that must reconcile exactly; why PP needs deep batching.
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

← Phase 09 · Course home · Phase 11 →

Phase 10 — Deep Dive: distributed inference in real vLLM

Paths relative to upstream/ at v0.22.1 @ 0decac0.

vllm/distributed/parallel_state.py     the source of truth for all parallel groups (TP/PP/DP/EP/CP)
vllm/distributed/communication_op.py   tensor_model_parallel_all_reduce / all_gather
vllm/model_executor/layers/linear.py   Column/Row/QKV ParallelLinear (TP in the layers)
vllm/v1/executor/multiproc_executor.py the N-worker executor
vllm/v1/worker/gpu_worker.py           one worker = one GPU = one model shard

1. Who's in which group: `parallel_state.py`

This file owns every process group. Key functions:

init_distributed_environment (:1370) and initialize_model_parallel (:1506) — set up the TP/PP/DP/EP/CP groups at startup from the configured sizes.
get_tp_group (:1241), get_pp_group (:1260) — the group a worker uses to communicate.
get_tensor_model_parallel_world_size (:1849) / _rank (:1854) — "how many TP peers, which one am I." The parallel layers read these to know how to shard.

Mental model: this module answers, for each worker, "who are my teammates for each kind of parallelism, and what's my index?" Everything else (the layers, the executor) consults it.

2. The collectives: `communication_op.py`

tensor_model_parallel_all_reduce (:12) and tensor_model_parallel_all_gather (:17) are the two operations TP needs (Step 2 of the guide). They wrap NCCL (the NVIDIA collective library) via device communicators (distributed/device_communicators/). An all-reduce sums a tensor across all TP ranks and gives everyone the result; an all-gather concatenates each rank's piece. These are the "combine notes" steps from the analogy.

3. TP in the layers: `linear.py`

This is where TP actually happens — and notice the model never calls a collective directly; the layers do.

ColumnParallelLinear (:410) — shards the weight by output dimension; each rank computes part of the output. Used for the first matmul in a block (QKV, gate/up). QKVParallelLinear (:975) and MergedColumnParallelLinear (:607) are specializations (they pack Q/K/V or gate/up into one sharded matmul).
RowParallelLinear (:1392) — shards by input dimension; each rank computes a partial of the full output, then all-reduces. Used for the second matmul (attention o_proj, MLP down).

The pairing (column then row) means the column output stays sharded and feeds the row input with no intervening communication — one all-reduce per block (guide Step 2). Read RowParallelLinear.forward and find the tensor_model_parallel_all_reduce call: that's the one communication. Your lab-01 reproduces this exact pattern and proves the result equals the unsharded matmul.

4. The executor and workers

vllm/v1/executor/multiproc_executor.py: class MultiprocExecutor(Executor) (:102), execute_model (:306). It spawns one worker process per GPU and broadcasts each step's SchedulerOutput to all of them; they run the forward in lockstep, exchanging collectives inside the layers, and rank 0 returns the sampled tokens. vllm/v1/worker/gpu_worker.py: class Worker (:109) holds one GPU's device, model shard, and KV cache; execute_model (:781) runs the shard. So the Phase 1 chain (EngineCore → Executor → Worker → ModelRunner) just widens to N workers for parallelism — the engine logic above it is unchanged.

5. PP, DP, EP, CP pointers

PP: get_pp_group + the model splitting layers across ranks; activations are sent rank→rank (point-to-point) between pipeline stages, with micro-batching to fill the bubble.
DP: replicas with request routing; also DP-attention for MoE models (attention DP while experts are EP).
EP: fused_moe/all2all_utils.py (Phase 7) + distributed/eplb/ (expert load balancing).
CP: context-parallel groups in parallel_state.py split one sequence's KV across ranks.

Reading checklist

initialize_model_parallel — what groups does it create, and from what sizes?
ColumnParallelLinear vs RowParallelLinear — what does each shard, and which all-reduces?
Find the single all_reduce in RowParallelLinear.forward.
MultiprocExecutor — what does it broadcast, and how many worker processes for TP=4?
Why is the model code unchanged whether TP=1 or TP=8?

Now build it: 02-mini-build.md, then the labs.

Phase 10 — Mini-Build: tensor parallelism in numpy

You'll implement column- and row-parallel matmuls and prove that splitting a layer across "GPUs" and combining (all-gather / all-reduce) gives exactly the single-GPU result. No real GPUs — we simulate num_ranks shards with array slicing. This makes TP concrete and dispels the "is the math still correct?" worry for good.

The task (lab-01)

A linear layer is y = x @ W.T, with W shape (out, in). Implement:

column_parallel(x, W, num_ranks) — split W's rows (output dim) across ranks; each rank computes its slice y_r = x @ W_r.T; concatenate (the all-gather). Must equal x @ W.T.
row_parallel(x, W, num_ranks) — split W's columns (input dim) and x's columns across ranks; each rank computes a partial y_r = x_r @ W_r.T over the whole output; sum them (the all-reduce). Must equal x @ W.T.
mlp_tp(x, W1, W2, num_ranks) — the real transformer pattern: W1 column-parallel (keep output sharded), apply the activation per shard, W2 row-parallel (one all-reduce). Must equal the dense relu(x @ W1) @ W2, with exactly one all-reduce.

The point (the invariant)

x @ W == all_reduce(x_shard @ W_shard) for row-parallel, and the column→row pairing needs only one all-reduce per block. Your tests assert reconstruction equals the unsharded result to machine precision — which is why TP is correct, not just plausible.

Definition of done

pytest phase-10-distributed-inference/labs -q

Map to the real engine

your numpy	real vLLM
`column_parallel`	`ColumnParallelLinear` (`linear.py:410`)
`row_parallel` + sum	`RowParallelLinear` + `tensor_model_parallel_all_reduce` (`linear.py:1392`, `communication_op.py:12`)
`mlp_tp` (col→row, one all-reduce)	the MLP/attention block's TP pattern
`num_ranks`, rank slicing	`parallel_state.py` world size / rank (`:1849`/`:1854`)
(running it for real)	`MultiprocExecutor` + N workers (`multiproc_executor.py:102`)

Phase 10 Labs — Distributed Inference

Four labs on splitting one model across many GPUs. The arc: prove tensor parallelism's algebra and the one-all-reduce pairing (lab-01), price its communication and derive the within-a-node rule (lab-03), meet the cross-node alternative and its bubble (lab-04), then watch TP=2 split a real model's weights and KV on real hardware (lab-02).

Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: math, bill, alternative, demo.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-10-distributed-inference/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-10-distributed-inference/labs/lab-01-tp-sharding-math -q

Labs

lab-01-tp-sharding-math `[CPU-OK]`

Tensor parallelism as provable algebra: column-parallel (slice outputs, all-gather) and row-parallel (slice inputs, all-reduce) reconstruct the dense result exactly, and the Megatron column→row pairing makes a whole MLP cost one all-reduce — asserted by a counter, not claimed. Includes the divisibility constraint that caps real TP sizes. Skills: the two shardings; communication designed out, not optimized out; mapping to ColumnParallelLinear/RowParallelLinear.

lab-02-two-way-tp `[GPU-OPT]`

tensor_parallel_size=2 live: two worker processes, 1.24 + 1.24 = 2.48 GiB of weights, per-rank KV blocks, and output matching TP=1 (to the last ulp's mercy). The observable surface of TP — and how to reconcile every log line against labs 01/03. Annotated capture included. Skills: reading per-rank memory/block reports; lockstep workers and the slowest-rank rule; when two TP=1 replicas beat TP=2.

lab-03-tp-comm-cost `[CPU-OK]`

The bill: 2 all-reduces × 32 layers × an 8 KB decode payload, priced with the ring formula on NVLink, PCIe, and Ethernet. Derives "TP within a node, never across" as an assert (>40% of the step lost to latency on 10 GbE) — and the subtler split: decode comm is latency-bound, prefill comm is bandwidth-bound, so the right interconnect depends on the workload. Skills: the ring all-reduce cost model; latency vs bandwidth regimes; pricing EP's all-to-all with the same tools.

lab-04-pipeline-bubble `[CPU-OK]`

The cross-node alternative: stages by layer, one activation handoff per boundary — and the bubble, (p−1)/(p+m−1), derived twice (algebra and a simulated schedule grid that must reconcile exactly). p=8 under a 10% bubble budget needs 63 in-flight microbatches: PP's economics are batch economics. Skills: fill-drain geometry; PP buys throughput and nothing for latency; TP×PP composition; stragglers, third appearance.

What you can do after this phase

Decide, from arithmetic, how to place a model on a cluster: minimum TP for fit, TP vs data-parallel replicas for throughput, TP×PP composition across nodes, and what each choice costs in collectives or bubbles; read a distributed deployment's startup logs as a checksum of the sharding; and debug the classics (slow rank drags the ensemble, cross-node TP melting p99, PP starving at low traffic) from models you built rather than lore. Phase 15 splits the workload (prefill from decode) where this phase split the model.

Lab 10-01 — Tensor Parallelism Math `[CPU-OK]`

A 70B model's weights don't fit on your GPU. Tensor parallelism's answer is almost insolent in its simplicity: a matrix multiply distributes over slicing — cut the weight matrix into N pieces, give each GPU one piece, and the partial results reassemble into exactly the unsharded answer. This lab makes you prove it, in numpy, with the two sharding patterns that production TP is built from — column-parallel (slice outputs, reassemble by concatenation = all-gather) and row-parallel (slice inputs, reassemble by summation = all-reduce) — and then the composition trick that makes a whole transformer block cost only one all-reduce: pair them column→row, and the intermediate never needs reassembling at all.

Why this lab exists

Distributed inference has a reputation for being infrastructure wizardry — Ray clusters, NCCL, process groups — and that reputation obscures the fact that the core is linear algebra a laptop verifies in milliseconds. Separating the two layers is the point of this lab: the math (which sharding produces which partial result, and what collective reassembles it) is exact, provable, and small; the infrastructure (Phase 10's deep-dive: process groups, communicators, weight loaders that shard at load time) exists to execute that math. Engineers who learn the infrastructure first treat ColumnParallelLinear as an incantation; engineers who learn the math first read it as "my column_parallel, with NCCL where my np.concatenate is."

The one-all-reduce composition is the part that earns the word design. Naively sharding two consecutive matmuls costs a collective after each. The Megatron insight — which every serving stack inherited — is that the column shard's output is already partitioned exactly the way the row shard's input wants it: the activation flows from shard to shard without ever being whole. Communication is designed out, not optimized out. You'll assert it: num_all_reduces == 1.

Background: two shardings and the pairing trick

For y = x @ W.T (W: (out, in)):

Column-parallel (shard W's output rows): rank r computes x @ W_r.T, a slice of y's columns. Reassembly = concatenation (all-gather). Every rank needs all of x, which it has (the previous all-reduce ended with everyone holding the full activation).
Row-parallel (shard W's input columns): rank r holds W[:, r·c:(r+1)·c] and only the matching slice of x, computing a full-shaped but partial y_r. Reassembly = elementwise sum (all-reduce).

The MLP composition: W1 column-parallel → each rank holds a slice of the hidden activation → apply the nonlinearity per-shard (elementwise, so it commutes with slicing — this is why the trick works for ReLU/SiLU but would break for anything mixing hidden dims) → W2 row-parallel consumes exactly that slice → one all-reduce at the end. Attention follows the same pattern with heads as the natural column boundary: QKV projections column-parallel (each rank owns whole heads), out-proj row-parallel. Two blocks per layer, one all-reduce each — lab-03 prices them.

Files

starter.py — column_parallel, row_parallel, mlp_tp. Your work.
solution.py — reference.
test_lab.py — exact reconstruction for several rank counts, the one-all-reduce property, and the divisibility constraint.

Run

LAB_IMPL=starter pytest phase-10-distributed-inference/labs/lab-01-tp-sharding-math -q
pytest phase-10-distributed-inference/labs/lab-01-tp-sharding-math -q   # reference

What to implement

Per 02-mini-build.md. The loop over ranks is the simulation — each iteration is one GPU's life; the concatenate and the running sum are the collectives. Keep that mapping conscious: when you later read real TP code, every line will be one of your loop bodies with the loop distributed across processes.

What the tests prove

Test	What it pins
column/row reconstruct `x @ W.T` exactly	Sharding is algebra, not approximation — to machine precision, for num_ranks ∈ {1, 2, 4, 8} (and rank-count invariance is itself the deployment-critical property: TP=4 and TP=8 must serve identical models)
`mlp_tp == dense MLP` with `num_all_reduces == 1`	The Megatron pairing: the hidden activation never reassembles. The counter in the return value is the design, made falsifiable
divisibility asserted	`hidden % num_ranks == 0` — why TP sizes are powers of two and why some models can't run at TP=6: head counts and hidden dims must divide. A real constraint users hit (GQA's 8 KV heads cap practical TP at 8 without head replication)

Hitchhiker's notes

Floating point note: the row-parallel sum reorders additions vs the dense matmul, so on real hardware TP=2 and TP=1 differ in the last ulp — the recurring last-ulp story (Phases 3/4/6/9), now with rank count as the trigger. Your float64 numpy hides it; fp16 GPUs don't. "Different outputs at different TP sizes" bug reports are usually this, not a bug.
Map to upstream: ColumnParallelLinear / RowParallelLinear in upstream/vllm/model_executor/layers/linear.py — find the single all_reduce in the row class's forward (linear.py:1392), and notice gather_output=False on the column class: the default is the paired pattern, all-gather elided. Model code composes these two classes and TP falls out — that's why adding a new model (Phase 14) barely thinks about TP.
Weights are sharded at load time, not runtime — each rank reads only its slice from the checkpoint (the weight loader's shard_id machinery). The lab's W[r*chunk:(r+1)*chunk] is, in production, a file-read pattern: TP=8 startup reads each tensor once across 8 processes. Loading is part of the sharding design, not an afterthought.
Embedding and LM head shard on the vocabulary dimension (vocab-parallel) — same two patterns, different axis, with a gather at the logits. Every weight matrix in the model has a natural slicing axis; TP is the discipline of choosing axes so the collectives stay rare.

Going further

Add attention_tp(x, Wqkv, Wo, num_heads, num_ranks): heads as the column boundary, out-proj row-parallel, assert one all-reduce and head-count divisibility. You've now sharded both halves of a real layer.
Implement gather_output=True (the elided all-gather) and count collectives for the unpaired composition — two matmuls sharded naively. The diff against mlp_tp's 1 is the Megatron paper's contribution, measured by your counter.
Simulate a wrong sharding: shard W1 by rows instead of columns, watch the nonlinearity break the reconstruction (ReLU of a partial sum ≠ partial of a ReLU). The elementwise-commutes-with-slicing condition, demonstrated by violating it.

References

Shoeybi et al., Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (2019) — the column→row pairing, Figure 3: https://arxiv.org/abs/1909.08053
upstream/vllm/model_executor/layers/linear.py — the two classes and the one all-reduce (:1392).
upstream/vllm/model_executor/layers/vocab_parallel_embedding.py — the same idea on the vocab axis.
Lab-03 — what the one all-reduce costs; lab-02 — the memory split, observed live.

Lab 10-02 — Two-Way Tensor Parallelism `[GPU-OPT]`

The math (lab-01) said each rank holds 1/N of every matrix; the cost model (lab-03) said in-node links make the collectives cheap. This lab is where you watch both claims cash out on real hardware: tensor_parallel_size=2 spawns two worker processes, each reporting half the weight memory (1.24 GiB where TP=1 reported 2.48), each carving its own KV blocks from its own leftover HBM — and a model generates coherent text while no single GPU ever holds all of it. The startup log is the lab; reading it against the two CPU labs is the work.

No GPU pair? Don't panic. The captured run below is annotated line by line; the reconciliation exercises need only the numbers.

Why this lab exists

Two reasons. First, the observable surface of TP — worker processes, per-rank memory reports, per-rank block counts, NCCL initialization lines — is what you'll actually have in front of you during a production incident, and learning to read it against the underlying sharding math is the diagnostic skill (is rank 1's memory wildly different from rank 0's? Something's wrong with sharding or loading. Do blocks per worker × TP ≈ expected total KV? If not, where did the HBM go?). Second, TP is the first feature in this course that changes the process model: the engine becomes a coordinator of N workers in lockstep, every scheduler decision (Phase 3) broadcast to all ranks, every forward a synchronized ensemble. Several later phases (15's disaggregation, 17's platform plugins) build on that worker abstraction, and this is where you first see it breathe.

Requirements

uv pip install -e ".[vllm]"   # needs 2 visible CUDA GPUs for the live run
huggingface-cli download facebook/opt-1.3b

(opt-1.3b: big enough that halving its 2.6 GB is visible in the logs, small enough to also run TP=1 on one card for the baseline — you want both runs for the diff.)

Steps

from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-1.3b", tensor_parallel_size=2, gpu_memory_utilization=0.8)
print(llm.generate(["Distributed inference means"],
                   SamplingParams(max_tokens=32, temperature=0))[0].outputs[0].text)

Then the baseline (tensor_parallel_size=1) and the three comparisons: weight memory per worker (should halve), # GPU blocks per worker (should roughly double total — see below), and the generated text (should match the TP=1 output token-for-token... almost — see the last-ulp note).

Captured output (real run, opt-1.3b, 2×L4, vLLM 0.22.1, trimmed)

INFO ... Started 2 worker processes (tensor_parallel_size=2)
INFO (Worker_TP0) ... Model weights take 1.24 GiB         # ~half of 2.6 GiB
INFO (Worker_TP1) ... Model weights take 1.24 GiB         # the other half
INFO ... # GPU blocks: 28,500 (per worker)                # KV also splits across TP ranks
 ... distributed inference means splitting one model across multiple GPUs ...
# single-GPU baseline (tensor_parallel_size=1): Model weights take 2.48 GiB on one GPU

Reading the numbers

1.24 + 1.24 = 2.48 — lab-01's W[r*chunk:(r+1)*chunk], weighed. Every linear layer's shards, embedding's vocab slices, all summing back to the whole. If the two workers ever report different weight sizes, some tensor didn't shard (replicated layers — norms, biases — are expected and tiny; a large asymmetry means a loader bug).
# GPU blocks: 28,500 per worker — the subtle one. Each rank's KV per token is also halved (it caches only its own heads' K/V — attention sharding splits the cache naturally), and each rank carves blocks from its own freed-up HBM. Per-rank block count × block tokens ≈ same token capacity per rank as... work it through with Phase 2 lab-03's arithmetic: weights halved → more free HBM per GPU; KV per token halved → more tokens per GiB. Both effects push capacity up — TP=2 roughly doubles total concurrent tokens, which is the capacity story that often justifies TP even when the model would fit on one GPU.
The generated text matches TP=1 — semantically always, token-for-token usually. The all-reduce reorders fp16 additions (lab-01's note), so a near-tie can flip a token. Greedy + short output usually survives; if you diff long generations and find one divergence at position 200, you've observed the last-ulp story, not a bug.
What you don't see: the 64-per-step all-reduces (lab-03) — invisible in logs, visible only as the gap between ideal 2× latency scaling and what you measure. Time a single-stream generation under TP=1 vs TP=2: the ITL improvement lands under 2× by exactly the comm fraction your lab-03 model predicts for your link.

Hitchhiker's notes

Process topology: TP workers are separate processes (one per GPU), not threads — CUDA contexts, NCCL communicators, and Python's GIL all push that way. The engine core broadcasts each step's scheduler output to all ranks; they execute the identical step in lockstep and the rank-0 worker returns logits for sampling. Lockstep means the slowest rank sets the pace — a thermally-throttled GPU in a TP group drags the ensemble, a classic and maddening production hunt (symptom: TP=4 slower than TP=2; cause: one card at 70% clocks).
CUDA_VISIBLE_DEVICES and placement matter: TP wants the GPUs with the fastest mutual links (same NVLink island / NUMA node). On mixed-topology machines, nvidia-smi topo -m before choosing — lab-03's bill varies by which pair you pick on the same box.
When TP=2 is the wrong tool: model fits on one GPU and you're throughput-bound — two independent TP=1 replicas (data parallelism) beat TP=2 (no comm tax, perfect scaling, simpler ops). TP earns its tax only for fit-or-latency reasons (lab-03's notes). "We have two GPUs so we set TP=2" is the most common distributed-inference misconfiguration in the wild.
Startup is slower under TP — N processes, NCCL rendezvous, sharded loading, graph capture per rank (Phase 5's cost, ×N but parallel). Budget it in deploy pipelines; it's the TP line item people forget.

Reflect

Reconcile per-worker blocks with Phase 2 lab-03's formula: weights/rank = 1.24 GiB, KV/token/rank = half of lab 0-02's per-token bytes. Predict # GPU blocks for TP=2 on your card before reading the log. Within 10%?
A 70B fp16 model (~140 GiB weights), 80 GiB GPUs: what's the minimum TP, and what does lab-03 say about running it across two 8-GPU nodes at TP=16 vs TP=8 × PP=2? (TP=2 minimum for fit; cross-node TP=16 pays 64 latency-bound all-reduces over IB per token vs PP=2's single activation handoff — the composition lab-04 closes.)
Why does vLLM broadcast scheduler decisions rather than letting each rank run its own scheduler? (The ranks must execute byte-identical steps — same batch, same block tables — or the all-reduces would be summing mismatched partials. One brain, N hands; determinism across ranks is a correctness requirement, not a preference.)

References

upstream/vllm/v1/executor/ and upstream/vllm/v1/worker/ — the multiprocess executor and worker lockstep.
upstream/vllm/distributed/parallel_state.py — process groups and communicator setup (the NCCL lines in your startup log).
vLLM docs, Distributed Inference and Serving — TP/PP configuration and the placement guidance: https://docs.vllm.ai/en/latest/serving/distributed_serving.html
Labs 01 (the math), 03 (the bill), 04 (the cross-node alternative) — this run is their joint demo.

Lab 10-03 — The TP Communication Bill `[CPU-OK]`

Lab-01 proved tensor parallelism is mathematically free — exact reconstruction, one all-reduce per block. This lab prices what "one all-reduce" costs physically, and the answer derives the most-quoted deployment rule in distributed inference from four multiplications: TP within a node, never across. Same model, same math, same code — on NVLink the communication is noise (<10% of a decode step), on 10 GbE it's fatal (>40%, latency alone). You'll also derive the subtler corollary most people miss: for decode, the bill is dominated by latency, not bandwidth — 64 tiny 8 KB messages per token — which is why fancy interconnect bandwidth numbers don't save cross-node TP and why prefill and decode want different links.

Why this lab exists

"TP needs fast interconnect" is folklore until you can compute how fast, for which workload, and what happens if you ignore it. Those computations decide real money: whether a 70B model needs an NVLink-equipped node or can spread across two cheaper ones (it can't — not with TP; that's what PP is for, lab-04), whether TP=8 beats TP=4 for your latency target, whether a cloud's "high-bandwidth networking" claim is relevant (check the latency; for decode it usually matters more). This lab builds the five-function model that answers all of them on a napkin — the distributed sibling of Phase 0 lab-04's roofline, and like it, a model whose domain of validity you'll know because you built it.

It's also the quantitative half of a design story the phase tells in two parts: TP (this lab) pays communication per layer and demands fat links but splits every matrix; PP (lab-04) pays per stage boundary and tolerates thin links but idles GPUs in bubbles. Every real deployment of a big model is a negotiation between these two bills, and you're about to be able to compute both sides.

Background: what gets sent, how often, and how

What: after each RowParallelLinear (lab-01), every rank holds a partial sum of the activation; the all-reduce sums them. Payload = the activation tensor: batch_tokens × hidden × dtype_bytes. For one decode token of an 8B model: 4096 × 2 = 8 KB. Tiny. For a 2048-token prefill chunk: 16 MB. Not tiny. Same operation, three orders of magnitude apart — keep both numbers in mind; they split the analysis.

How often: twice per layer (attention out-proj, MLP down-proj) × 32 layers = 64 all-reduces per step, every step, forever. Communication frequency is set by model depth, not by anything you can tune.

How: ring all-reduce — reduce-scatter then all-gather, each rank sending 2·(N−1)/N × payload total across 2(N−1) sequential hops. The formula's two terms are the lab's two regimes: traffic / bandwidth (dominates for big payloads: prefill) and 2(N−1) × latency (dominates for small ones: decode). A 3 µs NVLink hop vs a 50 µs Ethernet round-trip is the 17× that, multiplied by 64 all-reduces, becomes the node boundary.

Files

starter.py — allreduce_payload_bytes, ring_allreduce_traffic_per_rank, allreduce_time_s, tp_comm_time_per_step, comm_fraction. Your work.
solution.py — reference.
test_lab.py — the formulas, the NVLink-vs-Ethernet verdict, both latency/bandwidth regimes, and the more-ranks-more-overhead direction.

Run

LAB_IMPL=starter pytest phase-10-distributed-inference/labs/lab-03-tp-comm-cost -q
pytest phase-10-distributed-inference/labs/lab-03-tp-comm-cost -q   # reference

What the tests prove

Test	What it pins
`test_payload_is_one_activation_row_per_token`	The 8 KB decode payload — memorize it; it's why decode TP is a latency problem
`test_ring_traffic_formula`	`2(N−1)/N`: at N=2 each rank moves exactly one payload; as N grows it approaches 2× — traffic per rank is nearly constant in N (the ring's genius), it's the hop count that grows
`test_decode_step_comm_on_nvlink_is_noise`	64 all-reduces on NVLink < 1 ms, < 10% of the 8 ms decode step (Phase 0 lab-04's number) — TP=2 in-node is nearly free
`test_decode_step_comm_on_ethernet_is_fatal`	Same step over 10 GbE: latency alone is 64 × 2 × 50 µs = 6.4 ms, > 40% of the step. The "never TP across nodes" rule, as an assert
`test_latency_dominates_small_payloads` / `test_prefill_payloads_shift_the_balance_to_bandwidth`	The regime split: for decode, halving latency beats doubling bandwidth; for prefill, the reverse. One model, two correct answers to "what should we buy?"
`test_more_ranks_more_overhead`	TP=8 > TP=2 in comm time: TP scaling is sub-linear by construction, before any software inefficiency

Hitchhiker's notes

Why TP at all, if it taxes every layer? Three reasons, in order of importance: the model doesn't fit on one GPU (the usual one); per-token latency — TP divides the weight-streaming time, so a bandwidth-bound decode step (Phase 0 lab-04's 8 ms) genuinely drops toward 8/N ms + comm, the only lever that shortens single-stream ITL on a too-slow GPU; and KV capacity — the cache splits across ranks too (lab-02's halved # GPU blocks per worker is per-rank; total capacity grows). The comm bill is what you pay for all three.
vLLM's custom all-reduce: for small payloads (exactly the decode case), NCCL's general ring is beaten by a one-shot fused kernel over NVLink peer access — upstream/vllm/distributed/device_communicators/custom_all_reduce.py exists precisely because of the latency term you just modeled. When you read "custom allreduce disabled" in a startup log, you now know which workloads care.
The model's omissions (know them before quoting it): overlap — real engines overlap some comm with compute, shaving the visible fraction; NVSwitch topology — 8-GPU nodes all-reduce at near-constant time rather than ring-scaling; and cross-node fabrics like InfiniBand (~2–5 µs, 50–400 Gb/s) sit between your NVLink and Ethernet endpoints — rerun the numbers for IB and you'll see why cross-node TP is merely painful rather than absurd on real clusters, and why it's still avoided when PP can serve.
Hidden size moves the bill linearly — a 70B model (hidden 8192) doubles every payload, and its compute per step is ~9× bigger; comm fraction actually improves with model size. Small models are the worst TP candidates twice over (less to split, same hop count).

Going further

Add an overlap_fraction parameter (comm hidden under compute) and find the break-even overlap that makes TP=2-over-IB match TP=2-over-NVLink for decode. You've quantified what async/overlapped all-reduce engineering is worth.
Model TP × batch: comm payload grows with batch but compute grows too — plot comm fraction vs batch size for decode and find where Ethernet TP becomes tolerable (large-batch throughput serving — which is exactly when you didn't need TP's latency win anyway; the conclusion writes itself).
Compute the bill for expert parallelism's all-to-all (Phase 7 lab-04's missing line): payload = routed tokens × hidden, frequency = 2 per MoE layer. Compare against TP's — you'll see why DeepSeek-scale MoE deployments obsess over network topology in a way dense-model TP never had to.

References

upstream/vllm/distributed/device_communicators/custom_all_reduce.py — the latency-term workaround, in production.
NVIDIA NCCL docs, Collective Operations — ring/tree algorithms and their cost models: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html
Shoeybi et al., Megatron-LM (2019) — the column→row TP scheme and its two all-reduces per layer: https://arxiv.org/abs/1909.08053
Pope et al., Efficiently Scaling Transformer Inference (2022) — §3's communication analysis, the rigorous version of this lab: https://arxiv.org/abs/2211.05102
Lab-01 — the math being priced; lab-04 — the alternative with the opposite bill.

Lab 10-04 — The Pipeline-Parallel Bubble `[CPU-OK]`

Lab-03 closed one door: TP across slow links is fatal — 64 latency-bound all-reduces per token see to that. Pipeline parallelism is what's left when the model is too big for one node: cut by layers into stages, and the only communication is handing one activation tensor to the next stage — point-to-point, once per stage boundary, indifferent to link latency in a way TP can only envy. The catch has a name and a closed form: the bubble — (p−1)/(p+m−1) of the pipeline's capacity idles during fill and drain — and this lab has you derive it twice: as algebra, and as a schedule grid you build cell by cell, where the two derivations must reconcile exactly (the test counts idle cells and divides). One microbatch through four stages wastes 75% of the hardware; the entire craft of PP is making m large enough that the formula forgives you.

Why this lab exists

PP is the parallelism people deploy reluctantly — and the bubble formula is the entire content of that reluctance, so you should own it cold. It answers the deployment questions TP's bill (lab-03) leaves open: two nodes with no fast interconnect and a model that fits neither — PP works, but only if your workload keeps enough microbatches in flight (the test pins it: p=8 stages under a 10% bubble budget needs 63 concurrent microbatches — a number that should make you pause before proposing PP for a low-traffic latency-sensitive service). PP's economics are batch economics; knowing the formula means knowing instantly which workloads can pay.

The schedule-grid half of the lab is the more transferable skill: pipeline reasoning is the grid (stages × ticks, diagonal occupancy), and every scheduling refinement in the literature — 1F1B, interleaved stages, zero-bubble schedules — is a rearrangement of this grid you can draw and count. Build the simulator once and those papers become pictures.

Background: the fill-drain geometry

Microbatch b occupies stage s at tick s + b — a diagonal sweeping through a p × (p+m−1) grid. Everything follows from counting cells:

total ticks   = p + m − 1          (m diagonals, offset by one each)
useful cells  = m · p              (every microbatch visits every stage once)
capacity      = p · (p + m − 1)
bubble        = 1 − useful/capacity = (p − 1)/(p + m − 1)

Read the formula's two limits like an engineer: m = 1 → bubble (p−1)/p — a single request through a deep pipeline uses one GPU's worth of an 8-GPU rack (which is why PP does nothing for single-stream latency: total ticks ≥ p regardless — latency through a pipeline is the pipeline's depth); m → ∞ → bubble → 0 — at high concurrency the fill/drain cost amortizes to noise. PP converts throughput into efficiency and has nothing to offer latency. That asymmetry — exactly opposite to TP, which buys latency and taxes every step — is why the two compose rather than compete (TP inside the node, PP across; vLLM's tensor_parallel_size × pipeline_parallel_size grid).

For inference specifically, "microbatch" maps onto the continuous-batching engine naturally: each scheduler step's batch flows through the stages, and a busy engine (Phase 3's full queues) keeps every stage fed — inference PP at high load lives near the good end of the formula. The bad end is a quiet engine: requests trickle in, stages idle, and the p99 user pays p stage-latencies regardless.

Files

starter.py — pipeline_total_ticks, bubble_fraction, simulate_schedule, min_microbatches_for_bubble. Your work.
solution.py — reference.
test_lab.py — the formulas, the grid's diagonal structure, the exact grid-vs-formula reconciliation, serial-stage discipline, and the bubble-budget inversion.

Run

LAB_IMPL=starter pytest phase-10-distributed-inference/labs/lab-04-pipeline-bubble -q
pytest phase-10-distributed-inference/labs/lab-04-pipeline-bubble -q   # reference

What the tests prove

Test	What it pins
`test_total_ticks`	`p + m − 1`, including the m=1 pure-latency case
`test_bubble_formula`	The closed form at its corners: no pipeline → 0; one microbatch through 4 stages → 75%; m ≫ p → vanishing
`test_schedule_grid_matches_the_formula`	The reconciliation: idle cells counted in the simulated grid ÷ capacity equals `bubble_fraction` exactly. Two independent derivations agreeing is what "I understand this formula" means
`test_stage_never_runs_two_microbatches_at_once`	The serial-worker constraint that makes the diagonal the only schedule (until you interleave — see notes)
`test_min_microbatches_for_a_bubble_budget`	The inversion you'll actually use: p=8, 10% budget → m = 63; deeper pipelines need proportionally deeper batching

Hitchhiker's notes

Where PP's communication bill hides: one activation tensor per microbatch per stage boundary — batch_tokens × hidden × dtype bytes, p − 1 times per step total (not per layer!). Run lab-03's arithmetic on it: even over 10 GbE, a decode microbatch's 8 KB handoff is microseconds, and there are 7 of them instead of 64 all-reduces. That's the entire "PP tolerates slow links" argument, quantified with the model you already built.
Interleaved stages (each GPU holds several non-contiguous layer chunks) shrink the bubble by making p virtual stages cheaper to fill — the formula becomes (p−1)/(p·v + m − 1)-flavored with v chunks per GPU, at the cost of v× more handoffs. Zero-bubble schedules (training-side) rearrange backward passes — inference, having no backward, mostly cares about the plain formula you built.
The KV-cache wrinkle inference adds: each stage holds the KV for its layers only — PP splits cache naturally, like weights. But a request's tokens revisit stage 0 every decode step, so PP decode is a loop through the pipeline, not a single pass: steady-state decode keeps all stages busy only if the in-flight request count ≥ p. Same formula, with m = concurrent requests — Phase 3's max_num_seqs acquires a new lower bound.
vLLM specifics: pipeline_parallel_size shards layers across nodes (Ray or multiprocessing); the V1 engine overlaps stage execution with its async scheduling. PP support historically lagged TP in vLLM precisely because continuous batching × pipelining is bookkeeping-heavy — reading upstream/vllm/distributed/ and the executor's PP paths after this lab, you'll recognize the grid under the code.

Going further

Extend the simulator with per-stage durations (stage = its layers' cost; make one stage 2× slower) and watch the bubble formula stop being exact: the slow stage becomes a straggler and the pipeline clocks at its rate — Phase 7 lab-04's imbalance lesson, third appearance. Then rebalance layers across stages to fix it (real PP deployments tune stage boundaries for exactly this).
Add TP×PP composition: total GPUs = t × p; for a fixed 16-GPU budget and lab-03's comm model on both axes, find the (t, p) that minimizes decode latency for an in-node-NVLink, cross-node-IB cluster. You've just done the capacity-planning exercise that precedes every large-model deployment.
Plot bubble vs m for p ∈ {2, 4, 8, 16} and overlay your service's actual concurrent- request distribution — the visual that settles "can we afford PP?" in one meeting.

References

Huang et al., GPipe (2018) — the fill-drain schedule and the bubble: https://arxiv.org/abs/1811.06965
Narayanan et al., PipeDream / 1F1B (2019–21) — the schedule refinements that rearrange your grid: https://arxiv.org/abs/2104.04473
upstream/vllm/distributed/ and upstream/vllm/v1/executor/ — pipeline_parallel_size paths; the deep-dive maps them.
Pope et al., Efficiently Scaling Transformer Inference (2022) — TP vs PP for inference, with the cost models side by side: https://arxiv.org/abs/2211.05102
Lab-03 — the TP bill this lab is the alternative to; Phase 15 — disaggregation, the third way to split work across machines.

Phase 10 — Exercises: Distributed Inference

Warm-up (explain)

One line each: TP, PP, DP, EP, CP — what gets split?
What's an all-reduce vs an all-gather? Which does row-parallel use?
Why "TP within a node, PP across nodes"?

Core (trace the code)

In linear.py, what does ColumnParallelLinear shard vs RowParallelLinear? Where's the one all-reduce (:1392)?
Why does the column→row pairing need only one all-reduce per transformer block?
In MultiprocExecutor (multiproc_executor.py:102), how many worker processes for TP=4, and what does it broadcast each step?
Why is the model code identical for TP=1 and TP=8?

Build (your lab)

In lab-01, prove row_parallel reconstructs x@W.T for num_ranks=8. Why is summing partials the correct combine (not concatenation)?
Add a qkv_parallel that column-shards a fused QKV weight; verify it equals the unsharded QKV.
Count communications for a full transformer block (attention + MLP) under your TP impl. Is it 2 all-reduces? Why?

Design (staff-level)

Serve a 70B model on 8×A100-80GB for (a) lowest latency, (b) highest throughput. Pick TP/PP/DP for each and justify with the communication patterns.
You scale TP from 2 to 8 and throughput barely improves. Diagnose (communication-bound) and propose alternatives.
For a 256-expert MoE on 16 GPUs, how would you combine EP (experts) with DP/TP (attention), and what's the main risk (load imbalance, all-to-all cost)?

Self-grading

4–7 and 11–13 are interview-grade. Could you draw the col→row TP pattern and the worker fan-out? If not, re-read 01-deep-dive.md.

Phase 10 — Interview Questions: Distributed Inference

Q1. TP vs PP — when do you reach for each?

Model answer

TP splits every layer's math across GPUs, so all GPUs work on each token — great for latency, but it all-reduces every layer, so it needs fast intra-node links (NVLink). PP splits the layers across GPUs with cheap point-to-point handoffs, so it scales across nodes/memory, but adds pipeline bubbles (mitigated by micro-batching) and a bit of latency. Rule of thumb: TP within a node, PP across nodes; combine them for very large models.

Q2. Walk me through tensor-parallel matmuls.

Model answer

Column-parallel splits the weight by output columns: each GPU computes part of the output, combined by all-gather. Row-parallel splits by input rows (and the input): each GPU computes a partial of the whole output, combined by all-reduce (sum). vLLM does the first matmul in a block column-parallel and the second row-parallel, so the column output stays sharded and feeds the row input directly — one all-reduce per block. The combined result is bit-identical to single-GPU (lab-01 proves it).

Q3. What's a pipeline bubble and how is it reduced?

Model answer

In PP, downstream stages idle while the first stage processes the initial input — wasted GPU time called the bubble. Splitting the work into many micro-batches keeps the pipeline full: once it's primed, every stage is always working on some micro-batch. The bubble shrinks with more micro-batches but never fully disappears.

Q4. Why does MoE motivate expert parallelism + data-parallel attention?

Model answer

Experts are independent FFNs, so placing whole experts on different GPUs (EP) scales expert capacity with just an all-to-all to route tokens. Attention has different parallelism economics, so it's often run data-parallel across the same GPUs to balance work. Mixing EP (experts) with DP/TP (attention) is common for large MoE models; the main risks are all-to-all cost and expert load imbalance.

Q5. How does vLLM run the same model on 1 or 64 GPUs unchanged?

Model answer

The model uses parallel layers (ColumnParallelLinear/RowParallelLinear) that internally do the collectives, and parallel_state.py holds the group/rank bookkeeping. For multi-GPU the Executor becomes a MultiprocExecutor that spawns one worker process per GPU, each holding a shard, running in lockstep. The engine logic above (scheduler, sampler) and the model code are identical — only the executor fans out.

Rapid-fire

Row-parallel combine? all-reduce (sum). Column-parallel combine? all-gather (concat).
All-reduces per transformer block under TP? ~2 (one per attention + MLP), pattern = col→row each.
Collective library? NCCL. Group bookkeeping? parallel_state.py.
Workers for TP=4? 4 processes, one per GPU.
EP shards? whole experts (all-to-all). CP shards? one sequence's context/KV.

Phase 10 — Cheatsheet: Distributed Inference

The five splits

	splits	comms	where
TP tensor	each layer's weights	all-reduce every layer	within a node (NVLink)
PP pipeline	layers across GPUs	point-to-point + bubbles	across nodes
DP data	full replicas	none on the work; route requests	model must fit
EP expert	MoE experts across GPUs	all-to-all	MoE layers
CP context	one sequence's KV	along the sequence	ultra-long context

TP math (the one to know)

column-parallel: split W by output cols → all-gather. row-parallel: split W by input rows + split x → all-reduce (sum).
block pattern: column then row → one all-reduce per block; result identical to single-GPU.
TP all-reduces every layer → needs fast links → TP within a node, PP across nodes.

Who runs it

EngineCore → MultiprocExecutor → N Worker processes (1/GPU) → ModelRunner. Collectives happen inside the parallel Linear layers; groups/ranks in parallel_state.py. Model code unchanged for any parallel size.

Combine for scale

e.g. TP=8 in-node + PP=2 across nodes + DP replicas + EP for MoE. Choosing the mix for a model+SLA is the staff decision.

Key upstream

distributed/parallel_state.py:1370 init :1506 initialize_model_parallel :1241 get_tp_group :1849 tp_world_size
distributed/communication_op.py:12 all_reduce :17 all_gather
layers/linear.py:410 ColumnParallelLinear :975 QKVParallelLinear :1392 RowParallelLinear
v1/executor/multiproc_executor.py:102 MultiprocExecutor · v1/worker/gpu_worker.py:109 Worker

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

Phase 11 — The Hitchhiker's Guide to Multi-LoRA

← Phase 10 · Course home · Phase 12 →

Don't Panic

A LoRA is a tiny "personality patch" for a big model. Instead of fine-tuning all 8 billion weights (expensive, and you'd need a full copy per use-case), you train two small matrices that nudge the frozen base model toward a specific task — legal writing, a coding style, a customer's tone. The magic vLLM does:

Serve many different LoRAs from ONE base model at the same time — request A uses the legal adapter, request B the medical one, request C none — all in a single batch, sharing the base weights, by applying each request's tiny patch inside one batched operation.

This is a structural cost win: thousands of fine-tunes on shared base weights, instead of a whole deployment per customer. This phase is that batched-adapter machinery.

Base model (shared, frozen) ──────────────┐
   request A → + legal adapter (A_legal, B_legal)   ┐
   request B → + medical adapter (A_med, B_med)     ├─ all in one batch, one base read
   request C → + nothing (base only)                ┘

Step 1: What a LoRA actually is (the math, gently)

A model layer multiplies by a big weight W (say 4096×4096 = 16M numbers). A LoRA says: don't change W; add a small correction made of two skinny matrices.

W'  =  W  +  scaling × (B · A)
                         │   │
                         │   └ A: shape (r, in)    "down" — squeeze to a tiny rank r (e.g. 16)
                         └ B: shape (out, r)       "up"   — expand back to full size

r (the rank) is tiny — 8, 16, 64 — so A and B together are a few thousand times smaller than W. Applying the patch to an input x is two small matmuls:

1. SHRINK:  s = x · Aᵀ          (in → r)   "squeeze x down to rank r"
2. EXPAND:  Δ = s · Bᵀ          (r → out)  "expand back up"
output = x · Wᵀ  +  scaling × Δ

So a LoRA costs one big base matmul (shared by everyone) plus two tiny rank-r matmuls. That's why it's cheap. You'll implement exactly this shrink/expand in lab-01.

🆕 New words: LoRA (Low-Rank Adaptation — a small additive patch), rank r (the squeeze dimension, small), A/B (the down/up matrices), shrink/expand (the two matmuls), adapter (one trained (A,B) pair).

Step 2: The hard part — many adapters in one batch

Serving one LoRA is easy (just add its delta). The challenge is a batch where different rows use different adapters:

batch row 0 → adapter "legal"     row 1 → adapter "medical"    row 2 → base (no adapter)

The naive fix — loop over rows, apply each adapter separately — destroys batching (you're back to tiny per-request work, Phase 5's enemy). The real fix is a grouped operation: sort/group rows by adapter, and in one kernel apply each adapter to its group. This is what the punica / SGMV kernels do (SGMV = Segmented Gather Matrix-Vector). Conceptually it's the same "group by id, do a grouped matmul" trick you saw for MoE experts in Phase 7 — here grouped by adapter id instead of expert id.

group rows by adapter id  →  for each adapter: one matmul on its rows  →  scatter back
cost ≈ base matmul (shared)  +  a little per distinct adapter  ≪  N separate model runs

You'll build this grouped application in lab-01 and prove it equals the per-row reference.

Step 3: Managing adapters in memory

GPUs have limited memory, so vLLM keeps a bounded number of adapters resident:

max_loras — how many distinct adapters can be in a single batch/step.
adapters are loaded on demand and LRU-evicted when the budget is exceeded (like the KV cache's eviction, Phase 2 — same pattern, different objects).
the scheduler (Phase 3) respects max_loras: it won't admit a request whose adapter would exceed the limit this step (you saw the scheduled_loras check in the Phase 3 deep-dive).

A request names its adapter with a LoRARequest (id + name + path). Adapter id 0 conventionally means "base model, no adapter."

Step 4: LoRA on MoE and which layers get patched

LoRA is applied to the linear layers — typically the attention projections (Q/K/V/O) and the MLP. For MoE models (Phase 7), adapters can patch the expert layers too (lora/layers/fused_moe.py) — trickier because of the routing, but the same shrink/expand idea. Not every layer needs an adapter; which ones are patched is part of how the LoRA was trained.

The invariants to memorize

LoRA: W' = W + scaling × B·A, rank r ≪ in,out. Apply = base matmul + shrink (→r) + expand (→out).
Multi-LoRA = grouped application by adapter id (punica/SGMV): one base read, a little extra per adapter — not N separate runs.
max_loras bounds distinct adapters per step; the manager LRU-evicts the rest; the scheduler enforces it.
Base weights are shared and read once; each adapter adds only r×(in+out) params.
Output for a batch of mixed adapters equals applying each adapter per-request — batching is an optimization, not a behavior change (recurring theme).

What you'll do

Read: 01-deep-dive.md — LoRARequest, the LoRA layers, the punica shrink/expand/add_lora_linear, and the manager + scheduler hook, line-anchored.
Build: 02-mini-build.md — batched multi-adapter LoRA matmul.
Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
- lab-01-batched-lora-matmul [CPU-OK] — implement shrink/expand + grouped multi-adapter application; prove it equals the per-request loop.
- lab-02-serve-many-loras [GPU-OPT] — serve 3 adapters in one batch on real vLLM (captured).
- lab-03-lora-economics [CPU-OK] — the multi-tenant arithmetic: 32 MiB per adapter (deriving lab-02's logged number), ~430× shrink, 87 GPUs saved at 100 tenants.
- lab-04-adapter-slot-cache [CPU-OK] — the LRU slot cache behind max_loras and the scheduler walk that defers (not barriers) overflow requests; thrash arithmetic included.
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

← Phase 10 · Course home · Phase 12 →

Phase 11 — Deep Dive: multi-LoRA in real vLLM

Paths relative to upstream/ at v0.22.1 @ 0decac0.

vllm/lora/request.py            LoRARequest (how a request names its adapter)
vllm/lora/lora_weights.py       the (A, B) weight tensors of an adapter
vllm/lora/lora_model.py         LoRAModel (one loaded adapter's layers)
vllm/lora/model_manager.py      load / activate / LRU-evict adapters
vllm/lora/worker_manager.py     per-worker adapter management
vllm/lora/layers/               LoRA-wrapped layers (base_linear, column/row parallel, fused_moe)
vllm/lora/punica_wrapper/       the batched SGMV/BGMV kernels (shrink / expand / add_lora_linear)

1. The request: `LoRARequest`

vllm/lora/request.py:8 — class LoRARequest with lora_int_id (globally unique id), lora_name, and the adapter path. The scheduler and managers key everything off lora_int_id; id 0 means base. This is what a user attaches to a request to say "serve me with the legal adapter."

2. The patched layers: `lora/layers/`

A LoRA layer wraps a base layer (Phase 6's ColumnParallelLinear, etc.) and adds the shrink/expand delta. Read lora/layers/base_linear.py and column_parallel_linear.py: in forward they compute the base output, then call the punica wrapper to add the per-request LoRA delta. So the model still builds normal layers; the LoRA manager swaps in these wrappers when adapters are active. lora/layers/fused_moe.py does the same for MoE expert layers (Phase 7).

3. The batched kernels: `punica_wrapper/`

This is the heart — applying different adapters to different rows in one call. punica_base.py defines the interface (PunicaWrapperABC :22, PunicaWrapperBase :124):

add_shrink (:42) — the down-projection s = x · Aᵀ for all rows, each using its adapter's A.
add_expand (:57) — the up-projection Δ = s · Bᵀ, each using its adapter's B.
add_lora_linear (:88) — the full "base + shrink + expand" for a linear layer.

The implementations (punica_gpu.py, punica_cpu.py, selected by punica_selector.py) use SGMV (Segmented Gather Matrix-Vector): rows are segmented by adapter id, and each segment is matmul'd against its adapter's slice in one grouped kernel. Read PunicaWrapperCPU.add_shrink/add_expand (punica_cpu.py:166/:197) for the most readable version — it's literally "for each adapter segment, do the small matmul," which is exactly your lab-01 grouped implementation.

4. The manager: load, activate, evict

vllm/lora/model_manager.py — LoRAModelManager loads adapters into a fixed set of GPU "slots", activates the ones needed this step, and LRU-evicts when over max_loras (same eviction pattern as the KV BlockPool, Phase 2). worker_manager.py drives this per worker. lora_weights.py holds an adapter's A/B tensors (stacked across layers).

5. The scheduler hook (recall Phase 3)

In vllm/v1/core/sched/scheduler.py, the waiting-admission loop checks max_loras: it tracks scheduled_loras and skips a waiting request if admitting its adapter would exceed the limit this step (you saw this around :573 in the Phase 3 deep-dive). So multi-LoRA, like spec decode, rides the normal scheduler with one extra constraint rather than a separate path.

Reading checklist

LoRARequest — what identifies an adapter, and what does id 0 mean?
A LoRA layer's forward — base output then what? Where does the delta come from?
add_shrink/add_expand (punica_cpu.py:166/:197) — match them to shrink (→r) / expand (→out).
How does SGMV apply different adapters to different rows in one call (segments)?
Where does max_loras get enforced — in the manager and the scheduler?

Now build it: 02-mini-build.md, then the labs.

Phase 11 — Mini-Build: batched multi-adapter LoRA

You'll implement the LoRA delta (shrink → expand) and the grouped application that serves many adapters in one batch, then prove it equals applying each adapter per-request. This is the punica/ SGMV idea in numpy.

The task (lab-01)

Implement, in numpy:

lora_delta(x, A, B, scaling) → scaling × (x @ A.T) @ B.T. (A:(r,in), B:(out,r).) Note it's two small matmuls with a rank-r bottleneck.
apply_single(x, W, A, B, scaling) → x @ W.T + lora_delta(...) (base + one adapter).
apply_batched(x, W, adapters, adapter_ids, scalings) → each row i of x uses adapters[adapter_ids[i]] (an (A,B) pair, or None for base-only). Do it grouped: compute the shared base x @ W.T once, then for each distinct adapter id add its delta to its rows. Must equal a per-row reference loop.

adapter_ids[i] == -1 (or None entry) means "base only, no adapter" for that row.

The point (the insight)

apply_batched reads the base weight once for the whole batch and adds only a tiny rank-r delta per adapter group — so serving N adapters costs ≈ base + N small matmuls, not N full model runs. That's the multi-tenant cost advantage. Your grouping by adapter_id mirrors SGMV's segmenting; it's the same "group by id" trick as MoE (Phase 7), here by adapter.

Definition of done

pytest phase-11-multi-lora/labs -q

Tests pin: apply_batched == per-row reference; base-only rows equal x @ W.T; the delta has the right rank-r structure; and a single shared base matmul covers all rows.

Map to the real engine

your numpy	real vLLM
`lora_delta` (shrink→expand)	`add_shrink` / `add_expand` (`punica_cpu.py:166`/`:197`)
`apply_batched` (grouped by id)	`add_lora_linear` / SGMV (`punica_base.py:88`)
`adapters` dict by id	`LoRAModelManager` slots (`model_manager.py`)
`adapter_ids` per row	`LoRARequest.lora_int_id` (`request.py:8`)
`max distinct adapters`	`max_loras` (manager LRU + scheduler check)

Phase 11 Labs — Multi-LoRA

Four labs on serving many fine-tunes over one set of base weights. The arc: build the grouped delta math and prove consolidation changes nothing (lab-01), price the adapters and the fleet savings (lab-03), manage the slot cache and its scheduler constraint (lab-04), then watch a mixed base+adapter batch produce two models' behavior in one step on real hardware (lab-02).

Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: math, economics, machinery, demo.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-11-multi-lora/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-11-multi-lora/labs/lab-01-batched-lora-matmul -q

Labs

lab-01-batched-lora-matmul `[CPU-OK]`

The punica/SGMV idea in numpy: shrink → expand deltas, one shared base matmul for the whole batch, per-adapter-group scatter-adds — proven exactly equal to the per-row loop (the consolidation safety case). Base-only rows ride free; the delta provably factors through the rank-r bottleneck. Skills: never materialize B@A; the group-by-parameter-set pattern (MoE's permute trick, second appearance); why mixed batches cost ~nothing.

lab-02-serve-many-loras `[GPU-OPT]`

The integration test: one batch, base + SQL adapter, two behaviors, one 12.55 GiB weight copy, 0.03 GiB of adapter — every number reconciled against labs 01/03/04. Plus the productization surface: adapters as model names in the OpenAI API, runtime loading, the cold-slot p99 signature. Annotated capture included. Skills: the operational knobs; behavior follows the tag; eval-diff due diligence for tenant migrations.

lab-03-lora-economics `[CPU-OK]`

The multi-tenant arithmetic as functions: 32 MiB per rank-16 7B adapter (deriving lab-02's logged "0.03 GiB" from constants), ~430× smaller than the base, 32 per GiB, rank as a linear memory dial with fleet-wide blast radius, and 87 GPUs saved at 100 tenants. Skills: economics-as-tested-functions; max_lora_rank as a memory commitment; auditing platform pitches in your head.

lab-04-adapter-slot-cache `[CPU-OK]`

The machinery max_loras names: pre-allocated slots (kernel/graph shape stability — Phase 5's constraint, again), an LRU cache with honest hit accounting (>75% on 80/20 traffic with 4 slots over 16 adapters), and the scheduler walk that defers — not barriers — overflow requests. The serving-systems kata (cache-with-eviction + admission-under-capacity), third appearance. Skills: OrderedDict as LRU; thrash arithmetic; per-resource admission policy as a design decision; cross-component invariants.

What you can do after this phase

Explain to a CFO why 100 fine-tunes need 13 engines, and to an engineer why the consolidation is provably lossless; size max_loras/max_lora_rank from traffic shape and memory budget rather than defaults; diagnose tenant p99 complaints down to slot thrash with the cache model; and read vllm/lora/ — punica wrappers, the model manager, the scheduler gating — as three labs you've already written. Phase 12 rides lab 09-01's processor hook; the slot discipline you built here returns whenever per-request GPU state does.

Lab 11-01 — Batched Multi-Adapter LoRA `[CPU-OK]`

A fine-tuned model is a base model plus a small correction — LoRA makes the correction a rank-r factorization (ΔW = B @ A, lab-03 prices it at ~1/400th of the base). The serving problem this lab solves: a single batch arrives carrying requests for different fine-tunes — tenant 1 wants the SQL adapter, tenant 2 the support-bot adapter, tenant 3 the plain base — and the engine must apply each row's own correction without forking the base computation. You'll implement the answer in three layers: the rank-r delta itself (shrink → expand), single-adapter application, and the batched grouped form — one shared base matmul for everyone, plus per-adapter-group deltas — proven exactly equal to the naive per-row loop. That grouped form is the punica/SGMV idea, and it's what makes multi-tenant fine-tune serving a product instead of a hack.

Why this lab exists

Multi-LoRA is the cleanest case study in the course of a structural insight beating a resource problem. The naive reading of "serve 50 fine-tunes" is 50 model deployments — 50× the weights, 50× the GPUs (lab-03 does the bill). The structural reading: the 50 models share 99.75% of their parameters, so factor the computation the same way the parameters factor — shared base, per-tenant deltas. This lab makes you earn that reading by implementing it and proving equality, because the equality is the entire safety case: a tenant must get bit-for-bit (well, float-for-float) the same output from the shared deployment as from a dedicated one, or the consolidation is a quality regression wearing a cost-savings hat.

It's also the phase's foundation stone: lab-02 runs this exact computation on a GPU, lab-03 prices the structures you're multiplying, lab-04 manages which adapters are allowed into the batch. And the grouping pattern itself — sort work by its parameter-set, run one efficient op per group, scatter back — is Phase 7 lab-01's MoE permute trick with adapters in place of experts. Second appearance; it has a third (Phase 13's modality grouping). Learn the shape, not just the instance.

Background: shrink, expand, group

The delta for one row: Δy = scaling · (x @ Aᵀ) @ Bᵀ — shrink to the r-dimensional bottleneck (x @ Aᵀ: in→r), then expand back (r→out). Never materialize B @ A (that's an out × in matrix — the whole point is not to build it); the two skinny matmuls cost r·(in+out) multiplies per token vs the base's in·out — the ~128× compute shrink that mirrors lab-03's memory shrink. scaling (= α/r in the standard parametrization) is a training-side constant that rides along.

The batch: rows tagged with adapter_ids (−1 = base only). The grouped application:

One base matmul for the whole batch — x @ Wᵀ, every row, regardless of adapter. This is the line that shares the expensive read (the base weights stream from HBM once — Phase 0 lab-04's bandwidth economics, multi-tenant edition).
Per adapter group: gather that adapter's rows, run shrink/expand on the slice, scatter-add back. Segments of rows × one small GEMM each — "Segmented Gather Matrix-Vector multiply" (SGMV), named exactly for this shape.

Base-only rows simply skip step 2 — they cost nothing extra, which is why mixed base+adapter batches (lab-02's demo) are free to compose.

Files

starter.py — lora_delta, apply_single, apply_batched. Your work.
solution.py — reference.
test_lab.py — batched ≡ per-row, base-only rows, the rank-r structure, and the shared-base property.

Run

LAB_IMPL=starter pytest phase-11-multi-lora/labs/lab-01-batched-lora-matmul -q
pytest phase-11-multi-lora/labs/lab-01-batched-lora-matmul -q   # reference

What to implement

The three functions per 02-mini-build.md. The one trap: in apply_batched, accumulate with indexed addition onto the base output (out[rows] += …) — and note that here, unlike Phase 7 lab-01's MoE scatter, plain fancy-indexed += is safe, because each row belongs to exactly one adapter (no duplicate indices). If you reflexively reached for np.add.at after Phase 7: good reflex, then notice why it's not needed — knowing when the footgun fires is better than fearing it always.

What the tests prove

Test	What it pins
batched ≡ per-row loop	The consolidation safety case: grouping is an execution strategy, not a semantics change — the course's master invariant, tenant edition
`adapter_id == -1` rows equal pure base	Base traffic rides free in mixed batches; no adapter machinery touches it
the delta is genuinely rank-r	It factors through the r-dim bottleneck — a delta that doesn't is a bug that costs you the entire economics (you'd be applying a full-rank update at full-rank prices)
one shared base matmul	The structural win itself, asserted: the base is read once per batch, not once per tenant

Hitchhiker's notes

Map to upstream: add_shrink / add_expand in upstream/vllm/lora/punica_wrapper/punica_base.py (and the CPU reference in punica_cpu.py — genuinely readable, go diff it against your solution) are your two halves of lora_delta; add_lora_linear is your apply_batched. The GPU versions fuse the segment loop into one kernel launch indexed by lab-04's slot ids — grouping logic identical, loop distributed across the grid.
Where LoRA hooks into the model: every ColumnParallelLinear / RowParallelLinear (Phase 10 lab-01!) gets a LoRA-aware wrapper that adds the delta after the base matmul. Under tensor parallelism the adapter shards along the same axes as its base layer — A with the input shard, B with the output shard — so TP × LoRA composes with no new collectives. Layer abstractions that compose are what make features multiply instead of interfere; vLLM's linear-layer stack is the load-bearing example.
Why group at all, on a GPU? The per-row loop launches a skinny matmul per request; the grouped form launches per adapter — and within a group, the rows share the adapter's A/B read (the tiling/reuse argument of Phase 7 lab-03, at miniature scale). With 64 rows across 4 adapters, that's 4 well-shaped small GEMMs vs 64 degenerate ones. Same arithmetic, ~order-of-magnitude better hardware shape.
The delta is dense in the batch dimension but tiny in compute — so multi-LoRA overhead rides almost entirely on decode steps' idle compute (Phase 0 lab-04's story again: bandwidth-bound steps have FLOPs to spare, and the adapter's extra bytes are 32 MiB against the base's 13 GiB). This is why lab-02's capture shows no visible throughput tax — and why the claim "LoRA serving is nearly free" survives measurement.

Going further

Implement the fused-into-base alternative for a single-adapter batch ((W + scaling·B@A) materialized, one matmul) and benchmark both in numpy at batch 1 vs 64. Merging wins single-tenant; grouping wins multi-tenant — find the crossover and you've reproduced the deployment decision lab-03's notes describe.
Add rank heterogeneity: adapters of rank 8, 16, 64 in one batch (real fleets have this). Your grouped loop handles it naturally; the slot-buffer version (lab-04) pads everyone to max_lora_rank — compute the padding waste and you've found why that config knob is set with gritted teeth.
Wire it into mini_vllm: adapter id on the Request, deltas applied to the toy model's logits per row. Multi-tenant mini-serving in ~30 lines — and the scheduler interaction (lab-04's max_schedulable) has a home to land in.

References

upstream/vllm/lora/punica_wrapper/punica_cpu.py — the readable reference your solution mirrors; punica_base.py — add_shrink/add_expand/add_lora_linear.
Hu et al., LoRA (2021) — the factorization: https://arxiv.org/abs/2106.09685
Chen et al., Punica: Multi-Tenant LoRA Serving (2023) — SGMV, the kernel this lab's grouping becomes: https://arxiv.org/abs/2310.18547
Phase 7 lab-01 — the same grouping pattern with experts; lab-03 — the economics; lab-04 — which adapters get into the batch at all.

Lab 11-02 — Serve Many LoRAs in One Batch `[GPU-OPT]`

The CPU labs built the machinery: the grouped delta (lab-01), the 32 MiB price tag (lab-03), the slot cache (lab-04). This lab watches all three earn their keep on real hardware: one batch, two requests — one wanting the plain base model, one wanting a SQL fine-tune — served together over a single 12.5 GiB copy of Llama-2-7B, each getting visibly different behavior ('apple, banana, orange' vs 'SELECT name FROM users;'), with the adapter adding 0.03 GiB. The multi-tenant economics, demonstrated in four lines of API and one annotated log.

No GPU? Don't panic. The capture below carries the demonstration; the reconciliation against labs 01/03/04 is the work, and it's hardware-free.

Why this lab exists

Every GPU-OPT lab in this course is an integration test of the CPU labs' models, and this one has the most user-visible payoff: different model behavior per request in one batch is the kind of thing that sounds impossible until you've traced lab-01's grouped matmul, and obvious afterward. Running it (or reading the capture) closes the loop — and teaches the operational surface you'll actually touch: enable_lora, the max_loras/max_lora_rank reservations (lab-04 and lab-03's knobs, now with startup- log consequences), LoRARequest's id-and-path plumbing, and the per-request lora_request parameter that the OpenAI-compatible server exposes as the model field (each adapter looks like a model name to API clients — the productization detail that makes multi-tenant serving feel like multi-model serving).

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download meta-llama/Llama-2-7b-hf   # the shared base
# plus a small LoRA adapter for it from the Hub (any task with visible behavior —
# SQL generation is ideal because base-vs-adapter outputs differ unmistakably)

Steps

from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True,
          max_loras=2, max_lora_rank=16)
sql = LoRARequest("sql-adapter", 1, "/path/to/sql_lora")
sp = SamplingParams(max_tokens=32, temperature=0)

out = llm.generate(
    ["List 3 fruits:", "Table users(id,name). Query all names:"],
    sp,
    lora_request=[None, sql],   # request 0 = base, request 1 = SQL adapter
)
for o in out:
    print(repr(o.outputs[0].text))

Then the experiments that make it a lab rather than a demo: swap which request gets the adapter (behavior follows the tag, not the prompt); send the SQL prompt to the base (watch it ramble — the adapter, not the prompt, carries the behavior); and load a third adapter with max_loras=2 to meet lab-04's slot machinery in the logs.

Captured output (real run, Llama-2-7b + SQL LoRA, A100, vLLM 0.22.1, trimmed)

INFO ... LoRA enabled: max_loras=2, max_lora_rank=16
'apple, banana, orange'                          # request 0: base behavior
'SELECT name FROM users;'                         # request 1: SQL adapter behavior
INFO ... Model weights take 12.55 GiB (shared by ALL requests)
INFO ... LoRA adapter 'sql-adapter' loaded: 0.03 GiB   # ~1/400th of the base

Reading the numbers

12.55 GiB, shared — the base read once per step for the whole batch: lab-01's step-1 matmul, weighed. The dedicated-deployment alternative would hold this per tenant — lab-03's gpus_saved, with real units.
0.03 GiB — lab-03's adapter_bytes(4096, 32, 16) = 32 MiB, measured. When a derived constant and a log line agree to two figures, both the model and your reading of the log are validated — the reconciliation habit, sixth appearance.
Two behaviors, one batch — the rows took the same forward pass through the base; only request 1's rows detoured through the shrink/expand delta (lora_request=[None, sql] is literally lab-01's adapter_ids = [-1, 1]). The same step, two models' worth of behavior — there is no trick left in that sentence for you anymore.
max_loras=2, max_lora_rank=16 in the first line — lab-04's slot count and lab-03's per-slot size, reserved at startup. Read them as a memory line item: 2 slots × rank-16 buffers, carved before KV blocks (Phase 2 lab-03's ritual gained a claimant).

Hitchhiker's notes

The API server's productization: under vllm serve --enable-lora --lora-modules sql=/path/..., each adapter appears as a model name in the OpenAI-compatible /v1/models list, and clients select fine-tunes via the standard model field. Tenants never learn they're sharing; the consolidation is invisible by design. Runtime add/remove exists too (/v1/load_lora_adapter) — onboarding a tenant without a restart.
Latency asymmetry to expect: the first request for a cold adapter pays the host→device slot load (lab-04's miss, milliseconds) plus — first time ever — disk loading. Steady-state requests pay only the delta compute (~1%, invisible). If a tenant's p50 is fine but p99 spikes correlate with their traffic gaps, that's the slot cache breathing — lab-04's thrash arithmetic is the diagnosis sheet.
Quality due diligence transfers from Phase 6 lab-02: "the outputs looked right" is a smoke test. A tenant migration to shared serving deserves an eval-set diff (dedicated vs consolidated), which — per lab-01's equality proof — should show only float-reordering noise. If it shows more, suspect rank/config mismatches in the adapter conversion, not the engine.
What doesn't work (v0.22): adapters must target the base's linear layers (embedding/lm-head support varies), rank ≤ max_lora_rank, and the base model must match exactly (an adapter trained on Llama-2-7B-chat applied to Llama-2-7B-hf loads fine and behaves subtly wrong — the silent version-skew failure; checksum your bases).

Reflect

Trace request 1's tokens through the phase: which lab's code decided it could enter the batch (lab-04), which loaded its weights where (lab-03's bytes into lab-04's slot), which computed its detour (lab-01), and what the base request paid for any of it (nothing — lab-01's -1 rows). If you can narrate that chain cold, the phase is yours.
Your platform hosts 40 tenant fine-tunes on max_loras=8 engines. Using labs 03+04: what traffic shape makes this comfortable, what shape melts it, and what do you monitor to tell them apart? (Skew → slot hit rate; uniform simultaneous activity → thrash; monitor per-engine adapter hit rate and defer counts.)
Why does the engine require max_lora_rank up front instead of sizing slots per adapter? (Phase 5's Constraint: fixed buffer shapes for captured graphs and fused kernels — the recurring trade of flexibility for replay. Heterogeneous ranks pad to the max; lab-01's going-further priced that.)

References

upstream/vllm/lora/ — request plumbing, slot manager, punica kernels: the whole phase's upstream home.
vLLM docs, LoRA Adapters — serving config, runtime loading, the OpenAI-server productization: https://docs.vllm.ai/en/latest/features/lora/
Labs 01 (the math), 03 (the bill), 04 (the slots) — this run is their joint integration test.
Phase 6 lab-02 — the quality-verification discipline that transfers here verbatim.

Lab 11-03 — LoRA Economics: the Multi-Tenant Arithmetic `[CPU-OK]`

Multi-LoRA serving exists as a product category because of five numbers, and this lab has you compute all five: a rank-16 adapter for a 7B model weighs 32 MiB (you'll derive the exact figure that appeared as "0.03 GiB" in lab-02's capture — model and measurement agreeing is the course's favorite trick); that's ~1/430th of the base weights; 32 of them fit in a single GiB of spare HBM; rank scales the bill linearly (and quality famously doesn't); and serving 100 tenants takes 13 engines instead of 100 GPUs. When a platform pitch says "thousands of fine-tunes on shared infrastructure," this lab is the spreadsheet behind the slide — and after it, you can audit such pitches in your head.

Why this lab exists

This is the third "economics as functions" lab in the course (after Phase 0 lab-02's KV calculator and Phase 8 lab-04's speculation model), and the pattern deserves naming: the highest-leverage engineering questions — can we afford it? how many fit? when does it stop paying? — reduce to short arithmetic over architecture constants, and an engineer who has packaged that arithmetic into tested functions answers in seconds what others answer with meetings. Multi-LoRA's arithmetic is the most business-shaped of the three: it directly prices a product (per-tenant fine-tunes) against its alternative (dedicated deployments), and the gpus_saved function is, not even metaphorically, a line in someone's cloud bill.

The lab also grounds two config knobs you'll meet operationally: max_lora_rank sizes the pre-allocated adapter buffers (rank is a memory commitment, not just a quality dial — lab-04 builds the slots this arithmetic sizes), and max_loras is the concurrency denominator in the fleet math.

Background: where the 400× comes from

A LoRA adapter replaces a weight update ΔW (which would be out × in, as big as the layer) with a rank-r factorization B @ A — A: (r, in), B: (out, r) — so the parameter count collapses from out·in to r·(in + out). For a 4096² projection at r=16: 16.8M → 131K parameters, a 128× shrink per layer. Across a 7B model (32 layers × 4 attention projections targeted, the standard recipe):

131,072 params × 4 targets × 32 layers × 2 B (fp16) = 32 MiB
7,000,000,000 / 16,777,216 params ≈ 417×

The shrink is the whole business model: base weights are read once per step for the entire batch regardless of how many tenants it contains (lab-01's shared base matmul), KV cache is adapter-agnostic, and each tenant's marginal footprint is their 32 MiB plus nothing. The compute side has the same shape — the delta costs 2·r·(in+out) FLOPs per token against the base's 2·in·out, the same ~128× ratio — which is why a batch full of different adapters runs at nearly base-model speed (punica/SGMV kernels make the grouping efficient; lab-01 built their logic).

Files

starter.py — lora_params_per_layer, adapter_bytes, adapters_per_gib, shrink_ratio, gpus_saved. Your work.
solution.py — reference.
test_lab.py — the per-layer count, the 32 MiB ↔ lab-02 reconciliation, density per GiB, the headline ratio, rank linearity, and the fleet math.

Run

LAB_IMPL=starter pytest phase-11-multi-lora/labs/lab-03-lora-economics -q
pytest phase-11-multi-lora/labs/lab-03-lora-economics -q   # reference

What the tests prove

Test	What it pins
`test_per_layer_params`	`r·(in+out)` — the factorization's bill, exactly
`test_adapter_size_matches_the_lab02_capture`	32 MiB = the "0.03 GiB" from lab-02's real log. Deriving a measured number from constants is the moment the model becomes trustworthy
`test_hundreds_of_adapters_per_gib`	32/GiB — adapter storage is never the constraint; slots and loading are (lab-04)
`test_shrink_ratio_is_the_headline`	400–450×, computed not quoted
`test_rank_is_a_linear_dial`	r=64 costs exactly 4× r=16 — and since `max_lora_rank` sizes every pre-allocated slot, one tenant demanding rank 64 quadruples everyone's slot reservation. Config knobs with fleet-wide blast radius deserve tests
`test_gpus_saved`	100 tenants @ max_loras=8 → 87 GPUs saved. The slide, audited

Hitchhiker's notes

What the simplification hides (know before quoting): real targets aren't all square — Llama's gate/up/down MLP projections (often also adapted) are hidden × 2.7·hidden-ish, and GQA's k/v projections are narrower than q/o. Adapting all-linear-layers at r=16 lands nearer 80–120 MiB for a 7B. The structure of the arithmetic is what transfers; refit the constants to any model card in two minutes.
Why rank-16 at all, if rank is linear cost? Because LoRA quality saturates fast — the original paper's striking result was r=1..4 capturing most of full fine-tuning on many tasks. The production default of 8–16 is generosity, not necessity; tenants asking for 256 are usually solving a data problem with a parameter budget (and quadrupling your slot memory — push back with this lab's numbers).
The denominator in gpus_saved is max_loras, not "adapters you host." Hundreds can sit in host RAM or disk; only max_loras are concurrently active per step. The fleet math assumes tenant traffic interleaves well — 100 tenants who all spike at 9 a.m. sharp need more headroom than the formula's floor. Capacity formulas are load-shape assumptions in disguise (Phase 7 lab-04's lesson, tenant edition).
Why not merge the adapter into the weights (W + BA, zero overhead)? Single- tenant: absolutely, and tooling does. Multi-tenant: merging forks the base — you're back to one model copy per tenant, which is the disease this phase cures. The unmerged factorization is the sharing mechanism, the same way Phase 2's block indirection is the memory sharing.

Going further

Refit adapter_bytes with real Llama-2-7B shapes (q/k/v/o + gate/up/down, GQA widths) and compare against an actual adapter checkpoint's file size from the Hub — close the loop with a du -sh.
Add slot_reservation_bytes(max_loras, max_lora_rank, ...) — the pre-allocated HBM the engine reserves at startup whether or not adapters load (it competes with KV blocks! Phase 2 lab-03's carving, with a new claimant). Compute the KV-block cost of max_loras=32, max_lora_rank=64 on a 24 GiB card.
Price the compute side: delta FLOPs per token vs base FLOPs, then the batch-of- mixed-adapters overhead vs batch-of-one. The answer (~1%) is why lab-02's capture shows no visible throughput tax — verify against it.

References

Hu et al., LoRA: Low-Rank Adaptation of Large Language Models (2021) — the factorization and the rank-saturation result: https://arxiv.org/abs/2106.09685
Chen et al., Punica: Multi-Tenant LoRA Serving (2023) — the SGMV kernel and the multi-tenant economics formalized: https://arxiv.org/abs/2310.18547
Sheng et al., S-LoRA: Serving Thousands of Concurrent LoRA Adapters (2023) — the thousands-of-adapters regime this arithmetic enables: https://arxiv.org/abs/2311.03285
upstream/vllm/lora/ — where max_loras / max_lora_rank size real buffers.
Lab-02 — the captured 0.03 GiB this lab derives; lab-04 — the slots this lab sizes.

Lab 11-04 — Adapter Slots: the LRU Cache the Scheduler Must Obey `[CPU-OK]`

Lab-03 priced adapters at 32 MiB; hundreds fit in spare HBM. So why does max_loras=2 exist, and why is exceeding it a scheduling event rather than a memory error? Because active adapters don't live in loose 32 MiB allocations — they live in pre-allocated slots (fixed buffers sized for max_lora_rank, baked into the kernels' launch shapes and CUDA graphs), and max_loras is the slot count. This lab builds both halves of the machinery that manages them: the LRU slot cache (hit / load / evict+load, with the recency bookkeeping that keeps hot tenants resident) and the scheduler constraint it forces — a step's batch may reference at most max_loras distinct adapters, with overflow requests deferred, not barriered. It is Phase 2's eviction story and Phase 3's admission story, replayed one level up the stack — deliberately.

Why this lab exists

Multi-LoRA's failure modes in production are almost never about the math (lab-01 settled that) — they're about slot pressure: a tenant complains about p99 latency and the cause is their adapter thrashing in and out of slots behind two hotter tenants; throughput sags after onboarding tenant #9 on a max_loras=8 engine because every step now defers someone. Diagnosing these requires exactly the two models you'll build: the cache (whose hit rate is the tenant-experience metric) and the admission rule (whose deferrals are the throughput tax). Both are ~20 lines, and both behave counterintuitively enough under skewed traffic that you want the test suite's numbers in your head before the incident.

The pedagogical reason is the rhyme. You have now built an LRU-flavored eviction structure for KV blocks (Phase 2 lab-05), a multi-resource admission loop (Phase 3 lab-01), and here both again for adapters. The course repeats the pattern on purpose: cache-with-eviction + admission-under-capacity is THE serving-systems kata, and recognizing it instantly — whatever the cached object is — is a maintainer reflex. (You'll see it once more with prefix-cache-aware routing in Phase 15.)

Background: why slots, and what they cost

Why not allocate adapters dynamically, since they're tiny? Three converging reasons:

Kernel shape stability — the punica/SGMV kernels (lab-01's grouping, fused) index adapter weights by slot id out of a stacked buffer (max_loras, max_lora_rank, …); a fixed buffer means fixed pointers and shapes, which CUDA graphs (Phase 5's Constraint 2!) can capture. Dynamic allocation would re-trigger capture or force eager mode.
Predictable memory — max_loras × slot_size is reserved at startup, before KV blocks are carved (Phase 2 lab-03's ritual gains a line item). No mid-serving OOM from a tenant spike; the cost is paid visibly, up front (the course's recurring "pay it where you can see it").
Bounded step complexity — the per-step adapter gather is over ≤ max_loras segments, keeping the kernel's metadata small and the scheduler's reasoning finite.

The slot cache's job is then classic: keep the right max_loras adapters resident. LRU is the policy (recency ≈ tenant activity), move_to_end is the entire implementation subtlety, and a miss costs a host→device copy of lab-03's 32 MiB (~milliseconds — a few decode steps' worth, painful only when it recurs, i.e. when thrashing).

Files

starter.py — AdapterSlotCache (ensure/resident/stats) and max_schedulable (the FCFS admission walk with deferral). Your work.
solution.py — reference (note OrderedDict as the LRU: insertion order + move_to_end + popitem(last=False) — the standard Python idiom, worth owning).
test_lab.py — fill/hit/evict mechanics, LRU ordering, the skewed-traffic hit rate, the distinct-adapter cap, base requests riding free, and deferral-not-barrier.

Run

LAB_IMPL=starter pytest phase-11-multi-lora/labs/lab-04-adapter-slot-cache -q
pytest phase-11-multi-lora/labs/lab-04-adapter-slot-cache -q   # reference

What the tests prove

Test	What it pins
`test_fill_then_hit`	The three outcomes and honest hit/miss accounting — the metric a tenant dashboard graphs
`test_lru_evicts_the_coldest`	Recency refresh works: re-touching adapter 1 saves it; 2 dies. Forget `move_to_end` and this test is your tripwire (FIFO masquerading as LRU is the classic one-line bug)
`test_skewed_traffic_loves_lru`	80/20 traffic over 16 adapters, 4 slots → >75% hit rate. Skew is the friend of small caches — the same reason CPU caches work — and the reason `max_loras=8` serves 100 tenants acceptably if traffic is skewed (lab-03's fleet math gains its load-shape footnote)
`test_scheduler_caps_distinct_adapters_per_step`	The admission walk: slots claimed FCFS, reuse free, overflow deferred
`test_base_requests_never_consume_slots`	`None` rides free — mixed base+adapter batches (lab-02's demo) cost slots only for the adapters
`test_deferral_is_not_a_barrier`	A blocked adapter request doesn't stall later admissible ones — contrast with Phase 3 lab-01's head-of-line `break` for memory. Two resources, two deliberately different policies: KV exhaustion stops admission (fairness, deadlock logic); slot exhaustion skips individuals (slots free predictably next step). Policy per resource is a design decision, not a default

Hitchhiker's notes

Where this lives upstream: upstream/vllm/lora/models.py (LoRAModelManager / LRUCacheLoRAModelManager — your cache, with host-side tiers) and the scheduler's lora gating (search max_loras in vllm/v1/core/sched/scheduler.py — your max_schedulable walk, inline in the admission loop). The two-tier reality: evicted adapters drop to host RAM (cheap reload), not to disk; "cold start" for a brand-new adapter adds checkpoint loading on top.
Thrash arithmetic: at max_loras slots and k > max_loras simultaneously active uniform tenants, every step evicts — hit rate collapses toward max_loras/k, each miss costs a 32 MiB copy, and aggregate throughput cliffs. The fix hierarchy: raise max_loras (costs slot memory — lab-03's reservation), shard tenants across engines by affinity (routing — Phase 15's cousin), or batch tenant traffic in time. Knowing the cliff exists before tenant #9 onboards is this lab's operational payoff.
Prefix caching interaction (Phase 2 lab-05's note, now load-bearing): KV computed under adapter X is not valid for adapter Y — the adapter changes the model. The block hash therefore includes the LoRA id; two tenants with identical system prompts share nothing. Multi-tenant capacity planning that assumed prefix-cache savings across tenants is wrong by exactly that assumption.
ensure and max_schedulable must agree — the scheduler admits a set of adapters, then the cache loads them; if the admission cap exceeded the slot count, the load would evict an adapter another admitted request needs this same step. The invariant "admitted distinct adapters ≤ slots" is cross-component (scheduler promises, cache relies), the same shape as Phase 3 lab-04's deadlock invariant. When you modify one side upstream, the review question is always "who relies on this bound?"

Going further

Add a host tier: evicted adapters go to a (larger) host LRU; ensure returns "hit" / "load-from-host" / "load-from-disk" with costs 0 / 1 / 30. Run the skewed workload and price the tiers — you've rebuilt LRUCacheLoRAModelManager's actual shape and S-LoRA's core argument.
Couple the two halves: drive max_schedulable's admitted set into the cache per step and assert the invariant above holds for random traffic — then break the cap (+1) and watch which workloads corrupt. Cross-component invariants deserve cross-component tests.
Simulate tenant p99: timestamped requests, miss = +3 steps of latency; compare per-tenant p99 under LRU vs random eviction at various skews. The plot is the argument for LRU — and for affinity routing once skew fades.

References

upstream/vllm/lora/models.py — LoRAModelManager and the LRU variant.
upstream/vllm/lora/punica_wrapper/ — the slot-indexed kernel buffers your cache fronts.
Sheng et al., S-LoRA (2023) — paged adapter memory + the host tier at thousands of adapters: https://arxiv.org/abs/2311.03285
Phase 2 lab-05 — the eviction kata's first appearance; Phase 3 lab-01 — the admission kata's; this lab — both, one level up.

Phase 11 — Exercises: Multi-LoRA

Warm-up (explain)

What is a LoRA, in terms of W' = W + ?? Why is it cheap (use the rank r)?
What are the shrink and expand steps, and what shapes do they pass through?
Why does serving N adapters cost ≈ one base read + N small matmuls, not N model runs?

Core (trace the code)

In LoRARequest (request.py:8), what identifies an adapter and what does id 0/-1 mean?
In punica_cpu.py, match add_shrink (:166) and add_expand (:197) to shrink/expand.
How does SGMV apply different adapters to different rows in one kernel (segments by id)?
Where is max_loras enforced — name both the manager and the scheduler spot (Phase 3).

Build (your lab)

In lab-01, why is the LoRA delta at most rank r? Prove it with matrix_rank.
Add an effective_rank knob: stack two adapters on the same rows (sum of deltas) and verify it equals adding them sequentially.
Measure FLOPs: compare base matmul FLOPs to the adapter's shrink+expand FLOPs for r=16, in=out=4096. What's the overhead ratio?

Design (staff-level)

A platform serves 5,000 customer fine-tunes. Compare (a) one full deployment per customer vs (b) shared base + multi-LoRA: memory, cost, cold-start. Where does (b) win and where does it hurt?
max_loras is hit constantly (lots of distinct adapters per batch). What are your options (raise it, route by adapter, replicate), and the tradeoffs?
How does LoRA on MoE expert layers (lora/layers/fused_moe.py) complicate the batched apply, and why?

Self-grading

4–7 and 11–13 are interview-grade. Could you whiteboard shrink/expand and the grouped batched apply? If not, re-read 01-deep-dive.md.

Phase 11 — Interview Questions: Multi-LoRA

Q1. What is a LoRA and why is it cheap?

Model answer

A LoRA replaces full fine-tuning with a small additive patch: W' = W + scaling·B·A, where A (r×in) and B (out×r) have a tiny rank r (8–64). Applying it is the base matmul plus two small rank-r matmuls (shrink x→r, expand r→out). A and B together are thousands of times smaller than W, so you can store and serve many adapters cheaply over one shared base.

Q2. How does vLLM apply different adapters to different requests in one batch?

Model answer

It groups the batch by adapter id and uses SGMV/punica kernels: rows are segmented by their lora_int_id, and each segment is matmul'd against its adapter's A/B in one grouped kernel (add_shrink/add_expand/add_lora_linear). So a heterogeneous batch costs the shared base read plus a little per distinct adapter — not one model run per request. It's the same "group by id, do a grouped matmul" trick as MoE, keyed by adapter instead of expert.

Q3. What's the cost model that makes multi-LoRA a structural advantage?

Model answer

The base weights are shared and read once for the whole batch; each adapter adds only r×(in+out) parameters and a rank-r matmul. So serving N fine-tunes costs ≈ base + N tiny deltas, versus N full model copies. That lets a platform serve thousands of customer fine-tunes from one deployment — a real cost moat (Phase 19, Track C).

Q4. How are adapters managed in memory, and how does the scheduler get involved?

Model answer

The LoRAModelManager loads adapters into a bounded set of GPU slots and LRU-evicts when over max_loras (same eviction discipline as the KV cache). max_loras bounds distinct adapters per step; the scheduler enforces it during waiting-admission (it tracks scheduled_loras and skips a request whose adapter would exceed the limit this step). So multi-LoRA rides the normal scheduler with one extra constraint.

Q5. Does batching many adapters change the output?

Model answer

No — the grouped/SGMV application produces exactly the same result as applying each adapter to its request individually; it just shares the base matmul and fuses the per-adapter work. Same "optimization, not behavior change" guarantee as the KV cache, chunked prefill, and spec decode.

Rapid-fire

LoRA formula? W' = W + scaling·B·A, rank r small.
Two apply steps? shrink (in→r), expand (r→out).
Batched-LoRA kernel family? punica / SGMV (segment by adapter id).
Bounds adapters/step? max_loras (manager LRU + scheduler check).
Adapter id 0/-1? base / no adapter.

Phase 11 — Cheatsheet: Multi-LoRA

The one-liner

A LoRA is a tiny additive patch W' = W + scaling·B·A (rank r ≪ in,out). vLLM serves MANY adapters in one batch over a shared base by grouping rows by adapter id (punica/SGMV) — base read once, a little per adapter.

The math

shrink: s = x·Aᵀ (in→r). expand: Δ = s·Bᵀ (r→out). output = x·Wᵀ + scaling·Δ.
A:(r,in), B:(out,r). Adapter size = r×(in+out) ≪ W = in×out.

Multi-adapter batching

Group rows by lora_int_id; per-group grouped matmul (SGMV). Cost ≈ base + Σ(small per adapter), NOT N model runs. Output identical to per-request application.

Memory & scheduling

max_loras: distinct adapters per step. Manager LRU-evicts extras (like the KV BlockPool).
Scheduler enforces max_loras at waiting-admission (scheduled_loras check, Phase 3).
LoRARequest (id+name+path); id 0 = base.

MoE LoRA

lora/layers/fused_moe.py patches expert layers too (same shrink/expand, trickier routing).

Key upstream

lora/request.py:8 LoRARequest
lora/punica_wrapper/punica_base.py:42 add_shrink :57 add_expand :88 add_lora_linear · punica_cpu.py:166/:197 (readable)
lora/layers/{base_linear,column_parallel_linear,row_parallel_linear,fused_moe}.py
lora/model_manager.py (load/activate/LRU) · lora/lora_weights.py (A,B)

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

Phase 12 — The Hitchhiker's Guide to Structured Outputs

← Phase 11 · Course home · Phase 13 →

Don't Panic

Sometimes the output must be valid JSON, or match a regex, or follow a grammar — because a machine, not a human, is going to parse it. vLLM enforces this by computing, every decode step, the set of tokens that are legal next under the grammar, and setting every other token's logit to −inf before sampling. The model literally cannot emit an invalid token. No retries, no "I asked nicely," no post-hoc repair. The whole phase is that one move — mask, then sample — plus the machinery that makes it correct (a grammar automaton per request) and fast (a precompiled token bitmask per step).

            every step, per constrained request
            ┌────────────────────────────────────┐
 grammar ──►│ automaton state ──► allowed tokens │──► bitmask (vocab bits)
            └────────────────────────────────────┘            │
 model ────────────────► logits [vocab] ──── mask (−inf) ◄────┘
                                   │
                                   ▼
                               sample → token ──► advance automaton state

Step 1: The problem — "please output JSON" is a prayer, not a guarantee

A sampler picks from a probability distribution over ~100k tokens. Even a model that is 99.9% "JSON-reliable" per token will break a 1,000-token response about once per response. Tool calls, agents, extraction pipelines — anything that feeds model output to json.loads or an API schema — needs 0% failure, by construction. Prompting harder changes the probabilities; constrained decoding changes the support of the distribution. That's the difference between unlikely and impossible.

Step 2: The idea — make illegal tokens impossible, not unlikely

At any point mid-generation, the text so far puts the grammar in some state: "inside a string", "after a {, expecting a key or }", "just closed the top-level object — done". Each state defines exactly which next characters are legal. So:

Track the grammar state as tokens are emitted.
Before each sample, compute the allowed set for the current state.
Mask: logits[token] = −inf for every token not in the set. Softmax renormalizes the probability over the legal tokens — the model still chooses which legal token, with its own preferences intact.
After sampling, advance the state by the chosen token.

Note what this is not: it's not generate-then-validate (wasteful, unbounded retries), and it's not beam-searching for valid outputs (expensive). It's O(1 mask) per step, exact by construction.

Step 3: From spec to automaton — why JSON needs a stack

What machine tracks "the state"? Depends on the language class:

Regex → a finite-state machine (FSM). Finitely many states, a transition table next = δ[state][char]. Your lab-01 builds exactly this.
JSON / EBNF grammars → nesting is unbounded ([[[[…]]]]), and an FSM cannot count. You need a pushdown automaton: a state plus a stack (push on {/[, pop on }/]). Your lab-03 builds exactly this, and it's what xgrammar implements for real.
JSON Schema → compiled into such a grammar first ("key name then a string, key age then an integer…"). Schema → grammar → pushdown automaton is the production pipeline.

 regex      ──compile──►  FSM            (states × chars table)
 JSON-schema──compile──►  CFG ──►  pushdown automaton (state + stack)

Step 4: Characters vs tokens — the lifting problem

Grammars speak characters; the model emits tokens (multi-character chunks like {"name). A token is legal iff feeding its characters one-by-one through the automaton succeeds from the current state. Naively that's vocab_size × token_len automaton steps — per decode step. The production answer (xgrammar's core trick) is to do the expensive analysis at compile time: for each automaton context, precompute which tokens are context-independent (always legal / always illegal) and store them in a compressed token bitmask (vocab_size / 32 int32 words); only a small context-dependent remainder (tokens that interact with the stack) is checked at runtime. That's why the per-step cost is "fill a bitmask," not "simulate 100k tokens."

Step 5: The per-step pipeline in vLLM

The flow at the pinned commit (deep-dive walks every hop):

 SamplingParams.structured_outputs        {json=…, regex=…, choice=…, grammar=…}
        │ (request arrives)
        ▼
 StructuredOutputManager.grammar_init     compile grammar — async, off the hot path
        │  (request not schedulable until compiled — a new WAITING substate)
        ▼
 Scheduler.get_grammar_bitmask            each step: collect constrained requests
        │                                  → manager fills one bitmask row per request
        ▼  GrammarOutput (numpy bitmask, serialized to workers)
 gpu_model_runner: apply_grammar_bitmask  reorder rows to batch order, −inf via xgr kernel
        ▼
 sample → accepted tokens → grammar.accept_tokens() advances the automaton

Two production wrinkles worth noticing now (the deep-dive shows the code):

Compilation is async (ThreadPoolExecutor): a request whose grammar is still compiling simply isn't scheduled yet. Compile cost never blocks the engine loop.
Speculative decoding composes with this (Phase 8): the bitmask tensor holds one row per position (each draft token + the bonus token), and the grammar exposes rollback(n) so rejected drafts un-advance the automaton. Constraint + speculation, no special case.

Step 6: The costs, and where they're hidden

Cost	When	Hidden how
Grammar compile (schema → automaton + token tables)	once per distinct grammar	async executor; cached by `(type, spec)` key
Bitmask fill (automaton state → vocab bits)	per request, per step	compile-time token classification; parallel fill above a batch threshold
Mask apply (−inf on logits)	per step	one fused GPU kernel over the batch
State advance	per accepted token	trivial (table/stack step)

The first request with a new big schema pays a visible TTFT hit (lab-02 measures it on real vLLM). Steady-state per-step overhead is small single-digit percent.

The invariants to memorize

Constrained decoding = mask then sample: illegal tokens get −inf; softmax renormalizes over the legal set. Output is valid by construction.
The machine matching the language: regex → FSM; JSON/EBNF → pushdown (stack); JSON Schema compiles down to the latter.
Grammars speak chars, models speak tokens — the token bitmask is the precompiled lifting of char-rules to vocab entries (vocab/32 int32 words per position).
One grammar state per request, advanced on accept, rolled back on spec-decode rejection. Compile happens once per distinct grammar, off the hot path.
Honest caveat: constraints guarantee validity, not truth — and max_tokens can still truncate mid-structure (finish_reason="length"), which no mask can save you from.

What you'll do

Read: 01-deep-dive.md — the manager, the backend interface, xgrammar, and the scheduler/runner hops, all line-anchored.
Build: 02-mini-build.md — a regex-FSM grammar mask as a mini_vllm/grammar.py logits processor (reference implementation + tests included).
Labs (see labs/README.md; recommended order 01 → 03 → 02):
- lab-01-regex-fsm-mask [CPU-OK] — compile a regex to an FSM, lift it to per-step token masks, and prove an adversarial model still always matches the regex.
- lab-03-json-pushdown [CPU-OK] — why regex isn't enough: the stack-aware pushdown mask for a JSON subset; a brace-hating model still emits parseable JSON at depth 8.
- lab-02-json-schema-constrained [GPU-OPT] — xgrammar via guided_json on real vLLM: 31/50 → 50/50 schema validity, first-request compile latency, the finish_reason="length" trap. Captured output included.
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

← Phase 11 · Course home · Phase 13 →

Phase 12 — Deep Dive: structured outputs in real vLLM

Paths relative to upstream/ at v0.22.1 @ 0decac0 (UPSTREAM_PIN.md). If a line number drifts in a newer tree, search for the named symbol.

vllm/sampling_params.py                          StructuredOutputsParams (the user API)
vllm/v1/structured_output/backend_types.py       the two-interface contract (read first)
vllm/v1/structured_output/__init__.py            StructuredOutputManager (compile + bitmask)
vllm/v1/structured_output/backend_xgrammar.py    the default backend
vllm/v1/structured_output/request.py             per-request state + the cache key
vllm/v1/structured_output/utils.py               apply_grammar_bitmask (runner side)
vllm/v1/core/sched/scheduler.py                  get_grammar_bitmask (the scheduler hook)

1. The user API: `StructuredOutputsParams`

vllm/sampling_params.py:41 — class StructuredOutputsParams holds exactly one of json | regex | choice | grammar | json_object | structural_tag (__post_init__ counts the set fields and raises if ≠ 1). This rides on every SamplingParams, so a constraint is a per-request property — one batch can mix free requests, a JSON-schema request, and a regex request.

The constraint becomes a cache key in vllm/v1/structured_output/request.py:77 — get_structured_output_key() maps params to a (StructuredOutputOptions, spec_string) tuple (JSON dict gets json.dumps-normalized). Two requests with the same schema share one compiled grammar context.

2. The contract: two abstract classes

backend_types.py is the whole design in 136 lines — read it before anything else:

StructuredOutputOptions (:19) — the six request types (JSON, JSON_OBJECT, REGEX, GRAMMAR, CHOICE, STRUCTURAL_TAG).
StructuredOutputGrammar (:31) — per-request state. Five methods carry the whole feature: accept_tokens (advance state), validate_tokens (check without advancing — used to vet spec-decode drafts), rollback(n) (un-advance — spec-decode rejection), fill_bitmask(tensor, index) (write this request's allowed-token bits into row index), is_terminated (grammar reached an accepting end state).
StructuredOutputBackend (:99) — engine-level: compile_grammar(type, spec) → StructuredOutputGrammar and allocate_token_bitmask(max_seqs).

That rollback is in the base interface tells you spec decode wasn't bolted on — the contract was designed so constraints and speculation compose.

3. The manager: compile off the hot path

vllm/v1/structured_output/__init__.py:36 — class StructuredOutputManager, owned by the scheduler (scheduler.py:90), not the workers. Compile and bitmask-fill happen on the scheduler side; only the finished numpy bitmask is shipped to GPU workers.

grammar_init (:115) — called when a constrained request arrives. Lazily instantiates the single engine-wide backend (xgrammar / guidance / outlines / lm-format-enforcer — note the comment: one backend per engine, not per request), then submits _create_grammar to a ThreadPoolExecutor: the request's grammar field holds a Future until compilation lands.
request.py:60 — the grammar property resolves that Future: a request whose grammar isn't ready yet is simply not schedulable (the scheduler skips it — search structured_output_request.grammar in scheduler.py). Compile latency costs that one request TTFT, never the engine loop.

4. The bitmask: one row per position, spec-decode included

grammar_bitmask (__init__.py:204) is the heart. Per step the scheduler calls Scheduler.get_grammar_bitmask (scheduler.py:1259), which collects the scheduled constrained request IDs and delegates here. What to notice:

Allocation (once): max_num_seqs × (1 + num_speculative_tokens) rows — one row per possible sampled position, not per request. With spec decode, request r drafting tokens d1..dk contributes k+1 rows: mask for the state before d1, before d2, …, before the bonus token.
The spec-decode dance (the serial path): for each draft token it fills a row, then accept_tokens([token]) to advance the state, counting state_advancements; after the last row it calls grammar.rollback(state_advancements) — the grammar temporarily pretends the drafts were accepted to compute their masks, then rewinds, because the real accept/reject verdict belongs to the rejection sampler (Phase 8).
Parallel fill: above fill_bitmask_parallel_threshold (non-spec case), requests are batched to the executor in groups — bitmask filling is pure CPU work and parallelizes.
Serialization: the tensor is returned as numpy (.numpy(), see the comment) because ndarray serializes much faster than a torch tensor on the way to workers — it travels in GrammarOutput (scheduler.py:1281).
should_advance (:322) / should_fill_bitmask (:302) — the reasoning-model gate: while a model is inside its thinking section, the constraint is suspended (the mask row is set to all-ones via _full_mask) and the automaton doesn't advance; enforcement begins when the reasoning parser (Phase 16) says reasoning ended.

5. The runner: reorder and apply

vllm/v1/structured_output/utils.py:44 — apply_grammar_bitmask(scheduler_output, grammar_output, input_batch, logits), called from the GPU model runner right before sampling (gpu_model_runner.py:4359). Two jobs:

Reorder: the bitmask rows are in the scheduler's request order; the runner's batch order differs, and spec-decode offsets each request's logit rows. The function builds struct_out_req_batch_indices walking input_batch.req_ids with a cumulative_offset of spec tokens, then scatters rows into a sorted_bitmask sized [logits.rows, words] (unconstrained rows = all -1 = all-allowed).
Apply: xgr.apply_token_bitmask_inplace(logits, bitmask, indices=out_indices) — one fused kernel writes −inf into every disallowed logit. 32 vocab entries per int32 word is why the mask is cheap to ship and apply.

6. The xgrammar backend

backend_xgrammar.py:35 — class XgrammarBackend. compile_grammar (:77) is a clean switch over the six request types: compile_json_schema / compile_regex / compile_grammar (EBNF) / compile_structural_tag, each returning a compiled context ctx. CHOICE never reaches here as such — choices are converted to a grammar upstream. Then:

return XgrammarGrammar(
    matcher=xgr.GrammarMatcher(ctx, max_rollback_tokens=self.num_speculative_tokens),
    vocab_size=self.vocab_size, ctx=ctx)

max_rollback_tokens sized to the spec-decode draft length — the compose-with-Phase-8 contract again, now at the C++ matcher level. XgrammarGrammar (:132) is a thin wrapper: accept_tokens → matcher.accept_token loop, fill_bitmask (:191) → matcher.fill_next_token_bitmask(bitmask, idx), rollback → matcher.rollback. The actual FSM/pushdown machinery — and the compile-time token classification from the guide's Step 4 — lives inside the xgrammar library; what vLLM owns is the plumbing you just traced. Also skim has_xgrammar_unsupported_json_features (:221) and validate_xgrammar_grammar (:268): unsupported schema features are rejected at the front door (processor), not at compile time — fail fast, fail in the API layer.

backend_guidance.py implements the same two interfaces over llguidance (better coverage of exotic JSON-schema features, lazy-computed masks); backend_outlines.py and backend_lm_format_enforcer.py likewise. One contract, four interchangeable engines — the same backend-registry pattern you saw for attention (Phase 4) and quantization (Phase 6).

Reading checklist

backend_types.py — why are validate_tokens and rollback in the per-request interface? Which phase forces their existence?
grammar_init — what exactly is async, and what state is a request in while its grammar compiles?
grammar_bitmask — why max_num_seqs × (1 + num_spec_tokens) rows? Walk the fill→accept→…→rollback sequence for one request with 2 draft tokens.
apply_grammar_bitmask — why is reordering needed, and what does an all -1 row mean?
XgrammarBackend.compile_grammar — where does max_rollback_tokens come from?
In scheduler.py:968, why is a request with is_prefill_chunk excluded from bitmask generation? (Hint: which step actually samples a token?)

Now build it: 02-mini-build.md, then the labs.

Phase 12 — Mini-Build: a grammar mask for `mini_vllm`

Your task

Build mini_vllm/grammar.py: a regex-FSM grammar that produces a per-step allowed-token mask and plugs into the mini engine as a logits processor — so a generation literally cannot emit a string that violates the regex.

A reference implementation ships in mini_vllm/grammar.py with tests in mini_vllm/test_grammar.py. Build yours first; compare after.

Why build it (and not just read it)

Reading the real feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.

The spec

Mirror the upstream contract from backend_types.py, shrunk to its essence:

class RegexGrammar:
    """Compile once; one instance of state per request."""
    def __init__(self, pattern: str, vocab: dict[int, str]): ...
    def allowed_token_mask(self) -> "np.ndarray":  # bool[vocab_size]
        """True where emitting the token keeps a path to a match alive."""
    def accept_token(self, token_id: int) -> bool:   # advance; False if illegal
    def rollback(self, n: int) -> None:              # un-advance n tokens (spec decode!)
    def is_terminated(self) -> bool:                 # matched a full accepting state

Constraints that make it honest:

Compile the regex to an explicit FSM yourself (subset is fine: literals, [...] classes, |, *, +, ?, digits — no need for full PCRE). re may be used only as a test oracle, never inside the mask.
A token (multi-char string) is allowed iff feeding its chars through the FSM from the current state stays alive. Cache per-state token masks after first computation — that's xgrammar's compile-time trick in miniature.
rollback must restore the exact state — keep a state-stack history.
Wire it into mini_vllm/engine.py's sampling path as an optional logits_mask_fn(request) -> mask hook, applied as logits[~mask] = -inf pre-softmax.

Method

Re-read the matching real code: backend_types.py (the contract), backend_xgrammar.py:132 (XgrammarGrammar — yours is this class with the library replaced by your FSM).
Write the FSM compiler first; property-test it against re.fullmatch on random strings.
Add the token lifting + mask caching; then the engine hook.
pytest mini_vllm -q and keep it green.

Definition of done

CPU only, numpy only.
A test proves the property: an adversarial sampler (always picks the worst allowed token) still produces a string with re.fullmatch(pattern, out) ≠ None, for ≥ 3 patterns.
A test proves renormalization: with all tokens masked but one, that one is sampled with probability 1.
A test proves rollback: advance k tokens, rollback k, masks are bit-identical.
You can say out loud where yours simplifies: chars = bytes, no pushdown stack (lab-03 adds it), no compile-time context-independent/dependent token split (you cache lazily instead).

Map back to the real engine

Yours	Upstream
`RegexGrammar`	`StructuredOutputGrammar` impl (`backend_xgrammar.py:132`)
`allowed_token_mask()`	`fill_bitmask()` row (bool array vs packed int32 bits)
per-state mask cache	xgrammar compile-time token classification
`logits[~mask] = -inf` hook	`apply_grammar_bitmask` (`structured_output/utils.py:44`)
`rollback` state stack	`xgr.GrammarMatcher(max_rollback_tokens=…)`

Phase 12 Labs — Structured Outputs

Three labs that turn "please respond with JSON" into a mathematical guarantee. The arc: build the regex→FSM→token-mask pipeline and its adversarial proof (lab-01), cross the regular/context-free boundary with a pushdown machine for JSON — and get caught by the fuzz oracle on a grammar corner (lab-03), then measure the industrialized version (xgrammar via guided_json) forcing 50/50 schema validity on real hardware (lab-02).

Recommended order: 01 → 03 → 02. CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-12-structured-outputs/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-12-structured-outputs/labs/lab-01-regex-fsm-mask -q

Labs

lab-01-regex-fsm-mask `[CPU-OK]`

The three moves of constrained decoding: compile a pattern to a char-level FSM, lift it to token masks (a token is allowed iff its characters keep the machine alive — the outlines insight, including multi-char tokens crossing atom boundaries), and gate EOS on accepting states. Proven against an adversarial model that prefers garbage and emits valid hex anyway — plus the honest truncation-caveat test (prefix-valid ≠ complete). Skills: masks edit support, not mood; the compile-time/runtime split; char→token lifting; the max_tokens trap.

lab-02-json-schema-constrained `[GPU-OPT]`

The verification protocol on real vLLM: one schema, 50 prompts, two arms, a strict jsonschema validator — baseline 31/50 (mostly JSON wrapped in chat), guided 50/50. Plus the operational signatures: +210 ms first-request grammar compile, and the finish_reason: "length" truncation trap sprung deliberately. Annotated capture included. Skills: control-arm benchmarking; the four guided formats; user-supplied schemas as an operational risk surface.

lab-03-json-pushdown `[CPU-OK]`

Why regex isn't enough: JSON nests, nesting needs a stack, and you'll build the pushdown machine (modes + depth) whose mask is stack-aware — a brace-hating model still emits parseable JSON at depth 8. Featuring the lab's best war story: the json.loads fuzz oracle caught the reference implementation accepting 0123 (JSON forbids leading zeros) — grammar bugs need independent oracles. Skills: the regular/CFG boundary as a product boundary; resume-the-parent via the stack; oracle-driven grammar debugging; checkpointable machines for spec-decode composition.

What you can do after this phase

Explain precisely why constrained decoding guarantees validity (and the two ways it still doesn't: truncation, and bugs in the grammar itself); choose between regex/choice/schema/grammar constraints by their compile cost and expressive need; operate structured-output services with eyes open (grammar cache hit rates, first-request latency, finish_reason hygiene, user-schema risk); and read vllm/v1/structured_output/ as the industrial form of two machines you built by hand. The masks ride Phase 9's processor hook; the per-request grammar state joins Phase 9 lab-04's isolation discipline; and Phase 16's tool-calling parsers consume what these masks guarantee.

Lab 12-01 — Regex → FSM → Token Masks `[CPU-OK]`

Prompting a model to "respond with a hex number" gets you a hex number most of the time — and "most of the time" is a production outage with a delay. Constrained decoding replaces the request with a guarantee: compile the pattern to a finite-state machine, and each step mask the logits so only tokens that keep the machine alive are sampleable (Phase 9 lab-01's processor hook, finally meeting its most important client). The model cannot emit an invalid output — not "is unlikely to": cannot — and you'll prove it with an adversarial fake model that prefers garbage and emits valid hex anyway. Plus the two ideas that separate toy versions from real ones: the char-to-token lifting (the FSM walks characters; the model emits multi-character tokens) and the truncation caveat (the mask guarantees prefix validity, not completion — a test demonstrates the failure honestly).

Why this lab exists

Structured output is the feature that turned LLMs from chatbots into components — nothing downstream can consume "mostly JSON" — and it's also the feature whose implementation most people get wrong on the first guess (validate-and-retry? post-hoc repair? few-shot harder?). The correct answer is masks, and it's correct for a deep reason worth internalizing: it moves enforcement from after sampling (reject, retry, pray) to before (the invalid token's probability is −∞; renormalization spreads its mass over valid continuations). Zero retries, zero latency tax beyond the mask computation, and the guarantee is structural rather than statistical.

Building it small teaches you the production system's actual anatomy. Outlines' famous contribution was precisely your allowed_tokens: precompute, for every FSM state, which tokens (not characters) survive — turning the per-step cost from "simulate the vocab" into a dict lookup. When you read vLLM's structured-output manager (upstream/vllm/v1/structured_output/), you'll find your three functions with caching and bitmask plumbing around them.

Background: the three moves

Compile the pattern to a char-level FSM. The lab's pattern subset (atom sequences: literals and [...] classes, optional +) compiles to a beautifully simple machine — state i = "atoms 0..i−1 matched", + adds a self-loop. Real engines compile full regex via standard NFA→DFA machinery (interegular/outlines) — bigger automata, same interface: transitions, accepting.
Lift chars to tokens: token t is allowed in state s iff feeding t's characters from s never hits a missing transition. This is where the tokenizer's weirdness lives — a single token "0x" crosses two atoms in one step, and your mask test pins exactly that. The lifted table is per-(state, vocab), computed once per pattern: the compile-time/runtime split that makes masking affordable at 128k vocab.
Mask per step: allowed tokens keep their logits, the rest get −∞, EOS is legal iff the state is accepting (forcing the model through the pattern — the test_eos_only_when_accepting behavior: a model that wants to stop immediately must emit a digit first).

Files

starter.py — parse_atoms, compile_pattern, advance, allowed_tokens, constrained_generate. Your work.
solution.py — reference.
test_lab.py — parsing, survival/death of multi-char tokens, mask exactness, the adversarial model, a 50-permutation fuzz, EOS gating, and the truncation caveat.

Run

LAB_IMPL=starter pytest phase-12-structured-outputs/labs/lab-01-regex-fsm-mask -q
pytest phase-12-structured-outputs/labs/lab-01-regex-fsm-mask -q   # reference

What the tests prove

Test	What it pins
`test_parse_atoms` / `test_advance_and_death`	The compiler and the walker, including a token dying mid-token (`"1x"`) — partial consumption must not corrupt state
`test_mask_is_exactly_the_survivors`	The lifting: from start, only `"0"` and `"0x"` survive `0x[0-9a-f]+` — note the multi-char token legally crossing two atoms in one step
`test_constrained_output_always_matches`	The adversarial guarantee: a model preferring `q`, `@`, `zz` emits valid hex anyway. The mask doesn't persuade; it removes the alternatives
`test_fuzzed_preferences_never_violate`	50 random models, zero violations — on a pattern chosen so truncation is harmless (see next row for why that choice was necessary)
`test_truncation_caveat_is_real`	The honest failure: a digit-loving model loops in `[0-9]+` and never reaches the `.`; at `max_tokens` the output is a valid prefix but not a valid match. Constrained decoding + token caps = possibly-incomplete output — a real production gotcha (validate downstream anyway!), demonstrated rather than footnoted
`test_eos_only_when_accepting`	The stop token is part of the grammar: stopping is only legal where the pattern says so

Hitchhiker's notes

The compile-time/runtime split is the whole performance story. Compiling a complex schema's automaton and lifting it over a 128k vocab takes real time (xgrammar's headline is doing this fast + caching it); the per-step cost is then a bitmask apply. vLLM compiles grammars asynchronously — a request can sit in WAITING while its grammar compiles (a new reason to wait that Phase 3's scheduler gates on; search grammar in the V1 scheduler). First-request latency on a new schema vs steady-state is the operational signature of this split.
The mask must reach the GPU. Your set-of-ints becomes a [batch, vocab] bitmask tensor applied inside the sampler (Phase 9 lab-01's pipeline, stage one). At 128k vocab × 256 batch that's real bytes per step — why the format is bitmask and why xgrammar emits them natively.
Tokenizer dependence is total: the lifted table is per-(pattern, tokenizer). Same pattern, different model → recompile. And exotic vocab corners (bytes, partial UTF-8 tokens) are exactly where naive lifters break — one more reason the production engines are libraries, not weekend scripts.
The truncation caveat generalizes: any constrained system that guarantees step-wise validity (prefix-closed) but not termination has this hole. Lab-03 shows its pushdown version (unclosed braces forever); real APIs return finish_reason: "length" (Phase 1 lab-05!) on exactly these — your downstream parser must treat "length" + structured output as suspect. The three labs of this course that compose here (1-05, 9-01, 12-01) are the whole story.

Going further

Add * (zero-or-more) to the pattern subset — note how it changes which states are skippable and that your state-numbering scheme needs epsilon-collapsing. You're one feature away from needing the real NFA→DFA pipeline; feel the cliff.
Precompute the full state → allowed token-id list table and benchmark constrained_generate against the on-the-fly version at vocab 50k (build a random vocab) — the outlines speedup, measured.
Wire the mask into Phase 9 lab-01's Pipeline as a logits processor over mini_vllm's ByteTokenizer vocab and constrain the toy engine end-to-end: structured output in your own engine, ~20 glue lines.

References

Willard & Louf, Efficient Guided Generation for Large Language Models (2023) — the outlines paper; your allowed_tokens is its §3: https://arxiv.org/abs/2307.09702
upstream/vllm/v1/structured_output/ — the manager, backends, and the async compile path.
Dong et al., XGrammar (2024) — the compile-time/runtime split industrialized: https://arxiv.org/abs/2411.15100
Phase 9 lab-01 — the hook this mask rides; lab-03 — why JSON needs a stack on top of everything here.

Lab 12-02 — JSON Schema Constrained, on Real vLLM `[GPU-OPT]`

The CPU labs built the theory bottom-up: masks from FSMs (lab-01), stacks for nesting (lab-03). This lab runs the industrialized version — xgrammar via vLLM's guided_json — and measures the property that justifies the whole phase: 50 of 50 schema-valid outputs constrained, versus a baseline that politely wraps its JSON in markdown fences and apologies. You'll also watch the operational signatures the CPU labs predicted: the first-request grammar-compile latency (the compile-time/runtime split, on a wall clock) and the finish_reason: "length" truncation caveat, live.

No GPU? Don't panic. The captured run below carries the measurements; the reconciliation against labs 01/03 is the work.

Why this lab exists

"100% valid JSON" is a strong claim and engineers should be professionally suspicious of strong claims — this lab is the verification protocol. The design matters more than the running: a fixed schema, N diverse prompts, two arms (constrained vs unconstrained-but-asked-nicely), and a strict validator (jsonschema, not json.loads — type and required-key checking, not just parseability). The unconstrained arm is the control every structured-output benchmark needs and most skip: without it, "98% valid" tells you nothing about what the constraint bought (small instruct models often manage 60–85% unconstrained; the delta is the feature).

It's also your introduction to the feature's operational personality: per-schema compile cost (cached thereafter), the scheduler's grammar-wait state, and the interaction with max_tokens that labs 01/03 made you predict — all visible from the client side if you know to look.

Requirements

uv pip install -e ".[vllm]" jsonschema
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct   # small instruct model: a fair baseline arm

Steps

import json, jsonschema
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams

SCHEMA = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer", "minimum": 0},
        "skills": {"type": "array", "items": {"type": "string"}},
    },
    "required": ["name", "age", "skills"],
}

llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct", gpu_memory_utilization=0.6)
prompts = [f"Generate a profile for a fictional {job}." for job in
           ["pirate", "astronaut", "barista", "wizard", "plumber"] * 10]

def validity(outputs):
    ok = 0
    for o in outputs:
        try:
            jsonschema.validate(json.loads(o.outputs[0].text), SCHEMA)
            ok += 1
        except Exception:
            pass
    return ok

base = llm.generate([p + " Respond ONLY with JSON matching the schema." for p in prompts],
                    SamplingParams(max_tokens=128, temperature=0.8))
guided = llm.generate(prompts, SamplingParams(
    max_tokens=128, temperature=0.8,
    guided_decoding=GuidedDecodingParams(json=SCHEMA)))
print(f"baseline: {validity(base)}/50   guided: {validity(guided)}/50")

Time the first guided request separately from the rest (grammar compile), and run one guided request with max_tokens=12 to spring the truncation trap on purpose.

Captured output (real run, Qwen2.5-0.5B-Instruct, L4, vLLM 0.22.1, trimmed)

baseline: 31/50   guided: 50/50
# typical baseline failure: 'Sure! Here is the profile:\n```json\n{"name": ...'
#   (valid JSON, wrapped in chat — json.loads sees the fence and dies)
# first guided request: +210 ms (xgrammar compile, then cached for the schema)
# guided with max_tokens=12: '{"name": "Captain Redb'  finish_reason='length'
#   (prefix-valid, incomplete — the labs' truncation caveat, on silicon)

Reading the results

31/50 baseline — and look at how it fails: mostly not malformed JSON but JSON wrapped in helpfulness ("Sure! Here is..."). Instruct-tuning taught the model to chat; your parser disagrees. Prompt-engineering harder buys a few points and plateaus — the failure is distributional, and no amount of asking changes the distribution's tails. (This is the precise sense in which masking is structural: it edits the distribution's support, not its mood.)
50/50 guided — the first token is forced into { territory; the fence is unsamplable. Note this matches labs 01/03's adversarial tests exactly: the model's preferences (chatty preamble) lose to the mask, every time, by construction.
+210 ms first request — lab-01's compile-time/runtime split with a wall-clock number: schema → grammar → automaton → token bitmask tables, then cached (per schema × tokenizer). A fleet serving many distinct schemas pays this repeatedly — cache hit rate on grammars is a real metric for structured-output-heavy services.
The truncated run — finish_reason: "length" plus a prefix-valid fragment: labs 01/03's caveat verbatim. The defensive pattern: treat "length" + structured-output as invalid regardless of how parseable the prefix looks, and size max_tokens for the schema's worst case (arrays make worst cases long).

Hitchhiker's notes

The API surface spans four formats: guided_json (schema), guided_regex (lab-01's domain), guided_choice (the degenerate-but-useful enum case), and guided_grammar (full EBNF — lab-03's domain, user-supplied). All compile to the same masking machinery with different front ends; choosing the narrowest format that fits is both faster to compile and a better model-steering signal.
Backend choice exists (xgrammar default, guidance, outlines lineage) — like Phase 4's attention backends, with the same operational reflex: when structured output misbehaves, swapping backends is the bisection move (--guided-decoding-backend). Feature-support matrices differ (regex corners, schema keywords); the deep-dive maps them.
Quality inside validity: 50/50 valid says nothing about whether the content is good — masks constrain syntax, not sense. A model bullied through an unfamiliar schema produces valid-but-vapid fields. The schema is also a prompt: include it in the text and the constraint, and the two reinforce (measure content quality separately — Phase 6 lab-02's eval discipline applies).
Throughput cost is real but modest: bitmask application is cheap; grammar advance (per accepted token, per request) is CPU-side work that can bottleneck at high concurrency with complex grammars — watch the structured-output scheduling stats. Tail risk: one pathological schema compiling for seconds can stall its request, not the engine (the async-compile design — Phase 3's WAITING state earning a new tenant).

Reflect

Map every capture line to its CPU-lab origin: the fence failure (mask edits support — labs 01/03's adversarial tests), the +210 ms (lab-01's compile/runtime split), the truncation (both labs' caveat tests). If each has a home, the phase composed.
Your service takes user-supplied schemas. Name the three operational risks this lab armed you against. (Unbounded compile cost per novel schema — cache + limits; worst-case output length vs max_tokens — validate finish_reason; pathological grammars as a DoS surface — compile timeouts.)
Why does the guided arm use temperature 0.8 rather than 0? (The claim under test is "valid under sampling" — greedy would make validity trivially repetitive and hide mask bugs that only sampled tails reach. Constrain the support, then let the distribution be itself.)

References

upstream/vllm/v1/structured_output/ — manager, xgrammar backend, the async compile path and bitmask plumbing.
vLLM docs, Structured Outputs — the four guided formats and backend selection: https://docs.vllm.ai/en/latest/features/structured_outputs/
Dong et al., XGrammar (2024): https://arxiv.org/abs/2411.15100
Labs 01 and 03 — the theory this run industrializes; Phase 1 lab-05 — finish_reason, doing load-bearing work again.

Lab 12-03 — JSON Needs a Stack: the Pushdown Mask `[CPU-OK]`

Try to write lab-01's FSM for "balanced nested braces" and you'll hit a wall that's been a theorem since 1956: a finite-state machine cannot count unbounded nesting — matching { to } at arbitrary depth requires memory that grows, i.e. a stack. JSON nests. Therefore JSON is not a regular language, regex-based masking cannot enforce it, and the production answer (xgrammar) compiles context-free grammars to pushdown automata. In this lab you build the pushdown machine for a JSON subset — modes plus a depth counter that is the stack — and prove the strongest property in the phase: a model that hates closing braces (they're its least-preferred character) still emits output that json.loads accepts. Along the way the fuzz oracle will catch a grammar corner most humans forget (this lab's reference implementation forgot it too, the first time): JSON forbids leading zeros.

Why this lab exists

The regular/context-free boundary is the single most practical piece of formal- language theory in modern serving, because it's a product boundary: "constrain to a regex" and "constrain to a JSON schema" are different features with different engines, different compile costs, and different failure modes — and engineers who don't know why ship the wrong one. After this lab the boundary is physical: you'll have built both machines and the difference is one integer (depth) that lab-01's FSM has nowhere to put.

The second lesson is oracle-driven grammar debugging, courtesy of an honest war story: the fuzz test here validates completed generations with json.loads — an independent, stricter implementation of the spec — and on this lab's first draft it rejected "0123", which the hand-built machine happily accepted. The machine's grammar was wrong (JSON ints are '0' | [1-9][0-9]*), and no amount of testing the machine against itself would have found it. Constrained decoding is only as correct as its grammar; always fuzz against an independent parser. That habit is worth more than the lab.

Background: modes + depth = pushdown

The machine is lab-01's FSM plus one number. Modes track position inside the local grammar production (value, int, int_zero, obj_first, key, key_body, colon, comma_or_close, obj_key, done); depth counts open objects — and because this grammar's only recursive construct is the object, depth is the entire stack (a general CFG pushes symbols; we push indistinguishable ones, so a counter suffices — a nice special case to notice). The two rules that make it click:

'{' in value mode pushes (depth+1) and enters obj_first; '}' pops and then asks: stack empty? → done (only EOS legal); else → comma_or_close (the enclosing object resumes). That resume-the-parent move is what no FSM can express — the parent's state was remembered by the stack.
Ending a value is context-dependent: a top-level int may end at EOS; the same int inside an object must be followed by , or }. Hence allowed() consults depth — the mask itself is stack-aware, which is the whole point.

One implementation subtlety worth savoring: feeding , while in int mode must first end the int, then re-dispatch the comma in the new mode (feed calls itself once). Tokens that terminate one production and belong to the next are the bread-and-butter of incremental parsing — xgrammar's machinery handles exactly this, industrialized.

Files

starter.py — JsonMachine (allowed / feed / accepting, modes documented) and constrained_generate. Your work.
solution.py — reference.
test_lab.py — recognition both ways, illegal-feed errors, the depth-8 stack test, the brace-hating model, the json.loads fuzz oracle, and the truncation caveat (pushdown edition: depth that only grows).

Run

LAB_IMPL=starter pytest phase-12-structured-outputs/labs/lab-03-json-pushdown -q
pytest phase-12-structured-outputs/labs/lab-03-json-pushdown -q   # reference

What the tests prove

Test	What it pins
`test_recognizes_valid_json` / `test_rejects_invalid_json`	The grammar, both directions — including `{}}`, `{"a":}`, and the trailing-comma classic
`test_illegal_feed_raises`	The machine is also a validator; feeding it garbage is loud, not corrupting
`test_depth_is_the_stack`	8 levels opened, tracked, closed — the non-regular behavior, exercised. (Any fixed-state FSM fails some depth; the counter never does)
`test_brace_hating_model_still_emits_valid_json`	The adversarial guarantee, CFG edition: `}` ranked dead last, output parses anyway — `comma_or_close` mode eventually offers nothing but structure-respecting choices
`test_fuzzed_preferences_always_parse_or_truncate_live`	50 random models against the independent oracle: every completed output parses, every truncated one is a live prefix. This is the test that caught the leading-zero bug — the lab's best argument made by its own history
`test_truncation_caveat_brace_lover_never_closes`	Lab-01's caveat, worse: a `{`-loving model nests forever (each `value` slot opens another object), hits the cap with depth > 0, output unparseable. Prefix-valid ≠ complete, and recursion gives the failure infinite room

Hitchhiker's notes

From this machine to xgrammar: a JSON Schema adds constraints your subset doesn't have (specific keys, types per key, string escapes, floats) — the grammar grows, the principle doesn't: compile schema → CFG → pushdown automaton → per-state token bitmasks (lab-01's lifting, over pushdown configurations). XGrammar's research contribution is making that lifting fast despite the stack: most masking decisions turn out to be context-independent (decidable from the mode alone) and get precomputed; only the genuinely stack-dependent ones (your depth-consulting closers) are evaluated at runtime. Your machine cleanly displays which decisions are which — look at allowed(): only two branches read depth.
The speculative-decoding interaction (Phase 8): verifying k drafted tokens under a grammar means advancing the pushdown machine k times and rolling back on rejection — so the machine must be checkpointable (your machine: copy mode + depth; a full PDA: copy the stack). Feature composition is where structured-output engines earn their complexity; vLLM gates some combinations for exactly this reason.
Per-request state, again: each request carries its own machine, advanced as tokens commit (Phase 9 lab-04's isolation discipline — grammar state is one more thing that must never leak between batch rows). In vLLM it lives alongside the request in the structured-output manager, advanced in update_from_output's neighborhood (Phase 1's loop, hosting yet another tenant).
Why not just retry until valid? Compute: invalid generations burn full generation cost each attempt, and complex schemas can have high rejection rates. Masking's cost is per-step and tiny. The mask also helps the model — at every step its probability mass is renormalized over only-valid continuations, so the model is never "off the rails" trying to recover from its own syntax error. Constrained models often produce better-content JSON too, for this reason.

Going further

Add arrays ([ value (',' value)* ]) — now two bracket types share the stack and a counter no longer suffices: you need an actual stack of {-vs-[ symbols. You'll have crossed from counter automaton to true PDA, and the diff is ~15 lines that teach the distinction better than any textbook.
Add strings as values with escape handling ("a\"b") — the mode machinery for escapes is exactly why real JSON grammars are bigger than people expect.
Lift to tokens: reuse lab-01's allowed_tokens over this machine (a token is allowed iff feeding its chars never raises) with a multi-char vocab, and re-run the brace-hater test at the token level. You've now built, end to end, a miniature xgrammar.

References

Dong et al., XGrammar: Flexible and Efficient Structured Generation (2024) — the context-independent/dependent split and the PDA machinery: https://arxiv.org/abs/2411.15100
upstream/vllm/v1/structured_output/backend_xgrammar.py — the integration; find the per-request grammar state and the bitmask path.
Chomsky, Three Models for the Description of Language (1956) — where "regex can't count braces" was proven, sixty-nine years before your fuzz test rediscovered its consequences: https://doi.org/10.1109/TIT.1956.1056813
Lab-01 — the FSM floor this lab builds on; Phase 9 lab-01 — the hook both ride.

Phase 12 — Exercises: Structured Outputs

Warm-up (explain)

Why mask logits per step instead of generating freely and rejecting invalid outputs?
What machine do you need for a regex, and what more do you need for JSON? Why exactly can't an FSM handle JSON?
Grammars constrain characters but models emit tokens — state the lifting rule for "token T is allowed in state S", and why it's too slow to evaluate naively per step.

Solution sketches

Rejection is unbounded (a 1k-token output that's 99.9% reliable per token fails ~63% of the time; retries multiply cost and latency, and there's no guarantee of ever succeeding). Masking is O(1) per step and makes invalid output impossible while letting softmax renormalize over the legal set, preserving the model's preferences among valid continuations.
Regex → FSM (finite states, transition table). JSON → pushdown automaton: nesting depth is unbounded and an FSM has finite memory, so it cannot ensure every { gets its } — you need a stack (push on open, pop on close).
T allowed in S iff running T's characters through the automaton from S never dies. Naive cost = vocab × token_len automaton steps each decode step (~hundreds of thousands). xgrammar precomputes per-context token verdicts at compile time into a packed bitmask, leaving only a small context-dependent set for runtime.

Core (trace the code)

StructuredOutputManager.grammar_init (__init__.py:115) — what is submitted to the executor, what does the request's grammar field hold meanwhile, and what does the scheduler do with such a request?
In grammar_bitmask (__init__.py:204), the serial path calls accept_tokens on draft tokens and later rollback. Why advance at all, and why must it rewind?
apply_grammar_bitmask (utils.py:44) builds sorted_bitmask filled with -1. What does a -1 word mean, and why does the function need to reorder rows at all?
Why does the bitmask allocation reserve max_num_seqs × (1 + num_speculative_tokens) rows rather than max_num_seqs?

Solution sketches

_create_grammar (the backend compile_grammar call) goes to a ThreadPoolExecutor; the field holds a Future. The grammar property (request.py:60) returns None until resolved, and the scheduler skips the request — it waits, unscheduled, so compile latency hits only that request's TTFT, never the engine loop.
The mask for draft position i must reflect the state after drafts 1..i−1 — so the grammar advances through the drafts to compute successive rows. But accept/reject belongs to the rejection sampler after the forward pass; the grammar must rewind (rollback(state_advancements)) so the real outcome can be applied later.
-1 = all bits set = every token allowed (int32 of all 1s) — the rows for unconstrained requests. Reordering: the scheduler emitted rows in its own request order, but logits rows follow the runner's batch order with spec-token offsets; struct_out_req_batch_indices maps request → logit row.
With spec decode, each request samples at up to 1 + k positions per step (k drafts + bonus/correction), and each position needs its own mask row, stored inline.

Build (your lab)

In your lab-01 FSM, add a choice(["yes", "no", "maybe"]) constraint without writing a new engine — express it as a regex and confirm the adversarial model can only emit one of the three.
Measure your mask-cache hit rate: generate 200 tokens under a 5-state FSM and count distinct (state → mask) computations vs lookups. Relate the result to why xgrammar compiles ahead of time.
In lab-03's pushdown, construct an input where the same automaton state has different allowed sets depending on the stack. Why does this kill any pure-FSM implementation?

Solution sketches

choice is just alternation: yes|no|maybe — same FSM machinery (this is literally how CHOICE is lowered upstream). Adversarial run must end with output ∈ the set.
Distinct computations = number of reachable FSM states (≤ 5); everything after is a lookup, so hit rate → ~100% quickly. Ahead-of-time compilation is this cache computed eagerly for every state at compile time.
State "expecting close bracket" with stack [{ must allow } not ]; with [[ it must allow ] not }. Same control state, different stack top ⇒ allowed set is a function of (state, stack), which an FSM cannot represent for unbounded depth.

Design (staff-level)

A tenant sends 10k requests/minute, each with a unique (uncacheable) 50 KB JSON schema; compiles take 200 ms of CPU. The engine also serves latency-sensitive unconstrained traffic. What breaks, and what do you do about it (at least three layers of defense)?
Product asks: "constrained decoding makes outputs worse — the JSON is valid but the content got dumber." Is that plausible? Explain the mechanism and two mitigations.
Spec decode (k=4) + structured output: derive the per-step grammar-work overhead vs non-spec, and explain why acceptance rate changes under constraints (which direction?).
Design grammar-aware prefix caching: two requests share a long prompt but have different schemas. What can be shared (Phase 2/3 machinery), what cannot, and where would you hook the distinction in?

Solution sketches

The compile executor saturates → constrained TTFT explodes; if compile work steals scheduler-process CPU, everyone's step time degrades. Defenses: bound + isolate the compile pool (separate cores/process); per-tenant compile-queue quotas and admission control (reject/429 on queue depth); schema canonicalization to recover cache hits; pre-registration API for schemas; cap schema size/complexity at the front door (upstream already rejects unsupported features in the processor — same layer).
Plausible. The mask renormalizes over legal tokens, but the model's distribution was conditioned on its own preferred phrasing; forcing rare continuations (e.g. token boundaries that split words awkwardly) pushes it off-manifold and degrades content. Mitigations: schema-aware prompting (show the schema, few-shot valid examples) so the constrained path is also the high-probability path; looser grammars (whitespace flexibility — note xgrammar's any_whitespace flag); two-pass (free-form reason, then constrained extraction — exactly what the reasoning-gate in should_advance enables).
Per step the grammar does k+1 fill_bitmask calls + k accept_tokens + one rollback(k) vs 1 fill — so ~(k+1)× the grammar CPU, still small vs the forward pass. Acceptance rises on structural tokens (mask forces draft and target into the same narrow set — high agreement) and can fall if the drafter is unconstrained while the target is masked (drafts get vetoed by structure the drafter didn't know about; vLLM mitigates by validating drafts with validate_tokens).
KV blocks for the shared prompt are shareable as ever — KV depends only on token ids (the mask touches logits, not hidden states). What's not shareable is grammar state and compiled grammars (different schemas). Hook: nothing to change in the block hash — grammar state only matters from the first generated token; just keep grammar identity out of prefix-cache keys for prompt blocks (and be careful only if constraint applies to prompt — it doesn't).

Self-grading

4–7 and 11–14 are interview-grade. Could you whiteboard the per-step pipeline (state → bits → −inf → sample → advance) and the spec-decode advance/rollback dance from memory? If not, re-read 01-deep-dive.md §4.

Phase 12 — Interview Questions: Structured Outputs

Staff/principal-level questions. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)

Q1. How does vLLM guarantee valid JSON output?

Model answer

It compiles the JSON schema into a grammar automaton (xgrammar: a pushdown automaton, since JSON nests), and at each decode step computes a bitmask of tokens that keep the output grammar-valid, applying −inf to all illegal logits before sampling. Softmax renormalizes over the legal set, so the model still expresses preference among valid tokens — but invalid output is impossible by construction, not just unlikely. One grammar state per request, advanced as tokens are accepted.

Q2. Grammars are over characters; models emit multi-character tokens. How is that bridged efficiently?

Model answer

A token is legal iff its character sequence keeps the automaton alive from the current state. Checking that naively is vocab × token-length automaton steps per decode step. The production trick (xgrammar) is compile-time token classification: for each automaton context, precompute which tokens are unconditionally legal/illegal and pack them into a bitmask (vocab/32 int32 words); only a small context-dependent remainder (e.g. tokens interacting with the stack) is checked at runtime. Per-step cost becomes "copy a precomputed row + check a few stragglers."

Q3. Where does grammar compilation run, and why does that placement matter?

Model answer

On the scheduler side, in a thread-pool executor (StructuredOutputManager.grammar_init). The request's grammar field holds a Future, and the scheduler won't schedule the request until it resolves — so a 200 ms schema compile costs that request TTFT but never stalls the engine loop or other requests' steps. Compiled grammars are keyed by (type, spec) so repeated schemas don't recompile. The bitmask is also filled scheduler-side and shipped to workers as numpy (cheap serialization); the GPU only applies it.

Q4. How do structured outputs compose with speculative decoding?

Model answer

The bitmask carries one row per sampled position: for k draft tokens you need masks for the state before each draft plus the bonus position, so allocation is max_num_seqs × (1+k) rows. To compute row i the grammar tentatively accept_tokens's drafts 1..i−1, and after filling all rows it rollback's — the true accept/reject verdict belongs to the rejection sampler post-forward, which then advances the grammar by only the accepted prefix. rollback is part of the base grammar interface and xgrammar's matcher is constructed with max_rollback_tokens=num_speculative_tokens — composition was designed in, not patched on.

Q5. A customer says constrained outputs are "valid but dumber." Diagnose.

Model answer

Real effect. Masking renormalizes but the model wasn't conditioned to follow this grammar; when its preferred continuation is illegal, probability mass shifts to tokens it would rarely choose, pushing generation off-distribution (worst at awkward token boundaries and rigid whitespace rules). Mitigations: prompt with the schema and examples so the high-probability path is also the legal path; relax the grammar where harmless (xgrammar's any_whitespace); generate reasoning unconstrained and only constrain the final answer — vLLM's reasoning gate (should_advance) does exactly this for thinking models. Also check finish_reason="length": truncation, not the mask, is the most common "broken JSON" report.

Q6. Engine design: would you support different grammar backends per request? What does vLLM do and why?

Model answer

vLLM V1 deliberately supports one backend per engine (first constrained request picks it; see the NOTE in grammar_init). Per-request backends would mean multiple compiled-grammar caches, multiple bitmask allocation schemes, and validating every request's spec against every backend's feature matrix — for little gain since backends are interchangeable behind the two-interface contract (StructuredOutputBackend / StructuredOutputGrammar). The right extension point is the contract, not per-request dispatch: that's also how attention and quantization backends are handled.

Rapid-fire

Mask applied where? Logits, −inf, pre-softmax (one fused kernel, apply_grammar_bitmask).
Regex needs? FSM. JSON needs? Pushdown (stack). Schema → ? compiled to grammar first.
Bitmask row size? vocab_size / 32 int32 words. All -1 row = ? unconstrained.
Compile blocking the engine loop? Never — async executor, request waits unscheduled.
Spec-decode hooks in the grammar interface? validate_tokens, rollback(n).
What constraints can't fix: truth of content, and max_tokens truncation.

Phase 12 — Cheatsheet: Structured Outputs

The one-liner

Per step: grammar state → allowed-token bitmask → illegal logits = −inf → sample → advance state. Valid by construction; softmax renormalizes over legal tokens.

The pipeline

StructuredOutputsParams (one of json/regex/choice/grammar/json_object/structural_tag) → grammar_init compiles async (request unschedulable until Future resolves) → scheduler get_grammar_bitmask per step → manager fills rows → numpy → runner apply_grammar_bitmask reorders to batch order + fused −inf kernel → sample → accept_tokens advances.

Machines

regex → FSM · JSON/EBNF → pushdown (stack) · JSON Schema → compiled to grammar.
Char-rules lifted to tokens at compile time (xgrammar token classification) → packed bitmask, vocab/32 int32 words; runtime checks only context-dependent stragglers.

Performance model

Compile: once per distinct (type, spec) key; 10s–100s ms for big schemas; hits first request's TTFT only.
Per step: bitmask fill (CPU, parallelized above a batch threshold) + one fused mask kernel; low single-digit % overhead steady-state.

Spec-decode composition

Bitmask rows = max_num_seqs × (1 + num_spec_tokens) — one row per sampled position.
Fill row i after tentatively accepting drafts < i; then rollback(advancements).
Grammar interface has validate_tokens (check, no advance) + rollback(n); xgrammar matcher built with max_rollback_tokens=num_spec_tokens.

Gotchas

One backend per engine (xgrammar default; guidance/outlines/lm-format-enforcer), not per request.
Reasoning models: constraint suspended during thinking (should_advance gate; mask row set all-ones) until the reasoning parser signals end.
Valid ≠ true; and finish_reason="length" still truncates mid-structure — budget max_tokens for the schema's worst case.
Constrained + unconstrained drafter can tank spec acceptance — drafts get vetoed.

Key upstream

vllm/sampling_params.py:41 StructuredOutputsParams
v1/structured_output/backend_types.py:31 Grammar :99 Backend (the contract)
v1/structured_output/__init__.py:36 Manager :115 grammar_init :204 grammar_bitmask :322 should_advance
v1/structured_output/backend_xgrammar.py:77 compile_grammar :132 XgrammarGrammar
v1/structured_output/utils.py:44 apply_grammar_bitmask (runner side, gpu_model_runner.py:4359)
v1/core/sched/scheduler.py:1259 get_grammar_bitmask

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

Phase 13 — The Hitchhiker's Guide to Multimodal Models

← Phase 12 · Course home · Phase 14 →

Don't Panic

A vision-language model is not a new kind of engine. It's the same LLM you've been serving for twelve phases, plus a vision encoder bolted on the front. The image is encoded into a sequence of embedding vectors, and those vectors are spliced into the text embedding stream at reserved placeholder positions. From the first transformer layer onward, the model cannot tell which positions were words and which were pixels — KV cache, paged attention, continuous batching, all of it works unchanged. The new engineering is all around the splice: expanding placeholders, scheduling the encoder, and caching its output.

 "What is in <image> ?"                       ┌──────────────┐
        │ tokenize + expand          pixels ──► vision encoder│──► [E1 E2 … E576]
        ▼                                     └──────────────┘        │ projector
 [What][is][in][IMG][IMG]…[IMG][?]                                    │ (to LLM dim)
        │ embed text                                                  ▼
 [w1][w2][w3][▢][▢]…[▢][w4]  ──── overwrite ▢ positions ────► [w1 w2 w3 E1 … E576 w4]
                                                                      │
                                                                      ▼
                                                          ordinary LLM forward (Phases 0–11)

Step 1: How a decoder-only LLM "sees" — the splice

Three parts (LLaVA is the canonical layout, llava.py):

Vision encoder (a ViT): image → grid of patches (e.g. 24×24 = 576) → one embedding per patch.
Projector (an MLP): maps encoder embeddings into the LLM's hidden dimension — two matrices is all it takes to make pixels speak the language model's language.
The LLM: receives inputs_embeds where placeholder positions have been overwritten with projected image embeddings. In vLLM the overwrite is literally one indexed assignment: inputs_embeds[is_multimodal] = mm_embeds (models/utils.py:456, _merge_multimodal_embeddings).

That's the whole trick. Cross-attention encoder-decoder models (Whisper-style) are the exception, not the rule, in today's VLM zoo — the spliced decoder-only design won.

Step 2: Placeholders — the contract between processor and model

Before the model runs, the multimodal processor rewrites the prompt: the single <image> marker becomes N repeated image tokens, and a PlaceholderRange(offset, length) (multimodal/inputs.py:119) records exactly where. This bookkeeping is the contract:

The tokenizer side promises: positions [offset, offset+length) are dummies awaiting embeddings (some models interleave real structure — row separators — so is_embed can mask which positions inside the range are actually image slots).
The model side promises: the encoder will produce exactly length (or is_embed.sum()) embeddings. Get the count wrong and you get the classic VLM crash — upstream raises "Attempted to assign X multimodal tokens to Y placeholders" (utils.py:484). Your lab-01 makes you maintain this invariant by hand.

Step 3: The cost — one image is a paragraph… or a chapter

Image tokens are real tokens downstream: they occupy KV-cache blocks (Phase 2), consume scheduler token budget (Phase 3), and lengthen every later attention read. Typical scales:

Model	One image becomes
LLaVA-1.5 (fixed 336²)	576 tokens — always
Qwen2-VL (dynamic resolution)	~4 → ~16k tokens, ∝ pixel count

Dynamic resolution is the dangerous one: token count grows quadratically with image side length (lab-02 measures the law on real Qwen2-VL). A 4-image request can dwarf its own text. This is why MM models need their own memory profiling (compute_mm_encoder_budget, encoder_cache_manager.py:269) — the worst-case image inflates both KV and the encoder cache, and the engine must reserve for it at startup.

Step 4: The encoder cache — don't encode the same image twice

Encoder output is expensive (a full ViT forward) and reusable — the same image appears across chunked-prefill steps of one request, across retries, across users pasting the same screenshot. vLLM keeps finished encoder outputs in an EncoderCacheManager (v1/core/encoder_cache_manager.py:17), a second cache next to the KV cache with its own currency: it's measured in encoder embeddings, not blocks.

Design rhymes with Phase 2's block pool — learn the mapping:

BlockPool (Phase 2)	EncoderCacheManager (here)
block hash	`mm_hash` (content hash of the image)
`ref_cnt`	`cached[mm_hash]` = set of referencing request IDs
free queue (LRU eviction)	`freeable` OrderedDict (evict oldest unreferenced)
allocate / free	`allocate` / `free_encoder_input`, reclaim at allocation time

Cross-request sharing falls out of content hashing: two requests with the same image hit the same mm_hash (check_and_update_cache, :91).

Step 5: Encoder meets chunked prefill — the scheduling problem

Chunked prefill (Phase 3) slices a long prompt into budget-sized pieces. But an image embedding is produced by one indivisible encoder forward — you can't compute the first half of a ViT's patches this step and the rest next step. So the scheduler must reconcile two granularities, and _try_schedule_encoder_inputs (scheduler.py:1096) is the reconciliation. An encoder input is scheduled this step iff:

its placeholder range overlaps the token window being computed, [num_computed_tokens, num_computed_tokens + num_new_tokens);
it isn't already in the encoder cache;
the per-step encoder compute budget has room (encoders are compute-heavy; unbounded encoder work would blow up step time exactly like unbounded prefill would);
the encoder cache has space to hold the output.

If any check fails, the scheduler shrinks num_new_tokens to stop just before the unschedulable image — decode the text up to the doorstep, wait for next step. And once encoded-and-cached, a chunk boundary can land mid-placeholder freely: later chunks read the cached embeddings. Lab-03 builds this exact logic, all-or-nothing encodes and all.

Step 6: Prefix caching with pixels — hashing the image itself

Phase 3's prefix cache keys blocks by token IDs — but two different images expand to the same dummy token IDs! Sharing on token IDs alone would serve user B answers about user A's photo. Fix: MultiModalHasher (multimodal/hasher.py:50) content-hashes the actual image bytes, and that mm_hash is folded into the block hashes covering the placeholder range. Same prompt + same pixels → full prefix-cache hit; same prompt + different pixels → miss exactly at the image. (The same hash doubles as the encoder-cache key — one identity for both caches.)

The invariants to memorize

A VLM = encoder + projector + unchanged LLM; image embeddings overwrite placeholder positions in inputs_embeds. After the splice, the engine can't tell pixels from words.
PlaceholderRange is a contract: processor-side expansion count must equal encoder-side embedding count, exactly.
Image tokens are real tokens: they cost KV blocks, token budget, and attention time — dynamic-resolution models scale ∝ pixels (quadratic in side length).
The encoder cache is a second cache with its own budget, keyed by content hash, ref-counted per request, LRU-evicted when unreferenced.
Encoder runs are all-or-nothing; chunked prefill stops at the doorstep of an image it can't afford this step.
Prefix caching must mix the image hash into block hashes — token IDs alone are ambiguous for placeholder spans.

What you'll do

Read: 01-deep-dive.md — processor, placeholder machinery, encoder cache, scheduler hook, and LLaVA/Qwen2-VL as case studies, line-anchored.
Build: 02-mini-build.md — a fake-image pipeline for mini_vllm: placeholder expansion + toy encoder + content-hash cache.
Labs (see labs/README.md; recommended order 01 → 03 → 02):
- lab-01-image-token-expansion [CPU-OK] — pixels → patches → tokens → blocks: placeholder expansion, PlaceholderRange bookkeeping, and the capacity punchline (one image = 38 KV blocks).
- lab-03-encoder-scheduling [CPU-OK] — chunked prefill meets the vision tower: per-step encoder budget, all-or-nothing encodes, truncate-at-the-doorstep, and the cache that restores mid-placeholder freedom (V1's _try_schedule_encoder_inputs, distilled).
- lab-02-run-a-vlm [GPU-OPT] — Qwen2-VL on a real photo: the 1,421-token "one-line" prompt, the quadratic resize law, the encoder's TTFT spike. Captured output included.
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

← Phase 12 · Course home · Phase 14 →

Phase 13 — Deep Dive: multimodal in real vLLM

Paths relative to upstream/ at v0.22.1 @ 0decac0 (UPSTREAM_PIN.md). If a line number drifts in a newer tree, search for the named symbol.

vllm/multimodal/inputs.py                 PlaceholderRange + input dataclasses (read first)
vllm/multimodal/hasher.py                 MultiModalHasher — content identity
vllm/model_executor/models/llava.py       the canonical VLM (encoder+projector+splice)
vllm/model_executor/models/utils.py       _merge_multimodal_embeddings (the splice itself)
vllm/v1/core/encoder_cache_manager.py     the second cache
vllm/v1/core/sched/scheduler.py           _try_schedule_encoder_inputs (the hook)
vllm/model_executor/models/qwen2_vl.py    dynamic resolution (contrast case)

1. The contract: `PlaceholderRange`

vllm/multimodal/inputs.py:119 — class PlaceholderRange(offset, length, is_embed). The docstring example is the whole idea: prompt AAAA BBBB What is… gives image A PlaceholderRange(offset=0, length=4), image B (offset=5, length=4). is_embed is the subtlety: some models put structure tokens inside the range (Pixtral inserts a row-break token after each patch row — see llava.py:390, ([image_token_id] * ncols + [image_break_id]) * nrows), so a boolean mask says which positions actually receive embeddings. Everything downstream — scheduler windowing, embedding merge, profiling — is arithmetic over these ranges.

2. The processor: prompt rewriting, LLaVA-style

vllm/model_executor/models/llava.py is the layout to internalize, because Phase 14's "add a model" recipe reuses every piece:

BaseLlavaProcessingInfo.get_num_image_tokens (:188) — asks the vision-encoder info object how many tokens an H×W image becomes. This number is model math, not a constant.
LlavaDummyInputsBuilder (:222) — builds worst-case fake inputs (image_token * num_images) so startup profiling (Phase 1's memory measurement) sees the most expensive possible multimodal request before any real one arrives.
BaseLlavaMultiModalProcessor._get_prompt_updates (:264) — the rewrite rule: replace one image_token_id with [image_token_id] * num_image_tokens (:297). This is where <image> becomes 576 dummies and the PlaceholderRange is born.
The registry (vllm/multimodal/registry.py) binds processor classes to model classes via the @MULTIMODAL_REGISTRY.register_processor decorator on the model (:308 region).

3. The splice: `_merge_multimodal_embeddings`

vllm/model_executor/models/utils.py:456. After embed_multimodal (llava.py:661) runs encoder + projector (LlavaMultiModalProjector, :128 — two linears and an activation), the merge is one line:

inputs_embeds[is_multimodal] = mm_embeds_flat.to(dtype=input_dtype)

An in-place masked scatter — pixels become "words" by assignment. Read the except RuntimeError block (:478): the count-mismatch error ("Attempted to assign X multimodal tokens to Y placeholders") is the canonical symptom of a broken processor↔model contract, and the first thing you'll debug when adding a VLM. Note also the comment about keeping is_multimodal on CPU to avoid a device sync — model-runner hot path discipline.

4. Identity: `MultiModalHasher`

vllm/multimodal/hasher.py:50. hash_kwargs (:154) serializes each item (images go via serialize_item, :52 — raw bytes, not object identity) through blake3-style hashing into an mm_hash string. One hash, two jobs:

Encoder-cache key — same image in any request hits the same cached embeddings.
Prefix-cache ingredient — the hash is folded into KV block hashes covering the placeholder span (Phase 3's kv_cache_utils block hasher takes extra_keys for exactly this), so identical dummy token IDs with different pixels cannot alias.

5. The second cache: `EncoderCacheManager`

vllm/v1/core/encoder_cache_manager.py:17. Read the class docstring — it is unusually complete. The structure, mapped to Phase 2 vocabulary:

cached: dict[mm_hash, set[request_id]] — ref-counting by named references instead of an integer ref_cnt (you can ask who holds it).
freeable: OrderedDict[mm_hash, num_embeds] — the LRU free-queue analogue: entries with zero referencing requests, evictable oldest-first, reclaimed lazily at allocation time (can_allocate, :119) exactly like Phase 2's cached-block eviction.
num_free_slots vs num_freeable_slots — actual free space vs free-after-evictions; the allocate path decides how much eviction it must perform.
Units are encoder embeddings, not blocks or bytes (see the NOTE in the docstring: in-between break/text tokens don't count) — the budget that sized this cache comes from compute_mm_encoder_budget (:269) at startup.
get_freed_mm_hashes (:255) — drained each step into SchedulerOutput (scheduler.py:901) so workers drop their copies: the manager is scheduler-side bookkeeping; the tensors live in the runner's encoder_cache dict (gpu_model_runner.py:3065). Same split-brain pattern as KV: scheduler owns accounting, worker owns memory.

6. The scheduler hook: `_try_schedule_encoder_inputs`

vllm/v1/core/sched/scheduler.py:1096. Called for both running (:410) and waiting (:679) requests. The docstring lists the four conditions (overlap with the computed window; not already cached; encoder compute budget; encoder cache space). The mechanism to study is the fallback: when an encoder input fails a check, the function truncates num_new_tokens so the chunk ends just before the placeholder — the request still makes progress on text, and the image waits for a step with budget. Consequences worth saying out loud:

Encoder work rides the same step as the decoder chunk that first overlaps the image — there is no separate "encoder phase" (contrast Phase 15's encode-disaggregated serving, where there is).
The per-step encoder_compute_budget bounds step-time inflation; the cache-space check prevents an admission deadlock (an image that can never fit is rejected at the front door, compute_mm_encoder_budget sizing guarantees the worst case fits).
On allocation (:524/:810), the manager records the request as a referent; on request finish, free (:939) just de-references — the embeddings linger, freeable, for reuse.

7. Contrast case: Qwen2-VL dynamic resolution

vllm/model_executor/models/qwen2_vl.py. Versus LLaVA's fixed 576: token count is a function of the actual image (grid_thw — patches per height/width/time), so get_num_image_tokens does real arithmetic, video adds a time dimension, and M-RoPE (multimodal rotary position encoding — text positions and 2-D image positions interleaved) replaces vanilla RoPE. You don't need every detail; you need to recognize which parts of the Phase-13 machinery flex (token counting, dummy-input profiling, position encoding) and which don't (placeholder contract, encoder cache, scheduler hook — identical).

Reading checklist

PlaceholderRange — what is is_embed for? Find the Pixtral line that makes it necessary (llava.py:390).
_get_prompt_updates in llava.py — where exactly does 1 token become N?
_merge_multimodal_embeddings — what's the invariant, and what error message do you get when it breaks?
EncoderCacheManager.check_and_update_cache / can_allocate — walk a second request arriving with the same image: which dict/list transitions happen?
_try_schedule_encoder_inputs — all four scheduling conditions, and what happens to num_new_tokens when one fails?
In scheduler.py:901, how do workers learn an encoder entry was evicted?

Now build it: 02-mini-build.md, then the labs.

Phase 13 — Mini-Build: a fake-image pipeline for `mini_vllm`

Your task

Teach mini_vllm to serve a request that carries a fake "image": expand a placeholder into N synthetic image tokens, run a toy encoder (deterministic function of the image bytes), splice the embeddings, and cache encoder outputs by content hash so the same image is never encoded twice.

Why build it (and not just read it)

The spec

Request extension: Request may carry images: list[bytes] and a prompt containing the marker token <IMG>. A processing step expands each marker to num_image_tokens(image) placeholder token IDs and records PlaceholderRange(offset, length) — your own tiny dataclass. Make num_image_tokens = (len(image_bytes) // 64) + 1 so "resolution" varies (the dynamic-resolution lesson in one line).
Toy encoder: encode(image_bytes) -> np.ndarray[length, d], deterministic (seed a RNG from the content hash). Pretend it's expensive: count invocations.
Encoder cache: dict keyed by sha256(image_bytes), with per-request reference sets and an LRU freeable list with a capacity in embeddings — a 40-line EncoderCacheManager mirroring upstream's cached/freeable/freed trio.
The splice: in the (fake) forward, inputs_embeds[is_image_position] = cached_embeddings — assert the count contract and raise the upstream-style "X multimodal tokens to Y placeholders" error on mismatch.
Scheduler touch: image tokens must pass through your Phase-3 scheduler as ordinary tokens (KV blocks allocated, token budget consumed). If you did lab-03, optionally bolt on the per-step encoder budget + truncate-at-the-doorstep rule.

Method

Re-read encoder_cache_manager.py:17 (docstring is the design doc) and models/utils.py:456 (the splice).
Build processor → encoder → cache → splice in that order; test each before the next.
pytest mini_vllm -q and keep it green.

Definition of done

CPU only, numpy only.
A test proves expansion arithmetic: a prompt with 2 images of different sizes yields correct total length and two correct PlaceholderRanges.
A test proves cache sharing: two requests, same image bytes → encoder invoked once; different bytes (same length!) → invoked twice. This is the content-hash lesson.
A test proves the contract: corrupt the expansion count and assert the mismatch error fires.
A test proves eviction: capacity for one image's embeddings; finish request A, admit B with a new image → A's entry evicted (and its hash reported freed), not B rejected.
You can say out loud where yours simplifies: no real ViT, no projector dim-matching, no is_embed masks, no chunked-prefill interaction unless you added it.

Map back to the real engine

Yours	Upstream
marker expansion + range	`_get_prompt_updates` (`llava.py:264`) + `PlaceholderRange` (`inputs.py:119`)
`sha256(image_bytes)`	`MultiModalHasher.hash_kwargs` (`hasher.py:154`)
cache dict + refs + LRU	`EncoderCacheManager` (`encoder_cache_manager.py:17`)
splice + count assert	`_merge_multimodal_embeddings` (`models/utils.py:456`)
encoder budget rule (optional)	`_try_schedule_encoder_inputs` (`scheduler.py:1096`)

Phase 13 Labs — Multimodal Models

Three labs on the trick that lets a text engine see: translate pictures into the core's one currency — tokens — at the boundary, and keep Phases 1–3 untouched. The arc: build the expansion that turns pixels into sequence length (lab-01), referee the collision between chunked prefill and the can't-encode-half-a-picture vision tower (lab-03), then run a real VLM and reconcile every number — the 1,421-token "one-line" prompt, the quadratic resize law, the encoder's TTFT spike (lab-02).

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-13-multimodal-models/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-13-multimodal-models/labs/lab-01-image-token-expansion -q

Labs

lab-01-image-token-expansion `[CPU-OK]`

Pixels → patches → tokens → blocks: the ViT patch arithmetic (with its double-ceiling traps), the placeholder splice that rewrites the prompt, and the PlaceholderRange bookkeeping everything downstream navigates by. The punchline test: a "20-token prompt" with one image is a 595-token request needing 38 KV blocks. Skills: the quadratic resolution law; containment-by-translation as architecture; multi-image offset shifting; validating counts at the boundary.

lab-02-run-a-vlm `[GPU-OPT]`

Qwen2-VL-2B on a real photo: the ~30-token prompt arriving as 1,421 tokens, the ~4× drop on halving resolution (predicted first, measured second), and the 41 → 118 ms TTFT gap that is the vision encoder on a clock. Plus the operational surfaces: resize policy as the cheapest capacity lever, limit_mm_per_prompt, and the three-cache stack (processor / encoder / prefix). Annotated capture included. Skills: auditing a processor's decisions; segmenting TTFT by has-image; the quality cliff in resize tuning.

lab-03-encoder-scheduling `[CPU-OK]`

The collision: chunked prefill slices anywhere, but you can't encode half a picture. Implement V1's answer — per-step encoder budget, all-or-nothing encodes scheduled when a chunk enters a placeholder, truncate-at-the-doorstep when unaffordable, and the encoder cache that restores mid-placeholder freedom. Seven scenarios from pure-text to the zero-budget starvation edge. Skills: a third resource ledger; the cache-at-the-granularity-boundary pattern; why VLM prefills stall one token before their image.

What you can do after this phase

Price an image (or a video) in tokens, blocks, and TTFT before deploying it; predict and explain VLM capacity from the traffic's image-size distribution; tune the resize policy, encoder budget, and per-prompt limits with eyes open; and read vllm/multimodal/ plus the V1 encoder-scheduling path as machinery you've already built small. Phase 14 goes inside the models themselves — including how a vision tower bolts onto a language model in the first place.

Lab 13-01 — Image-Token Expansion: Where Pictures Become Sequence Length `[CPU-OK]`

Here is the entire secret of multimodal serving, and it fits in one sentence: to the engine, an image is just tokens. The user's prompt says <image>; the processor replaces that single placeholder with N placeholder positions (144 for a 336×336 LLaVA-style image, 576+ for high-res); the vision encoder's embeddings will occupy those positions; and from that moment every subsystem you've built in this course — the scheduler's token budget (Phase 3), KV blocks (Phase 2), TTFT arithmetic (Phase 1) — treats them like any other tokens. This lab implements the expansion: the patch arithmetic that converts pixels to a token count, the splice that rewrites the prompt, and the PlaceholderRange bookkeeping that remembers where each image lives — the exact data structure upstream uses.

Why this lab exists

Multimodal capacity surprises kill deployments. A chat service adds image support; the prompt text barely grew, yet TTFT triples and concurrency halves — because every image silently added hundreds of tokens that nobody counted. The arithmetic in this lab is the inoculation: image_token_count tells you what a resolution costs, test_resolution_is_quadratic_cost makes the scaling law visceral (double the sides, 4× the bill), and test_the_scheduler_sees_only_the_expanded_length does the capacity-planning punchline — a "20-token prompt" with one image is a 595-token request needing 38 KV blocks instead of 2. Run your traffic's image-size distribution through these three functions before you ship a VLM, and Phase 0 lab-02's concurrency math stays honest.

The deeper design point: expansion is how multimodality gets contained. The engine's core (scheduler, KV manager, attention) never learns what an image is — it sees a longer token sequence plus an opaque side-channel (the embeddings, delivered by lab-03's encoder scheduling). That containment is why vLLM could add vision, audio, and video without rewriting Phases 1–3, and it's the architectural pattern to copy: translate the exotic thing into the core's existing currency at the boundary.

Background: pixels → patches → tokens → blocks

The pipeline, stage by stage:

Pixels → patches: ViT encoders slice the image into patch × patch pixel tiles (14 px is the common size) — ceil(side / patch) per dimension. A 336×336 image: 24×24 = 576 patches.
Patches → tokens: many modern VLMs (Qwen-VL family and others) then merge merge × merge neighborhoods (pixel-unshuffle / spatial merge) to shrink the sequence: 24×24 → 12×12 = 144 tokens. Both divisions ceil — odd sizes round up at each stage, and test_patch_arithmetic's 337-pixel case pins the double-ceiling (a classic off-by-one source when re-implementing processors).
Tokens → the sequence: each <image> occurrence in the tokenized prompt is replaced by its image's count of sentinel ids, and a PlaceholderRange(offset, length) records the span — the coordinates lab-03's encoder scheduling and the model runner's embedding-scatter both navigate by. Multi-image prompts produce ordered, disjoint ranges whose offsets shift by earlier expansions (test_multi_image_ranges_are_ordered_and_disjoint pins the shift).
Sequence → blocks: Phase 2's ceil-div, unchanged — image KV is KV.

Files

starter.py — image_token_count, expand_prompt, kv_blocks_needed. Your work.
solution.py — reference.
test_lab.py — the patch arithmetic (with the ceiling traps), quadratic scaling, the splice, multi-image ranges, the count-mismatch assert, and the capacity punchline.

Run

LAB_IMPL=starter pytest phase-13-multimodal-models/labs/lab-01-image-token-expansion -q
pytest phase-13-multimodal-models/labs/lab-01-image-token-expansion -q   # reference

What the tests prove

Test	What it pins
`test_patch_arithmetic`	336×336 → 144 (the LLaVA-ish number you'll see in lab-02's capture), the 1-token thumbnail floor, and ceiling-at-both-stages
`test_resolution_is_quadratic_cost`	2× resolution = 4× tokens — why high-res modes are a capacity feature, not just a quality feature, and why production VLM configs cap image size
`test_expansion_splices_in_place`	The rewrite, exactly: text before, N sentinels, text after
`test_multi_image_ranges_are_ordered_and_disjoint`	Range offsets account for earlier expansions; every placeholder position holds the sentinel. The bookkeeping the embedding scatter trusts
`test_mismatched_counts_assert`	N placeholders demand N images — the validation that turns a garbled request into a clean 400 error instead of a runtime tensor-shape crash three layers deep
`test_the_scheduler_sees_only_the_expanded_length`	20 "text" tokens + 1 image = 595 tokens, 38 blocks. The line item your capacity model was missing

Hitchhiker's notes

Where this lives upstream: the per-model processor (upstream/vllm/model_executor/models/<model>.py + vllm/multimodal/processing.py) performs exactly this expansion at request-arrival time, emitting PlaceholderRanges (vllm/multimodal/inputs.py — same fields as yours). The count formula is model-specific (this is most of what differs between LLaVA, Qwen-VL, Pixtral processors); the splice machinery is shared.
The sentinel id never reaches the embedding table. At runtime the model runner computes text embeddings for real ids and scatters the encoder's output over the placeholder positions (get_input_embeddings with the ranges as the map). Your -100 is upstream's reserved placeholder id — chosen, like all sentinels, to be un-confusable with a real token (Phase 2's null block, Phase 9's -1 EOS: the course's sentinel family grows).
Prefix caching works for images — with one amendment you can now predict (Phase 2 lab-05's "anything that changes what KV means"): the block hash must include the image content hash, or two prompts with identical text but different pictures would share KV. vLLM hashes the multimodal items into the chain; same-image-same-text re-requests (retries, multi-turn over one photo) hit cache like any system prompt.
Variable-resolution schemes (dynamic tiling à la InternVL/GPT-4V's "high-res crops") are this lab's formula applied per tile plus a global thumbnail — the token count becomes data-dependent, which is exactly why upstream processors compute counts from actual image dimensions instead of constants, and why your capacity model must use the traffic's real size distribution.

Going further

Add aspect_preserving_resize(w, h, max_side) → new dims, then recompute the token bill — reproducing the resize-then-patch pipeline real processors run, and the knob (max_side) that trades quality for capacity.
Implement the embedding scatter: given text_emb (seq, d), image_emb (n, d), and a PlaceholderRange, produce the merged input — ~3 lines with numpy slicing, and you've written the runtime half of this lab's compile-time work.
Compute the KV-bytes per image (144 tokens × Phase 0 lab-02's per-token bytes) for a 7B model, then for video at 1 fps × 60 s. The result explains why video models lean so hard on token merging and why "just feed it the video" is a memory proposal, not a feature request.

References

upstream/vllm/multimodal/inputs.py — PlaceholderRange, the real one.
upstream/vllm/multimodal/processing.py — the expansion machinery (PromptReplacement and friends).
Liu et al., Visual Instruction Tuning (LLaVA, 2023) — the projector-into-the-token-stream design this lab models: https://arxiv.org/abs/2304.08485
Qwen team, Qwen2-VL (2024) — the 2×2 spatial merge and dynamic resolution: https://arxiv.org/abs/2409.12191
Lab-03 — who fills the placeholders, and what it costs the scheduler.

Lab 13-02 — Run a VLM and Count Its Image Tokens `[GPU-OPT]`

The CPU labs predicted two numbers: how many tokens an image becomes (lab-01's patch arithmetic) and what scheduling them costs (lab-03's encoder budget). This lab runs a real vision-language model — Qwen2-VL-2B — on a real image and checks both: the prompt that tokenized to ~30 text tokens arrives at the scheduler as ~1,400 tokens (one high-res photo), the encoder's execution shows up as a prefill-time spike, and the model then answers questions about pixels it turned into KV like any other context.

No GPU? Don't panic. The captured run below is annotated against both CPU labs; the counting exercises are the lab.

Why this lab exists

Every multimodal capacity incident starts with somebody not knowing their images' token bill, and the cure is having once watched the bill get charged: prompt in, expanded length in the logs, KV usage jumping by hundreds of blocks per picture. This lab is that watching — plus the operational surfaces unique to VLMs that text-only operators haven't met: the processor's resize decisions (the same photo costs different tokens at different max_pixels settings), the limit_mm_per_prompt guard, and the prefill-time encoder spike that no text-only latency model predicts.

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2-VL-2B-Instruct   # small, modern, dynamic-resolution
# any test image; a ~1280x960 photo makes the arithmetic vivid

Steps

from vllm import LLM, SamplingParams
from PIL import Image

llm = LLM(model="Qwen/Qwen2-VL-2B-Instruct", gpu_memory_utilization=0.7,
          max_model_len=4096, limit_mm_per_prompt={"image": 2})

image = Image.open("photo.jpg")
prompt = ("<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>"
          "Describe this image in one sentence.<|im_end|>\n<|im_start|>assistant\n")

out = llm.generate({"prompt": prompt, "multi_modal_data": {"image": image}},
                   SamplingParams(max_tokens=48, temperature=0))
print(out[0].outputs[0].text)
print("prompt tokens:", len(out[0].prompt_token_ids))   # the EXPANDED length

Then the three experiments: re-run with the image resized to half each side (predict the token drop with lab-01's formula first — expect ~4×); send two images and watch both placeholder expansions land; and run a text-only prompt through the same engine to baseline the TTFT difference (the encoder's share, lab-03's budget made visible).

Captured output (real run, Qwen2-VL-2B-Instruct, L4, vLLM 0.22.1, trimmed)

INFO ... Using Flash Attention backend.
prompt tokens: 1421                      # ~30 text tokens + ~1391 image tokens
'A golden retriever sits on a wooden dock beside a calm lake at sunset.'
# same photo, resized to half each side:
prompt tokens: 378                       # ~30 text + ~348 image (~4x fewer, as predicted)
# text-only TTFT: 41 ms ; with the full-res image: 118 ms
#   (the gap = vision encoder + the bigger prefill — lab-03's encoder cost, on a clock)

Reading the numbers

1421 tokens for a "one-line" prompt — lab-01's punchline on real silicon. Check the arithmetic: Qwen2-VL at native ~1280×960 → 28-px effective patches after the 2×2 merge → ⌈1280/28⌉×⌈960/28⌉ ≈ 46×35 ≈ 1,610-ish before the processor's max_pixels resize trims it to ~1,391. Your prediction landing within ~15% of the log (resize policy explains the gap) is the pass condition.
378 after halving — the quadratic law, confirmed: ~4× fewer image tokens. The cheapest capacity lever in multimodal serving is the resize policy (min_pixels/max_pixels in the processor config), and it's set per deployment, not per model.
TTFT 41 → 118 ms — the encoder runs at prefill time (lab-03: scheduled with the chunk that enters the placeholder), so images tax time-to-first-token specifically; decode speed afterward is untouched (the image is now just KV). Text-only latency dashboards miss this entirely — segment TTFT by has-image.
KV math: 1,421 tokens ≈ 89 blocks at block_size 16 — one photo holds the cache footprint of ~45 short text exchanges. limit_mm_per_prompt is the admission- control guard against the user who attaches twelve screenshots.

Hitchhiker's notes

Prompt format is model-specific and unforgiving — Qwen's <|vision_start|><|image_pad|><|vision_end|>, LLaVA's <image>, Pixtral's [IMG]: the processor knows the convention; the OpenAI-compatible server's image_url content blocks hide it from clients (Phase 16). When raw-prompting a VLM, a wrong placeholder doesn't error — the model just never sees the image and hallucinates cheerfully. The test_mismatched_counts_assert validation from lab-01 is what stands between you and that silence.
The processor cache: image preprocessing (resize, normalize, patchify) is CPU-side and non-trivial; vLLM caches processed inputs by content hash (mm_processor_cache_gb), so repeated images (multi-turn over one photo, retries) skip it. Distinct from lab-03's encoder cache (GPU, within-request) — two caches, two lifetimes, and a prefix-cache third (Phase 2) whose block hashes fold in the image hash. Multimodal is a cache stack.
Resolution policy is a quality/capacity dial with a cliff: too aggressive a max_pixels and OCR/chart tasks degrade sharply (small text needs pixels). Tune it against your actual task mix with Phase 6 lab-02's eval discipline — "the description still looked fine" is not a measurement.
Video is this lab times frames: a 1 fps minute is ~60 images through the same pipeline (with temporal merging fighting the bill). The arithmetic you validated here is why video context windows are the current frontier of memory engineering.

Reflect

Reconcile all three labs in one trace: the processor expanded (lab-01), the scheduler budgeted the encode with the chunk that entered the range (lab-03), the runner scattered embeddings over the placeholders, and decode proceeded over ordinary KV. Which of the four steps recur per step, and which per request? (Per-request: expansion + encode; per-step: scheduling + scatter of the relevant slice. The amortization is the design.)
Your VLM fleet's p99 TTFT doubled after a client started sending 4K screenshots. Three knobs, in the order you'd reach for them? (max_pixels resize policy — quality-checked; encoder budget / disable_chunked_mm_input tuning for interference; limit_mm_per_prompt + input validation as the guardrail.)
Why does the engine charge image tokens against max_model_len rather than tracking images separately? (Containment — lab-01's lesson: one currency keeps every Phase 1–3 invariant true for free. A separate ledger would re-litigate admission, blocks, and budgets per modality.)

References

upstream/vllm/model_executor/models/qwen2_vl.py — the processor whose decisions you just audited (find the merge factor and the pixel limits).
upstream/vllm/multimodal/ — registry, processor cache, input plumbing.
vLLM docs, Multimodal Inputs — the API surface and per-model conventions: https://docs.vllm.ai/en/latest/features/multimodal_inputs.html
Qwen team, Qwen2-VL (2024) — dynamic resolution and the 2×2 merge: https://arxiv.org/abs/2409.12191
Labs 01 and 03 — the two predictions this run validates.

Lab 13-03 — Encoder Scheduling: Chunked Prefill Meets the Vision Tower `[CPU-OK]`

Phase 3's chunked prefill rests on a freedom you probably never noticed it claiming: a prompt can be sliced anywhere. Multimodal revokes it. The positions inside a placeholder range (lab-01) get their embeddings from the vision encoder, and you cannot encode half a picture — the ViT runs on the whole image or not at all. So when a prefill chunk first reaches into an image's range, the engine faces a real scheduling decision: run the encoder this step (it costs real compute, governed by a per-step encoder budget), or truncate the chunk at the image's doorstep and try again next step. You'll implement that decision — vLLM V1's _try_schedule_encoder_inputs, distilled — including the piece that restores chunked prefill's freedom: the encoder cache, which lets later chunks continue mid-placeholder for free.

Why this lab exists

This is the lab where two phases collide and you get to be the referee. Phase 3 taught that chunk boundaries are arbitrary (the clamp doesn't care what token it stops at); lab-01 taught that some positions are image positions. The collision produces every behavior in this lab's test suite, and each one is a production symptom with a name: a VLM request whose prefill mysteriously stalls one token before its image (encoder budget exhausted — test_unaffordable_image_truncates_the_chunk), a step that schedules a 100-token encode for a chunk consuming only 40 image positions (test_entering_an_image_schedules_its_encoder — the encoder is all-or-nothing even when the decoder is incremental), a multi-image prompt that prefills image A this step and stops dead before image B (test_budget_splits_across_two_images).

The design lesson is the one the course keeps circling: vLLM did not forbid chunk boundaries inside images (which would couple the text scheduler to image geometry). It added a cache between the two engines — encode once, whole; consume incrementally, cached — so each side keeps its natural granularity. When two subsystems disagree about granularity, a cache at the boundary is usually the answer; this lab is the cleanest instance you'll ever implement.

Background: three resources now, not two

Phase 3's scheduler balanced the token budget and KV memory. Multimodal adds a third ledger, with its own units and its own cache:

Encoder budget (per step, in encoder tokens): the vision tower is real compute outside the LM's token budget — a step that encodes a 576-token image while also prefilling text is doing two models' work. Capping encoder work per step protects ITL exactly the way the token budget does (Phase 3 lab-05's argument, new actor).
Encoder cache (in encoder tokens of storage): outputs wait here between the encode and the chunks that consume them — and entries are freed once fully consumed. It's a third memory pool alongside KV blocks and LoRA slots (Phase 11 lab-04), with the same admission-pressure character.

The rule your plan_chunk implements, per placeholder the chunk would enter: cached → free; affordable → schedule the whole encode now (even for partial consumption); unaffordable → truncate the chunk to the placeholder's offset. And the invariant the truncation preserves is Phase 3's invariant, extended: a position is computed only when everything it needs exists — text positions need prior KV; image positions need their encoding. Same race of counters, one more prerequisite.

Files

starter.py — plan_chunk with the full rules in the docstring. Your work.
solution.py — reference (~25 lines; the thinking is in the cases).
test_lab.py — seven scenarios over a text/image/text/image/text prompt, from pure-text freedom to the zero-budget starvation edge.

Run

LAB_IMPL=starter pytest phase-13-multimodal-models/labs/lab-03-encoder-scheduling -q
pytest phase-13-multimodal-models/labs/lab-03-encoder-scheduling -q   # reference

What the tests prove

Test	What it pins
`test_pure_text_chunk_is_unconstrained`	Phase 3 behavior survives where no image is touched
`test_entering_an_image_schedules_its_encoder`	All-or-nothing encoding: touching 40 of 100 image positions schedules the full 100-token encode — the granularity mismatch, faced
`test_unaffordable_image_truncates_the_chunk`	The doorstep rule: chunk ends at `offset`, encoder runs empty. The mysterious stall, explained
`test_cached_image_costs_nothing` / `test_continuation_mid_placeholder_needs_no_new_encode`	The cache restores chunk freedom: mid-placeholder continuation with zero encoder budget — the design's whole payoff
`test_budget_splits_across_two_images`	Budget as a per-step ledger across images: A scheduled, B deferred, chunk truncated between them; enough budget → one-step prefill, two encodes
`test_progress_is_always_possible`	The honest edge: image-at-position-0 with zero budget yields a 0-token chunk — progress waits for a step with budget. (Per-step budgets reset, so this is a delay, not a deadlock — but a scheduler that forgot to give encoder budget would starve VLM requests forever; the test documents the dependency)

Hitchhiker's notes

Map to upstream: Scheduler._try_schedule_encoder_inputs in upstream/vllm/v1/core/sched/scheduler.py — your function with the encoder-cache space check added (the cache has finite storage; an encode can also be deferred because its output wouldn't fit), and encoder_budget flowing from scheduler_config. The encoder cache itself: vllm/v1/core/encoder_cache_manager.py — allocation, reference, and free-on-consumption; recognizably a tiny sibling of Phase 2's machinery.
Why encode-whole-but-consume-partial is safe: the encoder is not autoregressive — its output for an image is a pure function of pixels, independent of the text around it. That's what makes caching trivially correct (no chained hashes needed — contrast Phase 2 lab-05's ancestry chains) and what makes the all-or-nothing constraint tolerable: you never re-encode, ever, within a request.
Where the embeddings actually flow: encoder output → encoder cache → the model runner gathers the scheduled slice of cached embeddings each step and scatters them over lab-01's placeholder positions (get_input_embeddings). The PlaceholderRange is the shared coordinate system of all three labs — compile-time (lab-01), schedule-time (this lab), runtime (the scatter).
Capacity interaction worth knowing: encoder budget and token budget compete for the same wall-clock step. A VLM fleet tuned with Phase 3 lab-05's threshold analysis but ignoring encoder spikes still gets ITL spikes — from the vision tower. vLLM's disable_chunked_mm_input and encoder-budget knobs exist for exactly this tuning; you now know what they gate.

Going further

Add the encoder-cache space dimension: plan_chunk also receives cache_free_tokens, and an encode needs both budget and space; consumed entries free space for later steps. You've now matched upstream's full predicate — and created the three-pool admission dance (KV + encoder cache + budget) that real VLM scheduling is.
Simulate a step sequence: one request, the lab's two-image prompt, budget 150/step — emit the chunk plan per step until prefill completes. The trace (where chunks stall, when encodes fire, when the cache carries) is Phase 1 lab-04's probe, multimodal edition.
Model the ITL spike from an encode (Phase 3 lab-05's method): give encoder tokens a cost weight and plot a decode stream's step costs when a VLM prefill with a 576-token image lands beside it, with and without an encoder budget. The conclusion writes the config recommendation.

References

upstream/vllm/v1/core/sched/scheduler.py — _try_schedule_encoder_inputs: this lab, in production (with the cache-space check).
upstream/vllm/v1/core/encoder_cache_manager.py — the third memory pool.
vLLM blog, vLLM V1 — the encoder-cache design rationale: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
Phase 3 labs 01/02/05 — the chunking machinery this lab constrains; lab-01 — the ranges it navigates by.

Phase 13 — Exercises: Multimodal Models

Warm-up (explain)

In one breath: how does a decoder-only LLM "see" an image, and which engine components (Phases 0–11) need zero changes for it?
Why does vLLM keep an encoder cache separate from the KV cache? Name two ways their currencies and lifetimes differ.
Why can't the prefix cache key placeholder-covering blocks by token IDs alone?

Solution sketches

Vision encoder → projector → embeddings overwrite placeholder positions in inputs_embeds; from layer 1 on it's text-indistinguishable. Unchanged: paged KV, attention backends, sampler, batching — everything past the embedding layer.
KV cache: per-(request-prefix) layered K/V in fixed blocks, grows every step, freed at request end. Encoder cache: per-content (mm_hash) embeddings, measured in embedding slots not blocks, written once per image, shared across requests, LRU-evicted when unreferenced. Different key (position-prefix vs content), different unit, different lifecycle.
Every image expands to the same repeated dummy token ID — token-ID hashing would alias different pictures and serve one user's image context to another. The image's content hash (MultiModalHasher) must be folded into those block hashes.

Core (trace the code)

_get_prompt_updates (llava.py:264) — where does the expansion count come from, and why does Pixtral (:390) need PromptUpdateDetails.select_token_id /is_embed?
Walk EncoderCacheManager.check_and_update_cache (:91) for a request whose image is cached but currently unreferenced. Which structures change, and what is the Phase-2 analogue of this transition?
_try_schedule_encoder_inputs (scheduler.py:1096): an image's placeholder starts at token 5000, the request has computed 2000 tokens, and this step's chunk is 2048. What happens to the image, and to num_new_tokens?
The scheduler's manager tracks hashes but the runner holds tensors. Trace how an eviction decided by the scheduler reaches the worker (get_freed_mm_hashes → scheduler.py:901 → runner).

Solution sketches

From ProcessingInfo.get_num_image_tokens → the vision encoder's patch math (model config, image size). Pixtral interleaves [image_break_id] after each patch row, so not every position in the range receives an embedding — is_embed marks which do.
The hash is popped from freeable (it was an eviction candidate), its embed count is subtracted from num_freeable_slots, and the request ID joins cached[mm_hash]. Phase-2 analogue: BlockPool.touch — resurrecting a cached block from the free queue by bumping ref_cnt 0→1.
The chunk window [2000, 4048) doesn't reach offset 5000 → the image is not scheduled, and num_new_tokens is untouched (truncation only happens when the window overlaps an image that fails budget/cache checks). Next steps advance the window; the step whose window first overlaps 5000 must schedule (or truncate at) it.
Manager appends evicted hashes to freed; each step get_freed_mm_hashes() drains the list into SchedulerOutput.free_encoder_mm_hashes; workers delete those keys from their encoder_cache dict. Scheduler owns accounting, workers own memory — the same split as KV blocks.

Build (your lab)

In lab-01, compute: at block_size 16, how many KV blocks does one LLaVA image (576 tokens) cost, and what fraction of a 7B model's typical 8 GiB KV budget is 50 cached image-bearing prompts of 1000 tokens each?
Extend your mini-build's cache with a stats() method (hits, misses, evictions, occupancy) and write a test that drives hit-rate from 0% to >80% with a zipfian image distribution. Why is zipfian the realistic assumption?
In lab-03, construct a request where the encoder budget forces the image to wait one step but the cache-space check would have passed. Verify text progress continues. Then flip it: cache full, budget free. What's the user-visible difference?

Solution sketches

576/16 = 36 blocks for the image alone (38 with prompt rounding in the lab's setup). 50 × 1000 tokens ≈ 50 × 63 blocks ≈ 3150 blocks ≈ 25% of an 8 GiB budget at ~16 KiB/ block-token-layer scale — images eat KV budgets fast; exact numbers depend on the model, the point is the order of magnitude.
Real traffic repeats content (logos, screenshots, retried requests, multi-turn with the same image) with a long tail of singletons — zipf models that. Hits come from the head; the tail drives eviction churn.
Both delay the image, not the text (truncate-at-doorstep). Budget-limited: resolves next step deterministically. Cache-limited: resolves only when another request frees embeddings — potentially unbounded wait, which is why worst-case sizing at startup (compute_mm_encoder_budget) must guarantee a single max image always fits.

Design (staff-level)

Your fleet serves Qwen2-VL and users upload phone photos (12 MP). TTFT p99 is 4× worse than the text-only fleet. Walk the path pixels take and name the three biggest contributors + a mitigation for each.
Design multi-tenant fairness for the encoder cache: tenant A uploads thousands of unique images (0% reuse), tenant B reuses a product catalog (90% reuse). What goes wrong with global LRU and what do you change?
Should encoder outputs be prefix-cacheable across engine restarts (disk/remote)? Cost out the trade: embedding sizes vs re-encode time, and the consistency hazard the cache key must absorb.
Video: 1 fps × 60 s × ~hundreds of tokens/frame. Which Phase-13 mechanisms break first, and what does that tell you about why encode-disaggregation (Phase 15) exists?

Solution sketches

(a) Preprocessing/resize on CPU in the API process — move to async/parallel workers, downscale at the edge (Qwen2-VL token count ∝ pixels; cap max_pixels). (b) The ViT forward itself rides the first overlapping step — encoder budget tuning, or batch encoder work, or disaggregate encode (Phase 15). (c) Token inflation: 12 MP → tens of thousands of LLM tokens of prefill — enforce resolution limits server-side; chunked prefill spreads it but TTFT still pays.
Global LRU lets A's unique-image churn evict B's hot catalog (cache pollution by zero-reuse traffic). Fixes: per-tenant quotas/partitions, admission filter (only cache on second sight — a tiny bloom/ghost list), or weighted eviction favoring entries with reuse history.
An embedding tensor for a 576-token image at d=4096 fp16 ≈ 4.7 MB — often larger than the JPEG and comparable to re-encode time at high load; remote fetch can lose to recompute. Worth it only for very hot content. The key must absorb model identity + weights version + preprocessor config (resize policy!) — upstream's reset() on weight updates is the single-process version of that hazard.
Encoder cache capacity (a minute of video ≈ tens of thousands of embeddings) and the per-step encoder budget (one step can't afford a frame burst) break first; KV inflation follows. When encode work rivals decode work, sharing one GPU starves both — that's precisely the case for a separate encode fleet with its own scaling (Phase 15's encode disaggregation, EPD).

Self-grading

4–7 and 11–14 are interview-grade. Could you whiteboard the splice (processor → expand → encode → overwrite) and both caches' keys from memory? If not, re-read 01-deep-dive.md §3–§5.

Phase 13 — Interview Questions: Multimodal Models

Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)

Q1. How does a decoder-only LLM 'see' an image in vLLM?

Model answer

A vision encoder turns the image into embeddings that occupy a fixed set of placeholder token positions in the prompt. The language model then attends over text+image tokens uniformly. vLLM's input processor handles encoding, placeholder expansion, and caching the encoder output so it isn't recomputed each step.

Q2. What new bottlenecks do multimodal models add?

Model answer

The vision encoder is extra compute/memory before prefill; image tokens inflate sequence length (and KV); and the encoder-cache plus input-processing must be profiled and batched carefully, especially for dynamic-resolution models.

Going deeper

The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.

Phase 13 — Cheatsheet: Multimodal Models

Image -> vision encoder -> image embeddings -> placeholder token positions -> normal LLM.
EncoderCacheManager reuses image features; don't recompute per step.
Image tokens inflate seq length and KV usage; profile input processing.

Key upstream files

vllm/multimodal/
vllm/multimodal/processing.py
vllm/v1/core/encoder_cache_manager.py
vllm/model_executor/models/llava.py
vllm/model_executor/models/qwen2_vl.py

Full reference: 00-guide.md · 01-deep-dive.md

Phase 14 — Model Architectures (Adding a Model)

← Phase 13 · Course home · Phase 15 →

Don't Panic

vLLM supports 200+ architectures because adding one is a well-trodden recipe: write an nn.Module that uses vLLM's parallel layers and attention, map the checkpoint weights onto it, register it, done. This phase teaches that recipe — the single most valuable maintainer skill — across decoder-only, MoE, hybrid/SSM, and pooling models.

Why this phase matters

'Add support for model X' is the most common high-value vLLM contribution. Doing it well — correct weight mapping, TP-sharded layers, the right attention, tests — is exactly what earns maintainer trust.

What you'll learn

The model contract: init(vllm_config), forward(input_ids, positions, ...) -> hidden
vLLM building blocks: VocabParallelEmbedding, {Column,Row}ParallelLinear, Attention, RMSNorm
Weight loading: load_weights + the name-remapping from HF checkpoints
The model registry and how a name resolves to a class
Families: decoder-only (Llama), MoE (Mixtral), hybrid/SSM (Mamba/Jamba), pooling/reward
get_input_embeddings, tie_word_embeddings, LoRA/quant compatibility hooks

The map: where this lives in the real code

Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md) walks through the important ones line by line.

vllm/model_executor/models/llama.py — The reference decoder-only implementation.
vllm/model_executor/models/registry.py — The architecture registry.
vllm/model_executor/model_loader/ — Weight loading + checkpoint format handling.
vllm/model_executor/models/mamba.py — A state-space (non-attention) model.
vllm/model_executor/models/interfaces.py — Mixins: SupportsLoRA, SupportsPP, SupportsMultiModal, ...
tests/models/ — How model correctness is tested upstream (logit/greedy equality).

Labs in this phase

lab-01-add-a-toy-architecture [CPU-OK] — implement a new architecture against the mini_vllm model contract, serve it through the unchanged engine, and prove with a tripwire proxy that the contract is exactly one method.
lab-02-trace-weight-loading [GPU-OPT] — trace 5 tensors through llama.py's load_weights: name → mapping row → fused param → slice, with live shape verification. Captured mapping included.
lab-03-weight-mapping [CPU-OK] — implement the translation: q/k/v→qkv_proj renaming, GQA-aware slices, the loud shape-assert, and the fusion-legality theorem as a 1e-12 test.

See labs/README.md for the recommended order (01 → 03 → 02) and how to run them.

How to work this phase

Read this guide for intuition.
Read 01-deep-dive.md with the upstream/ files open.
Do 02-mini-build.md — build the mini_vllm piece yourself.
Run the labs, then attempt EXERCISES.md.
Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.

Where you are

This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.

← Phase 13 · Course home · Phase 15 →

Phase 14 — Deep Dive: Model Architectures (Adding a Model)

Read this with upstream/ open. Every path is relative to upstream/ at the pinned commit v0.22.1 @ 0decac0 (UPSTREAM_PIN.md). If a line number ever drifts, search for the named symbol instead.

Guided reading list

Work through these in order. This is a scaffold: the reading targets and the questions are real; fill in the line-by-line annotations as you go (this is exactly the muscle a maintainer uses — reading unfamiliar code and extracting its contract).

vllm/model_executor/models/llama.py — The reference decoder-only implementation.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/model_executor/models/registry.py — The architecture registry.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/model_executor/model_loader/ — Weight loading + checkpoint format handling.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/model_executor/models/mamba.py — A state-space (non-attention) model.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/model_executor/models/interfaces.py — Mixins: SupportsLoRA, SupportsPP, SupportsMultiModal, ...
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
tests/models/ — How model correctness is tested upstream (logit/greedy equality).
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.

Questions to answer as you read

The model contract: init(vllm_config), forward(input_ids, positions, ...) -> hidden?
vLLM building blocks: VocabParallelEmbedding, {Column,Row}ParallelLinear, Attention, RMSNorm?
Weight loading: load_weights + the name-remapping from HF checkpoints?
The model registry and how a name resolves to a class?
Families: decoder-only (Llama), MoE (Mixtral), hybrid/SSM (Mamba/Jamba), pooling/reward?
get_input_embeddings, tie_word_embeddings, LoRA/quant compatibility hooks?

Cross-references

Intuition: 00-guide.md
Build it yourself: 02-mini-build.md
The gold-standard depth to emulate: Phase 02 deep-dive.

Phase 14 — Mini-Build: extend `mini_vllm`

Your task

Define a 'model contract' in mini_vllm and implement two toy architectures behind it (a decoder-only and a tiny MoE), swappable by config — mirroring how real models plug into one runner.

Why build it (and not just read it)

Reading the real kernel/feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.

Method

Look at the matching real code from 01-deep-dive.md.
Add your module under mini_vllm/ (or extend an existing one).
Write a test_*.py next to it that pins the behavior you care about.
Run pytest mini_vllm -q and keep it green.

Definition of done

Your component runs on CPU with no extra dependencies (numpy ok).
A test demonstrates the property this phase is about (not just "it runs").
You can explain, out loud, how your toy maps to the real implementation and where it intentionally simplifies.

The flagship phases ship complete mini_vllm modules + tests (mini_vllm/block_pool.py, mini_vllm/scheduler.py) — use them as your reference for structure and test style.

Phase 14 Labs — Model Architectures (Adding a Model)

Three labs on the most common vLLM contribution: adding a model. The arc: honor the contract — implement a new architecture and prove the engine never looks past one method (lab-01), translate the checkpoint — HF names to fused vLLM params, with the GQA slice arithmetic and the fusion-legality proof (lab-03), then trace the real thing — five tensors through llama.py's load_weights, shapes reconciled live (lab-02).

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-14-model-architectures/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-14-model-architectures/labs/lab-01-add-a-toy-architecture -q

Labs

lab-01-add-a-toy-architecture `[CPU-OK]`

Implement a genuinely new architecture (a bigram model that ignores positions) against mini_vllm's one-method contract and serve it through the unchanged engine — with the capstone tripwire test: a proxy that fails on any attribute access beyond forward survives a full generate(), measuring the contract's width. Plus the deep one: Phase 3's chunked-prefill invariant verified for the new model — engine invariants are model-independent. Skills: the narrow waist; over-supplying contracts; tripwire proxies as executable architecture docs; the layer library as where features meet models.

lab-02-trace-weight-loading `[GPU-OPT]`

Five tensors traced through the real load_weights: safetensors name → mapping row → fused parameter → slice — with live shape verification ((6144, 4096) qkv = 32 q-heads

2×8 kv-heads, halving under TP=2) and checkpoint forensics from shapes alone. Captured mapping table included. Skills: reading load_weights as a peer; --load-format dummy; diagnosing loads-but-garbage; shapes as architecture fingerprints.

lab-03-weight-mapping `[CPU-OK]`

The translation implemented: q_proj/k_proj/v_proj → qkv_proj name rewriting with shard tags, GQA-aware slice arithmetic (k/v narrower than q — the off-by-one habitat), the loud shape-assert that catches MHA-checkpoint-meets-GQA-config at load time, and the legality theorem as a test: fused output slices ≡ separate projections to 1e-12. Skills: stacked_params_mapping; fusion is layout, not math; load-time asserts beat serve-time hallucinations.

What you can do after this phase

Walk the full integrator's path: implement a model against the contract using the layer library (getting TP/quant/LoRA for free), write its mapping table, load a real checkpoint, and verify with the discipline these labs drilled (touched-exactly-once, loud shape asserts, invariant tests against the new model). Read any file in vllm/model_executor/models/ as a variation on machinery you've built — and recognize "KeyError loading model X" issues as missing mapping rows you can fix. That's the on-ramp to Phase 19's real upstream PR.

Lab 14-01 — Add an Architecture Without Touching the Engine `[CPU-OK]`

vLLM serves hundreds of architectures — Llama, Mixtral, DeepSeek, Mamba hybrids, embedding models — through one engine, and the trick is a discipline, not a miracle: models implement a narrow contract, and the engine calls nothing else. This lab makes you live that discipline in miniature. mini_vllm's contract is one method — forward(last_tokens, positions) → logits — and you'll implement a genuinely new architecture against it (a bigram model: logits from a per-token table, positions ignored), swap it into a running engine, and prove every engine feature works unchanged. The capstone test is a tripwire proxy that fails on any attribute access beyond forward — and the engine passes a full generate() through it, proving the contract is exactly one method, not asserting it.

Why this lab exists

"Add support for model X" is the single most common vLLM contribution — the on-ramp through which most maintainers arrived — and the task is approachable precisely because of the contract this lab teaches. A model integrator never touches the scheduler, the KV manager, or the sampler; they write a model class that honors the interface and a weight loader that fills it (labs 02/03). Knowing where the boundary sits — what you must provide, what you may ignore, what you must never reach around — is the difference between a weekend PR and a month of confusion.

The lab's sneaky-deep test is test_engine_invariants_hold_for_the_new_model: Phase 3's chunked-equals-unchunked is an engine property, and it must hold for any contract-honoring model. Run it against your new architecture and you're doing what vLLM's CI does across its whole model zoo — verifying that engine invariants and model implementations are independent axes. When an invariant breaks only for one model, the leak is in whoever crossed the boundary, and this test design localizes the suspect instantly.

Background: the narrow waist

The contract's anatomy, and why each piece is what it is:

forward(last_tokens, positions) → (batch, vocab) logits — the engine guarantees row i of the output corresponds to entry i of the inputs (Phase 1 lab-03's positional contract), and that only requests passing needs_sample appear (the catch-up rule). The model guarantees deterministic logits given its inputs. Neither knows anything else about the other.
Positions are offered, not mandated — your BigramModel ignores them entirely and the engine cannot tell. That's the proof that the contract over-supplies on purpose: it carries what the most demanding architecture needs (positional information for RoPE-style models), and simpler models discard the surplus. Real vLLM's contract is wider for the same reason (KV caches, attention metadata, intermediate states for EAGLE — Phase 8), and most models use a subset.
The registry is the production version of your install_model: config's architectures field → ModelRegistry lookup → class constructed with the vLLM config. Swapping a model is data, not code — which is also how out-of-tree models plug in (Phase 17's plugin machinery registers into the same table).

Files

starter.py — BigramModel (the new brain) and install_model (the swap). Your work.
solution.py — reference.
test_lab.py — serving works, the brain is genuinely different, determinism, the engine-invariant check, and the contract tripwire.

Run

LAB_IMPL=starter pytest phase-14-model-architectures/labs/lab-01-add-a-toy-architecture -q
pytest phase-14-model-architectures/labs/lab-01-add-a-toy-architecture -q   # reference

What the tests prove

Test	What it pins
`test_new_architecture_serves_through_the_unchanged_engine`	The integration: batching, scheduling, sampling, stopping — all engine code, all untouched, all working
`test_it_really_is_a_different_model`	The outputs differ from `ToyModel`'s — you changed the brain, not the plumbing (a lab that accidentally re-implemented the old model would pass everything else)
`test_determinism_across_engine_instances`	The new architecture honors the course's testability convention: logits as a pure function of (seed, inputs)
`test_engine_invariants_hold_for_the_new_model`	Chunked ≡ unchunked for the new model — engine invariants are model-independent, verified rather than hoped
`test_engine_touches_only_the_contract`	The tripwire proxy: a full `generate()` with every attribute except `forward` booby-trapped. The contract's width, measured: one method

Hitchhiker's notes

The real contract, for comparison: a vLLM model implements forward(input_ids, positions, …) → hidden_states, compute_logits, and load_weights, composed from the layer library — VocabParallelEmbedding, QKVParallelLinear, RowParallelLinear (Phase 10 lab-01's classes!), Attention (which hides the entire Phase 2/4 machinery behind one call), RMSNorm. Building from these gives you TP, quantization, LoRA, and paged attention for free — the layer library is where the engine's features and the model's architecture meet, and using bare nn.Linear instead is the classic new-contributor mistake (works single-GPU, breaks under TP, bypasses quantization).
The tripwire-proxy test pattern generalizes: any time a design claims "X only uses interface Y," wrap Y's provider in a proxy that fails on everything else and run the full workload. Interfaces rot by accretion — someone reaches around for "just one attribute" — and a tripwire in CI is the only durable fence. (Compare Phase 9 lab-04's broken-control pattern: both are executable architecture documentation.)
Why a bigram model, of all things? Because ignoring positions is the point: the most instructive new architecture is one that uses less than the contract offers, proving the contract doesn't secretly require everything it carries. Hybrid/SSM models (Mamba-style) are the production version of this lesson — they need different state than KV caches, and vLLM's contract grew (state managers, hybrid allocators) precisely where their needs exceeded it.
mini_vllm's engine constructs its own model (no registry) — install_model papers over that with assignment. The gap is deliberate lab surface: notice how a registry (construct-from-config) beats post-hoc swapping the moment configs, checkpoints, and TP enter. The README of the real registry: upstream/vllm/model_executor/models/registry.py.

Going further

Add a second architecture — a RepeaterModel that strongly biases toward the last token (logits = one-hot-ish on last_token) — and watch greedy decoding produce aaaa...: a two-line model that generates the repetition pathology Phase 9 lab-01's penalties exist to fight. Then apply the penalty and watch it break the loop. Three labs, one demo.
Build a registry = {"bigram": BigramModel, "toy": ToyModel} and a engine_from_config({"architecture": "bigram", "seed": 7}) constructor — the real registration pattern, 10 lines, and now your lab-02/03 weight knowledge has a place to plug in.
Write the negative test: a model whose forward returns the wrong batch size, and assert the engine fails loudly rather than mis-assigning tokens (it fails in the sampler's indexing — would you ship a clearer assert upstream?).

References

upstream/vllm/model_executor/models/registry.py — the architecture → class table.
upstream/vllm/model_executor/models/llama.py — the canonical model implementation; read it as "the contract, honored" (and lab-02/03's subject).
vLLM docs, Adding a New Model — the official integrator's guide this lab is the warm-up for: https://docs.vllm.ai/en/latest/contributing/model/
Phase 1 lab-03 — the row-order contract this lab's forward inherits; Phase 10 lab-01 — the layer library that makes real models TP-able by construction.

Lab 14-02 — Trace Weight Loading in the Real `llama.py` `[GPU-OPT]`

Lab-03 had you implement the translation; this lab has you watch the production version run — and read it as a peer. You'll trace five tensors from a real Llama checkpoint through load_weights: the safetensors name on disk, the mapping row that claims it, the vLLM parameter and shard it lands in, and (under TP) which rows of that shard each rank takes. The deliverable is the filled-in mapping table below — five rows of checkpoint surgery, verified against a live load.

No GPU? Don't panic. Loading happens on CPU before anything touches CUDA — you can trace most of this with device="cpu"-ish settings or just the captured table below plus the source. The reading is the lab.

Why this lab exists

load_weights is where most model-integration PRs live or die, and it's also the single most readable "real" function in the model zoo once you have lab-03's vocabulary — a loop, a mapping table, and weight_loader callbacks. Tracing five tensors end-to-end converts the function from code-you-scroll-past into code-you-could-have-written, and it arms you for the two production moments that need this knowledge: a new checkpoint that won't load (which mapping row is missing?), and a loaded model that generates garbage (which shard landed wrong?).

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct  # or any Llama-family model

Steps

List the checkpoint's names: safetensors files are zip-like; enumerate without loading:

from safetensors import safe_open
import glob
names = []
for f in sorted(glob.glob("<model_dir>/*.safetensors")):
    with safe_open(f, framework="np") as sf:
        names += list(sf.keys())
print(len(names))           # ~291 for an 8B
print([n for n in names if ".layers.0." in n])   # one layer's worth

Open upstream/vllm/model_executor/models/llama.py, find stacked_params_mapping and the load_weights loop. For each of the five tensor names in the table below, walk the loop by hand: which mapping row matches? what name does it become? which shard_id rides along?
Verify live: load the model with LLM(model=..., enforce_eager=True) and afterwards inspect a parameter's shape: model.model.layers[0].self_attn.qkv_proj.weight.shape → (6144, 4096) — and reconcile: 32 q-heads × 128 + 2 × (8 kv-heads × 128) = 4096 + 2048 = 6144. Lab-03's qkv_slices, on a real tensor.

Captured mapping (Llama-3-8B, vLLM 0.22.1)

checkpoint tensor (HF)	vLLM parameter	shard	rows in fused
`model.layers.0.self_attn.q_proj.weight`	`...self_attn.qkv_proj.weight`	`q`	`0:4096`
`model.layers.0.self_attn.k_proj.weight`	`...self_attn.qkv_proj.weight`	`k`	`4096:5120`
`model.layers.0.self_attn.v_proj.weight`	`...self_attn.qkv_proj.weight`	`v`	`5120:6144`
`model.layers.0.mlp.gate_proj.weight`	`...mlp.gate_up_proj.weight`	`0`	`0:14336`
`model.layers.0.input_layernorm.weight`	(itself)	—	unfused

# live verification:
qkv_proj.weight.shape   = (6144, 4096)    # 4096 q + 1024 k + 1024 v  (GQA: 8 kv heads)
gate_up_proj.weight.shape = (28672, 4096) # 14336 gate + 14336 up
# under tensor_parallel_size=2: (3072, 4096) per rank — heads split, slices halve

Reading the trace

The k/v rows are 4× narrower than q's — GQA's 8 KV heads vs 32 query heads, lab-03's slice asymmetry on a real 8B. If you ever see (12288, 4096) here instead, you're looking at an MHA model — the fused shape is an architecture fingerprint.
gate_up_proj at 28,672 rows — the MLP's two halves stacked; down_proj stays unfused (it has no sibling to stack with). The mapping table's five rows cover ~80% of a Llama checkpoint's tensors; everything else passes through.
Under TP=2, every fused shape halves along rows — Phase 10 lab-01's column-parallel sharding composed with lab-03's stacking: each rank loads its heads' rows of each shard directly from disk. Two slicings, one read, no redistribution — the loading-is-part-of-sharding point from Phase 10, visible in a tensor shape.
enforce_eager=True keeps the trace clean (no capture pass cluttering logs — Phase 5 lab-04's test-suite setting, used for exactly its intended purpose).

Hitchhiker's notes

--load-format dummy skips real weights (random init) — the tool for testing mapping and shapes without downloading 16 GB, and how CI exercises loaders cheaply. Pair with a tiny --max-model-len and loader bugs surface in seconds.
Watch for the unloaded-parameter check: upstream tracks which params got weights and errors on leftovers — the missed-tensor guard from lab-03's going-further, in production. When adding a model, that error message is your todo list.
Sharded checkpoints (multiple .safetensors files) interleave layers across files arbitrarily — the loader is order-independent by design (each tensor knows its name; the mapping doesn't care about file layout). Resist any urge to assume file order means anything.
Quantized variants add scale/zero tensors with their own names (...qweight, ...scales) routed by the quant method's loader (Phase 6) — same loop, more rows. Tracing one AWQ tensor through is the natural sequel to this lab.

Reflect

From the shapes alone — (6144, 4096) qkv, (28672, 4096) gate_up — reconstruct the model card: hidden size, head count, KV heads, MLP expansion. (4096 hidden; 32 heads × 128; 8 KV heads; 3.5× ffn ratio.) Checkpoint forensics is a real skill; you just did it.
A teammate's new model PR loads but outputs garbage; loading reported no errors. Using labs 01–03: what are your first three checks? (Mapping rows for every fused family — a missed one means init-valued weights; slice boundaries vs the config's head counts; and the q/k/v order in the fused buffer vs what the attention layer expects.)
Why does vLLM fuse at load time rather than shipping a conversion script? (Checkpoints stay interchange-format; the fusion choice is the runtime's, can change between versions, and composes with TP/quantization decided at startup — lab-03's interface-vs-implementation point, operationalized.)

References

upstream/vllm/model_executor/models/llama.py — load_weights and stacked_params_mapping: the function under trace.
upstream/vllm/model_executor/layers/linear.py — the weight_loader callbacks that place each shard (lab-03's load_stacked, with TP).
upstream/vllm/model_executor/model_loader/ — the loader framework (formats, dummy loading, sharded files).
Lab-03 — the implementation this trace recognizes; lab-01 — the contract the loaded model serves through.

Lab 14-03 — Checkpoint Surgery: HF Names → vLLM Params, Shards → Fused `[CPU-OK]`

A HuggingFace checkpoint and a vLLM model disagree about what a layer is. The checkpoint stores q_proj, k_proj, v_proj as three tensors; vLLM runs one fused qkv_proj (one big GEMM beats three small ones — Phase 7 lab-03's tiling economics, applied to layer design). Same for gate_proj+up_proj → gate_up_proj. Loading weights is therefore translation: rename every checkpoint tensor to its vLLM parameter, and copy shard tensors into the right slice of the fused buffer. This lab has you build the translation table (llama.py's stacked_params_mapping, in spirit), the GQA-aware slice arithmetic, and the shape guard that turns wrong-checkpoint disasters into loud load-time errors — then prove the fusion legal with the test that matters: the fused matmul's output slices equal the three separate projections, exactly.

Why this lab exists

When a newly-added model loads and generates fluent nonsense, the bug is almost never in the forward pass — it's here, in the mapping: a shard landed in the wrong slice, a name pattern missed a tensor (silently left at init values), or an MHA checkpoint met a GQA config. These failures are maddening precisely because nothing crashes: the shapes coincidentally fit, the matmuls run, the output is garbage. The two defenses you'll build are the professional's toolkit: exact slice arithmetic (derived, not pattern-matched) and assert-on-shape at load time (test_shape_mismatch_is_loud — the wrong-checkpoint case caught at the door, not at the demo).

This lab is also lab-02's prerequisite done right: lab-02 has you read load_weights in llama.py; this lab has you implement its core first, so the reading is recognition. The pairing (build small, then read big) is the course's method; this is its purest instance — the production function is your three functions plus a loop over the checkpoint.

Background: why fused, and where the slices fall

Why fuse at all: three matmuls over the same input x with weights Wq, Wk, Wv equal one matmul with the row-stacked weight — x @ [Wq; Wk; Wv]ᵀ — and the single GEMM launches once, tiles better (Phase 7 lab-03: bigger M×N per weight-read), and reads x from memory once instead of three times. The legality is two lines of block matrix algebra, and test_fused_matmul_equals_separate_projections states it as an executable fact. (Column-stacking composes with tensor parallelism too: QKVParallelLinear is Phase 10 lab-01's column-parallel class with this stacking built in — the shard boundaries respect head boundaries on every rank.)

Where the slices fall — the GQA wrinkle: with nh query heads, nkv KV heads, head_dim hd, the fused weight has (nh + 2·nkv)·hd rows: q owns the first nh·hd, k the next nkv·hd, v the last nkv·hd. Under GQA (Phase 0 lab-02's 4× KV saving) nkv < nh, so k and v slices are narrower than q's — the asymmetry test_qkv_slices_account_for_gqa pins, and exactly the place hand-written loaders go wrong when their author last looked at an MHA model.

The name mapping: a substring rewrite (q_proj → qkv_proj) plus a shard tag telling the loader which slice. Tensors outside the table (norms, embeddings, down_proj — anything unfused) map to themselves. Upstream's stacked_params_mapping is literally this list of triples; your STACKED_PARAMS copies its shape.

Files

starter.py — map_weight_name, qkv_slices, load_stacked (+ the STACKED_PARAMS table, provided). Your work.
solution.py — reference.
test_lab.py — the mapping, pass-throughs, GQA slice arithmetic, the fusion- legality proof, and the loud-mismatch guard.

Run

LAB_IMPL=starter pytest phase-14-model-architectures/labs/lab-03-weight-mapping -q
pytest phase-14-model-architectures/labs/lab-03-weight-mapping -q   # reference

What the tests prove

Test	What it pins
`test_name_mapping`	The rewrite preserves the layer path and swaps only the projection name — `model.layers.3.self_attn.q_proj.weight` keeps its `layers.3` identity
`test_unstacked_names_pass_through`	Norms, embeddings, `down_proj`, `lm_head`: shard_id None, name unchanged. A mapping that's too greedy (matching `up_proj` inside `gate_up_proj`-like names) fails here
`test_qkv_slices_account_for_gqa`	q gets 128 rows, k and v get 32 each — and the three slices tile the fused rows with no gaps and no overlap (assert the boundary equalities; off-by-ones here are the garbage-output bug)
`test_fused_matmul_equals_separate_projections`	The legality theorem: slice the fused output and recover each projection to 1e-12. Fusion is layout, not math — the course's paged-attention identity (Phase 2 lab-06), weight edition
`test_shape_mismatch_is_loud`	An MHA-width k shard against a GQA config: caught by the assert at load, with shapes in the message. The alternative is a demo that hallucinates

Hitchhiker's notes

Read load_weights right after this (upstream/vllm/model_executor/models/llama.py, search stacked_params_mapping): the production loop is your three functions plus reality — iterating safetensors shards, skipping rotary-embedding buffers, handling TP (each rank loads only its rows of each slice: Phase 10 lab-01's sharding composed with this lab's stacking — two slicings, one tensor), and weight_loader callbacks per parameter that encapsulate the slice placement. Your load_stacked is the weight_loader of QKVParallelLinear, minus distribution.
Quantized checkpoints stack the stakes: AWQ/GPTQ tensors come with scales and zero-points per group (Phase 6 lab-03) that must be sliced consistently with their weights — a mapping bug now corrupts numerics in a way that's only statistically visible. Same machinery, smaller margin for error; the loud-assert habit pays double.
The mapping table is per-architecture API: when a new HF model renames a tensor (mlp.experts.0.w1 vs block_sparse_moe...), vLLM's loader needs a new mapping entry — the single most common cause of "KeyError loading model X with vLLM version Y" issues. You can now read those tracebacks as "the translation table is missing a row" and often fix them yourself; that's a real first upstream PR shape.
Why not store fused in the checkpoint? The checkpoint serves every runtime (HF transformers, llama.cpp, MLX...), each with its own fusion choices. Unfused is the interchange format; fusion is a runtime optimization — the same interface-vs-implementation split as Phase 11's unmerged LoRA, and the reason load_weights exists at all.

Going further

Add down_proj and embedding handling plus a full load_checkpoint(params, ckpt) driver: iterate a dict of fake checkpoint tensors, translate, place, and assert every parameter got touched exactly once (the missed-tensor bug class, made checkable — upstream tracks loaded_params for the same reason).
Compose with TP: given tp_rank, tp_size, make load_stacked place only the rank's rows of each shard (q rows shard by head; k/v by KV head — and note what happens when nkv < tp_size: KV-head replication, the real constraint from Phase 10 lab-01's divisibility test).
Write the MoE mapping rows (experts.N.w1 → experts.w13_weight with expert-index shards) by reading mixtral.py's table — the same idea with two stacking axes.

References

upstream/vllm/model_executor/models/llama.py — load_weights + stacked_params_mapping: this lab, productionized (lab-02 reads it with you).
upstream/vllm/model_executor/layers/linear.py — QKVParallelLinear.weight_loader: your load_stacked with TP.
vLLM docs, Adding a New Model — where the mapping table fits in the integrator's checklist: https://docs.vllm.ai/en/latest/contributing/model/
Phase 7 lab-03 — why fused GEMMs win; Phase 10 lab-01 — the sharding this stacking composes with; Phase 6 lab-03 — the quantized version of the stakes.

Phase 14 — Exercises: Model Architectures (Adding a Model)

Work these after the labs. They escalate from "explain it" to "design it" — staff-level means you can do the last ones cold.

Map a HF attention block's qkv/o weights onto QKVParallelLinear/RowParallelLinear.
What must change to make a model support tensor parallelism correctly?
How would you add a pooling/reward head, and what changes in output handling?

Self-grading

For each: could you (a) explain it to a teammate in 2 minutes, and (b) point to the exact upstream/ file that proves your answer? If not, re-read the matching anchor in 01-deep-dive.md.

Phase 14 — Interview Questions: Model Architectures (Adding a Model)

Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)

Q1. Walk me through adding support for a new decoder-only model to vLLM.

Model answer

Implement the model as an nn.Module using vLLM's parallel layers + Attention; implement load_weights to remap the HF checkpoint (esp. fused QKV/gate-up); register it; add it to the supported list; and add a correctness test comparing greedy/logits to HF. Handle TP sharding, tied embeddings, and any quant/LoRA hooks.

Q2. Why must the model use vLLM's Linear/Attention layers instead of plain torch?

Model answer

Those layers carry tensor-parallel sharding, paged-attention metadata, quantization dispatch, and CUDA-graph/compile compatibility. Plain torch layers would bypass paging, TP, and quantization — breaking the whole engine's contract.

Going deeper

The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.

Phase 14 — Cheatsheet: Model Architectures (Adding a Model)

Model recipe: parallel layers + Attention -> register -> load_weights remap -> test vs HF.
Fused weights (QKV, gate_up) are the usual load_weights gotcha.
Interfaces/mixins declare LoRA/PP/MultiModal/pooling support.
Families: decoder-only, MoE, hybrid/SSM (Mamba), embedding/reward.

Key upstream files

vllm/model_executor/models/llama.py
vllm/model_executor/models/registry.py
vllm/model_executor/model_loader/
vllm/model_executor/models/mamba.py
vllm/model_executor/models/interfaces.py
tests/models/

Full reference: 00-guide.md · 01-deep-dive.md

Phase 15 — Disaggregated Serving

← Phase 14 · Course home · Phase 16 →

Don't Panic

Prefill and decode have opposite appetites: prefill wants compute, decode wants memory bandwidth and runs much longer. Disaggregation runs them on SEPARATE machines — prefill servers and decode servers — and ships the KV cache between them. Each fleet is tuned and scaled independently. This phase is that split and the KV transfer that enables it.

Why this phase matters

P/D disaggregation is how the largest deployments hit both tight TTFT and high throughput at once, and it's a frontier of vLLM. Understanding KV connectors also unlocks KV offloading and cross-engine caching.

What you'll learn

Why co-locating prefill+decode causes interference (prefill stalls decodes)
Prefill node -> KV transfer -> decode node; the request handoff
KV connectors: the transfer abstraction (NIXL, shared storage, etc.)
Encode disaggregation for multimodal
Routing / proxy between P and D fleets; load balancing

The map: where this lives in the real code

Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md) walks through the important ones line by line.

vllm/distributed/kv_transfer/ — The KV connector framework (the heart of disagg).
vllm/distributed/kv_transfer/kv_connector/v1/ — V1 connectors (base + implementations).
vllm/v1/core/sched/scheduler.py — Search 'connector' / 'WAITING_FOR_REMOTE_KVS' to see async KV load.
examples/ — Look for disaggregated-prefill example scripts/configs.

Labs in this phase

lab-01-kv-handoff [CPU-OK] — migrate a live request between two mini_vllm engines (export/import + the KV block bill) and prove the continuation token-for-token identical.
lab-02-pd-pair [GPU-OPT] — a real producer/consumer pair with a KV connector: TTFT +10% (the toll), ITL p99 3× better (the interference, gone). Captured output included.
lab-03-disagg-economics [CPU-OK] — the trade in five functions: 256 MiB of freight per 2048-token 8B prompt, ~11 ms on fast fabric vs ~215 ms on 10 GbE, and the decision function that says no two different ways.

See labs/README.md for the recommended order (01 → 03 → 02) and how to run them.

How to work this phase

Read this guide for intuition.
Read 01-deep-dive.md with the upstream/ files open.
Do 02-mini-build.md — build the mini_vllm piece yourself.
Run the labs, then attempt EXERCISES.md.
Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.

Where you are

← Phase 14 · Course home · Phase 16 →

Phase 15 — Deep Dive: Disaggregated Serving

Read this with upstream/ open. Every path is relative to upstream/ at the pinned commit v0.22.1 @ 0decac0 (UPSTREAM_PIN.md). If a line number ever drifts, search for the named symbol instead.

Guided reading list

vllm/distributed/kv_transfer/ — The KV connector framework (the heart of disagg).
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/distributed/kv_transfer/kv_connector/v1/ — V1 connectors (base + implementations).
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/v1/core/sched/scheduler.py — Search 'connector' / 'WAITING_FOR_REMOTE_KVS' to see async KV load.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
examples/ — Look for disaggregated-prefill example scripts/configs.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.

Questions to answer as you read

Why co-locating prefill+decode causes interference (prefill stalls decodes)?
Prefill node -> KV transfer -> decode node; the request handoff?
KV connectors: the transfer abstraction (NIXL, shared storage, etc.)?
Encode disaggregation for multimodal?
Routing / proxy between P and D fleets; load balancing?

Cross-references

Intuition: 00-guide.md
Build it yourself: 02-mini-build.md
The gold-standard depth to emulate: Phase 02 deep-dive.

Phase 15 — Mini-Build: extend `mini_vllm`

Your task

Model disaggregation in mini_vllm: run a 'prefill engine' that produces KV blocks, serialize the block table + (fake) KV, and hand it to a separate 'decode engine' that continues generation — proving the handoff preserves output.

Why build it (and not just read it)

Method

Look at the matching real code from 01-deep-dive.md.
Add your module under mini_vllm/ (or extend an existing one).
Write a test_*.py next to it that pins the behavior you care about.
Run pytest mini_vllm -q and keep it green.

Definition of done

Your component runs on CPU with no extra dependencies (numpy ok).
A test demonstrates the property this phase is about (not just "it runs").
You can explain, out loud, how your toy maps to the real implementation and where it intentionally simplifies.

The flagship phases ship complete mini_vllm modules + tests (mini_vllm/block_pool.py, mini_vllm/scheduler.py) — use them as your reference for structure and test style.

Phase 15 Labs — Disaggregated Serving

Three labs on splitting the workload where Phase 10 split the model: prefill on machines built for compute, decode on machines built for bandwidth, a request's KV shipped between them. The arc: build the migration bookkeeping and prove it output-invisible (lab-01), price the trade — transfer toll vs interference win, verdicts flipping with the wire (lab-03), then assemble a real producer/consumer pair and watch p99 ITL collapse 3× while TTFT pays its 10% (lab-02).

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-15-disaggregated-serving/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-15-disaggregated-serving/labs/lab-01-kv-handoff -q

Labs

lab-01-kv-handoff `[CPU-OK]`

Move a live request between two engines: export (snapshot + free the source — usage back to 0.0, the anti-leak invariant), import (claim destination blocks, loudly OOM if they don't exist), and the proof that justifies the architecture — the migrated request's output is token-for-token identical to never moving. Migration revealed as admission-with-prepaid-compute, preemption's ship-instead-of-discard sibling. Skills: a request's transferable identity; the two recovery strategies; block identity doesn't survive (contents do); why routing must be output-invisible.

lab-02-pd-pair `[GPU-OPT]`

The real system: producer + consumer instances joined by a KV connector, a proxy running the max_tokens=1 handoff, and the two predicted signatures measured — TTFT +10% (the toll), ITL p99 38 → 12 ms (the interference, gone; p50 untouched, because interference was always a tail phenomenon). Annotated capture included. Skills: kv_role/connector configuration; both-sides-must-agree hazards; failure drills and graceful degradation; tails are what you're buying.

lab-03-disagg-economics `[CPU-OK]`

The trade in five functions: 256 MiB of KV freight per 2048-token 8B prompt — ~11 ms on InfiniBand-class fabric (invisible) vs ~215 ms on 10 GbE (doubles TTFT) — against the interference win from Phase 3 lab-05's spike. The decision function says yes, no-slow-link, and no-no-disease, each pinned by a test. Skills: the penalty ratio as the qualifying number; bits-vs-bytes; per-token-tax → per-request-toll as a pattern; KV compression as a topology enabler.

What you can do after this phase

Decide, from your cluster's fabric and your traffic's prompt/decode shape, whether disaggregation pays — and say which metric it buys (p99 ITL) and which it taxes (TTFT) with numbers; implement and review KV-transfer bookkeeping with the invariants drilled here (source clean, destination billed, OOM loud, output invisible); and stand up, configure, and failure-drill a real P/D pair. Combined with Phase 10, you now hold both axes of scale-out: split the model, split the workload — Phase 18 teaches you to measure which one your bottleneck wants.

Lab 15-01 — KV Handoff: Move a Live Request Between Engines `[CPU-OK]`

Everything in this course so far assumed a request lives and dies on one engine. Disaggregated serving breaks that assumption on purpose: engine P (tuned for compute-hungry prefill) processes the prompt, then the request — its token state and its computed-KV claim — migrates to engine D (tuned for bandwidth-hungry decode), which continues as if nothing happened. This lab implements the migration's bookkeeping on two mini_vllm engines: export_request (snapshot + release the source), import_request (resurrect + claim KV blocks at the destination), and the proof that justifies the whole architecture — the migrated request's final output is token-for-token identical to never having moved. Plus the two operational truths migrations live with: the source must come back clean (every block freed), and the destination must pay the KV bill up front — loudly failing if it can't.

Why this lab exists

The deep observation behind this lab — and behind Phase 3 lab-04's preemption before it — is that a request's entire transferable identity is small and explicit: prompt ids, output ids, num_computed_tokens, sampling params, and (the only heavy part) the KV those counters claim. Preemption exploited that by discarding the KV and recomputing; handoff exploits it by shipping the KV and not. Same state machine, two recovery strategies — and your import_request is structurally Scheduler-admission code (allocate, set counters, mark RUNNING), because migration is admission with prepaid compute. Once you see migration this way, the production machinery (KV connectors, NIXL, multi-engine routing — Phase 15's deep-dive) reads as transport details around bookkeeping you've already written twice.

The identical-output proof matters operationally, not just aesthetically: P/D deployments route some requests through the split path and others not (short prompts often stay colocated). If migration changed outputs, the same request would answer differently depending on an infrastructure routing decision — an unacceptable, undebuggable property. The test suite makes it impossible.

Background: what actually moves

The honest accounting of a migration, in order:

Export: snapshot the token state (cheap — a few hundred ints) and the num_computed_tokens claim; remove the request from the source's schedule; free its blocks (the source owes it nothing — test_source_engine_is_clean_after_export pins usage back to 0.0, because a leak here, times thousands of migrations, is an OOM with a delay).
Transfer: in real systems, the KV tensors themselves cross the wire — lab-03 prices this (256 MiB for a 2048-token prompt on an 8B; the freight is the whole economics). In mini_vllm, the toy model never reads KV values, so the transfer carries metadata only — which is precisely why the lab can isolate the bookkeeping correctness from the transport.
Import: allocate destination blocks for the computed tokens (Phase 2's ceil-div bill, paid in D's pool — test_destination_pays_the_kv_bill counts it exactly), set the counter, mark RUNNING, join the schedule. The connector would now fill those blocks with the shipped tensors; decoding resumes either way.

Note what makes step 3 legal without recomputation, in contrast to preemption's reset-to-zero: the claim "these num_computed_tokens tokens have valid KV" is now backed by the transfer rather than by local compute. The counter doesn't care who paid — which is the two-counters model (Phase 1) earning its keep one more time.

Files

starter.py — export_request, import_request, run_to_completion. Your work.
solution.py — reference.
test_lab.py — identical continuation (post-prefill and mid-decode), source cleanliness, the destination's block bill, and the loud-OOM import.

Run

LAB_IMPL=starter pytest phase-15-disaggregated-serving/labs/lab-01-kv-handoff -q
pytest phase-15-disaggregated-serving/labs/lab-01-kv-handoff -q   # reference

What the tests prove

Test	What it pins
`test_handoff_after_prefill_continues_identically`	The canonical P/D split (one step = prefill + first token, then migrate): final output ≡ single-engine. Routing decisions must be output-invisible
`test_handoff_mid_decode_also_works`	Migration is general, not prefill-special — any consistent (tokens, counter) snapshot moves. The mechanism that also underlies decode-to-decode rebalancing
`test_source_engine_is_clean_after_export`	Usage back to 0.0: the anti-leak invariant. Migration without cleanup is a slow-motion OOM
`test_destination_pays_the_kv_bill`	ceil(computed/block_size) blocks claimed at D — capacity planning for decode fleets must budget imported KV, not just locally-grown KV
`test_destination_oom_is_loud`	A destination that can't hold the transfer fails at import, not mid-decode — the admission check a router relies on when picking D instances

Hitchhiker's notes

Map to upstream: the KV connector interface (upstream/vllm/distributed/kv_transfer/kv_connector/v1/) is your export/import with tensors attached — get_num_new_matched_tokens (what can the destination receive?), the worker-side send/recv of block contents, and scheduler hooks that overlap transfer with compute. Connectors ship for shared storage (LMCache), point-to-point (NIXL/P2P), and more — transport varies, your bookkeeping shape doesn't.
The real subtlety production adds is asynchrony: D starts allocating and even scheduling while KV is still in flight, attention must not read blocks the transfer hasn't filled — a readiness-tracking problem your synchronous lab dodges on purpose. When you read connector code, most of its complexity is exactly this fence; the synchronous core underneath is this lab.
Block identity does not survive migration — P's block 47 becomes whatever D's pool hands out; only the logical token order matters, and the block table rebuild is free because tables are per-engine metadata (Phase 2). Anyone who tries to ship block ids instead of block contents has misunderstood the indirection — a surprisingly common design-review catch.
Prefix caching composes: if D already holds cached blocks for the prompt's prefix (another request warmed it), the transfer can skip those — connectors literally consult get_computed_blocks to shrink the freight. Phase 2 lab-05's machinery, now saving network bytes instead of FLOPs.

Going further

Make the import prefix-cache-aware: enable caching in D, pre-warm it with the same prompt, and extend import_request to claim cached blocks first (via get_computed_blocks) and allocate only the remainder — measure the freight saved. You've implemented the connector's matched-tokens optimization.
Build a tiny router: N decode engines, route each import to the one with the most free blocks; assert no import ever OOMs under a workload where round-robin would. Phase 11 lab-04's admission thinking, fleet edition.
Simulate the failure path: export, "lose" the payload, and re-run the request from scratch on D — preemption-style recompute as the fallback when transfer fails. Note that correctness needs nothing new: the request's identity is still just tokens. (This is why P/D systems can degrade gracefully to colocated.)

References

upstream/vllm/distributed/kv_transfer/kv_connector/v1/ — the connector interface and implementations (NIXL, shared-storage, multi-connector).
vLLM docs, Disaggregated Prefilling — the deployment shape this lab's bookkeeping serves: https://docs.vllm.ai/en/latest/features/disagg_prefill/
Zhong et al., DistServe: Disaggregating Prefill and Decoding for Goodput- optimized LLM Serving (OSDI 2024) — the why (lab-03 prices it): https://arxiv.org/abs/2401.09670
Phase 3 lab-04 — the discard-and-recompute sibling of this lab's ship-and-continue; Phase 1 — the two counters that make both legal.

Lab 15-02 — Stand Up a Prefill/Decode Pair `[GPU-OPT]`

The CPU labs built the bookkeeping (lab-01) and priced the trade (lab-03). This lab assembles the real thing: two vLLM instances on one box — one configured as the prefill producer, one as the decode consumer — joined by a KV connector, with a tiny proxy routing each request through both. You'll watch a request's KV cross between processes, the decode instance emit tokens for a prompt it never prefilled, and the two latency signatures the economics predicted: TTFT carrying the transfer, ITL running clean.

No GPU pair? Don't panic. The captured run below is annotated against both CPU labs; the reconciliation is the lab.

Why this lab exists

Disaggregation is a system — engines, connector, router — and systems have failure modes no component lab shows: the connector handshake that never completes (mismatched kv_transfer_config between the pair), the proxy that forgets to forward the first-token state, the decode instance whose pool can't absorb incoming KV at load (lab-01's loud-OOM, now a 500 error). Standing the pair up once, even on one box, converts the architecture from diagram to muscle memory — and the configuration surface (kv_role, kv_connector, the proxy contract) is exactly what you'll touch in any production P/D rollout.

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
# 2 GPUs ideal (one per role); 1 GPU works with gpu_memory_utilization=0.4 each.

Steps

Launch the pair (the P2P/NIXL-style connector config; exact connector names vary by version — vllm serve --help | grep kv is authoritative):

# Prefill instance (producer):
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8100 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'

# Decode instance (consumer):
CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8200 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'

Run the proxy (vLLM ships examples — upstream/examples/online_serving/ disaggregated_serving/): it sends each request to P with max_tokens=1, then replays it to D, which pulls the KV instead of prefilling.
Measure both arms: the same prompts against a plain single instance vs the pair — TTFT and ITL distributions separately (Phase 3 lab-05's follow-one-request discipline). Then load the decode side with steady streams and fire big prompts: colocated, the streams stutter; through the pair, they don't.

Captured output (real run, Qwen2.5-0.5B ×2, 2×L4, vLLM 0.22.1, trimmed)

(prefill)  INFO ... NixlConnector: registered as kv_producer
(decode)   INFO ... NixlConnector: registered as kv_consumer
(proxy)    request 0: prefill 188 ms (2031 tok) -> transfer -> decode first token
(decode)   INFO ... received KV for request 0: 127 blocks (~32 MiB)
single-instance     : TTFT 192 ms   ITL p50 11.2 ms   ITL p99 38.4 ms   (decode + big prompts mixed)
disaggregated pair  : TTFT 211 ms   ITL p50 11.0 ms   ITL p99 12.1 ms   (clean decode)
# TTFT +10% (the transfer toll) ; ITL p99 3.2x better (the interference, gone)

Reading the run

127 blocks (~32 MiB) — lab-03's freight, itemized: ~2031 tokens × ~16 KiB (a 0.5B model's per-token KV; run Phase 0 lab-02's formula to check). On the 8B from lab-03's tests this same prompt ships 256 MiB — small models flatter the transfer; scale the conclusion with the formula, not the demo.
TTFT 192 → 211 ms (+10%) — the toll, in the predicted range for an intra-box link (lab-03's penalty ratio, plus proxy overhead the model omits).
ITL p99 38.4 → 12.1 ms — the purchase: p99 collapses to ~p50 because decode steps never share a batch with prefill chunks anymore. Note p50 barely moved — interference was always a tail phenomenon (Phase 3 lab-05's lesson), and disaggregation is tail surgery.
The proxy's max_tokens=1 trick — P must run exactly through first-token (prefill + sample) so the KV is complete and the request state matches lab-01's canonical export point. Off-by-one here (max_tokens=0 isn't a thing; forgetting to carry the first token to D) is the classic proxy bug.

Hitchhiker's notes

Both instances must agree on everything KV-shaped — model, dtype, block size, TP layout — or the transferred tensors are garbage with compatible shapes (the silent kind). Real deployments pin both sides from one config source; version-skewed pairs during rolling upgrades are the operational hazard.
Connector zoo: NIXL (point-to-point RDMA-ish), LMCache (shared KV store — doubles as a cross-request prefix cache), MultiConnector (compose them). The roles (kv_producer/kv_consumer) and the scheduler hooks are the stable interface; transports compete underneath (lab-01's "transport varies, bookkeeping doesn't").
One box is a simulation of the topology, not the economics — intra-node transfer crosses NVLink/PCIe, flattering lab-03's toll. The correctness and configuration learning transfers; re-price before declaring victory on a real fabric.
Failure drill worth running: kill the decode instance mid-stream and watch the proxy's error; then kill the prefill side and note requests can fall back to the decode instance running colocated (it's a full vLLM!). Graceful degradation is configuration, not magic — design your proxy to use it.

Reflect

Trace one request through every phase-15 artifact: lab-01's export point (P's first-token state), lab-03's toll (the 32 MiB), this run's two latency signatures. Which numbers change when the model is 8B? When the link is 10 GbE? (Freight ×8 via per-token KV; toll ratio per lab-03's tests — possibly fatal.)
Why does the pair's p50 ITL match the single instance's? (Median decode steps were interference-free in both — chunking already protected them; the p99 was the casualty. Disaggregation buys tails, and SLOs are written on tails.)
Sketch the 3-instance variant: 1 prefill, 2 decode, router balancing imports by free blocks (lab-01's going-further). What new metric does the router need from each D? (Free-block headroom — the loud-OOM check, exported as capacity signal.)

References

upstream/examples/online_serving/disaggregated_serving/ — the proxy + configs this lab assembles.
upstream/vllm/distributed/kv_transfer/ — connectors, roles, scheduler hooks.
vLLM docs, Disaggregated Prefilling: https://docs.vllm.ai/en/latest/features/disagg_prefill/
Labs 01 (the bookkeeping) and 03 (the economics) — this run is their joint integration test, per the course's GPU-lab custom.

Lab 15-03 — The Disaggregation Trade: Transfer Bills vs Interference Wins `[CPU-OK]`

Why run prefill and decode on different machines when colocated chunked prefill (Phase 3) already works? Because chunking only caps the interference — every decode step that shares a batch with a prefill chunk still pays for it (the [33, 33, …] profile from Phase 3 lab-05), and at scale that cap is your ITL p99. Disaggregation buys perfectly clean decode steps — and pays by shipping the prompt's KV across a wire, straight into TTFT. This lab prices both sides in five functions and lands the punchline numbers: a 2048-token prompt on an 8B is 256 MiB of freight — ~11 ms over an InfiniBand-class link (invisible inside a ~205 ms prefill) versus ~215 ms over 10 GbE (doubling TTFT). Same architecture, opposite verdicts, decided entirely by the wire.

Why this lab exists

Disaggregation is the most hyped serving architecture of the moment, which is exactly when an engineer needs the arithmetic most — to know when it's transformative (latency-SLO products with long prompts, fleets big enough to pool P and D capacity separately) and when it's cargo cult (short prompts, slow links, or workloads whose interference a tuned chunk threshold already handles). The five functions you'll write are the meeting-room version of the DistServe paper's argument, and the decision function's three test cases are the three deployments you'll actually encounter: heavy interference + fast link (split), heavy interference + slow link (the cure costs more than the disease), and negligible interference (why bother).

The deeper pattern — the course's economics-lab family (Phase 0 lab-02, Phase 8 lab-04, Phase 11 lab-03, Phase 10 lab-03) — closes here with its cleanest specimen: one latency line item moved from a per-token tax (interference on every decode step) to a per-request toll (transfer once into TTFT). Whether that's a good trade depends on tokens-per-request and the toll rate; everything else is detail.

Background: the two ledgers

What disaggregation buys — decode steps that never share a batch with prefill: worst-case ITL drops from decode_step + chunk_time (Phase 3 lab-05's spike, capped but real) to decode_step, clean. For a 10 ms step under 25 ms chunks, that's a 3.5× p99 improvement — and each fleet can now be sized, scheduled, and even hardware-chosen for its own regime (prefill is compute-bound, decode bandwidth-bound — Phase 0 lab-04's split, finally given separate machines).

What it costs — the prompt's entire KV crosses a wire: prompt_tokens × kv_bytes_per_token (Phase 0 lab-02's 128 KiB/token for an 8B; 2.5× that for a 70B — test_payload_scales_with_model_not_just_prompt). The transfer lands in TTFT, and the right way to judge it is relative: transfer_time / prefill_time. Both scale ~linearly with prompt length, so the ratio is roughly constant per (model, link) — ~5% on a 200 Gb/s fabric (invisible), >100% on 10 GbE (the transfer outweighs the prefill it's delivering). That ratio is the single number that qualifies or disqualifies a cluster for P/D — compute it before the design review, not after the deployment.

Mind the unit trap the tests enforce: links are quoted in gigabits; KV comes in bytes. The factor of 8 has embarrassed real capacity plans.

Files

starter.py — kv_payload_bytes, transfer_seconds, colocated_itl_worst, disagg_ttft_penalty, disagg_wins. Your work.
solution.py — reference.
test_lab.py — the freight, both link verdicts, the interference identity, the penalty fractions, the three-way decision, and the model-size scaling.

Run

LAB_IMPL=starter pytest phase-15-disaggregated-serving/labs/lab-03-disagg-economics -q
pytest phase-15-disaggregated-serving/labs/lab-03-disagg-economics -q   # reference

What the tests prove

Test	What it pins
`test_payload_is_real_freight`	256 MiB per 2048-token request — per request, every request. KV transfer is a bandwidth product, not a control message
`test_link_speed_is_the_whole_story`	~11 ms vs ~215 ms for the same payload: the fabric is the feasibility condition, with the bits-vs-bytes factor of 8 enforced
`test_interference_math_is_phase3_lab05`	The colocated worst case is literally that lab's spike, in seconds
`test_ttft_penalty_fractions`	<6% on fast fabric, >100% on 10 GbE — the qualifying ratio
`test_the_decision_both_ways`	All three real deployments: split / don't (slow link) / don't (no disease to cure). A decision function that can say "no" in two different ways is one you can trust
`test_payload_scales_with_model_not_just_prompt`	The 70B multiplier: bigger models raise the freight and (via slower prefill) the budget — rerun the ratio per model, never reuse it

Hitchhiker's notes

GQA/MLA shrink the freight too — kv_bytes_per_token is Phase 0 lab-02's formula, so every KV-compression technique (Phase 6's FP8-KV included) is also a disaggregation enabler. DeepSeek's MLA (≈ 70 KiB/token at svelte) makes P/D dramatically cheaper to feed — architecture choices propagate into deployment topology, which is the kind of cross-layer effect staff engineers are paid to notice.
Overlap hides part of the toll: real connectors stream KV layer-by-layer while prefill still computes later layers, so the visible TTFT penalty can be a fraction of your transfer_seconds. The model is an upper bound with a known bias — the most useful kind (Phase 8 lab-04's phrasing, still true).
The hidden third ledger is utilization: separate fleets can each run their regime's optimal batch shape (prefill: few huge batches; decode: many small steady ones) instead of compromising — DistServe's "goodput" argument, which can dominate both latency ledgers at scale. Your model prices latency; remember the throughput term exists before declaring a verdict from latency alone.
The degenerate fallback matters: when the link is slow or the prompt short, routing the request colocated (no migration) costs nothing — P/D systems are hybrid by construction (lab-01's output-invariance is what makes per-request routing safe). The decision function runs per request class, not per cluster.

Going further

Add overlap: effective_transfer(transfer_s, prefill_s, overlap_fraction) and find the overlap that makes 25 GbE viable for 2048-token prompts. You've priced what connector engineering is worth (compare Phase 10 lab-03's same move for all-reduce).
Sweep prompt length 128 → 32k and plot both ledgers: the interference win grows with prompt length (bigger chunks to dodge) and the freight grows — but the penalty ratio stays flat while the ITL win grows. Long-context workloads are disaggregation's home turf; the plot shows why in one figure.
Add the queueing term: P-fleet utilization → prefill queue wait → TTFT. At high load, disaggregation's pooling effect (any P serves any D) cuts queue waits — the goodput argument made visible with an M/M/1 sketch.

References

Zhong et al., DistServe (OSDI 2024) — the goodput argument and the interference/transfer trade formalized: https://arxiv.org/abs/2401.09670
Patel et al., Splitwise (2024) — the same split from the hardware-heterogeneity angle: https://arxiv.org/abs/2311.18677
upstream/vllm/distributed/kv_transfer/ — where the freight actually ships (lab-01's bookkeeping + transport).
Phase 3 lab-05 — the interference this architecture deletes; Phase 0 labs 02/04 — the per-token bytes and the regime split that make both ledgers computable.

Phase 15 — Exercises: Disaggregated Serving

Work these after the labs. They escalate from "explain it" to "design it" — staff-level means you can do the last ones cold.

Quantify when disaggregation beats co-location (interference vs transfer cost).
What exactly must transfer between P and D, and in what layout?
How does the scheduler represent 'waiting for remote KV'?

Self-grading

For each: could you (a) explain it to a teammate in 2 minutes, and (b) point to the exact upstream/ file that proves your answer? If not, re-read the matching anchor in 01-deep-dive.md.

Phase 15 — Interview Questions: Disaggregated Serving

Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)

Q1. Why disaggregate prefill and decode?

Model answer

They have different resource profiles and interfere when co-located: a big prefill stalls ongoing decodes (latency spikes). Splitting them lets you scale and tune each fleet independently — more compute for prefill TTFT, more memory-bandwidth/instances for decode throughput — at the cost of transferring the KV cache between them.

Q2. What's the main cost/risk of disaggregation?

Model answer

Shipping the KV cache over the network adds latency and bandwidth pressure; it only pays off when interference savings exceed transfer cost. It also adds routing/orchestration complexity and failure modes (a decode node waiting on remote KV).

Going deeper

The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.

Phase 15 — Cheatsheet: Disaggregated Serving

Prefill fleet (compute) -> KV transfer -> decode fleet (bandwidth). Tune each separately.
KV connectors abstract the transfer (also used for offloading / cross-engine cache).
Scheduler state WAITING_FOR_REMOTE_KVS gates decode until KV arrives.

Key upstream files

vllm/distributed/kv_transfer/
vllm/distributed/kv_transfer/kv_connector/v1/
vllm/v1/core/sched/scheduler.py
examples/

Full reference: 00-guide.md · 01-deep-dive.md

Phase 16 — Serving APIs & Parsers

← Phase 15 · Course home · Phase 17 →

Don't Panic

Almost no one calls vLLM in Python in production — they hit its HTTP server, which speaks the OpenAI API (and the Anthropic Messages API, and gRPC). On top of raw generation it adds chat templating, streaming (SSE), tool calling, and reasoning parsers. This phase is the front door everyone actually uses.

Why this phase matters

The API server is where correctness meets the real world: streaming semantics, tool-call extraction, error handling, and OpenAI compatibility quirks. Tool/reasoning parsers are a frequent contribution area and a place small bugs cause big incidents.

What you'll learn

The OpenAI-compatible server: /v1/chat/completions, /v1/completions, /v1/embeddings
Chat templates and how messages become a token prompt
Streaming via Server-Sent Events; delta semantics
Tool/function calling: schema in, tool_calls out; the tool-call parsers
Reasoning parsers (separating chain-of-thought from the answer)
Anthropic Messages API and gRPC front-ends

The map: where this lives in the real code

Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md) walks through the important ones line by line.

vllm/entrypoints/openai/api_server.py — The FastAPI app + routes.
vllm/entrypoints/openai/serving_chat.py — Chat completions: templating, streaming, tools.
vllm/entrypoints/openai/tool_parsers/ — Per-model tool-call parsers (the pluggable bit).
vllm/entrypoints/openai/reasoning_parsers/ — Reasoning/think-tag parsers.
vllm/entrypoints/ — Look for the Anthropic Messages + gRPC entrypoints.

Labs in this phase

lab-01-tool-call-parser [CPU-OK] — batch + streaming tool-call parsing with the hold-back discipline (half-tags never leak, false alarms release), proven chunking-invariant by fuzz.
lab-02-openai-server-smoke [GPU-OPT] — vllm serve + the OpenAI client end to end, then the source trace through serving_chat: every response artifact assigned to its layer. Captured output included.
lab-03-streaming-detokenizer [CPU-OK] — the byte boundary: an incremental detokenizer that never emits broken UTF-8 (🚀 = three silences and a rocket), with the naive per-token decoder kept as a failing control.

See labs/README.md for the recommended order (03 → 01 → 02) and how to run them.

How to work this phase

Read this guide for intuition.
Read 01-deep-dive.md with the upstream/ files open.
Do 02-mini-build.md — build the mini_vllm piece yourself.
Run the labs, then attempt EXERCISES.md.
Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.

Where you are

← Phase 15 · Course home · Phase 17 →

Phase 16 — Deep Dive: Serving APIs & Parsers

Read this with upstream/ open. Every path is relative to upstream/ at the pinned commit v0.22.1 @ 0decac0 (UPSTREAM_PIN.md). If a line number ever drifts, search for the named symbol instead.

Guided reading list

vllm/entrypoints/openai/api_server.py — The FastAPI app + routes.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/entrypoints/openai/serving_chat.py — Chat completions: templating, streaming, tools.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/entrypoints/openai/tool_parsers/ — Per-model tool-call parsers (the pluggable bit).
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/entrypoints/openai/reasoning_parsers/ — Reasoning/think-tag parsers.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/entrypoints/ — Look for the Anthropic Messages + gRPC entrypoints.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.

Questions to answer as you read

The OpenAI-compatible server: /v1/chat/completions, /v1/completions, /v1/embeddings?
Chat templates and how messages become a token prompt?
Streaming via Server-Sent Events; delta semantics?
Tool/function calling: schema in, tool_calls out; the tool-call parsers?
Reasoning parsers (separating chain-of-thought from the answer)?
Anthropic Messages API and gRPC front-ends?

Cross-references

Intuition: 00-guide.md
Build it yourself: 02-mini-build.md
The gold-standard depth to emulate: Phase 02 deep-dive.

Phase 16 — Mini-Build: extend `mini_vllm`

Your task

Put a tiny HTTP layer over mini_vllm (stdlib http.server is fine) exposing a /v1/completions-shaped endpoint with streaming, plus a toy tool-call parser that extracts a JSON tool call from the output.

Why build it (and not just read it)

Method

Look at the matching real code from 01-deep-dive.md.
Add your module under mini_vllm/ (or extend an existing one).
Write a test_*.py next to it that pins the behavior you care about.
Run pytest mini_vllm -q and keep it green.

Definition of done

Your component runs on CPU with no extra dependencies (numpy ok).
A test demonstrates the property this phase is about (not just "it runs").
You can explain, out loud, how your toy maps to the real implementation and where it intentionally simplifies.

The flagship phases ship complete mini_vllm modules + tests (mini_vllm/block_pool.py, mini_vllm/scheduler.py) — use them as your reference for structure and test style.

Phase 16 Labs — Serving APIs & Parsers

Three labs on the front door: the layer that turns an inference engine into an API product. The arc: parse tool calls out of a token stream, streaming-safely (lab-01), go a level down to the byte boundary — the detokenizer that never emits broken UTF-8 (lab-03), then run the whole door — vllm serve, the OpenAI client, and a source trace that assigns every response artifact to its layer (lab-02).

Recommended order: 03 → 01 → 02 (bytes, then tags, then the server that composes both). CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-16-serving-apis-and-parsers/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-16-serving-apis-and-parsers/labs/lab-01-tool-call-parser -q

Labs

lab-01-tool-call-parser `[CPU-OK]`

Batch and streaming parsers for <tool_call> blocks, with the discipline that defines the streaming one: hold back any trailing text that might still become a tag, release it on false alarms, never leak half-tags to the user. Proven chunking-invariant by a 50-random-slicings fuzz. Skills: chunking invariance as the incremental parser's contract; hold-back buffers; per-model conventions as trained-in templates; loud failure for malformed calls.

lab-02-openai-server-smoke `[GPU-OPT]`

vllm serve + the OpenAI client: streamed deltas, a structured tool call, deliberate 400s, and a mid-stream disconnect (watch the abort free its KV). Then the source trace — route → validation → chat template → AsyncLLM.generate → the detokenizer/parser pipeline → SSE — with the framing question per leg: translation or inference? Annotated capture included. Skills: the server as translator; chat templates as derived state; finish_reason: "tool_calls"; front-door latency as its own budget.

lab-03-streaming-detokenizer `[CPU-OK]`

The byte boundary: 🚀 is four byte-tokens, and per-token decoding emits garbage three times — build the incremental detokenizer that emits only complete UTF-8 characters (lead-byte arithmetic, hold the tail, honest � on real truncation), with the naive approach kept as a failing control. Skills: the emit-eagerly-but-never-emit-what- might-change pattern (third appearance); why English-only testing is a blind spot; where character responsibility ends and grapheme rendering begins.

What you can do after this phase

Trace any API response artifact to the layer that produced it; pair models with their tool parsers and chat templates deliberately; build streaming text pipelines out of composable hold-back buffers (detokenize → stop-match → tag-parse) and test them with chunking fuzzes; and read vllm/entrypoints/openai/ as a translation layer over the engine you already know down to its counters. Phase 17 goes the other direction from the front door — down to the hardware the engine runs on.

Lab 16-01 — Tool-Call Parsing: Structure Out of a Token Stream `[CPU-OK]`

A tool-calling model doesn't emit function calls — it emits text that describes function calls (<tool_call>{"name": …}</tool_call> for Hermes-style models; [TOOL_CALLS] for Mistral; a dozen other conventions). The server's job is to turn that text into the OpenAI response's structured tool_calls field — and to do it while streaming, over chunks that can split the tag or the JSON anywhere. The batch parser is twenty easy lines; the streaming parser is where every real bug in vLLM's tool_parsers/ directory lives, and its central discipline is the lab's takeaway: hold back any text that might still become a tag — emit "Sure. " immediately, but keep "<tool" buffered until the next chunk says whether it's a tool call or the user's <today>.

Why this lab exists

Tool calling is the load-bearing feature of the agent era, and its serving-side reality is unglamorous: per-model text conventions, parsed incrementally, under the OpenAI API's streaming contract (content deltas must flow immediately; tool calls must arrive structured). vLLM ships ~20 parser plugins (upstream/vllm/entrypoints/ openai/tool_parsers/) that all solve this lab with different tag conventions — and their bug tracker is a museum of exactly the cases this lab's tests pin: tags split across chunks leaking half-tags into chat UIs, held-back text swallowed forever on false alarms, malformed JSON crashing streams instead of failing requests.

The streaming-equals-batch fuzz test is the lab's methodological gift: 50 random chunkings of the same text, all required to reassemble to the batch parse. Chunking invariance is the property every incremental parser owes, and randomized chunk boundaries are how you test it — the same move as Phase 8 lab-03's distributional oracle, applied to parsing.

Background: the two parsers

Batch (parse_tool_calls): scan for OPEN…CLOSE blocks, JSON-parse each, return (remaining content, calls). Malformed JSON raises — a call the executor can't parse must 4xx at the server, not detonate downstream (the loud-failure habit from Phase 14 lab-03).

Streaming (StreamingToolParser): a buffer and one bit of state (in_block). Outside a block, emit text eagerly except the longest trailing proper-prefix of OPEN — the hold-back. Inside, buffer silently until CLOSE (partial JSON is never parseable, so nothing useful can be emitted early), then parse and emit the call. finish() flushes held text and makes an unterminated block loud — the finish_reason: "length" interaction from Phase 12 lab-02, parser edition: a stream truncated mid-call is an error, not a tool call.

Files

starter.py — parse_tool_calls and StreamingToolParser (feed/finish). Your work.
solution.py — reference (note _trailing_tag_prefix: the hold-back, isolated).
test_lab.py — batch semantics, the 50-chunking fuzz, the split-tag leak test, the false-alarm release, and the unterminated-block failure.

Run

LAB_IMPL=starter pytest phase-16-serving-apis-and-parsers/labs/lab-01-tool-call-parser -q
pytest phase-16-serving-apis-and-parsers/labs/lab-01-tool-call-parser -q   # reference

What the tests prove

Test	What it pins
`test_batch_parse` / `test_multiple_calls_in_order`	The structured extraction, content preserved around it, order kept
`test_malformed_json_is_loud`	Garbage in a block raises — the server's chance to fail the request instead of the agent loop
`test_streaming_equals_batch_for_any_chunking`	Chunking invariance, 50 random slicings — the incremental parser's defining property
`test_tag_split_across_chunks_is_not_leaked`	`"Sure. <tool"` emits `"Sure. "` and holds `"<tool"` — half-tags never reach the user (the chat-UI-shows-`<tool` bug, prevented)
`test_false_alarm_prefix_is_released`	`"<to"` + `"day>"` → `"<today>"` emitted intact — held-back is not swallowed (the opposite bug, equally real)
`test_unterminated_block_fails_at_finish`	Truncation inside a call is an error, matching the Phase 12 hygiene rule

Hitchhiker's notes

Why per-model parsers at all? The tag convention is trained into each model (Hermes, Mistral, Llama, Qwen each render tool calls differently in their chat templates), so the parser must match the template — --tool-call-parser hermes pairs with the model the same way Phase 14's mapping table pairs with a checkpoint. Mismatched parser ⇒ tool calls stream as visible text: instantly recognizable once you've done this lab.
The OpenAI streaming contract adds a layer your events map onto: tool-call deltas (tool_calls[i].function.arguments streamed as JSON fragments). Real parsers emit partial-argument deltas for responsiveness — which requires incremental JSON parsing too (is this string complete? is the brace balanced?). Your buffer-until-close design is the correctness-first version; the delta-streaming upgrade is the going-further.
Constrained decoding (Phase 12) and parsing are complements, not rivals: the grammar mask can guarantee the model emits well-formed <tool_call> JSON (vLLM's tool-choice enforcement does exactly this), and the parser still must extract it from the stream. Guarantee the syntax, then parse it — belt and suspenders, both load-bearing.
The hold-back has a latency cost: a trailing < waits one chunk before display. Imperceptible — but the general trade (display latency vs structural certainty) recurs in stop-string handling (Phase 1 lab-05's straddle problem) and reasoning-tag parsers. Same buffer discipline everywhere; vLLM's detokenizer and parsers share it.

Going further

Add streaming argument deltas: inside a block, emit ("args_delta", fragment) events for completed JSON string portions — you'll need a brace/quote tracker (a mini Phase 12 lab-03 machine), and you'll understand why upstream parsers carry exactly one.
Implement a second convention (Mistral's [TOOL_CALLS][{...}]) behind the same event interface, and a get_parser(name) registry — the plugin shape of tool_parsers/, reproduced.
Property-test with adversarial content: tool-call JSON whose string values contain </tool_call>. Your parser breaks (find -> escape-aware scanning). Upstream's do too, mostly — models are trained not to emit this, which is a contract worth knowing is social, not technical.

References

upstream/vllm/entrypoints/openai/tool_parsers/ — the plugin zoo; hermes_tool_parser.py is your lab with delta streaming.
vLLM docs, Tool Calling — parser selection and --enable-auto-tool-choice: https://docs.vllm.ai/en/latest/features/tool_calling/
OpenAI API reference, function calling & streaming — the contract being satisfied: https://platform.openai.com/docs/guides/function-calling
Phase 12 — the masks that can guarantee what this lab parses; lab-03 — the same buffering discipline one level down, at the byte boundary.

Lab 16-02 — The OpenAI Server, End to End `[GPU-OPT]`

The CPU labs built the two text-pipeline stages (detokenizer, tool parser); this lab runs the whole front door: vllm serve, the OpenAI client, a streamed chat completion, and a tool call — then traces one request through the server source (serving_chat.py) so the HTTP layer stops being a fog between you and the engine you know. The payoff observation: everything from Phase 1 onward sits behind one async generator call — the server is a translator, not a second engine.

No GPU? Don't panic. The captured exchange below is annotated; the source trace is hardware-free.

Why this lab exists

Production vLLM is touched through this server far more often than through LLM() — and most operational questions ("why did this request 400?", "where do sampling defaults come from?", "what adds the latency between client and first token?") are server-layer questions. The trace this lab walks — FastAPI route → request validation → chat-template rendering → AsyncLLM.generate → per-token streaming through detokenizer/parsers → SSE chunks — is the request's actual itinerary, and each leg is a place you'll someday debug. The lab's framing question for every leg: is this translation (server's job) or inference (engine's job)? Keeping that line sharp is what makes the 20k-line entrypoints directory navigable.

Requirements

uv pip install -e ".[vllm]" openai
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct   # small instruct model with tool support

Steps

Serve (note the parser flags — lab-01's convention pairing):

vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000 \
  --enable-auto-tool-choice --tool-call-parser hermes

Stream a chat completion and watch the deltas arrive:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="-")

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    messages=[{"role": "user", "content": "Say hi in French, one word."}],
    stream=True)
for chunk in stream:
    print(repr(chunk.choices[0].delta.content), end=" ")

Force a tool call (define one tool, ask a matching question) and inspect the structured tool_calls in the response — lab-01's parser output, arriving over HTTP.
Misbehave on purpose: oversized max_tokens (read the 400's error body — validation is the server's first translation), a wrong model name, and a request with stream=true killed mid-stream (watch the server log the disconnect and the engine abort the request — Phase 1 lab-05's FINISHED_ABORTED, finally observed).

Captured output (real run, Qwen2.5-0.5B-Instruct, L4, vLLM 0.22.1, trimmed)

INFO ... Started server process; Application startup complete.    (Uvicorn + FastAPI)
INFO ... "POST /v1/chat/completions HTTP/1.1" 200 OK
None ' Bon' 'jour' ' !' None      # deltas: first None = role chunk, last = finish chunk
# tool call response (non-streamed):
"tool_calls": [{"type": "function", "function":
    {"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"}}],
"finish_reason": "tool_calls"
# the deliberate 400:
{"error": {"message": "max_tokens must be at most 32768 ...", "type": "BadRequestError"}}

Tracing the request through the source

Open these in order, one request in mind:

upstream/vllm/entrypoints/openai/api_server.py — the FastAPI route; finds the handler per endpoint. (Translation: HTTP ↔ python objects.)
upstream/vllm/entrypoints/openai/serving_chat.py — the heart: create_chat_completion validates, renders the chat template (messages → the model's prompt format — the per-model convention lab-01's parser is the inverse of), builds SamplingParams from the request body (every Phase 9 knob, arriving as JSON), and calls AsyncLLM.generate — the only line where inference happens.
The streaming loop just below — consumes engine outputs, runs the detokenizer-fed deltas (lab-03's output!) through the tool parser (lab-01!), and yields SSE chunks with finish_reason mapped per Phase 1 lab-05.
upstream/vllm/v1/engine/async_llm.py — AsyncLLM: the async wrapper over the EngineCore you traced in Phase 1 lab-02. The circle closes.

Hitchhiker's notes

The chat template is the most consequential invisible step: the same messages render differently per model (system-prompt placement, tool-schema injection, generation prompt), and template mismatches are the top cause of "model is dumb via API but fine in the playground." --chat-template overrides it; the template ships in the tokenizer config. The server's prompt is derived state — when debugging quality, print it (add_generation_prompt, the works) before blaming weights.
finish_reason: "tool_calls" — a third value joining Phase 1 lab-05's "stop"/"length": set when the parser extracted calls, telling the client to execute and continue the loop. The enum keeps earning.
One server, many surfaces: the same process exposes /v1/completions, /v1/chat/completions, embeddings, and (version-dependent) Anthropic-style routes — all translating onto the same AsyncLLM. API multiplexing is cheap because the engine boundary is clean; that's the architectural moral of the whole phase.
Disconnect handling is a correctness feature: a client that vanishes mid-stream must abort its request (free KV! — Phase 2's blocks don't free themselves), and the server's disconnect-watcher → abort_request path is what stands between you and a slow leak under flaky clients. Your step-4 experiment watched it work; know where it lives (api_server's disconnect checks).

Reflect

For each captured artifact, name the layer that produced it: the None role chunk (server's SSE framing), ' Bon' (engine token → lab-03 detokenizer → delta), the structured tool_calls (lab-01's parser), the 400 (validation — never reached the engine). If every artifact has an owner, the fog is gone.
The OpenAI contract returns arguments as a JSON string, not an object — and your lab-01 parser emitted dicts. Where must the re-serialization live, and why there? (The server's translation layer: the contract is the client's, the dict is internal. Translation owns format debts.)
What's the latency budget of the server layer itself? Measure: time-to-first- delta minus engine TTFT (from metrics) ≈ template render + validation + HTTP. If that gap grows under load, you're CPU-bound in the front door — a real failure mode (event-loop starvation) that no GPU dashboard will show you.

References

upstream/vllm/entrypoints/openai/serving_chat.py — the file this lab makes readable.
upstream/vllm/v1/engine/async_llm.py — the engine's async face.
vLLM docs, OpenAI-Compatible Server — endpoints, flags, template overrides: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html
Labs 01 and 03 — the two pipeline stages this server composes; Phase 1 lab-02 — the engine loop at the bottom of the stack.

Lab 16-03 — The Streaming Detokenizer: Never Emit Broken UTF-8 `[CPU-OK]`

Streaming sends text the instant tokens arrive — but token boundaries and character boundaries don't align. With mini_vllm's ByteTokenizer the problem is stark: 🚀 is four byte-tokens, and decoding after each token emits replacement-character garbage (�) three times before the rocket completes. Real BPE tokenizers have the identical problem wherever a multibyte character spans tokens (CJK text, emoji, accents — i.e. most of the world's traffic). This lab builds the fix every serving stack carries: an incremental detokenizer that only ever emits complete characters, holding incomplete byte sequences until they finish — with a control test proving the naive approach really does produce four pieces of garbage where yours produces three silences and a rocket.

Why this lab exists

This bug ships constantly. Every few months a chat product somewhere streams � mid- emoji or garbles Chinese text, because someone decoded per-token and tested only in English — ASCII is the one alphabet where token and character boundaries happen to agree, which makes English-only testing a perfect blind spot. The fix is small but must live in the streaming path (vLLM's IncrementalDetokenizer holds back exactly these bytes), and implementing it once inoculates you: afterward, "stream text" and "stream complete characters" register as different operations, the way Phase 9 taught "random" and "reproducibly random" to.

It's also the purest specimen of the phase's recurring discipline — lab-01 held back possible tag prefixes, stop-string handling (Phase 1 lab-05) holds back possible stop matches, and this lab holds back incomplete characters. One pattern, three layers: emit eagerly, but never emit what might still change meaning.

Background: UTF-8 tells you how long to wait

UTF-8's self-describing first byte is what makes the fix clean: 0xxxxxxx = 1-byte char, 110xxxxx = 2, 1110xxxx = 3, 11110xxx = 4 — your utf8_expected_len table. The detokenizer keeps a byte buffer; after each token it computes the longest prefix that is a whole number of complete sequences, decodes and emits that, and keeps the tail. The lead byte announces the wait; no guessing, no decode-and-check. flush() handles the honest edge: a stream truncated mid-character (max_tokens landing inside an emoji — Phase 1 lab-05's cap, byte edition) decodes the remnant with errors='replace', because at end-of-stream the garbage is real and hiding it would be lying.

Files

starter.py — utf8_expected_len and StreamingDetokenizer (feed/flush). Your work.
solution.py — reference.
test_lab.py — the length table, ASCII eagerness, emoji holding, the no-garbage-ever invariant on mixed multilingual text, the naive-approach control, truncation honesty, and EOS handling.

Run

LAB_IMPL=starter pytest phase-16-serving-apis-and-parsers/labs/lab-03-streaming-detokenizer -q
pytest phase-16-serving-apis-and-parsers/labs/lab-03-streaming-detokenizer -q   # reference

What the tests prove

Test	What it pins
`test_ascii_streams_one_char_per_token`	Eagerness: nothing is held that could be shown — latency is sacrificed only when correctness demands
`test_emoji_is_held_until_complete`	`["", "", "", "🚀"]` — three silences, one rocket: the wait is exactly the character's length, no more
`test_never_emits_replacement_chars_for_valid_text`	The invariant, on `naïve café — 你好 🚀🇫🇷`: no `�` ever, and concatenation loses nothing (both halves matter: no garbage AND no swallowing)
`test_naive_approach_really_is_broken`	The control (Phase 9 lab-04's pattern): per-token decode of 🚀 yields four garbage strings — the bug demonstrated, not described
`test_flush_handles_truncated_sequence`	Stream cut mid-emoji: flush emits honest `�` rather than raising or hiding — truncation is the caller's fact to handle
`test_eos_is_ignored`	Non-byte ids pass through silently — the sentinel discipline again

Hitchhiker's notes

The real version sits one level up: BPE tokens map to byte sequences (via the tokenizer's byte-level encoding), so vLLM's incremental detokenizer (upstream/vllm/v1/engine/detokenizer.py, backed by the tokenizers library's incremental decode) buffers token-ids and re-decodes a sliding window — same hold-back logic with tokenizer-specific machinery for "which prefix is stable." Your byte-level version is that algorithm with the cleanest possible alphabet.
The flag emoji in the test is a deliberate landmine that doesn't explode: 🇫🇷 is two complete 4-byte codepoints (regional indicators) that render as one flag. Your detokenizer may legally emit them separately — character completeness is the engine's contract; grapheme clustering is the terminal's problem. Knowing where your responsibility ends is part of the spec (and why the test checks for �, not for atomic flags).
This buffering interacts with everything downstream: stop strings are matched on detokenized text (so they inherit this buffer's timing), and lab-01's tag parser consumes this lab's output. The serving text pipeline is a stack of hold-back buffers, each with its own "might still change" criterion — when streamed output seems to lag by a character or two, you now know all three suspects.
Performance note: production detokenizers avoid re-decoding from scratch per token (your _complete_prefix_len scan is O(buffer), fine; re-decoding the whole output per token, the other naive approach, is O(n²) over a generation and has caused real regressions). Incrementality is a performance property here, not just a correctness one.

Going further

Build the full pipeline: ByteTokenizer → StreamingDetokenizer → lab-01's StreamingToolParser, fed token-by-token; assert end-to-end that a tool call with an emoji in its arguments survives both buffers. Two hold-backs composed — the actual server path.
Add stop-string support on top (Phase 1 lab-05's going-further, now with the right substrate): match on the emitted text, hold back any suffix that prefixes a stop string. Three buffers. Notice they compose without coordinating — each one's output is the next one's honest input.
Measure the worst-case display latency your buffer adds for a pathological all-emoji stream — then check the real detokenizer's equivalent bound. (Four tokens. The wait is bounded by UTF-8's max sequence length; this is why the design needs no timeout.)

References

upstream/vllm/v1/engine/detokenizer.py — IncrementalDetokenizer: this lab at the BPE level.
The Unicode Standard, ch. 3 (UTF-8) — the lead-byte table you implemented: https://www.unicode.org/versions/latest/
Phase 1 lab-05 — stop strings, the neighboring hold-back; lab-01 — the tag hold-back this lab feeds.

Phase 16 — Exercises: Serving APIs & Parsers

Work these after the labs. They escalate from "explain it" to "design it" — staff-level means you can do the last ones cold.

Why are streaming tool-call parsers hard (partial JSON across deltas)?
How does a chat template turn messages into a single token sequence?
What must be true for vLLM to be a drop-in OpenAI replacement?

Self-grading

For each: could you (a) explain it to a teammate in 2 minutes, and (b) point to the exact upstream/ file that proves your answer? If not, re-read the matching anchor in 01-deep-dive.md.

Phase 16 — Interview Questions: Serving APIs & Parsers

Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)

Q1. How does vLLM implement tool calling on top of plain text generation?

Model answer

The server injects tool schemas into the prompt (often via the chat template / structured output), then a model-specific tool-call parser extracts the function name and JSON args from the generated text — incrementally during streaming — and emits OpenAI-style tool_calls. Structured output can hard-constrain the args to the schema.

Q2. What's tricky about streaming responses?

Model answer

You must emit incremental deltas while maintaining correct semantics (role, finish_reason), and parse partial content (tool-call JSON, reasoning tags) that spans multiple chunks without committing to an interpretation too early.

Going deeper

The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.

Phase 16 — Cheatsheet: Serving APIs & Parsers

vllm serve -> FastAPI -> serving_chat -> AsyncLLM. Speaks OpenAI + Anthropic + gRPC.
Chat template turns messages -> prompt tokens. SSE for streaming deltas.
Tool/reasoning parsers are pluggable and per-model; streaming makes them partial-parse.

Key upstream files

vllm/entrypoints/openai/api_server.py
vllm/entrypoints/openai/serving_chat.py
vllm/entrypoints/openai/tool_parsers/
vllm/entrypoints/openai/reasoning_parsers/
vllm/entrypoints/

Full reference: 00-guide.md · 01-deep-dive.md

Phase 17 — Hardware Backends & Plugins

← Phase 16 · Course home · Phase 18 →

Don't Panic

vLLM runs on NVIDIA, AMD, CPUs, TPUs, Gaudi, and more. It does this by hiding every hardware difference behind a Platform abstraction and a plugin system, so the engine code stays hardware-agnostic and new accelerators arrive as plugins. This phase is that abstraction — and you'll run the CPU backend with no GPU at all.

Why this phase matters

Hardware breadth is a strategic advantage (GPU supply, cost arbitrage) and the Platform abstraction is a clean piece of architecture worth studying. Knowing where the seams are lets you reason about porting and about why a feature is available on one backend but not another.

What you'll learn

The Platform abstraction: device type, attention backend default, capabilities
How the engine queries the platform instead of hardcoding CUDA
The out-of-tree plugin system (entry points) for new hardware
CPU backend: what changes (no paging kernels? threading? dtype support)
Why some features are platform-gated (FP8, CUDA graphs, certain kernels)

The map: where this lives in the real code

Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md) walks through the important ones line by line.

vllm/platforms/interface.py — The Platform base class — the contract every backend implements.
vllm/platforms/cuda.py — The NVIDIA platform.
vllm/platforms/cpu.py — The CPU platform — read this; you can run it on a laptop.
vllm/platforms/__init__.py — Platform detection/resolution + plugin discovery.
vllm/plugins/ — The plugin loading mechanism.

Labs in this phase

lab-01-platform-abstraction [CPU-OK] — build the Platform interface, registry, resolver (CPU floor + loud override), then register an out-of-tree platform and change the engine's decisions with zero core edits — plus the duplicate-registration supply-chain guard.
lab-02-run-cpu-vllm [CPU-OK] — run vLLM on laptop cores and read cpu.py against lab-01's interface: every override checked off, the Phase 1–3 engine untouched. Captured output included.

See labs/README.md for how to run them.

How to work this phase

Read this guide for intuition.
Read 01-deep-dive.md with the upstream/ files open.
Do 02-mini-build.md — build the mini_vllm piece yourself.
Run the labs, then attempt EXERCISES.md.
Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.

Where you are

← Phase 16 · Course home · Phase 18 →

Phase 17 — Deep Dive: Hardware Backends & Plugins

Read this with upstream/ open. Every path is relative to upstream/ at the pinned commit v0.22.1 @ 0decac0 (UPSTREAM_PIN.md). If a line number ever drifts, search for the named symbol instead.

Guided reading list

vllm/platforms/interface.py — The Platform base class — the contract every backend implements.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/platforms/cuda.py — The NVIDIA platform.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/platforms/cpu.py — The CPU platform — read this; you can run it on a laptop.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/platforms/__init__.py — Platform detection/resolution + plugin discovery.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/plugins/ — The plugin loading mechanism.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.

Questions to answer as you read

The Platform abstraction: device type, attention backend default, capabilities?
How the engine queries the platform instead of hardcoding CUDA?
The out-of-tree plugin system (entry points) for new hardware?
CPU backend: what changes (no paging kernels? threading? dtype support)?
Why some features are platform-gated (FP8, CUDA graphs, certain kernels)?

Cross-references

Intuition: 00-guide.md
Build it yourself: 02-mini-build.md
The gold-standard depth to emulate: Phase 02 deep-dive.

Phase 17 — Mini-Build: extend `mini_vllm`

Your task

Add a 'platform' abstraction to mini_vllm: a base class exposing device/dtype/default-backend, with a CPU implementation, and have the engine consult it instead of hardcoding — mirroring vLLM's Platform.

Why build it (and not just read it)

Method

Look at the matching real code from 01-deep-dive.md.
Add your module under mini_vllm/ (or extend an existing one).
Write a test_*.py next to it that pins the behavior you care about.
Run pytest mini_vllm -q and keep it green.

Definition of done

Your component runs on CPU with no extra dependencies (numpy ok).
A test demonstrates the property this phase is about (not just "it runs").
You can explain, out loud, how your toy maps to the real implementation and where it intentionally simplifies.

The flagship phases ship complete mini_vllm modules + tests (mini_vllm/block_pool.py, mini_vllm/scheduler.py) — use them as your reference for structure and test style.

Phase 17 Labs — Hardware Backends & Plugins

Two labs on the layer that lets one engine speak to any silicon. The arc: build the platform interface, the registry, and the resolver — then register an out-of-tree platform and change the engine's decisions with zero core edits (lab-01); then run the realest possible demonstration — vLLM on your laptop's CPU, with cpu.py read against the interface you built (lab-02).

CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase:
pytest phase-17-hardware-backends-and-plugins/labs -m "not gpu"

# Grade yourself:
LAB_IMPL=starter pytest phase-17-hardware-backends-and-plugins/labs/lab-01-platform-abstraction -q

Labs

lab-01-platform-abstraction `[CPU-OK]`

The funnel: a Platform interface answering every hardware question (attention backend, dtypes, graph support), a registry with a CPU floor and a loud override, and the test that is the architecture — an out-of-tree "vendor" platform changes the engine's decisions without touching core. Plus the supply-chain guard: duplicate registration refused. Skills: the registry trilogy completed (attention → models → platforms); capability negotiation over assumption; plugins as additive hardware support; tests as architecture proofs.

lab-02-run-cpu-vllm `[CPU-OK]`

vLLM on laptop cores: the platform resolver choosing the floor, Torch SDPA standing in for flash attention, KV carved from RAM by VLLM_CPU_KVCACHE_SPACE, graphs degrading to eager — and the whole Phase 1–3 engine running unmodified, because none of it was ever a GPU concept. Read cpu.py against lab-01 and check off every override; note what a backend doesn't have to implement. Captured run included (your tok/s will differ; nothing else will). Skills: knob translation across platforms; the CPU roofline pricing the 9 tok/s; what to ask a vendor pitching "vLLM support."

What you can do after this phase

Explain how one engine serves five silicon families; evaluate or review a hardware plugin by what it overrides and what it leaves alone; run and tune vLLM where there is no GPU at all; and place any hardware question ("does X support fp8? graphs? custom all-reduce?") at the platform boundary where its answer lives. Phase 18 measures what all these layers cost; Phase 19 sends you upstream.

Lab 17-01 — The Platform Abstraction: One Engine, Any Silicon `[CPU-OK]`

vLLM runs on NVIDIA, AMD, Intel GPUs, TPUs, Gaudi, and plain CPUs — and the reason it can is one interface and one registry: every hardware-specific decision (which attention backend? which dtypes? are CUDA graphs a thing here?) is asked of a Platform object, and platforms register into a table that out-of-tree plugins can join without touching a line of core code. You'll build the whole mechanism small — the interface, two in-tree platforms, the resolver with its override and its CPU floor — and then the test that is the architecture: register a third platform from "outside" and watch the engine's decisions change, core untouched. Plus the security posture detail most plugin systems forget: duplicate registration is refused, because a plugin silently shadowing the CUDA platform is a supply-chain incident wearing a convenience feature.

Why this lab exists

The platform layer is how vLLM scaled organizationally, not just technically: hardware vendors (AMD, Intel, Google, Huawei, IBM) maintain their own backends — some in-tree, some as plugin packages — without serializing through the core team. That only works because the interface is explicit and the extension point is a registry, and the lab's plugin test demonstrates the payoff in its purest form: new silicon support is additive. If you ever bring vLLM to new hardware (a real career path — ask the Spyre and Ascend teams), this lab is the map of what you'll implement; if you review plugin PRs, it's the map of what to check.

The design pattern is also the course's registry trilogy completed: attention backends (Phase 4's selector), model architectures (Phase 14's registry), and now platforms — three tables, one philosophy: core code asks "who handles this?" instead of knowing. Each table is also a place where Phase 4 lab-02's bisection move works (override exists at every layer for exactly that reason).

Background: the decisions that funnel through

The real Platform interface (upstream/vllm/platforms/interface.py) answers, per hardware: which attention backend class (this is literally where Phase 4's selector gets its platform default), supported dtypes (your check_dtype is the negotiation — bf16 everywhere, fp16 not on CPU, fp8 only on Hopper+-class), device introspection (memory totals — Phase 2 lab-03's carving needs to ask someone), graph capture support (Phase 5 is a no-op on CPU), and communicator choices (Phase 10's collectives differ per fabric). Resolution happens once at import/startup: detect devices → consult the registry → (or honor the override) → fall back to CPU, the platform that always exists — the floor that makes "no accelerator detected" a slow day instead of a crash.

Plugins join via Python entry points: installing vllm-ascend registers its platform at import time — your register_platform, with packaging around it. The refuse-duplicates rule is the trust boundary: in-tree names are spoken for.

Files

starter.py — Platform.check_dtype, register_platform, resolve_platform, make_default_platforms. Your work.
solution.py — reference.
test_lab.py — accelerator preference, the CPU floor, override + loud unknowns, dtype negotiation, the out-of-tree plugin, and the duplicate refusal.

Run

LAB_IMPL=starter pytest phase-17-hardware-backends-and-plugins/labs/lab-01-platform-abstraction -q
pytest phase-17-hardware-backends-and-plugins/labs/lab-01-platform-abstraction -q   # reference

What the tests prove

Test	What it pins
`test_resolution_prefers_the_accelerator`	Detection order: the GPU wins when present, and with it come flash_attn and graphs — the decisions travel as a bundle
`test_cpu_is_the_floor`	Empty device list still resolves — vLLM always has somewhere to run, which is why lab-02 works at all
`test_override_wins_and_unknown_is_loud`	The bisection hook (Phase 4 lab-02's reflex, platform edition) — and typos fail fast instead of silently falling back
`test_dtype_negotiation`	Unsupported dtype → float32, never a crash mid-load: capability mismatches are negotiated at the boundary
`test_out_of_tree_plugin_changes_decisions_without_core_edits`	The architecture: a "vendor" registers `mytpu`, resolution returns it, the attention backend is now `pallas` — and the diff to core is zero lines
`test_duplicate_registration_is_refused`	A plugin cannot shadow `cpu` or `cuda` — the supply-chain guard, as an assert

Hitchhiker's notes

Find your functions upstream: upstream/vllm/platforms/interface.py (Platform, with ~30 methods where you wrote 1 — same skeleton), upstream/vllm/platforms/__init__.py (detection + resolution + plugin loading — your resolve_platform with the entry-point scan), and any of cuda.py / cpu.py / rocm.py / tpu.py as the in-tree implementations. Read cpu.py with lab-02 — its overrides are exactly the decision list above.
The plugin mechanism is general: vLLM's plugin system (upstream/vllm/plugins/) loads any registered entry point at startup — platforms, but also out-of-tree models (Phase 14's registry accepts plugins the same way) and custom components. One loading mechanism, many tables — when you see VLLM_PLUGINS in an environment, this is what it gates.
Why funnel rather than if torch.cuda.is_available() sprinkled everywhere? Because the sprinkled version is what most codebases have, and it makes new hardware a grep-and-pray refactor across hundreds of sites. The funnel makes it one class. The lab's plugin test is unwritable against sprinkled conditionals — which is the test-as-architecture-proof point again (Phase 14's tripwire, in registry form).
Capability negotiation beats capability assumption: check_dtype's fall-to-float32 is a microcosm of how the whole layer behaves — requests for the unsupported degrade explicitly (with a warning upstream) rather than crashing or, worse, silently miscomputing. Every backend boundary in your own systems deserves the same negotiation shape.

Going further

Wire it into mini_vllm: give LLMEngine a platform parameter whose attention_backend string selects between two toy attention impls (both correct, different "hardware"). The Phase 14 lab-01 tripwire test then proves the engine consults only the platform — the funnel, enforced.
Add get_device_memory() per platform and route Phase 2 lab-03's blocks-from-bytes carving through it — the startup ritual becomes platform-portable, which is precisely how the real worker does it.
Simulate the entry-point load: a plugins/ dict of callables, each registering a platform; load them in sorted order and re-run the duplicate test. Then consider: what should happen when two plugins collide? (Upstream: first wins
- a warning. Reasonable people disagree — write down the trade.)

References

upstream/vllm/platforms/interface.py — the real Platform.
upstream/vllm/platforms/__init__.py — detection, resolution, plugin loading.
vLLM docs, vLLM Plugin System: https://docs.vllm.ai/en/latest/design/plugin_system.html
Phase 4 lab-02 (attention selector) and Phase 14 lab-01 (model registry) — the other two tables in the trilogy.

Lab 17-02 — Run vLLM on CPU, and Read What the Platform Overrode `[CPU-OK]`

The one GPU-flavored lab in this course that genuinely needs no GPU: install vLLM's CPU backend, serve a tiny model on your laptop cores, and then read cpu.py against lab-01's interface to see exactly which decisions the platform redirected — attention backend swapped, CUDA graphs gone, KV cache carved from RAM by a different knob (VLLM_CPU_KVCACHE_SPACE instead of gpu_memory_utilization). Same engine, same scheduler, same paged KV, different silicon — Phase 1–3's machinery proving itself hardware-agnostic before your eyes.

The captured run below is from a 16-core laptop; yours will differ in tok/s and nothing else. That's the lesson.

Why this lab exists

Three reasons, ascending. Practically: CPU vLLM is real deployment surface — CI pipelines, edge boxes, air-gapped environments, and cost-floor serving of small models all use it, and its knobs differ enough from CUDA's to merit one deliberate run. Pedagogically: it's the existence proof for lab-01's architecture — every phase of this course you learned on GPU concepts (paged KV, continuous batching, chunked prefill) executes here unmodified, because none of them were ever GPU concepts; they were engine concepts, and the platform layer is what kept them so. Strategically: reading cpu.py teaches you the size of a backend — it's a short file, and "supporting new hardware is a short file plus kernels" is the fact that makes Phase 17's vendor-plugin world believable.

Requirements

# CPU wheels/build per the official guide (the pip default wheel is CUDA-flavored):
# https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct

Steps

VLLM_CPU_KVCACHE_SPACE=4 python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='Qwen/Qwen2.5-0.5B-Instruct', dtype='bfloat16', max_model_len=1024)
print(llm.generate(['The CPU backend exists because'],
                   SamplingParams(max_tokens=48, temperature=0))[0].outputs[0].text)
"

Three observations to collect: the startup log naming the platform and attention backend (compare with any GPU capture from earlier phases); the KV cache sized from VLLM_CPU_KVCACHE_SPACE (gigabytes of RAM — Phase 2 lab-03's carving with a new budget source); and your tok/s (single-digit to low-double — see the roofline note).

Captured output (real run, Qwen2.5-0.5B, 16-core CPU, vLLM 0.22.1, trimmed)

INFO ... Using CPU platform.                       # lab-01's resolver, choosing the floor
INFO ... Using Torch SDPA backend.                 # the platform's attention answer
INFO ... CPU KV cache space: 4 GiB                 # VLLM_CPU_KVCACHE_SPACE, not gpu_mem_util
INFO ... # CPU blocks: 13,107                      # same BlockPool, RAM-backed
WARNING ... CUDA graphs are not supported ... falling back to eager
 the CPU backend exists because not every deployment has a GPU ...
# generation: ~9 tok/s single stream (16 cores, bf16)

Reading cpu.py against lab-01

Open upstream/vllm/platforms/cpu.py next to your lab-01 Platform and check off the decisions: get_attn_backend_cls → Torch SDPA (the platform is Phase 4's selector for this hardware); dtype checks (fp16 discouraged on CPU — your check_dtype negotiation, with a warning); graphs unsupported (Phase 5 short- circuits — note the engine degrades, not crashes: eager mode was always a valid path); memory introspection reading system RAM. Then notice what's absent: nothing about schedulers, blocks, batching, or sampling. The platform overrides the hardware-touching edge and only the edge — lab-01's funnel, confirmed by reading what a real backend did not have to implement.

Hitchhiker's notes

The performance is honest, and the roofline explains it (Phase 0 lab-04 with CPU constants): ~50 GB/s of DRAM bandwidth vs a GPU's 2,000 — decode's weight-streaming bound lands at ~9 tok/s for a 1 GB-weight model, right where the capture sits. CPU serving is bandwidth-priced, same physics, smaller numbers — which is also why small models + quantization (fewer bytes!) are disproportionately effective here.
Knob translation table: gpu_memory_utilization → VLLM_CPU_KVCACHE_SPACE (absolute GiB — RAM isn't pre-carved like HBM); TP within a node → multiple NUMA-pinned CPU "devices" (VLLM_CPU_OMP_THREADS_BIND); graphs → nothing (eager always). The concepts you tuned all course exist; the spellings moved to where the hardware's truth lives.
CI is the killer app: vLLM's own test suite exercises engine logic on CPU runners constantly — correctness of schedulers and parsers doesn't need an A100 (this course's whole premise, which the project itself relies on).
From cpu.py to a vendor plugin is a difference of packaging, not kind: vllm-ascend, vllm-spyre and friends are out-of-tree cpu.py-shaped files plus kernels, registered through lab-01's entry-point mechanism. After reading one in-tree backend, you can review (or write) an out-of-tree one.

Reflect

List which course phases' machinery you just watched run unchanged on CPU, and which were platform-swapped. (Unchanged: 1, 2, 3, 9, 12, 16 — the engine and text layers. Swapped: 4's backend choice, 5 disabled, 7's kernels, 0/18's constants.) The ratio is the architecture's grade.
Why is VLLM_CPU_KVCACHE_SPACE absolute GiB while the GPU knob is a fraction? (HBM is the engine's to claim — a fraction of a dedicated resource; RAM is shared with the OS and everything else — an absolute budget is the honest contract. Knob design encodes resource ownership.)
A vendor pitches you "vLLM support" for their accelerator. From this phase, what three artifacts do you ask to see? (Their platform class and what it overrides; their attention backend's correctness story against Phase 4's reference shapes; benchmark constants for the Phase 0 lab-04 roofline so claims can be checked.)

References

upstream/vllm/platforms/cpu.py — the backend under read.
vLLM docs, CPU installation: https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html
vLLM docs, Plugin System — the out-of-tree path this is the in-tree template for: https://docs.vllm.ai/en/latest/design/plugin_system.html
Lab-01 — the interface this file implements; Phase 0 lab-04 — the physics that prices the capture's 9 tok/s.

Phase 17 — Exercises: Hardware Backends & Plugins

Work these after the labs. They escalate from "explain it" to "design it" — staff-level means you can do the last ones cold.

List 3 decisions the Platform abstraction centralizes and why hardcoding them would hurt.
Why is FP8 / CUDA-graph support platform-gated?
How would a new accelerator vendor add support without forking vLLM?

Self-grading

For each: could you (a) explain it to a teammate in 2 minutes, and (b) point to the exact upstream/ file that proves your answer? If not, re-read the matching anchor in 01-deep-dive.md.

Phase 17 — Interview Questions: Hardware Backends & Plugins

Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)

Q1. How does vLLM support so many hardware backends without forking the engine?

Model answer

A Platform abstraction centralizes hardware-specific choices (device, default attention backend, supported dtypes, capabilities), and the engine queries it instead of hardcoding CUDA. New hardware can register out-of-tree via the plugin entry-point system, so vendors add support without modifying core code.

Q2. Why can you run vLLM on a CPU at all, and what's different?

Model answer

The CPU platform provides CPU-appropriate kernels and disables GPU-only features (certain fused/quant kernels, CUDA graphs). It's slower but lets you develop and test the engine logic — exactly what the [CPU-OK] labs in this course rely on.

Going deeper

The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.

Phase 17 — Cheatsheet: Hardware Backends & Plugins

Platform abstraction = one place for device/dtype/default-backend/capabilities.
Engine asks the Platform; it never hardcodes CUDA.
New hardware = out-of-tree plugin via entry points.
CPU backend runs on a laptop (no paging/graph kernels), great for learning.

Key upstream files

vllm/platforms/interface.py
vllm/platforms/cuda.py
vllm/platforms/cpu.py
vllm/platforms/__init__.py
vllm/plugins/

Full reference: 00-guide.md · 01-deep-dive.md

Phase 18 — Performance Engineering

← Phase 17 · Course home · Phase 19 →

Don't Panic

Now you make it FAST and prove it. This phase is the engineer's loop: measure (TTFT, ITL, throughput) with the right tools, find the bottleneck (CPU launch? memory? a kernel?), turn the right knob (batch size, token budget, memory utilization, graphs, quant), and re-measure. It's the meta-skill that ties phases 2–17 together.

Why this phase matters

This is the daily job of a staff inference engineer and the thing startups live or die on (cost/token). Being able to read a profile, reason with a roofline, and tune vLLM's knobs methodically is what separates senior from staff.

What you'll learn

Metrics that matter: throughput (tok/s), TTFT, ITL/TPOT, goodput, latency percentiles
Little's Law and how batch size, arrival rate, and latency relate
The roofline model: compute-bound vs memory-bound; arithmetic intensity
Profiling: the torch profiler, Nsight Systems, and vLLM's own metrics
The knobs: max_num_seqs, max_num_batched_tokens, gpu_memory_utilization, enable_chunked_prefill, CUDA graphs, quant, spec decode
Benchmarking properly: vllm bench, warmup, steady state, fair comparisons

The map: where this lives in the real code

Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md) walks through the important ones line by line.

benchmarks/ — The benchmark suite (throughput, latency, serving).
vllm/benchmarks/ — The 'vllm bench' implementation.
vllm/v1/metrics/ — The metrics/stats the engine exposes (Prometheus + logging).
vllm/v1/metrics/stats.py — SchedulerStats / IterationStats: what's measured each step.
vllm/config/scheduler.py — The tuning knobs and their defaults/semantics.

Labs in this phase

lab-01-tune-the-knobs [CPU-OK] — build the full tuning loop on mini_vllm: arrival schedules (queueing enters the course), TTFT/spike/steps metrics, and an SLO-constrained grid search that refuses impossible SLOs — with two measured surprises about the chunk threshold.
lab-02-benchmark-real-vllm [GPU-OPT] — the same loop with wall-clocks: vllm bench serve sweeps, the rate-sweep knee found first, percentiles everywhere, and the one-page tuning report as the deliverable. Captured numbers included.

See labs/README.md for how to run them.

How to work this phase

Read this guide for intuition.
Read 01-deep-dive.md with the upstream/ files open.
Do 02-mini-build.md — build the mini_vllm piece yourself.
Run the labs, then attempt EXERCISES.md.
Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.

Where you are

← Phase 17 · Course home · Phase 19 →

Phase 18 — Deep Dive: Performance Engineering

Read this with upstream/ open. Every path is relative to upstream/ at the pinned commit v0.22.1 @ 0decac0 (UPSTREAM_PIN.md). If a line number ever drifts, search for the named symbol instead.

Guided reading list

benchmarks/ — The benchmark suite (throughput, latency, serving).
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/benchmarks/ — The 'vllm bench' implementation.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/v1/metrics/ — The metrics/stats the engine exposes (Prometheus + logging).
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/v1/metrics/stats.py — SchedulerStats / IterationStats: what's measured each step.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/config/scheduler.py — The tuning knobs and their defaults/semantics.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.

Questions to answer as you read

Metrics that matter: throughput (tok/s), TTFT, ITL/TPOT, goodput, latency percentiles?
Little's Law and how batch size, arrival rate, and latency relate?
The roofline model: compute-bound vs memory-bound; arithmetic intensity?
Profiling: the torch profiler, Nsight Systems, and vLLM's own metrics?
The knobs: max_num_seqs, max_num_batched_tokens, gpu_memory_utilization, enable_chunked_prefill, CUDA graphs, quant, spec decode?
Benchmarking properly: vllm bench, warmup, steady state, fair comparisons?

Cross-references

Intuition: 00-guide.md
Build it yourself: 02-mini-build.md
The gold-standard depth to emulate: Phase 02 deep-dive.

Phase 18 — Mini-Build: extend `mini_vllm`

Your task

Add a metrics collector to mini_vllm (tokens/step, batch size, KV usage, preemptions) and a tiny benchmark that sweeps max_num_batched_tokens to find the throughput knee — the real tuning loop in miniature.

Why build it (and not just read it)

Method

Look at the matching real code from 01-deep-dive.md.
Add your module under mini_vllm/ (or extend an existing one).
Write a test_*.py next to it that pins the behavior you care about.
Run pytest mini_vllm -q and keep it green.

Definition of done

Your component runs on CPU with no extra dependencies (numpy ok).
A test demonstrates the property this phase is about (not just "it runs").
You can explain, out loud, how your toy maps to the real implementation and where it intentionally simplifies.

The flagship phases ship complete mini_vllm modules + tests (mini_vllm/block_pool.py, mini_vllm/scheduler.py) — use them as your reference for structure and test style.

Phase 18 Labs — Performance Engineering

Two labs, one loop: define metrics → measure a workload under a config → search under an SLO constraint. First built cheap — a simulator over mini_vllm with arrival schedules, spike proxies, and a grid search that refuses impossible SLOs (lab-01) — then run for real with vllm bench serve, wall-clocks, percentile distributions, and the tuning report as the deliverable artifact (lab-02).

CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-18-performance-engineering/labs -m "not gpu"

# Grade yourself:
LAB_IMPL=starter pytest phase-18-performance-engineering/labs/lab-01-tune-the-knobs -q

Labs

lab-01-tune-the-knobs `[CPU-OK]`

The loop, built cheap: a simulator running arrival schedules (queueing finally enters the course) with three metrics — TTFT-from-arrival, worst-step tokens (the spike proxy), total steps — and an SLO-constrained grid search that breaks ties toward latency and raises on unsatisfiable SLOs. Two measured surprises: the chunk threshold is per-request (only the budget caps a step globally), and chunking can cost zero throughput when decode steps are already there to hide chunks in. Skills: constraints beat preferences; metric calibration tests; cheap models with known biases shrink expensive searches.

lab-02-benchmark-real-vllm `[GPU-OPT]`

The loop, run for real: vllm bench serve sweeps with warm servers, two runs per config, percentiles everywhere, the rate sweep that finds the knee first — and the tuning report as the artifact (workload, table, distributions, recommendation with its trade named). The captured sweep reconciles every row against the CPU labs that predicted it. Skills: the four methodology checks; conservation of suffering (admission knobs relocate latency); macro before micro; benchmark at the knee.

What you can do after this phase

Run a tuning engagement end to end: state the workload, find the knee, sweep one knob at a time with honest variance, report distributions, and recommend with the trade named — having prototyped the search cheaply enough to afford it. You can also audit anyone else's benchmark in about a minute (workload? warm? percentiles? one knob?), which is its own kind of superpower. Phase 19 sends everything upstream.

Lab 18-01 — Tune the Knobs: an SLO-Constrained Grid Search `[CPU-OK]`

Performance engineering is one loop, run with discipline: define metrics, measure a workload under a config, search the config space under a constraint. This lab has you build the whole loop on mini_vllm — a simulator that runs an arrival schedule (requests landing at different times, the thing every previous lab simplified away) and emits three metrics: per-request TTFT, the worst step's token count (the ITL-spike proxy), and total steps (throughput). Then grid_search sweeps budget × chunk-threshold under a hard spike SLO and returns the best legal config — refusing loudly when no config qualifies, because a quietly violated SLO is the worst outcome in the trade. Along the way the tests teach two facts that surprise most tuners: the chunk threshold is per-request (two chunked prefills still stack in one step — only the budget caps globally), and chunking can cost zero throughput when a long decode stream's steps are already there to hide the chunks in.

Why this lab exists

Every knob in this course got its own lab; this one is where they meet a workload — and workloads, not knobs, are what you actually tune for. The arrival schedule is the lab's quiet upgrade over everything before it: queueing (TTFT now includes waiting — test_queueing_shows_up_in_ttft), interference between requests that arrive at different moments, and the SLO-vs-throughput tension that only exists when both matter at once. The simulator is deliberately the cheapest possible version of the loop (steps and token counts, no GPUs, milliseconds per run) because the methodology is the deliverable: lab-02 runs the identical loop with vllm bench and wall-clocks, and the only thing that changes is the cost of each measurement — which is exactly why you prototype the search cheap.

The grid search's design choices are the staff-engineer content: the SLO is a constraint, not a weighted term (latency SLOs are promises, not preferences); ties break toward lower worst-TTFT (when throughput is equal, take the latency); and unsatisfiability raises (test_unsatisfiable_slo_is_loud) — the tuning loop's version of the course's loud-failure habit, because "best effort" on an impossible SLO ships a violation with extra steps.

Background: metrics, then search

The three metrics, and what each proxies:

ttft_steps — steps from arrival (not admission!) to first token. Queueing, scheduling, prefill: all of it. The user-facing wait.
max_step_tokens — the worst step's total scheduled tokens ≈ the worst inter-token stall any decoding user felt (Phase 3 lab-05's proxy, now a tunable's objective).
total_steps — the schedule's length ≈ inverse throughput at fixed step cost. (The proxy's known bias: real steps' wall-clock varies with their token count — total tokens would weight differently; lab-02's wall-clocks settle it.)

The two knobs swept are the course's latency dial (threshold) and throughput dial (budget) — and the search space is tiny on purpose. Real tuning fails far more often from unclear objectives than from undersized grids; get the constraint and the tiebreak right first, then enlarge the grid.

Files

starter.py — Metrics, simulate (arrivals + the Phase 1 lab-04 probe + first-token tracking), grid_search. Your work.
solution.py — reference.
test_lab.py — monotonicity, the per-request-vs-global cap lesson, queueing in TTFT, arrival-relative measurement, SLO compliance, and loud unsatisfiability.

Run

LAB_IMPL=starter pytest phase-18-performance-engineering/labs/lab-01-tune-the-knobs -q
pytest phase-18-performance-engineering/labs/lab-01-tune-the-knobs -q   # reference

What the tests prove

Test	What it pins
`test_throughput_more_budget_never_more_steps`	The sanity direction every tuning loop needs before it can be trusted with anything subtle
`test_spike_threshold_is_per_request_budget_is_global`	The two-cap structure: threshold=32 still allows a 65-token step (two chunks + a decode); only budget=40 forces ≤ 40. And the surprise: chunking cost zero steps here, because the 24-token decode stream's steps were already there to hide chunks in — Sarathi's piggybacking, measured from the throughput side. A lonely fat prompt, with nowhere to hide, pays the full chunking step-count bill
`test_queueing_shows_up_in_ttft`	`max_num_seqs=1` makes later arrivals wait, and the metric sees it — TTFT without queueing is a benchmark fiction
`test_ttft_is_measured_from_arrival`	The zero-point check: an idle engine serves first tokens in the arrival step. Metrics need calibration tests too
`test_grid_search_respects_the_slo`	The constrained search refuses the throughput-optimal-but-violating config — constraints beat preferences
`test_unsatisfiable_slo_is_loud`	An impossible SLO raises; it does not return the least-bad violation

Hitchhiker's notes

The hide-the-chunks result generalizes and matters: chunked prefill's throughput cost is max(0, chunk_steps − coexisting_decode_steps)-shaped. Fleets with deep decode streams (chat) chunk nearly free; bursty prefill-only fleets (batch summarization) pay full price — and that's also the fleet that didn't need the latency protection. The knob's cost and its benefit anti-correlate across workloads, which is why per-deployment tuning beats global defaults.
Arrival schedules are the difference between benchmarks and reality: this lab's three-request workload already produces queueing, interference, and hiding effects no all-at-once batch shows. Real benchmark suites (vllm bench serve) generate Poisson arrivals at a target QPS for the same reason — lab-02 uses exactly that.
Grid search is the right first search: 6 configs here, exhaustive, done. At real scale (5+ knobs), the same loop wraps Bayesian/successive-halving optimizers — but the metrics, the constraint handling, and the loud unsatisfiability transfer unchanged. The loop is the asset; the optimizer is a plug-in.
One proxy limitation to carry consciously: step counts can't see fixed per-step overheads (launch costs, scheduler time), so this simulator systematically favors many-small-steps configs vs what wall-clocks will say — Phase 5's whole subject is that bias. Cheap models with known biases, again (Phase 8 lab-04, Phase 15 lab-03): use them to shrink the expensive search, never to replace the final measurement.

Going further

Add a worst_ttft SLO as a second constraint and find workloads where the two SLOs conflict (spike cap wants small budget; TTFT wants big) — the multi-objective frontier, met honestly.
Generate Poisson arrivals (rng.poisson) at increasing rates and plot worst-TTFT vs offered load for two configs: the hockey stick where queueing takes over is the capacity limit, found by simulation — Phase 3 lab-04's going-further, completed.
Port simulate's probe to count tokens per step and weight total_steps by a per-step cost model (fixed + per-token) — calibrate the two constants against one lab-02 measurement, then re-run the grid. You've built the cheap-model/ expensive-measurement two-tier loop production tuning actually uses.

References

Phase 1 lab-04 (the probe), Phase 3 labs 01/05 (the two caps and the spike) — the parts this lab assembles.
upstream/vllm/benchmarks/ and vllm bench — the production version of this loop (lab-02).
vLLM docs, Optimization and Tuning — the knobs' official guidance, now checkable against your own search: https://docs.vllm.ai/en/latest/configuration/optimization.html
Agrawal et al., Sarathi-Serve (OSDI 2024) — the piggybacking result your zero-cost-chunking test measured: https://arxiv.org/abs/2403.02310

Lab 18-02 — Benchmark Real vLLM and Write the Tuning Report `[GPU-OPT]`

Lab-01's loop, with wall-clocks: run vllm bench serve against a live server, sweep two knobs, and produce the artifact this phase exists to teach — a tuning report: workload stated, configs compared, distributions (not means) reported, a recommendation with its trade named. The capture below is such a report in miniature; your deliverable is the same table for your hardware and a workload you choose.

No GPU? Don't panic. The captured sweep below is the worked example, and the report-writing discipline is hardware-free. (You can also run the whole lab against Phase 17 lab-02's CPU backend — slower numbers, identical methodology.)

Why this lab exists

Benchmark numbers without methodology are advocacy, and most published LLM serving comparisons fail one of four checks you'll practice here: stated workload (QPS, prompt/output length distributions — Phase 13 taught how much one image shifts these), warm measurement (Phase 5's capture and compile excluded), distributions (p50/p99 for TTFT and ITL — the tails are the product, Phase 3 lab-05), and one knob at a time. The phase's CPU labs built every mental model this lab's numbers will land in; the remaining skill is operational care, which only practice installs.

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct

Steps

Serve (one terminal): vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000
Bench (another): sweep request rate first to find the knee, then the knobs:

vllm bench serve --backend openai-chat --base-url http://localhost:8000 \
  --model Qwen/Qwen2.5-0.5B-Instruct \
  --dataset-name random --random-input-len 512 --random-output-len 128 \
  --num-prompts 200 --request-rate 8

Re-run the same command against servers restarted with one change each: --long-prefill-token-threshold 64, then --max-num-seqs 64, then --gpu-memory-utilization 0.9. Two runs per config (eyeball variance before trusting deltas — Phase 5 lab-04's discipline).
Write the report: table, distributions, the knee, one recommendation per SLO profile.

Captured sweep (Qwen2.5-0.5B, L4, vLLM 0.22.1)

workload: 512-in/128-out random, 200 prompts, rate 8 req/s, warm server, 2 runs each
config                      tput tok/s   TTFT p50/p99 (ms)   ITL p50/p99 (ms)
baseline (defaults)            4,310        145 / 610          11.2 / 41.8
threshold=64                   4,150        160 / 660          11.3 / 14.9   <- p99 ITL 2.8x better
max_num_seqs=64 (was 256)      4,290        150 / 1,240        11.1 / 13.6   <- queueing moved to TTFT
gpu_mem_util=0.9 (was 0.85)    4,420        144 / 600          11.2 / 40.9   <- more KV, small gain here
# rate sweep (baseline): 4 req/s p99 TTFT 210ms; 8 -> 610ms; 12 -> 4,900ms  <- the knee is ~8-10

Reading the sweep

threshold=64: −4% throughput, ÷2.8 p99 ITL — lab-01's trade with real units, and the per-request-vs-global subtlety still applies (check max chunk concurrency before promising the cap). For a chat product this row is the recommendation; for batch summarization it's a pure loss. The workload decides; the report must say so.
max_num_seqs=64: ITL p99 improves (fewer co-resident decodes per step) but TTFT p99 doubles — the queue moved from inside steps to in front of them. Conservation of suffering: admission knobs relocate latency between metrics; only capacity (next row) or efficiency creates more of it.
gpu_mem_util=0.9: +2.5% here because this workload wasn't KV-bound (0.5B, short contexts). The same knob on a 70B at long context is the difference between serving and queueing — a knob's value is workload-conditional, which is why the report states the workload first.
The rate sweep is the most important line: the knee (~8–10 req/s) is the capacity number every other measurement is conditional on. Benchmarking at the knee shows tradeoffs; past it, everything drowns in queueing and configs look identical (all terrible). Find the knee first, always.

Hitchhiker's notes

vllm bench subsumes the old benchmark_serving.py scripts — datasets (random, sharegpt, sonnet), Poisson arrivals via --request-rate, and the percentile outputs this report needs. The server-side Prometheus metrics (vllm:time_to_first_token_seconds and friends) should agree with the client-side numbers minus network — when they don't, you've found front-door overhead (Phase 16 lab-02's gap measurement).
Variance discipline scales with claim size: two runs to eyeball, five+ with a t-test before shipping a regression report someone will act on. The single most common benchmarking sin is one run per config and a conclusion from a 3% delta inside run-to-run noise.
Profile only after the macro story is clear: this lab's table tells you which config to keep; Phase 7 lab-02's profiler tells you why a step costs what it does. Macro → micro, never the reverse — profiling an untuned config optimizes the wrong thing precisely.
Report format matters more than it should: workload, configs, table, distributions, knee, recommendation-with-trade — one page. Decision-makers act on the page, not the runs; a perfect sweep badly reported changes nothing.

Reflect

Reconcile each captured row with its CPU-lab prediction: threshold (lab-01 + Phase 3 lab-05), max_num_seqs (lab-01's queueing test), mem_util (Phase 2 lab-03's blocks). Any row you couldn't have predicted within 2× deserves a note in the report — that's where your model of the system is thinnest.
Your p99 TTFT SLO is 800 ms and traffic is 10 req/s on this hardware. What does the rate sweep say, and what are the three escape routes? (You're past the knee: more replicas, a smaller/quantized model — Phase 6 — or admission control that sheds load visibly. Tuning knobs won't move a knee much; capacity does.)
Why benchmark with random data instead of real prompts first? (Controlled lengths isolate the knobs; then confirm with a real-trace dataset — sharegpt — because length distributions, prefix sharing, and image tokens all shift the knee. Synthetic isolates; real validates. You need both, in that order.)

References

vllm bench serve --help and upstream/vllm/benchmarks/ — the harness.
vLLM docs, Benchmarking — official methodology notes: https://docs.vllm.ai/en/latest/contributing/benchmarks/
Phase 3 lab-05 (the ITL story), Phase 2 lab-03 (the capacity story), Phase 5 lab-04 (warmup + variance), lab-01 (the search loop this lab runs for real).
Dean & Barroso, The Tail at Scale — why every column here is a percentile: https://research.google/pubs/the-tail-at-scale/

Phase 18 — Exercises: Performance Engineering

Work these after the labs. They escalate from "explain it" to "design it" — staff-level means you can do the last ones cold.

From a profile showing low GPU util at small batch, name the likely cause and fix.
Use Little's Law to predict the batch size needed for a target throughput at a given ITL.
Design a fair benchmark comparing two configs (warmup, steady state, same traffic).

Self-grading

For each: could you (a) explain it to a teammate in 2 minutes, and (b) point to the exact upstream/ file that proves your answer? If not, re-read the matching anchor in 01-deep-dive.md.

Phase 18 — Interview Questions: Performance Engineering

Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)

Q1. Throughput is low and GPU utilization is ~30% at batch size 1–2. What's happening?

Model answer

Almost certainly CPU-launch-bound decode: many tiny kernels per step, CPU can't feed the GPU. Enable CUDA graphs, increase batch size (raise max_num_seqs / accept more concurrency), and check for Python overhead on the hot path. Confirm with a profile showing gaps between kernels.

Q2. How do you decide max_num_batched_tokens and gpu_memory_utilization?

Model answer

max_num_batched_tokens trades prefill chunk size vs decode latency: bigger = better prefill throughput but can stall decodes; tune to your prompt/output mix. gpu_memory_utilization sets how much HBM the KV cache may use — raise it to fit more concurrent sequences, but leave headroom for activations/CUDA-graph buffers to avoid OOM.

Going deeper

The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.

Phase 18 — Cheatsheet: Performance Engineering

Loop: measure (TTFT/ITL/throughput) -> find bottleneck -> turn one knob -> re-measure.
Knobs: max_num_seqs, max_num_batched_tokens, gpu_memory_utilization, chunked prefill, CUDA graphs, quant, spec decode.
Roofline: decode=memory-bound, prefill=compute-bound. Little's Law links batch/rate/latency.
Benchmark with warmup + steady state + identical traffic, or it's noise.

Key upstream files

benchmarks/
vllm/benchmarks/
vllm/v1/metrics/
vllm/v1/metrics/stats.py
vllm/config/scheduler.py

Full reference: 00-guide.md · 01-deep-dive.md

Phase 19 — Capstone — Maintainer & Startup

← Phase 18 · Course home

Don't Panic

You now understand the engine. The capstone turns understanding into a track record: land a real upstream PR, pass the staff interview loop, and (optionally) sketch a startup that's actually defensible. Don't Panic — you've already done the hard part; this phase is about leverage and judgment.

Why this phase matters

Knowledge without a public artifact is invisible. A merged PR, a benchmark writeup, and the mini_vllm engine you built ARE your portfolio. This phase is how you convert the last 18 phases into a maintainer reputation, a job, or a company.

What you'll learn

The contribution workflow: finding good-first-issues, duplicate checks, RFCs
vLLM's actual rules for AI-assisted contributions (read upstream/AGENTS.md!)
Writing a PR that gets merged: scope, tests, benchmarks, description
Code review etiquette and how trust accrues to maintainers
The staff competency map and the mock interview loop (see CAREER.md)
The startup playbook: where cost/moats live; build vs buy vs upstream

The map: where this lives in the real code

Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md) walks through the important ones line by line.

AGENTS.md — vLLM's literal contribution policy (read it before any PR). Note: no pure code-agent PRs; disclose AI use; include tests + results; check for duplicates.
docs/contributing/ — The contributing guides.
[.buildkite/ and tests/](../upstream/.buildkite/ and tests/) — How CI is structured and what your PR must pass.
docs/design/ — Design docs / the kind of thinking RFCs require.

Labs in this phase

lab-01-find-and-scope-a-pr [CPU-OK] — issue triage as engineering: the five-check disqualification gauntlet, then the one-page implementation plan (invariants named, regression test planned, blast radius bounded, out-of-scope explicit) for the survivor.
lab-02-mock-staff-loop [CPU-OK] — the exit exam: four timed sessions (rapid-fire, deep-dive, a design scenario with shown arithmetic, two debugging trees), graded in three layers against the model answers and CAREER.md's competency map — honestly.

See labs/README.md for the exit criteria these two labs define.

How to work this phase

Read this guide for intuition.
Read 01-deep-dive.md with the upstream/ files open.
Do 02-mini-build.md — build the mini_vllm piece yourself.
Run the labs, then attempt EXERCISES.md.
Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.

Where you are

← Phase 18 · Course home

Phase 19 — Deep Dive: Capstone — Maintainer & Startup

Read this with upstream/ open. Every path is relative to upstream/ at the pinned commit v0.22.1 @ 0decac0 (UPSTREAM_PIN.md). If a line number ever drifts, search for the named symbol instead.

Guided reading list

AGENTS.md — vLLM's literal contribution policy (read it before any PR). Note: no pure code-agent PRs; disclose AI use; include tests + results; check for duplicates.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
docs/contributing/ — The contributing guides.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
[.buildkite/ and tests/](../upstream/.buildkite/ and tests/) — How CI is structured and what your PR must pass.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
docs/design/ — Design docs / the kind of thinking RFCs require.
- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.

Questions to answer as you read

The contribution workflow: finding good-first-issues, duplicate checks, RFCs?
vLLM's actual rules for AI-assisted contributions (read upstream/AGENTS.md!)?
Writing a PR that gets merged: scope, tests, benchmarks, description?
Code review etiquette and how trust accrues to maintainers?
The staff competency map and the mock interview loop (see CAREER.md)?
The startup playbook: where cost/moats live; build vs buy vs upstream?

Cross-references

Intuition: 00-guide.md
Build it yourself: 02-mini-build.md
The gold-standard depth to emulate: Phase 02 deep-dive.

Phase 19 — Mini-Build: extend `mini_vllm`

Your task

Capstone build: pick ONE real improvement to mini_vllm (e.g. add swapping-based preemption, beam search, or a second KV-cache group) and ship it with tests + a short design note — your dry run for an upstream PR.

Why build it (and not just read it)

Method

Look at the matching real code from 01-deep-dive.md.
Add your module under mini_vllm/ (or extend an existing one).
Write a test_*.py next to it that pins the behavior you care about.
Run pytest mini_vllm -q and keep it green.

Definition of done

Your component runs on CPU with no extra dependencies (numpy ok).
A test demonstrates the property this phase is about (not just "it runs").
You can explain, out loud, how your toy maps to the real implementation and where it intentionally simplifies.

The flagship phases ship complete mini_vllm modules + tests (mini_vllm/block_pool.py, mini_vllm/scheduler.py) — use them as your reference for structure and test style.

Phase 19 Labs — Capstone: Maintainer & Startup

Two process labs — no starters, no pytest; the graders are the vLLM review queue and your own honesty. Lab-01 turns the course's knowledge into a merged upstream PR: harvest real issues, run the disqualification gauntlet (claimed? reproducible? within your map? testable? reviewable?), and write the one-page implementation plan a maintainer would approve. Lab-02 is the exit exam: a timed four-session mock staff loop (rapid-fire, deep-dive, design, debugging) built from eighteen phases of INTERVIEW.md files, scored against CAREER.md's competency map — peeks cap at 2, skipped arithmetic caps at 2, and the low rows become your revision list.

Exit criteria for the course, per these labs:
  1. A merged (or at least review-surviving) upstream PR.   [lab-01]
  2. A competency matrix you'd show a hiring manager.       [lab-02]

Labs

lab-01-find-and-scope-a-pr `[CPU-OK]`

Issue triage as engineering: three candidates, five disqualification checks in cost order, and the survivor scoped with the course's move — find the load-bearing lines, name the invariants, plan the regression test, bound the blast radius (TP/quant/LoRA/spec interactions), and write the one-pager with an explicit out-of-scope line. Plus the mechanical friction-removers: claim the issue first, pre-commit hooks, DCO sign-off. Skills: selection and scoping — the actual gap between knowing and contributing.

lab-02-mock-staff-loop `[CPU-OK]`

Four timed sessions: rapid-fire fundamentals (phases 0–3), systems deep-dive (4–8), a design scenario with topology + knobs + shown arithmetic + named risks (choose from three realistic builds), and two debugging trees from the symptom catalog. Graded in three layers per answer — mechanism, invariant/arithmetic, operational consequence — against the model answers and the competency map. Skills: producing under pressure; committing to defended choices; calibrated self-assessment; the staff sentence ("128 KiB/token, so…").

Where to go from here

The course's final claim was made on page one: finish every lab and you can read and modify any part of vLLM, operate it like a principal engineer, and know where the moats are. These two labs are where you check the claim against reality — the PR against the review queue, the matrix against the map. Whatever rows come back weak, the phases are still there; whatever comes back strong, CAREER.md maps the three roads it opens (maintainer, staff IC, founder). Your notebook, your mini_vllm, your tuning reports, and your merged PR are the portfolio. Don't Panic — you built the whole engine once already.

Lab 19-01 — Find and Scope a Real Upstream PR `[CPU-OK]`

Nineteen phases built the knowledge; this lab spends it. You will triage real open vLLM issues, run the checks that separate a contributable issue from a trap, and produce the lab's artifact: a one-page implementation plan for one issue — written to the standard where a maintainer reading it would say "yes, do that." This is a process lab: no starter.py, no tests. The deliverable is the plan, and the grader is eventually the vLLM review queue itself.

Why this lab exists

The distance between "understands vLLM" and "has a merged vLLM PR" is not knowledge — it's selection and scoping, and most first-time contributors fail there: they pick an issue that's secretly hard, already claimed, or quietly obsolete, burn two weekends, and bounce off. The defense is treating issue triage as an engineering activity with checks, which is also — not coincidentally — what maintainers themselves do all day. Doing this lab honestly once gives you the habit; the habit gives you the merged PR; the merged PR (per CAREER.md) is the portfolio line that compounds.

Step 1 — Harvest candidates

gh issue list -R vllm-project/vllm --label "good first issue" --state open --limit 30
gh issue list -R vllm-project/vllm --label bug --state open --search "sort:created-desc" --limit 30

Pick three candidates with different shapes if you can: a model-support gap (Phase 14's mapping-row class), a parser/frontend bug (Phase 16's territory), and a docs/test gap (don't sneer — test PRs teach the review process at minimum stakes). For each, skim the issue thread fully: maintainer comments often contain the scoping ("this needs X first", "blocked on Y") that the title hides.

Step 2 — The disqualification gauntlet

Run every candidate through these checks — in this order, cheapest first:

Already claimed/fixed? gh pr list -R vllm-project/vllm --search "<keywords>" plus a search of closed PRs and linked PRs in the issue. The most common first-timer waste is duplicating an in-flight fix. (The repo's AGENTS.md encodes exactly these checks for AI-assisted contributors — read it; the checklist applies to humans identically.)
Still reproducible at HEAD? Issues rot; vLLM merges dozens of PRs daily. A bug filed three weeks ago may be gone. Reproduce (or for model-support, confirm the model still errors) before writing a line.
Is the cause within your current map? Trace it to a file. If the file is one whose machinery you've built in this course (scheduler, block pool, parsers, loaders, sampler, platform code) — green. If it's deep in a CUDA kernel and you skipped the GPU labs — pick another, or budget honestly.
Is the fix small but the test meaningful? The ideal first PR is a ≤100-line diff with a regression test that pins the behavior forever — the shape every lab in this course drilled. Issues whose fix is one line but whose test is impossible, or vice versa, score worse than they look.
Will anyone review it? Check the subsystem's recent merge velocity (gh pr list --search "path:vllm/<area>" --state merged --limit 10). A perfect PR into an unowned corner can sit for months; that's demoralizing precisely when momentum matters most.

Expect to disqualify two of three. That's the gauntlet working, not failing.

Step 3 — Scope the survivor

For the survivor, do the course's move: find the load-bearing lines. Identify the function(s) to change, the invariants they maintain (you can usually name them now — I1–I4, the budget cap, chunking invariance, the contract widths), the test file where the regression test belongs, and the blast radius (who calls this? does TP/quant/LoRA/spec-decode interact? — the feature-composition questions Phases 10–12 trained).

The one-page plan (the artifact)

Issue: #NNNNN — <title>                       Reproduced at: <commit>
Cause (file:line): ...                        (one paragraph, mechanism not symptom)
Fix sketch: ...                               (what changes, why it preserves the invariants)
Test plan: ...                                (the regression test: file, fixture, the assert)
Blast radius: ...                             (interactions checked: TP / quant / LoRA / spec / none)
Out of scope: ...                             (the adjacent improvements you are NOT doing — scoping
                                               discipline is mostly this line)
Open questions for maintainers: ...           (≤2, specific — these go in the PR description)

One page. If it doesn't fit, the scope is wrong — shrink the issue, not the font.

Hitchhiker's notes

Comment on the issue before coding ("I'd like to take this; plan: ") — it claims the work, invites early correction, and costs nothing. Maintainers redirect cheap plans gladly and expensive PRs reluctantly.
Read docs/contributing/ and run the pre-commit hooks before the first commit — format/lint failures are the #1 cause of first-PR friction, and they're entirely avoidable mechanically.
The DCO sign-off (git commit -s) is required and forgotten by almost every first contributor. Set git config alias.cs "commit -s" now.
Your course artifacts are your credibility: a PR description that says "this preserves the free-queue invariant (blocks in queue ⟺ ref_cnt==0)" or "verified chunking-invariance with a randomized-slicing test" reads as a peer, not a tourist. Use the vocabulary; you've earned it.

Going further

Do it: implement the plan, open the PR, survive review. Review is the lab's second half — expect a round or two; respond to every comment (fix or argue, never silence). The merge is Phase 19's true exit criterion.
Then do the maintainer's side once: pick someone else's open first-PR and review it against your gauntlet — kindly, concretely. Both seats teach.
Keep the plan template. Every nontrivial change you ever make — upstream or at work — deserves the one-pager; teams that institutionalize it ship faster and argue less.

References

upstream/AGENTS.md and upstream/docs/contributing/ — the project's own checks and process.
vLLM good-first-issue board: https://github.com/vllm-project/vllm/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
CAREER.md — where the merged PR fits in the maintainer path.
Phase 14 lab-03 (mapping rows — the classic first-PR shape), Phase 16 lab-01 (parser bugs — the second-classic).

Lab 19-02 — The Mock Staff Loop `[CPU-OK]`

Eighteen phases of INTERVIEW.md files exist for this moment: a full, timed, self-administered staff-engineer loop — four sessions, graded against the model answers and CAREER.md's competency map, with the gaps feeding a revision list rather than an ego. The deliverable is two artifacts: your scored competency matrix and the one-pager from the design session. This is the course's exit exam, and you are both candidate and (the harder job) honest grader.

Why this lab exists

Knowledge you can't produce under time pressure, out of order, against follow-up questions, isn't yet yours — it's still the book's. Staff loops test exactly the transformations this course optimized for: derive rather than recall (the economics labs), name the invariant under the feature (every lab's tables), state the trade with both sides priced (every Hitchhiker's note). The mock loop is where you find which of those moves are reflexes now and which still need the page open. Run it honestly once and your revision list writes itself; run it honestly twice, a month apart, and you'll have the rare commodity of calibrated confidence going into real loops — or real design meetings, which are the same exam with stakes.

The loop format

Four sessions, strictly timed, one sitting if you can manage it (fatigue realism included), notes only AFTER each session ends:

Session	Time	Source material
1. Fundamentals rapid-fire	30 min	2 questions each from phases 0–3 INTERVIEW.md, randomized
2. Systems deep-dive	45 min	1 question each from phases 4–8, with self-posed follow-ups
3. Design: "serve X under SLO Y on hardware Z"	60 min	Construct from phases 10/15/18 (3 scenarios below)
4. Debugging scenario	30 min	Pick 2 from the symptom catalog below

Design scenarios (pick one): (a) 70B chat, p99 TTFT < 1 s / ITL < 30 ms, 16 A100s across 2 nodes; (b) 100-tenant fine-tune platform, 8B base, 8 GPUs; (c) agentic workload, 8B + heavy tool calling, single-stream-latency-obsessed, 4 H100s. Produce the one-pager: topology (TP/PP/replicas/disagg), knobs with values and reasons, capacity arithmetic shown, the two biggest risks named.

Debugging symptoms (pick two; talk through the diagnosis tree out loud): p99 ITL spikes hourly (Phase 3/18); throughput fell 30% after a model swap (Phase 4/6 — check the backend line); tenant 7 complains, dashboards green (Phase 11 — slot thrash); seeded requests not reproducing (Phase 9); outputs differ across TP sizes (Phase 10 — the last ulp); VLM TTFT doubled (Phase 13 — image sizes).

Session guide

Answer out loud or in writing — producing is the test; reading silently grades as zero. For each question, the staff-grade answer has three layers, and you should consciously hit all three: the mechanism (what happens), the invariant or arithmetic underneath (why it must be so — quote the formula, name the I-number), and the operational consequence (what you'd do about it at 3 a.m.). The model answers in the INTERVIEW.md files are written in roughly that shape; grade against the shape, not just the facts.

Grading honestly

Score each competency row from CAREER.md's map: 3 = derived it cold, follow-ups survived; 2 = got there with hesitation or one peek; 1 = knew of it; 0 = blank. The two honesty rules: a peek caps the row at 2 (that's what the peek means), and an answer that skipped the arithmetic when arithmetic existed caps at 2 (staff answers compute — the course's whole thesis). Rows at ≤1 map directly to phases; that's your revision list, and the labs are designed for exactly this kind of targeted re-entry (each phase's index lists its skills).

Hitchhiker's notes

The design session is the one that decides real loops — and its failure mode is breadth without commitment. Force yourself to choose (TP=4 PP=2, not "TP or maybe PP") and defend with the lab arithmetic (Phase 10 lab-03's comm bill, Phase 0 lab-02's KV budget, Phase 15 lab-03's toll). Reviewers — real and self — reward a defended wrong choice over an undefended hedge.
Say the numbers out loud. "128 KiB per token, so 2048-token contexts cost 256 MiB each, so 8 GiB of free HBM holds ~32 of them" is a staff sentence; "KV is big" is not. The course gave you maybe twenty such derivations — sessions 1 and 3 should each surface five.
Interviewing the interviewer: after each model-answer comparison, ask what follow-up the answer invites and answer that too. Real loops live in the follow-ups; the INTERVIEW.md files seed them deliberately.
A month later, rerun changed rows only. Spaced, targeted, calibrated — the same discipline as performance work (measure, change one thing, measure).

Going further

Trade loops with a colleague — grading someone else against the model answers teaches more than being graded, and explaining a phase you "know" is the final filter for whether you do.
Take the design one-pager from session 3 and cost it on real cloud prices — the startup half of CAREER.md begins exactly there (capacity arithmetic × dollars = the unit economics every inference company lives or dies by).
Publish your best answer (blog post, internal doc) — the act of writing for strangers finds the remaining gaps, and the artifact compounds the way merged PRs do.

References

The INTERVIEW.md in every phase directory — the question bank.
CAREER.md — the competency map you're scoring against, and the maintainer/staff/startup paths the scores feed.
Lab-01 — the other half of the capstone: the loop proves you can explain the engine; the merged PR proves you can change it. Exit with both.

Phase 19 — Exercises: Capstone — Maintainer & Startup

Work these after the labs. They escalate from "explain it" to "design it" — staff-level means you can do the last ones cold.

Write a merge-ready PR description for a small fix (scope, tests run, why not a duplicate).
Pick a model from this course and propose, with numbers, the single highest-ROI optimization.
Draft a one-paragraph startup thesis with a defensible moat per CAREER.md Track C.

Self-grading

For each: could you (a) explain it to a teammate in 2 minutes, and (b) point to the exact upstream/ file that proves your answer? If not, re-read the matching anchor in 01-deep-dive.md.

Phase 19 — Interview Questions: Capstone — Maintainer & Startup

Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)

Q1. How would you land your first vLLM contribution?

Model answer

Find a good-first-issue or a real bug you hit; run the duplicate checks (gh issue/pr search) per AGENTS.md; reproduce it; write a minimal fix WITH a test that pins the behavior and a clear PR description stating what you ran and why it isn't a duplicate; respond to review quickly. Specialize in one area to build reviewer trust over time.

Q2. Where's the moat for an inference startup built on vLLM?

Model answer

Not in renting GPUs around vanilla vLLM (margins compress). It's in a sustained kernel/scheduling edge, workload specialization (long-context/agentic/structured), the control plane (routing, autoscaling, multi-tenancy, cost attribution), or distribution/switching costs. Upstream commodity features; keep the genuine edge.

Going deeper

The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.

Phase 19 — Cheatsheet: Capstone — Maintainer & Startup

Read AGENTS.md FIRST. No pure code-agent PRs; disclose AI; include tests+results; no dupes.
Merge-ready PR = small scope + tests that pin behavior + benchmark if perf + clear why.
Portfolio = merged PR + benchmark writeup + your mini_vllm engine.
Moats: kernel/scheduling edge, workload specialization, control plane, distribution.

Key upstream files

AGENTS.md
docs/contributing/
.buildkite/ and tests/
docs/design/

Full reference: 00-guide.md · 01-deep-dive.md

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer