vLLM Mastery — From Zero to Maintainer
A deep, lab-driven journey through the internals of the world's most popular open-source LLM inference engine.
This is not a tutorial. It is a 20-phase apprenticeship. If you start at Phase 0 knowing nothing about how language models run, and you finish every lab, you will be able to:
- Read and modify any part of the vLLM codebase — the scheduler, the KV-cache manager, attention backends, quantization, speculative decoding, distributed execution.
- Land real pull requests upstream and reason about them like a maintainer.
- Operate as a principal / staff LLM-inference engineer — design serving systems, debug throughput cliffs, and make the architectural calls that decide whether a model serves 10 or 10,000 users per GPU.
- Found or join a startup in the inference space and know exactly where the moats are.
Everything you need is in this repository. You will never need an outside book.
Contents
- The two things that make this work
- How each phase is structured
- The curriculum (20 phases)
- Recommended path
The two things that make this work
1. You read the real engine
Every concept is anchored to the actual vLLM source code, frozen at a single commit
(see UPSTREAM_PIN.md: v0.22.1 @ 0decac0). When a phase says
vllm/v1/core/block_pool.py:333—BlockPool.get_new_blocks()
that line really exists in ./upstream/ and you are expected to open it. We do not
paraphrase the engine. We quote it and explain it line by line.
⚠️ vLLM moves fast (dozens of merged PRs per day). Line numbers are valid only at the pinned commit. The named class/function is always given so you can re-find it in any version. Re-create the exact tree with the command in UPSTREAM_PIN.md.
2. You build a small engine
Reading is not understanding. So in parallel you build mini_vllm/ — a deliberately
small, dependency-light reimplementation of vLLM's core ideas that runs on a laptop CPU,
no GPU required. By the end you will have written, with your own hands:
- a paged KV-cache block allocator (Phase 2),
- a continuous-batching scheduler with prefix caching (Phase 3),
- a sampler, an n-gram speculative decoder, a batched-LoRA matmul, a grammar mask, …
The real engine teaches you what production looks like. The mini engine teaches you why every decision was made. You need both. This is the "Both" anchoring this course is built on.
How each phase is structured
Every phase-NN-*/ folder has the same shape:
| File | What it is |
|---|---|
00-guide.md | The Hitchhiker's Guide to the topic. Don't Panic. Pure intuition, analogies, ASCII diagrams. Assumes you know nothing. Read this first. |
01-deep-dive.md | The real implementation. Upstream path:line references, quoted excerpts, line-by-line explanation, data structures, edge cases. |
02-mini-build.md | Build or extend the mini_vllm/ component for this topic. |
labs/lab-NN-*/ | Hands-on labs: README.md + starter.py + solution.py + test_lab.py. |
EXERCISES.md | Graded challenges, easy → staff-level, with hints and solutions. |
INTERVIEW.md | Real staff/principal interview questions on the topic, with model answers. |
CHEATSHEET.md | One page: APIs, invariants, performance knobs, gotchas. |
Lab hardware tags
Not everyone has a GPU. Every lab is tagged:
[CPU-OK]— runs anywhere, including the CI on your laptop. Most labs.[GPU-OPT]— better on a GPU but has a CPU fallback; expected GPU output is captured in the README so you can follow along without one.[GPU-REQ]— genuinely needs an NVIDIA GPU (real CUDA kernels). The README includes captured output and a step-by-step so you learn even if you only rent a GPU later.
See SETUP.md for environment setup and cheap cloud-GPU options.
The curriculum (20 phases)
| # | Phase | One-line goal |
|---|---|---|
| 00 | Foundations | What an LLM forward pass is; prefill vs decode; why the KV cache exists. |
| 01 | Architecture & Request Lifecycle | Trace one request from LLM.generate() to tokens out. |
| 02 | PagedAttention ⭐ | How vLLM stores KV memory in pages and never fragments. |
| 03 | Continuous Batching & Scheduler ⭐ | Iteration-level scheduling, chunked prefill, prefix caching, preemption. |
| 04 | Attention Backends | FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, Triton. |
| 05 | CUDA Graphs & torch.compile | Piecewise vs full graphs; the compilation pipeline. |
| 06 | Quantization | FP8/MXFP4/NVFP4/INT8/INT4, GPTQ/AWQ/GGUF/compressed-tensors. |
| 07 | GEMM & MoE Kernels | CUTLASS GEMM; MoE routing & grouped GEMM; expert parallelism. |
| 08 | Speculative Decoding | n-gram, suffix, EAGLE, DFlash; draft/verify & rejection sampling. |
| 09 | Sampling & Decoding Algorithms | top-k/p, penalties, parallel sampling, beam search, logits processors. |
| 10 | Distributed Inference | Tensor / Pipeline / Data / Expert / Context parallelism. |
| 11 | Multi-LoRA | Batched adapters, punica/SGMV, dense + MoE LoRA. |
| 12 | Structured Outputs | Grammar-constrained decoding via xgrammar / guidance. |
| 13 | Multimodal Models | Vision encoders, image-token merging, processor cache. |
| 14 | Model Architectures | Add a model: decoder-only, MoE, hybrid/SSM, embedding/reward. |
| 15 | Disaggregated Serving | Prefill/decode/encode split; KV transfer connectors. |
| 16 | Serving APIs & Parsers | OpenAI & Anthropic APIs, gRPC, streaming, tool/reasoning parsers. |
| 17 | Hardware Backends & Plugins | The platform abstraction; NVIDIA/AMD/CPU/TPU plugins. |
| 18 | Performance Engineering | Profiling, benchmarking, roofline thinking, tuning knobs. |
| 19 | Capstone — Maintainer & Startup | Land a real PR; the staff competency map; the startup playbook. |
⭐ = the original flagship phases that set the template. Every phase now has fully
written labs — 60+ in total, each with an in-depth guide-style README, and (for the
CPU labs) a tested starter.py / solution.py / test_lab.py triplet. Run the whole
suite with pytest -m "not gpu" from the repo root; every phase's labs/README.md
gives the recommended order and the skills each lab delivers.
Recommended path
- Do them in order, 0 → 19. Each builds on the last;
mini_vllm/grows phase by phase. - For each phase: read
00-guide.md→ read01-deep-dive.mdwithupstream/open in a second window → do02-mini-build.md→ run the labs → attemptEXERCISES.md→ self-test withINTERVIEW.md. - Run the tests constantly:
pytest -m "not gpu"from the repo root. - Keep a lab notebook. When you finish, your notebook +
mini_vllm/+ a merged upstream PR is your portfolio.
Start here: SETUP.md, then phase-00-foundations/00-guide.md.
See also: GLOSSARY.md (every term defined once) and CAREER.md (the maintainer path, the staff competency map, the startup playbook).
This repo also builds as a website (mdBook → Cloudflare Pages): see PUBLISHING.md.
Setup
This course is designed so that the majority of labs run on a laptop CPU. You only need
a GPU for the labs explicitly tagged [GPU-REQ] (and even those ship captured output so you
can learn without one).
Contents
- 1. Python environment
- 2. Get the real vLLM source (required for the deep-dives)
- 3. (Optional) Install the real engine for the GPU labs
- 4. Running the labs and tests
- 5. Cheap GPU access for the
[GPU-REQ]labs - 6. Models used in labs
- Troubleshooting
1. Python environment
We follow vLLM's own convention and use uv (fast, and
it's what upstream uses — see upstream/AGENTS.md). Plain venv works too.
# Install uv (one time)
curl -LsSf https://astral.sh/uv/install.sh | sh
# From the repo root:
uv venv --python 3.12
source .venv/bin/activate
# Install the CPU-only course dependencies (numpy + pytest). This is all you need
# for mini_vllm and every [CPU-OK] lab.
uv pip install -e .
To run the torch-based labs (some Phase 2/4 mini-builds), add the CPU build of torch:
uv pip install -e ".[torch]" # CPU wheels are fine; no CUDA needed for mini_vllm
2. Get the real vLLM source (required for the deep-dives)
Every 01-deep-dive.md cites upstream/... paths. Clone the pinned tree:
git clone --depth 1 --branch v0.22.1 \
https://github.com/vllm-project/vllm.git upstream
cd upstream && git rev-parse HEAD # 0decac0d96c42b49572498019f0a0e3600f50398
cd ..
You do not need to install vLLM to read its source. (upstream/ is gitignored.)
3. (Optional) Install the real engine for the GPU labs
The real vllm package needs a CUDA build of torch and an NVIDIA GPU. Install it only on
a GPU box:
uv pip install -e ".[vllm]" # vllm==0.22.1, matches the pin
4. Running the labs and tests
# All CPU tests (mini_vllm + flagship labs). Run this constantly.
pytest -m "not gpu"
# Just one phase's labs
pytest phase-02-paged-attention/labs
# The mini engine's own test suite
pytest mini_vllm
# On a GPU box, also run the GPU-tagged tests
pytest -m gpu
GPU tests are auto-skipped when no CUDA device is present (see the gpu_device fixture in
each phase's conftest.py), so pytest is always green on a laptop.
5. Cheap GPU access for the [GPU-REQ] labs
You do not need to own a GPU. Options, cheapest-effort first:
| Option | Notes |
|---|---|
| Google Colab (free/Pro) | Free T4 is enough for small-model vLLM labs. Easiest start. |
| Modal / RunPod / Lambda / Vast.ai | Per-second/per-hour A10/L4/A100 rentals. ~$0.4–$2/hr for the GPUs these labs use. |
Cloud spot instances (AWS g5, GCP g2) | Cheapest sustained; more setup. |
A T4 or L4 (16–24 GB) runs every GPU lab in this course with a small model
(e.g. facebook/opt-125m, Qwen/Qwen2.5-0.5B). You will never need an 80 GB card to learn.
6. Models used in labs
Labs default to tiny models so they download fast and fit small GPUs (and some run on
CPU): facebook/opt-125m, Qwen/Qwen2.5-0.5B-Instruct, TinyLlama/TinyLlama-1.1B. Each
lab README names the exact model and the huggingface-cli download command.
Troubleshooting
pytestcollects 0 tests → run from the repo root (sopyproject.tomlis found).import vllmfails on a laptop → expected; the real engine needs CUDA. Use the[CPU-OK]labs andmini_vllmon a laptop; the captured outputs cover the rest.- Line numbers in a deep-dive don't match → you're not at the pinned commit. Re-clone per step 2, or search for the named function instead of trusting the line number.
Upstream Pin
Every "original code reference" in this curriculum is anchored to a single, frozen
snapshot of the real vLLM source tree so that path:line citations stay reproducible
even as upstream moves on.
| Field | Value |
|---|---|
| Project | vllm-project/vllm |
| Release tag | v0.22.1 |
| Commit SHA | 0decac0d96c42b49572498019f0a0e3600f50398 |
| Pinned on | 2026-06-08 |
| Local path | ./upstream/ (gitignored — not committed, re-clone as below) |
Contents
Re-create the exact tree
git clone --depth 1 --branch v0.22.1 https://github.com/vllm-project/vllm.git upstream
cd upstream && git rev-parse HEAD # must print 0decac0d96c42b49572498019f0a0e3600f50398
How citations are written
Throughout the phases you will see references like:
vllm/v1/core/sched/scheduler.py:312@0decac0—Scheduler.schedule()
- The path is relative to
upstream/. - The line number is valid only at the pinned SHA. If you check out a newer vLLM, open the file and search for the named symbol (the function/class is given) instead of trusting the line number.
@ 0decac0is the short SHA, a reminder that the snapshot is frozen.
Why pin at all?
vLLM merges dozens of PRs per day. A line number that is correct today is wrong next week. Pinning is the same discipline real maintainers use when they write design docs and bug reports: always cite a commit, never "main". When you eventually contribute upstream (Phase 19), you will cite commits in exactly this way in your PR descriptions and issue reports.
Bumping the pin (later)
When you want to refresh the curriculum against a newer vLLM:
- Re-clone at the new tag, update the table above.
- Re-run the
path:linespot-check in each phase's01-deep-dive.md. - Note behavioral changes in a
CHANGES.mdper phase — diffing how the engine evolved is itself one of the most instructive exercises in this whole course.
Glossary
Every term used in this course, defined once, in plain language. When a phase uses a term, it links here. If you ever feel lost, this is the place to land.
Ordering is roughly conceptual, grouped by theme, not alphabetical — read top to bottom the first time, then use Ctrl-F.
Contents
- The model and the forward pass
- Attention and the KV cache
- PagedAttention & memory (Phase 2)
- Scheduling & batching (Phase 3)
- Kernels & execution (Phases 4–7)
- Decoding strategies (Phases 8–9)
- Distributed & serving (Phases 10, 15–16)
- Adaptation & structure (Phases 11–12)
- vLLM internals & process model
The model and the forward pass
- Token — a chunk of text (often ~¾ of a word) that the model reads and writes. Text is turned into a list of integer token IDs by a tokenizer.
- Embedding — the vector the model uses to represent a token internally.
- Forward pass — running the model once over some tokens to produce, for the last position, a probability distribution over the next token (the logits).
- Logits — the raw, pre-softmax scores over the whole vocabulary for "what comes next".
- Autoregressive generation — generate one token, append it to the input, run the forward pass again, repeat. LLMs generate text one token at a time this way.
- Decoder-only model — the architecture of GPT/Llama/Qwen: a stack of transformer blocks that only attend to earlier tokens (causal). Most LLMs.
Attention and the KV cache
- Attention — the operation where each token "looks at" previous tokens and mixes in their information. For each token it computes a Query (Q), and compares it against the Key (K) and Value (V) of every earlier token.
- KV cache — because earlier tokens don't change, their K and V vectors can be computed once and cached. The KV cache is the stored K and V for every token generated so far. It is the single biggest consumer of GPU memory during serving. This course is largely the story of managing it well.
- Prefill — the first forward pass over the whole prompt at once. Compute-bound (lots of tokens, one pass). Fills the KV cache for the prompt.
- Decode — each subsequent single-token forward pass. Memory-bandwidth-bound (one token, must read all weights + the whole KV cache). This is where most serving time goes.
- TTFT (time to first token) — latency from request arrival to the first output token. Dominated by prefill.
- ITL / TPOT (inter-token latency / time per output token) — time between successive output tokens. Dominated by decode.
PagedAttention & memory (Phase 2)
- PagedAttention — vLLM's core idea: store the KV cache in fixed-size blocks (like OS memory pages) instead of one big contiguous buffer per request. Eliminates fragmentation and enables sharing.
- Block (KV block) — a fixed-size slot holding the KV of
block_sizetokens (commonly 16). The unit of KV allocation. Code:KVCacheBlockinkv_cache_utils.py. - Block size — number of tokens whose KV fits in one block (e.g. 16).
- Block table — per-request mapping from logical block index → physical block ID. Lets a request's KV be scattered across non-contiguous physical blocks.
- Block pool — the global pool of all physical blocks, with a free list and a prefix-cache
index. Code:
BlockPoolinblock_pool.py. - Fragmentation — wasted memory from reserving contiguous space you don't fully use. PagedAttention's reason for existing.
- Prefix caching — if two requests share a prefix (same leading tokens), they can share the same physical KV blocks. Found by hashing block contents. Phase 3.
- Copy-on-write (CoW) — when a shared block must diverge (one request writes new tokens), copy it so the other request's view is unaffected.
- Reference count (
ref_cnt) — how many requests currently use a block. A block is free only whenref_cnt == 0. - Eviction — reclaiming a cached (but currently unused) block for a new allocation. vLLM
uses an LRU-ish free queue (
FreeKVCacheBlockQueue).
Scheduling & batching (Phase 3)
- Batching — running many requests through the model together to use the GPU efficiently.
- Static batching — fix a batch, run it to completion. Wasteful: fast requests wait for slow ones.
- Continuous batching — re-decide the batch every iteration (every single token step). Finished requests leave, new ones join immediately. vLLM's default.
- Scheduler — the component that, each step, picks which requests run and how many tokens
each gets. Code:
Schedulerinv1/core/sched/scheduler.py. - Chunked prefill — split a long prompt's prefill across several steps so it doesn't starve ongoing decodes. Controlled by a token budget.
- Token budget —
max_num_batched_tokens: the cap on total tokens scheduled per step. - Preemption — when memory runs out, evict a running request's KV and put it back in the queue (to be recomputed later). The safety valve.
- Running / Waiting queues — requests currently decoding vs. requests waiting to start.
Kernels & execution (Phases 4–7)
- Kernel — a function that runs on the GPU. "Attention kernel", "GEMM kernel", etc.
- GEMM — General Matrix-Matrix Multiply. The workhorse op (every linear layer). Libraries: cuBLAS, CUTLASS.
- FlashAttention — a fused, memory-efficient attention kernel that never materializes the full attention matrix. FlashInfer / FlashMLA / TRTLLM-GEN / Triton — other attention/ GEMM kernel providers vLLM can dispatch to.
- Attention backend — vLLM's pluggable wrapper choosing which attention kernel to run.
- CUDA graph — a recorded sequence of GPU operations replayed with one launch, removing per-op CPU launch overhead. Piecewise = capture parts; full = capture the whole model forward.
- torch.compile — PyTorch's compiler; vLLM uses it to fuse ops and generate kernels, with custom graph passes.
- MoE (Mixture of Experts) — a layer with many "expert" sub-networks; each token is routed to a few experts. Big models, low active compute. (Mixtral, DeepSeek-V3.)
- Quantization — storing weights/activations in fewer bits (FP8, INT4, …) to save memory and bandwidth. Formats: FP8, MXFP4, NVFP4, INT8/INT4, GPTQ, AWQ, GGUF, compressed-tensors.
Decoding strategies (Phases 8–9)
- Greedy decoding — always pick the highest-probability token.
- Temperature / top-k / top-p / min-p — knobs that shape the sampling distribution.
- Parallel sampling (
n) — produce N independent completions for one prompt (sharing the prompt's KV via prefix caching). - Beam search — keep the top-N partial sequences by cumulative probability.
- Logits processor — a hook that edits the logits before sampling (penalties, bans, grammar masks).
- Speculative decoding — a cheap draft model/heuristic proposes several tokens; the big model verifies them in one pass, accepting a prefix. Speeds up decode.
- EAGLE / Medusa / n-gram / suffix / DFlash — specific speculative-decoding methods.
- Acceptance rate — fraction of drafted tokens the target model accepts. The metric that decides whether spec decode is a win.
Distributed & serving (Phases 10, 15–16)
- Tensor parallelism (TP) — split each layer's weights across GPUs; every GPU does part of every layer; results all-reduced.
- Pipeline parallelism (PP) — split the layers across GPUs; activations pass GPU→GPU.
- Data parallelism (DP) — replicate the model; split requests across replicas.
- Expert parallelism (EP) — split MoE experts across GPUs.
- Context parallelism (CP) — split a single sequence's context across GPUs.
- Collective op — multi-GPU communication primitive (all-reduce, all-gather, …) via NCCL.
- Disaggregated serving — run prefill and decode on different machines, shipping the KV cache between them, so each can be scaled and tuned independently.
- KV connector — the component that transfers KV blocks between engines (for P/D disagg or
offloading). Code under
vllm/distributed/kv_transfer/. - OpenAI-compatible server — vLLM's HTTP server speaking the OpenAI API (plus Anthropic Messages API and gRPC).
- Tool calling / reasoning parser — components that extract structured tool calls or chain-of-thought from model output.
Adaptation & structure (Phases 11–12)
- LoRA (Low-Rank Adaptation) — small trainable matrices added to a frozen base model to specialize it. vLLM serves many LoRAs in one batch.
- Punica / SGMV — batched kernels that apply different LoRAs to different requests in one GPU call.
- Structured output / guided decoding — forcing the model's output to match a grammar, regex, or JSON schema by masking invalid tokens each step. Engines: xgrammar, guidance.
vLLM internals & process model
- V1 engine — vLLM's current core architecture (the
vllm/v1/tree). V0 is legacy. This course teaches V1. LLM— the offline (batch) Python entry point:LLM(model=...).generate(prompts).AsyncLLM— the async engine powering the API server.EngineCore— the inner loop:add_request→step()(schedule → execute → output).- Worker / Executor — the executor owns workers; each worker drives one GPU's model.
- Model runner — turns a
SchedulerOutputinto actual tensor inputs and runs the model. SamplingParams— per-request decoding config (temperature, max_tokens, n, …).RequestOutput— what the engine returns: generated text/tokens for a request.
The Career Map: Maintainer, Staff Engineer, Founder
This course has three end-states in mind. They overlap, but each has its own "what does great look like" bar. Use this document as a compass: at any phase, ask "which of these am I building toward right now?"
Contents
- Track A — Become a vLLM maintainer
- Track B — Staff / Principal LLM-inference engineer
- Track C — Found a startup in inference
- A note on mindset
Track A — Become a vLLM maintainer
A maintainer is someone whose judgment the project trusts. You get there by a track record, not a title.
The ladder
- First contribution. A docs fix, a small bug, a test. Learn the workflow (Phase 19).
- Sustained contributions. Real features/fixes in one area (say, the scheduler or a quant method). You become "the person who knows X".
- Reviewer. You review others' PRs in your area credibly.
- Committer / maintainer. You're trusted to merge and to shape direction.
What maintainers actually do (and this course trains)
- Read code fast and correctly. Every
01-deep-dive.mdis reps for this. - Reason about invariants. "Block tables are append-only." "
ref_cnt==0⟺ in free queue." Maintainers hold dozens of these in their head. The deep-dives name them explicitly; theCHEATSHEET.mdfiles collect them. - Protect the hot path. vLLM's scheduler runs every token step for every request — a Python list scan in the wrong place is a throughput regression. You learn to feel this.
- Write tests that pin behavior. Look at
upstream/tests/v1/core/— that's the standard. - Communicate. PR descriptions, RFCs, issue triage. See
upstream/AGENTS.mdfor the project's literal rules (e.g. no pure code-agent PRs, cite that AI was used, include test commands and results).
The non-obvious advice
- Specialize, then generalize. Pick one subsystem from this course (scheduler, KV cache, a quant format, an attention backend) and go deeper than anyone. Depth in one area earns the trust that lets you touch others.
- Watch the firehose. Subscribe to the repo. Read merged PRs in your area daily. Diffing how the engine evolves (Phase 19) is the fastest way to learn the current mental model.
Track B — Staff / Principal LLM-inference engineer
This is the industry role: you own how models serve — throughput, latency, cost, reliability — at a company. The interview loops test exactly the material in this course.
The competency map
| Competency | Phases | "Staff-level" looks like |
|---|---|---|
| Transformer inference fundamentals | 0, 1 | Can derive KV-cache memory from first principles; explain prefill vs decode bottlenecks. |
| Memory management | 2 | Can size KV cache for a deployment; explain paging vs fragmentation with numbers. |
| Throughput engineering | 3, 18 | Can diagnose a throughput cliff from metrics; tune batch/token budgets; reason about Little's Law. |
| Kernels & precision | 4–7 | Knows when FlashInfer beats FlashAttention; what FP8 costs in accuracy; reads a roofline. |
| Latency techniques | 8, 9 | Knows when spec decode helps (acceptance rate × draft cost); chunked prefill tradeoffs. |
| Scale-out | 10, 15 | Picks TP vs PP vs DP vs EP for a model+SLA; understands P/D disaggregation economics. |
| Productization | 11, 12, 16 | Multi-tenant LoRA, structured output, API design, streaming, observability. |
| Hardware breadth | 17 | Reasons about NVIDIA vs AMD vs TPU tradeoffs and the plugin abstraction. |
How to use the INTERVIEW.md files
Each phase ships staff-level Q&A. Treat them as a mock loop: cover the answer, attempt it out
loud, then compare. The flagship phases (2, 3) show the depth expected. A strong candidate
can whiteboard the PagedAttention block allocator and the continuous-batching step loop from
memory — which, after this course, you will have written yourself in mini_vllm/.
Your portfolio
By the end you have three artifacts that beat any résumé bullet:
mini_vllm/— a working engine you built. Walk an interviewer through it.- A merged upstream PR (Phase 19). Public proof you operate at the real bar.
- A tuning/benchmark writeup (Phase 18). Shows you think in numbers.
Track C — Found a startup in inference
The inference layer is one of the most valuable and contested in the AI stack. This course makes you dangerous in it.
Where the value (and the moats) are
- Cost per token. The whole game. Everything in Phases 2–7 and 10 is a lever on it. A 2× throughput win is a 2× gross-margin win.
- Latency SLAs. TTFT and ITL guarantees (Phases 3, 8, 9, 15) are what enterprise buyers actually pay for.
- Multi-tenancy. Serving thousands of fine-tunes cheaply = multi-LoRA + prefix caching (Phases 3, 11). A structural cost advantage over per-customer deployments.
- Hardware arbitrage. Running well on cheaper/available silicon (Phase 17) when NVIDIA is supply-constrained.
Honest take on moats
Raw "we wrap vLLM and rent GPUs" is not a moat — margins compress fast. Defensible angles:
- A genuine kernel/scheduling edge you can sustain (hard, but this course is where you'd build the expertise to try).
- Workload specialization — agentic/long-context/structured-output/RAG-shaped traffic has different optimal configs; owning a vertical's serving stack is defensible.
- The control plane — routing, autoscaling, multi-tenancy, observability, cost attribution around the engine. Often more durable than the engine itself.
- Distribution / switching costs — being embedded in customers' pipelines.
The build/buy/contribute calculus
You will almost always build on vLLM rather than replace it — that's the point of open source. The startup question is "what do we add on top, and what should we upstream?" Phase 19 covers the contribute-vs-keep-private tradeoff (upstreaming buys you maintenance leverage and credibility; hoarding a commodity feature buys you nothing).
A note on mindset
The people who reach all three end-states share one habit: they read the source. Not docs
about the source — the source. This entire course is built to make that your default
reflex. Open upstream/ now and keep it open for the next 20 phases.
Phase 00 — The Hitchhiker's Guide to How an LLM Actually Runs
Course home · Phase 01 →
This is Chapter 0 of the book. It assumes you know nothing — not what a token is, not what a matrix multiply is — and it ends with you able to compute, on a napkin, how many users a given GPU can serve and why. Everything else in the course stands on this chapter, so we go slowly and build each idea from the ground up.
How to read this chapter. Most of it is for everyone. Paragraphs marked
🔬 Going deeper — optional rigor and real numbers for the expert track.
can be skimmed on a first pass and devoured on the second. By the end you should be comfortable at both levels: the intuition and the arithmetic.
Contents
- 0.1 Don't Panic — the whole thing in one sentence
- 0.2 The only math you need (a 5-minute primer)
- 0.3 Words become numbers, part 1: tokenization
- 0.4 Words become numbers, part 2: embeddings (meaning as coordinates)
- 0.5 The shape of a model: layers and the residual stream
- 0.6 Attention from the ground up (the heart of the machine)
- 0.7 The MLP (the per-token "thinking" block)
- 0.8 From logits to a token: sampling
- 0.9 The generation loop, and the redundancy that births the KV cache
- 0.10 Prefill vs decode, and why decode is memory-bound (the chapter's crux)
- 0.11 How big is the KV cache? (the wall that caps your users)
- 0.12 Throughput vs latency, and Little's Law
- 0.13 The one picture to carry into every later phase
- 0.14 What you'll do in this phase
0.1 Don't Panic — the whole thing in one sentence
A large language model is a function that reads a list of words and guesses the next word. To write text, it guesses a word, sticks it on the end, and guesses again — hundreds of times.
That is genuinely all an LLM does at runtime. ChatGPT writing you an essay is this loop running a few hundred times. Everything difficult — and everything this course teaches — comes not from the guessing, but from doing the guessing fast, for thousands of people at once, on hardware that costs more than a house. vLLM is the software that does that well. To make it faster (your future job), you must first feel why it is slow. That feeling is what this chapter installs.
We'll build up in this order: a tiny bit of math → words become numbers → what one "guess" involves → the loop → why it's slow → where the memory goes. Take your time.
0.2 The only math you need (a 5-minute primer)
Two objects and one operation. That's it.
A vector is just a list of numbers: [0.2, -1.1, 0.5]. Picture it as an arrow, or as
coordinates of a point. A length-3 vector is a point in 3D space; LLMs use points in thousands of
dimensions (you can't picture that, and you don't need to — the arithmetic is the same).
A matrix is a grid of numbers — a stack of vectors. A 2×3 matrix has 2 rows, 3 columns.
The one operation that matters is matrix multiplication (everyone calls it "matmul" or GEMM — General Matrix Multiply). To multiply a vector by a matrix, you take dot products. A dot product of two equal-length vectors multiplies them element-wise and sums:
[1, 2, 3] · [4, 5, 6] = 1·4 + 2·5 + 3·6 = 4 + 10 + 18 = 32
A dot product is a similarity score: it's large and positive when two vectors point the same way, near zero when they're unrelated, negative when opposed. Remember this — attention (the heart of the model) is built entirely out of dot products measuring "how related are these two tokens."
Multiplying a vector x (length 3) by a matrix W (3 columns, 2 rows) gives a new vector (length
2), one dot product per row of W:
x = [1, 2, 3] W = [ [1, 0, 1], → y[0] = [1,2,3]·[1,0,1] = 4
[0, 1, 1] ] y[1] = [1,2,3]·[0,1,1] = 5
y = [4, 5]
🔬 Going deeper. A neural network "layer" is exactly this:
y = x·Wᵀ(plus a bias, plus a nonlinearity). The matrixWis the weights — the billions of numbers that are the trained model. "Llama-3-8B" means ~8 billion such numbers. A forward pass is a long chain of these multiplies. So "running a model" = "doing a lot of matmuls with the weight matrices." Hold that: it explains both the compute cost (FLOPs) and the memory cost (reading the weights), which is the whole performance story in §0.10.
That's the entire math prerequisite. Onward.
0.3 Words become numbers, part 1: tokenization
A computer can't multiply the word "Paris". So step one is always: chop the text into small
pieces called tokens and replace each with an integer ID.
Why not just use whole words, or individual letters? Whole words give a gigantic, brittle vocabulary (every plural, typo, and rare name is a new word). Single letters make sequences painfully long. The sweet spot is subwords — common words stay whole, rare words split into pieces. The dominant algorithm is Byte-Pair Encoding (BPE):
🔬 How BPE is built (Going deeper). Start with every character as its own token. Then repeatedly find the most frequent adjacent pair of tokens in your training text and merge it into a new token. Do this tens of thousands of times. Common sequences like
"ing"," the","vLLM"get merged into single tokens; rare strings stay split into smaller bits. The result is a fixed list of merges (the vocabulary) that balances vocabulary size against sequence length.
A worked tokenization (Llama-3-style, ~128k vocab):
text: "vLLM is fast"
tokens: [ "v", "LLM", " is", " fast" ] ← note the leading spaces are part of tokens
IDs: [ 85, 4178, 382, 2347 ] ← example numbers from the vocab table
Two facts to carry forward:
- A token is roughly ¾ of a word on average (so 1,000 tokens ≈ 750 words).
- The component that does this is the tokenizer; reversing it (IDs → text) at the end is detokenization. The full list of tokens it knows is the vocabulary (Llama-3: ~128,256).
In mini_vllm/tokenizer.py we use the simplest possible tokenizer — one byte = one token, vocab of
257 — so the course needs zero downloaded files. Open it; it's ten lines, and it has the same
encode/decode interface a real tokenizer does.
🆕 New words: token (a subword chunk), token ID (its integer), tokenizer (the chopper), vocabulary (all known tokens), BPE (the merge algorithm), detokenization (IDs→text).
0.4 Words become numbers, part 2: embeddings (meaning as coordinates)
A token ID like 4178 is just a name — it carries no meaning by itself (token 4178 isn't "more"
than token 382). So the model's first move is to look up each ID in a big table and replace it with
a vector — a list of numbers — called an embedding.
Think of the embedding as coordinates in a space of meaning. Just as a city has a (latitude, longitude), a token has a few thousand coordinates (Llama-3-8B: 4096 of them). The training process arranges this space so that tokens used in similar ways land near each other, and — famously — directions in the space carry meaning:
embedding("king") - embedding("man") + embedding("woman") ≈ embedding("queen")
The lookup itself is trivial — it's just "go to row 4178 of the embedding table" — but it's the bridge from symbols to math. After this step, the prompt is no longer text; it's a stack of vectors (one per token), and from here on the model only does matmuls on those vectors.
"vLLM is fast"
→ IDs [85, 4178, 382, 2347]
→ embeddings, a 4 × 4096 matrix:
[ [ 0.02, -0.7, ... , 0.1 ], ← "v"
[-0.5, 0.3, ... , 0.9 ], ← "LLM"
[ 0.1, -0.2, ... , 0.0 ], ← " is"
[ 0.8, 0.4, ... ,-0.3 ] ] ← " fast"
🔬 Going deeper. The width of these vectors is the hidden size
d_model(4096 for Llama-3-8B). Biggerd_model= more capacity but more compute and memory everywhere. The embedding table hasvocab_size × d_modelnumbers (128256 × 4096 ≈ 525M just for embeddings). Many models tie the input embedding and the output projection (the "LM head") to save that memory.
🆕 New words: embedding (a token's meaning-vector), hidden size /
d_model(the vector width), LM head (the final layer turning vectors back into vocabulary scores).
0.5 The shape of a model: layers and the residual stream
The model is a tall stack of identical layers (Llama-3-8B has 32). The stack of token vectors flows up through them; this flowing stack is often called the residual stream. Each layer reads the stream, computes an update, and adds it back (that "add it back" is the residual connection — it's why training deep stacks works, but you can treat it as plumbing).
Each layer does exactly two things to the stream:
- Attention — lets each token look at the other tokens and pull in relevant information. This is the only place where information moves between positions.
- MLP (feed-forward) — transforms each token's vector independently, adding "thinking capacity."
After all 32 layers, the final vector at the last position is multiplied by the LM head to produce one score for every token in the vocabulary. Those scores are the logits (§0.8).
embeddings ─► [ layer 1 ] ─► [ layer 2 ] ─► ... ─► [ layer 32 ] ─► LM head ─► logits (vocab,)
│ attention + MLP, each with a residual add
Keep this split in mind — attention mixes across tokens, the MLP works per token — because it explains exactly where the KV cache lives (in attention, the only cross-token part) and where the giant matmuls are (the MLP, ~⅔ of the weights; Phase 7).
0.6 Attention from the ground up (the heart of the machine)
This is the one piece worth understanding in detail, because it dictates all of memory management. We'll build it from the problem, then do a worked numeric example.
The problem attention solves
Consider generating the next word of: "The river bank was muddy, so the fisherman ...". To continue sensibly the model must know "bank" means a riverbank (not a money bank) — and the clue is the word "river" earlier. Each token needs to gather context from the other tokens. Attention is the mechanism for that gathering.
Q, K, V — the search analogy
For every token, the model computes three vectors (each by multiplying the token's embedding by a learned weight matrix — yes, more matmuls):
- Query (Q) — "here is what I'm looking for." (your search box)
- Key (K) — "here is what I am about." (a document's title/tags)
- Value (V) — "here is the information I actually carry." (the document's contents)
To update a token, you compare its Query against every earlier token's Key (dot products — remember, dot product = similarity!), turn those similarities into weights that sum to 1, and take the weighted blend of those tokens' Values. It is exactly a soft search:
Type a query → it's scored against the keys of all documents → you get back a blend of the best-matching documents' values.
A worked example (do this once by hand — it demystifies everything)
Three tokens, and to keep it readable let each Q/K/V be just 2 numbers (a real model uses 128).
Suppose we're computing the new vector for token 3, whose query is q₃ = [1, 0]. The three
tokens' keys and values:
token key K value V
1 [1, 0] [10, 0]
2 [0, 1] [0, 10]
3 [1, 0] [5, 5]
Step 1 — similarity scores (dot product of q₃ with each key):
score₁ = [1,0]·[1,0] = 1 score₂ = [1,0]·[0,1] = 0 score₃ = [1,0]·[1,0] = 1
Token 3's query points the same way as tokens 1 and 3's keys (score 1) and is orthogonal to token 2's (score 0).
Step 2 — scale by 1/√(head_dim) = 1/√2 ≈ 0.707 (this keeps numbers from blowing up as the
vectors get wider — a numerical-stability trick):
scaled = [0.707, 0, 0.707]
Step 3 — softmax turns scores into weights that are all positive and sum to 1. Softmax of a
list is exp(each) / sum(exp):
exp(0.707)=2.03, exp(0)=1.00, exp(0.707)=2.03 sum = 5.06
weights = [2.03/5.06, 1.00/5.06, 2.03/5.06] = [0.40, 0.20, 0.40]
So token 3 will pay 40% attention to token 1, 20% to token 2, 40% to itself.
Step 4 — weighted blend of the Values:
out = 0.40·[10,0] + 0.20·[0,10] + 0.40·[5,5]
= [4,0] + [0,2] + [2,2]
= [6, 4]
That [6, 4] is token 3's attention output — a context-aware mix dominated by tokens 1 and 3.
That is the entire attention operation. Everything fancy later (FlashAttention, PagedAttention)
is about computing this exact thing faster and with less memory, never something different.
Causal masking — you can't read the future
When generating, token 3 may only attend to tokens 1, 2, 3 — not tokens that come after it
(they don't exist yet). This "only look backward" rule is causal masking: before the softmax,
scores for future positions are set to -∞ (so their softmax weight is 0). Picture the allowed
attention as a lower-triangular matrix:
attends to → t1 t2 t3 t4
query t1 ✓ ✗ ✗ ✗
query t2 ✓ ✓ ✗ ✗
query t3 ✓ ✓ ✓ ✗
query t4 ✓ ✓ ✓ ✓
This triangle is why token i needs the Keys and Values of all tokens ≤ i — the single most important sentence for understanding the KV cache (§0.9).
Multiple heads
Real attention runs several of these in parallel — heads — each with its own Q/K/V projections,
so different heads can specialize (one tracks syntax, another long-range references). Their outputs
are concatenated. Llama-3-8B has 32 query heads, each of dimension 128 (32 × 128 = 4096 =
d_model).
🔬 Going deeper — three things the experts know.
head_dimand the √d scale. Dot products ofd-dimensional vectors grow like√d, which would push softmax into tiny-gradient saturation. Dividing by√head_dimkeeps the variance ~1. (That's the0.707above.)- RoPE (positional info). Attention as described is order-blind — it'd treat "dog bites man" like "man bites dog." Models inject position by rotating Q and K by an angle proportional to their position (Rotary Position Embedding). Two tokens' score then depends on their relative distance. You'll see
rotary_emb(positions, q, k)inllama.py(Phase 0 deep-dive).- GQA/MQA — fewer KV heads. The KV cache (next section) is sized by the number of KV heads. So modern models use Grouped-Query Attention: many query heads share a smaller number of KV heads. Llama-3-8B has 32 query heads but only 8 KV heads — a 4× KV-cache saving baked into the architecture. (MQA is the extreme: 1 KV head.) This single design choice changes your serving capacity by 4×; remember it for §0.11.
🆕 New words: Query/Key/Value (Q/K/V), attention score (a Q·K dot product), softmax (scores → weights summing to 1), causal mask (can't attend to the future), head (one parallel attention), head_dim (a head's width), RoPE (rotary positions), GQA/MQA (shared KV heads).
0.7 The MLP (the per-token "thinking" block)
After attention mixes context in, the MLP processes each token's vector on its own through two big matmuls with a nonlinearity between:
hidden = activation(x · W_upᵀ) # expand: 4096 → ~14336 (Llama-3-8B)
out = hidden · W_downᵀ # project back: 14336 → 4096
The middle width (the "intermediate size") is several × d_model, which is why the MLP holds the
majority of the model's weights (~⅔). When you hear "the model is mostly GEMMs," it's largely
these two matrices per layer. (Modern Llamas use a gated variant, SwiGLU, with three matrices —
a detail for Phase 14; the shape story is the same.)
The takeaway for performance: attention is where memory (the KV cache) concentrates; the MLP is where compute and weight bytes concentrate. Different bottlenecks, different optimizations.
0.8 From logits to a token: sampling
After the last layer, the LM head turns the final vector into logits — one raw score per vocab token (~128k numbers). To pick a token you first turn logits into probabilities with softmax, then choose:
"The capital of France is" → logits → softmax → probabilities:
" Paris": 0.87 " Lyon": 0.06 " a": 0.01 " banana": 0.000003 ...
- Greedy: take the highest-probability token (
" Paris"). Deterministic. - Temperature / top-k / top-p: deliberately introduce randomness for variety.
This whole topic — the decoding algorithms and how they run vectorized across a whole batch — is
Phase 9. For now: logits → softmax → pick one. mini_vllm/sampler.py implements greedy plus
temperature/top-k/top-p in ~40 readable lines.
🆕 New words: logits (raw next-token scores), probability distribution (softmaxed logits), greedy decoding (argmax), sampling (random pick).
0.9 The generation loop, and the redundancy that births the KV cache
The model only ever predicts one next token. To write a sentence we loop — feed the output back in (this is autoregressive generation):
Step 1: "The capital of France is" → " Paris"
Step 2: "The capital of France is Paris" → "."
Step 3: "The capital of France is Paris." → <end>
Now look closely at what the naive loop computes. Each step runs the whole model over the whole text-so-far. Recall from §0.6 that attention at each position needs the Keys and Values of all earlier positions. So:
Step 1 processes tokens [1..5] → computes K,V for positions 1..5
Step 2 processes tokens [1..6] → computes K,V for positions 1..6 (1..5 AGAIN)
Step 3 processes tokens [1..7] → computes K,V for positions 1..7 (1..6 AGAIN)
We keep recomputing the K and V of tokens we already processed. Here is the key insight:
A token's Key and Value never change once computed. Token 5's K and V are identical in step 2 and in step 500. So compute them once and store them.
That stored table of every past token's Keys and Values is the KV cache. With it, each new step computes K,V for only the one new token and reads the rest from the cache:
work WITHOUT a cache: 1 + 2 + 3 + ... + N = N(N+1)/2 ≈ N²/2 (quadratic)
work WITH a cache: 1 + 1 + 1 + ... + 1 = N (linear)
For a 1,000-token answer that's ~500× less of this work. You will measure exactly this in
lab-01 — a 20-line experiment that is the single justification for the entire course. The KV
cache is not an optimization you can skip; it's what makes generation tractable.
The catch — and the reason Phases 2–3 exist — is that the KV cache is enormous and grows with every token, and it lives in scarce GPU memory. Managing it well is most of vLLM.
🆕 New words: autoregressive generation (predict → append → repeat), KV cache (stored Keys/Values of all prior tokens), EOS token (the "stop" token).
0.10 Prefill vs decode, and why decode is memory-bound (the chapter's crux)
With a KV cache, generation splits into two phases with opposite performance personalities. This is the most-probed idea in LLM-inference interviews — we'll do it with real numbers.
- Prefill — the first run: process the entire prompt at once to fill its KV cache. Many tokens, one run.
- Decode — every run after: generate one token, append, repeat. One token, one run, many times.
| Prefill | Decode | |
|---|---|---|
| tokens per run | many (whole prompt) | one |
| limited by | compute (math throughput) | memory bandwidth (reading from HBM) |
| sets the metric | TTFT (time to first token) | ITL / TPOT (time per output token) |
Why decode is bottlenecked by memory speed, not math
To produce one decode token, the GPU must read every weight in the model out of its main memory (HBM) — plus the whole KV cache — and then does only one token's worth of math with all of it. It's like driving to a vast warehouse, loading every crate onto the truck, to deliver one postcard. The bottleneck is the loading (memory reads), not the delivering (math).
🔬 The arithmetic that proves it (Going deeper — this is the money slide). Take Llama-3-8B in bf16 (2 bytes/param), on an A100 (HBM bandwidth ≈ 2 TB/s, compute ≈ 312 TFLOP/s bf16).
- Memory per decode step (batch = 1): read all weights = 8e9 params × 2 bytes = 16 GB. Time to read at 2 TB/s = 16e9 / 2e12 = 8 ms. That alone caps you at 1/0.008 ≈ 125 tokens/sec — no matter how fast the math is.
- Compute per decode step: a forward pass costs ≈ 2 × params FLOPs per token = 2 × 8e9 = 16 GFLOP. At 312 TFLOP/s that's 16e9 / 312e12 ≈ 0.05 ms.
- Verdict: memory (8 ms) dwarfs compute (0.05 ms) by ~160×. Decode is utterly memory-bound at batch 1. The expensive math units sit ~99% idle, waiting for weights to arrive.
Arithmetic intensity makes this crisp: it's FLOPs ÷ bytes-read. Decode at batch 1 ≈ 16 GFLOP / 16 GB = 1 FLOP/byte. The A100's "ridge point" (where it flips from memory- to compute-bound) is 312e12 / 2e12 ≈ 156 FLOP/byte. Since 1 ≪ 156, we're deep in memory-bound territory. This is the roofline model in one number.
The escape: batching. If you decode B sequences together, you still read the weights only once but do B× the math → intensity ≈ B FLOP/byte. To reach the ridge (≈156) and use the GPU fully, you need batch ~150. That's why throughput serving is all about big batches — and why the scheduler (Phase 3) exists. Prefill already has high intensity (many tokens × one weight read) → it's compute-bound from the start, which is why a long prompt can hog the GPU and must be chunked (Phase 3).
This one section explains nearly every optimization ahead:
- Batching (Phase 3): amortize the weight read over many sequences → throughput.
- Quantization (Phase 6): make the weights fewer bytes → less to read → faster decode.
- CUDA graphs (Phase 5): when per-step math is tiny, even the CPU overhead of launching the work dominates → remove it.
- Speculative decoding (Phase 8): do useful work for several tokens per weight-read.
🆕 New words: prefill / decode, TTFT / ITL(TPOT), HBM (the GPU's main memory), compute-bound / memory-bandwidth-bound, arithmetic intensity (FLOPs/byte), roofline (the model that says which bound you're under), ridge point (the FLOP/byte where it flips).
🔬 The GPU memory hierarchy (expert aside). A GPU has tiers: tiny ultra-fast registers and SRAM/shared memory (KB–MB, ~TB/s within a core) on-chip, and big slow HBM (tens of GB, ~1–3 TB/s) off-chip. "Memory-bound" means bound by HBM. FlashAttention (Phase 4) is fast precisely because it keeps attention's intermediates in SRAM and avoids round-tripping the giant score matrix to HBM. Keep this hierarchy in mind whenever a kernel is "memory-bound" — it's usually HBM traffic.
0.11 How big is the KV cache? (the wall that caps your users)
Decode is memory-bound, and the KV cache is the other big thing in that memory. Let's size it, one line at a time.
For every token, in every layer, we store a Key vector and a Value vector. So:
bytes_per_token = 2 (one K + one V)
× num_layers (32)
× num_kv_heads (8 ← GQA! not 32 — see §0.6)
× head_dim (128)
× bytes_per_number (2 for bf16)
For Llama-3-8B:
2 × 32 × 8 × 128 × 2 = 131,072 bytes ≈ 128 KB per token
So a 2,000-token conversation = 2000 × 128 KB ≈ 256 MB of GPU memory — for one user. The
punchline: a 24 GB GPU, after ~16 GB for the weights, has ~8 GB for KV. At 256 MB/user that's about
30 conversations at once before you run out of memory.
This is the headline of the entire field. What caps how many people you can serve is usually memory, not compute. The KV cache fills the GPU long before the math units are busy. So the serving game is fitting more KV cache: by not wasting any (PagedAttention, Phase 2), by sharing it across requests (prefix caching, Phase 3), and by shrinking it (FP8 KV cache, Phase 6 — halving
bytes_per_numberdoubles your users).
🔬 Going deeper — scale it to 70B and feel the squeeze. Llama-3-70B: 80 layers, 8 KV heads, head_dim 128 →
2×80×8×128×2 = 327,680 B ≈ 320 KB/token. At 8k context that's 2.6 GB per sequence. On an 80 GB A100, after ~140 GB of weights (wait — 70B in bf16 is 140 GB, so it doesn't even fit on one 80 GB GPU!). This is why 70B requires tensor parallelism across multiple GPUs (Phase 10) and why people quantize (Phase 6): both the weights and the KV cache are fighting for memory. You'll compute these numbers yourself inlab-02.
When you later see vLLM log Maximum concurrency for 2048 tokens: 68.65x, you'll know it's this
exact division: free-HBM ÷ per-sequence-KV. That number is your serving capacity.
0.12 Throughput vs latency, and Little's Law
Two metrics, in tension, that you'll trade off for the rest of your career:
- Latency — how fast one request feels (TTFT, ITL). What an individual user cares about.
- Throughput — total tokens/sec across everyone. What sets your cost per token — the number a business lives or dies on.
They fight: bigger batches raise throughput (amortized weight reads, §0.10) but slow each individual request (more work per step). The scheduler (Phase 3) steers this; Phase 18 is the art of tuning it.
🔬 Little's Law (Going deeper). For any stable serving system:
concurrency = throughput × latency. If each request stays in the system forLseconds and you sustainXrequests/sec, then on averageN = X·Lrequests are in flight. Rearranged: to hit a target throughput at a given latency, you need a certain concurrency — and that concurrency must fit in KV memory (§0.11). This little equation ties together the whole stack: memory limits concurrency, concurrency (via Little's Law) limits the throughput you can reach at your latency SLA. You'll use it to size real deployments in Phase 18.
🆕 New words: latency, throughput, cost per token, Little's Law (
N = X·L), SLA (the latency you've promised customers).
0.13 The one picture to carry into every later phase
Strip away the words and the engine reduces to this: a request is a list of tokens with two counters racing.
┌────────────────────────────────────────────────────────────────────┐
│ A request = tokens + two counters: │
│ num_tokens = how many tokens exist (prompt + generated) │
│ num_computed_tokens = how many have been processed (KV cached) │
│ │
│ PREFILL : computed is far behind → catch up in one big run │
│ DECODE : computed is one behind → compute one more, append, repeat│
│ │
│ The engine's entire job: make `computed` catch up to `tokens`, │
│ as cheaply as possible, for thousands of requests at once. │
└────────────────────────────────────────────────────────────────────┘
This is literally how vLLM's Request object is built (vllm/v1/request.py) and how its scheduler
reasons (Phase 3); mini_vllm/request.py mirrors it. If you remember one diagram from the whole
course, make it this one — every later phase is "do one part of this loop better."
0.14 What you'll do in this phase
- Read: 01-deep-dive.md — find every concept above (Q/K/V, the cache, the two
counters) in a real model file and in vLLM's
EngineCore.step. - Build / measure: 02-mini-build.md — understand
mini_vllm's tokenizer, toy model, and sampler, and run the two experiments below. - Labs (see labs/README.md for the full guide to each):
lab-01-kv-cache-speedup[CPU-OK]— implement generation with and without a KV cache and measure the O(N²) → O(N) win. The motivating experiment of the course.lab-02-kv-memory-calculator[CPU-OK]— write the memory formula and compute how many users fit on a real GPU (8B and 70B). See the memory wall for yourself.lab-03-sampling-basics[CPU-OK]— build greedy/temperature/top-k/top-p from scratch and prove your sampler agrees token-for-token withmini_vllm's.lab-04-prefill-vs-decode[CPU-OK]— the roofline arithmetic: the ridge point, the 0.6% compute utilization of single-stream decode, the 125 tok/s speed limit, the critical batch size.
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
You're ready for the rest of the book when you can, from memory: walk the attention worked example; explain why decode is memory-bound using arithmetic intensity; derive the KV-cache size for a model; and estimate how many users a GPU can serve. Those four are the foundation everything else is built on.
Course home · Phase 01 →
Phase 00 — Deep Dive: a real forward pass and the request counters
Paths relative to
upstream/atv0.22.1 @ 0decac0. You don't need to understand every line of a model — you need to recognize the shapes from the guide (Q/K/V, the KV cache, the prefill/decode counters) in real code. That recognition is what lets you navigate any model file later (Phase 14).
Contents
- 1. A real decoder-only model: Llama
- 2. The two counters that run the whole engine
- 3. The loop that drives it all:
EngineCore.step - Reading checklist
1. A real decoder-only model: Llama
Open vllm/model_executor/models/llama.py. The structure is a Russian doll:
LlamaModel(:350) — holds the embedding + a stack ofLlamaDecoderLayers + a final norm.LlamaDecoderLayer(:253) — one transformer block:self_attnthenmlp, each with a residual add and an RMSNorm.LlamaAttention(:124) — the attention block.LlamaMLP(:the small class withforward(self, x)at:117) — gate/up/down projections.
The decoder layer forward (LlamaDecoderLayer.forward, :316)
Skim it and find this shape (paraphrased):
# residual stream in -> norm -> attention -> add -> norm -> mlp -> add -> out
hidden = self.input_layernorm(hidden_states)
hidden = self.self_attn(positions, hidden) # attention mixes across tokens
hidden = residual + hidden
hidden = self.post_attention_layernorm(hidden)
hidden = self.mlp(hidden) # per-token transform
hidden = residual + hidden
That's the whole transformer block. 32 of these stacked = Llama-3-8B. Notice attention is the only place tokens interact; the MLP treats each token independently. That's why attention is where the KV cache (cross-token memory) lives, and the MLP is just big GEMMs (Phase 7).
Where K and V are produced and cached (LlamaAttention.forward, :223)
This is the payoff. Find (paraphrased):
qkv, _ = self.qkv_proj(hidden_states) # one matmul produces Q, K, V (fused)
q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
q, k = self.rotary_emb(positions, q, k) # positional info (RoPE)
attn_output = self.attn(q, k, v) # <- the Attention layer (Phase 4)
output, _ = self.o_proj(attn_output)
The self.attn call is a vllm.attention.layer.Attention module — and that is what writes the
new k, v into the paged KV cache (Phase 2) and reads back the cached K/V to compute
attention (Phase 4). So the journey is: model produces Q/K/V → the Attention layer caches K/V in
blocks and runs the attention kernel. Everything you'll learn in Phases 2 and 4 plugs in right
here, at this one self.attn(q, k, v) call. Hold that thread.
Don't get lost. You will not understand all of
llama.pytoday, and you don't need to. The point is to locate Q/K/V production and theself.attncall. That's the seam where the engine's memory and kernels meet the model.
2. The two counters that run the whole engine
Open vllm/v1/request.py. The Request class (:59) carries the prompt, the generated tokens,
and the sampling params. The two properties that matter most:
@property
def num_tokens(self) -> int: # :239
# total tokens that exist: prompt + generated so far
...
and the field set in __init__ (:145): self.num_computed_tokens = 0 — how many of those
tokens have had their KV computed and cached.
The whole engine is the race between these two numbers (guide §"mental model"):
- New request:
num_computed_tokens = 0,num_tokens = len(prompt). The gap is the whole prompt → prefill. - After prefill + each decode:
num_computed_tokensis one behindnum_tokens; generating a token bumpsnum_tokens, then the next step computes one more → decode.
num_tokens_with_spec (:243) adds speculative draft tokens to the gap — which is how spec
decode (Phase 8) rides the same machinery with no special case. RequestStatus (:315) is the
lifecycle enum (WAITING/RUNNING/PREEMPTED/FINISHED_*) you met in Phase 3.
mini_vllm/request.py is a faithful miniature: same num_computed_tokens vs num_tokens, same
status enum, same is_finished = status >= FINISHED ordering trick.
3. The loop that drives it all: EngineCore.step
Open vllm/v1/engine/core.py:428. This is the heartbeat of vLLM:
def step(self) -> tuple[dict[int, EngineCoreOutputs], bool]:
if not self.scheduler.has_requests():
return {}, False
scheduler_output = self.scheduler.schedule() # Phase 3: who runs
future = self.model_executor.execute_model(scheduler_output, ...) # the forward pass
grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output) # Phase 12
model_output = future.result()
if model_output is None:
model_output = self.model_executor.sample_tokens(grammar_output) # Phase 9
engine_core_outputs = self.scheduler.update_from_output( # advance counters
scheduler_output, model_output)
return engine_core_outputs, scheduler_output.total_num_scheduled_tokens > 0
schedule → execute → sample → update. That's it. That's the engine. Every phase in this course is a deep dive into one box of this five-line loop:
schedule()→ Phases 2, 3 (memory + batching)execute_model()→ Phases 4–7, 10, 13, 14 (kernels, quant, parallelism, the model itself)sample_tokens()→ Phases 8, 9, 12 (decoding, spec, structured output)update_from_output()→ Phase 3 (advancenum_computed_tokens, reap finished)
mini_vllm/engine.py's step() is the same loop with the GPU filed off — read them side by
side and the correspondence is exact.
Reading checklist
One sentence each:
-
In
LlamaAttention.forward, which line produces Q/K/V and which line caches/uses K/V? - Why does the MLP not need a KV cache but attention does?
-
On
Request, what's the difference betweennum_tokensandnum_computed_tokens? -
In
EngineCore.step, name the four stages and which course phase owns each.
Now build it: 02-mini-build.md, then the labs.
Phase 00 — Mini-Build: feel the KV-cache win
This phase doesn't add a new mini_vllm module — it has you understand the three you already
have and measure the foundational result the whole course rests on.
Contents
- Part A — read the three pieces of "next-token prediction"
- Part B — the lab: O(N²) vs O(N)
- Definition of done
- Map to the real engine
Part A — read the three pieces of "next-token prediction"
Open and read (they're tiny):
mini_vllm/tokenizer.py—encode(str) -> list[int],decode(list[int]) -> str. Same interface as a real HF tokenizer.mini_vllm/model.py—ToyModel.forward(last_tokens, positions) -> logits. Batched, autoregressive, deterministic. (It ignores KV values — honest simplification noted in the file — because this course cares about KV memory management, not the toy's numerics.)mini_vllm/sampler.py—Sampler.sample(logits, params) -> token_id. Greedy at temperature 0.
Trace one LLMEngine.generate(["hi"]) call in your head: tokenize → loop step() → sample →
detokenize. Confirm it matches EngineCore.step from the deep-dive.
Part B — the lab: O(N²) vs O(N)
lab-01-kv-cache-speedup is the build. You implement a toy attention twice:
- No cache: every step recomputes K/V for all prior tokens → work grows with the prefix.
- Cached: K/V computed once, reused → constant work per step.
You'll count "K/V computations" and show the no-cache version does 1+2+...+N = O(N²) while the
cached version does O(N) — and that both produce the identical token sequence (caching is an
optimization, never a correctness change — the same invariant you'll prove for chunked prefill
and preemption in Phase 3).
lab-02-kv-memory-calculator has you write the kv_bytes_per_token formula and compute how many
concurrent sequences fit on a given GPU — the number that caps your serving capacity.
Definition of done
pytest phase-00-foundations/labs -q
Then answer in your notebook:
- What is the asymptotic work ratio (no-cache / cached) for generating N tokens? (≈ N/2.)
- For Llama-3-8B at 8k context on a 24 GB GPU (say 16 GB weights), roughly how many concurrent full-length sequences fit? (You'll compute it in lab-02 — the answer is "surprisingly few," which is the entire motivation for Phase 2.)
Map to the real engine
| your understanding | real vLLM |
|---|---|
| the no-cache vs cache experiment | why vllm.attention.layer.Attention caches K/V at all |
num_computed vs num_tokens | Request counters (request.py:239) |
| tokenize→loop→sample→detokenize | EngineCore.step (core.py:428) |
| kv_bytes_per_token formula | how get_kv_cache_configs sizes the block pool (Phase 2) |
Phase 00 Labs — Foundations
Four labs that install the four facts everything else stands on: generation is autoregressive and caching makes it linear (lab-01), the cache is memory and memory is the binding constraint (lab-02), logits become tokens through a small exact algorithm (lab-03), and prefill and decode live in opposite performance regimes (lab-04). No GPU, no model downloads — counters, formulas, and numpy. Do them in order; each ends where the next begins.
Every lab follows the standard contract: starter.py with TODOs (your work),
solution.py (the reference), test_lab.py (the spec, executable). The default test run
uses solution.py so the suite is always green; set LAB_IMPL=starter to grade yourself.
# Whole phase:
pytest phase-00-foundations/labs -m "not gpu"
# Grade your own work on one lab:
LAB_IMPL=starter pytest phase-00-foundations/labs/lab-01-kv-cache-speedup -q
Contents
- lab-01-kv-cache-speedup
[CPU-OK] - lab-02-kv-memory-calculator
[CPU-OK] - lab-03-sampling-basics
[CPU-OK] - lab-04-prefill-vs-decode
[CPU-OK] - What you can do after this phase
Labs
lab-01-kv-cache-speedup [CPU-OK]
The experiment that motivates the course: implement generation with and without a KV cache, count the work exactly (95 vs 15 units; >100× by n=1000), and prove both produce identical tokens. The O(N²) → O(N) trade that converts compute into memory — and creates the prefill/decode split as a side effect. Skills: why the cache exists; causality makes K/V cacheable; counting beats clocking; the master "optimization changes nothing" invariant.
lab-02-kv-memory-calculator [CPU-OK]
Write the three-line formula behind every capacity decision in LLM serving and apply it to Llama-3-8B: 128 KiB per token, 256 MiB per sequence, ~32 concurrent users on a 24 GiB GPU. Then read FP8-KV and GQA as factors of the formula. Memory, not compute, is the constraint — derived, not asserted. Skills: back-of-envelope capacity planning; the formula as an optimization roadmap; weights are rent, KV is traffic.
lab-03-sampling-basics [CPU-OK]
Build the sampler: greedy, temperature, top-k, top-p — with the stability clause (softmax
max-subtraction), the inclusive nucleus boundary, and seeded reproducibility. The final
test proves your sampler agrees token-for-token with mini_vllm's engine sampler across
15 configurations. Skills: the four knobs as exact algorithms; −∞ masking; why greedy
mode anchors every deterministic test in this course.
lab-04-prefill-vs-decode [CPU-OK]
Six one-line functions and an A100 spec sheet: the ridge point (156 FLOPs/byte), single-stream decode at 0.6% compute utilization, the 125 tok/s physical speed limit for 8B/fp16, and the critical batch size where decode becomes compute-bound. The roofline worldview that sorts every optimization into "helps my regime" or "doesn't." Skills: compute-bound vs memory-bound as a reflex; the intensity cancellation (model size doesn't matter — tokens per weight-trip does); why batching is free money and quantization is a decode feature.
What you can do after this phase
Derive, on a whiteboard with no notes: why every inference engine caches KV (and what it
costs in bytes); how many users fit on a given GPU for a given model (and which knob to
turn when the answer is too small); what temperature=0.7, top_p=0.9 actually computes;
and whether a proposed optimization can possibly help a given workload (which side of the
ridge is it on?). These four reflexes are the entrance exam for Phase 1, where the loop
you simulated becomes a real engine with a scheduler, and for every phase after it.
Lab 00-01 — The KV-Cache Speedup [CPU-OK]
This is the experiment that motivates the entire course — and arguably the entire field of LLM inference engineering. You will implement autoregressive generation twice: once the naive way (recompute attention's keys and values for the whole sequence, every step) and once with a KV cache (compute each token's K/V exactly once, ever). Same model, same output, and a work difference that grows with the square of the sequence length. By the end you'll have measured, with an exact integer counter you control, why every serving engine on earth is built around a cache — and why the rest of this course is about managing that cache.
Contents
- Why this lab exists
- Background: what K and V are, and why they're recomputable
- Files
- Run
- What to implement
- What you should see — and why every number is what it is
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Ask a newcomer why LLM inference is expensive and they'll say "big matrices." True, but it misses the structural problem: generation is autoregressive. The model emits one token, appends it, and runs again on the longer sequence — N tokens of output means N forward passes. If each pass reprocesses everything before it, total work is 1+2+3+…+N ≈ N²/2 token-computations for N tokens of value. Quadratic. A 10k-token answer would cost ~50 million token-computations to produce 10 thousand tokens.
The KV cache is the observation that almost all of that recomputation is byte-identical every time — and the field's entire architecture flows downstream of caching it. Once you store K/V, generation becomes O(N)… and a new problem is born: that cache is state, it lives in scarce GPU memory, it grows every step, and somebody has to manage it. That somebody is vLLM, and managing-the-cache-well is Phases 1–19. This lab is where you earn the premise.
We measure with a counter, not a stopwatch, on purpose. Wall-clock on a laptop is noisy
and proves nothing about asymptotics; an exact work count (one unit per token processed)
gives you the formula, and formulas transfer to any hardware. You'll meet this
counting-over-clocking style again in Phase 3's labs.
Background: what K and V are, and why they're recomputable
In attention, each token's hidden state is projected three ways: a query (what am I looking for?), a key (what can I be found by?), and a value (what do I contribute when found?). When the model processes token t, its query is dotted against the keys of all previous tokens, and the resulting weights blend their values:
attn(t) = softmax(q_t · [k_0 … k_t]ᵀ / √d) · [v_0 … v_t]
The crucial property is causality: k_i and v_i depend only on tokens 0..i. Token
5's key is the same whether the sequence currently has 6 tokens or 6,000. So once computed,
(k_i, v_i) is valid forever — it's a pure function of the prefix, which makes it
perfectly cacheable. The query is the only part that's fresh each step (it belongs to the
new token), which is why we cache K and V but never Q.
That's the whole trick. "KV cache" sounds like infrastructure; it's actually a one-line theorem about causal attention plus the decision to spend memory on it.
Files
starter.py— implementgenerate_no_cacheandgenerate_with_cache. The work meter (compute_kv/KVWork) and the deterministicnext_tokenare provided. Your work.solution.py— reference.test_lab.py— pins identical outputs, the exact quadratic and linear work formulas, and the growing ratio.
Run
LAB_IMPL=starter pytest phase-00-foundations/labs/lab-01-kv-cache-speedup -q
pytest phase-00-foundations/labs/lab-01-kv-cache-speedup -q # reference (default)
What to implement
Both functions generate n_new tokens from a prompt of length P and return
(full_token_sequence, total_kv_work):
generate_no_cache— each decode step first callscompute_kv(tok, pos)for every token currently in the sequence (the model "re-reads" everything), then appendsnext_token(tokens). Stepi(0-indexed) costsP + iunits.generate_with_cache— prefill once (compute_kvper prompt token,Punits), then each decode step computes K/V for only the newly appended token (1 unit).
next_token is deterministic — a hash of the context — so both implementations must
produce the same token sequence. That's not a convenience; it's the point (see the first
test).
What you should see — and why every number is what it is
For P = 5, n_new = 10:
no cache : work = 5+6+7+8+9+10+11+12+13+14 = 95 (sum of P..P+n_new-1 → O(N²))
cached : work = 5 + 10 = 15 (P prefill + 1/step → O(N))
- Why 95? Step 0 reprocesses the 5 prompt tokens; step 1 reprocesses 6 (prompt + the token just generated); … step 9 reprocesses 14. The arithmetic series is the quadratic, made concrete enough to check by hand — which is exactly what the test does.
- Why 15? Each of the 15 tokens that ever exists has its K/V computed exactly once. The cached cost is the number of tokens. It cannot be beaten by any scheme that actually computes the KV (it can be beaten by schemes that reuse KV across requests — that's prefix caching, Phase 2/3).
- At
n_new = 1000: the ratio is >100× and still climbing linearly (~N/2). On real hardware this asymptotic gap is the difference between "chatbots are economically possible" and not. - Notice the two-phase shape that fell out for free: a big batch of K/V work up front (the prefill — all P prompt tokens at once, parallelizable, compute-hungry), then a drip of single-token steps (the decode — serial, one unit each). You didn't design that; caching created it. Prefill-vs-decode is the most consequential workload split in inference (lab-04 quantifies it; Phase 1 traces it; Phase 3 schedules around it), and it is born right here, in your 20 lines.
What the tests prove
| Test | What it pins |
|---|---|
test_both_produce_identical_tokens | Caching is an optimization, not a behavior change — the cached run's outputs are bit-identical. This is the course's master invariant: every optimization from here on (chunked prefill, prefix caching, preemption, paging) is proven safe by exactly this kind of equality test |
test_no_cache_is_quadratic | work == sum(P .. P+n_new−1) — the formula, not "roughly slower" |
test_cached_is_linear | work == P + n_new — every token computed once, ever |
test_work_ratio_grows_with_length | The gap grows with N (>100× at n=1000): this is an asymptotic class difference, not a constant factor someone could optimize away |
Hitchhiker's notes
- The cache is a time–space trade, and the space is the plot of this course. You just converted O(N²) compute into O(N) memory: every token now permanently occupies bytes (about 128 KiB/token for Llama-3-8B — lab-02 computes this). One number to foreshadow: a 24 GiB GPU holds weights plus only a few dozen full-length sequences of cache. Scarcity is immediate, and scarcity is why Phases 2–3 exist.
- Real transformers hide the no-cache cost inside one matmul. HuggingFace
generate(use_cache=False)doesn't loop per token like your simulation; it reprocesses the whole sequence in a single (big) forward pass per step. The work is still quadratic in total — your counter models the FLOPs faithfully even though the loop structure differs. - Where the cache actually lives upstream:
vllm.attention.layer.Attentionwrites each step's new K/V into the paged cache (viaslot_mapping— Phase 2 lab-06), and the kernel reads all prior K/V (viablock_table). What you modeled as a counter is, in production, tensors + an allocator + a scheduler. Same theorem underneath. - Why does the cached version call
next_token(tokens)with the full list, then? Because the model function still needs the whole context semantically — the cache changes what is recomputed, not what the model "knows." In a real model, "the cache was consulted" and "the context was read" are the same act: attention over cached K/V. Don't confuse caching KV with truncating context.
Going further
- Plot
work_no_cache / work_cachedfor n in 1..2000 — confirm the ~N/2 line. Then plot cached work alone: a flat 1/step. That flat line is why decode latency is stable and why per-token pricing is linear. Economics from asymptotics. - Model prompt length: sweep P from 10 to 10,000 at fixed n_new=100. Notice prefill dominates total cached work for long prompts — the TTFT story (Phase 1) in miniature.
- Add a
kv_bytescounter alongside the work counter (one cache entry percompute_kv) and watch memory grow linearly while compute stays flat — you've now built both axes of lab-02 and the motivating tension of Phase 2 with ~5 extra lines.
References
- Vaswani et al., Attention Is All You Need (2017) — where K/Q/V come from: https://arxiv.org/abs/1706.03762
- kipply, Transformer Inference Arithmetic — the canonical blog walkthrough of KV-cache math and why decode is bandwidth-bound: https://kipp.ly/transformer-inference-arithmetic/
- Pope et al., Efficiently Scaling Transformer Inference (2022) — §3 formalizes the prefill/decode split your counter just exposed: https://arxiv.org/abs/2211.05102
upstream/vllm/attention/layer.py— the production home of the cache write.- Phase 0 guide §"the KV cache" (00-guide.md) — the intuition this lab makes quantitative.
Lab 00-02 — KV-Cache Memory Calculator [CPU-OK]
Lab-01 ended with a cliffhanger: the KV cache converts quadratic compute into linear memory. This lab computes exactly how much memory — and the answer is the most important number in LLM serving economics: how many concurrent users fit on one GPU. You'll write the three-line formula, apply it to Llama-3-8B, and arrive at the genuinely shocking result that a 24 GiB GPU running an 8B model has room for only ~32 full-length conversations. Every dollar of inference cost, every "maximum concurrency" log line, and the entire existence of PagedAttention trace back to the arithmetic you're about to own.
Contents
- Why this lab exists
- Background: where the bytes go
- Files
- Run
- The formulas
- The headline result, walked through
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
This is back-of-envelope as a professional skill. A staff inference engineer gets asked, weekly, some variant of: "can we serve model X to Y users on hardware Z?" The wrong answer costs a fleet; the right answer is three multiplications you can do in a meeting. This lab installs the formula so deeply that you'll never again look at a GPU spec sheet without mentally dividing its HBM by a KV footprint.
It's also the Rosetta stone for the rest of the course. When Phase 2 lab-03's real engine
prints Maximum concurrency for 2,048 tokens per request: 68.65x, that's this lab's
max_concurrent_seqs evaluated against measured free HBM. When Phase 6 sells you FP8 KV,
when model cards advertise GQA, when Phase 10 shards KV across GPUs — every one of those is
an attack on a term of the formula you write here. Learn the formula, and the whole
optimization landscape organizes itself into "which factor does this shrink?"
Background: where the bytes go
Per token, per layer, attention stores one K vector and one V vector, each of
num_kv_heads × head_dim elements. Multiply it out:
kv_bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × dtype_bytes
▲ ▲ ▲ ▲ ▲
K and V every layer the heads per head fp16 = 2
Two things to notice before you code:
num_kv_heads, notnum_heads. Modern models use grouped-query attention (GQA): many query heads share each KV head precisely because someone did this lab's math and realized KV memory, not model quality, capped serving capacity. Llama-3-8B has 32 query heads but only 8 KV heads — a 4× KV saving designed into the architecture. (MQA — one KV head — is the extreme; MLA in DeepSeek compresses further still. Architecture evolution is visible in this one parameter.)- It's per token, forever. The cache for token 0 lives until the request finishes. Length × concurrency × per-token bytes must fit in what's left after weights. There is no amortization, no compression by default — just bytes, held for the lifetime of the conversation.
Files
starter.py— implementkv_bytes_per_token,kv_bytes_per_seq,max_concurrent_seqs. Your work.solution.py— reference.test_lab.py— pins exact numbers for Llama-3-8B, the FP8 and GQA factor effects, and the no-room edge case.
Run
LAB_IMPL=starter pytest phase-00-foundations/labs/lab-02-kv-memory-calculator -q
pytest phase-00-foundations/labs/lab-02-kv-memory-calculator -q # reference (default)
The formulas
kv_bytes_per_token = 2 (K and V) × num_layers × num_kv_heads × head_dim × dtype_bytes
kv_bytes_per_seq = kv_bytes_per_token × seq_len
max_concurrent = (gpu_bytes − weight_bytes) // kv_bytes_per_seq (0 if no room)
Integer division on purpose: you cannot serve 0.7 of a conversation. (The real engine has the same floor, expressed in blocks — Phase 2.)
The headline result, walked through
Llama-3-8B in fp16: num_layers=32, num_kv_heads=8, head_dim=128, dtype_bytes=2.
per token : 2 × 32 × 8 × 128 × 2 = 131,072 B = 128 KiB (!)
per 2,048-token sequence: 128 KiB × 2,048 = 256 MiB
24 GiB GPU − ~16 GiB weights = 8 GiB free
concurrency: 8 GiB / 256 MiB = 32 sequences
Sit with each line:
- 128 KiB per token. A token is ~4 characters of text. Its cache costs as much as a small image. A 100-word answer: ~17 MB. This is why "just keep the conversation in memory" is a capacity strategy, not a triviality.
- 256 MiB per max-length sequence — 1.6% of the entire weights per conversation, for context alone.
- 32 users. An 8-billion-parameter model on a serious GPU and the ceiling is thirty two — and that's assuming perfect packing with zero waste. Now recall (or preview) Phase 2 lab-02: pre-vLLM engines wasted 60–80% of KV memory on fragmentation, turning 32 into ~6–12. Memory, not compute, is the binding constraint of LLM serving — the single most counterintuitive fact in the field, and you just derived it.
- And the punchline that launches Phase 2: since the constraint is memory, the highest- leverage engineering target is making every byte of that 8 GiB hold useful KV. That is, verbatim, PagedAttention's mission statement.
What the tests prove
| Test | What it pins |
|---|---|
test_llama3_8b_per_token | The exact 131,072 — get the factors and you get the field's most-quoted number |
test_llama3_8b_concurrency | The headline 32, end to end through all three functions |
test_fp8_kv_doubles_concurrency | dtype_bytes is a linear lever: halve it, double the users. (Phase 6's FP8-KV feature, justified in one assert) |
test_gqa_saves_vs_mha | 8 vs 32 KV heads = exactly 4× — why GQA exists, as arithmetic |
test_no_room_returns_zero | Weights ≥ HBM → 0, gracefully. Capacity functions must not return −3 users |
Hitchhiker's notes
- The formula is the optimization roadmap. Every KV-memory technique in production
attacks one factor: FP8/INT4 KV quantization →
dtype_bytes; GQA/MQA/MLA →num_kv_heads × head_dim; sliding-window attention (Mistral) and hybrid SSM layers (Phase 14) → which layers store KV at all; tensor parallelism (Phase 10) → divides the whole thing across GPUs; prefix caching (Phases 2–3) → shareskv_bytes_per_seqacross requests. When you read a new inference paper, your first question is now: which factor? - Weights are the entry fee; KV is the rent. Weights are fixed and amortize over every request; KV scales with traffic. This is why bigger GPUs disproportionately help serving (more leftover after the fixed cost) and why a 70B model on an 80 GiB GPU (~140 GiB fp16 weights — doesn't even fit without quantization or sharding) is a different kind of problem than 8B on 24 GiB.
seq_lenis the denominator you control. The formula uses worst-case length — exactly what the engine's startup "maximum concurrency" line assumes, and exactly why Phase 2 lab-03's reflection scolds themax_model_len=32768config for a 4k workload. Capacity planning with the p99 length instead of the max is the cheapest 8× you'll ever find.- What the simple formula ignores (and where it bends in practice): activation
scratch memory (vLLM measures this at startup by profiling — Phase 2 lab-03),
block-granularity rounding (
block_size − 1tail waste per sequence — Phase 2), CUDA context overhead, and the fact that real workloads have a length distribution, so effective concurrency is higher than the max-length floor. The formula is your lower bound and your sanity check, not your final answer — which is also why vLLM computes blocks from measured free memory rather than trusting arithmetic.
Going further
- Build a table for models you care about (3B/8B/70B; fp16/fp8 KV; 2k/8k/128k context) on A100-80G. Notice where concurrency drops below 1 — congratulations, you've discovered why 128k-context serving needs either massive HBM, KV offload, or architecture tricks, and why long-context pricing is what it is.
- Invert the formula: given a target of 200 concurrent users at 4k context on 8B/fp16, how much free HBM do you need? (200 × 4096 × 128 KiB = 100 GiB → multi-GPU territory → Phase 10's tensor parallelism divides per-GPU KV by the shard count.)
- Add
block_sizerounding from Phase 2 (ceil(seq_len / block_size)blocks per sequence) and quantify how little paging's tail waste costs vs the 60–80% it saves — reproducing Phase 2 lab-02's conclusion from the memory side.
References
- kipply, Transformer Inference Arithmetic — the classic source for exactly this per-token math: https://kipp.ly/transformer-inference-arithmetic/
- Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models (2023) — why
num_kv_headsshrank industry-wide: https://arxiv.org/abs/2305.13245 - Pope et al., Efficiently Scaling Transformer Inference (2022) — KV memory as the scaling bottleneck, formalized: https://arxiv.org/abs/2211.05102
- Kwon et al., PagedAttention (SOSP 2023) — what happens next to these bytes: https://arxiv.org/abs/2309.06180
upstream/vllm/v1/core/kv_cache_utils.py::get_kv_cache_configs— your formula, running at every engine startup (Phase 2 lab-03 reads its output).
Lab 00-03 — From Logits to Token: Sampling Basics [CPU-OK]
A language model does not produce words. It produces logits — one raw score per
vocabulary entry, 257 of them in mini_vllm, 128k+ in Llama-3 — and something has to
collapse that scoreboard into the single token the user sees. That something is the
sampler, and in this lab you build it: greedy, temperature, top-k, and top-p (nucleus),
exactly mirroring mini_vllm/sampler.py — the final test literally checks that your
sampler and the engine's agree token-for-token across a grid of configurations.
This is the last piece of the foundations: lab-01 gave you the loop, lab-02 the memory, lab-04 the speed limits — this one gives you the decision each loop iteration ends with.
Contents
- Why this lab exists
- Background: the knobs, and what they're actually for
- Files
- Run
- What to implement
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Sampling parameters are the most-touched, least-understood interface in all of LLM
serving. Every API request carries them; every "the model got worse" support ticket is
~30% likely to be a sampling change; and every inference engineer eventually debugs an
incident where the answer was "someone set top_k=1 and wondered why outputs got
repetitive." You should know these four knobs the way a DBA knows isolation levels — not
as folklore ("0.7 is creative!") but as the small, exact algorithms they are.
There's an engine-design reason too. The sampler sits at a peculiar spot in the
architecture: it's the only stage that's per-request configurable and stochastic, in
the middle of a pipeline that is otherwise batched and deterministic. Getting determinism
back when you need it (tests! reproducibility! debugging!) takes deliberate design — the
seed parameter, greedy mode — and the entire course's testing strategy (every phase's
"identical output" invariants) leans on the greedy shortcut you'll implement here. When
Phase 9 expands sampling into penalties, logit processors, parallel sampling, and GPU
vectorization, this lab is the kernel of truth it builds on.
Background: the knobs, and what they're actually for
Order matters — this is a pipeline, and each stage reshapes the distribution the next one
sees (your sample must apply them in exactly this order to match the engine):
- Temperature — divide all logits by T before softmax. T<1 sharpens (rich get
richer), T>1 flattens (underdogs get a chance), T→0 approaches argmax. It's the
only knob that reweights rather than truncates. The
T == 0.0case is special- cased as pure argmax — both because division by zero, and because greedy must be exactly deterministic, no RNG involved at all. - Top-k — keep the k highest logits, set the rest to −∞ (probability zero after
softmax). A blunt truncation: k=1 is greedy-with-extra-steps, k=50 trims the long
tail of nonsense tokens. Its weakness: k is fixed while the distribution's actual
"width" varies wildly per step (after
"The capital of France is"there's one good token; after"My favorite"there are hundreds). - Top-p (nucleus) — keep the smallest set of tokens whose cumulative probability ≥ p. Adaptive where top-k is fixed: confident steps keep few tokens, uncertain steps keep many. The subtle spec detail your implementation must honor: the token that crosses the threshold is included (else p=0.5 over probs [0.4, 0.4, 0.2] would keep only 0.4 < 0.5 — an under-full nucleus).
- Softmax + one draw — normalize what survives and draw once with
np.random.default_rng(seed). Seeded → reproducible; unseeded → fresh entropy per call.
And the stability clause: softmax must subtract the max before exponentiating.
exp(1000) overflows float64; logits in the hundreds are perfectly normal outputs of an
unnormalized final layer. This one line is the difference between a sampler and a NaN
generator, and the test feeds you logits of 1000+ to make sure it's there.
Files
starter.py—softmax,apply_top_k,apply_top_p,sample, each with its recipe. Your work.solution.py— reference (functionally identical tomini_vllm/sampler.py).test_lab.py— distribution sanity, each knob's exact semantics, determinism, and the agreement test against the engine'sSampler.
Run
LAB_IMPL=starter pytest phase-00-foundations/labs/lab-03-sampling-basics -q
pytest phase-00-foundations/labs/lab-03-sampling-basics -q # reference (default)
What the tests prove
| Test | What it pins |
|---|---|
test_softmax_is_a_distribution_and_is_stable | Sums to 1, preserves order, and survives logits of 1000 — the max-subtraction clause |
test_greedy_is_argmax_and_ignores_every_other_knob | temperature=0 short-circuits the whole pipeline — even hostile top_k/top_p/seed settings can't perturb greedy. This guarantee is what every deterministic test in this course stands on |
test_top_k_keeps_exactly_k | Survivors finite, victims −∞, disabled cases (k≤0, k≥vocab) pass through unchanged |
test_top_p_keeps_the_smallest_sufficient_nucleus | The inclusive-crossing rule, on a hand-built distribution — and the test deliberately avoids sitting on the cumsum boundary, because float rounding flips the answer there (read the comment; it's a lesson in itself) |
test_temperature_sharpens_or_flattens | T's monotone effect on the max probability |
test_seeded_sampling_is_reproducible | Same logits + same seed = same token, forever |
test_agrees_with_mini_vllm_sampler | Your sampler ≡ the engine's sampler across 15 configurations — the equivalence that makes this lab "build the real component," not "build a toy like it" |
Hitchhiker's notes
- −∞ is the correct "impossible," not 0. Masking logits to −∞ (probability exactly 0 after softmax) composes cleanly: later stages renormalize over survivors automatically. Masking probabilities to 0 without renormalizing — a classic homebrew-sampler bug — leaves you sampling from a distribution that sums to 0.7.
- Order of operations is observable. Top-k-then-top-p (this pipeline, and vLLM's) gives different results than top-p-then-top-k for the same parameters. When two engines "with the same settings" produce different output statistics, pipeline order is suspect #2 (suspect #1 is tokenizer differences). The agreement test pins your order to the engine's.
- Why
np.partitioninstead of sorting in top-k? O(n) vs O(n log n) over the vocab, per token, per request — at 128k vocab × thousands of tokens/s this is real money. Production goes further: vLLM's V1 sampler does top-k/top-p vectorized over the whole batch on the GPU (upstream/vllm/v1/sample/), with exactly the semantics you just wrote scalar. Semantics here, performance there — the course's recurring split. - Ties under greedy:
argmaxtakes the lowest index. Sounds trivial until two engines break ties differently and a "deterministic" comparison fails at token 947 — the fp16 near-tie problem from Phase 3 lab-02's notes, one layer down. Determinism is a stack of conventions, and you now know one more layer of it. seedis per-request state in real engines — vLLM keeps a per-request generator so request A's draws don't perturb request B's stream under batching (Phase 9). Your per-calldefault_rng(seed)is the single-request simplification; the same idea, one request at a time.
Going further
- Implement min-p (keep tokens with prob ≥ p × max-prob — an increasingly popular
alternative that adapts even better than top-p) and write its boundary test. Then
check: vLLM ships it (
min_pinSamplingParams). - Sample 10,000 draws at T ∈ {0.3, 1.0, 2.0} from fixed logits and plot the empirical histograms against your computed distributions — a χ² eyeball test of your own sampler, and a visceral feel for what temperature does.
- Read
upstream/vllm/v1/sample/sampler.pyand find the four stages of your pipeline in their batched form: the same algorithm, where every operation is a tensor op over[batch, vocab]and the special-casing of greedy becomes an index-select.
References
mini_vllm/sampler.py— the component you just rebuilt; diff yours against it.upstream/vllm/v1/sample/sampler.py— the batched GPU version (Phase 9 territory).- Holtzman et al., The Curious Case of Neural Text Degeneration (2019) — the paper that introduced nucleus (top-p) sampling and explains why truncation matters: https://arxiv.org/abs/1904.09751
- vLLM docs, Sampling Parameters — the full production knob set your four generalize into: https://docs.vllm.ai/en/latest/api/inference_params.html
- Phase 9 — penalties, logit processors, structured-output masking (Phase 12), and why sampling lives on the GPU.
Lab 00-04 — Prefill vs Decode: the Roofline Arithmetic [CPU-OK]
Lab-01 showed you that caching splits generation into two phases. This lab shows you the strange physics those phases live under — with six one-line functions and an A100 spec sheet. You'll compute the ridge point (the GPU's FLOP-per-byte ratio), discover that a single decoding sequence uses less than 1% of the GPU's compute, derive the hard speed limit of decode (125 tokens/s for an 8B model on an A100 — no kernel wizardry can beat it), and find the critical batch size where decode finally becomes compute-bound. These four numbers are the worldview of performance engineering; everything in Phase 18 is this lab with profilers attached.
Contents
- Why this lab exists
- Background: the roofline model in five minutes
- Files
- Run
- What to implement
- The numbers, walked through
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
There is a question that separates engineers who tune inference systems from engineers who guess at them: "is this workload compute-bound or memory-bound?" Every optimization belongs to one regime or the other — a faster matmul kernel does nothing for a bandwidth-bound decode; more batching does nothing for a compute-bound prefill — and applying a fix from the wrong regime is the most common way smart people waste a quarter. The roofline model answers the question with division. This lab makes you do the division until it's reflexive.
It also explains, from first principles, the economics you've been absorbing all phase: why batch size is the master lever of throughput (lab and Phase 1 lab-04), why continuous batching was worth inventing (Phase 3), why chunked prefill mixes the two phases in one step (Phase 3 lab-05 — piggybacking compute-hungry prefill onto bandwidth-starved decode steps), and why speculative decoding (Phase 8) can "spend FLOPs to save time" — the FLOPs were idle anyway.
Background: the roofline model in five minutes
Every computation has an arithmetic intensity: FLOPs performed per byte moved from memory. Every processor has a ridge point: peak FLOPs ÷ memory bandwidth. The roofline law is one comparison:
- intensity < ridge → memory-bound: the compute units finish early and wait for the bus. Speed = bandwidth × intensity. More FLOPs won't help; fewer bytes will.
- intensity ≥ ridge → compute-bound: the bus keeps up; you're using the silicon. Speed = peak FLOPs. Fewer bytes won't help; better kernels / more hardware will.
Our minimal model of a transformer step: it must read all weights once (n_params × dtype_bytes from HBM — weights don't fit in cache) and performs ~2 FLOPs per parameter
per token (each weight participates in one multiply-accumulate per token). So:
intensity = 2 · n_params · num_tokens / (n_params · dtype_bytes) = 2 · num_tokens / dtype_bytes
The parameter count cancels. Intensity doesn't care how big the model is — only how many tokens share one trip through the weights. That cancellation is the single most clarifying fact in inference performance: prefill (thousands of tokens per trip) and decode (one token per trip per sequence) aren't two workloads that differ in degree; they're the two opposite ends of the roofline, by construction.
(The model ignores KV/activation traffic and attention FLOPs — see Hitchhiker's notes for when that bites. As a first-order tool it's startlingly accurate.)
Files
starter.py— six functions, each ~one line, each a load-bearing concept. Your work.solution.py— reference.test_lab.py— A100 numbers, the cancellation, both regimes, the crossover, and the decode speed limit.
Run
LAB_IMPL=starter pytest phase-00-foundations/labs/lab-04-prefill-vs-decode -q
pytest phase-00-foundations/labs/lab-04-prefill-vs-decode -q # reference (default)
The numbers, walked through
A100-80GB SXM: 312 TFLOPs fp16, 2.0 TB/s HBM. 8B model, fp16 (16 GB of weights).
ridge : 312e12 / 2.0e12 = 156 FLOPs/byte
decode, batch 1 : intensity = 2·1/2 = 1 → 1 ≪ 156: bandwidth-bound, using 1/156 ≈ 0.6% of compute
prefill, 2048 tokens : intensity = 2048 → ≫ 156: compute-bound
crossover : 156 tokens/step → the "critical batch size"
speed limit: 2.0e12 / 16e9 = 125 steps/s → 125 tok/s per sequence, max, ever
Read them like an engineer:
- 0.6% compute utilization for single-stream decode. The other 99.4% of a $15k GPU is structurally idle — not because of bad kernels, but because one token per weight-trip is all the arithmetic the workload offers. This is the number that makes batching non-optional: every additional sequence in the decode batch reuses the same weight bytes, adding intensity ~1 per sequence, for free until you hit the ridge at ~156. Batch 64 → 64× the tokens/s at essentially the same step time. That free lunch is the entire economic basis of serving (and of this course's obsession with fitting more sequences in memory — lab-02 — since memory is what caps the batch).
- 125 tokens/s is a physics ceiling, not an engineering one: a decode step cannot
complete faster than the weights can stream from HBM. Measure any well-tuned 8B/fp16
deployment and single-stream decode sits at 60–80% of this bound (the rest: KV reads,
kernel overheads). When someone promises 500 tok/s single-stream on this hardware,
they're describing quantization (fewer bytes — note
dtype_bytesin your formula), speculation (Phase 8), or fiction. - Prefill at 2048 is compute-bound — which is why TTFT responds to better kernels and
FlashAttention (Phase 4), while ITL responds to memory bandwidth and quantization. Two
metrics, two regimes, two completely disjoint optimization menus. Now you know why
Phase 3's chunked prefill mixes the phases in one batch: decode steps have idle FLOPs;
prefill chunks are pure FLOPs; together they fill the roofline from both sides
(Sarathi's "piggybacking", which you measured as
[33, 33, …]in Phase 3 lab-05). - The crossover at 156 is worth memorizing as a shape, not a number: it moves with hardware (H100: ~295 fp16; consumer cards: lower) and dtype. "Decode needs ~ridge-many tokens per step to saturate compute" is the portable version.
What the tests prove
| Test | What it pins |
|---|---|
test_a100_ridge_point | 156.0 — the one hardware constant of this lab |
test_intensity_is_just_tokens_over_dtype | The cancellation: 8B and 70B give identical intensity. If this surprises you, reread the Background — it's the lab's central fact |
test_single_decode_is_hopelessly_bandwidth_bound | Intensity 1 vs ridge 156 |
test_prefill_is_compute_bound | Intensity 2048 vs ridge 156 — same model, opposite regime |
test_critical_batch_size_is_the_ridge | 155 tokens: memory-bound; 156: compute-bound. The crossover is exactly the ridge, because intensity (fp16) = tokens |
test_decode_speed_limit_8b_fp16 | 125 tok/s — bandwidth over weight bytes, the unbeatable ceiling |
test_batching_multiplies_decode_throughput_for_free | Batch 64 → 8000 tok/s from the same weight stream — the free lunch, quantified |
Hitchhiker's notes
- Where the weights-only model bends: at long contexts, KV reads become the dominant bytes (128 KiB/token from lab-02 × thousands of tokens × batch — eventually exceeding the 16 GB weight read!). That's why long-context decode gets slower per token even though "the model is the same size," why GQA/MLA exist (shrink KV bytes), and why Phase 18 extends this lab's model with a KV-traffic term. First-order tools, knowingly applied, then refined — that's the discipline.
- Why ~2 FLOPs per param per token? Each parameter sits in some matrix; processing a token multiplies it by one activation and adds into an accumulator — one FMA = 2 FLOPs. Attention adds FLOPs quadratic in sequence length on top (it's parameter-free, so it escapes this accounting); for short-to-moderate contexts the linear layers dominate and 2·N·T is a good model. The famous training version of the same estimate is 6·N·T (forward + backward); inference keeps only the 2.
- Quantization through this lens: INT4 weights = 4 GB to stream = 500 steps/s ceiling, 4× decode speedup with zero kernel cleverness — bandwidth-bound workloads reward byte-shrinking one-for-one. But the same INT4 does ~nothing for compute-bound prefill (the FLOPs still happen, often in fp16 after dequant). One optimization, two regimes, two completely different value propositions — Phase 6 lives here.
- The ridge explains CUDA graphs too (Phase 5): your 125 steps/s ceiling means a decode step takes ≥ 8 ms on this 8B model (16 GB streamed at 2 TB/s) — at that scale a few hundred microseconds of kernel-launch overhead is noise. Shrink the model to 1B and steps drop toward 1 ms; suddenly launch overhead is a first-order cost, and capturing the whole step as one CUDA graph pays for itself. The roofline tells you when overhead optimizations matter, too.
Going further
- Recompute everything for an H100 SXM (~990 TFLOPs fp16, ~3.35 TB/s): ridge ≈ 295, 8B decode ceiling ≈ 209 tok/s. Notice the ridge rose — new GPUs gain FLOPs faster than bandwidth, so decode gets relatively more memory-bound every generation. That trend line is why KV/weight compression research keeps accelerating.
- Add a
kv_bytes_per_step(batch, context_len, kv_per_token)term (from lab-02) todecode_tokens_per_secondand find the context length where KV traffic overtakes the weight read for batch 64. You've just derived the long-context wall. - Plot the roofline: log-log, intensity on x, achievable FLOPs on y, the two plateaus, and drop points for decode batch {1, 8, 64, 156, 512} and prefill {128, 2048}. This single figure is the mental map for all of Phase 18 — draw it once by hand.
References
- Williams et al., Roofline: An Insightful Visual Performance Model (CACM 2009) — the original: https://dl.acm.org/doi/10.1145/1498765.1498785
- kipply, Transformer Inference Arithmetic — this lab's model applied end-to-end, the single best blog post in the field: https://kipp.ly/transformer-inference-arithmetic/
- Pope et al., Efficiently Scaling Transformer Inference (2022) — the rigorous version, including the KV-traffic terms: https://arxiv.org/abs/2211.05102
- Chen, Dissecting Batching Effects in GPT Inference (2023) — measured curves of the batch-size free lunch: https://le.qun.ch/en/blog/2023/05/13/transformer-batching/
- NVIDIA A100/H100 datasheets — where the peak-FLOPs and bandwidth constants come from (always check whether a quoted TFLOPs number assumes sparsity; marketing does).
- Phase 18 — this lab, with
nsys/ncuattached and the simplifications removed.
Phase 00 — Exercises: Foundations
Contents
- Warm-up (explain)
- Core (the distinctions that matter)
- Build (your labs)
- Design (staff-level)
- Self-grading
Warm-up (explain)
- In one sentence: what does an LLM compute, and what is "autoregressive generation"?
- Define tokens, embeddings, logits. Where in a forward pass do logits appear?
- Why does a token's K and V never change once computed? Why does that justify a cache?
Core (the distinctions that matter)
- Fill the table from memory: prefill vs decode — tokens/pass, bottleneck (compute vs memory bandwidth), and which latency metric (TTFT vs ITL) each drives.
- Explain why decode is memory-bandwidth-bound. What must the GPU read to produce one token, and how much math does it do with it?
- Why does batching help throughput specifically during decode? (Hint: what gets amortized?)
Build (your labs)
- In lab-01, derive the exact no-cache work
sum(P..P+n-1)and the cached workP+n. What's the ratio as n → ∞ for fixed P? - In lab-02, compute
kv_bytes_per_tokenand max concurrency for a model of your choice (look up its config: layers, kv_heads, head_dim). Then redo it with fp8 KV cache. - A model uses MHA (num_kv_heads == num_query_heads). Show how switching to GQA with 8 KV heads changes KV memory and thus concurrency.
Design (staff-level)
- You must serve a 70B model at 8k context with TTFT < 1s and ITL < 50ms on 8×A100 (80GB). Estimate KV memory per sequence and reason about how many concurrent users fit. What's the first thing you'd do to fit more?
- A teammate says "let's just recompute attention each step, it's simpler." Quantify what that costs for a 2000-token generation and explain why it's a non-starter.
- Using Little's Law (concurrency = throughput × latency), if you target 1000 tok/s aggregate at 50ms ITL, how many sequences must be in flight? What limits that number?
Self-grading
4, 5, 10–12 are interview-grade. Could you whiteboard each in 5 minutes? If not, re-read the guide's prefill/decode and memory sections, then drill INTERVIEW.md.
Phase 00 — Interview Questions: Foundations
Cover the answer, attempt out loud, compare. These fundamentals gate everything else — if you fumble them, the interviewer won't trust your scheduler answers.
Q1. Why is autoregressive decoding so much slower per token than prefill?
Model answer
Decode produces one token per step but must still read the entire model weights and the whole KV cache from HBM each step, while doing only one token's worth of math — terrible arithmetic intensity, so it's memory-bandwidth-bound and the GPU's compute sits idle. Prefill amortizes the same weight read over all prompt tokens at once, so it's compute-bound and far more efficient per token. Same kernels, opposite bottlenecks.
Q2. What is the KV cache and why does it dominate serving memory?
Model answer
It stores the Key and Value vectors of every prior token so attention need not recompute them
(they never change). Without it, generation is O(N²) in work; with it, O(N). Its size is
2 × layers × kv_heads × head_dim × dtype_bytes per token and it grows linearly with batch size
and sequence length, so at scale it dwarfs the weights and caps how many concurrent requests fit.
For Llama-3-8B that's ~128 KiB/token; a few thousand tokens × a few dozen users fills tens of GB.
Q3. Walk me through prefill vs decode.
Model answer
Prefill is the first pass over the whole prompt: many tokens, one pass, compute-bound, fills the
prompt's KV cache, determines TTFT. Decode is every subsequent single-token step: one token,
memory-bandwidth-bound (read all weights + KV), determines ITL/TPOT. The scheduler treats both
uniformly as "advance num_computed_tokens toward num_tokens," which is why chunked prefill and
continuous batching fall out naturally (Phase 3).
Q4. How would you estimate KV-cache memory for a deployment?
Model answer
kv_bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × dtype_bytes; multiply by max
sequence length for per-sequence bytes; concurrent capacity ≈ (HBM − weights) / per-sequence
bytes. Watch for GQA (kv_heads ≪ query_heads shrinks it), fp8 KV cache (halves dtype_bytes), and
that real engines reserve some HBM for activations and CUDA-graph buffers, so usable KV is a bit
less than the naive free figure.
Q5. Why does batching improve throughput, and what's the cost?
Model answer
In decode, reading the model weights from HBM is the dominant cost and is shared across a batch — so processing B sequences together costs barely more than one, multiplying throughput. The cost is latency: each step does more work, and (via Little's Law) higher concurrency means each request waits longer. The scheduler navigates this; Phase 18 tunes it.
Rapid-fire
- Tokens are roughly? ~¾ of a word; integer ids from a tokenizer.
- Logits are? Pre-softmax scores over the whole vocabulary for the next token.
- Decode bottleneck? Memory bandwidth. Prefill bottleneck? Compute.
- TTFT driven by? Prefill. ITL driven by? Decode.
- KV bytes/token formula?
2 × layers × kv_heads × head_dim × dtype_bytes. - The engine's master variables?
num_computed_tokenschasingnum_tokens.
Phase 00 — Cheatsheet: Foundations
Contents
- The one-liner
- The loop
- Prefill vs decode
- KV cache
- The master model
- Throughput vs latency
- Key upstream
The one-liner
An LLM predicts the next token; generation loops that. Serving = doing it fast for many users. Memory (the KV cache), not compute, is the cap.
The loop
tokenize → (prefill the prompt) → loop[ forward → sample → append ] → detokenize.
Real: EngineCore.step = schedule → execute → sample → update (core.py:428).
Prefill vs decode
| prefill | decode | |
|---|---|---|
| tokens/pass | many | one |
| bound by | compute (FLOPs) | memory bandwidth |
| latency | TTFT | ITL/TPOT |
| fills | prompt KV | one KV/step |
KV cache
- Exists because K/V never change once computed → cache them → O(N²) work becomes O(N).
kv_bytes_per_token = 2 × layers × kv_heads × head_dim × dtype_bytes.- Llama-3-8B fp16 ≈ 128 KiB/token. Concurrency ≈ (HBM − weights) / (per_token × seq_len).
- Shrink it: GQA (fewer kv_heads), fp8 KV (half dtype), shorter context, paging (Phase 2).
The master model
A request = num_computed_tokens racing num_tokens. Prefill = far behind; decode = one behind.
(vllm/v1/request.py:239; mirrored in mini_vllm/request.py.)
Throughput vs latency
Bigger batch → more throughput (amortize weight reads), worse per-request latency. Little's Law: concurrency = throughput × latency. The scheduler (Phase 3) and tuning (Phase 18) live here.
Key upstream
vllm/model_executor/models/llama.py— a real forward pass (Q/K/V atLlamaAttention.forward)vllm/v1/request.py:239— the countersvllm/v1/engine/core.py:428—EngineCore.step
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md
Phase 01 — The Hitchhiker's Guide to vLLM's Architecture & Request Lifecycle
← Phase 00 · Course home · Phase 02 →
This is Chapter 1. Phase 0 taught you what one forward pass is and why it's slow. This chapter zooms out to the whole machine: how vLLM turns that single forward pass into a service that streams tokens to thousands of users at once. We build the architecture the way you'd design it yourself — by starting with the obvious naive server, watching it fail, and fixing each failure. By the end you'll be able to trace any request from an HTTP call to a streamed token and name every component it touches. That mental map is what lets you navigate 500,000 lines of code without drowning.
How to read this chapter. Everyday explanation throughout; paragraphs marked 🔬 Going deeper add the systems rigor for the expert track.
Contents
- 1.1 Don't Panic — the architecture in one breath
- 1.2 Let's design it ourselves: why the naive server fails
- 1.3 Two front doors, one engine
- 1.4 The objects a request becomes (and why each exists)
- 1.5 The heartbeat dissected:
EngineCore.step() - 1.6 The process architecture: why the engine lives alone
- 1.7 Who actually touches the GPU: Executor → Worker → ModelRunner
- 1.8 The request lifecycle: a state machine
- 1.9 How thousands of requests share the loop (the payoff)
- 1.10 Tracing one request, end to end
- 1.11 The mental model to carry forward
- 1.12 What you'll do in this phase
1.1 Don't Panic — the architecture in one breath
A request enters as a string and leaves as tokens. In between it passes through a handful of well-named components, and a tiny loop runs the model over and over until the request is done.
vLLM looks enormous, but the path a request takes is short:
"Tell me a joke"
│ tokenize, wrap as a request
▼
LLM / AsyncLLM ← the front door (offline batch / online server)
│ add_request
▼
EngineCore.step() ──────────────── the heartbeat (runs every ~10–50 ms) ───────────┐
│ 1. schedule() who runs this step, and how many tokens (Ph 3) │
│ 2. execute_model() run the model on the assembled batch (Ph 4+) │ loop
│ 3. sample_tokens() pick the next token for each sequence (Ph 9)│ until
│ 4. update_from_output() advance counters, retire finished reqs (Ph 3)│ done
▼ │
Detokenizer / OutputProcessor ← token IDs → text, streamed back ───────────────────┘
│
▼
" Why did the function..." (streamed token by token)
That five-line step() loop is vLLM. Every later phase is a deep dive into one box of it. The
rest of this chapter explains why it's shaped this way and what each piece does.
1.2 Let's design it ourselves: why the naive server fails
The fastest way to understand vLLM's architecture is to build the obvious version in your head and watch it break. Each break motivates a real component.
Attempt 1 — a function call. "Just call model.generate(prompt) per request." This works for
one user. But it serves requests one at a time: while user A's 500-token answer generates, users
B–Z wait. And from Phase 0 §0.10, a single decode stream uses ~1% of the GPU (it's memory-bound at
batch 1). You're paying for a Ferrari and driving it in a parking lot. → We must run many requests
together (batching).
Attempt 2 — static batching. "Collect N requests, run them as a batch until all finish." Better GPU use, but two new problems:
- Requests have different lengths. A batch finishes at the speed of its slowest member; short requests sit idle in the batch, wasting their slot. (We'll fix this with continuous batching — re-decide the batch every step — in Phase 3.)
- New requests that arrive mid-batch must wait for the whole batch to finish before they can start. Terrible tail latency. → We need a component that re-plans the batch every single step: the scheduler.
Attempt 3 — scheduler + model, in one Python process, inside the web server. Now the GPU loop and the HTTP server share a process. Problems:
- The tight, latency-critical GPU loop competes with HTTP parsing, JSON serialization, and detokenization for the single Python thread (the GIL). A burst of requests stalls the GPU.
- Multi-GPU (Phase 10) needs multiple processes anyway. → Isolate the engine in its own process, talk to it over a queue. That's vLLM's V1 design.
So the architecture isn't arbitrary — each component is the answer to a specific failure of the naive version. Now let's name the real pieces.
🆕 New words: batching (run many requests together), static vs continuous batching, scheduler (re-plans the batch each step), the GIL (Python's single-thread lock — why the engine gets its own process).
1.3 Two front doors, one engine
vLLM has two entry points, and the crucial insight is that both are thin shells over the same engine core:
- Offline / batch:
LLM(model=...).generate(prompts)—vllm/entrypoints/llm.py. You hand it a list of prompts; it returns a list of results when all are done. Synchronous. This is whatmini_vllm'sLLMEngine.generatemirrors, and what you use in scripts and evals. - Online / serving: an HTTP server (OpenAI-compatible, Phase 16) →
AsyncLLM(vllm/v1/engine/async_llm.py) → the same core, but async and streaming — it yields each token as it's produced so the user sees text appear live.
Both funnel into EngineCore (vllm/v1/engine/core.py). Internalize this: batch and server
are skins; the engine is one. When you fix something in the core, you fix it for both.
1.4 The objects a request becomes (and why each exists)
A request changes form as it travels — and each form is a deliberate data type. Knowing them means that when you read a stack trace, you instantly know which stage you're in by the type in hand.
| Object | Lives between | Carries | Why it exists |
|---|---|---|---|
prompt + SamplingParams | user → server | the text + decoding knobs (temperature, max_tokens, n, stop) | the user's intent |
EngineCoreRequest | input proc → core | tokenized prompt + params + a request id | a serializable unit to cross the process boundary |
Request | inside the scheduler | the live request: token ids, num_computed_tokens / num_tokens, status, block table | the engine's working state (Phase 0's two counters!) |
SchedulerOutput | scheduler → executor | who runs, how many tokens each, block tables, etc. | the per-step plan |
ModelRunnerOutput | executor → core | sampled token ids, logprobs | the model's result |
RequestOutput | core → user | generated text/tokens (a delta, when streaming) | what the caller receives |
🔬 Going deeper. The split between
EngineCoreRequest(crosses the process boundary, so it's a plain serializable struct) andRequest(rich, mutable, lives only inside the engine process) is not incidental — it's the seam where the IPC boundary sits (§1.6). AndRequestOutputbeing a delta in streaming mode (only the new tokens since last time) is what makes server-sent-events streaming cheap. Naming is half of understanding a system; learn these six.
🆕 New words:
SamplingParams,EngineCoreRequest,Request,SchedulerOutput,ModelRunnerOutput,RequestOutput.
1.5 The heartbeat dissected: EngineCore.step()
The engine is a loop. Each tick (step()) advances every in-flight request by some tokens. Here
is the loop with each stage explained — this is the spine of the whole system:
def step():
scheduler_output = self.scheduler.schedule() # 1. PLAN
model_output = self.model_executor.execute_model(...) # 2. RUN
# (sampling happens inside/after execute; shown separate for clarity)
sampled = self.model_executor.sample_tokens(...) # 3. PICK
outputs = self.scheduler.update_from_output(...) # 4. BOOKKEEP
return outputs
- Schedule (the plan) — the scheduler looks at every waiting and running request and the free
KV memory, and decides: who runs this step, and how many tokens does each get? This is where
continuous batching, chunked prefill, prefix caching, and preemption happen (Phases 2–3). Output:
a
SchedulerOutput. - Execute (run the model) — the executor turns that plan into actual tensors (gather the scheduled tokens, build the attention metadata — block tables and sequence lengths from Phases 2–4) and runs the forward pass on the GPU (possibly as a CUDA graph, Phase 5). This is where kernels, quantization, parallelism, and the model itself live (Phases 4–7, 10, 13, 14).
- Sample (pick tokens) — turn the model's logits into one new token per sequence, applying each request's own sampling params, grammar masks, etc. (Phases 8, 9, 12).
- Bookkeep (update) — append the sampled tokens, advance each request's
num_computed_tokens, detect which requests just finished (hit EOS or max length), free their KV blocks, and emit outputs (Phase 3).
Then it loops. A request might be touched by a few hundred ticks over its lifetime (one per output token, after prefill). Every box of this loop maps to a phase of the course — keep this diagram open as your table of contents.
🔬 Going deeper — the real
stepis even leaner. Incore.pythe four stages are visible almost verbatim (you'll read them in the deep-dive). Two production wrinkles: (a)execute_modelcan run asynchronously (return a future) so the scheduler can plan the next step while the GPU works on this one — overlapping CPU and GPU; (b) a grammar bitmask for structured output (Phase 12) is computed between schedule and sample. Don't let those obscure the four-beat rhythm: plan → run → pick → bookkeep.
1.6 The process architecture: why the engine lives alone
From §1.2, the engine must not share a Python thread with the web server. So V1 runs EngineCore
in its own process (EngineCoreProc). The picture:
┌─────────────── API server process ───────────────┐ ┌──── EngineCore process ────┐
│ HTTP / OpenAI endpoints (Phase 16) │ │ scheduler │
│ tokenization, request validation │ IPC │ the model + KV cache │
│ AsyncLLM ── EngineCoreRequest ──────────────────┼───────▶│ step() loop │
│ detokenization, streaming ◀── EngineCoreOutputs─┼────────┤ │
└───────────────────────────────────────────────────┘ └────────────────────────────┘
Why this split is worth a whole process boundary:
- The scheduling loop stays tight — no HTTP work, JSON, or detokenization steals its thread or contends for the GIL. The GPU is never starved by web-server bookkeeping.
- Detokenization and streaming run on the server side, off the engine's hot path — turning token
IDs back into text and formatting SSE chunks happens in parallel with the next
step(). - It generalizes to multi-GPU: the core process becomes the coordinator of worker processes (next section).
The cost is that requests and outputs must be serialized across the boundary (that's why
EngineCoreRequest/EngineCoreOutputs are plain structs, §1.4). It's a price worth paying for an
uninterrupted GPU loop.
🆕 New words: IPC (inter-process communication),
EngineCoreProc(the engine's own process), SSE (server-sent events — the streaming protocol).
1.7 Who actually touches the GPU: Executor → Worker → ModelRunner
EngineCore decides what to run; it does not run the model itself. That's delegated down a chain
whose whole purpose is to make the same engine run on 1 GPU or 64:
EngineCore
└─ Executor (vllm/v1/executor/) owns the worker(s); the engine's handle to compute
└─ Worker (vllm/v1/worker/gpu_worker.py) one per GPU: holds a model shard + its KV cache
└─ ModelRunner (gpu_model_runner.py) SchedulerOutput → input tensors → forward → sampler
- Executor — for a single GPU it's a
UniProcExecutor(just calls the one worker). For tensor/pipeline parallelism (Phase 10) it's aMultiprocExecutorthat owns N worker processes and broadcasts each step's plan to all of them. - Worker — owns one GPU: its device, its slice of the model's weights, and its slice of the KV cache. Runs in lockstep with its peers.
- ModelRunner — the busiest object in the engine. It takes the
SchedulerOutput, prepares the input tensors (gathers the scheduled tokens, builds the attention metadata: block tables + sequence lengths + slot mapping — Phases 2/4), runs the (possibly CUDA-graphed) forward pass, and runs the sampler. You'll return togpu_model_runner.pyin Phases 4, 5, 9, 13.
The elegance: the model code is identical whether you run on 1 GPU or 64 — it just uses parallel layers, and the Executor fans the work out. Scaling out changes the Executor, nothing above it.
🔬 Going deeper. This is also where the prepare-inputs cost lives — assembling ragged, variable-length batches into padded tensors and metadata every step is real CPU work, and at small batch it can rival the GPU time. That's a major reason CUDA graphs (Phase 5) and careful tensor reuse matter, and why
gpu_model_runner.pyis so heavily optimized. When you profile a slow deployment (Phase 18), this file is a frequent suspect.
1.8 The request lifecycle: a state machine
Inside the engine, each request moves through a small set of states (RequestStatus in
vllm/v1/request.py). Understanding the states — and especially the transitions — is how you
reason about latency, fairness, and failures.
(arrives)
│
▼
┌─────────┐ admitted by ┌─────────┐ generates a token ┌──────────────────┐
│ WAITING │ ───scheduler────▶ │ RUNNING │ ───each step────────▶ │ FINISHED_* │
└─────────┘ (KV allocated) └─────────┘ until stop/maxlen │ (STOPPED/LENGTH/ │
▲ │ │ ABORTED/ERROR) │
│ preempted: out of KV │ └──────────────────┘
└────────── PREEMPTED ◀──────┘ (KV freed; re-admitted later, recomputed — Phase 3)
- WAITING — admitted to the engine, queued, not yet running (no KV allocated yet).
- RUNNING — actively generating; has KV blocks; touched every step.
- PREEMPTED — was running, but the engine ran out of KV memory and evicted it to make progress on others; it goes back to WAITING and is recomputed when memory frees (Phase 3's safety valve).
- FINISHED_* — terminal: hit a stop token (
STOPPED), hit max length (LENGTH_CAPPED), was cancelled (ABORTED), or errored. Its KV is freed and the final output returned.
🔬 Going deeper. Real vLLM has extra "waiting" sub-states for requests blocked on something other than the queue: waiting for a structured-output grammar to compile (Phase 12), waiting for KV to arrive over the network in disaggregated serving (Phase 15), etc. They're still "not ready to run," just for richer reasons. Also note the enum ordering trick:
is_finishedis simplystatus > PREEMPTED, so the terminal states are defined by position in the enum — a tiny detail that makes the hot-path check branch-free. You'll trace this exact state machine inlab-01.
🆕 New words:
RequestStatus, preemption (evict a running request under memory pressure), terminal/finished states.
1.9 How thousands of requests share the loop (the payoff)
Now connect the architecture back to Phase 0's physics. Why all this machinery? Because of §0.10:
one decode stream wastes ~99% of the GPU. The architecture exists to keep many requests in
flight so each step() decodes a big batch — amortizing the weight read and pushing arithmetic
intensity toward the roofline ridge.
Crucially, because the scheduler re-plans every step (continuous batching, Phase 3), requests
don't move in lockstep: the moment one finishes, its slot is freed and a WAITING request joins
mid-flight, on the very next tick. So at any instant the running batch is a churning mix of
requests at different stages — some doing their first (prefill) step, most adding one decode token.
The loop in §1.5 absorbs all of that uniformly because, to it, every request is just "advance
num_computed_tokens toward num_tokens" (Phase 0 §0.13). That uniformity is why one simple loop
can serve a chaotic, ever-changing crowd.
time ─►
req A [prefill][dec][dec][done]
req B [prefill][dec][dec][dec][done]
req C [prefill][dec][dec]... ← C joined the instant A's slot freed
every column is one step() = one batched forward over whoever's running right now
1.10 Tracing one request, end to end
Let's follow "Tell me a joke" through the offline path, naming each stop (you'll do this live in
lab-01, and read the real code in the deep-dive):
LLM.generate(["Tell me a joke"])tokenizes the prompt and builds anEngineCoreRequest.add_requestwraps it as aRequest(num_tokens=5,num_computed_tokens=0, statusWAITING) and enqueues it in the scheduler.- Tick 1 (prefill):
schedule()admits it →RUNNING, allocates KV blocks for 5 tokens;execute_modelruns the forward over all 5 prompt tokens;sample_tokensproduces" Why";update_from_outputsetsnum_computed_tokens=5, appends" Why"(num_tokens=6). - Ticks 2..N (decode): each tick schedules 1 new token for this request, runs the model,
samples the next token, appends it.
num_computed_tokenschasesnum_tokens, one step at a time. - Finish: when the model emits the EOS token (or hits
max_tokens),update_from_outputmarks itFINISHED_*, frees its KV blocks, and the detokenizer turns the token IDs into the final string (streamed token-by-token on the server path).
That's the whole life of a request. Notice tick 1 processes many tokens (prefill, compute-bound) and every later tick processes one (decode, memory-bound) — Phase 0's two phases, now visible in the loop.
1.11 The mental model to carry forward
front door (LLM / AsyncLLM)
→ EngineCore.step loop: schedule → execute → sample → update
├─ schedule/update ........ Phases 2, 3 (memory & batching)
├─ execute_model ........ Phases 4–7, 10, 13, 14 (kernels, quant, parallelism, models)
└─ sample_tokens ........ Phases 8, 9, 12 (decoding, spec, structured)
→ detokenize / stream ........ Phase 16 (the serving API)
Every later phase is a zoom into one box of EngineCore.step. You now have the table of contents
for the entire book. When a later chapter says "this happens during execute_model" or "the
scheduler decides X," you'll know exactly where in this picture you are.
1.12 What you'll do in this phase
- Read: 01-deep-dive.md —
LLM.generate,EngineCore.step,LLMEngine,AsyncLLM, and the Executor→Worker→ModelRunner chain, with verified line anchors. - Build: 02-mini-build.md — add lifecycle tracing to
mini_vllm. - Labs (see labs/README.md for the full guide to each):
lab-01-trace-a-request[CPU-OK]— instrumentmini_vllmto record a request's full lifecycle (states + the two counters, per step) and assert it matches the WAITING→RUNNING→FINISHED path.lab-02-read-the-real-loop[GPU-OPT]— run real vLLM with debug logging and correlate the output tocore.py:step()(captured output included).lab-03-engine-step-by-hand[CPU-OK]— rebuildLLMEngine.stepfrom the scheduler/model/ sampler and prove it token-for-token identical to the real loop (incl. theneeds_sampleguard).lab-04-watch-the-batch[CPU-OK]— probe the scheduler and record per-step batch composition: chunking, deferred admission, and mixed prefill+decode steps, measured.lab-05-stop-conditions[CPU-OK]— EOS vsmax_tokensvsignore_eos, the boundary tie, and the status→finish_reasonmapping every API consumer depends on.
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
You're ready to move on when you can draw the request's journey from generate() to a streamed
token, name every object and component it becomes/touches, recite the four stages of step() and
which phase owns each, and explain why the engine runs in its own process and why continuous
batching is what makes the whole thing economical.
← Phase 00 · Course home · Phase 02 →
Phase 01 — Deep Dive: tracing a request through real vLLM
Paths relative to
upstream/atv0.22.1 @ 0decac0. We follow one request fromLLM.generateto tokens out, naming every file. Keepmini_vllm/engine.pyopen alongside — it's the same control flow, miniature.
Contents
- 1. The offline entry point:
LLM.generate - 2. The heartbeat:
EngineCore.step - 3. Down to the metal: Executor → Worker → ModelRunner
- 4. The async path (serving)
- 5. The output path
- The whole journey, named
- Reading checklist
1. The offline entry point: LLM.generate
vllm/entrypoints/llm.py: class LLM (:66), def generate (:422). generate validates
inputs, builds requests, adds them to the engine, and runs the engine to completion, collecting
RequestOutputs. Under the hood it drives an LLMEngine.
vllm/v1/engine/llm_engine.py: class LLMEngine (:47) with add_request (:209) and step
(:287). This is the synchronous wrapper: add_request tokenizes + enqueues; step pumps the
core once and returns finished RequestOutputs. mini_vllm.LLMEngine.{add_request,step,generate}
mirror these one-to-one.
2. The heartbeat: EngineCore.step
vllm/v1/engine/core.py:428 (you read this in Phase 00 — revisit with the architecture in mind):
def step(self):
if not self.scheduler.has_requests():
return {}, False
scheduler_output = self.scheduler.schedule() # 1. who runs (Ph 3)
future = self.model_executor.execute_model(scheduler_output, ...) # 2. run model (Ph 4–14)
grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)
model_output = future.result()
if model_output is None:
model_output = self.model_executor.sample_tokens(grammar_output)# 3. sample (Ph 9)
engine_core_outputs = self.scheduler.update_from_output( # 4. advance (Ph 3)
scheduler_output, model_output)
return engine_core_outputs, scheduler_output.total_num_scheduled_tokens > 0
add_request is at core.py:337: it wraps the incoming EngineCoreRequest into a Request and
hands it to self.scheduler.add_request. Note EngineCore also subclasses into EngineCoreProc
(:835) — the version that runs in its own process and receives requests over a queue. That's
the process split from the guide.
3. Down to the metal: Executor → Worker → ModelRunner
self.model_executor is an Executor (vllm/v1/executor/abstract.py defines the interface). For
single-GPU it's a UniProcExecutor; for multi-GPU a MultiProcExecutor (multiproc_executor.py,
Phase 10). execute_model(scheduler_output) forwards to the worker(s).
vllm/v1/worker/gpu_worker.py — class Worker: owns the device, the model, and the KV cache for
one GPU. Its execute_model calls into the model runner.
vllm/v1/worker/gpu_model_runner.py — GPUModelRunner.execute_model is where SchedulerOutput
becomes reality: it gathers the scheduled tokens into input tensors, builds attention metadata
(block tables + sequence lengths from Phase 2/3), runs the (possibly CUDA-graphed, Phase 5)
forward pass, and runs the sampler. Search it for execute_model and _prepare_inputs. This is
the single busiest file in the engine — you'll return to it in Phases 4, 5, 9, 13.
4. The async path (serving)
vllm/v1/engine/async_llm.py: class AsyncLLM. The OpenAI server (Phase 16) calls
AsyncLLM.generate, an async generator that yields RequestOutput deltas as they're produced.
Internally it talks to the EngineCoreProc over IPC and runs the output processing/detokenization
on the server side, off the core's hot path. Same core, async shell.
5. The output path
vllm/v1/engine/output_processor.py + detokenizer.py: turn the core's sampled token ids back
into text, handle stop strings, and assemble RequestOutputs (streaming deltas for the server).
mini_vllm folds this into engine.generate (decode at the end) — simpler, same idea.
The whole journey, named
LLM.generate (llm.py:422)
└─ LLMEngine.add_request (llm_engine.py:209) -> EngineCore.add_request (core.py:337)
└─ loop LLMEngine.step (llm_engine.py:287) -> EngineCore.step (core.py:428):
scheduler.schedule() (sched/scheduler.py:329) Phase 3
executor.execute_model() (executor/ -> worker/gpu_model_runner.py) Phase 4-14
executor.sample_tokens() (sample/sampler.py) Phase 9
scheduler.update_from_output() (sched/scheduler.py:1283) Phase 3
└─ output_processor/detokenizer -> RequestOutput
Reading checklist
-
LLM.generate→ which engine method adds requests, which pumps the loop? -
EngineCore.step→ recite the four stages and the file each lives in. - Executor vs Worker vs ModelRunner → who owns the GPU, who builds tensors?
-
Why does
EngineCoreProcexist (the process split)? - Where does detokenization happen, and why off the core's hot path for serving?
Now build it: 02-mini-build.md, then the labs.
Phase 01 — Mini-Build: trace the request lifecycle
You'll add lifecycle tracing to mini_vllm so you can see a request move through
WAITING → RUNNING → FINISHED, with its num_computed_tokens/num_tokens at every step. Seeing
the state machine run is how the architecture stops being abstract.
Contents
The task (lab-01)
Implement trace_request(engine_kwargs, prompt, sampling_params) -> list[Event] that runs the
mini_vllm engine one step() at a time and records, after each step, every live request's
(request_id, status, num_computed_tokens, num_tokens). Then derive:
- the first event (should be RUNNING with
num_computed == num_prompt_tokensafter prefill), - the sequence of statuses (RUNNING…→FINISHED),
- that
num_computed_tokensis monotonically non-decreasing until finish.
You're reconstructing, on your own engine, what VLLM_LOGGING_LEVEL=DEBUG shows you on the real
one (lab-02). Map each transition to EngineCore.step (core.py:428).
Method
mini_vllm.LLMEngine exposes scheduler (with .running/.waiting) and step(). Drive the
loop manually:
eng = LLMEngine(**engine_kwargs)
rid = eng.add_request(prompt, sampling_params)
events = []
while eng.scheduler.has_unfinished_requests():
eng.step()
for r in eng.scheduler.running:
events.append(Event(r.request_id, r.status.name, r.num_computed_tokens, r.num_tokens))
# also capture finished requests in the step return value
(The exact capture is the lab's job; the test pins the resulting trace's shape.)
Definition of done
pytest phase-01-architecture-and-request-lifecycle/labs -q
Then answer: at which step does num_computed_tokens first equal num_prompt_tokens (prefill
done)? After that, how much does it grow per step (decode = 1)? Why does that match the
prefill/decode model from Phase 0?
Map to the real engine
| your trace | real vLLM |
|---|---|
| status transitions | RequestStatus (request.py:315) |
| per-step counter advance | update_from_output (scheduler.py:1283) |
| the loop you drive | EngineCore.step (core.py:428) |
reading scheduler.running | the real Scheduler.running list |
Phase 01 Labs — Architecture & Request Lifecycle
Five labs that turn the engine from a black box into your box. The arc: observe the lifecycle (lab-01), verify it on real hardware (lab-02), rebuild the loop yourself (lab-03), watch many requests share it (lab-04), and master how requests end (lab-05). Do them in order — each one's vocabulary is the next one's prerequisite.
Every [CPU-OK] lab follows the same contract: starter.py with TODOs (your work),
solution.py (the reference), test_lab.py (the spec, executable). The default test run
uses solution.py so the suite is always green; set LAB_IMPL=starter to grade yourself.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-01-architecture-and-request-lifecycle/labs -m "not gpu"
# Grade your own work on one lab:
LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-01-trace-a-request -q
Contents
- lab-01-trace-a-request
[CPU-OK] - lab-02-read-the-real-loop
[GPU-OPT] - lab-03-engine-step-by-hand
[CPU-OK] - lab-04-watch-the-batch
[CPU-OK] - lab-05-stop-conditions
[CPU-OK] - What you can do after this phase
Labs
lab-01-trace-a-request [CPU-OK]
Drive the mini_vllm engine one step at a time and record every transition of a single
request — status, num_computed_tokens, num_tokens — from prefill through decode to
finish. You'll reconstruct, on an engine you control, exactly what
VLLM_LOGGING_LEVEL=DEBUG prints on the real one, and internalize the course's central
mental model: a request is two counters racing. Skills: the lifecycle state machine;
prefill/decode as one mechanism; TTFT = step 1.
lab-02-read-the-real-loop [GPU-OPT]
Run real vLLM 0.22.1 on a tiny model with debug logging and attribute every log line to a
stage of EngineCore.step (core.py:428). The lab-01 trace and the production log line up
one-to-one — that correlation is the moment the upstream codebase becomes readable. Captured,
annotated output included so the lab works without a GPU. Skills: log-line → source-line
debugging; the three-call engine core; # GPU blocks as serving capacity.
lab-03-engine-step-by-hand [CPU-OK]
The rite of passage: given the engine's organs (scheduler, model, sampler), wire the
schedule → execute → sample → update loop yourself, and prove it token-for-token identical
to LLMEngine.step. Includes the one subtle rule of the whole loop — only requests whose
computed tokens catch up this step may sample — with a test that catches you if you miss
it. Skills: the engine's stage contract; the needs_sample invariant; testing by
determinism.
lab-04-watch-the-batch [CPU-OK]
Instrument the scheduler with a non-invasive probe and record the batch composition of every step while multiple requests run under a scarce token budget. You'll see prefill chunks and decodes co-scheduled in one step, requests joining and leaving the batch mid-flight — continuous batching, measured rather than described. Skills: the observe-don't-modify probe pattern; budget/chunk/defer mechanics; token-conservation identities for debugging schedulers.
lab-05-stop-conditions [CPU-OK]
Dissect how requests end: EOS ("stop") vs max_tokens ("length"), the ignore_eos
benchmark flag, and the boundary tie where both fire at once and the order of two if
statements becomes a public API. Scripted token streams make every edge case exactly
testable. Skills: status → finish_reason mapping; why ordering of stop checks is an API
decision; triaging "my answer got cut off."
What you can do after this phase
Explain to a colleague, with evidence you generated yourself: what one step of an inference
engine does; why TTFT ≈ prefill and ITL ≈ one decode step; how N requests share an engine
without ever stopping it; and what finish_reason will say and why. You also now hold the
top of vLLM's call tree (EngineCore.step) in your head — every later phase is a descent
into one of its three calls.
Lab 01-01 — Trace a Request's Lifecycle [CPU-OK]
"The first thing a systems engineer does with a black box is make it stop being one."
You are going to take one request — a single prompt — and watch every heartbeat of its life
inside an inference engine: from the moment it's admitted, through its prefill, through every
decode step, until the engine pronounces it finished. By the end you will have produced, with
your own code, the exact trace that vLLM emits when you run it with
VLLM_LOGGING_LEVEL=DEBUG — and more importantly, you'll know why every line of that trace
looks the way it does.
Contents
- Why this lab exists
- Background: the one mental model to rule them all
- Files
- Run
- What to implement
- What you should see — and why every number is what it is
- What the tests prove
- How this maps to the real engine
- Hitchhiker's notes (gotchas & deeper cuts)
- Going further
- References
Why this lab exists
Every hard debugging session you will ever have on an inference engine — a stuck request, a latency cliff, a throughput regression, a preemption storm — starts with the same question: "what is the engine doing with my request right now?" If you can't answer that, you're guessing. If you can, you're an engineer.
The trouble is that production engines hide the lifecycle behind a convenience API. You call
llm.generate(...), you get text back, and the thousand scheduling decisions in between are
invisible. This lab removes the convenience wrapper. You will drive the engine one step at
a time with your own loop, and after each step you'll photograph the request's state:
its status, how many of its tokens have been computed, how many exist in total.
That photograph sequence is the request lifecycle. Once you've built it yourself, the real
engine's debug logs (lab-02), its Prometheus metrics (vllm:num_requests_running,
vllm:num_requests_waiting — Phase 18), and its scheduler internals (Phase 3) all become
readable at a glance, because they're all just different projections of the same state
machine you're about to instrument.
Background: the one mental model to rule them all
Here is the single most important idea in this whole phase, the one the real vLLM scheduler
is built on (see the comment block at upstream/vllm/v1/core/sched/scheduler.py:330):
There is no "prefill phase" and no "decode phase." A request is just two counters racing:
num_computed_tokenschasingnum_tokens.
num_tokens= prompt tokens + tokens generated so far. It grows by 1 every time the request samples a new token.num_computed_tokens= how many of those tokens have had their KV (attention key/value) computed and stored in the cache. The model can only sample a new token when this counter has caught up — when every existing token's KV is in place.
"Prefill" is merely the situation where num_computed_tokens is far behind (the whole
prompt's KV is missing) and the engine computes a big batch of it at once. "Decode" is the
situation where it's exactly one behind, and each step computes one token's KV and samples
one new token. The same loop handles both. This is what makes chunked prefill (Phase 3),
prefix caching (Phases 2–3), and continuous batching fall out naturally instead of being
special cases — and it's the single design decision that most distinguishes vLLM's V1 engine
from a naive two-phase implementation.
The lifecycle states you'll observe (from mini_vllm/request.py, mirroring
upstream/vllm/v1/request.py):
add_request() scheduled stop condition
(created) ───────────────▶ WAITING ───────────────▶ RUNNING ───────────────▶ FINISHED_*
▲ │
│ memory pressure │
└──── PREEMPTED ◀───────┘ (Phase 3, lab-04)
FINISHED_* is two states in practice: FINISHED_STOPPED (hit the EOS token) and
FINISHED_LENGTH (hit max_tokens). Lab-05 dissects that distinction; in this lab we pin
ignore_eos=True so length is always the stop reason and the trace is deterministic.
Files
starter.py— implementtrace_request(the manual step loop + snapshotting). Your work.solution.py— a complete reference. Resist opening it until your tests pass or you're genuinely stuck; the value of the lab is the 20 minutes of thinking.test_lab.py— pins the lifecycle shape: prefill-in-step-1, monotonic counters, one-token-per-decode, finish-at-cap.
Run
# Test YOUR implementation:
LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-01-trace-a-request -q
# Test the reference (default — this is why the suite is green out of the box):
pytest phase-01-architecture-and-request-lifecycle/labs/lab-01-trace-a-request -q
What to implement
def trace_request(prompt: str, max_tokens: int = 4, **engine_kwargs) -> list[Event]
where Event is (request_id, status, num_computed_tokens, num_tokens). The recipe:
- Build an
LLMEngine(**engine_kwargs)and add one request withSamplingParams(max_tokens=max_tokens, temperature=0.0, ignore_eos=True). (Greedy + ignore-EOS = a fully deterministic, fixed-length run. Determinism is not a nicety here — it's what makes the lifecycle testable.) - Loop
while eng.scheduler.has_unfinished_requests():callingeng.step()yourself. This is the whole trick:generate()would run this loop for you and hide everything; you are running it by hand so you can look between the steps. - After each step, snapshot every request in
eng.scheduler.running, then every request in the liststep()returned (those just finished and have already been removed fromrunning— if you only look atrunning, the finalFINISHED_LENGTHevent vanishes. This is the classic observability bug: the most interesting state transition is the one that removes the thing you're observing).
See 02-mini-build.md for the engine's anatomy if you haven't built it yet.
What you should see — and why every number is what it is
For trace_request("hello", max_tokens=4) your event list should look like this:
Event(request_id='req-0', status='RUNNING', num_computed_tokens=5, num_tokens=6)
Event(request_id='req-0', status='RUNNING', num_computed_tokens=6, num_tokens=7)
Event(request_id='req-0', status='RUNNING', num_computed_tokens=7, num_tokens=8)
Event(request_id='req-0', status='FINISHED_LENGTH', num_computed_tokens=8, num_tokens=9)
Every number above is explainable, and being able to explain it is the point:
- Why does the first event already say
num_computed_tokens=5?"hello"is 5 bytes, andmini_vllm'sByteTokenizeris one token per byte, so the prompt is 5 tokens. The prompt easily fits the scheduler's token budget (default 2048), so the entire prefill happens inside step 1. You never observenum_computed_tokens < 5because there is no "between" to observe — the counter goes 0 → 5 inside one step. (Make the prompt longer than the budget, or setlong_prefill_token_threshold, and you will see intermediate values. Try it. That's chunked prefill, and it's lab 03-02.) - Why is
num_tokens=6in that same first event? Because step 1 didn't just prefill — the prefill caught up (num_computed == num_tokenswas about to hold), so the model sampled token #1 in the same step. Prompt (5) + 1 output = 6. Prefill and first-token generation are one step, which is why TTFT (time-to-first-token) ≈ prefill time in every serving benchmark you'll ever read. - Why does each subsequent event advance both counters by exactly 1? That's a decode step: compute KV for the one new token, sample the next. One in, one out, forever — this lockstep is why decode is memory-bandwidth-bound (you re-read all the weights to produce a single token per request; see the roofline discussion in Phase 18).
- Why does it finish at
num_tokens=9and not 10?max_tokens=4counts output tokens: 5 prompt + 4 output = 9. The status isFINISHED_LENGTHbecause we setignore_eos=True— the request was always going to run to its cap. - Why are there exactly 4 events? One snapshot per step, and the run takes exactly
max_tokenssteps: 1 step of (prefill + first token) and 3 pure decode steps. Burn this formula in: steps = max_tokens when the prompt fits one prefill. A 1000-token answer is a thousand trips around the engine loop. That is why decode dominates serving cost, and why everything from CUDA graphs (Phase 5) to speculative decoding (Phase 8) exists. - Notice what you never see:
WAITING. The request is admitted in the very firstschedule()because the engine is empty. On a loaded server, requests queue in WAITING — and time spent there is pure user-visible latency that no kernel optimization can fix. You'll create real WAITING time in lab-04 by starving the token budget.
What the tests prove
| Test | The invariant it pins | Why a maintainer cares |
|---|---|---|
test_first_event_is_running_after_prefill | Step 1 completes the prompt's KV (computed == 5) | TTFT = prefill; admission happens in schedule(), not add_request() |
test_counters_monotonic_and_decode_by_one | num_computed_tokens never decreases; decode advances it by exactly 1 | A counter going backwards means preemption (Phase 3) — in this lab it would mean your loop is broken |
test_finishes_at_length_cap | Terminal status starts with FINISHED | Finished requests must leave running and free their KV — the reaping path |
test_total_decode_steps_equals_max_tokens | Exactly max_tokens steps for a budget-fitting prompt | The steps = output-tokens equivalence underlying every latency model |
How this maps to the real engine
Open upstream/vllm/v1/engine/core.py:428 (EngineCore.step) next to your loop. The
correspondence is one-to-one:
| Your loop | Real engine | What it does |
|---|---|---|
eng.step() calls scheduler.schedule() | self.scheduler.schedule() | Decide which requests compute how many tokens this step |
model forward + sampler inside step() | self.model_executor.execute_model(...) | Run the GPU forward pass, sample |
scheduler.update_from_output(...) | self.scheduler.update_from_output(...) | Advance counters, detect stops, reap finished |
your events.append(...) | VLLM_LOGGING_LEVEL=DEBUG log lines / Prometheus gauges | Observability |
The real Request (upstream/vllm/v1/request.py) carries the same two counters with the
same names. The real RequestStatus has the same states plus a few you'll meet later
(FINISHED_ABORTED, FINISHED_IGNORED). When you read the V1 scheduler in Phase 3, you'll
recognize every field because you traced it here first.
Hitchhiker's notes (gotchas & deeper cuts)
- Don't snapshot before the first step. Between
add_request()and the firstschedule(), the request is WAITING withnum_computed_tokens=0— real, but the tests deliberately start observation after step 1, because that's when the engine has actually done something. If you want the WAITING event, add it; just know why the tests don't ask for it. - The finished request is not in
scheduler.running.update_from_outputreaps it beforestep()returns. That's whystep()returns the finished list — it's your only handle on them. Real vLLM has the same shape: finished requests come back inEngineCoreOutputs, not in the scheduler's queues. - Why
temperature=0.0? The toy model is deterministic given (last token, position), and greedy sampling makes the whole token stream reproducible. With temperature > 0 the lifecycle shape would be identical but the test for exact step counts could break if a sampled EOS sneaked in. Determinism first, then realism — a good habit for engine tests generally (the real vLLM test suite leans hard on greedy for the same reason). - One step ≠ one token, in general. It's one scheduling quantum. This lab's prompt
fits in one chunk so steps and output tokens align; chunked prefill (Phase 3) breaks that
alignment on purpose. If you internalize "step = the engine's clock tick, in which each
scheduled request advances
num_computed_tokensby some amount," nothing later will surprise you.
Going further
- Re-run with
long_prefill_token_threshold=2and a 10-char prompt. You should now seeRUNNINGevents withnum_computed_tokensat 2, 4, 6, 8, 10 — and crucially,num_tokensnot growing during those steps (mid-prefill steps emit no token; seeScheduler.needs_sample). You've just watched chunked prefill with your own eyes, two phases early. - Trace two requests at once (pass a second prompt). Watch them interleave within the same steps — that's continuous batching, and it's lab-04.
- Compute TTFT and ITL (inter-token latency) in steps from your event list. On real hardware each step has a wall-clock cost roughly proportional to its scheduled token count; your step trace is the skeleton of every latency benchmark in Phase 18.
References
upstream/vllm/v1/engine/core.py:428—EngineCore.step, the loop you reproduced.upstream/vllm/v1/request.py— the realRequestandRequestStatus.upstream/vllm/v1/core/sched/scheduler.py:330— the "no prefill phase, no decode phase" comment this lab is built around.- Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) — §3 describes the request lifecycle this trace makes visible. https://arxiv.org/abs/2309.06180
- Yu et al., Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) — the paper that introduced iteration-level (per-step) scheduling, i.e. the reason your trace advances per step and not per request. https://www.usenix.org/conference/osdi22/presentation/yu
- vLLM blog, vLLM V1: A Major Upgrade to vLLM's Core Architecture (Jan 2025) — the V1 engine-loop redesign you're tracing. https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
- kipply, Transformer Inference Arithmetic — why prefill is compute-bound and decode is bandwidth-bound, the physics behind your step counts. https://kipp.ly/transformer-inference-arithmetic/
Lab 01-02 — Read the Real Engine Loop [GPU-OPT]
In lab-01 you built a trace of the request lifecycle on mini_vllm. Now you'll get the same
trace out of the real engine — vLLM 0.22.1, a real model, real CUDA — and line up the two
side by side. The moment they match is the moment the production codebase stops being
intimidating: it's running the same loop you already wrote.
No GPU? Don't panic. The full captured output from a real run is below, annotated. The loop structure is the lesson; the hardware just makes it go fast. Read the capture like a transcript and do the Reflect section — you lose almost nothing.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, facebook/opt-125m, L4, vLLM 0.22.1, trimmed)
- Reading the output line by line
- Now read the source
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
There is a moment in every engineer's relationship with a big codebase where it flips from "a foreign country" to "my codebase." It almost never happens by reading files top to bottom. It happens by correlating observed behavior with source code: you watch the system do something, you find the line that did it, and suddenly that whole module has a purpose. This lab manufactures that moment deliberately.
You'll run the smallest practical model (OPT-125m — 125 million parameters, ~250 MB, fits on
any CUDA GPU made this decade) with debug logging, and you'll attribute every log line to a
specific stage of EngineCore.step. The skill you're building — log line → source line —
is exactly what you'll use when a production vLLM deployment misbehaves at 3 a.m. and the
only evidence is a log stream.
Requirements
uv pip install -e ".[vllm]" # installs vllm==0.22.1, matching the course pin
huggingface-cli download facebook/opt-125m # ~250 MB; tiny on purpose
Why OPT-125m? You want the engine, not the model, to be the star. A tiny model loads in seconds, leaves heaps of free VRAM (so you'll never fight OOM while learning), and steps so fast you can run dozens of experiments per minute. Save the 70B models for when the engine is boring to you.
Steps
VLLM_LOGGING_LEVEL=DEBUG python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='facebook/opt-125m', gpu_memory_utilization=0.4, max_model_len=256)
print(llm.generate(['The capital of France is'], SamplingParams(max_tokens=16, temperature=0))[0].outputs[0].text)
"
Three deliberate parameter choices worth understanding (they're the first three knobs you'll ever tune on a real deployment):
gpu_memory_utilization=0.4— vLLM pre-allocates this fraction of total VRAM for weights- KV cache. We keep it low so the demo coexists with your desktop; production runs 0.9+.
Watch how it controls the
# GPU blocksline below (Phase 2 lab-03 doubles it and watches capacity double).
- KV cache. We keep it low so the demo coexists with your desktop; production runs 0.9+.
Watch how it controls the
max_model_len=256— caps sequence length, which caps the per-request KV footprint and changes the "maximum concurrency" math the engine prints at startup.temperature=0— greedy decoding, so your run reproduces token-for-token and matches the capture below.
Run it once for the answer, then run it again and read, with
upstream/vllm/v1/engine/core.py:428 open in a second window.
Captured output (real run, facebook/opt-125m, L4, vLLM 0.22.1, trimmed)
INFO ... Initializing a V1 LLM engine with config: model='facebook/opt-125m', ...
INFO ... # GPU blocks: 8788, # CPU blocks: 0
DEBUG ... Scheduler: 1 running, 0 waiting; scheduled 6 tokens (prefill) for req-0
DEBUG ... EngineCore step: executed=True, 6 scheduled tokens
DEBUG ... Scheduler: 1 running, 0 waiting; scheduled 1 token (decode) for req-0
DEBUG ... EngineCore step: executed=True, 1 scheduled token
... (15 more decode steps) ...
DEBUG ... Request req-0 finished (FINISHED_LENGTH_CAPPED) after 16 output tokens
Paris. It is the largest city in France...
Reading the output line by line
Every number in that capture is a thing you already understand from lab-01:
# GPU blocks: 8788— at startup the engine measured free VRAM after loading weights, profiled a worst-case forward pass, and carved everything left into 8788 KV blocks of 16 tokens each (≈140k tokens of cache). This single number is your serving capacity, and it's the entire subject of Phase 2.# CPU blocks: 0simply means no CPU swap space is configured.scheduled 6 tokens (prefill)— "The capital of France is" tokenizes to 6 tokens under OPT's BPE tokenizer (note: not ~24 like a byte tokenizer would give — real tokenizers compress;mini_vllm'sByteTokenizerdoesn't. Same lifecycle, different token counts). All 6 are scheduled in one step because 6 ≪ the token budget. This is exactly your lab-01 step 1.1 running, 0 waiting— the scheduler's two queues, printed every step. With one request and an empty server, nobody ever waits. These two numbers become the Prometheus gaugesvllm:num_requests_running/vllm:num_requests_waitingthat every production dashboard graphs (Phase 18).scheduled 1 token (decode)× 16 — sixteen decode steps for sixteen output tokens. Steps = output tokens: the lab-01 invariant, now on real hardware.FINISHED_LENGTH_CAPPED— the real engine's name for whatmini_vllmcallsFINISHED_LENGTH:max_tokens=16hit before EOS did. Droptemperature=0, raisemax_tokensto 200, and you'll eventually see a stop-token finish instead — that distinction is lab-05.
Now read the source
Open upstream/vllm/v1/engine/core.py:428 (EngineCore.step). Strip the error handling and
batching machinery in your head and you're left with:
scheduler_output = self.scheduler.schedule() # "Scheduler: ..." lines
model_output = self.model_executor.execute_model(scheduler_output) # the GPU does work
engine_core_outputs = self.scheduler.update_from_output( # counters advance,
scheduler_output, model_output) # finishes detected
Three calls. That's the engine. Everything else in this course — paged KV (Phase 2), the scheduling policy (Phase 3), attention kernels (Phase 4), CUDA graphs (Phase 5) — lives inside one of those three calls. Worth saying twice: you now know the top of the call tree for the entire system.
While you're in there, trace one level down on each:
schedule()→upstream/vllm/v1/core/sched/scheduler.py:329— the two-queue loop you'll reimplement in Phase 3 lab-01.execute_model()→ eventuallyupstream/vllm/v1/worker/gpu_model_runner.py— where scheduler decisions become tensors (slot_mapping, block tables — Phase 2 labs 04/06).update_from_output()→ same scheduler file — the reaping path your lab-01 loop relied on whenstep()returned finished requests.
Hitchhiker's notes
- Why is the very first step slower than all the rest? (Watch the timestamps.) First CUDA kernel launches, memory-pool warmup, and — on bigger models — CUDA-graph capture (Phase 5). Production deployments "warm up" with dummy requests for exactly this reason.
LLM(...)is the offline wrapper. Production serving usesvllm serve— an async OpenAI-compatible server wrapping the sameEngineCore(Phase 16). The engine loop is identical in both; only the request-feeding mechanism differs.- Log formats drift. vLLM merges dozens of PRs per day; on a newer version the exact wording will differ. The stages won't. Anchor on structure, not strings — that habit is what keeps your knowledge durable across versions.
- Try breaking it. Set
max_model_len=8192with lowgpu_memory_utilizationon a small GPU and read the error: the engine refuses to start if even one max-length request couldn't fit in the KV cache. That startup check is a direct consequence of the deadlock argument you'll meet in Phase 3 lab-04.
Reflect
- The first step schedules the whole prompt (6 tokens); every later step schedules 1. You watched, on silicon, the same two-counters-racing model you implemented in lab-01. Where did TTFT come from in this run? (Step 1's wall-clock: prefill + first sample.)
- "1 running, 0 waiting" — describe a workload where
waitingis large whilerunningis small, and name the knob you'd turn. (Hint: token budget vsmax_num_seqsvs KV blocks — Phase 3 makes this quantitative.) - Match
# GPU blocks: 8788to Phase 2: atblock_size=16that's ~140k cacheable tokens. Withmax_model_len=256, what's the theoretical max concurrency? (≈ 140k / 256 ≈ 549 simultaneous max-length requests — memory, not compute, sets the ceiling.)
References
upstream/vllm/v1/engine/core.py:428—EngineCore.step.upstream/vllm/v1/core/sched/scheduler.py:329—Scheduler.schedule.- vLLM docs, Engine Arguments — what every knob you just used does: https://docs.vllm.ai/en/latest/serving/engine_args.html
- vLLM blog, vLLM V1: A Major Upgrade (Jan 2025) — why the V1 loop looks like this: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
- Yu et al., Orca (OSDI 2022) — iteration-level scheduling, the reason the log shows per-step decisions: https://www.usenix.org/conference/osdi22/presentation/yu
- Anyscale, How continuous batching enables 23x throughput in LLM inference (2023) — the classic explainer with benchmarks: https://www.anyscale.com/blog/continuous-batching-llm-inference
Lab 01-03 — Rebuild the Engine Step by Hand [CPU-OK]
This is the rite-of-passage lab of Phase 1. In lab-01 you observed the engine loop from
the outside; now you will write it. You get the engine's organs — a Scheduler, a model,
a Sampler — and you must wire them into one working heartbeat:
schedule → execute → sample → update
When your version produces token-for-token identical output to LLMEngine.step (the
tests check exactly that), you will have personally implemented the function that sits at
the top of vLLM's call tree — the one every other phase of this course lives inside.
Contents
- Why this lab exists
- Background: what a "step" really is
- Files
- Run
- What to implement
- The subtle part: who gets to sample?
- What the tests prove
- How this maps to the real engine
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Reading a loop and being able to write it are different levels of knowledge, and the gap
between them is precisely where maintainers are made. Every nontrivial vLLM PR you will ever
review or write touches the contract between these four stages: the scheduler promises the
executor a batch shape; the executor promises the sampler logits in row order; the sampler
promises the scheduler a token per eligible request; update_from_output promises
everyone that the bookkeeping is consistent before the next tick. Bugs live at these seams.
After this lab, the seams are yours.
There's a second, sneakier payoff. By taking the engine apart and reassembling it, you
prove to yourself that LLMEngine contains no magic: it owns its components and runs a
four-line loop. That demystification compounds — when you later read
EngineCore.step upstream and it's 60 lines instead of 4, you'll see immediately that the
extra 56 are batching, async plumbing, and error handling, not new ideas.
Background: what a "step" really is
A step is the engine's clock tick. In one tick:
- schedule — the scheduler looks at every live request and produces a verdict: a map
{request_id: n}meaning "compute KV for the nextntokens of this request, this tick." For a fresh short prompt,n= the whole prompt (prefill). For a request mid-generation,n = 1(decode). For a long prompt under chunked prefill,n= one chunk. The genius of the design is that downstream stages don't care which of those it is. - execute — the model computes the forward pass for all scheduled tokens of all
scheduled requests in one batch. (In
mini_vllmthe toy model only needs each request's last token + position; the real engine feeds every scheduled token through the transformer and writes their KV into the paged cache — Phase 2.) - sample — for each request that caught up this tick (more below), turn its logits row into one new token id.
- update — advance
num_computed_tokensbyn, append sampled tokens, check stop conditions, reap the finished (free their KV, drop fromrunning).
The state at the end of a tick depends only on the state at the start — there is no hidden carry-over. That's why the engine can be paused, traced (lab-01), snapshotted, or driven by a test one tick at a time.
Files
starter.py—engine_step(scheduler, model, sampler)is stubbed with the full recipe in the docstring. Your work.solution.py— reference (mirrorsmini_vllm/engine.py::LLMEngine.step).test_lab.py— equivalence tests against the real engine, plus the mid-prefill edge case.
Run
LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-03-engine-step-by-hand -q
pytest phase-01-architecture-and-request-lifecycle/labs/lab-03-engine-step-by-hand -q # reference
What to implement
def engine_step(scheduler: Scheduler, model: ToyModel, sampler: Sampler) -> list[Request]
One full iteration; returns the requests that finished. You'll find the four stages spelled out in the starter docstring. Budget 30–45 minutes; if it takes longer, re-read 02-mini-build.md — the trouble is almost always stage 2.
The subtle part: who gets to sample?
Stage 2 hides the one genuinely subtle decision in the whole loop, and it's the reason this lab has an edge-case test:
A request emits a token this step iff its computed tokens catch up to all its tokens:
num_computed_tokens + n == num_tokens(that'sScheduler.needs_sample).
Why? Sampling token k+1 requires the logits at position k, which require the KV of all
tokens 0..k to exist. A mid-prefill chunk (say, tokens 0–3 of a 12-token prompt) computes
useful KV but leaves the request's tail un-computed — sampling now would be sampling from a
model that hasn't read the whole prompt. It would run without crashing, and it would
produce garbage. This is the classic class of inference bug: silently wrong, not loudly
broken. The test test_mid_prefill_chunk_emits_no_token exists so that if you ever forget
the guard, you find out in 50 ms instead of in production.
(The real engine encodes the same rule via logits_indices — the model runner gathers
logits only at each request's last scheduled position and the sampler only sees rows for
requests that caught up. Different mechanism, identical invariant.)
What the tests prove
| Test | What it pins |
|---|---|
test_single_request_matches_reference | Your loop = the engine's loop, simplest case |
test_batch_matches_reference | Row ordering: logits row i must go to scheduled request i. Shuffle them and tokens cross between requests — the "answer swap" bug that has hit real serving systems |
test_matches_under_chunked_prefill | Your loop survives n < remaining (chunks) and a tight token budget without changing output |
test_mid_prefill_chunk_emits_no_token | The needs_sample guard above |
test_empty_schedule_returns_no_finished | The idle path: an engine with nothing to do must do nothing, gracefully |
The equivalence tests work because everything is deterministic: the toy model's logits are a pure function of (seed, last token, position) and greedy sampling has no randomness. Two engines with the same seed must agree token-for-token — so any disagreement is a bug in your wiring, never noise. Hold on to this technique: determinism turns "looks right" into "provably identical," and it's how vLLM's own correctness tests pin scheduler changes.
How this maps to the real engine
Side by side with upstream/vllm/v1/engine/core.py:428:
| Your line | Upstream | Notes |
|---|---|---|
output = scheduler.schedule() | scheduler_output = self.scheduler.schedule() | Identical role; upstream's output also carries block tables & slot mappings (Phase 2) |
model.forward(last_tokens, positions) | self.model_executor.execute_model(scheduler_output) | Upstream ships the whole batch to GPU workers, possibly across processes/nodes (Phase 10) |
sampler.sample(logits[i], ...) | inside the model runner: self.sampler(...) | Upstream samples on the GPU, vectorized over the batch (Phase 9) |
scheduler.update_from_output(...) | self.scheduler.update_from_output(...) | Same name, same job |
Note what upstream does not do differently: the stage order, the catch-up rule, the reaping path. Architecture survives; implementation details scale.
Hitchhiker's notes
- Order within the batch is a contract.
rows[i]↔logits[i]. Inmini_vllmthis is a Python list; upstream it's tensor row indices (logits_indices). Either way, the scheduler and the sampler are communicating through positional agreement — one of those invisible contracts that only becomes visible when someone breaks it. update_from_outputmust run even on steps where nothing sampled. Mid-prefill steps still advancenum_computed_tokens— that's the whole point of the chunk. If you guard the update behindif sampled:, chunked prefill freezes forever. (Ask us how we know.)- Why does
engine_steptake the scheduler rather than creating one? Dependency injection isn't ceremony here: the tests hand you an engine's organs precisely so they can compare your loop against the engine that owns them. Upstream is shaped the same way for the same reason —EngineCorereceives its executor, which is what lets tests swap in fakes. - The model is fake; the bookkeeping is real.
ToyModelproduces deterministic pseudo-logits and ignores KV contents. Everything you wired — scheduling verdicts, catch-up sampling, reaping — is faithful. This split (real control plane, toy data plane) is the course's core trick, and it's also how you should unit-test engine changes upstream: the control plane rarely needs a GPU to be proven correct.
Going further
- Add a
callback(step_idx, output, sampled)parameter and rebuild lab-01's trace using your own step function. Observability-as-a-hook is exactly how vLLM's stat loggers attach to the loop. - Break the row-order contract on purpose (reverse
rowsbut notlogits) and watch which test catches it and how — the failure is instructive: outputs are plausible-looking tokens, just the wrong ones. - Time 1000 steps of a decode-only batch at batch sizes 1, 8, 64 (
time.perf_counter). Even on CPU with a toy model you'll see per-step overhead amortize with batch size — a small-scale preview of why batching is the first lever of throughput (Phase 18).
References
mini_vllm/engine.py— theLLMEngine.stepyou are reimplementing.upstream/vllm/v1/engine/core.py:428—EngineCore.step.upstream/vllm/v1/worker/gpu_model_runner.py— where execute/sample happen for real; search forlogits_indicesto find the catch-up rule's production form.- Yu et al., Orca (OSDI 2022) — §4, "iteration-level scheduling": the paper that first made "one step of everyone" the unit of work. https://www.usenix.org/conference/osdi22/presentation/yu
- vLLM blog, vLLM V1: A Major Upgrade — the rewrite that flattened the engine loop into the shape you just built: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
Lab 01-04 — Watch the Batch: Continuous Batching Made Visible [CPU-OK]
One request's lifecycle (lab-01) is a nice story. But inference engines earn their living when many requests share the machine — and the way they share it is the single biggest throughput idea of the last few years: continuous batching. In this lab you'll instrument the engine to photograph the batch composition of every step — who got scheduled, for how many tokens — and you'll directly observe the thing the famous benchmark posts only describe: prefill chunks of one request riding in the same step as decodes of another.
Contents
- Why this lab exists
- Background: static vs continuous batching
- Files
- Run
- What to implement
- What you should see — the full trace, explained
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
This lab is Phase 3 knocking on the door early — on purpose. The scheduler is easier to implement (Phase 3 lab-01) after you've seen its decisions laid out step by step. More practically: per-step batch composition is the engine's most important hidden variable. The wall-clock time of a step is roughly proportional to the tokens scheduled in it, so the sequence of dicts you're about to record is, up to a constant, the latency profile of the server. Spiky dicts = spiky inter-token latency. When Phase 3 lab-05 measures chunked prefill's effect on decode latency, it will use exactly the probe you build here.
You'll also learn the instrumentation pattern itself: wrapping a component's method to observe a system without changing its behavior. That's how vLLM's own stat loggers attach to the engine, and how you'll debug schedulers for the rest of your career — schedulers rarely crash; they just quietly make bad batches. You can't grep for a bad batch. You have to look at it.
Background: static vs continuous batching
The old way (pre-Orca, ~2022): collect N requests, run them as a unit until all N finish, then take the next N. Two disasters hide in that sentence. First, requests finish at different times, and finished slots sit idle while the longest request drags on (the "convoy"). Second, a request arriving one millisecond after the batch launched waits an entire batch lifetime to even start. GPU utilization graphs of static-batched servers look like a comb: bursts of work, then teeth-gaps of idle.
The Orca insight (OSDI 2022), which vLLM adopted and the whole industry copied: rebuild the batch every step. A request can join the batch at any step boundary (its prefill just becomes part of that step) and leave at any step boundary (its slot is free next step, not at end-of-batch). The batch isn't a unit of work anymore — it's whatever the scheduler composed for this one tick. Anyscale's benchmark of this idea measured up to 23× throughput over static batching. That entire revolution is visible in the data structure you're about to record: consecutive dicts whose key sets grow and shrink while the engine never stops.
Files
starter.py— implementtrace_batches(engine + probe). Your work.solution.py— reference.test_lab.py— pins step-1 composition, token conservation, budget cap, deferral under a tight budget, and the existence of mixed prefill+decode steps.
Run
LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-04-watch-the-batch -q
pytest phase-01-architecture-and-request-lifecycle/labs/lab-04-watch-the-batch -q # reference
What to implement
def trace_batches(prompts, max_tokens=4, **engine_kwargs)
-> tuple[list[str], list[dict[str, int]]]
Add all prompts (greedy, ignore_eos=True), then run eng.step() to completion — but
first, wrap eng.scheduler.schedule with a closure that calls the original, appends a
copy of out.num_scheduled_tokens to your trace, and returns out unchanged. The probe
must be invisible: same engine behavior with or without it. (Copy the dict! The scheduler
gives you its own object; aliasing it is the kind of bug that produces a trace where every
step mysteriously looks like the last one.)
What you should see — the full trace, explained
Two prompts — A = "hello world" (11 tokens) and B = "goodbye" (7 tokens),
max_tokens=4 — with a tight budget of max_num_batched_tokens=8:
step 1: {A: 8} # A's prefill, CHUNKED to the budget. B is NOT admitted: budget spent.
step 2: {A: 3, B: 5} # A finishes prefill (3 left) + samples token 1.
# B finally admitted with the leftover budget: 8-3=5 of its 7.
step 3: {A: 1, B: 2} # ← THE MONEY STEP: A is decoding (1 token) while B is still
# prefilling (its last 2) — prefill and decode IN THE SAME BATCH.
step 4: {A: 1, B: 1} # both decoding.
step 5: {A: 1, B: 1} # ...
step 6: {B: 1} # A hit max_tokens and left; B has the machine to itself.
Read it like a maintainer:
- Step 1 is
{A: 8}, not{A: 11}— the budget (8) caps the step, so the scheduler takes the first 8 tokens of A's prompt and stops. Nothing special-cased:n = min(remaining, budget). And B isn't admitted at all, because admission requires leftover budget. B is spending this step in the WAITING queue — this is the queueing delay that lab-01 promised you'd see under load. - Step 2 is where continuous batching starts paying — A's last chunk and B's first chunk share a step. A static-batch engine cannot produce this step; it doesn't have a concept for "half of A and half of B."
- Step 3 is the signature —
min=1, max>1in the same dict: a decode and a prefill chunk co-scheduled. The testtest_mixed_batches_exist_under_loadhunts for exactly this shape. On a GPU this mixing is also an efficiency trick: decode alone underuses compute (bandwidth-bound), prefill alone starves latency; mixed batches fill the compute bubbles with prefill work (Sarathi's "piggybacking" — Phase 3). - Step 6's shrinking key set — A finished and was reaped mid-flight; B never noticed. Its slot is reusable immediately. That, and nothing more, is "continuous."
- Conservation check — sum A's numbers: 8+3+1+1+1 = 14 = 11 + 4 − 1. Each request's
scheduled tokens total
prompt + max_tokens − 1. Why −1? The final sampled token is appended and the request immediately finishes — its KV is never computed, because no further token will ever attend to it. The engine doesn't do work the future won't read. When a counter is off by one in a scheduler, this is the kind of identity you use to find it; that's why there's a test pinning it.
Rerun with the default roomy budget (2048) and the drama disappears: step 1 is
{A: 11, B: 7}, everything after is decodes. Scheduling is only interesting under
scarcity — keep that in mind when building benchmarks, or you'll "validate" a scheduler
on workloads that never exercise it.
What the tests prove
| Test | Invariant |
|---|---|
test_ample_budget_prefills_everyone_in_step_one | With budget to spare, admission is immediate — queueing is a scarcity phenomenon, not a constant tax |
test_token_conservation_per_request | Σ scheduled = prompt + max_tokens − 1, the off-by-one identity above |
test_budget_is_never_exceeded | Σ over the batch ≤ max_num_batched_tokens, every single step — the engine's load-bearing promise to the GPU's latency |
test_tight_budget_chunks_and_defers | The exact step-1/step-2 composition above: chunking + deferred admission |
test_mixed_batches_exist_under_load | A prefill chunk and a decode co-exist in one step |
Hitchhiker's notes
- The probe pattern beats print-debugging schedulers. You get structured data you can
assert on, diff between runs, and plot. The real engine's equivalent surface is
SchedulerOutput(upstreamvllm/v1/core/sched/output.py) — when debugging real vLLM, loggingnum_scheduled_tokensper step gives you this exact trace. - Why does B wait a whole step when the budget is spent? Could the scheduler give A 7 and B 1 instead of A 8? It could — but FCFS says finish admitting A's work first; fairness policies are a deep rabbit hole (priority scheduling lands in Phase 3's exercises). The shape to remember: policy decides who, budget decides how much, and they're separable concerns in the code.
- Step time ∝ scheduled tokens is a good first-order model but not exact on real hardware: a decode-only step pays memory-bandwidth costs that token-count alone doesn't capture, and tiny steps pay fixed launch overheads (which CUDA graphs attack — Phase 5). Phase 18 refines the model; the trace you built stays the right raw material.
- Request IDs are global.
mini_vllmnumbers requests with a module-level counter, so don't hardcodereq-0in your own experiments — use the idstrace_batchesreturns. The tests are written that way for exactly this reason.
Going further
- Plot the trace: steps on x, stacked bars of scheduled tokens per request. You've recreated the iconic continuous-batching diagram from the Orca paper and the Anyscale post — except yours is measured, not illustrated.
- Sweep
max_num_batched_tokensfrom 4 to 64 over the same prompts and plot total steps vs budget. You'll see a hyperbola flatten: past "everything fits," more budget buys nothing. Congratulations, you've found a saturation knee — Phase 18 is full of these. - Add 8 requests with staggered arrival (add two, step twice, add two more …). Watch key sets churn. This is what a production batch actually looks like: a rolling membership, no two steps alike.
References
- Yu et al., Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) — iteration-level scheduling, the idea this lab photographs: https://www.usenix.org/conference/osdi22/presentation/yu
- Anyscale, How continuous batching enables 23x throughput in LLM inference (2023) — the benchmark post that made this mainstream: https://www.anyscale.com/blog/continuous-batching-llm-inference
- Agrawal et al., Sarathi-Serve: Taming Throughput-Latency Tradeoff in LLM Inference (OSDI 2024) — why mixed prefill+decode batches are not just legal but desirable: https://arxiv.org/abs/2403.02310
upstream/vllm/v1/core/sched/output.py—SchedulerOutput, the real engine's version of the dicts you recorded.upstream/vllm/v1/core/sched/scheduler.py:329— the loop that composed every step you traced; you implement its core in Phase 3 lab-01.
Lab 01-05 — Stop Conditions & Finish Reasons [CPU-OK]
Every request dies. The only questions are when and what we tell the user about it. This
lab dissects the engine's stop machinery — the few lines of update_from_output that decide
whether a generation halts on the model's own EOS token or on the operator's max_tokens
cap — and the mapping from internal status to the finish_reason field that every OpenAI
API consumer in the world branches on.
It looks small. It is small. It is also the part of the engine with the highest bug-impact-to-code-size ratio: an off-by-one or a mis-ordered check here doesn't crash — it silently truncates answers, or streams one token too many, for every user, forever.
Contents
- Why this lab exists
- Background: the three ways a request ends
- Files
- Run
- What to implement
- The edge case the tests are really about
- What the tests prove
- How this maps to the real engine
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Ask anyone who's run an LLM API in production what their most common user-facing bug
report is. It won't be a crash. It will be: "the answer just… cuts off." Triaging that
report requires knowing exactly what you'll know after this lab: was it finish_reason: "length" (the operator's cap — raise max_tokens), "stop" (the model chose to end — a
prompting issue), or a stream that died without a reason (an actual bug)? The distinction is
three enum values and two if statements, and entire support rotations have burned days for
lack of it.
There's an engineering lesson too. Stop handling is where model behavior (EOS is just a
token the model can emit, with a probability like any other) meets system policy
(max_tokens is an admission-control and billing boundary). Keeping those two cleanly
separated — and correctly ordered — is a miniature of the whole serving-systems
discipline.
Background: the three ways a request ends
- The model stops itself — it samples the EOS (end-of-sequence) token. EOS is not
magic: it's a vocabulary entry (id 256 in
mini_vllm'sByteTokenizer; id 2 for Llama;<|endoftext|>= 50256 for GPT-2) that the model learned to emit when a response is complete. The engine checks "was the token just appended the EOS?" and if so marksFINISHED_STOPPED→ APIfinish_reason: "stop". A well-behaved model ends most chat turns this way. - The operator stops it —
num_output_tokens >= max_tokens. MarkedFINISHED_LENGTH→ APIfinish_reason: "length". To an API consumer this usually means "your answer was truncated; consider a bigger budget." To the operator it's the lever that bounds worst-case cost and KV occupancy per request — schedulers need a worst case to exist (remember the deadlock argument coming in Phase 3 lab-04). - Someone aborts it — client disconnect, admin action. Real vLLM has
FINISHED_ABORTEDfor this;mini_vllmomits it (no clients to disconnect). Worth knowing it exists: cancellation is a first-class lifecycle path in production, and "KV freed on abort" is a real invariant people have broken.
And one anti-way that trips newcomers: ignore_eos=True (used throughout this course's
tests, and by every serious benchmark) disables check #1, so generation always runs to the
cap. Why would anyone want a model to blow through its own stop sign? Benchmarking. If
you're measuring tokens/sec, you need every request to produce a known, fixed number of
tokens regardless of what the model "wants" to say. The flag exists for load generators,
not users — and you've been benefiting from it since lab-01 without noticing: it's what made
your traces deterministic in length.
Files
starter.py— implementfinish_reason(status → API string) andrun_until_stop(the feed-tokens-until-something-fires simulation of the update stage). Your work.solution.py— reference.test_lab.py— the EOS path, the ignore_eos path, the length path, the boundary tie, the unfinished case, and an end-to-end engine check.
Run
LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-05-stop-conditions -q
pytest phase-01-architecture-and-request-lifecycle/labs/lab-05-stop-conditions -q # reference
What to implement
Two functions. finish_reason(request) is the status-to-API translation table.
run_until_stop(token_stream, eos_token_id, sampling_params) replays the engine's update
stage with pre-decided tokens: append one, run maybe_finish(), break if it fired. Using
a scripted token stream instead of a sampler is the trick that makes stop logic exhaustively
testable — you can place an EOS at any position you like, including exactly on the
max_tokens boundary, something you could wait a long time for a sampler to do for you.
(This is also how you should test stop-sequence handling upstream: script the stream, pin
the behavior.)
The edge case the tests are really about
What should happen when the model emits EOS exactly at the max_tokens boundary? Both
conditions are true simultaneously. Look at mini_vllm/request.py::maybe_finish: the EOS
check runs first, so the request reports "stop". That ordering is a deliberate,
user-visible API decision, not an accident of code layout: "stop" tells the consumer "the
answer is complete"; "length" tells them "the answer was cut off — maybe retry with a
bigger budget." On the boundary, the answer is complete — reporting "length" would
invite pointless retries (and with auto-retrying clients, real money). Real vLLM resolves
the tie the same way.
test_eos_on_the_boundary_reports_stop pins this. If someone "tidies up" maybe_finish by
reordering the checks, that test fails — which is the whole job of a test like that: turning
an invisible design decision into a tripwire. Notice the meta-lesson: whenever a function
checks two conditions that can be true at once, the order is an API. Grep any engine
you maintain for such pairs; most of them are untested.
What the tests prove
| Test | What it pins |
|---|---|
test_eos_stops_generation | Tokens after EOS are never generated — the stream truly halts |
test_ignore_eos_runs_to_length_cap | ignore_eos neutralizes check #1 only; the cap still binds |
test_no_eos_hits_length_cap | The cap fires at exactly max_tokens, not ±1 |
test_eos_on_the_boundary_reports_stop | The tie-break above |
test_unfinished_request_has_no_reason | WAITING/RUNNING → None: a streaming response must not carry a finish_reason until the end |
test_engine_reports_length_with_ignore_eos | Your mapper agrees with the engine's real loop, end to end |
How this maps to the real engine
upstream/vllm/v1/request.py—RequestStatusandget_finished_reason(): the same mapping you wrote, plusFINISHED_ABORTED → "abort". Note upstream encodes "is finished" as an ordering on the enum (status > PREEMPTED) —mini_vllmcopies that trick, which is why the enum's declaration order is load-bearing in both. (A reordered enum constant breakingis_finishedis exactly the kind of PR a maintainer learns to catch on sight.)upstream/vllm/v1/engine/output_processor.py— where statuses become thefinish_reasonstrings in API responses, including for streaming (sent only on the final chunk — yourNone-until-finished mapping is what makes that correct).- The real engine checks more stops than these two: stop strings (must be detected on
detokenized text, which means stop handling interacts with the detokenizer's streaming
buffer — a genuinely tricky area), stop_token_ids (per-request custom EOS lists), and
min_tokens(suppress EOS before a floor — the mirror image ofignore_eos). Each is the same shape you built: a predicate over the request's tail, checked in a defined order, inupdate_from_output. When you read that upstream code now, it will parse as "lab-05, four more times."
Hitchhiker's notes
- EOS consumes a token of budget. In
mini_vllm(and in token accounting generally) the EOS lands inoutput_token_ids— yourtest_eos_stops_generationresult was[10, 20, EOS], three tokens spent. APIs differ on whether the EOS is shown (vLLM strips it from text but it exists in the token count). If you've ever wondered why an API bills N+1 tokens for an N-token answer — this is why. max_tokenscounts output, not total. Prompt length lives in a different limit (max_model_len, which caps prompt + output together). Conflating the two produces classic admission bugs: a request with a 4000-token prompt andmax_tokens=200needs 4200 tokens of headroom, and the real scheduler must reserve for the worst case, not the average.- Greedy + a real model can loop forever ("the the the…") — and without a length cap, that request never finishes, never frees its KV, and slowly strangles the server. The cap isn't a UX nicety; it's the engine's guarantee that every admission terminates. Treat any proposal of "unlimited max_tokens" as what it is: a resource-leak feature request.
- Sampling parameters can make EOS unreachable in subtler ways than
ignore_eos: alogit_biasof −∞ on the EOS id, ormin_tokensbefore the floor. The stop machinery composes with the sampler (Phase 9); when stops "mysteriously" don't fire, the sampler is suspect #1.
Going further
- Add stop-string support to
run_until_stop: decode the accumulated output withByteTokenizerafter each token and halt when a given string appears. You'll immediately hit the real-world wrinkle: the stop string can straddle a token boundary, so you must check a sliding window of recent text, not just the newest fragment. Now read how upstream solves it (searchstopinoutput_processor.py) and admire the buffering. - Implement
min_tokens: suppress the EOS check whilenum_output_tokens < min_tokens. One line. Then write the boundary test for it (EOS exactly atmin_tokens) — you know the drill now. - In real vLLM, run a chat model and print
finish_reasonfor: a normal question, the same withmax_tokens=5, and the same withignore_eos=True. Watch"stop","length","length"come back — your three paths, on production silicon.
References
mini_vllm/request.py—maybe_finish(): the eight lines this lab is about.upstream/vllm/v1/request.py—RequestStatus.get_finished_reason.upstream/vllm/v1/engine/output_processor.py— stop strings, streaming finish_reason.- OpenAI API reference, Chat Completions — the
finish_reasoncontract your mapper implements: https://platform.openai.com/docs/api-reference/chat/object - vLLM docs, Sampling Parameters —
stop,stop_token_ids,min_tokens,ignore_eos: https://docs.vllm.ai/en/latest/api/inference_params.html
Phase 01 — Exercises: Architecture & Request Lifecycle
Contents
Warm-up (explain)
- Name the four stages of
EngineCore.stepand the course phase that owns each. - What's the difference between
LLMandAsyncLLM? What do they share? - List the objects a request becomes: prompt → ? → ? → ? →
RequestOutput.
Core (trace the code)
- In
EngineCore.step(core.py:428), which stage can returnNone, and what is called then? - Who owns the GPU: Executor, Worker, or ModelRunner? What does each do?
- Why does V1 run
EngineCorein its own process? What crosses the boundary?
Build (your lab)
- In lab-01, at which step does
num_computed_tokensfirst equal the prompt length, and why? - Extend
trace_requestto trace two requests at once; observe how the scheduler interleaves them across steps (continuous batching, Phase 3). - Add a
WAITINGsnapshot (before the first schedule) to your trace. Why is there usually only one WAITING tick for a lone request on an idle engine?
Design (staff-level)
- A user reports high TTFT but normal ITL. Which stage(s) of
stepwould you investigate, and which phase's knobs (2/3/5) would you reach for? - You're asked to add a new API surface (e.g. a gRPC endpoint). Which layer do you build it at, and what must it produce/consume to reuse the existing core unchanged?
- Explain why detokenization runs off the core's hot path in the server. What would break if it
ran inside
EngineCore.step?
Self-grading
4–6 and 10–12 are interview-grade. Could you draw the full request journey and name every file? If not, re-read 01-deep-dive.md §"The whole journey, named".
Phase 01 — Interview Questions: Architecture & Request Lifecycle
Q1. Walk me through what happens between LLM.generate(prompt) and the first token.
Model answer
generate tokenizes the prompt and builds an EngineCoreRequest; add_request wraps it in a
Request and enqueues it in the scheduler. Then the engine loops EngineCore.step: the
scheduler picks the request and how many tokens to compute (the whole prompt, as prefill), the
executor runs the model on the assembled batch via a worker/model-runner, the sampler produces
the first token, and update_from_output advances num_computed_tokens and records the token.
The output processor detokenizes and returns/streams it. (llm.py:422 → core.py:428.)
Q2. What are the four stages of the engine step?
Model answer
schedule() (who runs, how many tokens — Phase 3), execute_model() (run the forward pass on a
worker/model-runner — Phases 4–14), sample_tokens() (pick the next token — Phase 9), and
update_from_output() (advance counters, reap finished requests — Phase 3). Everything in vLLM is
a deep dive into one of these. (core.py:428.)
Q3. Executor vs Worker vs ModelRunner — who does what?
Model answer
The Executor (v1/executor/) is the engine's handle to compute; it owns one Worker for
single-GPU or N for tensor/pipeline parallel and fans execute_model out to them. A Worker
(gpu_worker.py) owns one GPU's device, model shard, and KV cache. The ModelRunner
(gpu_model_runner.py) turns a SchedulerOutput into input tensors + attention metadata, runs
the (CUDA-graphed) forward pass, and runs the sampler. This indirection is why the same engine
runs on 1 or 64 GPUs — only the Executor changes.
Q4. Why does V1 isolate EngineCore in its own process?
Model answer
To keep the tight GPU scheduling loop off the API server's event loop and free of GIL contention
with HTTP handling and detokenization, and to cleanly coordinate multi-GPU worker processes.
Requests cross the boundary as serialized EngineCoreRequests and results as EngineCoreOutputs;
output processing/detokenization runs server-side so it never stalls the core. (EngineCoreProc,
core.py:835.)
Q5. How do offline batch and online serving share code?
Model answer
Both are thin shells over the same EngineCore. LLM/LLMEngine is the synchronous batch shell
(add_request + pump step), AsyncLLM is the async/streaming shell for the OpenAI server. The
scheduling, execution, and sampling are identical; only the entry/exit (sync vs async, full
result vs streamed deltas) differ.
Rapid-fire
- Offline entry point?
LLM.generate(llm.py:422). - Online entry point?
AsyncLLMbehind the OpenAI server. - The heartbeat?
EngineCore.step(core.py:428). - Object the scheduler operates on?
Request(with status + counters). - What
update_from_outputdoes? Advancenum_computed_tokens, append tokens, reap finished.
Phase 01 — Cheatsheet: Architecture & Request Lifecycle
Contents
The journey
LLM.generate / AsyncLLM -> EngineCore.step (loop) -> Detokenizer -> RequestOutput
step = schedule (Ph3) -> execute_model (Ph4-14) -> sample (Ph9) -> update_from_output (Ph3)
Entry points
- Offline:
LLM(model=...).generate(prompts)(entrypoints/llm.py:422) →LLMEngine. - Online: OpenAI server →
AsyncLLM(v1/engine/async_llm.py). Same core, async + streaming.
The compute chain
EngineCore → Executor (1 or N workers) → Worker (owns one GPU) → ModelRunner
(gpu_model_runner.py: SchedulerOutput → tensors → forward → sample).
Objects that flow
prompt+SamplingParams → EngineCoreRequest → Request (counters+status) → SchedulerOutput
→ ModelRunnerOutput → RequestOutput.
Lifecycle
WAITING → RUNNING → FINISHED_* ; PREEMPTED → WAITING (Phase 3). RequestStatus (request.py:315).
Process model
EngineCore runs in its own process (EngineCoreProc, core.py:835) — tight loop off the GIL;
detokenization runs server-side, off the hot path.
Key upstream
entrypoints/llm.py:422generate·v1/engine/llm_engine.py:209/287add_request/stepv1/engine/core.py:428step·:337add_request ·:835EngineCoreProcv1/engine/async_llm.pyAsyncLLM·v1/worker/gpu_model_runner.pythe runner
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md
Phase 02 — The Hitchhiker's Guide to PagedAttention ⭐
← Phase 01 · Course home · Phase 03 →
This is a flagship phase — written in full. Use it as the template for the depth every other phase aims at.
Contents
- Don't Panic
- Step 1: Why is memory the problem at all?
- Step 2: The old way, and why it bled memory
- Step 3: The fix — pages (blocks)
- Step 4: The bonus that falls out — sharing
- Step 5: The data structures you're about to meet
- The four invariants (memorize these)
- What you'll do in this phase
Don't Panic
Here is the entire idea, in one breath:
The KV cache is the model's memory of the conversation so far. Naively, you'd give each request one big contiguous slab of GPU memory to hold it. PagedAttention instead chops the KV cache into fixed-size blocks (like a operating system chops memory into pages) and lets each request's blocks live anywhere in GPU memory, tracked by a little block table. That one change — contiguous slab → scattered pages — is why vLLM serves several times more requests per GPU than the systems that came before it.
If you have ever learned how an OS gives processes "virtual memory" backed by scattered physical pages, you already understand PagedAttention. It is literally that idea, applied to the KV cache. The vLLM paper's title even says so: "Efficient Memory Management for Large Language Model Serving with PagedAttention."
Take a breath. By the end of this phase you will have written a working paged block
allocator yourself (mini_vllm/block_pool.py) and read the real one
(upstream/vllm/v1/core/block_pool.py) line by line.
Step 1: Why is memory the problem at all?
Recall from Phase 0: during generation, the model caches a Key and Value vector for every token it has seen, in every layer. This is the KV cache. It is enormous and it grows as the conversation gets longer.
A rough size for one sequence:
kv_bytes_per_token = 2 (K and V) × num_layers × num_kv_heads × head_dim × dtype_bytes
For Llama-3-8B (32 layers, 8 KV heads, head_dim 128, fp16) that's about:
2 × 32 × 8 × 128 × 2 ≈ 131 KB per token
A 2,000-token conversation is ~256 MB of KV — for one user. On a 24 GB GPU, after the ~16 GB of weights, you have ~8 GB for KV — maybe ~30 such conversations. Memory, not compute, is what caps how many users you can serve. So how you manage that memory is the whole ballgame.
Step 2: The old way, and why it bled memory
Pre-vLLM systems reserved a contiguous chunk of KV memory per request, sized for the maximum possible length (e.g. 2048 tokens), up front.
Request A (will generate 30 tokens, reserved 2048):
[####..............................................................] <- 2018 slots WASTED
^30 used
Request B (reserved 2048):
[#########.........................................................] <- ~2000 WASTED
Two diseases:
- Internal fragmentation — you reserve for the worst case (2048) but use 30. The other ~2018 slots sit idle, reserved, unusable by anyone else.
- External fragmentation — as requests of different sizes come and go, free memory breaks into chunks too small to fit the next contiguous request, even though the total free memory is plenty.
Studies in the vLLM paper found these wasted 60–80% of KV memory. That directly means 60–80% fewer concurrent users than the hardware could support.
Step 3: The fix — pages (blocks)
PagedAttention says: stop reserving contiguous slabs. Instead:
- Carve all KV memory into many small, equal blocks. A block holds the KV of
block_sizetokens (commonly 16). - Maintain a global pool of free blocks.
- Give each request blocks on demand, one at a time, as it generates — and the blocks can be anywhere in physical memory.
- Keep a per-request block table: a little array mapping the request's logical block index (0, 1, 2, …) to the physical block id it actually got.
Physical KV memory (one big array of blocks, ids 0..N):
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ b0 │ b1 │ b2 │ b3 │ b4 │ b5 │ b6 │ b7 │ b8 │ b9 │ ...
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
Request A's block table: [ 4, 1, 7 ] (logical 0→phys 4, 1→1, 2→7)
Request B's block table: [ 2, 9 ]
A's tokens live in blocks 4,1,7 — NOT contiguous, and that's totally fine.
Only A's *last* block may be partly empty (≤ block_size−1 wasted). No giant reservations.
Now waste is at most block_size − 1 tokens per request (the tail of the last block) —
seconds of generation, not thousands of reserved-but-idle slots. Fragmentation: gone.
The mental shift: a request's KV no longer needs to be contiguous in memory; it only needs to be contiguous in the block table. The attention kernel is handed the block table and gathers KV from the scattered physical blocks. That's the "Paged" in PagedAttention.
Step 4: The bonus that falls out — sharing
Once KV is in blocks tracked by tables, two requests can point their block tables at the
same physical block. If two requests start with the same prompt (a shared system prompt, or
n=4 samples of one prompt), they can share the physical KV blocks of that prefix — compute
it once, store it once.
System prompt blocks (computed once): b5 b6
Request A table: [ b5, b6, b1 ] ─┐
Request B table: [ b5, b6, b8 ] ─┴─ both point at b5,b6 (shared!), diverge after.
This is prefix caching (the star of Phase 03). To make sharing safe we need two more concepts, both straight from operating systems:
- Reference counting — each block knows how many requests use it (
ref_cnt). A block is truly free only whenref_cnt == 0. - Copy-on-write — if a shared block must change for just one request, copy it first so the other sharer's view is untouched.
Step 5: The data structures you're about to meet
The real vLLM (and your mini_vllm) implement paging with exactly four pieces:
| Piece | Job | Real code | Your code |
|---|---|---|---|
KVCacheBlock | metadata for one physical block (id, ref_cnt, hash) | kv_cache_utils.py:116 | mini_vllm/block_pool.py |
FreeKVCacheBlockQueue | the free list, in eviction order, O(1) middle-removal | kv_cache_utils.py:164 | mini_vllm/block_pool.py |
BlockPool | owns all blocks + the free list + the prefix-cache index | block_pool.py:130 | mini_vllm/block_pool.py |
KVCacheManager | per-request block tables; the API the scheduler calls | kv_cache_manager.py:110 | mini_vllm/kv_cache.py |
A surprising detail you'll appreciate: the free list is a hand-rolled doubly linked list,
not a Python deque. Why? Because on a prefix-cache hit we must yank a specific block out of
the middle of the free list in O(1). A deque can't do that. The real code has a 30-line
docstring justifying this exact decision (kv_cache_utils.py:164). Reading that docstring and
understanding why is a rite of passage — and a great interview answer.
The four invariants (memorize these)
A maintainer holds these in their head at all times. They're tested in
mini_vllm/test_block_pool.py and asserted throughout the real code:
- I1. A block is in the free queue ⟺
block.ref_cnt == 0(and it isn't the null block). - I2. Block tables are append-only: an allocated
block_idnever changes under a request. (This is why the cache doesn't de-duplicate — seeblock_pool.py:48.) - I3. Only a full block (exactly
block_sizetokens) ever gets a hash and enters the prefix cache. - I4. "Cached" ≠ "unusable." A block can be a free eviction candidate (in the free
queue) while still being a prefix-cache hit target.
touch()revives it.
What you'll do in this phase
- Read the real allocator: 01-deep-dive.md walks
block_pool.pyandkv_cache_utils.pyline by line. - Build your own: 02-mini-build.md (you've got
mini_vllm/block_pool.pyas the reference — the lab has you write it from a stub). - Labs (see labs/README.md; recommended order 01 → 02 → 05 → 06 → 03 → 04):
lab-01-block-allocator[CPU-OK]— implement the paged allocator + free queue, pass the tests.lab-02-fragmentation-viz[CPU-OK]— simulate contiguous vs paged allocation; measure the waste.lab-03-real-vllm-blocks[GPU-OPT]— run real vLLM, readnum_gpu_blocksand KV usage, prove no fragmentation.lab-04-triton-paged-attn[GPU-REQ]— port a block-table-indexed attention to a Triton kernel.lab-05-share-and-evict[CPU-OK]— the life of a cached block: sharing (ref_cnt==2), eviction order (tails before shared prefixes), and revival from the middle of the free queue.lab-06-paged-attention-numpy[CPU-OK]— the kernel's data path in pure numpy:slot_mapping, scatter, gather-through-the-table, and proof that paged == dense to 1e-12. (The CPU twin of lab-04.)
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
When you can whiteboard the block table + free queue from memory and explain copy-on-write and the four invariants, you understand the single most important idea in vLLM. Onward.
← Phase 01 · Course home · Phase 03 →
Phase 02 — Deep Dive: PagedAttention in the real vLLM
All paths are relative to
upstream/at the pinned commitv0.22.1 @ 0decac0(UPSTREAM_PIN.md). Open each file as we go. Line numbers are valid at the pin; the named symbol lets you re-find anything if you're on a different version.The V1 KV-cache stack lives in
vllm/v1/core/:vllm/v1/core/ kv_cache_utils.py KVCacheBlock, FreeKVCacheBlockQueue, hashing (the primitives) block_pool.py BlockPool (the allocator) kv_cache_manager.py KVCacheManager, KVCacheBlocks (per-request tables) kv_cache_coordinator.py coordinates groups (hybrid models) (one level up) single_type_kv_cache_manager.py (per-group logic)
We'll go bottom-up: the block, the free list, the pool, then the manager the scheduler calls.
Contents
- 1.
KVCacheBlock— metadata for one physical block - 2.
FreeKVCacheBlockQueue— the free list, and why it's hand-rolled - 3.
BlockPool— owns every block, the free list, and the cache index - 4. The hash that makes it a prefix cache:
hash_block_tokens - 5.
KVCacheManager— the per-request API the scheduler uses - 6. Where the blocks actually get used: the attention kernel
- Reading checklist
1. KVCacheBlock — metadata for one physical block
vllm/v1/core/kv_cache_utils.py:116:
@dataclass
class KVCacheBlock:
"""KV-cache block metadata."""
block_id: int
ref_cnt: int = 0
_block_hash: BlockHashWithGroupId | None = None
# Used to construct a doubly linked list for free blocks.
prev_free_block: "KVCacheBlock | None" = None
next_free_block: "KVCacheBlock | None" = None
is_null: bool = False
Crucial things to notice:
- A
KVCacheBlockis metadata only. The actual K/V tensors live in a big GPU buffer; this object just says "block #block_id, used byref_cntrequests, hashing to_block_hash." Yourmini_vllm.block_pool.KVCacheBlockis the same shape minus the GPU tensors. ref_cntis the heart of sharing (I1). Theblock_hashsetter (line 139) asserts the block has no hash yet — enforcing I3/I2: a block's hash is set once when it fills, and the block id is stable.prev_free_block/next_free_blockare the linked-list pointers. The comment (line 128) warns: "These two attributes should only be manipulated by FreeKVCacheBlockQueue." That's an invariant about ownership — exactly the kind of thing a maintainer must respect.
reset_hash() (line 146) clears the hash on eviction. We'll see it called from
_maybe_evict_cached_block.
2. FreeKVCacheBlockQueue — the free list, and why it's hand-rolled
vllm/v1/core/kv_cache_utils.py:164. Read its docstring in full — it's a masterclass. The key
sentences:
"We implement this class instead of using Python builtin deque to support removing a block in the middle of the queue in O(1) time. … this class does not allocate any Python objects when manipulating the linked list."
Two design decisions, both about performance on the hot path (this runs for every allocation and free, every step):
- O(1) middle removal. On a prefix-cache hit, a block that was a free eviction candidate
gets revived — pulled out of wherever it sits in the free list. A
dequeonly does O(1) at the ends; the middle is O(n). So they wrote a doubly linked list. - Zero allocation. They reuse the
prev/nextfields on the blocks themselves rather than allocating node wrappers. No GC pressure in the scheduler loop.
The eviction order is the other half (docstring lines 173–180):
"1. The least recently used block is at the front (LRU). 2. If two blocks have the same last accessed time … the one with more hash tokens (the tail of a block chain) is at the front."
So popleft() evicts LRU-first, and within a freed request, tail blocks go first (we'll see
KVCacheManager.free frees in reverse so the longest shared prefix survives longest).
The sentinel trick (lines 196–214): a fake head and tail node so push/pop never special-case
"is this the first/last?". Read popleft (216), remove (286), append (306), popleft_n
(253), append_n (329). Your mini_vllm.block_pool.FreeKVCacheBlockQueue implements the same
four operations with the same sentinel trick — compare them side by side.
Interview gold: "Why does vLLM use a custom linked list instead of
collections.dequefor free blocks?" → O(1) removal from the middle for prefix-cache revival, and zero per-operation allocation on the scheduler hot path. If you can also say where the middle removal happens (touch), you're answering at staff level.
3. BlockPool — owns every block, the free list, and the cache index
vllm/v1/core/block_pool.py:130. The constructor (__init__, line 149):
self.blocks: list[KVCacheBlock] = [KVCacheBlock(idx) for idx in range(num_gpu_blocks)]
self.free_block_queue = FreeKVCacheBlockQueue(self.blocks)
self.cached_block_hash_to_block: BlockHashToBlockMap = BlockHashToBlockMap()
# To represent a placeholder block with block_id=0.
self.null_block = self.free_block_queue.popleft()
self.null_block.is_null = True
- One
KVCacheBlockper physical block, all initially free. - A null block (id 0) is reserved as a placeholder (used for skipped positions, e.g.
outside a sliding window).
mini_vllmreserves block 0 the same way (BlockPool.__init__). cached_block_hash_to_blockis the prefix-cache index:block_hash → block. (Upstream uses aBlockHashToBlockMapthat can hold multiple blocks per hash;mini_vllmsimplifies to one block per hash — read theBlockHashToBlockMapdocstring at line 34 to see why the real one is more complex: it must keep block ids stable, I2, so it doesn't dedup.)
Allocation: get_new_blocks (line 333)
def get_new_blocks(self, num_blocks: int) -> list[KVCacheBlock]:
if num_blocks > self.get_num_free_blocks():
raise ValueError(f"Cannot get {num_blocks} free blocks from the pool")
ret: list[KVCacheBlock] = self.free_block_queue.popleft_n(num_blocks)
if self.enable_caching:
for block in ret:
self._maybe_evict_cached_block(block) # <- was it a cached eviction candidate?
assert block.ref_cnt == 0
block.ref_cnt += 1
else:
for block in ret:
assert block.ref_cnt == 0
block.ref_cnt += 1
return ret
Pop n blocks off the front of the free queue (LRU). If caching is on, each popped block might
still be sitting in the prefix cache as an eviction candidate (I4) — so
_maybe_evict_cached_block removes its hash entry before we reuse it. Then ref it (ref_cnt = 1). mini_vllm.BlockPool.get_new_blocks mirrors this exactly (including _maybe_evict).
Eviction: _maybe_evict_cached_block (line 365)
block_hash = block.block_hash
if block_hash is None:
return False # block was never cached, nothing to evict
if self.cached_block_hash_to_block.pop(block_hash, block.block_id) is None:
return False
block.reset_hash() # <- I3: it no longer holds cacheable content
This is the OS analogy made literal: reusing a physical page means invalidating whatever was mapped there. The hash is cleared so no future request thinks this block holds their prefix.
Sharing: touch (line 402) — the O(1) middle removal in action
def touch(self, blocks: Sequence[KVCacheBlock]) -> None:
for block in blocks:
# ref_cnt=0 means this block is in the free list (eviction candidate), so remove it.
if block.ref_cnt == 0 and not block.is_null:
self.free_block_queue.remove(block) # <- O(1) middle removal! (the whole reason
block.ref_cnt += 1 # for the custom linked list)
When a new request hits a prefix-cached block that happened to be free, touch revives it:
pull it out of the middle of the free list and bump its ref count. This single line is why
FreeKVCacheBlockQueue exists. mini_vllm.BlockPool.touch is identical in spirit.
Freeing: free_blocks (line 419)
for block in blocks_list:
block.ref_cnt -= 1
self.free_block_queue.append_n(
[block for block in blocks_list if block.ref_cnt == 0 and not block.is_null]
)
Decrement refs; any block that hit 0 goes back on the free queue (and stays in the cache as an eviction candidate — I4). The caller is expected to pass blocks in eviction-priority order (docstring line 419: "first block will be evicted first").
Caching full blocks: cache_full_blocks (line 211)
The big method that registers newly-full blocks into the prefix cache. The important loop (line 267):
for i, blk in enumerate(new_full_blocks):
if blk.is_null or (block_mask is not None and not block_mask[i]):
continue
assert blk.block_hash is None # I3 again
block_hash = new_block_hashes[i]
block_hash_with_group_id = make_block_hash_with_group_id(block_hash, kv_cache_group_id)
blk.block_hash = block_hash_with_group_id
self.cached_block_hash_to_block.insert(block_hash_with_group_id, blk)
Only full, non-null, non-masked blocks get a hash and enter the index. The rest of the method (lines 285–331) emits optional KV-cache events (for observability / external KV stores) — skip that on first read.
4. The hash that makes it a prefix cache: hash_block_tokens
vllm/v1/core/kv_cache_utils.py:541:
def hash_block_tokens(hash_function, parent_block_hash, curr_block_token_ids, extra_keys=None):
if not parent_block_hash:
parent_block_hash = NONE_HASH
curr_block_token_ids_tuple = tuple(curr_block_token_ids)
return BlockHash(
hash_function((parent_block_hash, curr_block_token_ids_tuple, extra_keys))
)
The block's hash includes its parent's hash. That chaining is the entire reason this is a
prefix cache and not just a block cache: block [c, d] hashes differently depending on what
came before it, so a hit on block k guarantees blocks 0..k were all identical. extra_keys
folds in things that must not collide across contexts — LoRA id, multimodal content, a
cache_salt — see generate_block_hash_extra_keys (line 503). Your
mini_vllm.block_pool.hash_block_tokens keeps the parent chaining (the essential part) and
drops extra_keys; the test test_prefix_hash_is_chained pins the property.
5. KVCacheManager — the per-request API the scheduler uses
vllm/v1/core/kv_cache_manager.py:110. This is the only KV class the scheduler talks to; it
hides the pool/coordinator behind a clean interface. Two methods matter most.
get_computed_blocks (line 194) — prefix-cache lookup
max_cache_hit_length = request.num_tokens - 1 # must recompute last token to get logits
computed_blocks, num_new_computed_tokens = self.coordinator.find_longest_cache_hit(
request.block_hashes, max_cache_hit_length
)
Note the num_tokens - 1: even if the entire prompt is cached, the last token must be
recomputed to produce logits. mini_vllm.KVCacheManager.get_computed_blocks reproduces this
exact max_hit_tokens = num_tokens - 1 rule and walks block hashes from the front, stopping at
the first miss (a prefix must be contiguous from the start).
allocate_slots (line 236) — the workhorse
Read the giant ASCII docstring (lines 273–305): it diagrams how a request's tokens split into
comp | new_comp | ext_comp | new | lookahead. The control flow (simplified):
num_blocks_to_allocate = self.coordinator.get_num_blocks_to_allocate(...)
if num_blocks_to_allocate > self.block_pool.get_num_free_blocks():
return None # <- OOM! caller must preempt and retry
...
new_blocks = self.coordinator.allocate_new_blocks(...)
...
self.coordinator.cache_blocks(request, num_tokens_to_cache) # cache newly-full blocks
return self.create_kv_cache_blocks(new_blocks)
The single most important line for Phase 03
is return None: when there aren't enough free blocks, allocate_slots returns None, and
the scheduler responds by preempting a running request and retrying. That handshake between
the KV manager (memory truth) and the scheduler (policy) is the seam where memory management
meets scheduling. mini_vllm.KVCacheManager.allocate_slots returns None on OOM for exactly
this reason, and mini_vllm.Scheduler.schedule preempts on None.
free (line 429) — reverse order on purpose
"""We free the blocks in reverse order so that the tail blocks are evicted first when
caching is enabled."""
self.coordinator.free(request.request_id)
Freeing tail-first means the head blocks (the shared prefix) stay in the free queue longest,
so they survive for the next request that shares that prefix. mini_vllm.KVCacheManager.free
does reversed(blocks) for the same reason — see the comment there.
6. Where the blocks actually get used: the attention kernel
We've managed metadata; where do the K/V tensors and block tables meet a GPU kernel? Two places to glance at (full treatment in Phase 04):
- The classic CUDA kernels:
csrc/attention/paged_attention_v1.cuand..._v2.cu. These take a block table and gather KV from scattered physical blocks. Search the.cuforblock_tableto see the indirection:physical_block = block_table[seq][logical_block]. - The V1 backends that build the metadata:
vllm/v1/attention/backends/flash_attn.pyturns the scheduler's block ids + sequence lengths into theslot_mapping(where to write new K/V) and block tables (where to read old K/V) the kernel needs.
You don't need to read CUDA to pass this phase — but knowing that "the block table is literally passed into the attention kernel, which dereferences it per token" closes the loop on why the metadata we manage here is shaped the way it is.
Reading checklist
Tick these off in your lab notebook (write one sentence each):
-
KVCacheBlock— what doesref_cntgate? what does theblock_hashsetter assert? -
FreeKVCacheBlockQueue— why a linked list not a deque? where is the middle-removal used? -
BlockPool.get_new_blocks— why call_maybe_evict_cached_blockbefore reusing? -
BlockPool.touch— trace the O(1) revival of a cached free block. -
hash_block_tokens— why include the parent hash? -
KVCacheManager.allocate_slots— what does returningNonetrigger, and where?
Now go build it: 02-mini-build.md, then the labs.
Phase 02 — Mini-Build: the paged block allocator
You will build the four pieces from the deep-dive, on CPU, with numpy-or-nothing. The reference implementation already lives in the repo so you can check yourself:
mini_vllm/block_pool.py—KVCacheBlock,FreeKVCacheBlockQueue,BlockPool,hash_block_tokensmini_vllm/kv_cache.py—KVCacheManager
But the point is to write it yourself first. lab-01-block-allocator gives you a starter.py
with the method bodies stubbed out and a test_lab.py that pins every invariant. Make the
tests pass without peeking, then diff your file against solution.py (and against the real
mini_vllm/block_pool.py).
Contents
The build, in order
1. KVCacheBlock
A dataclass: block_id, ref_cnt=0, block_hash=None, is_null=False, and two link
pointers prev_free/next_free. Add reset_hash(). Don't put list logic here — the queue
owns the pointers (mirror the real ownership invariant).
2. FreeKVCacheBlockQueue
A doubly linked list with fake head/tail sentinels. Implement:
popleft()→ first block, O(1)remove(block)→ unlink from the middle, O(1) ← the reason this class existsappend(block)→ push to tailget_all_free_blocks()→ for tests Keep anum_free_blockscounter in sync. Test: removing a middle block keeps the rest ordered (test_free_queue_o1_middle_removal).
3. BlockPool
- Build
num_blocksblocks, wrap them in the queue, reserve block 0 asnull_block. get_new_blocks(n)— pop n,_maybe_evicteach, setref_cnt=1._maybe_evict(block)— if it has a hash and is the cached block for that hash, drop it from the index andreset_hash().touch(blocks)— if a block is free (ref_cnt==0),removeit from the queue, thenref_cnt += 1.free_blocks(blocks)—ref_cnt -= 1; any that hit 0 (and aren't null) go back on the queue.cache_full_blocks(blocks, hashes)— set hash + index it (skip null/already-hashed).get_cached_block(hash),get_num_free_blocks(),get_usage().
4. hash_block_tokens(parent_hash, token_ids)
return hash((parent_hash, tuple(token_ids))). The parent chaining is non-negotiable — it's
what makes it a prefix cache. Test: same tokens, different parent → different hash.
5. KVCacheManager (in mini_vllm/kv_cache.py)
get_computed_blocks(request)→ walk full-block hashes from the front, look each up, stop at first miss; cap usable hits at(num_tokens - 1) // block_size. Return(blocks, num_cached).allocate_slots(request, num_new_tokens, new_computed_blocks=None)→touchthe cached blocks, compute how many blocks the (computed+new) tokens need, returnNoneif not enough free, else allocate and cache newly-full blocks.free(request)→ free its blocks in reverse order.
Definition of done
pytest mini_vllm/test_block_pool.py -q # the reference suite
pytest phase-02-paged-attention/labs -q # your lab solution + the lab tests
Both green. Then, in your notebook, answer: Which line in your touch() is the O(1) middle
removal, and which real-world event triggers it? (Answer: pulling a prefix-cached block out of
the free list on a cache hit.)
Stretch (optional, sets up Phase 03)
Add a tiny copy-on-write to your pool: a fork_block(block) that, if ref_cnt > 1,
allocates a fresh block, (pretend-)copies contents, decrements the original, and returns the
new one. You won't wire it into the engine here, but it's the mechanism behind safe prefix
sharing when one sharer diverges — and a classic interview follow-up.
Phase 02 Labs — PagedAttention
Six labs that take you from "paging is an idea" to "I have built every layer of it." The arc: build the allocator (lab-01), measure why it wins (lab-02), see it manage real gigabytes (lab-03), then follow the data path — share & evict cached blocks (lab-05), gather through the table in numpy (lab-06), and finally write the GPU kernel (lab-04).
Recommended order: 01 → 02 → 05 → 06 → 03 → 04. (The directory numbers predate labs 05
and 06; the metadata labs first, then the data path, then the hardware.) CPU labs follow the
standard contract — starter.py (your work), solution.py (reference), test_lab.py (the
spec); default runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-02-paged-attention/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-01-block-allocator -q
Contents
- lab-01-block-allocator
[CPU-OK] - lab-02-fragmentation-viz
[CPU-OK] - lab-03-real-vllm-blocks
[GPU-OPT] - lab-04-triton-paged-attn
[GPU-REQ] - lab-05-share-and-evict
[CPU-OK] - lab-06-paged-attention-numpy
[CPU-OK] - What you can do after this phase
Labs
lab-01-block-allocator [CPU-OK]
The lab of the phase. Implement the structure vLLM is famous for: KVCacheBlock, the
doubly-linked free queue with O(1) middle removal, the BlockPool with refcounts and lazy
eviction, and the parent-chained content hash. Four invariants (I1–I4), each one a class of
production bug, each one pinned by a test. Skills: the allocator's constitution; why the
free queue can't be a deque; hash chains = causality.
lab-02-fragmentation-viz [CPU-OK]
Simulate contiguous-max-reservation vs paged allocation on the same request stream and
measure the difference: 8 admissions vs 64, 94% waste vs ~0, on identical memory. The
PagedAttention headline result, re-derived by you in a four-line model. Skills: internal
vs external fragmentation; first-principles capacity modeling; the block_size trade-off.
lab-03-real-vllm-blocks [GPU-OPT]
Run real vLLM and read its memory self-assessment: where # GPU blocks: 8788 comes from
(profile → subtract → carve), why "Maximum concurrency: 68.65x" is the whole serving-
capacity story, and how to sanity-check both from the model config on paper. Captured
annotated run included for the GPU-less. Skills: capacity planning;
gpu_memory_utilization / max_model_len as capacity knobs; reading the startup ritual.
lab-04-triton-paged-attn [GPU-REQ]
The payoff: write a Triton kernel that gathers K/V through a block table and computes
decode attention with online softmax, then verify against a dense reference and compare
with paged_attention_v1.cu. Do lab-06 first — it's the same algorithm without the GPU
dialect. Skills: kernel-level indirection; online softmax; block_table (read) vs
slot_mapping (write); fp32 accumulation.
lab-05-share-and-evict [CPU-OK]
The biography of a cached block: two identical prompts converge on the same physical blocks
(ref_cnt == 2), freed blocks linger as eviction candidates, eviction consumes tails before
shared prefixes (reverse-order free = the policy), and a newcomer revives "dead" blocks from
the middle of the free queue. Includes the num_tokens − 1 hit cap and an exactly-sized
pool that counts every block. Skills: prefix-cache state machine; eviction-as-queue-order;
why the last token always recomputes.
lab-06-paged-attention-numpy [CPU-OK]
The data path, with no kernel noise: build slot_mapping, scatter K/V into a shuffled
physical cache, gather back through the block table, and prove paged attention equals dense
attention to 1e-12 — including a poisoned-tail test that makes masking bugs detonate.
The CPU twin of lab-04. Skills: the slot formula; write-map vs read-map; testing masked
computations by poisoning padding.
What you can do after this phase
Explain — with code you wrote and numbers you measured — why vLLM's memory manager admits
~10× more requests than reservation-based engines; predict a deployment's KV capacity from
HBM size and model config; narrate the full life of a cached block from allocation through
sharing to eviction or revival; and read block_pool.py, kv_cache_manager.py, and the
paged-attention kernels upstream as a peer of their authors. Phase 3 now puts this allocator
under a scheduler.
Lab 02-01 — Build the Paged Block Allocator [CPU-OK]
This is the lab of the phase, and arguably of the course. You are going to implement, from a skeleton, the data structure that made vLLM famous: the paged KV-cache block allocator — the free queue, the block pool, the reference counts, and the prefix-cache index. When the tests go green, the thing that serves trillions of tokens a day in production deployments around the world will exist, in miniature, written by your hands.
Contents
- Why this lab exists
- Background: what problem this structure solves
- The cast of characters
- Files
- How to run
- What to implement (in
starter.py) - The invariants you're proving
- The one data-structure decision to savor
- What the tests prove
- Hitchhiker's notes
- Success, and what to do with it
- References
Why this lab exists
Here is the surprise at the heart of vLLM: its breakthrough wasn't a kernel, a model trick, or a CUDA wizardry. It was an operating-systems idea from 1962 — paged virtual memory — applied to the KV cache. The PagedAttention paper's headline numbers (2–4× throughput over the prior state of the art) come almost entirely from the metadata structure you're about to build: a few hundred lines of bookkeeping that decide which 16-token "page" of GPU memory belongs to whom.
That's also why this lab is CPU-only with zero loss of fidelity. The GPU tensors are, as the
module docstring puts it, "just an array indexed by block_id." The hard part — the part
maintainers actually edit, review, and break — is the metadata: ref counts, free lists,
eviction, the prefix-cache index. You'll write all of it. And because mini_vllm's version
is a faithful-but-small port of the real one (same class names, same invariants, line
references throughout), finishing this lab means you can open
upstream/vllm/v1/core/block_pool.py and read it like something you wrote.
Background: what problem this structure solves
Every token a transformer processes leaves a residue: its attention keys and values, needed by every future token of the same sequence. For a 7B model that's ~0.5 MB per token. The pre-vLLM engines stored each request's KV in one contiguous tensor sized for the maximum possible length — and since you can't know in advance how long a generation will run, they reserved worst case and used average case. Result: 60–80% of "used" KV memory held nothing (measured in the PagedAttention paper, §2; you'll reproduce the number yourself in lab-02).
The fix is the OS playbook, almost verbatim:
| OS virtual memory | vLLM | In this lab |
|---|---|---|
| physical page frame | KV block (block_size tokens of K/V) | KVCacheBlock |
| free frame list | free queue | FreeKVCacheBlockQueue |
| page table (per process) | block table (per request) | Phase 2's KVCacheManager (next file over) |
| shared pages + refcounts | prefix sharing + ref_cnt | touch / free_blocks |
| page cache | prefix-cache index | cached_block_hash_to_block |
A request takes blocks one at a time, from anywhere, as it grows. Nothing is reserved.
External fragmentation: impossible (all blocks the same size). Internal fragmentation: at
most block_size − 1 tokens, in the last block only. Sharing: free, via refcounts. That's
the whole revolution.
The cast of characters
You implement three things, mirroring (with line references) the real engine:
KVCacheBlock— one block's metadata:block_id(its fixed address in the GPU tensor),ref_cnt(how many requests use it),block_hash(set only when full and cached), and two linked-list pointers it does not manage itself. (upstream:kv_cache_utils.py:116)FreeKVCacheBlockQueue— a doubly linked list with head/tail sentinels holding everyref_cnt == 0block in eviction order. Supportspopleft(allocate),append(free), and the crucialremove(block)— O(1) extraction from the middle. (upstream:kv_cache_utils.py:164, where the docstring explains exactly why a deque can't do this job)BlockPool— the owner:get_new_blocks(allocate + maybe-evict),touch(adopt a cached block, reviving it from the free queue if needed),free_blocks(decref, return to queue atref_cnt == 0),cache_full_blocks/get_cached_block(the prefix-cache index), plushash_block_tokens— the parent-chained content hash. (upstream:block_pool.py:130)
Files
starter.py— the skeleton. Method bodies raiseNotImplementedError. Fill them in.solution.py— a complete reference. Don't open it until you're green or truly stuck — this lab's struggle is its value.test_lab.py— every invariant from the deep-dive §1–3, executable.
How to run
# Grade YOUR implementation:
LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-01-block-allocator -q
# The reference (default — keeps the suite green out of the box):
pytest phase-02-paged-attention/labs/lab-01-block-allocator -q
What to implement (in starter.py)
Recommended order — each layer is testable before the next:
FreeKVCacheBlockQueue:popleft,remove,append,get_all_free_blocks. The sentinels (_head,_tail) are pre-wired so you never branch on "am I first/last?" — notice how much conditional logic two dummy nodes delete. Keepnum_free_blocksexact; the pool's OOM answer depends on it.hash_block_tokens: hash(parent_hash, tokens_tuple). One line — but read the docstring until you can say why the parent is in there (see Hitchhiker's notes).BlockPool:get_new_blocks(pop,_maybe_evict, assertref_cnt == 0, set to 1),_maybe_evict(drop the hash↔block mapping if this block was a cached eviction candidate),touch,free_blocks,cache_full_blocks,get_cached_block,get_num_free_blocks. Mind block 0: it's reserved as the null block at construction, exactly like upstream.
The invariants you're proving
These four lines are the closest thing the KV subsystem has to a constitution. Real scheduler bugs — upstream, in production — are violations of one of these:
- I1. A block is in the free queue ⟺
ref_cnt == 0(and it's not the null block). Both directions. A block in the queue with refs is a use-after-free wearing a disguise: someone will allocate it and overwrite KV another request is still reading — silent corruption, tokens from someone else's conversation. - I2. Block ids are stable: once given to a request, a block is never renumbered or
deduplicated out from under it. The GPU kernel reads physical addresses computed from
block_id; metadata cleverness must never move data. - I3. Only full blocks get hashed and cached. A partial block's contents are still changing; caching it would serve half-written KV to a prefix match.
- I4. Cached ≠ unusable. A cached block with
ref_cnt == 0sits in the free queue as an eviction candidate — it can be reclaimed (evicted) byget_new_blocksor revived (re-referenced) bytouch. This dual citizenship is the whole trick of zero-cost prefix caching: the cache rides for free in memory that's already free.
The one data-structure decision to savor
Why is the free "queue" a hand-rolled doubly linked list instead of
collections.deque? Because of I4. When a prefix-cache hit revives a block, that block is
sitting somewhere in the middle of the free queue, and it must leave now, in O(1) —
not via an O(n) scan of a deque. The eviction end (popleft) and the return end (append)
are deque-friendly; it's the revival path that forces real pointers. The upstream class
exists for precisely this reason and says so in its docstring.
Generalize the lesson: the access pattern dictates the structure. "Queue with O(1) middle removal" doesn't have a stdlib name, so vLLM built one. When you find a hand-rolled structure in a mature codebase, your first question should be "which operation forced this?" — the answer is usually a design document in disguise.
What the tests prove
| Test group | Invariant |
|---|---|
| free-queue mechanics | popleft/append/remove keep order and counts exact; sentinels never leak |
| allocate/free round-trips | I1 in both directions |
| no-dedup on identical content | I2 — two requests writing the same tokens get different blocks |
| partial blocks never cached | I3 |
revive-from-middle via touch | I4 + the O(1) removal that motivates the linked list |
| eviction drops the cache entry | _maybe_evict keeps the index consistent with reality |
Hitchhiker's notes
- The chained hash is the prefix property.
hash(block) = hash(parent_hash, tokens)means a block matches only if its entire ancestry matches. Without the chain, the block containing tokens[c, d]would collide between "ab|cd" and "xy|cd" — and a request would inherit KV computed under a different prefix. Attention is causal: KV at position i encodes everything before i. The chain is causality, hashed. (Upstream goes further and also folds in extras like LoRA id and multimodal hashes — same idea, more ancestry. And since v0.9, the hash uses SHA-256 by default rather than Python'shash, because across a fleet, a 64-bit hash collision means serving someone else's KV: at scale, "unlikely" is a frequency.) - The null block (id 0) is not a hack. Reserving a permanent placeholder block means
"no block here yet" can be represented inside the block-table tensor without sentinels
like −1 leaking into kernels. Upstream does exactly this. Watch that your
free_blocksandtouchnever count it. - Eviction is lazy and that's the elegance. Nothing proactively cleans the cache. A
cached-but-free block just sits in the queue; if demand arrives first,
get_new_blocksevicts it in passing (_maybe_evict); if a prefix hit arrives first,touchrevives it. The cache is exactly as big as whatever memory happens to be idle — no knob to tune, no background thread to race with. - Order of the free queue = eviction policy.
poplefttakes the front, so whatever orderingappendmaintains is your eviction policy. Append-on-free gives LRU-ish. Phase 3'sKVCacheManager.freeexploits this by returning a request's blocks in reverse order, so deep suffix blocks die before shared prefix blocks. Policy, encoded as list order — no priority queue in sight. (Upstream v0.22 keepsmaybe_evictand the queue discipline inBlockPool; older versions had a pluggableEvictorclass — the simplification is itself an instructive PR to read.)
Success, and what to do with it
LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-01-block-allocator -q
........ [100%]
Then do the two diffs that cement the knowledge:
diffyourstarter.pyagainstsolution.py— note every place you did it differently and decide which you prefer (sometimes yours is better; say why).- Open
upstream/vllm/v1/core/block_pool.pynext to your file and readget_new_blocks,touch,free_blocksfor real. List what production adds (multi-group KV for hybrid models, eviction events for observability,BlockHashtypes) and notice that nothing structural differs. You now read this file as its author.
References
mini_vllm/block_pool.py— the faithful port you're rebuilding, with upstream line refs.upstream/vllm/v1/core/block_pool.py:130—BlockPoolin production.upstream/vllm/v1/core/kv_cache_utils.py:116,164,541—KVCacheBlock,FreeKVCacheBlockQueue(read its docstring!),hash_block_tokens.- Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) — the paper; §4 is this lab: https://arxiv.org/abs/2309.06180
- vLLM blog, vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention (June 2023) — the original announcement, with the fragmentation figures: https://blog.vllm.ai/2023/06/20/vllm.html
- vLLM docs, Automatic Prefix Caching (design) — the hash-chain design you implemented: https://docs.vllm.ai/en/latest/design/prefix_caching.html
- Denning, Virtual Memory (ACM Computing Surveys, 1970) — the 50-year-old playbook vLLM ran: https://dl.acm.org/doi/10.1145/356571.356573
Lab 02-02 — Measure Fragmentation: Contiguous vs Paged [CPU-OK]
"Paging saves memory" is a slogan. This lab turns it into a number — one you produce, on your laptop, in milliseconds. You'll simulate the pre-vLLM allocation strategy and the paged strategy on the same stream of requests and measure exactly how many requests each admits and how many memory slots each wastes. The ratio you compute is, in miniature, the entire empirical case for PagedAttention — the 2–4× from the SOSP paper, re-derived by you.
Contents
- Why this lab exists
- Background: the two kinds of waste
- The experiment
- Files
- Run
- What you should see — with the arithmetic
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
A staff engineer's job is frequently to re-derive someone's headline claim from first
principles before betting an architecture on it. Papers cherry-pick; blog posts round up;
your workload is never quite theirs. The skill this lab drills is building the smallest
simulation that captures a memory-allocation phenomenon — no GPU, no model, no engine, just
the allocation math — and using it to interrogate a claim. You'll use this move constantly:
sizing KV for a new deployment, evaluating "should we raise block_size?", estimating what
a longer max_model_len costs before anyone provisions hardware.
It also makes Phase 2's core trade quantitative. After this lab you won't say "contiguous allocation wastes memory"; you'll say "on a reserve-512-use-32 workload it wastes 94% and admits 8 requests where paging admits 120 — and here's the four-line model that says so."
Background: the two kinds of waste
Memory allocators lose memory two ways, and the distinction drives everything:
- Internal fragmentation — waste inside an allocation: you reserved more than you
used. The contiguous strategy reserves
max_lenper request because generation length is unknowable in advance; a request that stops after 32 tokens of a 512 reservation strands 480 slots. Note this waste is invisible to the allocator — the slots are "allocated," the dashboard says the memory is in use, and yet it holds nothing. The PagedAttention paper measured real systems (Orca-style reservation) at 60–80% waste this way. - External fragmentation — waste between allocations: enough total free slots exist, but no single contiguous run is big enough, so an admission fails anyway. This one appears only after churn (allocate/free cycles punch holes), which is why naive benchmarks — and naive simulations — miss it.
Paging attacks both at once: nothing is reserved beyond the current need (internal waste
collapses to a partial tail block, < block_size per request), and no run needs to be
contiguous (external waste becomes structurally impossible — every free block is exactly
the right size). The price: an indirection table and a kernel that can follow it (lab-04/06).
That trade — a pointer per page for near-zero waste — is the same one OS designers accepted
in the 1960s, and for the same reason.
The experiment
A stream of requests arrives, each actually using some number of KV slots (tokens) out of
a worst-case max_len. You have a fixed pool of total_slots. Two allocators:
contiguous_admit(the old way): each request reservesmax_lencontiguous slots, first-fit. Reject if no extent is big enough. Waste =max_len − usedper admitted request.paged_admit(vLLM's way): each request takesceil(used / block_size)blocks from anywhere. Reject if not enough blocks remain. Waste = the partial tail block:need·block_size − used.
Same arrivals, same pool. Count admissions, rejections, wasted slots.
Files
starter.py— implementcontiguous_admit()andpaged_admit()(TODOs). Your work.solution.py— reference, plus areport()you can run directly:python phase-02-paged-attention/labs/lab-02-fragmentation-viz/solution.py.test_lab.py— asserts paged admits more and wastes less on a reserve-big-use-small workload, and pins the exact waste arithmetic of each strategy.
Run
LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-02-fragmentation-viz -q
pytest phase-02-paged-attention/labs/lab-02-fragmentation-viz -q # reference (default)
What you should see — with the arithmetic
Default report() parameters: total_slots=4096, max_len=512, used_len=32,
block_size=16, 64 arrivals.
contiguous: admitted=8 rejected=56 wasted=3840
paged: admitted=64 rejected=0 wasted=0
Walk the numbers; each is checkable in your head, which is the point of a model this small:
- Contiguous admits 8 = ⌊4096 / 512⌋. The pool is "full" after eight 512-slot reservations — even though those eight requests are using only 8 × 32 = 256 slots, i.e. 6% of the pool is doing work while 100% is reserved.
wasted=3840= 8 × (512 − 32). Internal fragmentation, precisely.- Paged admits all 64 — they need 64 × ⌈32/16⌉ = 128 blocks = 2048 slots of the 4096 available. The pool isn't even half full. 15× the admissions on identical memory — and admissions ≈ concurrent users ≈ throughput, which is why a memory-bookkeeping change produced vLLM's throughput headline.
- Paged
wasted=0is an artifact worth noticing: 32 divides evenly by 16, so the tail block is full. Changeused_len=33and waste jumps to 64 × 15 = 960 — each request's last block holds 1 token and strands 15 slots. Maximum possible paged waste is alwaysblock_size − 1per request; that bound is the design. (This is also your first taste of theblock_sizetrade-off: small blocks → less tail waste but bigger tables and more lookups; large blocks → the reverse. vLLM defaults to 16.)
Now shrink the pool or interleave frees (see Going further) to watch external fragmentation — the subtler killer — show up in the contiguous column too.
What the tests prove
| Test | What it pins |
|---|---|
| paged admits ≥ contiguous on reserve-big-use-small | the headline claim, as an inequality your code must earn |
contiguous waste = Σ(max_len − used) | internal fragmentation is exactly over-reservation |
paged waste < block_size per request | the bounded-tail guarantee |
| rejection counting | the failure mode is admission, not a crash — capacity bugs are silent |
Hitchhiker's notes
- Why can't the contiguous allocator just reserve less? Because generation length is decided by the model, token by token (Phase 1 lab-05 — EOS is sampled, not scheduled). A smaller reservation means mid-generation OOM with KV laid out so it can't grow in place — the realloc-and-copy would stall the whole batch. Reserve-the-max was the correct answer under contiguity; the insight was to remove the contiguity requirement, not to blame the reservation. When a design's only fix within its constraints is bad, attack the constraints. That's the actual PagedAttention lesson.
- This simulation has no frees — it's a single admission wave. That's deliberately conservative: churn makes contiguous worse (holes), never better, while paged is immune to hole geometry. When your simplified model favors the baseline, your conclusion survives reviewers. Note the trick for your own benchmarking work.
- Real vLLM never frees mid-request either — blocks accrete one at a time as a request
grows (
allocate_slots, called every step the request crosses a block boundary) and are freed together at finish. The grow-by-one-block pattern is exactly what yourpaged_admit'sceilmodels at admission granularity. - The waste you measured is why
gpu_memory_utilizationworks at 0.9. With near-zero internal waste, the engine can run its KV pool nearly full without lying to itself about what fits. Under contiguous reservation, "90% utilized" would mean 90% reserved, perhaps 20% used — a dashboard fiction. Bookkeeping honesty is a prerequisite for running hot.
Going further
- Make external fragmentation bite. Extend the simulation with frees: admit, free every
other request (leaving 512-slot holes... or smaller ones with varied
max_len), then try to admit one larger request. Total free space: plenty. Largest extent: too small. Admission: fails. Paged, same workload: succeeds. That's the demo to show anyone who thinks a defragmenter could have saved contiguous allocation (on a GPU, "defragmenting" = copying gigabytes of KV mid-serving). - Sweep
block_size∈ {1, 4, 16, 64, 256} atused_len=33and plot waste vs table size (Σ needentries). You'll rediscover the page-size trade-off every OS textbook draws — and why 16 tokens is a sane middle. - Feed it a real distribution. Replace the constant
used_lenwith samples from a log-normal or from real conversation-length data (e.g. ShareGPT lengths, as used in vLLM's own benchmarks). The contiguous column gets worse — variance is its enemy: the reservation must cover the tail of the distribution while the mean pays for it.
References
- Kwon et al., PagedAttention (SOSP 2023), §2–3 — the measured 60–80% waste and the fragmentation taxonomy you just reproduced: https://arxiv.org/abs/2309.06180
- vLLM blog (June 2023) — the announcement, with the memory-waste figure this lab recreates: https://blog.vllm.ai/2023/06/20/vllm.html
- Yu et al., Orca (OSDI 2022) — the prior state of the art whose reservation strategy is
your
contiguous_admit: https://www.usenix.org/conference/osdi22/presentation/yu - Wilson et al., Dynamic Storage Allocation: A Survey and Critical Review (1995) — the classic allocator-fragmentation survey, if you want the deep end: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.47.275
mini_vllm/kv_cache.py::allocate_slots— where theceil(used/block_size)you wrote runs for real, every engine step.
Lab 02-03 — Inspect Real vLLM's KV Blocks [GPU-OPT]
You've built the allocator (lab-01) and measured why it wins (lab-02). Now watch the real thing manage real gigabytes: how vLLM decides at startup how many KV blocks your GPU gets, how usage breathes as requests come and go, and the startup log line that tells you — before a single request arrives — how many concurrent users this deployment can hold. This lab is where Phase 2 stops being a data-structures exercise and becomes capacity planning.
No GPU? Don't panic. A complete captured run (L4 24GB) is annotated below. The arithmetic — which is the lesson — works the same on paper.
Contents
- Why this lab exists
- Background: where blocks come from
- Requirements
- Steps
- What to look for / log
- Captured output (real run, facebook/opt-125m, L4 24GB, vLLM 0.22.1)
- The capacity arithmetic, worked
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
The most consequential number in any vLLM deployment is printed once, at startup, and most
operators scroll past it: # GPU blocks: NNNN. That number is your serving capacity — it
bounds how many tokens of context can exist on the GPU simultaneously, which bounds
concurrent users, which bounds throughput (because batch size is where throughput comes
from, Phase 18). Every knob you'll ever tune for capacity — gpu_memory_utilization,
max_model_len, model choice, quantization, tensor parallelism — acts by moving this one
number. This lab teaches you to read it, predict it, and change it on purpose.
The skill being drilled is first-principles capacity planning: given a GPU and a model, compute on paper how many blocks you'll get, then start the engine and check. When the prediction lands within a few percent, KV memory stops being a mystery you provision by trial-and-OOM and becomes something you budget like a spreadsheet.
Background: where blocks come from
At startup, vLLM runs a careful ritual (upstream/vllm/v1/worker/gpu_worker.py,
determine_available_memory):
- Load the weights, measure what's left of the
gpu_memory_utilizationbudget. - Profile a worst-case forward pass (max batch, max length, dummy data) to measure peak activation memory — the scratch space a real step needs. This is why startup takes those extra seconds; it's also why vLLM doesn't OOM at the first big batch like naive servers do: it already simulated the worst day.
- Whatever survives — budget − weights − peak activations − allocator overhead — is carved
into KV blocks of
block_sizetokens each (kv_cache_utils.get_kv_cache_configs).
So: num_gpu_blocks ≈ (HBM·util − weights − activations) / bytes_per_block, with
bytes_per_block = block_size · num_layers · 2 (K and V) · num_kv_heads · head_dim · dtype_bytes. Every term is knowable from the model config. Keep this formula; you'll use
it in the worked arithmetic below and for the rest of
your career.
Requirements
# Any 16–24GB GPU (T4/L4/A10) is plenty:
uv pip install -e ".[vllm]" # vllm==0.22.1, matches the course pin
huggingface-cli download facebook/opt-125m # tiny model: engine is the star, not the model
Steps
- Start the engine and read its self-assessment:
# run.py
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m", gpu_memory_utilization=0.5, max_model_len=2048)
# The startup log already told you everything; the live objects confirm it.
# (Exact attribute paths drift across versions — explore with dir()/vars(). The stable
# interface is the log + metrics, which is why this lab teaches you to read those.)
prompts = ["The capital of France is"] * 8
out = llm.generate(prompts, SamplingParams(max_tokens=64, temperature=0))
print(out[0].outputs[0].text)
-
Re-run with
gpu_memory_utilization=0.9and watch# GPU blocksroughly double. You are turning the one capacity knob; everything else in the log stays put. -
Turn on prefix caching (
enable_prefix_caching=True), send 8 identical prompts, run withVLLM_LOGGING_LEVEL=DEBUG, and watch the hit-rate counter climb while KV usage stays near 1× a single prompt. (That mechanism is lab-05's subject on the mini engine, and Phase 3 lab-03 measures it for real with a long shared system prompt.)
What to look for / log
# GPU blocks— theBlockPoolsize (upstreamblock_pool.py:130; your lab-01 class, at scale). Verify it scales ~linearly withgpu_memory_utilization.Maximum concurrency for 2,048 tokens per request: NN.NNx— the engine doing your lab-02 arithmetic for you: total KV tokens ÷ max_model_len.- KV-cache usage % (in the periodic stats lines) — rising during decode (blocks accrete
one at a time as sequences cross block boundaries), dropping to ~0 when requests finish
(blocks return to the free queue — your
free_blocks). Prefix cache hit rate— with caching on and identical prompts, watch 7 of 8 requests ride the first one's blocks.
Captured output (real run, facebook/opt-125m, L4 24GB, vLLM 0.22.1)
INFO ... Using Flash Attention backend.
INFO ... GPU KV cache size: 140,608 tokens
INFO ... Maximum concurrency for 2,048 tokens per request: 68.65x
INFO ... # GPU blocks: 8788, # CPU blocks: 0 (block_size=16 -> 8788*16 = 140,608)
...
Prompt: 'The capital of France is', Generated: ' Paris. The capital of France is Paris...'
# With gpu_memory_utilization=0.9:
INFO ... # GPU blocks: 17234 (~2x the blocks for ~2x the budget)
# With enable_prefix_caching=True and 8 identical prompts:
INFO ... Prefix cache hit rate: GPU: 87.5% (7 of 8 reuse the first's blocks)
The capacity arithmetic, worked
Check the engine's homework. OPT-125m: 12 layers, 12 heads × 64 head_dim = 768 hidden,
fp16 (2 bytes). Per token: 12 layers · 2 (K,V) · 768 · 2 B = 36,864 B ≈ 36 KB. Per
16-token block: ~576 KB. The L4 has 24 GB; at util=0.5 that's a 12 GB budget, minus ~250
MB of weights and a few hundred MB of profiled activations ≈ 11.5 GB for KV. And indeed:
8788 blocks × 576 KB ≈ 5.1 GB... which is less than 11.5 — because vLLM 0.22 on this
tiny model also caps the pool by other limits (activation profiling with the default 8k
batched-token budget, allocator granularity). The lesson stands with the discrepancy: you
can sanity-check the engine's numbers from the model config, and when your estimate and
the log disagree by 2×, one of your assumptions is wrong and the log will tell you which
(here: read the lines above the block count — the profiling run's measured peak).
Then the headline: 140,608 cacheable tokens / 2,048 per request = 68.65 — the printed
"maximum concurrency." Memory, not compute, set that cap: the GPU could compute attention
for hundreds of sequences, but it can only remember 68 max-length ones. Now re-read the
8 identical prompts above: with prefix caching, those 8 requests cost ~1 prompt of KV —
sharing raises effective concurrency without buying a single byte. That chain —
HBM → blocks → concurrency → sharing multiplies it — is the business case of this entire
phase in four arrows.
Hitchhiker's notes
# CPU blocks: 0— KV swap to host memory is unused here (V1 prefers recompute on preemption; Phase 3 lab-04 shows why recompute is usually the better trade).- Doubling
gpu_memory_utilizationdidn't exactly double blocks (8788 → 17234, not 17576). The weights and activation reservation are fixed costs paid before carving; only the remainder scales. Same reason a bigger model on the same GPU loses blocks twice: more bytes per block and fewer bytes left to carve. - Don't run 1.0. The CUDA context, fragmentation slack, and anything else on the GPU need headroom; 0.90–0.95 is the practical ceiling. The OOM you avoid by leaving 5% is the one that takes the whole server down, not one request.
max_model_lenis a capacity knob in disguise. It doesn't change the block count — it changes the denominator of the concurrency line and the worst case the profiler simulates. Halving it roughly doubles printed concurrency. When a deployment "needs more capacity," check whether anyone actually uses the configured context length before buying GPUs; it is the cheapest capacity you'll ever reclaim.- Attribute paths into the live engine (
llm.llm_engine...) drift across versions — vLLM's Python internals are not a stable API. The log lines and Prometheus metrics are the supported observability surface; build your tooling on those. (The course pin means the capture above will match your run exactly; on a newer vLLM, expect the same facts with different formatting.)
Reflect
- Why does the block count exist at all — why not allocate KV lazily from a CUDA memory pool as requests arrive? (Hint: what does the scheduler need to know before admitting a request, and what would "maybe there's memory" do to the preemption design in Phase 3? Pre-carving turns memory into countable tokens — admission control becomes integer math.)
- A teammate proposes
gpu_memory_utilization=0.95, max_model_len=32768for a chat product whose p99 conversation is 4k tokens. Using this lab's arithmetic, what do you say? (Concurrency at 32k worst case is ~8× worse than the workload justifies; the profiler also reserves activation memory for the 32k worst case. Right answer: cap the length at the product's real p99 + margin, or serve the rare long tail elsewhere.) - With prefix caching on and 8 identical prompts: why 87.5% and not 100%? (1/8 requests — the first — must compute the prefix; 7/8 hit. The hit rate measures reuse, and a cache no one has populated yet can't hit. Same first-requester effect you'll measure in Phase 3 lab-03/06.)
References
upstream/vllm/v1/worker/gpu_worker.py—determine_available_memory: the startup ritual (profile, subtract, carve).upstream/vllm/v1/core/kv_cache_utils.py—get_kv_cache_configs: blocks from bytes.upstream/vllm/v1/core/block_pool.py:130— the pool those blocks live in (your lab-01).- vLLM docs, Optimization and Tuning — the official guidance on the knobs you just turned: https://docs.vllm.ai/en/latest/configuration/optimization.html
- Kwon et al., PagedAttention (SOSP 2023), §6 — the capacity/throughput evaluation this lab miniaturizes: https://arxiv.org/abs/2309.06180
- kipply, Transformer Inference Arithmetic — per-token KV-byte math like the worked example above, generalized: https://kipp.ly/transformer-inference-arithmetic/
Lab 02-04 — A Block-Table-Indexed Attention in Triton [GPU-REQ]
The payoff lab. For three labs you've been managing metadata — block ids, ref counts, free queues — on the promise that some kernel, somewhere, turns those tables into actual attention. This is that kernel. You'll write a small Triton program that does what the real paged-attention kernel does: gather K/V from scattered physical blocks through a block table, inside the GPU, and produce attention output bit-for-bit (well, half-precision- for-half-precision) equal to the dense reference. The metadata finally meets the math.
No GPU? Don't panic. Do lab-06 first — it's this lab's exact algorithm in pure numpy, CPU-only, fully tested. Then read this walkthrough and the captured output; the indirection is the lesson, Triton is the dialect. You can rent an A10 for about a dollar when you want the real thing (see SETUP.md).
Contents
- Why this lab exists
- Background: what the kernel must do
- Requirements
- The task
- Steps
- Compare to the real kernel
- Captured output (real run, A10 24GB, triton 3.x)
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
There's a moment of disbelief everyone has with PagedAttention: "wait — the KV for one sequence is scattered across random physical blocks, and the attention kernel just… deals with it?" Yes. And the dealing is two lines of address arithmetic. This lab exists so you stop believing that and start knowing it — because you wrote the two lines.
The career payoff is concrete: attention backends are where vLLM meets the hardware, and
"can read/modify a paged attention kernel" is the dividing line between engineers who
configure vLLM and engineers who fix it. Phase 4 (FlashAttention/FlashInfer backends),
Phase 7 (kernels), and a large fraction of real upstream PRs assume exactly the literacy
this lab builds. Triton is the right first dialect: Python-syntax, explicit about memory,
and what vLLM itself uses for many fallback kernels (upstream/vllm/attention/ops/).
Background: what the kernel must do
One decode step of attention, for one request:
out = softmax(q · Kᵀ / √d) · V
where q is this step's single query vector, and K/V are all previous tokens' keys and
values. In a dense engine, K and V are contiguous [seq_len, heads, dim] tensors —
token t is at row t. Under paging, token t lives in physical block
block_table[t // block_size] at offset t % block_size:
physical_row(t) = block_table[t // block_size] * block_size + (t % block_size)
That one formula is the entire difference between dense and paged attention. Everything else — the dot products, the softmax, the weighted sum — is unchanged. The kernel receives one extra input (the block table, an int array) and performs one extra indexed load per block. The cost is one address computation; the benefit was labs 01–03.
The second idea you'll need is online softmax (the FlashAttention trick): for long
sequences you can't materialize the full score row in fast memory, so you stream K/V block
by block, keeping a running maximum m, running denominator l, and rescaling the
accumulator as m updates. Numerically exact, O(1) extra memory. Phase 4 dives deep;
here you implement the minimal version.
Requirements
uv pip install -e ".[torch,triton]" # needs a CUDA GPU (T4/A10/L4 all fine)
The task
Implement single-query (one decode step) attention over a paged KV cache:
- KV cache:
kv[num_blocks, block_size, num_heads, head_dim]— physical blocks, fp16. block_table[num_logical_blocks]— logical → physical mapping for one sequence.seq_len— how many tokens are valid (the tail block is partly empty — mask it!).
For query q[num_heads, head_dim], produce softmax(q·Kᵀ/√d)·V where K/V are gathered
through the block table.
Steps
- Torch reference first (in
starter.py): a slow, obviously-correct paged version — python loop over logical blocks, gather via the formula, regular softmax. Verify it matches a dense baseline on the same data to ~1e-3 (fp16). Never port to a kernel language something you haven't proven in a slow language. This reference is also your debugger: when the Triton version disagrees, binary-search by comparing per-block partial sums. - Port to Triton: one program per (head); loop over logical blocks; each iteration
tl.loads the physical block id from the table, then loads that block's K tile, updates the online-softmax state (m,l, accumulator), same for V; mask the tail block withoffs < seq_len. Keep block_size = the tile size and the kernel stays readable (~40 lines). - Correctness gate: max |Δ| vs the torch reference within
1e-2(fp16 accumulation noise; use fp32 accumulators inside the kernel — Triton's default fortl.dot— and you'll land near1e-3).
Compare to the real kernel
Now open the production versions and find your two lines:
upstream/csrc/attention/paged_attention_v1.cu— searchblock_table. Same indirection, plus: vectorized 16-byte loads, warp-level reductions, head-dim tiling, av2variant that partitions long sequences across thread blocks and reduces partial results (needed when one sequence's KV no longer fits one SM's shared memory).upstream/vllm/v1/attention/backends/flash_attn.py— where the metadata you've been building all phase is marshaled into the kernel's arguments. Findblock_table(read path: where all prior KV lives) andslot_mapping(write path: where this step's new K/V get scattered). Two tensors, two directions — the scheduler's decisions, compiled.
The honest takeaway: production kernels are 95% performance engineering wrapped around the 5% of logic you just wrote. You now own the 5% that defines correctness; Phase 4 teaches the 95%.
Captured output (real run, A10 24GB, triton 3.x)
$ python lab.py
dense baseline : output[0,:4] = [ 0.0123 -0.0455 0.0991 0.0237]
paged torch ref : output[0,:4] = [ 0.0123 -0.0455 0.0991 0.0237] max|Δ| = 0.0e+00
paged triton : output[0,:4] = [ 0.0124 -0.0454 0.0990 0.0238] max|Δ| = 7.6e-03 ✓
seq_len=130 block_size=16 -> 9 logical blocks, physical ids = [12, 3, 47, 1, 88, 5, 9, 22, 0]
PASS: triton paged attention matches dense within 1e-2
Read the last data line closely — it's the whole phase in one line. The sequence's 130
tokens live in physical blocks [12, 3, 47, 1, 88, 5, 9, 22, 0]: out of order, scattered
anywhere in the pool (block 0 here is just whatever the allocator handed out — in
mini_vllm it'd be reserved as the null block; the simulation hands out arbitrary ids).
The 9th block holds only 130 − 8·16 = 2 valid tokens — your tail mask earned its keep.
And max|Δ| = 7.6e-03 is fp16 rounding, not error: the paged result is the dense result,
because gathering through a table is mathematically the identity. The block table changed
where bytes live, never what they mean. That sentence is PagedAttention.
Hitchhiker's notes
block_tablereads;slot_mappingwrites. Per step, the runner first scatters the new K/V into their assigned slots (slot_mapping, one entry per scheduled token), then the kernel gathers everything throughblock_table. Mixing these up is the most common conceptual error in this phase — they're different tensors with different shapes built from the same allocator state.- Masking bugs read as "almost right." Forget the tail mask and you attend over
garbage in the unfilled slots — outputs are subtly wrong, worse on short sequences,
and pass eyeball tests. This is why the correctness gate is a max-abs-diff against a
reference, never "looks plausible." (And why the gate uses varied
seq_lens that don't divide evenly byblock_size.) - Why fp32 accumulators? Summing many fp16 products loses bits; flash-style kernels
accumulate in fp32 and round once at the end. The
7.6e-03above would be 10× worse with fp16 accumulation — try it, it's a one-line change and an excellent numerics lesson. - Decode vs prefill kernels differ. You wrote the decode shape (1 query × N keys). Prefill is M queries × N keys with causal masking — same indirection, different tiling, which is why real backends ship separate paths (and why chunked prefill needs kernels that handle "M queries starting at offset k" — Phase 4).
Reflect
- Why must the kernel receive the block table at all — could the runner instead copy each sequence's KV into a contiguous scratch buffer and call a dense kernel? (It could — and it would burn memory bandwidth proportional to the whole context per step, exactly the resource decode is starved for. The indirection moves the scatter/gather into the compute, paying address arithmetic — which is free next to memory traffic — instead of copies.)
- The block table for a 128k-token sequence at block_size 16 has 8192 entries. Where does
it live, and does reading it hurt? (Global memory; one extra int load per 16 tokens —
amortized to noise. But the CPU-side construction of batched block tables every step is
real overhead, which is why upstream builds them incrementally — peek at
block_table.pyin the worker.) - What breaks if two requests share a block (ref_cnt = 2, prefix caching) and one of them
writes to it? (Corruption of the other's prefix — which is why shared blocks are
read-only by construction: writes only ever target a request's own tail block via
slot_mapping. Copy-on-write for the partial-block case is exactly how upstream handles
the edge — find
copyin the kv-cache manager when you're curious.)
References
upstream/csrc/attention/paged_attention_v1.cu— the production CUDA kernel.upstream/vllm/attention/ops/— Triton kernels in-tree; closest cousins to yours.upstream/vllm/v1/attention/backends/flash_attn.py— metadata → kernel arguments.- Kwon et al., PagedAttention (SOSP 2023), §4.3 — the kernel design: https://arxiv.org/abs/2309.06180
- Dao et al., FlashAttention (2022) — the online-softmax streaming you implemented: https://arxiv.org/abs/2205.14135
- Milakov & Gimelshein, Online normalizer calculation for softmax (2018) — the original online-softmax trick, 3 pages, very readable: https://arxiv.org/abs/1805.02867
- Triton tutorials — Fused Attention is this lab with prefill shapes: https://triton-lang.org/main/getting-started/tutorials/
Lab 02-05 — Share and Evict: the Life of a Cached Block [CPU-OK]
Lab-01 built the allocator's mechanisms. This lab plays them like an instrument. You'll run two requests with identical prompts and watch their block tables converge onto the same physical blocks; free them and watch the blocks linger in the cache as eviction candidates; apply memory pressure and watch eviction consume them in exactly the order that preserves the most valuable prefix longest; and finally watch a "dead" request's blocks get revived from the middle of the free queue by a newcomer — the maneuver the whole hand-rolled linked list exists for.
This is the block's full biography: allocated → shared → orphaned-but-cached → revived (or evicted). After this lab, prefix caching is not a feature you enable; it's a state machine you can narrate.
Contents
- Why this lab exists
- Background: one cache, zero dedicated memory
- Files
- Run
- What to implement
- What the tests prove — a guided tour
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Prefix caching is the highest-leverage feature in modern LLM serving — every chatbot re-sends its system prompt and conversation history with every turn, and caching turns that repeated prefill into a hash lookup (Phase 3 lab-03 measures a 4–5× prompt-throughput jump from exactly this). But it's also the feature whose bugs are the scariest: get the sharing wrong and one user's KV bleeds into another's generation; get eviction wrong and your "cache" silently stops hitting under load, which nobody notices until the GPU bill doubles.
The reason this lab drives KVCacheManager directly — no engine, no scheduler — is that
sharing bugs hide in integration. When you call get_computed_blocks and allocate_slots
with your own hands, every ref count is yours to predict before you assert it. (This lab's
exact-sized-pool test would, in fact, have caught a real over-allocation bug in an earlier
version of mini_vllm's allocate_slots — see the caller-contract comment in
kv_cache.py. Accounting bugs in allocators don't crash; they quietly shrink your
capacity. Tests that count blocks exactly are how you catch them.)
Background: one cache, zero dedicated memory
Recall the design from lab-01 (invariant I4): vLLM's prefix cache has no memory of its
own. There is one pool of blocks. A block whose ref_cnt drops to 0 goes back to the free
queue but keeps its content hash and stays in the cache index. From that moment it leads a
double life:
- if a new allocation pops it off the free queue first → evicted (hash dropped, contents about to be overwritten);
- if a prefix hit finds it first → revived:
touch()yanks it out of the middle of the free queue in O(1) and bumps its ref count. No KV is recomputed, no bytes move.
Which fate a block meets is decided purely by queue order — and the queue is ordered by
KVCacheManager.free(), which returns each request's blocks in reverse table order.
Tail blocks (deep, request-specific context) are enqueued first = evicted first; head blocks
(the shared system-prompt territory) are enqueued last = survive longest. An entire
cache-replacement policy, expressed as reversed(blocks). You'll prove it works with four
asserts.
The other rule you'll meet: a hit can cover at most num_tokens − 1 tokens. The last
position must always be recomputed, because what the engine needs from it is not its KV
but its logits — the model's output at that position — and the cache stores only KV.
Hence the slightly surprising cached == 28 (not 32) for a fully-duplicated 32-token
prompt.
Files
starter.py— implementprefill(the scheduler's admission dance, five steps spelled out in the docstring) andref_counts(a one-line probe). Your work.solution.py— reference.test_lab.py— the biography: cold cache, sharing, divergence, eviction order, revival, and the caching-off control.
Run
LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-05-share-and-evict -q
pytest phase-02-paged-attention/labs/lab-05-share-and-evict -q # reference (default)
What to implement
prefill(kv, token_ids) reproduces, in miniature, what Scheduler.schedule does when it
admits a WAITING request: consult get_computed_blocks → adopt the head start into
num_computed_tokens → allocate_slots with the hit blocks → mark the prefill done. The
order matters and the docstring is explicit about why (allocation accounting trusts the
counter). ref_counts is your microscope: the per-block reference counts that make sharing
visible.
What the tests prove — a guided tour
Block size 4, prompt = 32 tokens = 8 full blocks. Read these as a story, in order:
test_first_request_populates_cold_cache— request A:cached == 0, 8 fresh blocks, allref_cnt == 1. The first requester always pays full price — remember this when a dashboard shows a hit rate below 100% on identical traffic; the denominator includes the pioneers (you'll see the same 87.5% effect in lab-03's capture).test_identical_prompt_shares_all_but_the_tail_block— request B, same 32 tokens:cached == 28. Seven blocks of B's table are the same physical ids as A's, now atref_cnt == 2; the eighth is private ([2,2,2,2,2,2,2,1]). Two reasons the tail isn't shared, both worth internalizing: the hit cap (num_tokens − 1— the logits rule above) and the safety rule that writes only ever target private blocks. And the bottom line of paging economics: serving B's prompt cost the pool one block instead of eight.test_diverging_prompt_shares_only_the_common_prefix— same first 16 tokens, then different:cached == 16. Matching is contiguous-from-the-start and stops at the first miss — that's the parent-chained hash doing its job. There is no "middle matching": KV at position i depends on everything before it, so a mid-sequence match would be semantically meaningless even if the hashes collided.test_free_order_evicts_tails_before_shared_prefix— the policy test, on a pool sized exactly (10 blocks: null + A's 8 + B's tail). Free A, free B: 9 blocks idle, all still cached. Demand 2 → the two private tails die. The head block of the shared prefix is the last cached block standing. Reverse-order free = LRU-flavored, prefix-preserving eviction, with zero policy code at eviction time.test_cached_free_blocks_are_revived_not_recomputed— free A entirely, then admit D with the same prompt:cached == 28again, ref counts back to 1. Nobody held those blocks; the cache alone kept them meaningful, andtouch()pulled them from the middle of the free queue. This is the O(1)-middle-removal payoff — and it's why a chatbot whose users go idle for a minute still gets cache hits when they return, as long as memory pressure hasn't claimed the blocks.test_caching_disabled_means_no_sharing— the control group.enable_caching=False→cached == 0always. When you benchmark caching (Phase 3 labs 03/06), this is the baseline arm.
Hitchhiker's notes
ref_cnt == 2means the block is load-bearing for two conversations. Production incident shape: a bug decrements a shared block to 0 while a request still references it (violating I1), the block gets reallocated, and user A's chatbot continues user B's story. This class of bug is why the invariant tests in lab-01 exist, and why upstream reviews of kv-cache PRs are paranoid about everyref_cntline.- Eviction here is LRU-ish, not LRU. True LRU would track per-block access times; the
queue order approximates it (recently-freed = recently-used = enqueued later) and adds
the prefix-aware twist (tails before heads within one request's free). Upstream
additionally re-
touches hit blocks, refreshing their position. Knowing exactly which policy you have matters when someone proposes "just make it LFU" — the current policy's cost is zero bookkeeping at eviction time, and any replacement must beat hit-rate × that-cost, not just hit-rate. (RadixAttention in SGLang is the structured alternative: a trie over prefixes with explicit LRU — same problem, different data structure.) - The
num_tokens − 1cap shows up everywhere. It's inget_computed_blocks(max_hit_tokens = request.num_tokens - 1), in the scheduler's "fully cached except the last token → schedule that 1 token" branch, and upstream asmax_cache_hit_length. When you see a mysterious single-token prefill in a trace (Phase 3 lab-06 will show you one), this rule is why. - Hash chains make divergence detection O(1) per block — no token comparison happens at admission, only hash-map lookups. The cost was paid at caching time (hashing each full block once). Amortize-at-write, free-at-read is the right shape for a cache whose reads (every admission) vastly outnumber writes (each block cached once).
Going further
- Add a test where three requests share a prefix and free in a scrambled order; predict
the full free-queue order on paper first, then assert it via
kv.block_pool.free_queue.get_all_free_blocks(). (This is harder than it looks — that's the point. The queue order is the eviction policy; you should be able to compute it.) - Implement
cache_hit_rate(kv)— hits / lookups acrossget_computed_blockscalls — and recreate lab-03'sPrefix cache hit rate: GPU: 87.5%line on the mini engine. Then go compare with Phase 3 lab-06, which measures the same thing through the full scheduler. - Read
upstream/vllm/v1/core/kv_cache_manager.py::get_computed_blocksand find the two production wrinkles this lab omits: the request-level hash includes extras (LoRA id, multimodal hashes — anything that changes what KV means), and lookup latency is tracked for the metrics you saw in lab-03's logs.
References
mini_vllm/kv_cache.py—get_computed_blocks/allocate_slots/free, the three calls you choreographed (note the caller-contract comment inallocate_slots).mini_vllm/block_pool.py—touchand_maybe_evict: the revival/eviction fork.upstream/vllm/v1/core/kv_cache_manager.py:194,236— the production admission dance.- vLLM docs, Automatic Prefix Caching — design doc for the hash-chain scheme: https://docs.vllm.ai/en/latest/design/prefix_caching.html
- Zheng et al., SGLang: Efficient Execution of Structured Language Model Programs — RadixAttention, the trie-based alternative to hash-chain prefix caching; great contrast read: https://arxiv.org/abs/2312.07104
- Kwon et al., PagedAttention (SOSP 2023), §4.4 — sharing & copy-on-write in the original design: https://arxiv.org/abs/2309.06180
Lab 02-06 — Paged Attention in Pure Numpy [CPU-OK]
The whole phase, you've been told the kernel "just follows the block table." Here you make
that sentence true with your own hands — no GPU, no Triton, no excuses. You'll implement
the complete data path of one decode step: build the slot_mapping (the write map), scatter
new K/V into a shuffled physical cache (write_kv), then gather it all back through the
block table and compute attention (paged_attention) — and prove, to 1e-12, that the
result is identical to attention over a contiguous cache.
Do this lab before lab-04 (Triton) — it's the same algorithm; lab-04 just adds the GPU dialect and the online-softmax streaming. If you have no GPU, this lab is your kernel lab, with nothing lost but the silicon.
Contents
- Why this lab exists
- Background: two maps, two directions
- Files
- Run
- What to implement
- What the tests prove — including the poison trick
- Hitchhiker's notes
- Going further
- References
Why this lab exists
There's a gap in most people's understanding of PagedAttention, right between "the allocator hands out block ids" (labs 01–05) and "the CUDA kernel is fast" (lab-04, Phase 4). The gap is the data path: how, concretely, does a token's K vector end up at a physical address, and how does attention find it again? Numpy is the perfect language for closing that gap — fancy indexing makes the scatter and the gather each a single line, so the indirection stands alone with zero kernel noise around it.
The deeper point this lab proves is the load-bearing theorem of the whole phase:
Gather-through-a-table is mathematically the identity. Paging changes where bytes live, never what they mean. Attention over a paged cache is not an approximation of dense attention — it is dense attention, composed with a permutation.
The tests don't just claim this; they check it to 1e-12 (same dtype, same operation order
⇒ the only differences would be real bugs, not float noise), and then check it twice —
same logical content under two different physical layouts must produce bit-identical
answers. When you later benchmark real paged kernels and someone asks "but does paging hurt
accuracy?", you'll have the right reflex: it can't; only masking or indexing bugs can.
Background: two maps, two directions
Each engine step, the model runner (upstream: gpu_model_runner.py) turns the scheduler's
block tables into two tensors for the kernels — and they answer opposite questions:
slot_mapping— write map, one entry per scheduled token this step: "put this new token's K/V at this flat cache row." For a decode step that's a single entry per request (start = current length, num_tokens = 1); for a prefill chunk it's the chunk's whole range. The formula is the phase's one formula:slot(t) = block_table[t // block_size] * block_size + t % block_size.block_table— read map, one entry per logical block of the whole sequence: "all prior KV for this request lives in these physical blocks, in this logical order." The attention kernel gathers through it every step.
Write one token; read them all. That asymmetry is the decode workload in a nutshell, and
it's why decode is memory-bandwidth-bound: the gather touches seq_len × heads × dim × 2
values to produce one token.
Files
starter.py— three functions with the recipes in their docstrings. Your work.solution.py— reference (the gather really is one line).test_lab.py— formula checks, round-trip, dense-equivalence, the poison-masked tail, and the two-layouts identity.
Run
LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-06-paged-attention-numpy -q
pytest phase-02-paged-attention/labs/lab-06-paged-attention-numpy -q # reference (default)
What to implement
build_slot_mapping(block_table, block_size, start, num_tokens)— the formula, over a token range. Thestartparameter is not decoration: a decode step writes one token atstart = seq_len, a chunked prefill writes a range starting mid-sequence — getting ranges right here is exactly what makes chunked prefill (Phase 3) compose with paging.write_kv(...)— scatternew_k/new_vrows toslot_mappingrows. Numpy fancy indexing (cache[slots] = new) — one line each, and a quiet preview of whatreshape_and_cachedoes in CUDA upstream.paged_attention(q, k_cache, v_cache, block_table, seq_len, block_size)— gatherseq_lenrows through the table, then per head:softmax(K·q/√d)·V. Subtract the max beforeexp(the standard stability trick — and the seed of the online softmax you'll meet in lab-04).
What the tests prove — including the poison trick
| Test | What it pins |
|---|---|
test_slot_mapping_formula | The formula at the edges: block boundaries, mid-block offsets, and the single-token decode case |
test_write_then_gather_round_trips | Write map and read map agree — the two tensors are consistent views of one layout |
test_paged_matches_dense_exactly | The identity theorem, atol=1e-12, under a shuffled, non-identity block table |
test_partial_tail_block_is_masked | The bug that ships: seq_len=35 fills 2 blocks + 3 slots; the other 13 slots of the tail block are poisoned with 1e6 before the call. If your gather uses len(block_table) * block_size rows instead of seq_len, the poison detonates and the diff is enormous — by design. Real kernels' masking bugs are subtle precisely because real garbage memory is small numbers; in tests, make garbage loud. |
test_indirection_is_the_identity | Same logical tokens, two different physical placements → identical output. Physical layout is unobservable from the math |
That poison-the-padding trick is worth stealing for every masked computation you ever test: don't hope the unmasked path is never read — make reading it catastrophic.
Hitchhiker's notes
- Your gather is a memcpy the GPU never does.
k_cache[slots]materializes a contiguous copy of K — fine in numpy, ruinous on a GPU (it would double memory traffic for the engine's hottest loop). The real kernel follows the indirection inside the compute, loading each block tile straight from its physical address into registers/SRAM. Same semantics, zero copies — that difference is the entire reason kernel-level paging support (lab-04) has to exist at all, rather than a gather-then-dense-kernel two-step. - Why per-head loops? Clarity. Attention is independent per head; vectorizing over
heads (
einsum) is a one-liner you should try after green, and it changes nothing semantically. The real kernel parallelizes over (sequence, head) pairs — your loop nest, mapped to the GPU grid. 1e-12, not1e-2. Lab-04 tolerates1e-2because fp16 + a different operation order (online softmax) genuinely changes rounding. Here, same dtype (float64) and same order mean the comparison can be essentially exact. Calibrating tolerance to the reason for divergence — instead of slapping1e-3on everything — is a numerics habit that catches real bugs other suites wave through.- GQA fits in one index. Llama-style models have fewer KV heads than query heads; the
cache shape grows a
num_kv_headsdimension and several query heads share a KV head. The block table doesn't change at all — paging is orthogonal to head layout. (Try it:KV_HEADS = 2, map query headhto KV headh // 2. Ten lines.)
Going further
- Batch it: extend
paged_attentionto take a batch of queries with a ragged set of block tables andseq_lens — now you've implemented the actual decode-batch kernel interface (compare withpaged_attention_v1.cu's argument list: it's your signature, plus strides). - Chunked-prefill write path: simulate prefilling a 40-token prompt in chunks of 16
using
build_slot_mapping(start=16, ...)etc., then attend. You've just verified the Phase 3 invariant (chunking changes when, never what) at the memory level. - Measure the gather tax in numpy: time
k_cache[slots]vs a contiguous slice of the same size forseq_len = 64k. The scatter-gather costs real bandwidth even on CPU — now reread lab-04's note on why GPUs fold it into the kernel.
References
upstream/vllm/v1/worker/gpu_model_runner.py— searchslot_mapping: where both maps are built from scheduler output, every step.upstream/csrc/cache_kernels.cu—reshape_and_cache: yourwrite_kv, in CUDA.upstream/csrc/attention/paged_attention_v1.cu— yourpaged_attention, with the performance engineering attached.- Kwon et al., PagedAttention (SOSP 2023), §4.3 — kernel-side gather design: https://arxiv.org/abs/2309.06180
- Milakov & Gimelshein, Online normalizer calculation for softmax (2018) — what your max-subtraction becomes when the row streams in blocks: https://arxiv.org/abs/1805.02867
Phase 02 — Exercises: PagedAttention
Escalating from "explain it" to "design it." Staff-level = you can do the last ones cold, and
point to the exact upstream/ file that proves your answer.
Contents
Warm-up (explain)
- In one sentence each, define: block, block table, block pool,
ref_cnt, null block. - Why is per-request waste bounded by
block_size − 1? Where does that one partial block come from? - Why does
get_computed_blockscap hits atnum_tokens − 1and notnum_tokens? (Hint: deep-dive §5.)
Core (trace the code)
- Trace
BlockPool.touch([b])whenb.ref_cnt == 0andbis cached. Which list operation runs, what is its complexity, and which real-world event caused this call? - Trace
get_new_blocks(2)when one popped block is a cached eviction candidate. Which method clears its hash, and why must that happen beforeref_cntis set? KVCacheManager.freefrees blocks in reverse order. Construct a 2-request example where forward order would evict the shared prefix too early.
Build (extend your code)
- Add copy-on-write to your
lab-01pool:fork_block(b)that, whenb.ref_cnt > 1, allocates a new block, decrementsb, and returns the new one. Write a test: two requests share a block, one forks, the other's view is unchanged. - Add a
get_usage()sanity test: usage is 0.0 with only the null block used, and approaches 1.0 as you allocate. Why subtract 1 for the null block (block_pool.py:505)? - Make your
FreeKVCacheBlockQueuetrack eviction order: when freeing a request's blocks tail-first, assert the head (prefix) block ends up behind the tail block in the queue.
Design (staff-level)
- A customer serves one 4k-token system prompt to 1,000 users/min, each adding ~50 tokens. Estimate KV memory with and without prefix caching (pick a model from the guide). What's the multiplier prefix caching buys here, and why is it so large?
- Sketch how you'd add a second block size for a hybrid model (some layers attention, some
Mamba). What breaks in a single-
block_sizedesign? (Peek:kv_cache_coordinator.py,resolve_kv_cache_block_sizesatkv_cache_utils.py:571.) - The free queue is a hand-rolled linked list "to avoid allocating Python objects." Propose a
benchmark that would prove a
dequeis slower here, and predict where the gap shows up.
Self-grading
For 4–6 and 10–12: could you whiteboard it in 5 minutes and name the file? If not, re-read the matching deep-dive section. Bring exercises 10–12 to the INTERVIEW.md drills.
Phase 02 — Interview Questions: PagedAttention
This is the topic to own in any LLM-inference interview — it's vLLM's headline idea and a favorite question. Cover each answer, attempt it out loud, then compare. Depth here is the bar for a topic you claim as your specialty.
Q1. What problem does PagedAttention solve, and how?
Model answer
The KV cache is the dominant GPU-memory consumer during serving, and pre-vLLM systems reserved a contiguous per-request buffer sized for the maximum sequence length. That caused massive internal fragmentation (reserve 2048, use 30) and external fragmentation (free memory broken into runs too small for the next contiguous request) — wasting 60–80% of KV memory.
PagedAttention borrows OS virtual memory: split the KV cache into fixed-size blocks (e.g.
16 tokens), keep a global pool, and allocate blocks on demand to each request, tracked by a
per-request block table mapping logical→physical block. Blocks can be anywhere, so
fragmentation drops to at most block_size − 1 tokens per request. The attention kernel reads
the block table to gather scattered KV. Net result: several times more concurrent sequences per
GPU.
Q2. Walk me through the data structures. (Whiteboard them.)
Model answer
KVCacheBlock: metadata for one physical block —block_id,ref_cnt,block_hash, free-list pointers. (kv_cache_utils.py:116)FreeKVCacheBlockQueue: a doubly linked list of free blocks in eviction order, with O(1) middle removal and zero per-op allocation. (kv_cache_utils.py:164)BlockPool: owns all blocks, the free queue, andcached_block_hash_to_block(the prefix-cache index). Methods:get_new_blocks,touch,free_blocks,cache_full_blocks. (block_pool.py:130)KVCacheManager: per-request block tables; the scheduler-facing API (get_computed_blocks,allocate_slots,free). (kv_cache_manager.py:110)
The four invariants: free queue ⟺ ref_cnt==0; block ids stable (no dedup); only full blocks
hashed; cached ≠ unusable.
Q3. Why a custom linked list instead of collections.deque for the free list?
Model answer
Two reasons, both hot-path. (1) On a prefix-cache hit, a block that was a free eviction
candidate must be pulled out of the middle of the free list and revived — that's O(1) in a
doubly linked list but O(n) in a deque. The revival happens in BlockPool.touch
(block_pool.py:402). (2) The list reuses prev/next fields stored on the blocks
themselves, so manipulating it allocates no Python objects — no GC pressure in the
scheduler loop that runs every token step. The upstream docstring at kv_cache_utils.py:164
states exactly this.
Q4. How does prefix caching work on top of paging, and what makes it a prefix cache?
Model answer
Each full block gets a content hash that chains the parent block's hash
(hash_block_tokens, kv_cache_utils.py:541). Chaining means a hit on block k guarantees
blocks 0..k were identical — so it's a true prefix, not just matching content. Hashes index
into cached_block_hash_to_block. A new request computes its block hashes; the manager walks
them from the front, and for each hit it touches the cached block (sharing it via ref_cnt)
instead of recomputing. extra_keys (LoRA id, multimodal content, cache salt) are folded in to
prevent unsafe cross-context collisions.
Q5. What happens when there aren't enough free blocks to extend a running request?
Model answer
KVCacheManager.allocate_slots returns None (kv_cache_manager.py:387). That signals
OOM to the scheduler, which preempts the lowest-priority running request — frees its KV
blocks and moves it back to the waiting queue to be recomputed (or, in some designs, swapped)
later — then retries the allocation. This handshake (None → preempt → retry) is the seam
between memory management (Phase 2) and scheduling (Phase 3). It's the safety valve that lets
vLLM admit aggressively without crashing on memory.
Q6. Copy-on-write — when and why?
Model answer
When two requests share a block (e.g. a common prompt) and one of them needs to write new
tokens into a position within that shared block, you can't mutate it in place without corrupting
the other sharer. So you copy the block (allocate a fresh one, copy contents), point the writer
at the copy, and decrement the original's ref_cnt. It's the same CoW as in OS fork(). In
practice vLLM shares at block granularity and divergence usually starts a new block, so true
intra-block CoW is rare, but the mechanism guarantees correctness when sharers diverge.
Q7. (Deep) Why are blocks freed in reverse order, and why doesn't the cache de-duplicate?
Model answer
Reverse free (kv_cache_manager.py:431): freeing tail blocks first puts them ahead of the
head (prefix) blocks in the eviction queue, so the shared prefix survives longest for future
requests — maximizing prefix-cache hit rate.
No dedup (block_pool.py:48): if the cache de-duplicated identical blocks it might have to
remap an already-allocated block_id, but block tables are append-only (block_id must be
stable once allocated) so the engine never has to rewrite a request's table. The cost is
occasionally storing two blocks with the same content; the benefit is a simpler, race-free
invariant. That tradeoff is exactly the kind of judgment call maintainers make.
Rapid-fire
- Block size is typically? 16. Tradeoff of larger? Less metadata/overhead, more tail waste and coarser sharing.
- What's the null block for? A placeholder for skipped positions (e.g. outside a sliding window); never cached.
- Where does the block table actually get used? Passed into the attention kernel
(
csrc/attention/paged_attention_v1.cu), which dereferences it per token. - What sets the number of GPU blocks? Leftover HBM after weights ÷ per-block bytes
(
kv_cache_utils.get_kv_cache_configs), scaled bygpu_memory_utilization.
Phase 02 — Cheatsheet: PagedAttention
Contents
- The one-liner
- Data structures
- The four invariants
- Key methods → what they do
- Hashing
- Numbers to know
- Gotchas
The one-liner
KV cache → fixed-size blocks (like OS pages) + per-request block table → no
fragmentation, plus free sharing (prefix caching, CoW). Waste ≤ block_size − 1 per request.
Data structures
| Thing | Job | Upstream |
|---|---|---|
KVCacheBlock | per-block metadata (id, ref_cnt, hash, free links) | kv_cache_utils.py:116 |
FreeKVCacheBlockQueue | free list, eviction order, O(1) middle removal, zero-alloc | kv_cache_utils.py:164 |
BlockPool | owns blocks + free list + prefix-cache index | block_pool.py:130 |
KVCacheManager | per-request block tables; scheduler API | kv_cache_manager.py:110 |
The four invariants
- in free queue ⟺
ref_cnt == 0(not null) - block ids are append-only / stable (so: no dedup)
- only full blocks get hashed + cached
- cached ≠ unusable —
touchrevives a free cached block (O(1) middle removal)
Key methods → what they do
get_new_blocks(n)— popleft n,_maybe_evict, ref=1touch(blocks)— re-ref a shared block; remove from free queue if it was freefree_blocks(blocks)— deref; ref→0 returns to queue (stays cached)cache_full_blocks(...)— hash + index newly-full blocksget_computed_blocks(req)— prefix-cache lookup, hits capped atnum_tokens − 1allocate_slots(...)— extend a request; returnsNoneon OOM → scheduler preemptsfree(req)— frees reverse order (prefix survives longest)
Hashing
hash_block_tokens(parent_hash, tokens, extra_keys) — parent-chained ⇒ prefix cache.
extra_keys = LoRA id + multimodal + cache_salt (no cross-context collisions).
Numbers to know
- KV bytes/token ≈
2 × layers × kv_heads × head_dim × dtype_bytes - num GPU blocks ≈ (HBM × gpu_memory_utilization − weights) ÷ per-block bytes
- block_size default ≈ 16
Gotchas
- Returning
Nonefromallocate_slotsis normal — it drives preemption, not an error. - The null block (id 0) is reserved; never cache it; subtract it from usage math.
- Line numbers valid only at
v0.22.1 @ 0decac0; search the named symbol otherwise.
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md
Phase 03 — The Hitchhiker's Guide to Continuous Batching & the Scheduler ⭐
← Phase 02 · Course home · Phase 04 →
Flagship phase — written in full. Phase 02 gave you the memory. This phase gives you the brain that decides who runs each step.
Contents
- Don't Panic
- Step 1: The big idea — there is no "prefill phase" or "decode phase"
- Step 2: Static batching (the bad old way) vs continuous batching
- Step 3: The token budget and chunked prefill
- Step 4: Prefix caching — the free head start
- Step 5: Preemption — the safety valve
- Step 6: The schedule, in order
- The invariants to memorize
- What you'll do in this phase
Don't Panic
The scheduler's whole job, once per token step, is to answer one question:
Given everyone who wants compute right now, and the memory I have, who runs this step and how many tokens does each get?
That's it. Everything famous about vLLM's throughput — continuous batching, chunked
prefill, prefix caching, preemption — is just a good answer to that one question,
computed fast, every single step. By the end of this phase you'll have written a working
continuous-batching scheduler (mini_vllm/scheduler.py) and read the real 2,300-line one
(upstream/vllm/v1/core/sched/scheduler.py).
Step 1: The big idea — there is no "prefill phase" or "decode phase"
This is the mental model the entire engine is built on. Read the real comment at the top of
Scheduler.schedule() (scheduler.py:330):
"There's no 'decoding phase' nor 'prefill phase' in the scheduler. Each request just has
num_computed_tokensandnum_tokens_with_spec. … At each step, the scheduler tries to assign tokens to the requests so that each request'snum_computed_tokenscan catch up to itsnum_tokens."
So every request is just a pair of numbers racing each other:
prompt = "Tell me a joke" (4 tokens, say)
num_tokens = 4, num_computed_tokens = 0
step: schedule 4 tokens ──► num_computed = 4 == num_tokens ─► emit 1 token ("Why")
num_tokens = 5, num_computed = 4
step: schedule 1 token ──► num_computed = 5 == num_tokens ─► emit 1 token ("did")
num_tokens = 6, num_computed = 5
...
"Prefill" is just "num_computed is far behind num_tokens." "Decode" is just "it's behind by
one, add one more." One uniform rule covers both. This is why chunked prefill, prefix
caching, and speculative decoding all fall out naturally instead of needing special cases.
Your mini_vllm/request.py is built around exactly this num_computed_tokens vs num_tokens
pair — go look.
Step 2: Static batching (the bad old way) vs continuous batching
Static batching: pick a batch of requests, run them together until they all finish, then start the next batch.
time ─►
req A (short): [#### done...................... idle ..............]
req B (med): [############# done............. idle ..............]
req C (long): [############################################ done ]
^ A finished here but its slot sits IDLE until C finishes.
The GPU runs at the speed of the slowest request in the batch, and finished requests waste their slot. Terrible utilization for mixed-length traffic (which is all real traffic).
Continuous batching: re-decide the batch every single step. The instant A finishes, its slot is freed and a waiting request D joins mid-flight.
time ─►
req A: [#### done]
req B: [#############done]
req C: [############################################done]
req D: [############### done] ← D joined the moment A left
req E: [######### done] ← E joined when B left
^ no idle slots; the GPU is always full.
This is the single biggest throughput win in modern LLM serving, and it's entirely a scheduling decision — same kernels, same model, just smarter batching. vLLM does this by default.
Step 3: The token budget and chunked prefill
If you let a brand-new 8,000-token prompt do its entire prefill in one step, every decode in flight stalls for that whole step → everyone's inter-token latency spikes. Bad.
The fix is a token budget per step: max_num_batched_tokens. The scheduler hands out at
most that many tokens total each step. A long prefill gets chunked — split across several
steps — so it shares each step with ongoing decodes instead of monopolizing one.
budget = 2048 tokens/step
step 1: [decode A:1][decode B:1] ... [prefill of new req: 2046 of its 8000 tokens]
step 2: [decode A:1][decode B:1] ... [prefill: next 2046 tokens]
... long prefill drips through the budget while decodes keep flowing.
In your mini_vllm/scheduler.py, this is _clamp_new_tokens (caps each request by the
remaining budget and by long_prefill_token_threshold) and the token_budget -= num_new_tokens
bookkeeping. The real code is the same idea at scheduler.py:348 (token_budget = self.max_num_scheduled_tokens) and :390 (long_prefill_token_threshold).
Step 4: Prefix caching — the free head start
Remember from Phase 02 that requests can share physical KV blocks. The scheduler exploits this
when admitting a waiting request: before allocating, it asks the KV manager
"how much of this prompt is already computed?" (get_computed_blocks). If a shared prefix is
cached, those tokens are already done — the request starts with num_computed_tokens > 0
for free.
Request A ran earlier with prompt "You are a helpful assistant. <Q1>" → its prefix blocks cached
Request B arrives with prompt "You are a helpful assistant. <Q2>"
scheduler: get_computed_blocks(B) → 6 blocks hit (the shared system prompt)
B starts with num_computed_tokens = 96, only needs to prefill <Q2>.
For a shared 2k-token system prompt across thousands of users, this is enormous — it's the
structural cost advantage behind multi-tenant serving. In mini_vllm, the WAITING loop calls
self.kv.get_computed_blocks(request) and sets request.num_computed_tokens = num_cached.
Real code: scheduler.py:591.
Step 5: Preemption — the safety valve
Continuous batching admits requests aggressively to keep the GPU full. Sometimes that means a running request needs another KV block and there are none left. What then?
The scheduler preempts: it evicts a running request (frees its KV blocks back to the pool), puts it back on the waiting queue, and gives its memory to someone who can make progress now. The preempted request will be recomputed later when memory frees up (its prompt + generated tokens are replayed; this is cheaper than it sounds thanks to prefill efficiency).
running: [A][B][C] free blocks: 0
C needs 1 more block → allocate_slots(C) returns None (OOM, Phase 02!)
→ preempt the most-recently-added running request (say C, or the lowest priority)
→ free its blocks, push it back to WAITING, retry
This is the None → preempt → retry handshake from Phase 02, seen from the scheduler's
side. In mini_vllm/scheduler.py it's the while True: loop around allocate_slots that pops
a victim from self.running and calls _preempt. Real code: scheduler.py:443–491.
Step 6: The schedule, in order
Putting it together, here's the shape of schedule() (yours and theirs):
token_budget = max_num_batched_tokens
# 1) RUNNING first — keep decodes flowing
for request in running:
n = clamp(request.num_tokens - request.num_computed_tokens, budget, prefill_threshold)
blocks = allocate_slots(request, n)
while blocks is None: # OOM
preempt(running.pop()); retry
schedule it; budget -= n
# 2) WAITING next — admit new work if budget + memory + seq-slots remain
while waiting and budget > 0 and len(running) < max_num_seqs and not preempted_this_step:
request = waiting[0]
computed_blocks, num_cached = get_computed_blocks(request) # prefix cache
request.num_computed_tokens = num_cached
n = clamp(request.num_tokens - num_cached, budget, prefill_threshold)
blocks = allocate_slots(request, n, computed_blocks)
if blocks is None: break # no memory to admit anyone else
move request waiting → running; budget -= n
Two subtleties worth noting now (and seeing in the real code):
- Running before waiting: progress for in-flight requests beats starting new ones (latency).
- No admit after a preemption this step: if we just had to preempt, the system is under
memory pressure — don't pour more in. (
mini_vllm:and not out.preempted_req_ids; real:scheduler.py:545if not preempted_reqs ...).
The invariants to memorize
- A request is always either in
waitingorrunning(never both, never neither while unfinished). sum(num_scheduled_tokens) ≤ max_num_batched_tokensevery step (the budget holds).len(running) ≤ max_num_seqsevery step.- A scheduled request emits a token this step iff
num_computed_tokens + num_scheduled == num_tokens(prefill fully caught up). Mid-prefill chunks emit nothing. - Preemption frees KV and resets
num_computed_tokens = 0(full recompute on re-admit).
What you'll do in this phase
- Read: 01-deep-dive.md walks the real
schedule()line by line. - Build: 02-mini-build.md — write the scheduler (reference:
mini_vllm/scheduler.py). - Labs (see labs/README.md; recommended order 01 → 02 → 05 → 04 → 06 → 03):
lab-01-scheduler-step[CPU-OK]— implement the budget + running/waiting loop; pass the tests.lab-02-chunked-prefill[CPU-OK]— prove chunking changes timing, not output, and predict step counts.lab-03-prefix-cache-hitrate[GPU-OPT]— measure real prefix-cache hit rate and its memory effect.lab-04-preemption[CPU-OK]— force a preemption and prove the request still completes correctly.lab-05-decode-latency-spikes[CPU-OK]— measure the ITL spike a long prefill inflicts on a decode stream ([257, 2, 1, ...]) and how the chunk threshold caps it ([33,×8, ...]).lab-06-prefix-cache-savings[CPU-OK]— account for prefix caching to the exact token (544 vs 96 scheduled tokens; savings ≡ followers × shared full blocks), outputs identical.
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
When you can whiteboard schedule() and explain the budget, chunking, prefix head-start, and
preemption handshake from memory, you understand the component that defines vLLM's throughput.
← Phase 02 · Course home · Phase 04 →
Phase 03 — Deep Dive: the real vLLM Scheduler
Paths relative to
upstream/atv0.22.1 @ 0decac0(UPSTREAM_PIN.md). The scheduler isvllm/v1/core/sched/scheduler.py(~2,300 lines). We read the parts that matter; the rest is connectors, encoders, spec-decode glue, and stats — return to those after Phases 8, 13, 15.Supporting files:
vllm/v1/core/sched/ scheduler.py Scheduler.schedule() / update_from_output() (the brain) output.py SchedulerOutput, NewRequestData, CachedRequestData (the wire format) request_queue.py FCFS vs PRIORITY queues (ordering policy) interface.py SchedulerInterface (the contract) vllm/v1/request.py Request, RequestStatus (the unit of work)
Contents
- 1. The unit of work:
Requestand its states - 2.
schedule()— the whole algorithm - 3. The output:
SchedulerOutput - 4. The other half:
update_from_output - 5. Putting Phases 02 + 03 together
- Reading checklist
1. The unit of work: Request and its states
vllm/v1/request.py:315, RequestStatus:
class RequestStatus(enum.IntEnum):
WAITING = enum.auto()
WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR = enum.auto()
WAITING_FOR_REMOTE_KVS = enum.auto()
WAITING_FOR_STREAMING_REQ = enum.auto()
RUNNING = enum.auto()
PREEMPTED = enum.auto()
# Note: anything after PREEMPTED will be considered as a finished status.
FINISHED_STOPPED = enum.auto()
FINISHED_LENGTH_CAPPED = enum.auto()
FINISHED_ABORTED = enum.auto()
...
Two things to internalize:
- The extra
WAITING_FOR_*states exist because a request can be not ready for reasons beyond "queued": waiting on a grammar to compile (Phase 12), on remote KV to arrive (Phase 15), etc. Yourmini_vllm.RequestStatuskeeps justWAITING/RUNNING/PREEMPTED/FINISHED_*— the essential skeleton. - The ordering trick:
is_finishedis simplystatus > PREEMPTED(line 337). Enum order is the logic.mini_vllmcopies this (is_finished = status >= FINISHED_STOPPED).
The master variables on Request: num_computed_tokens vs num_tokens (and
num_tokens_with_spec for speculative decoding). Everything in schedule() manipulates these.
2. schedule() — the whole algorithm
vllm/v1/core/sched/scheduler.py:329. The defining comment (lines 330–339) — read it; it's the
mental model from the guide, verbatim from the maintainers.
Setup (lines 341–362)
scheduled_new_reqs, scheduled_resumed_reqs = [], []
scheduled_running_reqs, preempted_reqs = [], []
req_to_new_blocks: dict[str, KVCacheBlocks] = {}
num_scheduled_tokens: dict[str, int] = {}
token_budget = self.max_num_scheduled_tokens # <- the per-step token budget
...
self.kv_cache_manager.new_step_starts()
token_budget is max_num_scheduled_tokens (derived from max_num_batched_tokens). This is
the global cap that makes chunked prefill work. mini_vllm: token_budget = self.max_num_batched_tokens.
Phase A — schedule RUNNING requests (lines 364–533)
req_index = 0
while req_index < len(self.running) and token_budget > 0:
request = self.running[req_index]
...
num_new_tokens = (
request.num_tokens_with_spec
+ request.num_output_placeholders
- request.num_computed_tokens
)
if 0 < self.scheduler_config.long_prefill_token_threshold < num_new_tokens:
num_new_tokens = self.scheduler_config.long_prefill_token_threshold # chunk long prefills
num_new_tokens = min(num_new_tokens, token_budget) # respect the budget
num_new_tokens = min(num_new_tokens, self.max_model_len - 1 - request.num_computed_tokens)
num_new_tokens = how far this request is behind, clamped by (a) the long-prefill chunk
threshold and (b) the remaining token budget and (c) the model length. This four-line clamp is
exactly your mini_vllm.Scheduler._clamp_new_tokens (minus spec/placeholder terms). Note
num_tokens_with_spec includes draft tokens — that's how speculative decoding (Phase 8) rides
the same scheduler with no special case, just as the top comment promised.
The preemption loop (lines 442–491) — the heart
with record_function_or_nullcontext("schedule: allocate_slots"):
while True:
new_blocks = self.kv_cache_manager.allocate_slots(
request, num_new_tokens, num_lookahead_tokens=self.num_lookahead_tokens,
)
if new_blocks is not None:
break # got memory; schedule it
# The request cannot be scheduled. Preempt the lowest-priority request.
if self.policy == SchedulingPolicy.PRIORITY:
preempted_req = max(self.running, key=lambda r: (r.priority, r.arrival_time))
self.running.remove(preempted_req)
...
else:
preempted_req = self.running.pop() # FCFS: preempt the most-recent
self._preempt_request(preempted_req, scheduled_timestamp)
preempted_reqs.append(preempted_req)
if preempted_req == request:
break # nothing left to preempt; give up this req
if new_blocks is None:
break
This is the None → preempt → retry handshake with the KV manager (Phase 02 §5). Under
FCFS it preempts self.running.pop() — the most recently admitted, i.e. lowest priority by
arrival. Under PRIORITY it preempts the worst (priority, arrival_time). mini_vllm implements
the FCFS branch (self.running.pop() + _preempt) — the PRIORITY branch is a great extension
exercise.
_preempt_request (line 929) frees the KV and resets the request to be recomputed. Compare
mini_vllm.Scheduler._preempt: frees KV, num_computed_tokens = 0, status PREEMPTED, back to
the front of waiting.
Commit the scheduled running request (lines 493–533)
scheduled_running_reqs.append(request)
req_to_new_blocks[request_id] = new_blocks
num_scheduled_tokens[request_id] = num_new_tokens
token_budget -= num_new_tokens # <- budget bookkeeping
req_index += 1
# ... spec-decode + encoder bookkeeping ...
Phase B — admit WAITING requests (lines 544–...)
if not preempted_reqs and self._pause_state == PauseState.UNPAUSED:
while (self.waiting or self.skipped_waiting) and token_budget > 0:
if len(self.running) == self.max_num_running_reqs:
break
...
request = request_queue.peek_request()
...
# Get already-cached tokens.
if request.num_computed_tokens == 0:
new_computed_blocks, num_new_local_computed_tokens = (
self.kv_cache_manager.get_computed_blocks(request) # <- prefix caching!
)
...
Three gates before admitting anyone (mirrored in mini_vllm):
if not preempted_reqs— don't admit new work in a step where we had to preempt (memory pressure). (mini_vllm:and not out.preempted_req_ids.)token_budget > 0— budget left.len(self.running) == self.max_num_running_reqs: break— the seq-slot cap (max_num_seqs).
Then get_computed_blocks(request) is the prefix-cache head start (Phase 02 §5, guide §4):
the request adopts the cached prefix and only prefills the remainder. The LoRA constraint just
below (lines 573–584) caps distinct adapters per step (max_loras, Phase 11) — another feature
riding the scheduler.
3. The output: SchedulerOutput
vllm/v1/core/sched/output.py:181. What the scheduler hands the executor:
@dataclass
class SchedulerOutput:
scheduled_new_reqs: list[NewRequestData] # first-time-scheduled (full payload)
scheduled_cached_reqs: CachedRequestData # already-running (just deltas)
num_scheduled_tokens: dict[str, int] # req_id -> tokens this step
total_num_scheduled_tokens: int
scheduled_spec_decode_tokens: dict[str, list[int]]
scheduled_encoder_inputs: dict[str, list[int]]
num_common_prefix_blocks: list[int]
finished_req_ids: set[str]
...
The split between NewRequestData (line 31 — full prompt, block_ids, sampling params) and
CachedRequestData (line 112 — just new tokens + new block ids) is a real optimization: for a
request already running, you don't resend the prompt every step, only the delta. mini_vllm
simplifies this to one num_scheduled_tokens dict + the request objects, but the idea — send
new requests in full, running requests as deltas — is worth knowing.
4. The other half: update_from_output
vllm/v1/core/sched/scheduler.py:1283. After the model runs and the sampler produces tokens,
the scheduler ingests the results: append sampled tokens, advance num_computed_tokens, detect
finished requests, free their KV, handle spec-decode acceptance/rejection, emit stats. Your
mini_vllm.Scheduler.update_from_output is the skeleton: num_computed_tokens += n; if a token
was sampled, append it and check stop conditions; reap finished requests (free KV, drop from
running).
The condition for "did this request emit a token this step" in mini_vllm is
needs_sample = (num_computed_tokens + num_scheduled == num_tokens) — only fully-caught-up
(prefill-complete) requests sample. The real engine encodes the same thing through the model
runner's logits-indices selection; the principle is identical (you only sample at the last
position of a request that has no more prompt to ingest).
5. Putting Phases 02 + 03 together
The clean separation you should now see:
Scheduler (policy: who runs, how many tokens) ──calls──► KVCacheManager (truth: is there memory?)
▲ │
└────────────── None ◄── allocate_slots ◄───────────────┘ (OOM signal)
│
└─ responds: preempt a running request, free its KV, retry
The scheduler never touches blocks directly; the KV manager never decides policy. That clean seam is why each file stays readable despite the engine's complexity — and it's a design lesson worth stealing for your own systems.
Reading checklist
Write one sentence each in your notebook:
-
The top comment of
schedule()— restate the "no prefill/decode phase" idea in your words. -
The 4-line
num_new_tokensclamp — what are the three caps and why each? -
The
while Truepreemption loop — what doesallocate_slotsreturningNonetrigger? - FCFS vs PRIORITY preemption victim selection — who gets preempted in each?
- The three gates before admitting WAITING requests.
-
get_computed_blocksin Phase B — how does prefix caching give a free head start? -
NewRequestDatavsCachedRequestData— why send deltas for running requests?
Now build it: 02-mini-build.md, then the labs.
Phase 03 — Mini-Build: the continuous-batching scheduler
You'll build the scheduler that drives mini_vllm. The reference is already in the repo —
mini_vllm/scheduler.py — but write it yourself first against lab-01's stub + tests, then
diff.
This phase's mini-build depends on Phase 02's KV manager (mini_vllm/kv_cache.py), because the
scheduler's whole interaction with memory is allocate_slots / get_computed_blocks / free.
That dependency is the point: scheduling and paging are two halves of one machine.
Contents
The build, in order
1. SchedulerOutput
A small dataclass: num_scheduled_tokens: dict[str,int], scheduled_requests: list[Request],
preempted_req_ids: list[str], and a total_num_scheduled_tokens property. (Real:
output.py:181, much richer.)
2. Scheduler.__init__
Hold the KVCacheManager, max_num_seqs, max_num_batched_tokens,
long_prefill_token_threshold, and two queues: waiting: deque[Request], running: list[Request].
3. _clamp_new_tokens(num_new_tokens, token_budget)
Apply the long-prefill chunk cap (if 0 < threshold < num_new) then min(num_new, budget),
floored at 0. This single helper is chunked prefill. (Real: scheduler.py:390–392.)
4. schedule() — the two-phase loop
- Phase A (running): for each running request,
n = _clamp_new_tokens(num_tokens − num_computed, budget);allocate_slots; onNone, pop a victim fromrunning,_preemptit, record it, retry; commit andbudget -= n. - Phase B (waiting): while
waiting and budget>0 and len(running)<max_num_seqs and not preempted: peek the front request;get_computed_blocks→ setnum_computed_tokens;n = _clamp_new_tokens(...);allocate_slots(req, n, computed_blocks); onNonebreak; else move waiting→running, commit.
5. _preempt(request)
kv.free(request); num_computed_tokens = 0; num_preemptions += 1; status PREEMPTED;
waiting.appendleft(request) (re-admit ASAP).
6. update_from_output(output, sampled)
For each scheduled request: num_computed_tokens += n; if it sampled, append the token and
maybe_finish(). Reap finished requests: remove from running, kv.free.
7. needs_sample(request, num_scheduled) (static)
return request.num_computed_tokens + num_scheduled == request.num_tokens. Only fully-caught-up
requests emit a token (mid-prefill chunks don't).
Definition of done
pytest mini_vllm/test_scheduler.py -q # the reference suite (token budget, chunking,
# preemption, prefix head-start)
pytest phase-03-continuous-batching-scheduler/labs -q
Then run the full engine and confirm the scheduler's correctness invariants hold end to end:
pytest mini_vllm/test_engine.py -q
# test_chunked_prefill_matches_unchunked_output and
# test_prefix_caching_matches_no_caching_output PROVE that these optimizations change
# *timing/memory*, never *output*. That property is the whole game.
Stretch (sets up later phases)
- PRIORITY policy — add a
priorityfield toRequestand a PRIORITY branch to the preemption victim selection (maxby(priority, arrival_time)), mirroringscheduler.py:456. - Swapping vs recompute — instead of resetting
num_computed_tokens=0on preempt, "swap" the blocks to a CPU pool and restore them on re-admit. Compare the cost model (recompute is compute, swap is bandwidth) — a real vLLM design axis. - Stats — count preemptions, average batch size, and KV usage per step; you'll need these in Phase 18.
Phase 03 Labs — Continuous Batching & the Scheduler
Six labs around the engine's brain. The arc: build the scheduling loop (lab-01), prove chunked prefill safe (lab-02), measure why it exists (lab-05), survive memory pressure with preemption (lab-04), then account for prefix caching exactly (lab-06) and on real hardware (lab-03).
Recommended order: 01 → 02 → 05 → 04 → 06 → 03. (Directory numbers predate labs 05–06:
mechanism, then safety, then motive, then the emergency path, then the cache economics.)
CPU labs follow the standard contract — starter.py (your work), solution.py
(reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades
yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-03-continuous-batching-scheduler/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-01-scheduler-step -q
Contents
- lab-01-scheduler-step
[CPU-OK] - lab-02-chunked-prefill
[CPU-OK] - lab-03-prefix-cache-hitrate
[GPU-OPT] - lab-04-preemption
[CPU-OK] - lab-05-decode-latency-spikes
[CPU-OK] - lab-06-prefix-cache-savings
[CPU-OK] - What you can do after this phase
Labs
lab-01-scheduler-step [CPU-OK]
Implement the two-phase loop at the heart of continuous batching: serve RUNNING first
(decode-first is a policy, and iteration order is the policy), then admit WAITING — all
under three independent scarcities (token budget, sequence slots, KV memory) enforced at
three different points. Your 30 lines are, shape for shape, the core of
scheduler.py:329 upstream. Skills: budget/slot/memory enforcement; running-first;
head-of-line blocking as a fairness choice; one code path for prefill and decode.
lab-02-chunked-prefill [CPU-OK]
Prove the engine's most important safety property — chunking changes when tokens are
computed, never what tokens come out — by running the same deterministic workload under
both schedules and diffing token ids. Plus the timing side: predict prefill steps with
ceil(prompt/chunk) and know every boundary case. Skills: the causality + sampling-guard
argument; output-invariance as a CI-enforceable equality; the chunk-size trade-off.
lab-03-prefix-cache-hitrate [GPU-OPT]
Run the real engine on the canonical workload (long shared system prompt, unique tails) with prefix caching off and on, and read three independent meters that must agree: hit rate (0% → 93.7%), prompt throughput (4–5×), KV usage (~1× the prefix). Annotated capture included for the GPU-less; lab-06 is the exact-arithmetic CPU twin. Skills: constructing sharing-known workloads; reading hit-rate denominators; when caching buys nothing.
lab-04-preemption [CPU-OK]
Force the scheduler's emergency path in a pool where two requests cannot both fit: watch it evict the most-recent admission, let the survivor finish, then replay the victim — and prove the final outputs identical to a roomy pool's. Recovery is just prefill: the two-counters model makes eviction, chunking, and cache hits one code path. Skills: the allocate-or-preempt dance; victim policy as forward-progress argument; the deadlock invariant; pairing "survives Y" tests with "Y actually happened" probes.
lab-05-decode-latency-spikes [CPU-OK]
The motive for chunked prefill, measured: a decode stream's per-step cost profile when a
256-token prompt lands — [257, 2, 1, 1, ...] unchunked vs [33,×8, 2, 1, ...] at
threshold 32. Same total work, radically different tail latency; nothing free — the spike
spreads into the long prompt's TTFT. Skills: per-victim latency measurement; p99 vs mean;
the threshold/budget dial; why aggregate meters hide interference.
lab-06-prefix-cache-savings [CPU-OK]
Account for prefix caching to the exact token: 544 scheduled tokens uncached vs 96 cached, savings ≡ (N−1) × shared full blocks = 448, outputs bit-identical, and a share-nothing control arm that saves almost nothing. Includes the one-token prefill that immediately samples — three phases of rules colliding in a single scheduled token. Skills: the compute odometer; predicting cache value with integer arithmetic; eager caching at allocation time; validating noisy GPU meters against an exact model.
What you can do after this phase
Implement and modify vLLM's scheduling policy with the confidence of someone who has built
the loop, proven its invariants, and measured its trade-offs: explain why chunked prefill
is default-on (and what threshold to set, from data); predict prefix-cache savings for any
workload before enabling it; diagnose a preemption storm from the metrics and name the
right knob; and read vllm/v1/core/sched/scheduler.py end to end as a peer. Combined with
Phase 2, you now hold the complete control plane — Phase 4 descends into the kernels it
commands.
Lab 03-01 — Implement the Scheduler Step [CPU-OK]
The scheduler is the brain of the engine — the component that decides, every single step,
who computes and how much. In this lab you implement its core: the two-phase loop
(serve the RUNNING, then admit the WAITING) under a global token budget, a sequence-slot
cap, and a per-request chunk limit. It is maybe 30 lines of code. Those 30 lines are the
difference between a GPU that hums at 90% utilization and one that stutters between
overload and idleness — and they are, shape for shape, the same 30 lines at the heart of
upstream/vllm/v1/core/sched/scheduler.py.
Contents
- Why this lab exists
- Background: the three scarce resources
- Why running-first is not arbitrary
- Files
- Run
- What
schedule_stepmust do - What the tests prove — a guided tour
- How this maps to the real engine
- Hitchhiker's notes
- Going further
- References
Why this lab exists
In Phase 1 lab-04 you observed the scheduler's decisions as a trace of per-step batch dicts. Reverse the arrow: now you are the one producing those dicts. Everything you watched — chunking to the budget, deferred admission, mixed prefill+decode batches — must now fall out of code you write. This is the course's central loop made flesh: observe a mechanism, then build it, and the understanding compounds.
It's also the file you will touch most as a contributor. Scheduling policy is where vLLM
evolves fastest — priority scheduling, fairness, SLA-aware admission, disaggregated
prefill (Phase 15) are all edits to this loop. The deep-dive walks you through the real
Scheduler.schedule; this lab makes sure that when you read it, you're recognizing,
not learning.
Background: the three scarce resources
Every scheduling decision is a negotiation between three independent scarcities, and the loop checks all three — know which line enforces which:
max_num_batched_tokens(the token budget, default 2048–8192 upstream) — caps total tokens computed per step. This is a latency control: step wall-clock time grows with scheduled tokens, so the budget is, almost literally, your inter-token latency dial (lab-05 measures this). The budget is global per step — one pool shared by everyone scheduled.max_num_seqs(the slot cap) — caps how many requests can be RUNNING at once. This bounds per-step fixed overheads and runner state (and, on real hardware, things like CUDA-graph batch-size buckets — Phase 5). It is checked only at admission: an already-running request never re-competes for its slot.- KV memory (via
kv.allocate(...)) — the hard wall from Phase 2. Unlike the other two, this one can refuse mid-flight (a decode needs one more block and the pool is empty); handling that refusal is preemption, deliberately deferred to lab-04. In this lab, allocation failure during admission simply stops admitting.
Three resources, three different enforcement points. Most scheduler bugs are one resource checked at the wrong point — e.g. counting seqs in the budget loop, or letting an admission overdraw the budget "just this once."
Why running-first is not arbitrary
The loop's order — RUNNING phase, then WAITING phase — encodes a policy with a name: decode-first. The requests already running have users watching tokens stream; a stalled decode is a frozen cursor in somebody's chat window. The waiting requests haven't received anything yet; making them wait one more step costs queueing delay but breaks no stream. So the scheduler protects in-flight experience first and spends whatever budget remains on admissions.
The inverse policy (admit-first) would maximize... nothing useful: it trades visible jitter for marginally earlier admissions. But note the deeper principle, because it generalizes: the loop's iteration order IS the priority policy. Upstream's priority scheduling and preemption-victim selection are both, at bottom, careful answers to "in what order do we iterate, and from which end do we take?"
Files
starter.py—clamp(...)andschedule_step(...)stubbed, with the full recipe in comments. Ships a tiny self-containedReqandFakeKV(a slot-counting memory model) so the lab isolates pure scheduling logic. Your work.solution.py— reference (mirrorsmini_vllm/scheduler.py, minus preemption).test_lab.py— budget cap, slot cap, chunking, running-first ordering, and memory-stops-admission.
Run
LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-01-scheduler-step -q
pytest phase-03-continuous-batching-scheduler/labs/lab-01-scheduler-step -q # reference
What schedule_step must do
budget = max_num_batched_tokens.- RUNNING phase: for each running req in order:
n = clamp(req.num_tokens − req.num_computed, budget, threshold); skip ifn == 0;kv.allocate(req, n)(assume it succeeds for running reqs here — preemption is lab-04); commitscheduled[rid] = n,budget -= n. - WAITING phase: while there are waiters AND
budget > 0ANDlen(running) < max_num_seqs: take the front waiter (FCFS — order is policy!), clamp the same way, try to allocate; on failurebreak(if the front request can't fit, don't go shopping deeper in the queue — see the head-of-line note below); on success, move it waiting → running and commit. - Return
{rid: n}.
And clamp(num_new, budget, threshold) is the whole chunking mechanism in one line:
cap by the per-request threshold (if 0 < threshold < num_new), then by the remaining
budget, floored at 0. Notice what isn't here: no "prefill mode," no "decode mode." A
decode is just a request whose num_tokens − num_computed == 1. The two-counters model
from Phase 1 means one code path schedules both — that unification is the deep design,
and it's why this loop stays 30 lines while doing what took Orca a paper to describe.
What the tests prove — a guided tour
test_clamp_chunks_and_budgets— the clamp's three regimes (budget-bound, threshold-bound, neither). Get this right first; everything else composes it.test_budget_caps_total_tokens— three 8-token prompts under a 10-token budget schedule exactly 10 tokens: 8 + 2 (the second request's prefill is chunked mid-prompt)- 0 (the third isn't admitted). One assertion, three behaviors.
test_max_num_seqs_caps_running— ten tiny requests, slots for four: exactly four admitted, despite infinite budget and memory. Each scarcity binds independently.test_chunked_prefill_caps_per_request— a 100-token prompt withthreshold=16schedules 16, not 100, even with budget to burn. The threshold protects other requests' latency from this request's prompt (lab-05 quantifies exactly how much).test_running_scheduled_before_waiting_admitted— the decode-first policy: the running decode gets its 1 token first; the eager 20-token waiter gets what's left, chunked. Order of phases = priority.test_admission_stops_when_memory_exhausted— A fills the pool; B stays WAITING. No crash, no partial admission: capacity exhaustion is a normal scheduling outcome, not an error path. (The engine-level consequence — B admitted later when A finishes — is Phase 1 lab-04's trace; the violent version is lab-04's preemption.)
How this maps to the real engine
Open upstream/vllm/v1/core/sched/scheduler.py:329 after you're green. The skeleton is
yours; production adds, in roughly descending order of weight: preemption inside the
RUNNING phase (the while True allocate-or-preempt dance — lab-04); prefix-cache
consultation at admission (get_computed_blocks — lab-06 / Phase 2 lab-05); structured-
output and LoRA gating; speculative-decoding token accounting; and the encoder budget for
multimodal inputs. Every one of those is a guard or a discount on num_new_tokens
inside the same two phases. Once you see the file that way — your 30 lines plus accessory
clauses — it stops being 700 intimidating lines.
Also worth noting upstream: _clamp_new_tokens's real twin is the interaction between
long_prefill_token_threshold and chunked_prefill_enabled in the scheduler config —
chunked prefill is default-on in V1, which tells you how settled this once-controversial
idea now is.
Hitchhiker's notes
- Head-of-line blocking is a choice. When the front waiter doesn't fit, we
breakrather than trying the next (smaller) one. Skipping ahead would raise utilization and starve large requests — a big prompt could wait forever behind a stream of small ones slipping past it. FCFS-with-blocking is the fairness-conservative default; if you relax it, you must add an aging mechanism. (Upstream has exactly this debate in its issue tracker — worth a read.) - Why is the budget in tokens, not requests? Because step time scales with tokens through the model, not with request count — a 1-request 2048-token prefill costs about the same as 2048 one-token decodes through the GEMMs (attention differs; Phase 18 refines). Budgeting the actual scarce quantity is what makes the latency dial linear.
num_computed > 0for a waiter is not an error — it's a preempted request being re-admitted (lab-04) or a prefix-cache hit (lab-06). Your clamp already handles it:num_tokens − num_computedjust comes out smaller. Design observation: by making "partial progress" a first-class state, recovery and caching share the admission path with fresh requests. No special cases.- The FakeKV is a teaching instrument: one slot per token, no blocks, no hashes — so
this lab's failures are always scheduling failures. When you wire the real
KVCacheManagerin (mini-build), block granularity adds a ceil() but changes no logic.
Going further
- Add priority classes: each
Reqgetspriority: int; iterate waiting in priority order with FCFS tiebreak. Then write the test proving a late high-priority request overtakes the queue without stalling running decodes. You've just implemented the core of upstream'spriorityscheduling policy. - Add the fully-cached edge case: if
num_tokens − num_computed == 0for a waiter (prefix cache covered everything it can), schedule 1 token anyway. Why must it be ≥ 1? (A request that schedules 0 tokens never produces logits, never samples, never finishes — an admission that can't make progress.mini_vllm/scheduler.pyhas this exact branch; lab-06 will show you the 1-token prefills it produces in a trace.) - Make the budget elastic: allow one oversized decode batch when
waitingis empty. Measure (with Phase 1 lab-04's probe) what it does to step-time variance. Most "clever" scheduler ideas die in exactly this experiment — cheap to run here, expensive to learn in production.
References
mini_vllm/scheduler.py— the full version (with preemption + prefix caching) your solution grows into.upstream/vllm/v1/core/sched/scheduler.py:329—Scheduler.schedule, the production loop; read it immediately after finishing.- Yu et al., Orca (OSDI 2022) — iteration-level scheduling, this loop's ancestor: https://www.usenix.org/conference/osdi22/presentation/yu
- Agrawal et al., Sarathi-Serve (OSDI 2024) — why the chunk threshold exists; the prefill/decode interference math: https://arxiv.org/abs/2403.02310
- vLLM docs, Optimization and Tuning —
max_num_batched_tokens/max_num_seqsguidance straight from the maintainers: https://docs.vllm.ai/en/latest/configuration/optimization.html
Lab 03-02 — Chunked Prefill: Same Output, Different Timing [CPU-OK]
This lab proves, on a running engine, the most important safety property in vLLM:
Chunked prefill changes WHEN tokens are computed, never WHAT tokens are produced.
If that sentence were false, no scheduling optimization in this codebase would be safe to
ship — every knob that re-times work would be a knob that corrupts output. You'll verify it
the strong way (identical token ids, chunked vs unchunked, on the real mini_vllm engine),
and you'll learn to predict the timing side: exactly how many steps a prefill takes under
any threshold/budget combination.
Contents
- Why this lab exists
- Background: why chunk a prefill at all
- Why the output cannot change — the actual argument
- Files
- Run
- The formula to implement
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
"Re-timing is output-invariant" is the kind of claim engineers nod along to and never
check. But your career will repeatedly hand you moments where the nod isn't enough: a
customer reports different outputs between two deployments that differ only in scheduler
config; a reviewer asks whether your scheduler PR can change generations; an incident
review wants to know whether enabling chunked prefill mid-fleet is provably safe or just
probably safe. This lab gives you the proof technique: drive the same deterministic
workload through both schedules and diff the token ids. It's mini_vllm's own regression
test (test_engine.py::test_chunked_prefill_matches_unchunked_output), reproduced by your
hand so you know why it must hold, not just that it does.
The second skill is the timing model. "How many steps does a 4000-token prompt take at threshold 512?" is a real capacity question (it sets TTFT for that request and the interference window for everyone else — lab-05). The answer is a one-line formula, and you should never need to run the engine to produce it.
Background: why chunk a prefill at all
Without chunking, a 4096-token prompt arrives and the scheduler faces an ugly choice: schedule the whole prefill in one step — a step that takes hundreds of times longer than a decode step, during which every other user's token stream visibly freezes — or make the new request wait indefinitely. Early engines picked the freeze; users called it "jitter" and "stalls."
Chunked prefill (Sarathi's contribution, default-on in vLLM V1) dissolves the choice:
split the prompt into budget-sized chunks across several steps, and let decodes ride along
in each step's leftover budget. The long prompt pays slightly more total latency (more
steps, plus re-reading its growing KV each chunk); everyone else's inter-token latency
stays smooth. The two-counters model from Phase 1 makes the implementation almost
embarrassingly small: a prefill is just a request whose num_computed_tokens is far
behind, so capping its per-step advance — clamp from lab-01 — is chunking. No prefill
state machine, no resume logic; the counter is the resume logic.
Why the output cannot change — the actual argument
Spell it out once, carefully, because this is the argument you'll reuse for every scheduling feature:
- The model's logits at position k depend only on tokens
0..k(causality) and their KV values — not on which step computed that KV. KV is a pure function of the tokens. - The engine samples for a request only when
num_computed_tokens + n == num_tokens(Scheduler.needs_sample— Phase 1 lab-03's guard). Mid-prefill chunks emit nothing. - Therefore the first sample happens at the same logical state (all prompt KV computed,
position = prompt length) whether the prompt was computed in 1 chunk or 10. Same state
- same sampling → same token. Induction extends this to every later token.
The invariant has exactly two load-bearing dependencies: causality (KV doesn't depend on schedule) and the sampling guard (no logits read mid-prefill). Notice what that implies for review: a PR can only break output-invariance by touching one of those two things. That's a checklist of length two for an entire class of changes — and on real GPUs, a softer third dependency appears (batch-shape-dependent floating-point reduction order), which is why the real engine's version of this test compares with tolerance while ours can demand exact equality. See the Hitchhiker's notes.
Files
starter.py— implementnum_prefill_steps(prompt_len, threshold, budget). Your work.solution.py— reference.test_lab.py— checks your formula on the boundary cases AND runs the engine both ways asserting identical output token ids.
Run
LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-02-chunked-prefill -q
pytest phase-03-continuous-batching-scheduler/labs/lab-02-chunked-prefill -q # reference
The formula to implement
A single request (it owns the whole budget) with a prompt_len-token prompt. The per-step
chunk is threshold if threshold > 0 else budget, but never more than budget:
chunk = min(threshold or budget, budget). The prefill then takes
ceil(prompt_len / chunk)
steps. Watch the boundaries the tests probe: threshold = 0 means disabled (not "chunk
of zero"); a threshold larger than the budget is moot (the budget binds); a prompt that
divides evenly takes exactly prompt_len / chunk, no +1. Off-by-ones here are off-by-ones
in someone's TTFT model later.
What the tests prove
- The formula tests pin the chunk-size selection logic and the ceiling division —
including
threshold=0(unchunked: 1 step), threshold > budget (budget wins), and exact-division boundaries. - The engine test generates from the same prompt with
long_prefill_token_threshold=0and with a small threshold, and asserts identical output token ids — not similar: identical. It can demand exactness becausemini_vllmis deterministic end-to-end (greedy sampling, deterministic toy model), which turns the safety property into a hard equality a CI can enforce forever. This is the test you write first when building any scheduling feature: pin the semantics, then optimize the timing freely. (Compare to the trace shape you saw in Phase 1 lab-04: chunking visibly rearranged the steps. Same engine, same tokens — the timing is the only degree of freedom.)
Hitchhiker's notes
- On real GPUs, "identical" softens to "equivalent." Chunking changes batch shapes; different GEMM/attention tile sizes can change floating-point reduction order; logits wiggle in the last ulp; and a greedy argmax between two near-tied tokens can flip. The semantic invariant (same distribution, same correctness) holds; bitwise equality does not. This is why upstream correctness tests for chunked prefill compare with tolerance or check logprob closeness, and it's the first thing to say in the incident review when two configs differ by one token at position 947: not all divergence is a bug — divergence beyond rounding is.
- The threshold is a latency/throughput dial, not free money. Small chunks: smoother decode latency for others, but the long prompt's prefill stretches across more steps (worse TTFT for it), and each chunk re-reads the prompt's accumulated KV (attention cost ~quadratic-ish in total across chunks vs the one-shot). Sarathi-Serve's whole paper is about choosing this number; lab-05 lets you feel it.
- Where would chunking change output? It wouldn't — but a bug that sampled mid-prefill
would (a request emitting a token from logits computed over half its prompt). Find the
guard in
mini_vllm/scheduler.py::needs_sampleand its upstream twin (thelogits_indicesselection in the model runner). If a future refactor moved sampling before the catch-up check, this lab's engine test is the tripwire that catches it. - Chunked prefill and prefix caching compose. A cache-hit request enters admission
with
num_computed_tokensalready nonzero; the chunk math applies to the remainder. No interaction code exists because both features speak the same language: the counter. (Lab-06 shows the composed behavior in a trace.)
Going further
- Extend
num_prefill_stepsto two concurrent prompts sharing the budget fairly — suddenly you need to model the RUNNING phase's in-flight chunks competing with admissions, and the closed form gets genuinely interesting. Check your model against Phase 1 lab-04's probe. - Compute TTFT in steps as a function of threshold for a 4096-token prompt at budget 512, then sketch the other requests' worst-case stall at each threshold (lab-05 measures it). Plot both curves; their crossing is the tuning decision Sarathi formalizes.
- Read upstream's
long_prefill_token_thresholdhandling and the scheduler config's chunked-prefill defaults, and write down which of your formula's branches each config combination exercises.
References
mini_vllm/test_engine.py::test_chunked_prefill_matches_unchunked_output— the course's own regression test you just rebuilt.mini_vllm/scheduler.py—_clamp_new_tokens(the chunker) andneeds_sample(the guard).upstream/vllm/v1/core/sched/scheduler.py— the production clamp; searchlong_prefill_token_threshold.- Agrawal et al., SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills (2023) — the original chunking paper: https://arxiv.org/abs/2308.16369
- Agrawal et al., Sarathi-Serve (OSDI 2024) — the production-grade follow-up with the threshold-tuning math: https://arxiv.org/abs/2403.02310
- vLLM docs, Chunked Prefill — the feature's official knobs and defaults: https://docs.vllm.ai/en/latest/configuration/optimization.html
Lab 03-03 — Measure Real Prefix-Cache Hit Rate [GPU-OPT]
Every production LLM workload has a shape, and the shape is almost always "a long shared prefix, then a short unique tail" — system prompts, few-shot examples, conversation history, RAG boilerplate. Prefix caching turns that shared prefix from N prefills into one. In this lab you run the real engine on exactly that workload and watch the meters move: the hit-rate counter climbing to ~94%, prompt throughput jumping 4–5×, and KV usage staying near 1× a single prompt. These are the numbers that justify the feature — and you'll know how to reproduce them on your workload, which is the question that actually matters.
No GPU? Don't panic. The captured run below is annotated line by line; the analysis sections work entirely on paper. And lab-06 reproduces this experiment on the mini engine, CPU-only, with exact token accounting — do that one hands-on.
Contents
- Why this lab exists
- Background: what a "hit" buys
- Requirements
- Steps
- What to measure
- Captured output (real run, Qwen2.5-0.5B, L4 24GB, vLLM 0.22.1)
- Reading the numbers like an operator
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Prefix caching is the rare optimization that is simultaneously huge (multi-× on the right workload), free to enable (default-on in modern vLLM), and workload-dependent enough to be oversold (≈0 benefit on share-nothing batch jobs). An engineer who can't measure it is at the mercy of vibes in both directions. This lab builds the measurement reflex: construct a workload with known sharing, run with the feature off and on, and read three independent meters that must agree (hit rate, prompt throughput, KV usage). When the meters don't agree — hit rate high but no speedup, say — you've learned something real (often: the prefix wasn't block-aligned, or the workload was decode-dominated all along).
The same experiment is also your template for capacity claims: "enabling prefix caching will let this deployment serve 3× the QPS" is a sentence you should only say after running this lab's shape against your traffic.
Background: what a "hit" buys
From Phase 2 lab-05 you know the mechanism: full blocks of the prompt are content-hashed
(parent-chained), and a new request adopts any cached chain head — touch, ref-count bump,
zero compute. What that buys, concretely, per hit token:
- Prefill compute: the entire forward pass for that token — skipped. TTFT for a request with an N-token cached prefix drops by roughly N/(prefill speed).
- KV memory: the hit blocks are shared, not copied (
ref_cnt += 1). Sixteen requests sharing a 1000-token system prompt store its KV once. - What it never buys: decode. Generated tokens are new by definition. A workload that prefills 50 tokens and decodes 2000 saves almost nothing — check your prefill:decode ratio before promising miracles.
The unit of caching is the full block (Phase 2's I3): a 130-token shared prefix at block_size 16 hits at most 8 blocks = 128 tokens, and divergence mid-block forfeits that block. Hence the operator's rule of thumb: put the static part first, pad nothing, and the boundary token of your template matters more than you'd think.
Requirements
uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct # small, modern, instruct-tuned
Steps
# run.py
from vllm import LLM, SamplingParams
SYSTEM = "You are a meticulous assistant. Follow instructions carefully. " * 30 # ~400 tokens shared
llm = LLM(
model="Qwen/Qwen2.5-0.5B-Instruct",
enable_prefix_caching=True, # <- the feature under test; flip for the control run
gpu_memory_utilization=0.6,
max_model_len=4096,
)
# 16 requests sharing SYSTEM, each with a unique tail.
prompts = [f"{SYSTEM}\n\nQuestion {i}: what is {i}+{i}?" for i in range(16)]
out = llm.generate(prompts, SamplingParams(max_tokens=16, temperature=0))
for o in out[:2]:
print(repr(o.outputs[0].text))
Run twice — enable_prefix_caching=True then False — both under
VLLM_LOGGING_LEVEL=DEBUG, and collect the three meters below for each. (One subtlety of
this script: all 16 requests are submitted in one generate call, so requests 1..15 hit
blocks request 0 cached moments earlier in the same run — vLLM caches blocks as soon as
they're full, not when the request finishes. The mechanism is identical for requests
arriving minutes apart, as long as the blocks survive eviction.)
What to measure
| Metric | prefix caching OFF | prefix caching ON |
|---|---|---|
| Prefix cache hit rate | 0% (counter absent/zero) | climbs toward (N−1)/N |
| Avg prompt throughput | baseline | several × baseline |
| Peak KV-cache usage | ~16 × SYSTEM + tails | ~1 × SYSTEM + tails |
| TTFT, requests 2..16 | full prefill each | only the unique tail prefills |
Three of these are in the debug logs; TTFT you can take from the per-request timing if you use the API server, or infer from prompt throughput here.
Captured output (real run, Qwen2.5-0.5B, L4 24GB, vLLM 0.22.1)
# enable_prefix_caching=True
INFO ... Automatic prefix caching is enabled.
DEBUG ... Prefix cache hit rate: GPU: 0.00% (request 0 populates the cache)
DEBUG ... Prefix cache hit rate: GPU: 93.7% (requests 1..15 reuse SYSTEM's blocks)
INFO ... Avg prompt throughput: 41523 tokens/s (mostly cached -> not recomputed)
'4' '6'
# enable_prefix_caching=False (same workload)
DEBUG ... Prefix cache hit rate: GPU: 0.00%
INFO ... Avg prompt throughput: 9120 tokens/s (every SYSTEM prefilled from scratch)
Reading the numbers like an operator
- 0.00% → 93.7% — the first line is request 0 paying full price (the pioneer effect:
a cache nobody populated cannot hit — same 1/N you saw in Phase 2 lab-03's 87.5%).
Then 15 of 16 requests reuse SYSTEM's blocks. Why 93.7% and not 15/16 of all tokens?
The denominator is queries (tokens looked up), and each request's unique tail plus its
final block can't hit — the cap from Phase 2 lab-05: a hit covers at most full blocks of
at most
num_tokens − 1tokens. Hit rates have denominators; always ask what's in them before quoting one. - 41523 vs 9120 tokens/s prompt throughput — the 4.6× is the shared prefix being computed once instead of 16 times. Sanity-check the ratio: with ~430 shared + ~15 unique tokens per prompt, the cached run computes ~1×430 + 16×15 ≈ 670 prefill tokens where the uncached run computes 16×445 ≈ 7120 — a ~10× compute saving, surfaced as ~4.6× in the wall-clock meter (the meter averages over windows that include decode time too). Meters measure what they measure; derive what you expected before trusting the headline.
- The outputs are
'4'and'6'— the same answers the uncached run gives. Cached KV is the same KV (Phase 2 lab-06's identity theorem, now economically significant). Correctness meters and performance meters move independently; check both. - Same arithmetic as lab-06 — which computes the exact scheduled-token saving
(
(N−1) × full-blocks-of-shared-prefix) on the mini engine where every token is countable. The GPU numbers above are that arithmetic, plus wall-clock noise.
Hitchhiker's notes
- Conversation history is the killer app, not just system prompts: each turn re-sends the whole transcript, which is — by construction — a growing shared prefix with itself. A chat with T turns gets ~T× prefill savings on its own history. This is why every serious chat API (and vLLM-based products) leans on prefix/prompt caching, and why the commercial APIs sell it explicitly (Anthropic/OpenAI "prompt caching" — same idea, different billing).
- What invalidates a cached prefix: eviction under memory pressure (the blocks are
still just free-queue citizens — Phase 2 lab-05),
reset_prefix_cache(), restart, or anything that changes what the KV means: different LoRA adapter, different model, different chat-template rendering of the "same" text. The hash chain includes token ids only after templating — two prompts that render differently share nothing, which is the most common "why is my hit rate 0" in practice (timestamp in the system prompt, per-user name early in the template, randomized example order...). n>1parallel sampling (Phase 9) reuses this exact machinery — N samples of one prompt share the prompt's blocks via the sameref_cntmechanics. So do beam search and speculative-decoding draft trees. "Share immutable prefix KV via refcounted blocks" is load-bearing infrastructure, not a feature flag.- Security note for multi-tenant operators: cache timing is observable — a fast TTFT reveals someone recently prefilled the same prefix. Cross-tenant prefix caching can therefore leak prompt equality across tenants (a real, published attack class against LLM caches). vLLM's cache is per-engine; if you front multiple tenants, decide deliberately whether their prefixes may share a pool.
Reflect
- Why does the first request show 0% even though the cache is enabled? And what is the steady-state hit rate of this workload as N → ∞? ((N−1)/N of the shareable tokens — the pioneer cost amortizes to nothing.)
- Your workload prefills 2000 tokens of RAG context (unique per query!) and decodes 100. What hit rate do you expect? (~0 — unique context shares nothing. What would help? Reordering the prompt so static instructions precede the unique context, and caching exactly that. Prompt structure is a performance interface.)
- Estimate: 16 requests × 430-token SYSTEM at ~36 KB/token-ish for a 0.5B model — how much KV memory did sharing save, in MB? Now do it for a 70B model at 405 KB/token and 64 concurrent requests. (This is why prefix caching is also a capacity feature, per Phase 2 lab-03's concurrency math.)
References
upstream/vllm/v1/core/kv_cache_manager.py::get_computed_blocks— where the hit happens, including the hit-rate accounting you watched.mini_vllm/kv_cache.py::get_computed_blocks— the same logic at readable scale (Phase 2 lab-05 exercises it directly).- vLLM docs, Automatic Prefix Caching — design + operational notes: https://docs.vllm.ai/en/latest/design/prefix_caching.html
- Zheng et al., SGLang / RadixAttention (2023) — prefix reuse generalized to a tree; the natural next read: https://arxiv.org/abs/2312.07104
- Anthropic, Prompt caching announcement (2024) — the same economics, productized; good for building intuition about real workload shapes: https://www.anthropic.com/news/prompt-caching
- Lab-06 in this phase — the CPU twin with exact token accounting.
Lab 03-04 — Preemption: Survive Memory Pressure [CPU-OK]
Every admission decision the scheduler makes is a bet: "this request will fit." Decode growth makes the bet probabilistic — every running request gets one token longer per step, and nobody knows when they'll stop. Preemption is what happens when the bets go bad: the scheduler evicts a running request mid-generation, frees its KV, and re-runs it later — and the user at the other end must never be able to tell. In this lab you'll force that emergency on purpose, in a pool sized so two requests cannot both finish, and prove the two halves of the contract: a preemption really occurs, and the preempted request's final output is token-for-token identical to what it would have produced with infinite memory.
Memory pressure costs time, never correctness. That's the sentence this lab turns from a slogan into a test.
Contents
- Why this lab exists
- Background: why overcommit at all
- The mechanism, step by step
- Files
- Run
- The setup that forces preemption
- Why the output is identical
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Preemption is the scheduler's least-exercised, most-critical path — the firmware of the fire extinguisher. It runs rarely (well-tuned deployments preempt little), which means bugs in it survive for months, and when it finally runs, it runs during the worst possible conditions: full memory, maximum load, every user watching. An engineer who has caused preemptions in a controlled pool, watched the counters reset, and verified the replayed output, debugs a production preemption storm from knowledge; everyone else debugs it from folklore.
There's also a design lesson here worth the price of the lab on its own: vLLM turns a potential correctness catastrophe (OOM mid-generation) into a pure performance event (recompute later). That transformation — push failures down the severity ladder, from "wrong answer" to "crash" to "slow" — is the signature move of robust systems design, and preemption is its cleanest specimen in this codebase.
Background: why overcommit at all
The timid alternative exists: admit a request only if prompt + max_tokens worth of blocks
can be reserved up front. No preemption needed, ever. But you built lab Phase-2-02, so
you can name the cost: that's max-reservation again through the back door — requests
typically generate a fraction of max_tokens, so reserved-but-unused blocks strangle
concurrency exactly the way contiguous allocation did. vLLM chooses to admit
optimistically (reserve nothing beyond current need, let requests grow block by block)
and handle the rare collision with preemption. More throughput every step, plus an
occasional recompute tax, beats less throughput always. But optimism requires a safety
valve — and the valve must preserve correctness, or the whole bargain is rotten. Hence
this lab's two-sided test.
The mechanism, step by step
From mini_vllm/scheduler.py (the RUNNING phase's while True), mirroring upstream:
- A running request needs a block for its next token;
allocate_slotsreturnsNone— the pool is empty. This is the bad bet coming due. - The scheduler picks a victim: the last request in
running— the most recently admitted (under FCFS, the lowest-priority / least-progressed; upstream with priority scheduling picks lowest-priority-then-latest). Note the same principle as lab-01: which end of which list you take from IS the policy. _preempt(victim): free all the victim's blocks (back to the free queue — Phase 2 mechanics), resetnum_computed_tokens = 0, keepoutput_token_ids(this is the crown jewel — see below), status →PREEMPTED, and push it on the front of the waiting queue (it has waited longest; it re-enters first).- Retry the allocation. Repeat — possibly preempting several — until it fits. Degenerate
case: the victim is yourself; then you give up this step (you're now first in
waiting, and you'll be re-admitted when memory frees). - No admissions on a step that preempted (
not out.preempted_req_idsguards the WAITING phase) — re-admitting while evicting would thrash.
Files
starter.py— implementrun(prompts, num_blocks, block_size, max_tokens): drivemini_vllm.LLMEnginewith a given pool size and return each prompt's output token ids. Deliberately thin — the test design is the lab. Your work.solution.py— reference.test_lab.py— (1) cramped-pool outputs == roomy-pool outputs; (2) a direct scheduler test that a preemption actually occurs under pressure.
Run
LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-04-preemption -q
pytest phase-03-continuous-batching-scheduler/labs/lab-04-preemption -q # reference
The setup that forces preemption
Arithmetic you can check on your fingers — pool of 5 blocks × 4 slots = 20 slots, minus the null block → 16 usable slots. Two requests, each 8 prompt tokens + up to 20 output: each needs 3 blocks just to get past its first decodes (tokens 9..12 spill into block 3). Both admit fine (2 blocks each = 4 of 4 blocks — the optimistic bet). Both decode a few tokens... then one needs its third block. Free blocks: zero. The scheduler preempts the most-recent admission, lets the survivor finish (its blocks free at completion — the reaping path from Phase 1), then re-admits the victim, which re-prefills from scratch and finishes too. Total cost: one extra prefill. Total damage to output: zero — that's the assertion.
The direct scheduler test stages the same squeeze without the engine: schedule once (both
admitted), manually advance both requests past their prompts, schedule again — and assert
out.preempted_req_ids is non-empty. One test proves the valve opens; the other proves
nothing leaks when it does.
Why the output is identical
The victim's output_token_ids survive preemption; only num_computed_tokens resets. On
re-admission, the request's token list is prompt + outputs_so_far, and the engine — with
zero special-case code — simply sees a request whose counter is far behind and prefills
the whole thing, generated tokens included. When the counter catches up, the next
sample happens at the same (last_token, position) state as if nothing had happened, and
the deterministic model continues identically. Induction does the rest.
Stop and admire the design economy: recovery is just prefill. The two-counters model makes "resume after eviction" indistinguishable from "admit a long prompt" — same code path as lab-02's chunking, same path as lab-06's cache hits. Three features, zero interaction code. When you design state machines, this is what to copy: make recovery a state the normal path already handles, not a parallel universe of special cases. (Real vLLM preserves correctness the same way; with prefix caching on, the recompute may even hit surviving cached blocks of its own prompt and skip most of the work.)
What the tests prove
| Test | The half of the contract it pins |
|---|---|
test_cramped_pool_matches_roomy_pool | Correctness: 5-block pool and 256-block pool produce identical token ids, and both requests reach their full max_tokens. The user cannot detect preemption from outputs. |
test_preemption_actually_happens_under_pressure | Liveness of the test itself: a preemption really fires in this scenario. Without this, the first test could pass vacuously (pool accidentally roomy, nothing preempted, nothing proven). Pair every "X survives Y" test with a "Y actually happened" probe — untriggered safety tests are the unit-test equivalent of an unplugged smoke detector. |
Hitchhiker's notes
- Recompute vs swap. vLLM can alternatively swap a victim's KV to CPU RAM and copy it back later, instead of recomputing. The trade: recompute spends GPU FLOPs (cheap-ish, prefill is compute-efficient); swap spends PCIe bandwidth (~tens of GB/s against KV that can be GBs) and host RAM. V1 defaults to recompute — short-to-medium contexts re-prefill faster than they copy. Swap wins for very long contexts; that regime is where the disaggregated/offload designs of Phase 15 live.
- Why the most-recently-admitted victim? Least progress lost (it has computed the least), and under FCFS it's the lowest-priority commitment. Preempting the oldest would maximize wasted work and starve the request closest to finishing — note that the survivor finishing is what frees memory. Victim selection isn't fairness aesthetics; it's part of the forward-progress argument.
- The deadlock question (interviewers love it): what if no request can finish
because none fits alone? Then preemption ping-pongs forever — A evicts B, B evicts A.
Prevention is an admission-time invariant: a single max-length request must fit in
the pool. That's exactly the startup check you met in Phase 1 lab-02 (the engine
refusing
max_model_lentoo big for its blocks) — the safety valve works only because another component made a promise. Cross-component invariants like this are what design docs are for. - Operationally, preemption is a smell, not a feature. Each one re-prefills a whole
request — visible as a TTFT/ITL spike for the victim and burned throughput for everyone.
vLLM logs a warning with a preemption counter (
vllm:num_preemptions_totalin metrics); a rate of preemptions means your pool is undersized for your workload: lowermax_num_seqs, shortenmax_model_len, raisegpu_memory_utilization, or buy HBM. The valve existing doesn't make leaning on it free.
Going further
- Instrument
num_preemptionsper request (the field already exists) across a sweep of pool sizes from 5 to 50 blocks for a fixed 4-request workload. Plot total steps vs pool size — you'll get a hockey stick whose knee is "enough memory," the capacity-planning picture from the memory side. - Change the victim policy to oldest-first and rerun the suite. The correctness test still passes (replay is policy-independent — make sure you can say why), but count total steps: you've measured the cost of a bad policy with the safety net intact.
- Add a
swapmode tomini_vllm(stash the victim's per-token "KV" — here just its counter state — instead of resetting) and make the correctness test pass both modes. You'll discover the bookkeeping subtleties (what if the swapped request's blocks were shared via prefix cache?) that make real swap implementations hairy.
References
mini_vllm/scheduler.py::_preemptand the RUNNING phase's allocate-or-preempt loop — the dozen lines this lab is about.upstream/vllm/v1/core/sched/scheduler.py— searchpreempt: same dance, plus priority-aware victim selection and the preemption-mode plumbing.- Kwon et al., PagedAttention (SOSP 2023), §4.5 — preemption via recompute vs swap, with measurements: https://arxiv.org/abs/2309.06180
- vLLM docs, Optimization and Tuning — the official "reduce preemptions" guidance and the warning log you'll see in production: https://docs.vllm.ai/en/latest/configuration/optimization.html
- Phase 2 lab-02 — the over-reservation waste that justifies optimistic admission in the first place.
Lab 03-05 — Decode-Latency Spikes, and How Chunking Kills Them [CPU-OK]
Labs 01 and 02 built chunked prefill's mechanism and proved it safe. This lab supplies the
missing piece: the motive. You'll put a short request mid-decode — a user happily
watching tokens stream — and then slam a 256-token prompt into the engine. Without
chunking, the decode stream takes one step that costs 257 tokens of work instead of 1:
a ~250× inter-token latency spike, the infamous "my chat froze for a second" of early
serving engines. With threshold=32, the same experiment caps every step at 33. You will
produce both latency profiles yourself, as exact integer sequences, on a laptop.
Contents
- Why this lab exists
- Background: step cost is the latency
- Files
- Run
- What to implement
- What you should see — the two profiles
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Tail latency is where serving engineers earn their pay. Means are easy — any engine has a fine average inter-token latency; the product experience is set by the p99, and the p99 is set by exactly the event you're about to stage: someone else's prefill landing in your decode step. This interference is invisible in throughput numbers (the work all gets done!) and invisible in single-request benchmarks (no one to interfere with). You only see it by looking at per-step cost from the perspective of one victim stream — which is precisely the measurement you'll build, using the schedule-probe from Phase 1 lab-04 as your instrument.
This lab is also the Sarathi-Serve paper in a bottle. Their contribution — "stall-free scheduling" via chunked prefills piggybacked on decode batches — reduces, on this workload, to the difference between your two measured profiles. Papers compress well when you've run their experiment.
Background: step cost is the latency
In mini_vllm, steps are instant; on a GPU, a step's wall-clock time grows roughly with
the tokens scheduled in it (they all go through the same forward pass — more tokens, more
FLOPs, longer step). So for a decoding request, the time between its token k and token
k+1 is the duration of the step that computes k+1 — including everyone else's work
in that step. That's why this lab's metric is:
for each step in which the victim advances, the total scheduled tokens of that step.
It's a proxy with the right shape: a step of [A:1, B:256] is ~257 token-units long, and
A's user waits all of it for one token. The proxy ignores second-order GPU effects
(attention's memory traffic, kernel launch overheads — Phase 18 refines), but the
first-order picture it gives is the one that drives the tuning decision.
Files
starter.py— implementdecode_step_costs(...): stage the collision, probe the schedule, extract the victim's per-step costs. Recipe in the docstring. Your work.solution.py— reference.test_lab.py— the spike exists unchunked; the cap holds chunked; the work conserves; outputs are schedule-invariant; the victim is never starved.
Run
LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-05-decode-latency-spikes -q
pytest phase-03-continuous-batching-scheduler/labs/lab-05-decode-latency-spikes -q # reference
What you should see — the two profiles
Real output of the solution (long_prompt_len=256, max_tokens=16):
threshold=0 : [257, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
threshold=32: [33, 33, 33, 33, 33, 33, 33, 33, 2, 1, 1, 1, 1, 1, 1]
Read them like latency traces, because that's what they are:
- Unchunked: one monstrous step — B's entire 256-token prefill rides the same step as A's decode (256 + 1 = 257) — then calm. A's user experiences fifteen smooth tokens and one ~250× hiccup. This is a p99 disaster hiding in a perfect mean: the average cost is ~18, the median is 1. If you only monitor averages, this profile looks healthy. It isn't.
- Chunked at 32: eight consecutive steps of exactly 33 (one 32-token chunk of B + A's 1 token — your lab-02 formula: ⌈256/32⌉ = 8 steps), then a 2 (B's first decode rides along), then 1s. The spike didn't shrink; it spread: the same 256 tokens of prefill work, conserved to the token, paid in eight 33× installments instead of one 257× balloon. Worst-case ITL for A drops ~8×; B's time-to-first-token rises (8 steps instead of 1). Nothing is free — chunking is a redistribution of latency from everyone's p99 to the long prompt's TTFT. That redistribution is almost always the right trade (decode stalls are user-visible jitter; prefill latency is a single wait users expect), but say it as a trade, not a win.
- The
2s are worth a glance: B's own decode co-scheduled with A's. Mixed batches everywhere once you know to look (Phase 1 lab-04's "money step").
What the tests prove
| Test | What it pins |
|---|---|
test_unchunked_prefill_spikes_one_decode_step | The spike is real (≥ 257) and singular (second-worst step ≤ 3) — exactly one balloon payment |
test_chunked_prefill_caps_the_spike | The cap holds (≤ threshold + 1 for every step the victim shares) and the work conserved (≥ 8 elevated steps — the prefill didn't vanish, it spread) |
test_chunking_does_not_change_the_decode_streams_output | Lab-02's invariant under interference: identical token ids for the victim across both schedules |
test_decode_stream_is_never_starved | The victim advances every single step it's alive, chunked or not — running-first (lab-01) means interference delays decodes but never skips them |
Together these four are the full contract of chunked prefill: bounded interference, conserved work, untouched outputs, guaranteed progress.
Hitchhiker's notes
- Where's the threshold's floor? Push it down: at
threshold=1, B's prefill takes 256 steps — A's latency is pristine and B's TTFT is catastrophic; and on real hardware, 256 tiny steps pay 256× the fixed per-step overhead (scheduler, launches, sampler), so total throughput sags too. The optimum is workload-dependent and that's the point: upstream exposeslong_prefill_token_threshold(and the budget) rather than hardcoding an answer. Sarathi-Serve's evaluation is essentially this sweep with wall-clocks. - The budget is the other half of the dial. Chunks are bounded by
min(threshold, remaining budget)— lab-01's clamp. A smallmax_num_batched_tokenscaps interference globally (every step is small) at the cost of slower prefills for everyone. Production tuning usually sets budget for the worst acceptable step time, then threshold for fairness within it. - Why measure from the victim's seat? Because aggregate metrics hide exactly this. Mean step cost barely moves between the two profiles (the work is identical!); only per-victim-step cost shows the 257 vs 33. The general lesson for benchmarking serving systems: pick a request and follow it — fleet-wide averages are where tail pain goes to hide. This is why serious LLM benchmarks report TTFT and ITL distributions, never just tokens/sec (Phase 18).
- Real-engine correspondence: in vLLM, run two clients — one streaming a long
generation, one submitting a huge prompt — and watch the streamer's inter-chunk gaps
with chunked prefill toggled (it's default-on in V1; you can throttle it via
long_prefill_token_threshold). The wall-clock version of your integer profiles, jitter included.
Going further
- Compute p50/p99 of A's step costs for thresholds {0, 16, 32, 64, 128, 256} and plot both against B's prefill-step count. The p99 curve falls as the TTFT curve rises; where they cross for your tolerance is the tuning answer. You've reproduced Sarathi-Serve Figure-1-style analysis with a 30-line probe.
- Make it a storm: five long prompts arriving on consecutive steps while two victims decode. Does the cap still hold per step? (It must — the budget binds the sum.) What happens to admission order? (Lab-01's FCFS + head-of-line rules, now visible in data.)
- Add wall-clock: time each
eng.step()(even toy steps have measurable cost) and check the correlation between your token proxy and real microseconds. Weak on a toy model, strong on a GPU — knowing when a proxy is valid is half of performance engineering.
References
- Agrawal et al., Sarathi-Serve: Taming Throughput-Latency Tradeoff in LLM Inference (OSDI 2024) — the stall this lab stages, and the chunking cure, measured at scale: https://arxiv.org/abs/2403.02310
- Agrawal et al., SARATHI (2023) — the original chunked-prefill paper: https://arxiv.org/abs/2308.16369
upstream/vllm/v1/core/sched/scheduler.py—long_prefill_token_thresholdin the production clamp; the dial you just calibrated.- vLLM docs, Optimization and Tuning — official guidance on budget/threshold tuning: https://docs.vllm.ai/en/latest/configuration/optimization.html
- Dean & Barroso, The Tail at Scale (CACM 2013) — the classic on why p99 beats mean, background for this lab's whole worldview: https://research.google/pubs/the-tail-at-scale/
Lab 03-06 — Prefix Caching: Count Every Token It Saves [CPU-OK]
Lab-03 showed you prefix caching through the real engine's meters — hit rates, throughput
averages, wall-clock noise. This lab removes the noise. On the mini engine, you'll run the
same shared-system-prompt workload with caching off and on and account for the savings to
the exact token: 544 scheduled tokens uncached, 96 cached, difference 448 = 7 followers × 16 blocks × 4 tokens. Not approximately. Exactly. When you can predict a
cache's benefit with integer arithmetic before running it, you understand the cache.
Contents
- Why this lab exists
- Background: what "scheduled tokens" measures
- Files
- Run
- What to implement
- The accounting, line by line
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Three phases of machinery converge here, and this lab is where you check that you can
predict their composition: Phase 2's block hashing and sharing (lab 02-05), this phase's
admission path (get_computed_blocks → adopt → allocate, lab 03-01), and the scheduling
identities from Phase 1 lab-04 (Σ scheduled = prompt + max_tokens − 1). If your prediction
of the cached total is off by even one token, one of those three mental models has a crack
in it — and the integers will tell you which (that's how the over-allocation bug mentioned
in Phase 2 lab-05 was actually found: an exact count disagreed).
The professional skill is dimensioned estimation of cache value. "Enable prefix caching and things get faster" is advocacy. "This workload shares a 64-token block-aligned prefix across 8 requests, so caching eliminates exactly 7×64 = 448 of 544 scheduled prefill+decode tokens — an 82% compute reduction on this batch, and here's the count" is engineering. The GPU version (lab-03) gives you the wall-clock corroboration; this lab gives you the theorem.
Background: what "scheduled tokens" measures
The probe sums total_num_scheduled_tokens over every schedule() call — every token of
forward-pass work the scheduler ever requested. It is the engine's compute odometer:
prefill chunks, cache-miss remainders, decodes, everything. Two properties make it the
right meter here:
- It's conserved: work not scheduled is work not done. There is no place for savings to hide or double-count.
- It's schedule-invariant in total: chunking and batching rearrange when tokens are scheduled, never how many (lab-02/05). Only caching changes the total — by replacing computed tokens with adopted KV. So the off/on difference isolates the cache's effect perfectly. Experimental design through invariants: pick a meter on which everything else you might accidentally vary is provably neutral.
Files
starter.py— implementrun_and_count(...): the probe +generate, returning the odometer total and the outputs. Your work.solution.py— reference.test_lab.py— exact totals for both arms, the savings identity, output equality, and the share-nothing control.
Run
LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-06-prefix-cache-savings -q
pytest phase-03-continuous-batching-scheduler/labs/lab-06-prefix-cache-savings -q # reference
What to implement
The Phase 1 lab-04 probe, reduced to an accumulator: wrap eng.scheduler.schedule, add up
out.total_num_scheduled_tokens, run eng.generate(...) over the prompts (greedy,
ignore_eos), return (total, token_ids_per_prompt). Ten lines. The thinking is in the
test predictions — write those yourself on paper before running anything.
The accounting, line by line
Workload: SYSTEM = "S"×64 (64 byte-tokens = exactly 16 full blocks at block_size 4 —
alignment chosen deliberately), 8 prompts SYSTEM + str(i) (65 tokens, unique last token),
max_tokens=4, greedy.
Caching off — every request pays full price:
per request: 65 (prefill) + 3 (decodes; the 4th token is sampled but never computed — Phase 1 lab-04)
= 68
total : 8 × 68 = 544
Caching on — the pioneer pays, the followers ride:
request 0 : 65 + 3 = 68 (cold cache: populates 16 block hashes during its prefill)
requests 1–7: 1 + 3 = 4 each (!!)
total : 68 + 7×4 = 96
savings : 544 − 96 = 448 = 7 × 64
That 1 deserves a pause — it's three of this course's rules colliding in one token:
- The follower's 65-token prompt hits all 16 full blocks → 64 tokens adopted free.
- The hit cap (
num_tokens − 1, Phase 2 lab-05) wouldn't bind here (64 ≤ 64), but the 65th token couldn't hit anyway: it's in a partial block (I3 — never cached) and it must be computed to produce logits (you need the model's output at the last position, and the cache stores only KV). - So the scheduler admits the request with
num_computed = 64, schedules exactly1token, and — because64 + 1 == 65— that same step samples (needs_sample, Phase 1 lab-03). A one-token prefill that immediately emits: the strangest-looking line you'll see in a scheduler trace, and now you can explain it.
Also notice when the followers hit: all 8 requests are admitted in the same
schedule() call, yet requests 1–7 still hit blocks request 0 cached microseconds
earlier — because mini_vllm (like upstream) registers blocks in the cache index at
allocation time, inside the same admission loop. Caching is eager; sharing begins before
the pioneer has computed a single value. (The KV contents don't exist yet — but the
reservation is shared, and the prefill that fills it runs once. If that bends your
brain, good; it's the detail most explanations skip.)
What the tests prove
| Test | What it pins |
|---|---|
test_caching_off_pays_full_price_for_everyone | The baseline identity: 8 × (65 + 3) = 544, no cache, no surprises |
test_caching_on_computes_the_shared_prefix_once | The cached total, exactly: 68 + 7×4 = 96 — every rule above, composed correctly |
test_savings_equal_followers_times_shared_full_blocks | The savings identity (N−1) × shared_full_block_tokens — the formula you'll reuse to estimate cache value on any workload |
test_outputs_are_identical_with_and_without_caching | Caching is a pure performance feature: same tokens out. (The cached KV is the KV — Phase 2 lab-06's identity, economically applied) |
test_unshared_prompts_save_nothing | The control arm: distinct prompts share only a sliver of block-aligned prefix → savings < 25%. Caching is workload-dependent; anyone selling it flat-rate is selling |
Hitchhiker's notes
- The alignment was rigged, and you should notice.
SYSTEMis exactly 16 blocks. Make it 66 tokens (16.5 blocks) and followers hit only 64 of 66 — the half-full block 17 recomputes for everyone, forever. On real tokenizers you don't control alignment, which is why measured hit rates hover below the naive prediction (lab-03's 93.7%) and why block_size enters cache math, not just memory math. - Map the integer totals to the GPU meters: hit rate ≈ adopted/looked-up = 7×64 / (some denominator including tails); prompt-throughput ratio ≈ 544-ish/96-ish ≈ 5× — squarely the 4–5× lab-03 measured through wall-clock noise. Exact model + noisy measurement agreeing is how you validate both; either alone can fool you.
- Why followers cost 4 while the pioneer costs 68 is the per-request view of the economics: a follower's marginal cost is its unique content plus decodes. System prompts become nearly free at the margin; what stays expensive is what's per-user. This inverts prompt-engineering economics — long, rich shared instructions are cheap; per-request context is what you trim. Product decisions hang on this inversion.
enable_caching=Falseexists for a reason — it's the control arm of every caching benchmark, and occasionally a production choice (e.g. strict multi-tenant isolation — see lab-03's security note). A feature you can't turn off is a feature you can't measure.
Going further
- Multi-turn: simulate a 5-turn conversation (each turn's prompt = previous prompt + previous output + new question) and predict, then measure, the per-turn scheduled tokens with caching on. You should see each turn pay only its delta. This is the chat-history result from lab-03's notes, now exact.
- Eviction pressure: shrink
num_blocksuntil followers stop hitting (the pioneer's blocks get evicted by the followers' own decode growth — Phase 2 lab-05's queue mechanics). Find the cliff; explain its location from pool arithmetic. - Derive the general formula: for N requests sharing a P-token prefix (block size B),
savings =
(N−1) × B × ⌊P/B⌋... except when the unique suffix is empty — then the hit cap (num_tokens − 1) bites and the formula needs a correction term. Write the corrected version and the test that proves it. (This edge — identical entire prompts — is exactly lab-03's 8-identical-prompts experiment.)
References
mini_vllm/scheduler.py— the WAITING-phaseget_computed_blocksadmission path you're metering (and thenum_new_tokens == 0 → schedule 1branch behind the one-token prefill).mini_vllm/kv_cache.py::_cache_full_blocks— eager caching at allocation time.upstream/vllm/v1/core/kv_cache_manager.py— the production twin of both.- vLLM docs, Automatic Prefix Caching — design doc: https://docs.vllm.ai/en/latest/design/prefix_caching.html
- Phase 2 lab-05 — the block-level mechanics (ref counts, hit cap, revival) this lab meters through the scheduler.
- Phase 3 lab-03 — the same experiment on real hardware, with wall-clocks attached.
Phase 03 — Exercises: Continuous Batching & Scheduler
Escalating from "explain it" to "design it." Staff-level = the last ones cold, citing the exact
upstream/ line.
Contents
Warm-up (explain)
- Restate, in your own words, the "no prefill/decode phase" idea. What two numbers on
Requestdoes the whole scheduler manipulate? - Why schedule RUNNING requests before admitting WAITING ones?
- What three conditions must all hold to admit a waiting request? (guide §6 / deep-dive §2.)
Core (trace the code)
- Walk the 4-line
num_new_tokensclamp (scheduler.py:385–398). Name each of the caps and give a scenario where each one is the binding constraint. - Trace the
while Truepreemption loop for: 3 running requests, 0 free blocks, FCFS policy. Who gets preempted, and what does_preempt_requestdo to them? - In
mini_vllm, why does admission stop entirely (break) on the firstallocate_slots == None, while the running phase retries after preempting? (Hint: different goals — make progress vs. don't over-admit.)
Build (extend your code)
- Implement the PRIORITY policy in
mini_vllm/scheduler.py: apriorityonRequestand a victim =max(running, key=lambda r:(r.priority, r.arrival_time)). Write a test where a high-priority late arrival preempts a low-priority running request. - Add a stats counter: total preemptions, average batch size, mean KV usage per step. Verify on lab-04's cramped run that preemptions > 0.
- Implement swapping preemption: instead of
num_computed_tokens = 0, move the request's blocks to a CPU list and restore on re-admit. Show output is still identical; discuss the cost difference vs recompute.
Design (staff-level)
- A workload is 90% short chats (200-token prompts) and 10% long-doc summaries (16k prompts).
Pick
max_num_batched_tokensandlong_prefill_token_thresholdand justify with the latency impact on the short chats while the long prefills run. - You see frequent preemptions in production. List the three knobs you'd reach for (and the code/Phase each maps to) and the risk of each.
- Continuous batching keeps the GPU full, but at very high concurrency latency degrades.
Use Little's Law to explain the tradeoff and where you'd cap
max_num_seqs. - Design admission control to prevent a request that can never fit (longer than the whole
KV cache) from deadlocking the engine. What does real vLLM do? (Peek:
FINISHED_IGNORED,check_enough_kv_cache_memoryatkv_cache_utils.py:794.)
Self-grading
5, 10–13 are interview-grade. Could you whiteboard each in 5 minutes and name the file? If not, re-read the matching deep-dive section, then drill INTERVIEW.md.
Phase 03 — Interview Questions: Continuous Batching & Scheduler
Throughput questions live here. Cover the answer, attempt out loud, then compare. This and Phase 02 are the two topics to own cold.
Q1. What is continuous batching and why is it the biggest throughput win in LLM serving?
Model answer
Static batching runs a fixed batch to completion, so the GPU runs at the speed of the slowest request and finished requests waste their slot. Continuous batching re-decides the batch every single step (every token): the instant a request finishes, its slot is freed and a waiting request joins mid-flight. With mixed-length traffic (all real traffic) this keeps the GPU saturated continuously instead of idling on finished slots. It's purely a scheduling change — same kernels, same model — which is why it's such high leverage.
Q2. Explain the scheduler's core mental model.
Model answer
There's no "prefill phase" or "decode phase." Each request is just num_computed_tokens racing
to catch up to num_tokens. Every step the scheduler hands out tokens so requests close that
gap, under a global token budget. "Prefill" = far behind; "decode" = behind by one. This single
rule covers chunked prefill (hand out part of the gap), prefix caching (start with the gap
pre-closed), and speculative decoding (the gap includes draft tokens via num_tokens_with_spec)
— all with no special cases. It's the comment at the top of Scheduler.schedule
(scheduler.py:330).
Q3. What is chunked prefill and what problem does it solve?
Model answer
A long prompt's prefill, done in one step, would monopolize the step and stall every in-flight
decode → inter-token-latency spikes for all current users. Chunked prefill splits the prefill
across multiple steps under the per-step token budget (max_num_batched_tokens), so each step
mixes a slice of the big prefill with ongoing decodes. It trades a bit of prefill throughput
(more steps) for much better decode latency under load. Knob:
long_prefill_token_threshold + the budget (scheduler.py:390).
Q4. How does prefix caching interact with the scheduler?
Model answer
When admitting a waiting request, the scheduler calls get_computed_blocks
(scheduler.py:591), which asks the KV manager how many leading tokens are already cached
(shared physical blocks from an earlier request with the same prefix). Those tokens count as
already computed, so the request starts with num_computed_tokens > 0 and only prefills the
unique remainder. For a shared system prompt across many users this is a massive
throughput/memory win and the structural advantage behind multi-tenant serving. It rides on
Phase 02's block sharing (touch + ref_cnt).
Q5. Walk me through what happens when a running request needs memory and there's none.
Model answer
allocate_slots returns None (Phase 02). The scheduler enters its preemption loop
(scheduler.py:443): it picks a victim — under FCFS self.running.pop() (most recently
admitted), under PRIORITY the worst (priority, arrival_time) — calls _preempt_request to
free that request's KV blocks and send it back to waiting (to be recomputed later), then retries
the allocation. If the only request left to preempt is the one we're trying to schedule, we give
up on it this step. This None → preempt → retry handshake is what lets vLLM admit aggressively
without OOM-crashing.
Q6. Preemption: recompute vs swap. Tradeoff?
Model answer
On preemption you can either recompute the KV later (replay prompt+generated tokens through prefill) or swap the KV blocks out to CPU memory and copy them back on resume. Recompute spends GPU compute (cheap-ish thanks to efficient prefill, no extra memory traffic off-GPU); swap spends PCIe bandwidth and CPU memory but avoids recomputation. Recompute usually wins for short sequences; swap can win for very long KV where recompute would be expensive. Either way, output is identical — preemption costs time, not correctness.
Q7. Why admit no new requests in a step where you preempted?
Model answer
A preemption means you're already out of KV memory. Admitting more work in the same step would
immediately force more preemptions — thrashing. So the scheduler gates the waiting phase on "no
preemptions this step" (scheduler.py:545; mini_vllm: not out.preempted_req_ids). It lets
the system drain pressure before taking on more.
Q8. (Deep) How does speculative decoding ride this same scheduler with no special case?
Model answer
A request's num_tokens_with_spec includes proposed draft tokens, so the same num_new_tokens = num_tokens_with_spec - num_computed_tokens clamp naturally schedules the draft tokens to be
verified, and num_lookahead_tokens reserves KV slots for them in allocate_slots. Acceptance/
rejection is handled in update_from_output. The scheduler doesn't know or care that it's spec
decode — it's just "tokens to compute," exactly as the top-of-function comment promised. (Full
treatment: Phase 08.)
Rapid-fire
- Two queues?
waiting(deque/priority) andrunning(list). - The per-step token cap?
max_num_batched_tokens→token_budget. - The concurrent-sequence cap?
max_num_seqs→len(running)limit. - Who's scheduled first each step? Running, then waiting.
- What does
update_from_outputdo? Append sampled tokens, advancenum_computed_tokens, reap finished requests (free KV). - A request emits a token iff?
num_computed_tokens + num_scheduled == num_tokens(prefill fully caught up).
Phase 03 — Cheatsheet: Continuous Batching & Scheduler
Contents
- The one-liner
- The master model
- schedule() shape
- The four/five invariants
- Knobs (→ Phase 18)
- The Phase 02 ↔ 03 seam
- Key upstream
- Gotchas
The one-liner
Every token step, re-decide the batch: schedule RUNNING first, admit WAITING, under a token
budget + seq-slot cap. Continuous batching, chunked prefill, prefix caching, preemption all fall
out of "make num_computed_tokens catch up to num_tokens."
The master model
No prefill/decode phase. Request = (num_computed_tokens racing num_tokens).
Prefill = far behind. Decode = behind by one. (scheduler.py:330)
schedule() shape
budget = max_num_batched_tokens
# A) RUNNING: n = clamp(num_tokens - num_computed, budget, threshold);
# allocate_slots; None -> preempt running.pop(); retry; commit; budget -= n
# B) WAITING: while budget>0 and len(running)<max_num_seqs and not preempted:
# get_computed_blocks (prefix cache) -> num_computed; clamp; allocate; None -> break; admit
The four/five invariants
- a request is in exactly one of {waiting, running} while unfinished
sum(num_scheduled_tokens) <= max_num_batched_tokenslen(running) <= max_num_seqs- emits a token iff
num_computed + num_scheduled == num_tokens - preempt frees KV + resets
num_computed = 0(recompute on re-admit)
Knobs (→ Phase 18)
max_num_batched_tokens— per-step token budget (chunked prefill granularity)max_num_seqs— max concurrent running requestslong_prefill_token_threshold— per-request prefill chunk capenable_prefix_caching— share prefix KV across requests- scheduling policy — FCFS vs PRIORITY (preemption victim choice)
The Phase 02 ↔ 03 seam
Scheduler decides policy; KVCacheManager is truth. allocate_slots returns None on OOM
→ scheduler preempts + retries. Scheduler never touches blocks; KV manager never sets policy.
Key upstream
vllm/v1/core/sched/scheduler.py:329—schedule():443— preemption loop ·:591— prefix-cache head startscheduler.py:1283—update_from_outputvllm/v1/core/sched/output.py:181—SchedulerOutput(New vs Cached request data)vllm/v1/request.py:315—RequestStatus
Gotchas
allocate_slots == Noneis normal control flow (drives preemption), not an error.- Admission stops on first OOM (
break); running phase retries after preempting. - No admission in a step that preempted (avoid thrashing).
- A request longer than the whole KV cache can never fit → ignored/aborted, not deadlock.
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md
Phase 04 — The Hitchhiker's Guide to Attention Backends
← Phase 03 · Course home · Phase 05 →
Contents
- Don't Panic
- Step 1: Why attention needs a special kernel (recap Phase 2)
- Step 2: Why so many kernels?
- Step 3: The backend abstraction
- Step 4: Online softmax (the FlashAttention trick), in one picture
- The invariants to memorize
- What you'll do
Don't Panic
Attention is one mathematical operation. But there are a dozen hyper-tuned GPU kernels that compute it (FlashAttention, FlashInfer, Triton, FlashMLA, TRTLLM-GEN…), each best for some combination of hardware, model, and batch shape. vLLM hides them all behind one interface, picks the right one at startup, and feeds it the metadata it needs (the block tables from Phase 2). This phase is that interface and that choice — usually the single hottest kernel in decode, so it's where a lot of real performance wins and bugs live.
model's Attention layer (one API)
│ q, k, v
▼
AttentionImpl (the chosen backend: FlashAttention / FlashInfer / Triton / MLA / ...)
│ + AttentionMetadata (block tables, seq lens, slot mapping)
▼
the CUDA kernel ── gathers paged KV via the block table, computes softmax(QKᵀ)V
Step 1: Why attention needs a special kernel (recap Phase 2)
A token attends to all earlier tokens, whose K/V live in scattered physical blocks
(PagedAttention). So the kernel can't just multiply two contiguous matrices — it must, per token,
look up physical_block = block_table[logical_block] and gather K/V from all over memory. It also
must write this step's new K/V to the right slot (slot_mapping). Two pieces of metadata the
scheduler/runner build and hand the kernel:
- block table — where to read prior KV (logical → physical block).
- slot mapping — where to write this step's new K/V.
Plus per-request sequence lengths so variable-length (varlen) batches pack together.
Step 2: Why so many kernels?
The math is fixed; the fast way to do it depends on context:
- FlashAttention — the classic: never materializes the full
N×Nattention matrix; streams K/V in tiles using online softmax (running max + rescale), so memory is O(N) not O(N²). Great general default. - FlashInfer — a library specialized for serving: paged KV, prefill+decode wrappers, fast for many small/decode requests; often wins at high concurrency.
- Triton — kernels written in Triton (Python-ish DSL); portable, the fallback when a hand-tuned CUDA kernel isn't available for your case.
- FlashMLA — for MLA (Multi-head Latent Attention), DeepSeek's design that compresses KV into a low-rank latent — different KV layout, so it needs its own kernel.
- TRTLLM-GEN — NVIDIA TensorRT-LLM generated kernels, tuned for specific GPUs/precisions.
Different head dims, dtypes (fp16/bf16/fp8), features (sliding window, soft-cap, ALiBi), and hardware all shift which kernel is fastest or even available.
Step 3: The backend abstraction
vLLM factors attention into four roles (vllm/v1/attention/backend.py):
| Role | Job |
|---|---|
Attention layer | what the model calls (q,k,v -> out); backend-agnostic |
AttentionBackend | names the impl + metadata classes for a kernel family |
AttentionImpl | the actual forward that runs the kernel |
AttentionMetadataBuilder | turns SchedulerOutput into the kernel's metadata (block tables, seq lens, slot mapping) each step |
A selector (get_attn_backend, selector.py:52) picks the backend at startup from platform +
dtype + head_dim + model features, overridable with VLLM_ATTENTION_BACKEND=FLASH_ATTN|FLASHINFER| TRITON_ATTN|.... The model never changes — only which AttentionImpl is plugged in.
Step 4: Online softmax (the FlashAttention trick), in one picture
You can't hold a 1×N attention row in fast SRAM for long N. So FlashAttention streams K/V in tiles and keeps a running result, rescaling as it goes:
for each tile of (K,V):
s = q·Kᵀ_tile # scores for this tile
m_new = max(m_old, max(s)) # running max (for numerical stability)
correction = exp(m_old - m_new)
acc = acc*correction + exp(s - m_new) · V_tile # rescale old, add new
denom = denom*correction + sum(exp(s - m_new))
out = acc / denom
You'll implement exactly this in lab-01 (numpy, CPU) over a paged KV cache, and prove it
equals plain dense attention. That single lab demystifies FlashAttention and PagedAttention's
kernel side at once.
The invariants to memorize
- Attention is one op; the backend is which kernel computes it. Model code is backend-agnostic.
- The kernel needs block table (read map), slot mapping (write map), seq lens (varlen).
- Online softmax makes attention O(N) memory and is why "Flash" kernels exist.
- Backend is chosen at startup (selector) and overridable via
VLLM_ATTENTION_BACKEND. - MLA models need MLA-specific backends (different KV layout).
What you'll do
- Read: 01-deep-dive.md — the
Attentionlayer, the backend base classes, the selector, andFlashAttentionImpl/its metadata builder, all line-anchored. - Build: 02-mini-build.md — paged attention with online softmax in numpy.
- Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
lab-01-paged-attention-gather[CPU-OK]— implement online-softmax attention over a paged KV cache; prove it equals dense attention.lab-02-backend-selection[GPU-OPT]— read the selector, build the (GPU, dtype, model) → backend matrix, verify with env overrides (captured output).lab-03-causal-prefill-attention[CPU-OK]— the prefill kernel shape: M queries, causal loop bounds,start_posoffsets; prove chunked prefill == one-shot at the attention layer.lab-04-flash-decoding-partitions[CPU-OK]— split-KV decode: attention state as a mergeable (max, denom, acc) triple; equality with dense for any partition count/order.
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
← Phase 03 · Course home · Phase 05 →
Phase 04 — Deep Dive: the attention backend system
Paths relative to
upstream/atv0.22.1 @ 0decac0. The attention stack lives across:vllm/model_executor/layers/attention/attention.py the Attention nn.Module (model-facing) vllm/v1/attention/backend.py AttentionBackend / AttentionImpl / AttentionMetadataBuilder base classes vllm/v1/attention/selector.py get_attn_backend (the picker) vllm/v1/attention/backends/flash_attn.py a complete backend, end to end vllm/v1/attention/backends/{flashinfer,triton_attn,mla/}.py other families vllm/v1/attention/backends/registry.py name -> backend mapping
Contents
- 1. The model-facing
Attentionlayer - 2. The base classes:
backend.py - 3. A complete backend: FlashAttention
- 4. The selector: who picks the backend
- 5. MLA — when the KV layout itself changes
- Reading checklist
1. The model-facing Attention layer
vllm/model_executor/layers/attention/attention.py:177 — class Attention(nn.Module, AttentionLayerBase). This is what LlamaAttention.forward called in Phase 0 (self.attn(q, k, v)). Its __init__ (:189) resolves the backend (via the selector) and instantiates an
AttentionImpl; its forward (:437) hands q,k,v to that impl. The model talks only to this
class — it never knows which kernel runs. That decoupling is the whole point: swap the kernel,
the model is untouched.
2. The base classes: backend.py
vllm/v1/attention/backend.py defines the contract every kernel family implements:
AttentionBackend— static methods naming the impl class, the metadata class, supported head sizes/dtypes, and the KV cache shape.AttentionImpl— theforward(q, k, v, kv_cache, attn_metadata) -> outthat runs the kernel (writes new K/V to the cache viaslot_mapping, reads prior KV via the block table).AttentionMetadataBuilder—build(...)turns the per-step scheduler info (sequence lengths, block tables, slot mapping) into the typed metadata the kernel wants.
This three-part split (Backend names it, Impl runs it, Builder feeds it) repeats across every backend file.
3. A complete backend: FlashAttention
vllm/v1/attention/backends/flash_attn.py:
class FlashAttentionBackend(AttentionBackend)(:68) — the registry entry; declares the impl, metadata, and supported configs.class FlashAttentionMetadata(:223) — the per-step data the kernel needs (block table, seq lens, slot mapping, scheduling for varlen).class FlashAttentionMetadataBuilder(AttentionMetadataBuilder[...])(:276) — builds that metadata from the model runner's inputs each step. This is the bridge from Phases 2/3 to the kernel: the block tables you allocated and the scheduled token counts become kernel arguments here.class FlashAttentionImpl(AttentionImpl)(:592) —forwardcalls the FlashAttention CUDA kernel (viavllm-flash-attn/flash-attn), passing the paged KV cache + metadata.
Read FlashAttentionImpl.forward and find where it (a) writes the new k,v into the KV cache using
slot_mapping, and (b) calls the varlen flash-attn function with the block table. Those two calls
are the read/write maps from the guide, live.
4. The selector: who picks the backend
vllm/v1/attention/selector.py:52 — def get_attn_backend(...). It considers the platform
(current_platform, Phase 17), dtype, head size, whether the model uses MLA / sliding window,
and the VLLM_ATTENTION_BACKEND env override, then returns the backend class. _cached_get_attn_backend
(:106) memoizes it. The platform files (vllm/platforms/cuda.py, rocm.py, cpu.py) provide
the per-hardware default — which is why the same model picks FlashAttention on an A100, a Triton
or FlashInfer path elsewhere, and a CPU kernel on a laptop (Phase 17).
5. MLA — when the KV layout itself changes
vllm/v1/attention/backends/mla/ holds the MLA backends. MLA (DeepSeek) compresses K/V into a
low-rank latent vector, so the KV cache stores something different and needs its own kernel
(FlashMLA). This is why "add a model" (Phase 14) sometimes means "wire up a different attention
backend" — the model's attention design dictates the KV layout dictates the kernel.
Reading checklist
-
Attention.forward— what does the model pass, and what does it NOT know? -
The three base classes in
backend.py— Backend vs Impl vs MetadataBuilder. -
In
FlashAttentionMetadataBuilder.build— which Phase 2/3 outputs become kernel metadata? -
In
FlashAttentionImpl.forward— find the KV write (slot_mapping) and the paged read (block table). -
get_attn_backend— name three factors that change the chosen backend.
Now build it: 02-mini-build.md, then the labs.
Phase 04 — Mini-Build: paged attention with online softmax
You'll implement the heart of a "Flash"-style attention kernel in numpy — online softmax over a paged KV cache — and prove it equals plain dense attention. This single build demystifies both FlashAttention (the streaming softmax) and PagedAttention's kernel side (the block-table gather) at once.
Contents
- The task (lab-01)
- The online-softmax recurrence (from the guide)
- Definition of done
- Map to the real engine
The task (lab-01)
Given:
- a query vector
q(one decode step, one head): shape(d,), - a paged KV cache
k_cache, v_cache: shape(num_blocks, block_size, d), - a
block_table: list[int]mapping logical→physical block, - a
seq_len(valid tokens),
compute attention(q) = softmax(q·Kᵀ / √d) · V, where K/V are gathered through the block
table (token t lives at block_table[t // block_size], offset t % block_size), using the
online softmax recurrence (running max + rescale) so you never build the full score vector.
Implement two functions and show they match:
dense_attention(q, K, V)— the reference (build all scores, softmax, weighted sum).paged_online_attention(q, k_cache, v_cache, block_table, seq_len)— block-table gather + online softmax, processed block by block.
The online-softmax recurrence (from the guide)
m, denom, acc = -inf, 0, zeros(d)
for each block (gathered via block_table, up to seq_len):
s = (q · Kblockᵀ) / sqrt(d) # scores for this block's tokens
m_new = max(m, s.max())
corr = exp(m - m_new)
acc = acc*corr + (exp(s - m_new) @ Vblock)
denom = denom*corr + exp(s - m_new).sum()
m = m_new
return acc / denom
Definition of done
pytest phase-04-attention-backends/labs -q
The test asserts paged_online_attention ≈ dense_attention within tolerance, for non-block-aligned
seq_len (so you handle the partial last block), and that scattering the logical blocks to
arbitrary physical ids doesn't change the result (that's the whole point of paging).
Map to the real engine
| your numpy | real vLLM |
|---|---|
block_table gather | the block table fed to FlashAttentionImpl (flash_attn.py:592) |
| online softmax | the FlashAttention/FlashInfer kernels |
seq_len partial block | varlen handling in the metadata builder (flash_attn.py:276) |
| dense reference | what a naive (pre-Flash) kernel did, O(N²) memory |
Phase 04 Labs — Attention Backends
Four labs that take you inside the kernels the scheduler commands. The arc: build the decode kernel's algorithm (lab-01), widen it to the prefill shape with causal bounds (lab-03), parallelize it with the mergeable-state trick (lab-04), then step back and map the stable of production backends and the selector that picks between them (lab-02).
Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: algorithm
first, then its two extensions, then the dispatcher.) CPU labs follow the standard
contract — starter.py (your work), solution.py (reference), test_lab.py (the spec);
default runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-04-attention-backends/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-04-attention-backends/labs/lab-01-paged-attention-gather -q
Contents
- lab-01-paged-attention-gather
[CPU-OK] - lab-02-backend-selection
[GPU-OPT] - lab-03-causal-prefill-attention
[CPU-OK] - lab-04-flash-decoding-partitions
[CPU-OK] - What you can do after this phase
Labs
lab-01-paged-attention-gather [CPU-OK]
The fusion lab: online-softmax (FlashAttention's running max / denominator / accumulator
recurrence) over a paged KV cache (PagedAttention's block-table gather), in ~25 lines of
numpy, proven equal to dense attention — including the partial-last-block bound and the
m = −inf first-block edge. This is the semantics of paged_attention_v1.cu, and the
foundation labs 03 and 04 build on. Skills: the recurrence and why it's exact; the
rescaling correction factor; mapping your variables onto the CUDA kernel's.
lab-02-backend-selection [GPU-OPT]
Run the selector, override it (VLLM_ATTENTION_BACKEND), read get_attn_backend
(selector.py:52), and build the (GPU, dtype, model) → backend matrix — including why MLA
models force a backend while sliding windows merely filter candidates. Captured output
included for the GPU-less. Skills: the two-run kernel-bisection habit; backends differ
in the last ulp legitimately; why selection is startup-time configuration.
lab-03-causal-prefill-attention [CPU-OK]
The prefill shape: M queries starting at start_pos, each attending over exactly its
causal prefix — where the mask degenerates into a loop bound and chunked prefill becomes
just "queries that don't start at zero." The payoff test proves chunked ≡ one-shot in
attention outputs (Phase 3 lab-02's theorem, at the layer that enforces it), and a
poisoned-future test makes causality violations deafening. Skills: decode vs prefill as
loop shapes; start_pos/query_start_loc metadata; why prefill is compute-bound in this
very loop nest.
lab-04-flash-decoding-partitions [CPU-OK]
The parallelism lab: attention state compresses to a mergeable (max, denom, unnormalized-acc) triple, so a 128k-token decode can be split across partitions computed
independently and merged exactly — any partition count, any merge order, any tree shape,
all 1e-12-equal to dense. This is paged_attention_v2, flash-decoding, FlashInfer
split-k, and (stretched across GPUs) Phase 10's context parallelism. Skills: the
attention monoid; never normalize a partial; why long-context decode is where backends
differ.
What you can do after this phase
Read any attention backend in vllm/v1/attention/backends/ and find the three things that
are always there: the streaming recurrence (lab-01), the shape/metadata handling for
prefill vs decode (lab-03), and the reduction strategy (lab-04). Diagnose a kernel
suspicion with the backend-override bisection (lab-02), predict which backend a deployment
runs before it starts, and explain to a colleague why paged + flash + split-KV compose
without approximation. Phase 5 freezes these kernels into CUDA graphs; Phase 7 goes below
them into GEMMs.
Lab 04-01 — Paged Attention with Online Softmax [CPU-OK]
This lab is where the two most important kernel ideas in LLM inference fuse into one
function. From PagedAttention (Phase 2): K/V live in scattered physical blocks, reached
through a block table. From FlashAttention: you never materialize the full score row —
you stream the keys and maintain a running softmax. Put them together and you have, in
~25 lines of numpy, the algorithm at the heart of every decode kernel vLLM ships:
paged_attention_v1.cu, the Triton fallbacks, FlashInfer's decode path. When the tests
pass, you don't "know about" these kernels anymore — you've written their semantics.
Did Phase 2 lab-06 already? Good — that was the gather with ordinary softmax. This lab replaces the softmax with the online recurrence, the part that makes the streaming exact. Different load-bearing idea, same scaffolding, deliberately.
Contents
- Why this lab exists
- Background: the recurrence
- Files
- Run
- What to implement
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Naive attention computes all N scores, softmaxes the row, then blends N value rows.
That's three passes over data that, for a long context, doesn't fit in any fast memory —
on a GPU it means writing an O(N) score row to HBM and reading it back, twice, in the
hottest loop of the entire system. FlashAttention's insight is that softmax can be
computed in one streaming pass with O(1) extra state, if you're willing to rescale
history every time you discover a new maximum. That rescaling trick — three running
quantities and a correction factor — is the single most important piece of kernel math in
this field, and the only way to actually own it is to implement it and watch it match the
naive answer to 1e-6 on inputs where a wrong correction factor would diverge wildly.
The phase needs this lab as its foundation: lab-03 runs this recurrence per query row (prefill), lab-04 proves it's a mergeable monoid (flash-decoding), and the deep-dive's tour of real backends assumes you can see this loop inside every one of them.
Background: the recurrence
You hold three things while streaming key blocks: m (max score so far), denom (sum of
exp(score − m) so far), acc (sum of exp(score − m) · v so far — unnormalized). For
each new block with scores s:
m_new = max(m, max(s))
corr = exp(m − m_new) # how much history shrinks under the new max
p = exp(s − m_new) # new block's weights, on the new scale
acc = acc · corr + p @ V_block
denom = denom · corr + sum(p)
m = m_new
Final answer: acc / denom. Why it's exact (not an approximation): every exp(s_i) you
ever wanted appears in the final sums multiplied by exp(−m_final) — the corrections
compose so each term is rescaled from whatever max it was added under to the final max.
It's a telescoping product, and the only thing subtraction-by-max changes is overflow
behavior, never the ratio. The same algebra is why the state merges across partitions
in lab-04 — write it out once by hand for two blocks and the whole phase unlocks.
The paged part you know from Phase 2: token t is at
k_cache[block_table[t // block_size], t % block_size], and the last block of a sequence
is usually partial — read only seq_len − start rows of it.
Files
starter.py—dense_attention(the slow truth) andpaged_online_attention(the streaming, gathered version). Your work.solution.py— reference.test_lab.py— equality with dense for aligned and ragged lengths, and paging invariance.
Run
LAB_IMPL=starter pytest phase-04-attention-backends/labs/lab-01-paged-attention-gather -q
pytest phase-04-attention-backends/labs/lab-01-paged-attention-gather -q # reference
What to implement
Write dense_attention first and convince yourself it's correct — it's your oracle, and
the entire discipline of kernel work is never port what you haven't proven slow. Then
the streaming version per the recurrence above, iterating logical blocks that cover
[0, seq_len). The two classic stumbles, both covered by tests:
- The first-block edge:
mstarts at−inf, socorr = exp(−inf − m_new)must come out as 0, not NaN. Guard it (the solution branches onm != -inf). - The partial last block:
valid = min(block_size, seq_len − start). Read one row too many and you're attending over uninitialized cache — the bug that "almost works" (Phase 2 lab-06 poisoned the padding to make this loud; here the random zeros are quiet but the 1e-6 equality still catches it).
What the tests prove
| Test | What it pins |
|---|---|
test_matches_dense_block_aligned | The recurrence itself: 16 tokens, 4 scattered blocks ([3, 1, 7, 0]), equal to dense within 1e-6. A wrong corr doesn't fail subtly — softmax weights are exponential in the error, so divergence is loud |
test_matches_dense_partial_last_block | 13 tokens = 3 full + 1 single-token block: the valid bound |
test_paging_invariance | Same logical sequence at physical placements [0,1,2] vs [7,3,5] → identical output. The block table is the only coupling between logical and physical — Phase 2's identity theorem, restated where the math happens |
Hitchhiker's notes
- Map your variables to the CUDA kernel: in
paged_attention_v1.cu, yourmisqk_max(computed via warp/block reductions instead ofmax()), yourdenomisexp_sum, youracclives in registers asaccs, and your gather is theblock_table-indexed pointer arithmetic in the main loop. Read the kernel right after finishing — it's ~400 lines of which you now understand the load-bearing 40; the rest is vectorized loads, shared-memory staging, and reduction plumbing (the "95% performance engineering" of Phase 2 lab-04). - Why subtract the max at all, again?
exp(90)overflows float32. Logit ~90 is not exotic — it's a confident model with a sharp head. Unprotected softmax is a NaN factory; subtraction-by-max makes every exponent ≤ 0. The online version just maintains that protection without knowing the max in advance — that's the whole cleverness. - One query here, many heads in reality: real decode runs this once per (sequence, KV-head) with the query being that head's slice — heads are embarrassingly parallel and share nothing (Phase 2 lab-06's per-head loop). GQA means several query heads stream the same K/V blocks — bandwidth amortization inside the kernel, one more reason GQA wins (Phase 0 lab-02).
- Numerics note for the tests' 1e-6: float64 throughout, so the tolerance is generous
— it's calibrated to catch algorithmic error (a missing
corr, an off-by-one), not rounding. In fp16 kernels the same comparison runs at 1e-2 with fp32 accumulators (Phase 2 lab-04's gate); the tolerance always encodes what you're testing for.
Going further
- Hand-trace two blocks with two tokens each on paper, with block 2's max larger than
block 1's — watch
corrshrink the history. Then once with block 2's max smaller — watchcorr = 1and nothing rescale. The recurrence has exactly these two behaviors. - Delete the
corrfactor and run the tests: the aligned test fails with weights skewed toward later blocks. Now you know this failure's signature — useful the day you review a kernel PR that gets it almost right. - Batch it: take a list of
(q, block_table, seq_len)and loop — you've builtpaged_attention_v1's grid (one program per sequence per head). Then go to lab-04 to split within a sequence, and lab-03 to widen to query chunks.
References
- Milakov & Gimelshein, Online normalizer calculation for softmax (2018) — the recurrence, 3 readable pages: https://arxiv.org/abs/1805.02867
- Dao et al., FlashAttention (2022) — the recurrence + tiling + IO analysis: https://arxiv.org/abs/2205.14135
upstream/csrc/attention/paged_attention_v1.cu—qk_max,exp_sum, the gather: your function in CUDA.upstream/vllm/v1/attention/backends/flash_attn.py:592— where the real engine hands block tables and slot mappings to the kernel (find both; Phase 2 lab-06 explains the write side).- 02-mini-build.md — the recurrence derived step by step.
Lab 04-02 — Backend Selection Matrix [GPU-OPT]
vLLM doesn't have an attention kernel; it has a stable of them — FlashAttention, FlashInfer, Triton, FlashMLA, TRTLLM-GEN, per-platform CPU/ROCm/TPU variants — and a selector that picks one at startup based on your GPU, dtype, model architecture, and features. That choice is invisible when it's right and bewildering when it's wrong, and "wrong" here means anything from a 20% throughput gap to a crash on an exotic head size. In this lab you run the selector, override it, read its source, and build the (GPU, dtype, model) → backend table that lets you answer — from memory, in an incident — "which kernel is this deployment actually running, and what else could it run?"
No GPU? Don't panic. The captured output below is the experiment; the selection logic (
selector.py:52) is the lesson, and it reads the same on a laptop.
Contents
- Why this lab exists
- Background: why so many backends
- Requirements
- Steps
- Captured output (real run, L4, vLLM 0.22.1, trimmed)
- Build the matrix (your deliverable)
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Every component you've studied so far had one implementation. Attention is where vLLM
becomes a dispatcher, and dispatchers are where production surprises live: the same
model, same config, same vLLM version runs different kernels on an A100 vs an H100 vs
an RTX 4090 — different performance, different numerics in the last ulp, occasionally
different bugs. When a user reports "works on my machine, garbage on the cluster," the
backend matrix is the first thing a maintainer checks, and VLLM_ATTENTION_BACKEND is the
first bisection tool they reach for. This lab is that reflex, installed.
It's also your map for the rest of the phase: the deep-dive walks the backends' implementations; this lab establishes which of them you're ever actually running and what forces the exceptions (MLA models, head sizes, dtypes, platforms).
Background: why so many backends
Because "attention" is several workloads wearing one name, and the optimal kernel differs per (shape × hardware × feature):
- FlashAttention (FA2/FA3) — the battle-tested default for standard transformers on NVIDIA; hand-tuned prefill and decode paths, broad feature support. FA3 exploits Hopper-specific hardware (TMA, warpgroup MMA), which is why the GPU generation enters the selector.
- FlashInfer — plan-based kernels with strengths vLLM's defaults lack in places: cascade/shared-prefix attention (lab-04's merge!), aggressive split-k, customizable masking. Often the win for high-concurrency or shared-prefix workloads — measure, don't assume (Phase 18).
- Triton backend — portable, readable, JIT-compiled; the fallback when the hand-written kernels lack your head size/feature combo, and the reference implementation you can actually modify (it's the closest production cousin of your lab-01 code).
- FlashMLA / TRTLLM-GEN — DeepSeek-style MLA models compress KV into a low-rank latent; the cache layout itself is different, so standard kernels can't read it at all. Architecture doesn't just prefer a backend — it can force one.
- Platform backends (CPU, ROCm, TPU — Phase 17) — different ISAs entirely.
The selector (get_attn_backend, upstream/vllm/v1/attention/selector.py:52) resolves:
explicit override → platform default chain → capability checks (dtype, head size,
sliding window, MLA) → fallback. Selection happens once, at startup — the backend's
metadata builder and CUDA-graph shapes (Phase 5) are baked for the engine's lifetime.
Requirements
uv pip install -e ".[vllm]"
Steps
- Let vLLM pick (read the startup line naming the backend):
python -c "from vllm import LLM; LLM(model='facebook/opt-125m', gpu_memory_utilization=0.4)"
- Force alternatives and confirm the engine obeys:
VLLM_ATTENTION_BACKEND=FLASHINFER python -c "from vllm import LLM; LLM(model='facebook/opt-125m', gpu_memory_utilization=0.4)"
VLLM_ATTENTION_BACKEND=TRITON_ATTN python -c "from vllm import LLM; LLM(model='facebook/opt-125m', gpu_memory_utilization=0.4)"
Also try forcing something invalid for your setup (e.g. FLASHMLA on a non-MLA
model) and read the error — the selector's failure messages are part of its interface,
and you want to have seen them before an incident shows them to you.
- Read the source next to the log:
selector.py:52(get_attn_backend) and the platform default chain inupstream/vllm/platforms/cuda.py. For your GPU + dtype + two or three models, predict the choice before running — the lab is passed when your predictions stop missing.
Captured output (real run, L4, vLLM 0.22.1, trimmed)
# default:
INFO ... Using Flash Attention backend.
# VLLM_ATTENTION_BACKEND=FLASHINFER:
INFO ... Using FlashInfer backend.
# VLLM_ATTENTION_BACKEND=TRITON_ATTN:
INFO ... Using Triton backend.
# a DeepSeek (MLA) model, default:
INFO ... Using FlashMLA backend. # MLA models force an MLA backend (different KV layout)
One line, easily scrolled past — but it names the code that will execute the hottest loop of the deployment several thousand times per second. Operators should log-grep for it on every rollout; version upgrades do change defaults, silently (selection logic and backend names both drift across releases — anchor on the mechanism, not the strings).
Build the matrix (your deliverable)
| GPU | dtype | model feature | chosen backend | why |
|---|---|---|---|---|
| A100/L4 | bf16 | standard | FlashAttention | hand-tuned default for Ampere+ |
| H100 | bf16 | standard | FlashAttention (FA3 path) | Hopper-specific kernels |
| any | any | MLA (DeepSeek) | FlashMLA | latent KV layout — standard kernels can't read it |
| any | any | override set | (the override) | VLLM_ATTENTION_BACKEND wins over everything |
| any | any | unsupported head size | Triton fallback | JIT covers shapes hand-written kernels skip |
| CPU | fp32 | standard | the CPU backend | no CUDA; platform chain (Phase 17) |
Extend it with what your hardware shows — the table above is the skeleton; the rows you add from your own runs are the ones you'll remember.
Hitchhiker's notes
- The override is a bisection tool, not a tuning knob. Mystery garbage output? Flip
to
TRITON_ATTN: if the garbage persists, it's not the kernel (look at sampling, weights, tokenizer); if it disappears, you've isolated a kernel bug and your issue report writes itself ("FA path wrong for head_size=96 + sliding window; Triton correct"). This two-run dance is the single highest-value habit this lab teaches. - Backends differ in the last ulp, legitimately. Different tiling = different reduction order = bitwise-different logits (Phase 3 lab-02's softening, kernel edition). Greedy outputs can diverge after enough tokens with no bug anywhere. Don't file that issue; do mention it when comparing backends in evals.
- Why startup-time selection rather than per-request? The backend brings its own
metadata builder (the
FlashAttentionMetadataof lab-03) and its kernels are baked into CUDA-graph captures (Phase 5); swapping per request would mean re-capturing graphs and rebuilding paged-cache layouts mid-flight. Selection is configuration, not scheduling. - Capability gaps are normal, not shameful: a brand-new model with head_dim 96, or fp8 KV + sliding window, may be outside the fast path's support matrix and silently fall back to Triton — correct but slower. When throughput regresses after a model swap, check the backend line first; the model may have changed your kernel.
Reflect
- Your p99-latency-sensitive service runs long-context decode on H100s. Name two backend experiments worth running before touching any other knob, and what you'd measure. (FlashInfer split-k vs FA3 at your concurrency, ITL distributions — lab-04 explains why long decode is where they differ; Phase 18 gives the harness.)
- Why does an MLA model force the backend while sliding-window merely filters candidates? (MLA changes the cache's data layout — incompatible storage; sliding window is a mask variation several backends implement — a feature flag, not a format.)
- The selector consults the platform (
cuda.py,rocm.py,cpu.py…) before capability checks. Sketch how a new accelerator vendor slots in without touching the selector — that's Phase 17's plugin architecture, and the reason the chain is shaped this way.
References
upstream/vllm/v1/attention/selector.py:52—get_attn_backend, the dispatcher.upstream/vllm/platforms/cuda.py— the NVIDIA default chain the selector consults.upstream/vllm/v1/attention/backends/— the stable itself; skim each file's class docstring and you've got the cast list for the deep-dive.- vLLM docs, Engine Arguments / environment variables —
VLLM_ATTENTION_BACKENDand friends: https://docs.vllm.ai/en/latest/serving/engine_args.html - Ye et al., FlashInfer (2024) — what the alternative brings: https://arxiv.org/abs/2501.01005
- Dao, FlashAttention-2/3 — what the default brings: https://arxiv.org/abs/2307.08691, https://arxiv.org/abs/2407.08608
Lab 04-03 — Causal Prefill Attention over a Paged Cache [CPU-OK]
Lab-01 gave you the decode kernel shape: one query, N keys. But every token in that
cache got there through the other shape — prefill: M queries at once (a prompt, or a
chunk of one), each allowed to see only its own past. In this lab you build the prefill
shape on top of your lab-01 recurrence, with the two ingredients that make it interesting:
the causal mask (query i attends to positions 0..start_pos+i, nothing later) and
the start_pos offset that makes chunked prefill possible at the kernel level. The
payoff test proves, in attention outputs rather than scheduler bookkeeping, the invariant
Phase 3 lab-02 promised you: prefilling in chunks computes exactly what one-shot prefill
computes.
Contents
- Why this lab exists
- Background: one mechanism, two shapes
- Files
- Run
- What to implement
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Every attention backend in vLLM ships (at least) two code paths, and PRs routinely touch
one and break the other. If you've only ever written the decode path, prefill kernels read
as a wall of index arithmetic: why does the mask depend on a start offset? why does the
kernel receive query_start_loc arrays? what exactly must hold for a chunk computed today
to splice seamlessly with a chunk computed three steps ago? This lab makes you derive all
three answers, because you need them to make four tests pass.
It also closes a loop the course opened two phases ago. Phase 3 proved "chunking changes
when, never what" behaviorally — same tokens out of the engine. But that proof leaned on
the kernel doing its part: a query at absolute position 7, computed in a chunk that starts
at position 5, must attend over tokens 0–7 exactly as it would have in a one-shot
prefill. That's a property of the attention math plus the cache, and here you verify it at
the layer where it actually lives — test_chunked_equals_one_shot is Phase 3 lab-02
restated in linear algebra.
Background: one mechanism, two shapes
The contract when a prefill chunk runs (this ordering is upstream's, and Phase 2 lab-06's):
- The runner writes the chunk's K/V first —
slot_mappingscatters rows for positionsstart_pos..start_pos+M−1into the paged cache. So by the time attention runs, the cache holds tokens0..start_pos+M−1: everything each query may legally see. - The kernel then computes, for each query row i (absolute position
start_pos+i), attention over the causal prefix[0, start_pos+i]— gathered through the block table, streamed with online softmax, exactly your lab-01 loop with a per-query length.
Note what the causal mask is not: a -inf matrix you materialize. In a streaming kernel
the mask degenerates into a loop bound — you simply stop reading keys at the query's
own position. (Real kernels processing key tiles need the mask only for the one diagonal
tile where queries and keys overlap; every earlier tile is all-visible, every later tile is
skipped entirely. "The mask is mostly a loop bound" is why causal attention costs half of
bidirectional, not the same with masking overhead.)
And start_pos is the entire kernel-side story of chunked prefill: a chunk is just a
prefill whose queries don't start at zero. No special "resume" state — the cache is the
state, which is the same insight (the counter/cache is the resume mechanism) you've now
met in the scheduler (Phase 3), in preemption recovery (Phase 3 lab-04), and here in the
kernel.
Files
starter.py—dense_causal_attention(the reference) andpaged_causal_prefill_attention(the paged, online-softmax version). Your work.solution.py— reference; note how it reuses the lab-01 recurrence as an inner function — the decode kernel is literally a sub-case.test_lab.py— full prefill, mid-sequence chunk, chunked ≡ one-shot, and the poisoned-future causality test.
Run
LAB_IMPL=starter pytest phase-04-attention-backends/labs/lab-03-causal-prefill-attention -q
pytest phase-04-attention-backends/labs/lab-03-causal-prefill-attention -q # reference
What to implement
Two functions. The dense reference is a per-query loop: slice the causal prefix, score,
softmax, blend. The paged version wraps your lab-01 recurrence: for query i, run the
block-streaming loop with seq_len = start_pos + i + 1. That +1 is load-bearing — a
token does attend to itself (its K/V are in the cache before its attention runs; see
the contract above). Off-by-one it and test_full_prefill_from_position_zero fails on the
very first row, where the prefix is exactly one token.
What the tests prove
| Test | What it pins |
|---|---|
test_full_prefill_from_position_zero | The base case (start_pos=0), with a partial last block — 13 tokens in 4 blocks |
test_mid_sequence_chunk | The chunked case: queries for positions 5–8 of a 9-token cache attend over exactly the right prefixes despite starting mid-block |
test_chunked_equals_one_shot | The phase-bridging invariant: 12 positions as one chunk ≡ as 5 + 7 — every output row identical to 1e-9. Phase 3 lab-02's theorem, at the layer where it's actually enforced |
test_causality_future_tokens_are_invisible | A 1e3 "loud future" in the last token's K/V changes only the last query's row. Rows 0–6 provably deaf to it. A non-causal bug here doesn't crash — it leaks the future into every token, the model trains on nothing like it, and outputs degrade mysteriously. This test makes the leak deafening instead |
The poison technique is Phase 2 lab-06's trick pointed at a different boundary: there it
guarded seq_len masking, here it guards the causal frontier. Same principle — make the
forbidden region catastrophic to touch, then prove nothing touched it.
Hitchhiker's notes
- Why is prefill compute-bound while decode is bandwidth-bound when it's the same math? Count the reuse: in prefill, each gathered K/V block is dotted against many query rows (every query whose prefix covers it); in decode, against exactly one. That's the arithmetic-intensity difference of Phase 0 lab-04, visible in this very loop nest — and it's why real prefill kernels tile over both queries and keys (FlashAttention's 2D blocking) while decode kernels tile only keys (lab-01/lab-04 shapes).
query_start_locand friends: real batches contain many requests' chunks concatenated; upstream passes per-request offsets (query_start_loc,seq_lens) so one kernel launch handles a ragged batch. Yourstart_posis the single-request version of that metadata. Find the production form inupstream/vllm/v1/attention/backends/flash_attn.py(FlashAttentionMetadata).- The solution's per-query inner loop is honest but quadratic in reads — it re-gathers shared prefix blocks once per query. Real kernels invert the nest (outer loop over key tiles, inner over queries, with the diagonal-tile mask) precisely to read each block once. Try the inversion as an exercise; the recurrence per query row is unchanged, which is the point — the math doesn't care which loop is outside.
- Sliding-window attention (Mistral et al.) is one more loop-bound tweak: the prefix
becomes
[max(0, pos−W+1), pos]. If you can place the causal bound, you can place the window bound — and you now know why window support is a per-backend feature flag rather than a model-side trick.
Going further
- Vectorize the dense reference into a single masked matmul
(
scores + np.triu(-inf, k=1+start_pos_offset)) and check it against your loop — then notice the materialized(M, N)score matrix is exactly what FlashAttention exists to avoid. - Invert the loop nest (key-tiles outer) as sketched above and re-run the suite — same
four green tests, different memory behavior. You've reproduced the actual structure of
flash_attn's prefill kernel. - Implement sliding-window (
windowparameter, prefix startmax(0, pos−W+1)) and write the poison test for the left boundary: a loud token just outside the window must be inaudible.
References
- Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (2022) — the 2D-tiled prefill kernel this lab is the skeleton of: https://arxiv.org/abs/2205.14135
- Dao, FlashAttention-2 (2023) — the loop-nest inversion and work partitioning: https://arxiv.org/abs/2307.08691
upstream/vllm/v1/attention/backends/flash_attn.py—FlashAttentionMetadata:query_start_loc,seq_lens, and the cascade of shapes one launch handles.- Phase 3 lab-02 — the engine-level statement of
test_chunked_equals_one_shot. - Phase 2 lab-06 — the write path (
slot_mapping) that fills the cache this lab reads.
Lab 04-04 — Flash-Decoding: Split the Keys, Merge the Partials [CPU-OK]
Here's a problem your lab-01 kernel can't solve. One request, one decode query, a 128k-token context — and a GPU with 100+ streaming multiprocessors. The online-softmax loop is sequential over blocks: one SM grinds through 8,000 blocks while 99+ SMs watch. Decode latency for long contexts becomes a single-core problem on a massively parallel machine.
The fix — known as flash-decoding (Dao et al.), paged_attention_v2 in vLLM's CUDA,
split-k in FlashInfer — is the subject of this lab: partition the keys, attend each
partition independently and in parallel, then merge the partial results exactly. The
reason it works is a small piece of algebra worth owning forever: softmax-attention state
compresses to a triple (max, denominator, unnormalized-accumulator), and two such
triples combine associatively. You'll implement the triple, the merge, and prove
equality with dense attention for any partition count, any merge order, any tree shape.
Contents
- Why this lab exists
- Background: attention state is three numbers
- Files
- Run
- What to implement
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
This is the lab where "online softmax" stops being a trick you memorized and becomes a monoid you can wield. Lab-01's recurrence processes blocks left to right — it looks inherently sequential. The deep fact is that it isn't: the per-block update is just the binary merge applied repeatedly, and because the merge is associative and order-insensitive, you may evaluate it in any tree shape — including "all leaves in parallel, one combine at the end." Sequential streaming (FlashAttention), parallel split-KV (flash-decoding), and hierarchical reduction (multi-stage kernels) are the same algorithm under different parenthesizations.
Practically, this is also the difference between usable and unusable long-context decode.
Batch-1, long-context inference — the agentic workload, increasingly the workload — has
no batch parallelism to hide behind; parallelism must come from within the single
query's attention. When vLLM picks paged_attention_v2 over v1, or FlashInfer chooses a
split-k plan, the decision is "is this context long enough that splitting beats the merge
overhead?" After this lab you'll know exactly what's being weighed.
Background: attention state is three numbers
For a query q and any set of keys/values, define:
m = max_i s_i (s_i = k_i·q / √d)
denom = Σ_i exp(s_i − m)
acc = Σ_i exp(s_i − m) · v_i ← UNNORMALIZED (a vector)
(m, denom, acc) is a summary of attention over that key set: the final output is
acc / denom, but crucially you don't divide until the very end. Two summaries over
disjoint key sets merge by rescaling both to the shared max:
m* = max(m₁, m₂)
denom* = denom₁·e^{m₁−m*} + denom₂·e^{m₂−m*}
acc* = acc₁·e^{m₁−m*} + acc₂·e^{m₂−m*}
Check the properties: commutative (symmetry of the formulas), associative (both sides
reduce to "rescale everything to the global max and add"), and lab-01's per-block update
is exactly this merge where one side is a single block's summary. The exp(m−m*)
correction factors are the price of never having seen the global max in advance — and
they're also the numerical-stability mechanism: no exponential is ever taken of a
positive number, so nothing overflows even when one partition holds a monster logit
(test_extreme_scores_do_not_overflow feeds it a score of ~200, which would be inf
under naive softmax).
Files
starter.py—attend_partial(key range → summary),merge_partials(summaries → output),partitioned_attention(split, attend, merge). Your work.solution.py— reference (the whole thing is ~25 lines; the understanding is the deliverable).test_lab.py— identity at 1 partition, equality at any count, empty-chunk handling, order-invariance, hierarchical merging, and the overflow stress.
Run
LAB_IMPL=starter pytest phase-04-attention-backends/labs/lab-04-flash-decoding-partitions -q
pytest phase-04-attention-backends/labs/lab-04-flash-decoding-partitions -q # reference
What to implement
Follow the math above literally. The one design rule that matters: attend_partial
must not normalize. The moment you divide by the local denominator, the summary is no
longer mergeable — you've thrown away the weights needed to re-weight against other
partitions. (Returning normalized outputs and "averaging" them is the classic wrong
implementation; it passes the 1-partition test and fails every other one, which is
exactly why the 1-partition test isn't sufficient and the suite has six.)
What the tests prove
| Test | What it pins |
|---|---|
test_one_partition_is_just_attention | The degenerate case: summary → output round-trips |
test_any_partition_count_matches_dense | 2, 3, 7, 32, 100 partitions — all 1e-12-equal to dense. Partitioning is exact, not approximate; any tolerance bigger than rounding would hide real bugs |
test_more_partitions_than_keys | array_split hands you empty chunks; skip, don't crash. The GPU analogue: grid sized for max length, sequences shorter than the partition count |
test_merge_is_order_invariant | Reversed and shuffled partial lists give identical output — mandatory, because on hardware thread blocks finish in nondeterministic order |
test_merge_is_hierarchical | Merging merges = attending over the union: associativity, demonstrated. This is the license for tree reductions and multi-stage kernels |
test_extreme_scores_do_not_overflow | A ~200 logit in one partition: finite output, still 1e-12-equal. The running max isn't bookkeeping — it's the firewall |
Hitchhiker's notes
- Where this lives upstream:
upstream/csrc/attention/paged_attention_v2.cu— searchmax_logitsandexp_sums: those are yourmanddenom, written per partition to scratch buffers, merged by a second reduction kernel. The v1/v2 choice (v2 when partitioning pays) is made by the backend per launch. FlashInfer generalizes the same state into plan-based split-k; FlashAttention'sflash_attn_with_kvcacheexposes it asnum_splits. - The merge is also how cascade/shared-prefix attention works (FlashInfer's signature feature): attend over the shared system-prompt KV once for the whole batch (one summary, reused), attend per-request suffixes separately, merge each request's pair. Same triple, same combine — prefix caching meeting kernel design. That's three course threads (Phase 2 sharing, Phase 3 caching, this lab) converging on one formula.
- Why does sequential streaming still exist if parallel split is exact? Overhead: each partition writes its summary to global memory and a second kernel reads them back. For short contexts the round-trip costs more than it saves; for prefill the parallelism already comes from query rows (lab-03). Split-KV wins specifically at long-context decode — engineering is choosing the parenthesization that matches the hardware's idle dimension.
- This trick is older and bigger than attention: it's a parallel reduction over a non-trivial monoid, the same pattern as parallel max/sum/scan. The general skill — "can I summarize partial state so summaries combine associatively?" — is how you parallelize anything with a running normalizer. You'll meet it again in distributed softmax (Phase 10's context parallelism splits attention across GPUs with exactly this merge).
Going further
- Implement
merge_two(a, b) -> summary(summary × summary → summary, not output) and rebuildmerge_partialsas a fold; then as a balanced tree withfunctools.reduce-style pairing. Verify all shapes agree — you've now written the reduction the way the GPU executes it. - Combine with lab-01: make each partition gather through the block table (partition
= a contiguous range of logical blocks). That composition — paged + split-KV — is
precisely
paged_attention_v2. - Simulate the cascade pattern: 8 "requests" sharing a 512-token prefix with unique 64-token suffixes. Compute the prefix summary once + 8 suffix summaries, merge per request; compare against 8 dense computations. Measure the key-reads saved (should be ~7×512 rows) — FlashInfer's headline, reproduced in numpy.
References
- Dao et al., Flash-Decoding for Long-Context Inference (2023) — the technique, with the parallelism diagrams: https://pytorch.org/blog/flash-decoding/
- Milakov & Gimelshein, Online normalizer calculation for softmax (2018) — the merge formula's original home: https://arxiv.org/abs/1805.02867
- Ye et al., FlashInfer: Efficient and Customizable Attention Engine for LLM Serving (2024) — split-k plans and cascade/shared-prefix attention: https://arxiv.org/abs/2501.01005
upstream/csrc/attention/paged_attention_v2.cu—max_logits/exp_sums/ the reduce kernel: your lab, in CUDA.- Phase 10 — the same merge, stretched across GPUs (context parallelism).
Phase 04 — Exercises: Attention Backends
Contents
Warm-up (explain)
- Attention is one operation — so why does vLLM have many attention backends?
- What three pieces of metadata does a paged attention kernel need, and what is each for?
- What is online softmax and what problem does it solve?
Core (trace the code)
- In
Attention.forward(attention.py:437), what does the model pass, and what does it not know about the kernel? - Name the three base classes in
backend.pyand the job of each (Backend / Impl / MetadataBuilder). - In
FlashAttentionImpl.forward(flash_attn.py:592), find the KV write (slot_mapping) and the paged read (block table). How do they map to Phase 2? - List three inputs
get_attn_backend(selector.py:52) uses to pick a backend.
Build (your lab)
- In lab-01, explain why scattering the logical blocks to arbitrary physical ids doesn't change the output. What does that prove about the kernel's contract?
- Extend
paged_online_attentionto multiple query heads (loop or vectorize). Verify against a multi-head dense reference. - Add a causal mask variant (a prefill query at position
pattends only to tokens ≤ p).
Design (staff-level)
- At high concurrency with many short decode requests, FlashInfer often beats FlashAttention. Hypothesize why, and design a benchmark (Phase 18) to confirm it for your workload.
- You're bringing up a new model with a novel attention (e.g. a different KV compression). What parts of the backend system must you implement, and what can you reuse?
- A user reports correct output with
VLLM_ATTENTION_BACKEND=TRITON_ATTNbut garbage with the default. Outline your debugging path and what it implies about the default kernel.
Self-grading
4–7 and 11–13 are interview-grade. Could you draw the layer→impl→kernel path and name the files? If not, re-read 01-deep-dive.md.
Phase 04 — Interview Questions: Attention Backends
Q1. Why does vLLM have a pluggable attention-backend system?
Model answer
Attention is one math op, but the fastest (or only available) kernel depends on hardware,
dtype, head size, and model features (MLA, sliding window). A pluggable system lets vLLM pick the
best kernel per setup (FlashAttention, FlashInfer, Triton, FlashMLA, TRTLLM-GEN) and adopt new
ones without touching model code — the model talks only to the Attention layer
(attention.py:177), which delegates to the chosen AttentionImpl.
Q2. What does a paged attention kernel need that a dense one doesn't?
Model answer
The block table (logical→physical block, to gather scattered prior KV), the slot mapping
(where to write this step's new K/V), and per-request sequence lengths (for varlen batching).
These are built each step by the AttentionMetadataBuilder (flash_attn.py:276) from the
scheduler's output — the bridge from Phases 2/3 to the kernel.
Q3. Explain online softmax and why FlashAttention uses it.
Model answer
Naive attention materializes the full N×N score matrix — O(N²) memory. Online softmax streams K/V in tiles, keeping a running max, a rescaled accumulator, and a running denominator, so it computes exact softmax-weighted attention in O(N) memory and stays in fast SRAM. That's the "Flash" in FlashAttention, and it's what makes long-context attention feasible. (You implement it in lab-01.)
Q4. How and when is the backend chosen?
Model answer
At startup, by get_attn_backend (selector.py:52), from the platform default
(platforms/cuda.py etc.), dtype, head size, and model features, with VLLM_ATTENTION_BACKEND
as an override. It's fixed for the run because CUDA-graph capture and the metadata builder depend
on it (Phase 5). MLA models force an MLA backend due to their different KV layout.
Q5. What is MLA and why does it need its own backend?
Model answer
Multi-head Latent Attention (DeepSeek) compresses K/V into a shared low-rank latent vector instead of storing full per-head K/V, shrinking the KV cache a lot. Because the cached representation and the attention math differ, it needs a dedicated kernel/backend (FlashMLA) and a different KV cache layout — an example of the model's attention design dictating the kernel.
Rapid-fire
- Model-facing class?
Attention(attention.py:177). - Three backend roles? Backend (names it), Impl (runs it), MetadataBuilder (feeds it).
- Override env var?
VLLM_ATTENTION_BACKEND. - Read map / write map? block table / slot mapping.
- The trick that makes attention O(N) memory? Online softmax.
Phase 04 — Cheatsheet: Attention Backends
Contents
- The one-liner
- The four roles (
vllm/v1/attention/backend.py) - The kernels
- Online softmax (why "Flash")
- Selection
- Key upstream
The one-liner
Attention is one op; the backend is which kernel computes it. Model code is backend-agnostic; the kernel gets paged-KV metadata (block table + slot mapping + seq lens).
The four roles (vllm/v1/attention/backend.py)
Attentionlayer (model-facing,attention.py:177) → delegates to:AttentionImpl.forward— runs the kernel (writes KV via slot_mapping, reads via block table)AttentionBackend— names impl + metadata + supported configsAttentionMetadataBuilder.build— SchedulerOutput → kernel metadata (the Phase 2/3 → kernel bridge)
The kernels
| backend | best for |
|---|---|
| FlashAttention | general default; online softmax, O(N) memory |
| FlashInfer | serving, paged KV, high concurrency / many decodes |
| Triton | portable fallback |
| FlashMLA | MLA models (DeepSeek) — low-rank latent KV |
| TRTLLM-GEN | NVIDIA TensorRT-LLM generated, GPU/precision-tuned |
Online softmax (why "Flash")
running max + rescale + accumulate per tile → exact softmax in O(N) memory, no N×N matrix.
Selection
get_attn_backend (selector.py:52) ← platform default + dtype + head size + features; override
with VLLM_ATTENTION_BACKEND=FLASH_ATTN|FLASHINFER|TRITON_ATTN|.... Fixed for the run (CUDA graphs).
Key upstream
model_executor/layers/attention/attention.py:177Attention ·:437forwardv1/attention/backend.pybase classes ·v1/attention/selector.py:52selectorv1/attention/backends/flash_attn.py:68 Backend :223 Metadata :276 Builder :592 Implv1/attention/backends/mla/MLA ·registry.pyname→backend
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md
Phase 05 — The Hitchhiker's Guide to CUDA Graphs & torch.compile ⭐
← Phase 04 · Course home · Phase 06 →
Flagship phase — written in full. Phases 02–03 made memory and scheduling fast. This phase attacks a different enemy: the CPU, which can be too slow to even tell the GPU what to do.
Contents
- Don't Panic
- Step 1: Why the CPU is a bottleneck at all
- Step 2: CUDA graphs — record once, replay forever
- Step 3: Full vs Piecewise graphs
- Step 4: torch.compile — making the kernels themselves better
- Step 5: How they fit together at runtime
- The invariants to memorize
- What you'll do in this phase
Don't Panic
Two ideas, one breath each:
CUDA graphs: launching a GPU kernel from Python costs CPU time. During decode you launch hundreds of tiny kernels per token — and the CPU can't issue them fast enough, so the GPU sits idle waiting for work. A CUDA graph records that whole sequence of launches once and replays it with a single launch. The CPU overhead vanishes.
torch.compile: instead of running your model op-by-op in Python, PyTorch's compiler traces it into a graph, then fuses and rewrites the ops into fewer, faster kernels. vLLM wraps this with its own backend, caching, and custom optimization passes.
They're complementary: torch.compile makes the kernels better; CUDA graphs make launching
them free. vLLM uses both, together, by default. By the end of this phase you'll have built a
CPU simulation of capture/replay (mini_vllm/cudagraph.py) that reproduces the exact win and
the exact constraints, and you'll have read the real CUDAGraphWrapper.
Step 1: Why the CPU is a bottleneck at all
Recall the decode loop (Phase 0): one token at a time, and each token's forward pass runs many small operations — for each of, say, 32 layers: a QKV projection, attention, an output projection, two MLP matmuls, norms, residual adds… Easily hundreds of GPU kernels per token.
Each kernel is launched from Python/C++ on the CPU. A launch isn't free — it costs a few microseconds of CPU work to set up and enqueue. Do the arithmetic:
~300 kernels/token × ~5 µs CPU launch overhead ≈ 1.5 ms of CPU work per token
If the GPU work for that token is also ~1.5 ms, you're at best 50% utilized — and at small batch sizes (where each kernel is tiny and finishes fast) the GPU finishes each kernel before the CPU can launch the next one. The GPU starves, waiting on the CPU. This is CPU-launch-bound decode, and it's the default failure mode at low batch sizes.
CPU: [launch k1][launch k2][launch k3]......[launch k300] ← the CPU is the critical path
GPU: [k1] idle [k2] idle [k3] idle ..... ← GPU waits between tiny kernels
└ each gap = CPU not done issuing the next launch
You can't make the launches cheaper one by one. But you can stop doing them every step.
Step 2: CUDA graphs — record once, replay forever
A CUDA graph is a recording of a sequence of GPU operations and their dependencies. You "capture" it once by running the forward pass in a special mode; CUDA records every kernel and its arguments into a graph object. Thereafter you replay the whole graph with a single API call — the CPU issues one launch and the GPU rips through all 300 kernels with zero per-kernel CPU involvement.
Without graphs (every step): CPU issues 300 launches → GPU starves between them
With a captured graph: CPU issues 1 "replay" → GPU runs all 300 back-to-back
The catch — and this is the whole reason it's tricky — is that a graph is a frozen recording. It records exact kernels reading from exact memory addresses for exact tensor shapes. So:
- Constraint 1 — fixed shapes. A graph captured for batch size 8 only replays for batch size 8. vLLM captures a graph for each batch size it expects (and pads odd batch sizes up to the nearest captured size). It keeps a dictionary of graphs keyed by shape.
- Constraint 2 — static input buffers. Replay reads from the same memory the capture used. So to run a new token's inputs, you must copy them into the captured input buffer first, then replay. The graph reads from the fixed address; your job is to keep that address valid and current.
These two constraints are exactly what your mini_vllm/cudagraph.py GraphRunner models: a
dict keyed by input shape (Constraint 1), and a static_input buffer you np.copyto into
before replay (Constraint 2). Go read it — it's ~40 lines and it is the mental model.
Why "graphs" plural? Because of Constraint 1, vLLM holds many captured graphs — one per batch size in
cudagraph_capture_sizes. The realCUDAGraphWrapperstores them inconcrete_cudagraph_entries: dict[BatchDescriptor, CUDAGraphEntry](cuda_graph.py:207). Your simulation'sself.graphs: dict[shape, GraphEntry]is the same idea with the GPU filed off.
Step 3: Full vs Piecewise graphs
There's a wrinkle. Some operations can't be captured into a graph cleanly — most importantly attention, because its kernel takes variable-length metadata (the block tables and sequence lengths from Phase 02/03) that change every step and don't fit the "frozen recording" model.
vLLM offers two strategies (the CUDAGraphMode enum, compilation.py:53):
- FULL — capture the entire model forward as one graph. Maximum CPU-overhead removal, but fragile: everything in the forward must be capture-safe, including attention (which needs special handling, e.g. capturing only the decode case where shapes are uniform).
- PIECEWISE — split the forward at the uncapturable ops (attention). Capture each contiguous compiled region between splits as its own small graph; run the split ops (attention) eagerly. You pay a few launches (one per piece + the eager attention) instead of 300 — most of the win, far more robustly.
FULL: [ ============== one graph: whole forward ============== ] (1 replay)
PIECEWISE: [ graph A ] (attention eager) [ graph B ] (attention eager) [ graph C ]
└ capture └ run live └ capture └ run live └ capture
a handful of launches, robust to attention's dynamic metadata
vLLM's V1 default is actually FULL_AND_PIECEWISE (compilation.py:63): use a FULL graph for
pure-decode batches (uniform shapes — safe and fastest) and PIECEWISE for mixed prefill+decode
batches (where attention metadata varies). Your mini_vllm.PiecewiseGraphRunner models exactly
this: it splits ops at an "uncapturable" predicate, captures the rest, runs the splits eagerly —
and a test proves the output is identical to eager.
Step 4: torch.compile — making the kernels themselves better
CUDA graphs remove launch overhead but don't change what runs. torch.compile does the
other half: it traces your model into an FX graph (via TorchDynamo), then a backend
(Inductor) generates fused kernels — e.g. fusing a RMSNorm + a quantization into one kernel,
so you read memory once instead of three times.
vLLM doesn't just use stock torch.compile; it has a custom backend (compilation/ backends.py) and a compilation pipeline with levels (CompilationMode, compilation.py:37):
CompilationMode (the "level"):
0 NONE – pure eager, no compile (what enforce_eager gives you)
1 STOCK_TORCH_COMPILE – plain torch.compile
2 DYNAMO_TRACE_ONCE – trace once, no recompiles
3 VLLM_COMPILE – vLLM's Inductor backend: caching + PIECEWISE compilation +
shape specialization + custom passes ← the V1 default
At level 3, vLLM:
- traces the model once and caches the compiled artifacts (so restarts are fast),
- splits the graph at attention for piecewise compilation (lining up with piecewise CUDA graphs),
- runs custom graph passes (
compilation/passes/) — fusions vLLM knows are safe and profitable for inference but stock Inductor wouldn't do (e.g. fused add+RMSNorm, sequence- parallel rewrites, quant fusions).
You opt a model into all this with one decorator: @support_torch_compile on the model class
(decorators.py:118). That's the seam between "a model" and "the compiler."
Step 5: How they fit together at runtime
model class
└─ @support_torch_compile (Phase 5: opt into the compiler)
└─ torch.compile / VLLM_COMPILE backend → fused kernels, piecewise split at attention
└─ CUDAGraphWrapper (Phase 5: capture/replay per batch size)
└─ for batch size B: replay graph_B (1 launch) OR capture if new shape
Each decode step, the model runner sets a forward_context with the current
cudagraph_runtime_mode and a batch_descriptor (the shape key). The CUDAGraphWrapper
(cuda_graph.py:233) reads that context and either runs eagerly (mode NONE / mismatch),
replays the cached graph for that shape, or captures a new one. That dispatch-by-context
is precisely what your GraphRunner.__call__ does with x.shape as the key.
The invariants to memorize
- CUDA graphs remove CPU launch overhead, not GPU compute. They help when decode is CPU-launch-bound (small batch), not when it's GPU-bound (large batch / prefill).
- A graph is per-shape: one captured graph per batch size; odd sizes are padded up.
- Replay reads static buffers: new inputs must be copied in before replay.
- Attention is the thing that resists capture → piecewise splits around it.
torch.compileimproves kernels; CUDA graphs improve launching. Different problems, used together.enforce_eager=Trueturns both off — your debugging escape hatch (and the only way to get fully dynamic shapes).
What you'll do in this phase
- Read: 01-deep-dive.md walks the real
CUDAGraphWrapper.__call__, theCUDAGraphMode/CompilationModeenums, and@support_torch_compileline by line. - Build: 02-mini-build.md — the capture/replay simulator (reference:
mini_vllm/cudagraph.py). - Labs (see labs/README.md; recommended order 01 → 02 → 05 → 03 → 04):
lab-01-graph-replay-simulator[CPU-OK]— implement capture/replay + shape dispatch + static buffers; pass the tests.lab-02-launch-overhead[CPU-OK]— model launch overhead and find the eager↔graph crossover.lab-03-cudagraph-mode[CPU-OK]— reimplement theCUDAGraphModedispatch (FULL / PIECEWISE / FULL_AND_PIECEWISE) and prove you understand decode-vs-mixed routing.lab-04-graph-vs-eager-real[GPU-REQ]— run real vLLM withenforce_eagervs CUDA graphs, measure the ITL difference (captured output included).lab-05-capture-sizes[CPU-OK]— the capture-size ladder: rung lookup, padding waste, and the denser-ladder-vs-more-captures trade ("why is my batch of 33 running at 40?").
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
When you can explain why a graph helps decode but not prefill, name the two constraints, and draw piecewise vs full from memory, you understand the layer that often doubles low-batch throughput for free.
← Phase 04 · Course home · Phase 06 →
Phase 05 — Deep Dive: CUDA Graphs & torch.compile in real vLLM
Paths relative to
upstream/atv0.22.1 @ 0decac0(UPSTREAM_PIN.md). The compilation subsystem:vllm/compilation/ cuda_graph.py CUDAGraphWrapper, CUDAGraphEntry (capture/replay — read this first) decorators.py @support_torch_compile (how a model opts in) backends.py the VllmBackend for torch.compile (trace -> split -> compile) piecewise_backend.py piecewise compiled regions passes/pass_manager.py + passes/fusion/ custom graph rewrites vllm/config/compilation.py CompilationMode, CUDAGraphMode, CompilationConfigWe read capture/replay (the core), the two config enums (the vocabulary), and the decorator (the seam). The Inductor internals are deep — return after you're comfortable here.
Contents
- 1. The two enums that name everything
- 2.
CUDAGraphWrapper— capture and replay (the heart) - 3.
@support_torch_compile— the seam between a model and the compiler - 4. The backend + passes (skim now, return later)
- 5. Where it's wired into the engine
- Reading checklist
1. The two enums that name everything
CompilationMode — the "level" (vllm/config/compilation.py:37)
class CompilationMode(enum.IntEnum):
NONE = 0 # pure eager, model runs as-is (what enforce_eager gives you)
STOCK_TORCH_COMPILE = 1 # the standard torch.compile pipeline
DYNAMO_TRACE_ONCE = 2 # single Dynamo trace, avoid recompilation
VLLM_COMPILE = 3 # vLLM's Inductor backend: caching, piecewise, shape
# specialization, custom passes <- V1 default
This answers "how hard does the compiler work?" Level 3 (VLLM_COMPILE) is where vLLM's value
is — its own backend with caching and piecewise splitting. Levels 0–2 are mostly for
debugging/comparison. mini_vllm doesn't compile (no GPU), but the idea of "a level dial from
eager to fully-optimized" is the thing to carry.
CUDAGraphMode — the capture strategy (vllm/config/compilation.py:53)
class CUDAGraphMode(enum.Enum):
NONE = 0
PIECEWISE = 1
FULL = 2
FULL_DECODE_ONLY = (FULL, NONE) # full graph for decode, nothing for mixed
FULL_AND_PIECEWISE = (FULL, PIECEWISE) # full for decode, piecewise for mixed (v1 default)
Notice the clever encoding: the last two are tuples (decode_mode, mixed_mode). A batch is
either pure-decode (uniform shapes — safe for a FULL graph) or mixed prefill+decode (variable
attention metadata — needs PIECEWISE). The helper methods make this explicit
(compilation.py:65):
def decode_mode(self) -> "CUDAGraphMode":
return CUDAGraphMode(self.value[0]) if self.separate_routine() else self
def mixed_mode(self) -> "CUDAGraphMode":
return CUDAGraphMode(self.value[1]) if self.separate_routine() else self
def has_mode(self, mode) -> bool: ... # is `mode` one of my routines?
def requires_piecewise_compilation(self) -> bool:
return self.has_mode(CUDAGraphMode.PIECEWISE)
So FULL_AND_PIECEWISE.decode_mode() == FULL and .mixed_mode() == PIECEWISE. You will
reimplement these exact methods in lab-03 — they're small and they encode the whole
"which graph for which batch" decision. The comment at line 595–620 of the config spells out
the tradeoffs (PIECEWISE only keeps non-attention out of the graph; FULL_AND_PIECEWISE is
generally fastest).
2. CUDAGraphWrapper — capture and replay (the heart)
vllm/compilation/cuda_graph.py:145. Read its docstring (lines 146–168) — it states the
dispatch protocol precisely. The key data structure (line 207):
# the entries for different batch descriptors that we need to capture cudagraphs for.
self.concrete_cudagraph_entries: dict[BatchDescriptor, CUDAGraphEntry] = {}
A dict of graphs keyed by batch shape. This is Constraint 1 (per-shape) made concrete. Your
mini_vllm.GraphRunner.graphs: dict[shape, GraphEntry] is the same structure.
A CUDAGraphEntry (line 128) is what we cache per shape:
@dataclass
class CUDAGraphEntry:
batch_descriptor: BatchDescriptor
cudagraph: torch.cuda.CUDAGraph | None = None
output: Any | None = None
# for cudagraph debugging, track the input addresses during capture,
# and check if they are the same during replay
input_addresses: list[int] | None = None
That input_addresses field is Constraint 2 (static buffers) made checkable: capture records
the input tensor addresses; replay asserts they're unchanged. Your simulation models this with
the static_input buffer you must np.copyto into.
The dispatch: __call__ (line 233)
Walk it in three branches:
(a) No graph / mode mismatch → run eagerly (lines 234–254):
forward_context = get_forward_context()
batch_descriptor = forward_context.batch_descriptor
cudagraph_runtime_mode = forward_context.cudagraph_runtime_mode
if (cudagraph_runtime_mode == CUDAGraphMode.NONE
or cudagraph_runtime_mode != self.runtime_mode):
# profile run, warmup, no-cudagraph, OR a different wrapper's turn
return self.runnable(*args, **kwargs)
The wrapper "blindly trusts" the mode + shape key set by the model runner in the
forward_context. If the runtime says NONE (profiling/warmup) or this isn't this wrapper's
mode, just run the real function. (This is how FULL and PIECEWISE wrappers can be nested and each
only fires for its own mode.) Your GraphRunner doesn't need modes, but the trust-the-context
pattern is why the wrapper stays decoupled from the compiler.
(b) Shape not seen → CAPTURE (lines 257–344):
if batch_descriptor not in self.concrete_cudagraph_entries:
self.concrete_cudagraph_entries[batch_descriptor] = CUDAGraphEntry(batch_descriptor=...)
entry = self.concrete_cudagraph_entries[batch_descriptor]
if entry.cudagraph is None:
validate_cudagraph_capturing_enabled()
input_addresses = [x.data_ptr() for x in args if isinstance(x, torch.Tensor)]
entry.input_addresses = input_addresses
cudagraph = torch.cuda.CUDAGraph()
...
with torch.cuda.graph(cudagraph, pool=self.graph_pool, stream=current_stream()):
output = self.runnable(*args, **kwargs) # the kernels are RECORDED, not just run
if self.cudagraph_options.weak_ref_output:
output = weak_ref_tensors(output)
entry.output = weak_ref_tensors(output)
entry.cudagraph = cudagraph
compilation_counter.num_cudagraph_captured += 1
return output # return the REAL output on capture step
The with torch.cuda.graph(...) context is where CUDA records every kernel issued by
self.runnable(...) into cudagraph. The weak_ref_tensors dance (lines 325–336) is the
"mind-exploding" memory management the comment warns about: the output lives in the graph's
private memory pool, so vLLM holds only weak references to avoid leaking it while still letting
PyTorch manage the pool. Your simulation skips this (numpy has no pools) but captures the
structure: first sight of a shape → run once, record, cache.
(c) Shape seen → REPLAY (lines 346–361):
if self.is_debugging_mode:
new_input_addresses = [x.data_ptr() for x in args if isinstance(x, torch.Tensor)]
assert new_input_addresses == entry.input_addresses, (
"Input addresses for cudagraphs are different during replay...")
...
entry.cudagraph.replay()
return entry.output
This is the entire win in two lines: entry.cudagraph.replay() issues one launch and the
GPU runs the whole recorded sequence; return the cached output tensor. Note the debug assertion —
it enforces Constraint 2 (inputs must be at the same addresses; the model runner guarantees this
by writing new inputs into persistent buffers before calling). Your GraphRunner.__call__ replay
branch is the direct analog: np.copyto(entry.static_input, x) then "replay" as a single
LaunchCounter.bump(1).
The whole class in one sentence: a per-shape dict where the first call captures and every later call with that shape replays — exactly your
mini_vllm.GraphRunner.
3. @support_torch_compile — the seam between a model and the compiler
vllm/compilation/decorators.py:118. Models opt in by decorating the class:
@support_torch_compile(dynamic_arg_dims={"x": 0, "y": 0})
class MyModel(nn.Module):
def forward(self, x: torch.Tensor, y: Optional[torch.Tensor]): ...
What it does (read the docstring, 126–176): it wraps the class so that, when compilation is
enabled, the forward is run through torch.compile/the vLLM backend, and it marks which tensor
dimensions are dynamic (the batch/sequence dim) so the compiler specializes on shape
correctly. dynamic_arg_dims says "dimension 0 of x varies" — that's the batch dimension the
CUDA-graph capture sizes range over. If you don't pass it, vLLM infers it from the type
annotations (line 153): torch.Tensor args get dim 0 marked dynamic.
The important takeaway: adding compile support to a model is one decorator, and the dynamic dims you declare are what let the same compiled artifact serve many batch sizes (and what the CUDA-graph layer keys its captured graphs on). When you add a model in Phase 14, this decorator is part of the recipe.
4. The backend + passes (skim now, return later)
vllm/compilation/backends.py—VllmBackend, thetorch.compilebackend Dynamo calls with the traced FX graph. It splits the graph atsplitting_ops(attention) for piecewise compilation, compiles each piece with Inductor, caches the results, and arranges the pieces for piecewise CUDA-graph capture. This is the level-3VLLM_COMPILEmachinery.vllm/compilation/piecewise_backend.py— manages a single piecewise compiled region.vllm/compilation/passes/pass_manager.py+passes/fusion/— the custom graph passes: rewrites vLLM applies to the traced graph that stock Inductor wouldn't, e.g. fusingadd + RMSNorm, fusing quantization into the preceding op, sequence-parallel rewrites. Each pass is an FX-graph-in, FX-graph-out transform. Reading one small fusion pass is a great way to see "graph-level transformation" concretely.
Your mini_vllm.PiecewiseGraphRunner models the split idea (break at uncapturable ops, capture
the rest) without the Inductor compilation — which is the part that matters for the mental model.
5. Where it's wired into the engine
The model runner (vllm/v1/worker/gpu_model_runner.py) is what:
- decides the
cudagraph_runtime_modefor the current batch (FULL for pure decode, PIECEWISE for mixed, NONE during profiling/warmup) and thebatch_descriptor(the shape key), - sets them on the
forward_context(which theCUDAGraphWrapperreads), - writes the step's inputs into the persistent buffers the captured graph reads from (Constraint 2), padding the batch up to a captured size (Constraint 1),
- runs a warmup at startup that captures graphs for every size in
cudagraph_capture_sizes.
Search gpu_model_runner.py for cudagraph and capture to see the warmup/capture loop and the
input-buffer copies. That's the production embodiment of everything above.
Reading checklist
One sentence each in your notebook:
-
CompilationMode— what does level 3 (VLLM_COMPILE) add over stocktorch.compile? -
CUDAGraphMode— why areFULL_AND_PIECEWISE/FULL_DECODE_ONLYencoded as tuples? -
concrete_cudagraph_entries— what is the key, and which constraint does that enforce? -
CUDAGraphEntry.input_addresses— which constraint, and when is it checked? -
__call__— name the three branches (eager / capture / replay) and their triggers. -
entry.cudagraph.replay()— why is this "the entire win"? -
@support_torch_compiledynamic_arg_dims— why does the compiler need to know the dynamic dimension?
Now build it: 02-mini-build.md, then the labs.
Phase 05 — Mini-Build: simulate CUDA-graph capture & replay
You'll build a CPU simulation of CUDA graphs that reproduces the one win and the two
constraints from the guide. No GPU, no torch — just numpy and a launch counter. The reference
lives in mini_vllm/cudagraph.py; write it yourself first against lab-01's stub + tests.
The trick that makes this teachable on a laptop: we don't time anything. We count launches.
LaunchCounter is a global tally standing in for per-op CPU launch overhead. Eager pays one per
op every call; a replay pays exactly one. That single number captures the entire point of CUDA
graphs.
Contents
The build, in order
1. LaunchCounter
A class with a class-level n, plus reset() and bump(k=1). This is your stand-in for CPU
launch overhead.
2. run_eager(ops, x)
Run a list of ops over x, bump(1) per op. Returns the result. This is the baseline: overhead
paid per op, every single call.
3. GraphRunner(ops) — capture once, replay forever
The core. __call__(x):
- key =
x.shape. - Capture (key unseen): copy
xinto astatic_inputbuffer, run the ops (bump per op), cache aGraphEntry(shape, static_input, output, num_ops), return the output. (Constraint 1: graphs are keyed by shape.) - Replay (key seen):
np.copyto(entry.static_input, x)(Constraint 2: new inputs must land in the fixed buffer), recompute the ops from the buffer,bump(1)for the whole replay, return. (The win: one launch instead oflen(ops).) Exposenum_captured.
4. PiecewiseGraphRunner(ops, is_capturable) — split at the attention analog
Build contiguous segments by grouping consecutive ops with the same is_capturable(i) value.
Wrap capturable segments in a GraphRunner; keep uncapturable segments as plain op lists run via
run_eager. __call__ threads x through the segments in order. Expose num_graphs (count of
capturable segments). This models PIECEWISE: capture the compiled regions, run attention eagerly.
Definition of done
pytest mini_vllm/test_cudagraph.py -q # the reference suite (7 tests)
pytest phase-05-cuda-graphs-and-torch-compile/labs -q
Then answer in your notebook, citing mini_vllm/cudagraph.py lines:
- Which line is Constraint 1 (per-shape dispatch)? Which is Constraint 2 (static buffer copy)? Which is the win (single launch on replay)?
- In
PiecewiseGraphRunner, why doesnum_graphs == 2when you split a 3-op model at the middle op? (Two capturable runs surround one eager op.)
Map your toy to the real engine
your mini_vllm/cudagraph.py | real vLLM |
|---|---|
GraphRunner.graphs: dict[shape, GraphEntry] | CUDAGraphWrapper.concrete_cudagraph_entries: dict[BatchDescriptor, CUDAGraphEntry] (cuda_graph.py:207) |
entry.static_input + np.copyto | CUDAGraphEntry.input_addresses + persistent input buffers (cuda_graph.py:135, :346) |
| capture branch | with torch.cuda.graph(...) (cuda_graph.py:313) |
replay branch (bump(1)) | entry.cudagraph.replay() (cuda_graph.py:360) |
PiecewiseGraphRunner split | piecewise compilation/capture, split at attention (backends.py) |
Stretch (optional)
- Padding to capture sizes. Real vLLM only captures graphs for a fixed set of batch sizes
and pads odd sizes up. Add a
capture_sizes=[1,2,4,8]toGraphRunner: roundx's batch dim up to the nearest capture size before keying, so batch 5 and 7 both reuse the size-8 graph. Count how many distinct graphs you capture across batches 1..8 with and without padding. - A fusion pass. Add an optional graph-rewrite step that fuses two adjacent elementwise ops
into one (e.g.
+1then*2→ one op) and show it reduces launches even in eager mode — thetorch.compilehalf of the story.
Phase 05 Labs — CUDA Graphs & torch.compile
Five labs on the machinery that turns a launch-bound decode loop into replayed recordings. The arc: build capture/replay and meet its two constraints (lab-01), derive the economics — crossover and ceiling (lab-02), solve the variable-batch problem with the capture-size ladder (lab-05), route decode vs mixed batches with the mode dispatch (lab-03), then measure the whole stack on real silicon (lab-04).
Recommended order: 01 → 02 → 05 → 03 → 04. (Directory numbers predate lab-05:
mechanism, economics, then the two policy layers, then the measurement.) CPU labs follow
the standard contract — starter.py (your work), solution.py (reference), test_lab.py
(the spec); default runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-05-cuda-graphs-and-torch-compile/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-01-graph-replay-simulator -q
Contents
- lab-01-graph-replay-simulator
[CPU-OK] - lab-02-launch-overhead
[CPU-OK] - lab-03-cudagraph-mode
[CPU-OK] - lab-04-graph-vs-eager-real
[GPU-REQ] - lab-05-capture-sizes
[CPU-OK] - What you can do after this phase
Labs
lab-01-graph-replay-simulator [CPU-OK]
Build capture/replay on CPU: a runner that records an op sequence per shape and replays
it as a single launch, copying new inputs into a static buffer first. The two infamous
constraints (fixed shape, fixed addresses) fall out as the direct price of the win — a
graph is a recording, and recordings don't take arguments. Mirrors
mini_vllm/cudagraph.py and the real CUDAGraphWrapper. Skills: capture vs replay
accounting; copyto-not-rebind; shape-keyed dispatch.
lab-02-launch-overhead [CPU-OK]
The economics as closed forms: break-even at the second call, asymptotic speedup = number of ops captured, dilution by Amdahl. Turns "graphs help low-batch decode" into formulas you can defend and extrapolate — including when not to bother (single fused op, never-repeating shapes). Skills: crossover/ceiling analysis; the upfront-cost- amortized pattern that recurs in every compilation decision.
lab-03-cudagraph-mode [CPU-OK]
Reimplement the CUDAGraphMode enum's dispatch: composite modes as (decode, mixed)
tuples, FULL for uniform decode batches, PIECEWISE (split at attention) for ragged mixed
ones, and the compile-time dependency requires_piecewise_compilation guards. The ten
lines where chunked prefill, attention metadata, and graph constraints reconcile.
Skills: the routing table; compile-time vs run-time configuration; reading the
two-pass capture log.
lab-04-graph-vs-eager-real [GPU-REQ]
The validation: enforce_eager=True vs default at batch 1/8/64 on an L4 — 2.5× fading
to 1.13×, exactly lab-02's curve, plus the capture log showing lab-03's two routines and
lab-05's 23-rung ladder. Annotated capture included for the GPU-less. Skills:
falsifiable-prediction benchmarking; extrapolating to other models/hardware; when
enforce_eager is right (tests, debugging) vs wrong (serving).
lab-05-capture-sizes [CPU-OK]
The variable-batch problem: capture a ladder of sizes, pad every batch up to the
nearest rung. Implement the ladder, the lookup, and the waste accounting — answering the
production FAQ "why is my batch of 33 running at 40?" and quantifying the
denser-ladder-vs-more-captures trade. Skills: padding as the price of replay; bucketing
continuous quantities; reading cudagraph_capture_sizes.
What you can do after this phase
Explain CUDA graphs as a systems mechanism (recording + static buffers + shape dict)
rather than GPU folklore; predict graph benefits for a given model size, batch
distribution, and hardware before measuring; decode every line of vLLM's capture-time
logging; tune cudagraph_mode and cudagraph_capture_sizes from workload evidence; and
know exactly what enforce_eager=True trades, in both directions. Phase 6 changes what's
inside the kernels (quantization); Phase 8's draft models are where graph mastery pays
double.
Lab 05-01 — Build the Capture/Replay Simulator [CPU-OK]
Here's the absurdity CUDA graphs exist to fix: a decode step for a small model can spend
more time on the CPU — Python dispatch, kernel argument marshaling, cudaLaunchKernel
calls, one per operation, hundreds per step — than the GPU spends computing. The GPU
finishes each tiny kernel and idles, waiting for the next launch to arrive. CUDA graphs
fix it by recording the whole kernel sequence once and replaying it as a single
launch. In this lab you build that mechanism on CPU — capture, shape-keyed dispatch,
static buffers, replay — and in doing so you'll discover that both of its infamous
constraints aren't incidental limitations but the direct price of the win.
Contents
- Why this lab exists
- Background: one win, two constraints
- Files
- Run
- What to implement
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
CUDA graphs have a reputation as deep GPU arcana, and the reputation is wrong: the
mechanism is pure systems — a cache of recorded work, keyed by shape, replayed from
fixed memory — and it simulates perfectly on a laptop. What's genuinely hard about graphs
in production is not the replay; it's the discipline the constraints impose on
everything else: every batch must arrive at a captured shape (lab-05's padding ladder),
every input must be written into the same buffers (the input_addresses checks upstream),
and anything dynamic — like attention over varying sequence lengths — must either be
made shape-stable or cut out of the graph (lab-03's piecewise modes). You can't reason
about any of that machinery until the core capture/replay contract is in your fingers.
That's this lab.
The simulator you build mirrors mini_vllm/cudagraph.py, which itself mirrors the real
CUDAGraphWrapper (upstream/vllm/compilation/cuda_graph.py) — same per-shape dict,
same static-buffer copy, same single-launch accounting. The launch counter stands in for
wall-clock CPU overhead, for the usual course reason: a counter gives you formulas
(lab-02 derives them), a stopwatch gives you noise.
Background: one win, two constraints
- The WIN — eager execution pays one launch per op, every call. A captured graph pays the full cost once (capture), then one launch per replay regardless of how many kernels are inside. For a 300-kernel decode step replayed thousands of times per second, that's the difference lab-04 measures at ~2.5× end-to-end.
- CONSTRAINT 1 (fixed shape) — the recording bakes in every tensor size, grid
dimension, and memory extent. A different batch size is a different recording. Hence:
graphs are stored in a dict keyed by shape (upstream:
concrete_cudagraph_entrieskeyed byBatchDescriptor), and unseen shapes must capture anew. - CONSTRAINT 2 (static buffers) — the recording bakes in addresses. Replay reads
the same input memory it was captured from, so new inputs must be copied into the
captured buffer before replay (upstream asserts this: the
input_addressesconsistency check). Forget the copy and the graph happily recomputes last step's batch — the classic graph bug, andtest_static_buffer_reflects_new_inputexists to make you commit it once, here, where it's cheap.
Both constraints are the same fact stated twice: a graph is a recording, not a program. Recordings don't take arguments.
Files
starter.py—LaunchCounter,run_eager, andGraphRunnerstubbed. Your work.solution.py— reference (mirrorsmini_vllm/cudagraph.py).test_lab.py— the win, both constraints, correctness, and the 100-call accounting.
Run
LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-01-graph-replay-simulator -q
pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-01-graph-replay-simulator -q # reference
What to implement
LaunchCounter— class-leveln,reset(),bump(k=1). (Global on purpose: launch overhead is a process-wide resource, which is also why one slow Python step stalls every request in the batch.)run_eager(ops, x)— bump once per op, every call.GraphRunner(ops).__call__(x):- Capture (shape unseen): copy
xintostatic_input, run ops (bump each), cache aGraphEntry, return the output. - Replay (shape seen):
np.copyto(entry.static_input, x)— into the existing buffer, never rebind the reference — recompute from the buffer,bump(1)total, return.
- Capture (shape unseen): copy
What the tests prove
| Test | What it pins |
|---|---|
test_eager_pays_one_launch_per_op | The baseline cost model |
test_capture_then_replay_is_one_launch | The WIN: capture = len(ops), replay = exactly 1 |
test_replay_output_matches_eager | Replay is an optimization, not a behavior change — the course's master invariant, graph edition |
test_static_buffer_reflects_new_input | Constraint 2: capture with value 1, replay with value 5, get 50 — the copy-into-buffer is live |
test_new_shape_triggers_recapture | Constraint 1: shape (8,) after shape (4,) pays full capture; both entries coexist in the dict |
test_graphs_win_when_overhead_dominates | 100 calls: 300 eager launches vs 102 graph launches — the amortization lab-02 turns into formulas |
Hitchhiker's notes
np.copyto(buf, x)vsbuf = xis the whole lab. Rebinding the Python name does nothing to the captured memory; the real API has the same trap (you muststatic_tensor.copy_(new)in PyTorch graph idiom, never reassign). If you remember one line from this phase, make it this one.- Find your three lines upstream: capture (
cuda_graph.py:313, insidetorch.cuda.graph(...)), replay (:360,entry.cudagraph.replay()), the per-shape dict (:207). The production wrapper adds warmup runs before capture (CUDA needs the allocator and autotuners settled), a memory pool shared across graphs, and debug-mode address assertions — engineering around exactly the two constraints you implemented. - What can't be captured at all? Anything whose control flow depends on data: CPU-side branching, dynamic shapes inside the sequence, unsupported ops (some collectives, host syncs). vLLM's answer is to compile the model into a shape-stable form first (torch.compile, with attention marked as a splitting op) — graphs are the last stage of the compilation pipeline, not a standalone trick. That pipeline is the deep-dive's subject; lab-03 handles the mode routing it produces.
- Replay still runs the ops here (numpy has no real recording) — the simulation's one honest cheat. The accounting (one launch) models the real benefit; the real replay also skips Python entirely, which is why the measured win (lab-04) can exceed what launch-counting alone predicts.
Going further
- Add an
input_addressesassertion to your replay path (storeid(entry.static_input)at capture; assert it unchanged at replay) — you've reproduced upstream's debug check, and you'll appreciate why it exists the first time you "optimize" the copy away. - Give
GraphRunnera memory budget: each entry costsprod(shape)bytes; evict LRU when over budget. Now you have the graph-pool problem, and a feel for why upstream shares one memory pool across all captured sizes instead. - Wire your
GraphRunneraroundmini_vllm'sToyModel.forwardfor fixed batch sizes and count launches across a fullgenerate()— the engine-level integration upstream does in the model runner.
References
mini_vllm/cudagraph.py— the annotated simulator this lab rebuilds, with upstream line references throughout.upstream/vllm/compilation/cuda_graph.py—CUDAGraphWrapper: capture, replay,BatchDescriptordict, address checks.- NVIDIA, Getting Started with CUDA Graphs — the original motivation and API: https://developer.nvidia.com/blog/cuda-graphs/
- PyTorch docs, CUDA Graphs (
torch.cuda.CUDAGraph) — the idiom vLLM builds on, including the static-buffer pattern: https://pytorch.org/docs/stable/notes/cuda.html#cuda-graphs - Phase 0 lab-04 — why small-batch decode is launch-overhead territory in the first place.
Lab 05-02 — Launch Overhead & the Eager↔Graph Crossover [CPU-OK]
Lab-01 gave you the mechanism; this lab gives you the economics. Capturing a graph isn't free — the first call pays full overhead, plus (in reality) warmup and memory. So when does the investment pay off, and how big can the payoff get? You'll derive both answers as closed-form formulas — the crossover point and the asymptotic speedup — and they're worth deriving because they're the difference between "graphs are good" (folklore) and knowing for which workloads, by how much, and when not to bother (engineering). Spoiler for the impatient: break-even at the second call, asymptotic speedup = the number of ops captured — and both facts have production consequences listed below.
Contents
- Why this lab exists
- The model (launches as a proxy for CPU overhead)
- Files
- Run
- The two results you should be able to state cold
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Every caching/compilation decision in systems — JIT vs interpret, memoize vs recompute, capture vs eager — has the same shape: an upfront cost amortized over repeats. The two numbers that decide it are always the same two you'll derive here: how soon does it break even (crossover) and what's the ceiling (asymptotic ratio). This lab drills the pattern on the cleanest possible instance, where both answers are exact integers. After it, you'll recognize the same analysis inside torch.compile's warmup tradeoffs, lab-05's ladder-density question, and Phase 18's "is this optimization worth its startup cost" recurring decision.
It also arms you for the most common graphs-related production question: "should I set
enforce_eager=True to speed up startup?" The formulas say precisely what that trades
away, per decode step, forever — and lab-04 confirms the prediction on silicon.
The model (launches as a proxy for CPU overhead)
- Eager,
kcalls of annum_ops-op model:k × num_opslaunches. - Graph: first call captures at cost
capture_cost_ops(defaultnum_ops), every later call replays in 1: totalcapture_cost_ops + (k − 1).
One unit ≈ one kernel launch ≈ some microseconds of CPU time. The model deliberately ignores GPU compute time — which is exactly why its predictions hold when launches dominate (small-batch decode) and fade when they don't (large batch, prefill). Knowing a model's domain of validity is part of the lab; lab-04's batch-64 numbers show the fade.
Files
starter.py— implementeager_launches,graph_launches,crossover,asymptotic_speedup. Your work.solution.py— reference.test_lab.py— pins the formulas, the second-call break-even, thenum_ops == 1degenerate case, and the asymptote.
Run
LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-02-launch-overhead -q
pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-02-launch-overhead -q # reference
The two results you should be able to state cold
- Crossover. With
capture_cost_ops = num_ops > 1, the graph total beats eager from the 2nd call onward: capture costs one eager-pass-worth, and every replay after savesnum_ops − 1. Graphs are not a long-game investment — they pay back almost immediately provided the shape repeats. The real risk is never "capture was too expensive"; it's "the shape never came back" (which is why capture sizes exist — lab-05 — and why wildly dynamic workloads gain little). - Asymptotic speedup. As
k → ∞, per-call launches →num_ops(eager) vs 1 (graph): the launch-overhead speedup approaches the number of ops captured. A model with 300 kernels per decode step has a 300× launch speedup ceiling — diluted in wall-clock by the GPU work that remains (Amdahl), which is why lab-04 measures 2.5×, not 300×. Bigger models per-step → more GPU work per launch → less dilution benefit needed; smaller models → launch-bound → graphs are load-bearing. Phase 0 lab-04's 8 ms-vs-1 ms step analysis, now with the fix attached.
What the tests prove
| Test | What it pins |
|---|---|
| formula tests | eager = k·n, graph = capture + (k−1) — exact, no off-by-ones (an off-by-one here is a wrong capacity claim later) |
| crossover tests | break-even at call 2 for n > 1; never for n == 1 (one launch either way — a single fused megakernel gains nothing from graphs, an endpoint worth knowing) |
| asymptote test | per-call ratio → n from below, monotonically |
Hitchhiker's notes
- What
capture_cost_ops > num_opsmodels: real capture includes warmup passes (allocator, autotuners), stream synchronization, and graph instantiation — typically several eager-passes-worth. It shifts the crossover later by a few calls; it never changes the asymptote. The real engine moves this entire cost to startup (capturing the whole ladder before serving — lab-04's 7-second log line), so steady-state traffic sees only replays: the crossover question is answered "before the first request." - Why the speedup ceiling is
num_opsand not infinite: a replay is still one launch. The only way past it is fewer-than-one launch per step — batching multiple steps per launch — which exists (multi-step scheduling / async scheduling in vLLM's history) and brings its own complications. Ceilings tell you where the next optimization frontier is; that's their real use. - torch.compile plays the same game one level up: compilation cost (seconds to
minutes, cached to disk in vLLM via the compilation cache) amortized over runs; kernel
fusion reduces
num_opsitself, which lowers what graphs have left to save. Fusion and capture are complementary attacks on the samek × num_opsbill — fusion shrinksnum_ops, capture shrinks its coefficient. The deep-dive's pipeline (compile → piecewise split → capture) is exactly this composition. - The formulas assume the shape repeats. Per-shape accounting is multiplicative:
every distinct batch size runs its own crossover race. A uniform-traffic deployment
amortizes a handful of shapes beautifully; a chaotic one spreads
kthin across many shapes. That's the bridge to lab-05 — the ladder exists to concentratekonto few shapes by padding.
Going further
- Plot per-call cost vs
kforn ∈ {3, 30, 300}withcapture_cost = 3n. Mark the crossovers. This single chart is the "should we graph it?" conversation, pre-had. - Add a second resource to the model: each captured graph costs
Mmemory, and you have budgetB. Combined with lab-05's ladder, derive the optimal number of rungs for a given traffic distribution — you've reinvented the actual config-tuning problem. - Measure a real launch: time
torch.mmon tiny tensors CPU-side (or just a no-op Python function call stack 300 deep) and put microseconds to the unit. The model stays the same; the constants acquire meaning.
References
- Lab-04 — the formulas, confirmed on an L4: ~2.5× at batch 1, fading to 1.13× at 64.
upstream/vllm/compilation/cuda_graph.py— where capture cost is actually paid.- NVIDIA, CUDA Graphs blog — measured launch overheads that set the unit: https://developer.nvidia.com/blog/cuda-graphs/
- Phase 0 lab-04 — the roofline reason launch overhead only matters at small step sizes.
- Hennessy & Patterson, Computer Architecture — Amdahl's law, the reason ceilings dilute; any edition, the first chapter.
Lab 05-03 — Reimplement the CUDAGraphMode Dispatch [CPU-OK]
Labs 01–02 established that graphs love uniform, repeated shapes. Now meet the batch that
hates them: a mixed batch — Phase 3's chunked prefill riding alongside decodes, every
step a different ragged collection of sequence lengths flowing into attention. A FULL
graph can't swallow that. vLLM's answer is a small but consequential piece of policy: the
CUDAGraphMode enum (upstream/vllm/config/compilation.py:53), which routes pure-decode
batches and mixed batches to different graph strategies — and whose composite values
(FULL_AND_PIECEWISE, the V1 default) are the reason your lab-04 capture log shows two
capture passes. You'll reimplement its dispatch methods exactly, because this tiny enum is
where three phases of machinery (chunked prefill, attention metadata, graph constraints)
get reconciled in about ten lines.
Contents
- Why this lab exists
- Background (read first)
- Files
- Run
- What you must reproduce
- The routing you're proving
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Most engineers meet cudagraph_mode as a config string they cargo-cult when something
breaks ("try PIECEWISE"). The enum deserves better: it's a textbook example of encoding
a two-dimensional policy in a one-dimensional config, and the dispatch methods you'll
write are the decoder ring. Once you've implemented decode_mode/mixed_mode/has_mode
yourself, every graphs-related symptom maps to a row of the routing table: capture log has
one pass instead of two → someone set FULL; mixed batches mysteriously slow → mode is
FULL_DECODE_ONLY and prefill steps run eager; compile time doubled → the mode
requires_piecewise_compilation and the model was split at attention.
There's also a compile-time/run-time lesson here that generalizes: some of these flags must be known before the model is compiled (you can't piecewise-replay a graph that wasn't piecewise-compiled), so the enum is consulted in two different epochs of the engine's life. Configuration that crosses epochs is where the subtle bugs live — this lab makes the two consumers explicit.
Background (read first)
class CUDAGraphMode(enum.Enum):
NONE = 0
PIECEWISE = 1
FULL = 2
FULL_DECODE_ONLY = (FULL, NONE) # full graph for decode, no graph for mixed
FULL_AND_PIECEWISE = (FULL, PIECEWISE) # full for decode, piecewise for mixed (V1 default)
The composite modes are tuples (decode_mode, mixed_mode). Why the split:
- A pure-decode batch is the graph's dream: every request contributes exactly one token, shapes are uniform (padded to a ladder rung — lab-05), attention metadata is regular. Safe for a FULL graph — the entire step, attention included, one replay.
- A mixed batch (prefill chunks + decodes) has per-request query lengths, ragged attention metadata, varlen kernels — exactly what a recording can't generalize over. Options: no graph at all (NONE), or PIECEWISE — capture the shape-stable runs between attention calls and run attention eagerly. Piecewise is the compromise that keeps most of the launch win (the hundreds of small ops around attention) while letting the one genuinely dynamic op stay dynamic.
PIECEWISE requires the model to have been compiled with attention as a splitting op
(torch.compile carves the graph at splitting_ops) — that's the compile-time dependency
requires_piecewise_compilation guards.
Files
starter.py— implementseparate_routine,decode_mode,mixed_mode,has_mode,requires_piecewise_compilation,runtime_mode_for. Modes are strings; composites live in aROUTINESdict. Your work.solution.py— reference.test_lab.py— the full routing table, every mode × both batch kinds.
Run
LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-03-cudagraph-mode -q
pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-03-cudagraph-mode -q # reference
What you must reproduce
separate_routine(m)→ ismcomposite (distinct decode/mixed routines)?decode_mode(m)/mixed_mode(m)→ the concrete mode for that batch kind (composites split; simple modes return themselves for both —FULLmeans full everywhere, which is why it's only safe with chunked prefill disabled or padded-prefill tricks).has_mode(m, target)→ doesmemploytargetin any routine?requires_piecewise_compilation(m)→has_mode(m, PIECEWISE).runtime_mode_for(m, is_decode)→ the per-batch dispatch the wrapper performs each step (upstream: theBatchDescriptor.uniform_decodeflag selecting the entry).
The routing you're proving
| mode | decode batch | mixed batch | needs piecewise compile? |
|---|---|---|---|
NONE | NONE | NONE | no |
PIECEWISE | PIECEWISE | PIECEWISE | yes |
FULL | FULL | FULL | no |
FULL_DECODE_ONLY | FULL | NONE | no |
FULL_AND_PIECEWISE | FULL | PIECEWISE | yes |
Memorize the last row — it's the default, and both lab-04 capture passes
((decode, FULL) and (mixed prefill-decode, PIECEWISE)) are its two cells.
Hitchhiker's notes
- Why is
FULL_AND_PIECEWISEthe default and not plainFULL? Because chunked prefill is default-on (Phase 3): mixed batches are the common case, not the exception, and a FULL-only config would either crash on them or force eager. The default encodes the workload assumption; change the workload assumption (e.g. a decode-only disaggregated worker — Phase 15) andFULL_DECODE_ONLYbecomes the rational pick. Configs are workload claims in disguise. - Where the dispatch actually happens: per step, the runner builds a
BatchDescriptor(batch size + uniform-decode flag); the graph wrapper keys its entry dict on it (lab-01's dict, now two-dimensional). Yourruntime_mode_for(m, is_decode)is that lookup's policy half. - What "attention runs eagerly" costs in PIECEWISE: one-ish launches per attention per layer per step, vs the hundreds saved elsewhere. That's why piecewise keeps most of the win — and why backends that support graph-safe attention metadata (uniform decode) unlock FULL for the decode half, which is the entire point of the composite.
- Failure smell catalog: capture log shows one pass → not the default mode; OOM during capture → ladder too long × two routines (lab-05's memory cost, doubled); "piecewise compilation required" assertion → mode demands PIECEWISE but compilation level didn't split. Ten lines of enum, three distinct production symptoms.
Reflect
- Why can't the runtime "just check if the batch is uniform and use FULL when it can"
without any enum? (It does check — that's
runtime_mode_for. The enum exists for the compile-time half: whether to split at attention must be decided before any batch arrives. Runtime flexibility is bounded by compile-time commitments.) - A team disables chunked prefill entirely and serves short prompts only. Which mode
maximizes their throughput, and what new risk do they take? (
FULL— every batch can be graph-shaped now; the risk is any stray mixed/odd batch has no graph and no piecewise fallback: eager cliffs.) - Sketch the routing table for a hypothetical
PIECEWISE_DECODE_ONLY. Why does no such mode ship? (If decode batches — the most uniform — can only manage piecewise, mixed can't do better; the composite would collapse to plainPIECEWISE.)
References
upstream/vllm/config/compilation.py:53— the real enum and its methods; diff your solution against it line by line.upstream/vllm/compilation/cuda_graph.py—BatchDescriptorand the per-entry dispatch.upstream/vllm/v1/worker/gpu_model_runner.py— whereuniform_decodeis determined per step.- vLLM docs, Compilation Config — the user-facing knob this enum sits behind: https://docs.vllm.ai/en/latest/configuration/optimization.html
- Lab-04's capture log — both routines of the default mode, visible at startup.
Lab 05-04 — CUDA Graphs vs Eager on Real vLLM [GPU-REQ]
The payoff lab: everything you derived on paper in labs 01–03 — the launch-overhead win,
the crossover economics, the two-routine capture — measured on real silicon. You'll run
the same tiny model with graphs on (the default) and with enforce_eager=True (graphs and
compilation off) across batch sizes 1, 8, and 64, and watch the speedup do exactly what
lab-02's model predicts: ~2.5× at batch 1, fading to ~1.13× at batch 64 as the
bottleneck migrates from CPU launches to GPU compute.
No GPU? Don't panic. The captured output below is the experiment; every number in it is annotated against the labs that predicted it. Read it like a lab notebook.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, facebook/opt-125m, L4 24GB, vLLM 0.22.1)
- Reading the numbers like an engineer
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
A model that predicts is worth a hundred that explain after the fact. Labs 01–02 made three falsifiable claims: graphs help most when GPU work per step is smallest (batch 1); the help fades — never inverts — as batch grows; and the cost is a visible one-time capture at startup. This lab is the falsification attempt. When the L4 numbers land on the predicted curve, you've earned something better than a benchmark result: a validated mental model you can extrapolate to hardware you've never touched ("H100, 70B, batch 32 — graphs matter how much?") — which is what capacity planning actually requires.
The experimental design itself is the second lesson: one knob (enforce_eager), one
sweep variable (batch size), fixed everything else, and a baseline arm. The number of
production "benchmarks" that fail this bar is the reason Phase 18 exists.
Requirements
uv pip install -e ".[vllm]"
huggingface-cli download facebook/opt-125m
(OPT-125m again, deliberately: a small model maximizes the launch-overhead share of step time — Phase 0 lab-04's arithmetic — making it the best-case stage for graphs. Keep that in mind when extrapolating to 70B; see the notes.)
Steps
# run.py
import time
from vllm import LLM, SamplingParams
def bench(enforce_eager: bool, n_prompts: int):
llm = LLM(model="facebook/opt-125m", enforce_eager=enforce_eager,
gpu_memory_utilization=0.5, max_model_len=512)
prompts = ["The meaning of life is"] * n_prompts
sp = SamplingParams(max_tokens=128, temperature=0)
t0 = time.perf_counter()
out = llm.generate(prompts, sp)
dt = time.perf_counter() - t0
toks = sum(len(o.outputs[0].token_ids) for o in out)
print(f"enforce_eager={enforce_eager} batch={n_prompts}: {toks/dt:8.1f} tok/s")
for bs in (1, 8, 64):
bench(enforce_eager=True, n_prompts=bs) # graphs + compile OFF
bench(enforce_eager=False, n_prompts=bs) # graphs ON (default)
- Compare the pairs at each batch size; compute the ratios.
- Watch the startup logs in the graphs-on runs: the capture progress bars are lab-02's
capture_cost, paid where you can see it. - Re-run a pair twice and note run-to-run variance before trusting any single ratio — the habit that separates measurements from numbers.
Captured output (real run, facebook/opt-125m, L4 24GB, vLLM 0.22.1)
enforce_eager=True batch=1 : 980.3 tok/s
enforce_eager=False batch=1 : 2473.6 tok/s # ~2.5x: pure launch-overhead win at bs=1
enforce_eager=True batch=8 : 6912.4 tok/s
enforce_eager=False batch=8 : 11034.8 tok/s # ~1.6x: still CPU-bound-ish
enforce_eager=True batch=64: 41560.2 tok/s
enforce_eager=False batch=64: 46883.1 tok/s # ~1.13x: GPU-bound, graphs help less
# startup, graphs ON:
INFO ... Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|####| 23/23
INFO ... Capturing CUDA graphs (decode, FULL): 100%|####| 23/23
INFO ... Graph capturing finished in 7 secs, took 0.41 GiB
# startup, enforce_eager=True: (no capture step)
Reading the numbers like an engineer
- 2.5× at batch 1 — the launch-bound regime. An OPT-125m decode step is sub- millisecond of GPU work behind a few hundred kernel launches; remove the launches (lab-01's WIN) and the step nearly collapses to its GPU time. This is the headline — and it's workload-specific: agentic, single-stream, small-model serving lives here.
- The fade, 2.5× → 1.6× → 1.13× — Amdahl in motion. Bigger batches mean more GPU work per (unchanged) launch bill; the removable fraction shrinks. Note what doesn't happen: the ratio never dips below 1. Graphs don't have a regime where they hurt steady-state throughput — the cost lives entirely at startup. That asymmetry is why they're default-on rather than a tuning option.
23/23twice — the capture-size ladder (lab-05: countdefault_capture_sizes(512)) run once per routine ofFULL_AND_PIECEWISE(lab-03's table, bottom row, live). If you ever need a one-glance config check on a running deployment, this pair of progress bars is it.7 secs, 0.41 GiB— lab-02'scapture_cost, in physical units: 46 captures' worth of warmup+record, and the shared graph memory pool. Amortized over millions of steps; but on a CI box that boots vLLM per test, 7 seconds × every test is real money — which is whyenforce_eager=Trueis the standard test-suite setting upstream while being wrong for production. Same knob, opposite verdicts, both derivable from lab-02.
Hitchhiker's notes
- Extrapolating to big models: a 70B's decode step is tens of ms of GPU work — the launch bill is a far smaller fraction, so expect graph gains in single-digit percents at moderate batch, not 2.5×. Graphs matter most for small models, small batches, long generations — which, conveniently, describes draft models in speculative decoding (Phase 8), where graphs are practically mandatory.
enforce_eager=Truedisables compilation too, so this A/B bundles two effects (fused kernels + graphs). For the isolated graph effect, comparecudagraph_mode=NONEwith compilation on vs the default. The bundle is what operators actually toggle, hence the lab measures the bundle — but know what's in the box before attributing the delta.- Variance discipline: tok/s from a single
generatecall includes engine startup effects, first-iteration warmup, and timer jitter. The captured numbers are representative, not sacred — your L4 will differ by a few percent, your 4090 by more. What must reproduce is the shape: big ratio at 1, monotone fade, no inversion. If your shape differs, that's interesting; investigate (background processes, thermal throttling, a different default mode). - When is
enforce_eagerright in production? Debugging (eager stack traces point at real lines; graph replays don't), extreme memory pressure (reclaim the graph pool's GiB), or genuinely chaotic shapes beyond the ladder. Rare — but "what's the escape hatch and what does it cost" is exactly the question this lab leaves you able to answer with numbers.
Reflect
- Predict before measuring: on your hardware, will batch-8 land closer to the batch-1 or batch-64 ratio? Which parameter of lab-02's model are you implicitly estimating? (The GPU-work-per-step share — i.e. where batch 8 sits relative to the roofline ridge from Phase 0 lab-04.)
- The capture log shows
0.41 GiBfor 46 graphs of a 125m model. Sketch why a 70B model with tensor parallelism captures in a similar order of memory (graphs store launch topology + workspace, not weights) — and why people are still surprised by the pool's size on memory-tight deployments. - Your service restarts pods on every deploy, 50× a day. Quantify the capture tax and name two mitigations. (7 s × 50 = ~6 min/day of cold capacity; mitigate via fewer capture sizes — lab-05's ladder — or vLLM's compilation cache for the compile half; the capture half re-runs regardless.)
References
- Labs 01–02 — the mechanism and the formulas these numbers validate.
- Lab-03 — why the capture log has exactly two passes; lab-05 — why each pass has 23.
upstream/vllm/v1/worker/gpu_model_runner.py— the capture loop emitting those progress bars.- vLLM docs, Optimization and Tuning —
enforce_eager,cudagraph_mode, compilation knobs: https://docs.vllm.ai/en/latest/configuration/optimization.html - Phase 18 — the benchmarking discipline this lab previews (variance, baselines, sweeps).
Lab 05-05 — Capture Sizes: Bucketing Batches into Graphs [CPU-OK]
Lab-01 left you with a tension it didn't resolve. A CUDA graph is captured per shape
(Constraint 1) — but a decode batch's size changes every step as requests join and finish
(Phase 1 lab-04 showed you the churn). Capture a graph for every possible batch size from
1 to max_num_seqs? That's hundreds of captures: minutes of startup and gigabytes of
graph memory. Capture only a few? Then most steps have no matching graph. vLLM's answer is
the capture-size ladder: a curated list of sizes, with every batch padded up to the
nearest rung. In this lab you implement the ladder, the rung lookup, and the waste
accounting — and answer the production question this mechanism generates weekly: "why is
my batch of 33 running at size 40?"
Contents
- Why this lab exists
- Background: padding as the price of replay
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
This is the lab where CUDA graphs stop being a binary feature ("on = fast") and become a
budgeted trade you can reason about quantitatively. Every rung in the ladder costs
capture time at startup and graph-pool memory forever; every gap between rungs costs
padded rows — real FLOPs spent computing garbage that's discarded — on every step that
lands in the gap. The deliverable skill: given a workload's batch-size distribution, say
whether the default ladder fits it, and what changing cudagraph_capture_sizes (or
max capture size) would buy. That's a real tuning lever (Phase 18) hiding behind an
innocuous config list.
It also explains two log lines and one metric that otherwise mystify operators: the
Capturing CUDA graphs ... 23/23 startup progress bar (that's the ladder's length — count
the rungs in default_capture_sizes(512)), the graph-pool memory in took 0.41 GiB
(lab-04's capture pass), and the small constant gap between num_running and the batch
size the profiler shows (the padding).
Background: padding as the price of replay
Replay requires the captured shape, exactly (lab-01, Constraint 2: same buffers, same sizes). A batch of 33 with a graph captured at 40 runs as follows: the 33 real rows are copied into the static input buffer, rows 34–40 are filled with junk (typically zeros or stale data — and it doesn't matter, because their outputs are never read), and the whole 40-row graph replays. The padded rows cost ~7/40 ≈ 17% extra compute for that step — almost always cheaper than the alternative (an eager step paying per-kernel launches), because decode steps are launch-overhead-dominated at exactly these small sizes (lab-02's regime, Phase 0 lab-04's roofline).
The ladder's shape encodes where padding hurts: rungs are dense at small sizes ([1, 2, 4], then every 8) because relative waste is worst there — padding 3 → 4 is 33%, padding 250 → 256 is 2.4%. Above the largest rung the engine just runs eagerly: at that much GPU work per step, launch overhead is amortized anyway and graphs stop mattering (lab-04's shrinking gap, measured).
Files
starter.py—default_capture_sizes,select_capture_size,padded_tokens,trace_waste. Your work.solution.py— reference.test_lab.py— the ladder's exact shape, exact-rung hits, round-up, eager fallback, trace accounting, and the density trade-off.
Run
LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-05-capture-sizes -q
pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-05-capture-sizes -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_ladder_shape | [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64] for max 64 — dense low, every-8 above |
test_exact_rung_pays_nothing | Landing on a rung is free — and why benchmarks at batch 1/8/64 can overstate graph benefits vs your real traffic at batch 33 |
test_between_rungs_rounds_up | 33 → 40 (waste 7): the production FAQ, answered with arithmetic |
test_oversize_batch_runs_eager | Above the top rung: no padding, no graph — the eager fallback is a normal path, not an error |
test_trace_accounting | Waste summed over a step trace — the quantity you'd actually plot for a workload |
test_finer_ladder_trades_graphs_for_padding | Denser ladder = less padding but more rungs to capture: the whole design space in one assert |
Hitchhiker's notes
- Where this lives upstream:
cudagraph_capture_sizesinupstream/vllm/config/compilation.py, consumed by the model runner's dummy-run capture loop at startup and by the per-step batch padding (searchpadingpu_model_runner.py). The real ladder is the same shape as yours with a higher default ceiling (typically 512). - Padding interacts with the sampler, not just the GEMMs. Padded rows produce logits
too; the engine must not sample from them. Upstream handles it by slicing real rows
out before sampling (
logits_indicesagain — Phase 1 lab-03's guard doing one more job). When you see careful index plumbing around batch padding in a PR, this is what it's protecting. - The same bucketing idea recurs everywhere shapes must be finite: torch.compile's dynamic-shape buckets, TensorRT optimization profiles, XLA padding on TPUs (Phase 17 — where padding costs are far more dramatic). "Continuous quantity → discrete ladder + round up" is a pattern, and its failure mode is always the same: a workload that sits just above a rung, paying maximum waste consistently. Check the distribution, not the mean.
- Why not capture on demand — first time a size appears, capture it? Capture requires a warmup run, allocations, and stream quiescence: a multi-hundred-ms stall mid-serving the first time batch=37 shows up, and unbounded graph memory growth over a day of traffic. Startup capture converts an unpredictable runtime stall into a predictable boot cost — the same "pay it where you can see it" philosophy as the Phase 2 lab-03 memory profiling pass.
Going further
- Take the batch-size trace from Phase 1 lab-04's probe (lengths of each step's dict) and
run
trace_wasteover it for three ladders: default, powers-of-two only, every-4. Compute waste as a fraction of real rows — for bursty traces the answer often surprises. - Model capture cost: give each rung a fixed cost (say, 0.3 s + 8 MB) and find, for a given trace length, the ladder that minimizes total cost (capture + padded-row time). You've just turned a config knob into an optimization problem — Phase 18's worldview.
- Read the capture loop in
gpu_model_runner.py(searchcapture) and find where the ladder is iterated largest first — then work out why (memory-pool reuse: the biggest graph's buffers can be shared by the smaller ones).
References
upstream/vllm/config/compilation.py—cudagraph_capture_sizesand the mode enum (lab-03).upstream/vllm/v1/worker/gpu_model_runner.py— the capture loop and per-step padding.- vLLM blog, vLLM V1 — the compilation + capture architecture: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
- NVIDIA, CUDA Graphs (programming guide) — what capture/replay actually does: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#cuda-graphs
- Lab-04 — the measured shrinking-gap curve this ladder is tuned against.
Phase 05 — Exercises: CUDA Graphs & torch.compile
Escalating from "explain it" to "design it." Staff-level = the last ones cold, citing the exact
upstream/ line.
Contents
- Warm-up (explain)
- Core (trace the code)
- Build (extend your code / mini_vllm)
- Design (staff-level)
- Self-grading
Warm-up (explain)
- In one sentence each: what does a CUDA graph remove, and what does
torch.compileimprove? - Why does a graph help decode at batch size 1 but barely at batch size 256?
- Name the two constraints a captured graph imposes, and the field/structure in
CUDAGraphWrapperthat enforces each (cuda_graph.py).
Core (trace the code)
- Walk the three branches of
CUDAGraphWrapper.__call__(cuda_graph.py:233): name the trigger for eager / capture / replay and the one line that is the win. FULL_AND_PIECEWISEis encoded as the tuple(FULL, PIECEWISE). Usingdecode_mode/mixed_mode(compilation.py:65), state which concrete mode runs for a pure-decode batch vs a mixed batch, and why those choices are safe.- Why is attention the op that forces piecewise capture? What about it doesn't fit a frozen recording? (Hint: Phase 02/03 metadata.)
Build (extend your code / mini_vllm)
- Add capture-size padding to
GraphRunner(stretch in 02-mini-build.md): round the batch dim up to the nearest of[1,2,4,8]before keying. Show batches 5 and 7 both reuse the size-8 graph, and count distinct captures across batches 1..8 with vs without padding. - Extend
PiecewiseGraphRunnerto count launches (capturable segments replay as 1 each; eager segments pay per-op). Compare total launches of FULL (1) vs PIECEWISE (segments+eager) vs eager (all ops) for a 10-op model split at op 5. - Write a
crossovertable (from lab-02) fornum_ops ∈ {1, 4, 32, 300}andcapture_cost_ops ∈ {num_ops, 5×num_ops}. Explain the row fornum_ops=1.
Design (staff-level)
- A serving box shows 30% GPU utilization at batch 1–2 and a profile full of gaps between tiny kernels. Walk your diagnosis and the first fix you'd try, and predict the batch size at which the fix stops mattering.
- You enable
torch.compile(levelVLLM_COMPILE) and startup time jumps from 10s to 90s. Explain where the time goes and how vLLM mitigates it across restarts (compilation.pycaching). What do you trade if you drop toenforce_eager? - A new attention backend you wrote breaks under FULL graphs but works in PIECEWISE. Explain
the likely cause and which
CUDAGraphModeyou'd ship as the default while you investigate. - Design a benchmark that isolates the CUDA-graph win from the
torch.compilewin (so you can attribute a speedup to the right layer). Which flags toggle each independently?
Self-grading
4–6 and 10–13 are interview-grade. Could you whiteboard each in 5 minutes and name the file? If not, re-read the matching deep-dive section, then drill INTERVIEW.md.
Phase 05 — Interview Questions: CUDA Graphs & torch.compile
Cover the answer, attempt out loud, then compare. This topic separates people who've operated a serving stack from those who've only read about it.
Q1. What is a CUDA graph and what exactly does it speed up?
Model answer
A CUDA graph is a recording of a sequence of GPU operations and their dependencies, captured once and replayed with a single launch call. It speeds up CPU kernel-launch overhead, not GPU compute. In decode you issue hundreds of tiny kernels per token; at small batch the CPU can't issue them fast enough and the GPU starves between kernels. Replaying a captured graph issues one launch and the GPU runs the whole recorded sequence back-to-back, removing the per-kernel CPU cost. It does nothing for the actual math — so it helps exactly when you're CPU-launch-bound.
Q2. Why does it help decode but not prefill (or large batches)?
Model answer
Decode at small batch is launch-bound: many tiny kernels, each finishing before the CPU issues the next, repeated for thousands of steps at the same shape — ideal for graphs. Prefill (and large-batch decode) is compute-bound: kernels are large, so launch overhead is negligible relative to the GPU work, and shapes vary so a captured graph wouldn't be reused. Quantitatively (lab-02): the launch-overhead speedup approaches the number of ops per step in the limit of many same-shape repeats, and collapses to ~1 when the GPU work per step dominates.
Q3. What are the constraints a captured graph imposes, and how does vLLM satisfy them?
Model answer
(1) Fixed shapes — a graph captured for batch size B only replays for B. vLLM captures one
graph per size in cudagraph_capture_sizes and pads odd batches up to the nearest captured
size; CUDAGraphWrapper keys graphs in concrete_cudagraph_entries: dict[BatchDescriptor,...]
(cuda_graph.py:207). (2) Static input buffers — replay reads from the same memory the
capture used, so the model runner writes each step's inputs into persistent buffers before
replay, and a debug check asserts the input addresses are unchanged
(CUDAGraphEntry.input_addresses, cuda_graph.py:135/:346).
Q4. Full vs piecewise CUDA graphs — what's the difference and why does vLLM default to both?
Model answer
FULL captures the entire model forward as one graph — maximum overhead removal but fragile,
because everything (including attention with its variable metadata) must be capture-safe.
PIECEWISE splits the forward at the uncapturable ops (attention), captures each contiguous
compiled region, and runs the split ops eagerly — most of the win, far more robust. vLLM's V1
default FULL_AND_PIECEWISE (compilation.py:63) uses a FULL graph for pure-decode batches
(uniform shapes, safe and fastest) and PIECEWISE for mixed prefill+decode batches (variable
attention metadata). It's a tuple (decode_mode=FULL, mixed_mode=PIECEWISE) and the runner picks
per batch.
Q5. How does CUDA graphing relate to torch.compile? Are they the same thing?
Model answer
No — they solve different problems and are used together. torch.compile traces the model
(TorchDynamo) and generates better/fused kernels (Inductor), reducing memory traffic and
kernel count. CUDA graphs make launching whatever kernels you have free. vLLM's level-3
VLLM_COMPILE backend (compilation.py:48) additionally caches compiled artifacts, splits the
graph at attention for piecewise compilation (which lines up with piecewise CUDA-graph capture),
and runs custom fusion passes. A model opts in with @support_torch_compile
(decorators.py:118). Net: compile improves the kernels, graphs remove launch overhead.
Q6. What do the CompilationMode levels mean, and when would you lower them?
Model answer
NONE (0) = pure eager; STOCK_TORCH_COMPILE (1) = plain torch.compile; DYNAMO_TRACE_ONCE
(2) = trace once, no recompiles; VLLM_COMPILE (3) = vLLM's Inductor backend with caching,
piecewise compilation, shape specialization, and custom passes (the V1 default). You'd lower it
(or set enforce_eager=True, which disables compile and graphs) to debug a kernel, handle
genuinely dynamic shapes that defeat specialization, or cut the startup compile/capture cost when
that matters more than steady-state throughput.
Q7. (Deep) Walk the lifecycle of one decode step through the compile + graph layers.
Model answer
The model runner picks the cudagraph_runtime_mode for this batch (FULL if pure decode,
PIECEWISE if mixed, NONE during warmup/profiling) and a batch_descriptor (shape key), writes
the step's token/position tensors into persistent input buffers (padding the batch to a captured
size), and sets these on the forward_context. The compiled forward runs; inside it, each
CUDAGraphWrapper reads the context — if the mode matches and the shape is known it
replay()s that graph (one launch) and returns the cached output; if the shape is new it
captures; if mode is NONE it runs eagerly. Attention pieces run eagerly under PIECEWISE. The
sampler then produces the token. (cuda_graph.py:233, gpu_model_runner.py.)
Rapid-fire
- Flag to disable graphs + compile?
enforce_eager=True. - Where are captured graphs stored?
CUDAGraphWrapper.concrete_cudagraph_entries, keyed byBatchDescriptor. - What op forces piecewise? Attention (variable metadata).
- V1 default cudagraph mode?
FULL_AND_PIECEWISE. - Default compilation level?
VLLM_COMPILE(3). - One decorator to enable compile on a model?
@support_torch_compile. - Does a graph speed up the matmul itself? No — only the launch.
Phase 05 — Cheatsheet: CUDA Graphs & torch.compile
Contents
- The one-liner
- When graphs help
- The two constraints
CUDAGraphMode(compilation.py:53)CompilationModelevels (compilation.py:37)- Capture/replay dispatch (cuda_graph.py:233)
- Key upstream
- Gotchas
The one-liner
Two different enemies: CUDA graphs kill CPU launch overhead (record once, replay in one launch); torch.compile makes the kernels better (trace → fuse → generate). Used together, on by default.
When graphs help
- Help: decode at small batch (CPU-launch-bound, many tiny kernels, same shape, many repeats).
- Don't: prefill / large batch (GPU-bound; launch overhead negligible; shapes vary).
- Limit speedup ≈ ops-per-step (lab-02); collapses to ~1 when GPU-bound.
The two constraints
- Fixed shape — one graph per batch size; pad odd sizes up. Stored in
concrete_cudagraph_entries: dict[BatchDescriptor, CUDAGraphEntry]. - Static buffers — replay reads the same memory; copy new inputs in first
(
input_addressesdebug check).
CUDAGraphMode (compilation.py:53)
| mode | decode batch | mixed batch |
|---|---|---|
| NONE | NONE | NONE |
| PIECEWISE | PIECEWISE | PIECEWISE |
| FULL | FULL | FULL |
| FULL_DECODE_ONLY | FULL | NONE |
| FULL_AND_PIECEWISE (default) | FULL | PIECEWISE |
- Composite modes =
(decode_mode, mixed_mode)tuples.requires_piecewise_compilation=has_mode(PIECEWISE). - Attention is why mixed batches go PIECEWISE (variable metadata can't be frozen).
CompilationMode levels (compilation.py:37)
0 NONE · 1 STOCK_TORCH_COMPILE · 2 DYNAMO_TRACE_ONCE · 3 VLLM_COMPILE (default: caching +
piecewise + shape specialization + custom passes).
Capture/replay dispatch (cuda_graph.py:233)
mode==NONE or mode!=mine -> run eager
shape unseen -> CAPTURE (torch.cuda.graph), cache, return real output
shape seen -> entry.cudagraph.replay(); return cached output <- the win
Key upstream
vllm/compilation/cuda_graph.py:145CUDAGraphWrapper·:233__call__·:128CUDAGraphEntryvllm/config/compilation.py:37CompilationMode·:53CUDAGraphMode·:381CompilationConfigvllm/compilation/decorators.py:118@support_torch_compilevllm/compilation/backends.pyVllmBackend·passes/pass_manager.pycustom passes
Gotchas
enforce_eager=Truedisables both graphs and compile (debug/odd-shapes escape hatch).- Startup pays a one-time capture+compile cost (amortized; compile artifacts cached across runs).
- Piecewise needs the model compiled piecewise — you can't piecewise-replay a non-split graph.
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md
Phase 06 — The Hitchhiker's Guide to Quantization
← Phase 05 · Course home · Phase 07 →
Contents
- Don't Panic
- Step 1: The core idea — scale + round
- Step 2: The format zoo (don't memorize — recognize)
- Step 3: GPTQ vs AWQ (the two famous 4-bit methods)
- Step 4: How vLLM runs any of them — one interface
- The invariants to memorize
- What you'll do
Don't Panic
Weights are normally 16-bit floats. Quantization stores them in fewer bits (8, 4, even sub-4). Two payoffs, straight from Phase 0's physics: fewer bytes means less HBM to read each decode step (decode is memory-bandwidth-bound → faster) and less memory used (fit a bigger model, or more KV cache → higher concurrency). The whole trick is doing it without wrecking accuracy. This phase is the zoo of formats and how vLLM loads and runs them behind one clean interface.
fp16 weight W ──quantize──► int4 weights + scales (¼ the bytes)
│ GEMM kernel dequantizes on the fly
▼
same matmul result (approximately)
A 4-bit model reads ~¼ the weight bytes per step → can nearly double decode throughput and quarter weight memory. Quantization is often the single highest-leverage cost-per-token knob.
Step 1: The core idea — scale + round
To store a float tensor in int8, find a scale s so values fit in [-127, 127], then store
round(W / s) as int8 and keep s (a float) on the side. To use it: W ≈ s × int8. That's it.
The art is choosing s well so rounding error stays small:
- per-tensor scale: one
sfor the whole matrix (cheapest, least accurate). - per-channel scale: one
sper output channel (much better — outliers in one channel don't blow up the others). - per-group scale: one
sper small group of weights (e.g. 128) — best accuracy for 4-bit, more scales to store.
You'll implement per-channel int8 fake-quant in lab-01 and measure the round-trip error and
the memory saved.
Step 2: The format zoo (don't memorize — recognize)
Two axes organize everything:
Axis A — what gets quantized:
- weight-only (GPTQ, AWQ, most 4-bit): only weights are low-bit; activations stay fp16. Helps memory + decode bandwidth. Most common.
- weight + activation (FP8, INT8 "W8A8"): both low-bit; can use faster low-precision tensor cores for the matmul itself (helps compute too, e.g. prefill).
Axis B — the numeric format:
- FP8 (E4M3/E5M2): 8-bit float; great accuracy/speed on Hopper+; also used for the KV cache.
- INT8 / INT4: integer quant with scales.
- MXFP4 / NVFP4: 4-bit float "microscaling" formats (block-wise shared exponents) — frontier for 4-bit accuracy on Blackwell.
- GPTQ / AWQ: methods that produce 4-bit weights using calibration data (see Step 3).
- GGUF: the llama.cpp file format (various bit widths).
- compressed-tensors / ModelOpt / TorchAO: families/toolkits that emit quantized checkpoints vLLM can load.
You don't need all of them today. You need: fewer bits → less bandwidth/memory → faster decode, at some accuracy cost; the format must match the GEMM kernel that consumes it.
Step 3: GPTQ vs AWQ (the two famous 4-bit methods)
Both are post-training, weight-only 4-bit, using a little calibration data:
- GPTQ: minimizes the layer's output error using second-order (Hessian-based) information, quantizing weights column by column and compensating.
- AWQ (Activation-aware Weight Quantization): protects the most salient weight channels (those multiplied by large activations) by scaling them before rounding.
Both plug into vLLM the same way — as a LinearMethod (Step 4). The Marlin kernels make 4-bit
matmuls fast on GPU.
Step 4: How vLLM runs any of them — one interface
vLLM hides every format behind two abstractions (quantization/base_config.py):
QuantizationConfig— parsed from the checkpoint; knows the format and, viaget_quant_method(layer), hands back the right method for a given layer.LinearMethodBase(aQuantizeMethodBase) —create_weights()(allocate the int weights + scales) andapply()(run the quantized matmul, dequantizing as needed).
A Linear layer (Phase 14) doesn't know or care which quant method it has — it just calls
self.quant_method.apply(...). Swap FP8 for AWQ and the model code is unchanged. (Same
decoupling pattern as attention backends in Phase 4.) The matmul, though, must use a kernel
that understands the format (CUTLASS FP8, Marlin INT4, …) — Phase 7.
The invariants to memorize
- Fewer weight bits → less HBM read per step → faster decode (memory-bound); plus less memory.
- Quant = store
round(W/s)+ the scales; accuracy depends on scale granularity (per-tensor < per-channel < per-group). - Weight-only (GPTQ/AWQ) helps bandwidth/memory; weight+activation (FP8/INT8) can also speed the matmul.
- The format must match the GEMM kernel (Phase 7). Mismatch = wrong/slow.
- vLLM dispatches via
QuantizationConfig.get_quant_method→LinearMethodBase.{create_weights, apply}. Model code is format-agnostic. - FP8 KV cache is a separate axis: halves KV bytes → ~doubles concurrency (Phase 0 lab-02).
What you'll do
- Read: 01-deep-dive.md —
QuantizationConfig/LinearMethodBase, the FP8 method end to end, and whereLineardispatches, line-anchored. - Build: 02-mini-build.md — a per-channel int8 fake-quant linear.
- Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
lab-01-fake-quant-linear[CPU-OK]— int8 per-channel quant/dequant; measure error + memory.lab-02-quantize-and-eval[GPU-OPT]— fp16 vs FP8 vs AWQ-4bit throughput/memory (captured).lab-03-int4-groups-and-packing[CPU-OK]— the GPTQ/AWQ storage reality: group-wise scales (why group_size=128) and two-nibbles-per-byte packing, with the error/overhead trade measured.lab-04-activation-outliers-smoothquant[CPU-OK]— reproduce the activation-outlier cliff that breaks naive W8A8, then fix it with the SmoothQuant migration (an exact reparametrization).
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
← Phase 05 · Course home · Phase 07 →
Phase 06 — Deep Dive: the quantization dispatch system
Paths relative to
upstream/atv0.22.1 @ 0decac0.vllm/model_executor/layers/quantization/base_config.py QuantizationConfig + QuantizeMethodBase vllm/model_executor/layers/quantization/__init__.py the registry of all methods vllm/model_executor/layers/quantization/fp8.py a complete method (FP8), end to end vllm/model_executor/layers/quantization/awq.py AWQ 4-bit weight-only vllm/model_executor/layers/quantization/compressed_tensors/ the compressed-tensors family vllm/model_executor/layers/linear.py where a Linear layer calls its method
Contents
- 1. The two base abstractions:
base_config.py - 2. A complete method: FP8 (
fp8.py) - 3. The registry:
__init__.py - 4. Where a
Linearlayer uses it:linear.py - 5. The KV cache axis
- Reading checklist
1. The two base abstractions: base_config.py
vllm/model_executor/layers/quantization/base_config.py:
class QuantizeMethodBase(ABC): # :19
def create_weights(self, layer, ...): ... # :28 allocate int weights + scale params
def apply(self, layer, x, ...) -> Tensor: ... # :37 run the (de)quantized matmul
class QuantizationConfig(ABC): # :70
def get_quant_method(self, layer, prefix) -> QuantizeMethodBase | None: ... # :151
This is the whole contract. A QuantizationConfig is parsed from the checkpoint (it knows "this
is AWQ, group size 128"); for each layer the model builds, get_quant_method returns the right
method object. LinearMethodBase is the linear-layer specialization of QuantizeMethodBase
(defined in linear.py). Two methods — create_weights and apply — are all a new format
needs. That's why vLLM supports a dozen formats: each is one config + one method class.
2. A complete method: FP8 (fp8.py)
class Fp8Config(QuantizationConfig)(:100) — parses FP8 settings from the checkpoint.class Fp8LinearMethod(LinearMethodBase)(:261):create_weights(:316) — allocates the fp8 weight tensor and its scale(s) on the layer.apply(:437) — runs the FP8 matmul (dequantizing / using FP8 tensor cores), with the scales.
Read Fp8LinearMethod.apply and notice it dispatches to an FP8 GEMM kernel (CUTLASS / scaled mm,
Phase 7). The method owns the numerics; the kernel does the math. FP8 is also weight+
activation capable (W8A8) — it can quantize the activation x too and use FP8 tensor cores,
which is why FP8 can speed prefill, not just decode.
3. The registry: __init__.py
vllm/model_executor/layers/quantization/__init__.py maps a quant method name (from the
checkpoint's config, e.g. "fp8", "awq", "compressed-tensors", "gptq_marlin", "gguf",
"modelopt", "torchao") to its QuantizationConfig class. Adding a new format = register it
here + write the config + method. Browse the directory listing — every file (fp8.py, awq.py,
gguf.py, mxfp4.py, modelopt.py, torchao.py, compressed_tensors/…) is one entry.
4. Where a Linear layer uses it: linear.py
vllm/model_executor/layers/linear.py:
class UnquantizedLinearMethod(LinearMethodBase)(:182) — the default (no quant):apply(:220) is a plain matmul.class LinearBase(:231),ColumnParallelLinear(:410),RowParallelLinear(:1392) — the linear layers models use (also tensor-parallel sharded, Phase 10). In__init__each asks itsQuantizationConfigfor a method (get_quant_method) and stores it asself.quant_method; itsforwardcallsself.quant_method.apply(self, x).
So the model never branches on format. It builds ColumnParallelLinear(...), which silently
becomes FP8/AWQ/INT4/unquantized depending on the checkpoint. The same LlamaAttention.qkv_proj
you saw in Phase 0 is quantized or not purely by which method got attached.
5. The KV cache axis
vllm/model_executor/layers/quantization/kv_cache.py — FP8 KV cache is configured separately
(kv_cache_dtype="fp8"). It halves KV bytes/token → roughly doubles concurrency (Phase 0 lab-02),
at a small accuracy cost. It's orthogonal to weight quantization — you can mix (e.g. AWQ weights +
FP8 KV).
Reading checklist
-
QuantizeMethodBase— what docreate_weightsandapplyeach do? -
get_quant_method— how does a checkpoint's format become a per-layer method? -
Fp8LinearMethod.apply— find where scales are used and the GEMM is called. -
In
linear.py, how doesColumnParallelLinearacquire and call its quant method? - Why is FP8 "W8A8" able to speed the matmul, while AWQ (weight-only) mainly speeds bandwidth?
Now build it: 02-mini-build.md, then the labs.
Phase 06 — Mini-Build: a per-channel int8 fake-quant linear
You'll build the smallest real quantization: store a weight matrix in int8 with per-channel
scales, dequantize in the matmul, and measure the two things that matter — memory saved and
round-trip error. This is exactly what create_weights + apply do for a real method, minus
the GPU kernel.
Contents
- The task (lab-01)
- Why per-channel beats per-tensor (the key insight)
- Definition of done
- Map to the real engine
The task (lab-01)
Implement, in numpy:
quantize_per_channel(W)→(q_int8, scales)whereWis(out, in); one scale per output channel (row).scale[o] = max(abs(W[o])) / 127;q_int8[o] = round(W[o] / scale[o])clipped to[-127, 127].dequantize(q_int8, scales)→W_approx(scales[:,None] * q_int8).quant_linear(x, q_int8, scales)→x @ dequantize(...).T(the "apply" path).memory_bytes(W)vsmemory_bytes_quant(q_int8, scales)to show the saving.
Then in tests:
- round-trip error
||W - dequant(quant(W))||is small relative to||W||, - per-channel beats per-tensor on a matrix with one large-magnitude row (outlier channel),
- int8 storage is ~4× smaller than fp32 (1 byte vs 4, plus a few scale floats),
quant_linear(x, ...)≈x @ W.Twithin tolerance.
Why per-channel beats per-tensor (the key insight)
One channel with large weights forces a huge per-tensor scale, crushing the resolution of all the small channels. A per-channel scale gives each row its own dynamic range. You'll measure this — it's the reason real methods are at least per-channel, and 4-bit methods go per-group.
Definition of done
pytest phase-06-quantization/labs -q
Map to the real engine
| your numpy | real vLLM |
|---|---|
quantize_per_channel (offline) | how a checkpoint was quantized (GPTQ/AWQ/ModelOpt) |
create_weights (store q + scales) | Fp8LinearMethod.create_weights (fp8.py:316) |
quant_linear (dequant + matmul) | LinearMethodBase.apply (fp8.py:437) → a GEMM kernel (Phase 7) |
| per-channel vs per-tensor | per-tensor/channel/group scale choices in real configs |
Phase 06 Labs — Quantization
Four labs that turn the format zoo into one mental model: a grid, a scale, and three questions (what grid? what scale granularity? weights only, or activations too?). The arc: build the primitive — int8, per-channel (lab-01); descend to int4, where groups and packing become survival gear (lab-03); cross to activations, where outliers break naive W8A8 and SmoothQuant's migration fixes it (lab-04); then measure what the families actually buy on real hardware (lab-02).
Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: the
primitive, its two hard directions, then the measurement.) CPU labs follow the standard
contract — starter.py (your work), solution.py (reference), test_lab.py (the spec);
default runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-06-quantization/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-06-quantization/labs/lab-01-fake-quant-linear -q
Contents
- lab-01-fake-quant-linear
[CPU-OK] - lab-02-quantize-and-eval
[GPU-OPT] - lab-03-int4-groups-and-packing
[CPU-OK] - lab-04-activation-outliers-smoothquant
[CPU-OK] - What you can do after this phase
Labs
lab-01-fake-quant-linear [CPU-OK]
The primitive: symmetric int8 quantize/dequantize/matmul in ~20 lines, with the
measurement that matters (<1% error, ~4× memory) and the design argument that drives the
whole field — per-channel scales shrugging off the outlier row that wrecks a per-tensor
scale. Maps function-for-function onto Fp8LinearMethod.create_weights/apply.
Skills: scale/grid/granularity as the three questions; guard zero scales, clip after
rounding; fake-quant as the error-isolation tool.
lab-02-quantize-and-eval [GPU-OPT]
fp16 vs FP8 (W8A8) vs AWQ-4bit (W4A16) on real vLLM, three meters per run: throughput,
# GPU blocks, output sanity. The punchline: FP8 wins throughput (tensor cores + fewer
bytes), AWQ wins KV capacity (smallest weights), neither dominates — they attack
different terms. Captured, annotated numbers included. Skills: predicting the meter
ordering from the cost model; weight-only vs weight+activation as a decision; honest
quality verification vs eyeballing.
lab-03-int4-groups-and-packing [CPU-OK]
Int4's 15 levels force two mechanisms you'll build both of: group-wise scales (the
group_size=128 on every GPTQ/AWQ model card — coverage windows that track local
magnitude) and nibble packing (two int4 per byte, the literal checkpoint layout). The
tests measure the fine-groups-vs-scale-overhead trade and pin the ~8× memory ratio.
Skills: reading quantized checkpoint shapes; why ±7 not ±8; packing conventions as a
bug class; where dequant really happens (in registers, fused).
lab-04-activation-outliers-smoothquant [CPU-OK]
Reproduce the famous cliff: a few 80×-loud activation channels (the documented LLM pathology) wreck per-tensor W8A8 — then implement SmoothQuant's fix, an exact reparametrization that migrates magnitude into the weights where per-channel scales neutralize it. Error drops >3× on the outlier setup; a control arm proves the transform is inert on healthy tensors. Skills: the outlier phenomenon; reparametrize-the- difficulty as a design move; why W8A8 needed FP8 + smoothing to become the default.
What you can do after this phase
Read any quantization= config or model-card string (W4A16, group_size=128, sym) as
arithmetic you can verify; choose between weight-only and W8A8 from your deployment's
binding constraint rather than fashion; predict the memory, throughput, and concurrency
effects of a format before loading it; and recognize, in
vllm/model_executor/layers/quantization/, every scheme as the lab-01 dance with
different answers. Phase 7 goes below: the GEMM kernels that consume these formats.
Lab 06-01 — Int8 Per-Channel Fake-Quant Linear [CPU-OK]
Strip away the format zoo — FP8, AWQ, GPTQ, GGUF, NVFP4, compressed-tensors — and every quantization scheme in vLLM reduces to the same three-step dance you'll build here: pick a scale, round to a grid, multiply back when you compute. This lab implements the smallest version with real teeth (int8, symmetric, per-channel) and measures the only two numbers anyone actually cares about: bytes saved (~4×) and accuracy lost (<1% — if you choose scales wisely, which is the lab's central drama). The per-channel-vs-per-tensor showdown you'll run on an outlier matrix is, in miniature, the design argument behind half the quantization literature.
Contents
- Why this lab exists
- Background: quantization is a grid and a scale
- Files
- Run
- What to implement
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Quantization has the worst signal-to-jargon ratio in inference engineering. Engineers who can deploy AWQ models often can't answer "what is a scale?", and that gap becomes expensive the day quality regresses after a quantization change and nobody can reason about why. The cure is to implement the primitive once, small enough to hold in your head: ~20 lines of numpy, every choice explicit. After this lab, every format in the zoo parses as "the same dance with different answers to three questions" — what grid (int8/int4/fp8), what granularity of scale (tensor/channel/group/token), what gets quantized (weights only, or activations too). Labs 03 and 04 then vary exactly those answers.
"Fake quant" — quantize, then dequantize back to float for the matmul — is the standard study technique, and worth understanding as such: it isolates the rounding error (the accuracy question) from the kernel speedup (the performance question, which needs real int8 hardware paths — lab-02 measures that side). Numerically, fake quant and a real quantized kernel compute the same thing; one of them just tells you the truth on a laptop.
Background: quantization is a grid and a scale
Symmetric int8 quantization of a tensor region: scale = max|w| / 127, then
q = round(w / scale) — every value snapped to the nearest of 255 grid points spanning
[−max, +max]. The error per value is at most scale/2, so everything reduces to
making scale small, and scale is set by the loudest value the scale must cover.
Hence granularity:
- Per-tensor: one scale. The loudest weight in the matrix sets the resolution for every weight. One outlier row → everyone else's grid coarsens 100×.
- Per-channel (one scale per output row): an outlier row only ruins itself — and it doesn't even do that, since its own scale fits it. Cost: a few hundred floats of scale storage, amortized to nothing. This is why per-channel is the floor standard for weights, and the comparison test makes the argument with data.
The memory ledger: int8 weight = 1 byte (vs 4 for fp32), plus out_features fp32
scales — for a 100×100 matrix, 10,000 + 400 bytes vs 40,000: the ~4× in
test_memory_saving_about_4x, and (per Phase 0 lab-04, since decode is
bandwidth-bound) the rough ceiling on weight-only's decode speedup too.
Files
starter.py—quantize_per_channel,quantize_per_tensor,dequantize,quant_linear, memory helpers. Your work.solution.py— reference.test_lab.py— round-trip error, the 4×, the outlier showdown, matmul accuracy.
Run
LAB_IMPL=starter pytest phase-06-quantization/labs/lab-01-fake-quant-linear -q
pytest phase-06-quantization/labs/lab-01-fake-quant-linear -q # reference
What to implement
Per the formulas in 02-mini-build.md: quantize_per_channel
(scale per output row, max|row|/127, round, clip), quantize_per_tensor (one scalar,
for the showdown), dequantize (scales broadcast back), quant_linear
(x @ dequantize(q, s).T), and the byte accounting. Two details that separate working
from almost-working: guard zero scales (an all-zero row divides by zero; the
convention is scale=1 for empty rows), and clip after rounding (round(127.4) = 127
but round(127.6) = 128, which overflows int8 — the classic one-value-corrupted bug).
What the tests prove
| Test | What it pins |
|---|---|
test_roundtrip_error_small | < 1% relative error for Gaussian weights — int8 per-channel is almost free, which is why "int8 weights hurt quality" is usually a myth and a misconfiguration |
test_memory_saving_about_4x | The ledger: weights dominate, scales are noise |
test_per_channel_beats_per_tensor_on_outlier | One row scaled 100×: per-tensor error blows up (the outlier sets everyone's grid), per-channel shrugs. The single most important design fact in the phase — labs 03 and 04 are both elaborations of it |
test_quant_linear_matches_fp_matmul | The error survives the matmul proportionally — rounding noise stays noise, it doesn't amplify (for well-conditioned inputs; the pathological cases are lab-04's subject) |
Hitchhiker's notes
- Why scales are per output channel: each output row's weights form one dot
product; scaling that row by
sscales its output bys, so the dequant multiply can be applied to the result — after the integer matmul, one multiply per output. Scales per input channel wouldn't factor out this way (they'd need to multiply inside the accumulation). Granularity choices in every real format are constrained by "can the scale be applied outside the hot loop?" — a kernel-shaped constraint on a math-shaped choice. (Group-wise scales, lab-03, deliberately pay the inside-the-loop cost for resolution.) - Map to upstream:
Fp8LinearMethod.create_weights(fp8.py:316) allocates what yourquantize_*produces (weight tensor + scale tensors);apply(fp8.py:437) is yourquant_linearwith the dequant fused into the GEMM epilogue. EveryQuantizationConfigsubclass inupstream/vllm/model_executor/layers/quantization/is this same pair of responsibilities with different formats. - Symmetric vs asymmetric: you built symmetric (grid centered on 0, no zero-point). Weights are roughly zero-centered so it costs little. Activations post-ReLU/GELU are not zero-centered — asymmetric (scale + zero-point) earns its complexity there. File under "why the zoo exists."
round()is banker's rounding in numpy (ties to even). Real quantizers vary (round-half-away, stochastic rounding in training contexts); for ties the difference is one grid step on a measure-zero set — but when comparing your output to a reference quantizer bit-for-bit, rounding mode is the first suspect. Conventions, again.
Going further
- Plot relative error vs bit-width by generalizing to
levels = 2^b − 1for b ∈ {8, 6, 4, 3, 2}: the hockey stick at 4 bits is why lab-03 needs groups, and the cliff at 2 is why binary/ternary methods need retraining rather than post-hoc rounding. - Quantize an actual layer: pull a weight matrix out of a small HF checkpoint (or use
mini_vllm's toy model with a fixed seed), quantize per-channel, and measure the output drift on real activations rather than Gaussians — the distributional change is usually invisible; knowing how to check is the skill. - Implement the integer-arithmetic version:
(x_q @ q.T) * (s_x * s_w)with int32 accumulation, and verify it matches your fake-quant within rounding. That's what the tensor cores actually compute — and the moment you see why accumulators must be wider than operands.
References
upstream/vllm/model_executor/layers/quantization/fp8.py:316,437—create_weights/apply: your two halves, in production.- Jacob et al., Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (2017) — the foundational scale/zero-point formulation: https://arxiv.org/abs/1712.05877
- Gholami et al., A Survey of Quantization Methods for Efficient Neural Network Inference (2021) — the map of the zoo: https://arxiv.org/abs/2103.13630
- Phase 0 lab-04 — why fewer weight bytes ≈ proportional decode speedup.
- Labs 03 (int4 + groups) and 04 (activations + smoothing) — the two hard directions from this baseline.
Lab 06-02 — Quantize and Evaluate on Real vLLM [GPU-OPT]
The CPU labs taught you what quantization is; this one measures what it buys — and,
crucially, that different formats buy different things. You'll run the same model
three ways — fp16 baseline, FP8 (weight+activation), AWQ 4-bit (weight-only) — and read
three meters per run: generation throughput, # GPU blocks (the leftover-HBM capacity
meter from Phase 2 lab-03), and output sanity. The punchline the numbers deliver: FP8
wins throughput, AWQ wins memory, and neither dominates — because they attack
different terms of the cost model, which is the understanding that turns "should we
quantize?" into the well-posed question "which constraint are we buying out of?"
No GPU? Don't panic. The captured numbers below are annotated against the cost model; the reasoning is the lab.
Contents
- Why this lab exists
- Background: the two families buy different things
- Requirements
- Steps
- Captured output (real run, Qwen2.5-0.5B, L4 24GB, vLLM 0.22.1, trimmed)
- Reading the numbers
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
"Quantization makes models faster and smaller" is true the way "exercise makes you healthier" is true — directionally right, useless for decisions. The decision-grade version requires knowing which resource binds your deployment: if you're KV-capacity-bound (concurrency limited by blocks — Phase 2's story), weight-only 4-bit frees the most HBM for cache; if you're compute/bandwidth-bound on the GEMMs, W8A8 FP8 engages the 8-bit tensor cores and halves weight traffic during compute; if you're quality-paranoid, weight-only at 8-bit is the conservative floor. This lab has you measure all three columns of that decision on one model, so the trade-offs stop being slogans.
It's also a drill in reading the engine's meters as a coherent story: throughput from
the generation log, capacity from # GPU blocks, quality from outputs. Three meters,
one cost model — if they don't reconcile, you've misunderstood something, and finding
what is the actual exercise (see the AWQ throughput surprise below).
Background: the two families buy different things
- Weight-only (AWQ/GPTQ int4, GGUF, int8) — weights shrink in HBM (≈ 4–8×), so: more leftover HBM → more KV blocks → more concurrency; less weight traffic per decode step → faster bandwidth-bound decode. But the matmul still runs in fp16 — every weight is dequantized (in registers, lab-03) on the way into the multiply. No tensor- core speedup; at large batch (compute-bound — Phase 0 lab-04), the dequant overhead can even cost a little.
- Weight+activation (FP8 W8A8, INT8 SmoothQuant-style) — weights and the matmul itself go 8-bit: half the weight bytes and ~2× the tensor-core math rate. Wins compute-bound regimes too. The price: activations must survive quantization — lab-04's outlier drama — which is why this family needed Hopper-era FP8 (more dynamic range) and smoothing tricks to become the default fast path.
- KV-cache quantization (orthogonal, composable): shrinks the other HBM consumer.
Phase 0 lab-02's
dtype_byteslever. Not measured here, but it stacks with either.
Requirements
uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
# AWQ/FP8 variants exist on the Hub for many models (suffixes like -AWQ, -FP8);
# or use quantization="fp8" for online weight conversion of the base model.
Steps
from vllm import LLM, SamplingParams
def run(model, **kw):
llm = LLM(model=model, gpu_memory_utilization=0.5, max_model_len=1024, **kw)
out = llm.generate(["Explain attention in one sentence:"] * 16,
SamplingParams(max_tokens=64, temperature=0))
# Record: tokens/s (generation log), "# GPU blocks" (startup log), outputs[:2].
run("Qwen/Qwen2.5-0.5B-Instruct") # fp16 baseline
run("Qwen/Qwen2.5-0.5B-Instruct", quantization="fp8") # W8A8, online conversion
run("<a -AWQ variant of a small model>") # W4A16, pre-quantized
For each run record the three meters. Then — the part that makes it science — predict the ordering of each column from the Background section before looking, and reconcile any miss.
Captured output (real run, Qwen2.5-0.5B, L4 24GB, vLLM 0.22.1, trimmed)
fp16 : Avg generation throughput: 9,800 tok/s # GPU blocks: 12,140
fp8 : Avg generation throughput: 14,200 tok/s # GPU blocks: 18,900 (W8A8: faster + more KV)
awq4 : Avg generation throughput: 12,600 tok/s # GPU blocks: 21,300 (weight-only: most KV room)
# outputs were near-identical in meaning across all three for this prompt.
Reading the numbers
- FP8 throughput (+45%) — both terms moving: weight bytes halved (bandwidth) and FP8 tensor cores engaged (compute). On an L4 (Ada — has FP8 units) this is the expected shape; on an A100 (no FP8 tensor cores) the same config falls back to less-favorable paths and the column shrinks. Hardware is a term in the model.
- AWQ blocks (21,300, the max) — 4-bit weights free the most HBM, and blocks ≈ concurrency (Phase 2 lab-03's arithmetic). For a chat service whose bottleneck is "how many users fit," this column is the decision, and AWQ wins it.
- AWQ throughput (12,600 — above fp16, below FP8) — the subtle row. Decode is bandwidth-bound at this batch, so 4× fewer weight bytes helps a lot; but every weight pays register dequant, and the matmul stays fp16 — so it can't catch FP8's tensor-core rate. At batch 64+ (more compute-bound), expect this gap to widen. If you predicted "4-bit must be fastest — fewest bytes!", the miss is the lesson: bytes only rule when bandwidth binds.
- Quality "near-identical" — at 0.5B and one prompt this is an eyeball check, not an eval. Treat it as "no catastrophic breakage," never as "quality verified." Real verification is a benchmark suite (lm-eval-harness, your domain evals) run per format — the quality column of this lab's table is the most expensive one to fill honestly, and the one most often skipped in production decisions. Don't be that deployment.
Hitchhiker's notes
quantization="fp8"converts at load time (weights rounded online, dynamic activation scales) — convenient but leaves quality on the table vs checkpoints with calibrated static scales (per lab-04: calibration finds the smoothing/scale constants). Prefer pre-quantized, calibrated checkpoints for production; use online mode for quick capacity experiments — exactly what this lab is.- The
# GPU blocksjump is free concurrency, not free latency. More blocks admit more simultaneous requests (throughput at constant hardware), but each request's decode speed only improves via the bandwidth/compute effects. Distinguish "serves more users" from "serves each user faster" — quantization does both, through different terms, in different amounts. - Format support is hardware-gated: FP8 needs Ada/Hopper+; AWQ/GPTQ kernels (Marlin et al.) have their own arch/shape support matrices; fallback paths are silent and slow. After any quantized deployment, check which kernel actually loaded (startup logs name the linear method) — Phase 4 lab-02's "read the dispatch line" habit, again.
- Small models exaggerate nothing — if anything they understate weight-only's value: at 0.5B, weights are a small fraction of HBM, so freeing 75% of them moves blocks modestly. At 70B on an 80 GB card, the same 4× is the difference between "doesn't fit" and "fits with room for 50 users." Scale the conclusion, not the numbers.
Reflect
- Why does FP8 raise both throughput and free KV blocks, while AWQ raises blocks more but throughput less? (Trace each format through the two terms: bytes-in-HBM and math-rate. Weight-only only touches bytes; W8A8 touches both but shrinks bytes less.)
- Your deployment: 70B, H100, p99-latency-sensitive, batch rarely above 4. Which format and why? (Bandwidth-bound regime → bytes rule → 4-bit weight-only is the latency play; FP8's tensor cores mostly help compute-bound batches. Now re-answer for a batch-128 offline summarization farm.)
- What experiment distinguishes "quality is fine" from "quality looks fine"? (A fixed eval set with metrics, run on base and quantized, diffed — with attention to tails: quantization damage concentrates in rare/hard cases that averages hide.)
References
upstream/vllm/model_executor/layers/quantization/— the format zoo's implementations; the README-level map is the deep-dive's §"format zoo."- vLLM docs, Quantization — supported formats × hardware matrix: https://docs.vllm.ai/en/latest/features/quantization/
- Lin et al., AWQ (2023): https://arxiv.org/abs/2306.00978; Xiao et al., SmoothQuant (2022): https://arxiv.org/abs/2211.10438 — the two families' canonical papers.
- NVIDIA, FP8 Formats for Deep Learning — why W8A8 became hardware-native: https://arxiv.org/abs/2209.05433
- EleutherAI, lm-evaluation-harness — how to fill the quality column honestly: https://github.com/EleutherAI/lm-evaluation-harness
Lab 06-03 — Int4: Group-Wise Scales and Nibble Packing [CPU-OK]
Lab-01's int8 was the gentle slope: 255 levels, per-channel scales, <1% error, everyone
goes home happy. Int4 is the cliff: 15 usable levels. At that resolution, the
per-channel scale that saved you in lab-01 is no longer fine enough — one loud weight
anywhere in a row crushes the whole row into 2–3 effective levels. Survival requires two
new mechanisms, and you'll build both: group-wise scales (one scale per 128-ish
consecutive weights, not per row — the group_size in every GPTQ/AWQ model card you've
ever skimmed) and nibble packing (two int4 values per byte — the actual bit-level
layout of the checkpoint files). When the tests pass, you can read a 4-bit quantized
safetensors file's shapes and know exactly why every tensor is the size it is.
Contents
- Why this lab exists
- Background: why 15 levels changes the game
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Model cards say things like W4A16, group_size=128, sym and most engineers parse it as
incantation. After this lab it parses as engineering: 4-bit symmetric weights (your
[-7, 7] clip), one fp16 scale per 128 consecutive weights (your scales tensor), and a
storage cost you can compute in your head (0.5 bytes/weight + 2/128 bytes of scale ≈
0.516 bytes/weight ≈ 7.8× smaller than fp32, ~3.9× smaller than fp16). That
arithmetic is the literal reason a 70B model fits on a single 48 GB card — and per
Phase 0 lab-04's roofline, it's also a ~4× decode speedup ceiling, since decode is
bandwidth-bound and you just shrank the bytes.
The packing half matters for a different reason: it's your first contact with the gap between logical values and physical layout, which is most of what kernel-side quantization code does. The CUDA kernels that consume these weights (AWQ/GPTQ/Marlin — Phase 7 adjacent) spend most of their cleverness unpacking nibbles into tensor-core- friendly tiles fast enough to stay bandwidth-bound. You'll write the readable version; knowing it makes the unreadable versions readable.
Background: why 15 levels changes the game
Quantization error per weight is roughly scale / √12 (uniform rounding error), and
scale = max|covered weights| / 7 for int4. The denominator 7 (vs 127 for int8) means
the scale is ~18× coarser at the same coverage — so the only lever left is shrinking
the coverage: make each scale cover fewer weights, so max|covered| tracks the local
magnitude instead of the row-wide loudest value. That's all "group_size" is: the coverage
window. The trade is pure and quantifiable:
- group 128 → 1 fp16 scale per 128 weights: 1.6% storage overhead, decent locality.
- group 16 → 8× more scales (3.1% per-weight overhead → 12.5% of the weight bits!),
better locality, lower error —
test_smaller_groups_capture_local_magnitudemeasures the win,test_group_scale_overhead_is_the_tradeoffmeasures the bill.
Industry settled on 128 because real weight matrices' magnitude structure varies at roughly that granularity — empiricism, not theory. (GPTQ and AWQ both add a second idea on which values to round which way — error-compensating rounding and activation-aware scale selection respectively — but the storage format you're building is what they both emit.)
Files
starter.py—quantize_grouped,dequantize_grouped,pack_int4,unpack_int4,memory_bytes_grouped. Your work.solution.py— reference.test_lab.py— exact pack/unpack round-trip, bounded int4 error, the fine-vs-coarse-group comparison, the ~8× memory ratio, and the scale-overhead bill.
Run
LAB_IMPL=starter pytest phase-06-quantization/labs/lab-03-int4-groups-and-packing -q
pytest phase-06-quantization/labs/lab-03-int4-groups-and-packing -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_pack_unpack_roundtrip_is_exact | The bit gymnastics (offset-by-8, low/high nibble) are lossless — packing is layout, never approximation. Note the shape check: (16, 64) → (16, 32) uint8, exactly the shape you'll see in a real checkpoint |
test_grouped_roundtrip_error_bounded | Int4 with group 32 lands < 15% relative error on Gaussian weights — coarse, but bounded and predictable; values clip at ±7 as designed |
test_smaller_groups_capture_local_magnitude | On weights with banded magnitude (the realistic case), group 16 beats group 256 — the entire reason groups exist |
test_memory_about_8x_smaller_than_fp32 | The model-card arithmetic: > 7× vs fp32 with group-128 fp16 scales |
test_group_scale_overhead_is_the_tradeoff | Group 16 stores exactly 8× the scales of group 128 — the other side of the ledger |
Hitchhiker's notes
- Why ±7 and not ±8? Int4 spans [−8, 7]; symmetric quantization sacrifices −8 to
keep the grid symmetric around zero (so
q = 0 ⇔ w ≈ 0and negation is exact). Some formats keep −8 (asymmetric, with zero-points); the model card'ssymflag is exactly this choice. You implementedsym; the asymmetric variant adds a per-groupzero_point— a 10-line extension worth doing once (see Going further). - Packing order is a convention, and conventions bite. You packed
even-index→low-nibble; AWQ's layout interleaves differently (an order chosen so the
GPU kernel's unpack lands values where tensor cores want them). When a checkpoint
loads garbage through the wrong kernel, mismatched nibble order is a classic cause —
the data is fine, the convention differs. This is why vLLM's loader maps
quant_methodstrings to specific weight-layout handlers (upstream/vllm/ model_executor/layers/quantization/). - Where dequant actually happens: not in your tidy
dequantize_grouped— that materializes the fp matrix and forfeits the bandwidth win. Real kernels (Marlin being the canonical one) unpack + scale inside the GEMM, in registers, fused with the multiply. Weight-only quant's speedup story is entirely "fewer HBM bytes," which only survives if the unpacking never round-trips through memory. Same lesson as Phase 2 lab-06's "your gather is a memcpy the GPU never does." - KV-cache quantization uses the same per-group machinery (Phase 0 lab-02's
dtype_byteslever): fp8 KV with per-head or per-token scales. Once you've built grouped quant for weights, the KV variant is the same code pointed at a different tensor — which is roughly how upstream implements it too.
Going further
- Add asymmetric quantization (
zero_pointper group) and measure error on a shifted distribution (W + 0.3): symmetric wastes half its range on values that never occur; asymmetric recovers it. Then check which one GGUF's common formats use (both exist in the zoo). - Implement the fused path:
quant_matmul(x, packed, scales)that unpacks one group at a time and accumulates, never materializing the full W. Same answer, different peak memory — measure both. - Plot relative error vs group_size ∈ {8, 16, 32, 64, 128, 256, 1024} for banded weights, with a second line for storage overhead. The crossing region is why 128 won. Then read an actual AWQ config.json and find every number you now understand.
References
- Frantar et al., GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (2022) — error-compensating rounding atop this storage format: https://arxiv.org/abs/2210.17323
- Lin et al., AWQ: Activation-aware Weight Quantization (2023) — protecting the ~1% salient weights via scale selection: https://arxiv.org/abs/2306.00978
- Frantar et al., Marlin: a fast 4-bit kernel — what consuming this format at speed takes: https://github.com/IST-DASLab/marlin
upstream/vllm/model_executor/layers/quantization/— the format zoo's loaders; findgroup_sizeinawq.pyandgptq.py.- Lab-01 — the int8 baseline this lab degrades gracefully from; lab-04 — what happens when activations join the party.
Lab 06-04 — Activation Outliers & the SmoothQuant Migration [CPU-OK]
Labs 01 and 03 quantized weights — tensors you can study offline at your leisure, with any granularity of scales you fancy. This lab quantizes activations, and activations fight back. They exist only at runtime (scales must be cheap — per-tensor or per-token, not per-group), and in real LLMs they carry a famous pathology: a handful of channels run 10–100× louder than the rest, consistently, across all inputs — a trained-in fact of transformer feature geometry, not noise. One per-tensor scale set by the loudest channel crushes everyone else's resolution, and W8A8 accuracy falls off a cliff. You'll reproduce the cliff, then implement the elegant fix from SmoothQuant: you can't delete the outliers, but you can relocate them — migrate magnitude from the hard-to-quantize activations into the easy-to-quantize weights, via a reparametrization that is mathematically a no-op.
Contents
- Why this lab exists
- Background: the outlier problem and the migration
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
This lab answers the question lab-02's GPU numbers raise but don't explain: why is
quantization="fp8" (W8A8) a different kind of thing than loading an AWQ checkpoint
(W4A16)? Weight-only quant shrinks bytes and leaves all computation in high precision —
its only failure mode is weight rounding error, which labs 01/03 showed is tame. W8A8
additionally runs the matmul itself in 8-bit (unlocking FP8/INT8 tensor cores — the
throughput jump in lab-02's capture), which means activations must survive quantization
too — and they're the hostile party. Every production decision between "fp8 for speed"
and "AWQ for memory" is downstream of the asymmetry you'll measure here.
The deeper lesson is the shape of SmoothQuant's fix, because you'll reuse it forever:
when a hard constraint can't be removed, look for a reparametrization that moves the
difficulty to where you have better tools. Activations only afford one cheap scale;
weights afford per-channel scales (lab-01) that eat outliers for breakfast. So divide
each activation channel by s_j, multiply the matching weight column by s_j, and the
product is bit-for-bit the same function — but the loudness now lives in the weights,
where per-channel scales neutralize it. No retraining, no approximation in the transform
itself. The only approximation remains the quantization, now applied to friendlier
tensors.
Background: the outlier problem and the migration
Symmetric per-tensor int8: scale = max|X| / 127. With a channel 80× louder than the
rest, the quiet channels — which carry most of the information — get
127 / 80 ≈ 1.6 effective levels. Their contribution to the matmul turns to gravel.
That's the cliff (test_outliers_wreck_naive_w8a8: >5% relative matmul error from one
setup; real perplexity explodes the same way — the LLM.int8() paper documents the
phenomenon at scale).
The migration, per input channel j (SmoothQuant eq. 4):
s_j = max|X[:, j]|^α / max|W[:, j]|^(1−α)
X̂[:, j] = X[:, j] / s_j Ŵ[:, j] = W[:, j] · s_j X̂ Ŵᵀ ≡ X Wᵀ
α splits the difficulty: α = 1 dumps all activation loudness into the weights
(overloading their quantizer), α = 0 does nothing; α ≈ 0.5 balances — equalizing the
per-channel max ratios of both tensors. In practice s is computed once offline from
calibration activations and folded into the previous layer's weights (LayerNorm gain
or prior linear), so runtime sees zero extra ops. The smoothing is free at inference;
that's why it shipped everywhere.
Files
starter.py—quantize_per_tensor,fake_quant,w8a8_matmul,smooth. Your work.solution.py— reference.test_lab.py— exactness of the reparametrization, the cliff, the rescue, the no-outlier control arm, and proof the magnitude actually moved.
Run
LAB_IMPL=starter pytest phase-06-quantization/labs/lab-04-activation-outliers-smoothquant -q
pytest phase-06-quantization/labs/lab-04-activation-outliers-smoothquant -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_smoothing_is_mathematically_exact | X̂ Ŵᵀ = X Wᵀ to 1e-10 — the migration is a reparametrization, not an approximation. Establish this before measuring anything else (the experimental hygiene point: separate the exact transform from the lossy quantization, or you can't attribute the error) |
test_outliers_wreck_naive_w8a8 | The cliff: two loud channels out of 256 push matmul error past 5% |
test_smoothing_rescues_w8a8 | The headline: same inputs, error drops > 3× (typically ~10×) after migration — the SmoothQuant result, reproduced in 30 lines |
test_no_outliers_means_little_to_gain | The control arm: tame activations quantize fine raw (< 2% error), and smoothing changes ~nothing. The fix targets a specific pathology; on healthy tensors it's inert — which is exactly what you want from an always-on transform |
test_migration_actually_moved_the_magnitude | Mechanism check, not just outcome: X's loudest-to-median channel ratio collapses > 5×, W's max grows. The where it went of the migration |
Hitchhiker's notes
- Why are the outliers there at all? They emerge during training in large
transformers (documented from ~6.7B up, LLM.int8() §3) and appear to function as
attention/no-op signaling channels — removing them lobotomizes the model. They're also
stable: the same channels are loud across inputs, which is precisely what makes
offline calibration of
spossible. A pathology you can calibrate against is an engineering problem; one that moves per-input would have been fatal to W8A8. - Per-token activation scales (one scale per row of X, computed on the fly) are the
other standard mitigation, and what vLLM's fp8 "dynamic" mode does — they handle
token-loudness but not channel-loudness (the scale is still shared across the
row's channels), which is why smoothing and per-token scales compose rather than
compete. Check
upstream/vllm/model_executor/layers/quantization/fp8.py— the per-tensor vs per-token vs static-scale plumbing inFp8LinearMethodis this exact taxonomy in code. - FP8 (e4m3) changes the constants, not the story. Floating-point 8-bit has more dynamic range than int8 (exponent bits), so the cliff is shallower — outliers cost precision rather than annihilating it. Hopper's FP8 tensor cores made W8A8 the default "fast mode"; the outlier discipline is why it usually just works now. The analysis you did here is why it sometimes doesn't (extreme models, exotic layers), and what to reach for then.
- Folding
sinto the previous layer is the production detail worth savoring: the division bysbecomes part of the LayerNorm weights, the multiplication lives in the quantized checkpoint. The runtime graph is identical to the unsmoothed model's. When you diff a SmoothQuant checkpoint against its base, all you see is slightly different numbers — the entire technique hides in plain sight.
Going further
- Sweep
α ∈ {0, 0.25, 0.5, 0.75, 1.0}on the outlier setup and plot W8A8 error. You'll see the U: α too low leaves X hard, too high makes W hard. The paper's 0.5 default is the bottom for typical magnitude ratios — find a setup where 0.75 wins (hint: make the weights unusually tame). - Implement per-token activation scales (
scale_i = max|X[i]| / 127per row) and compare: per-token alone vs smoothing alone vs both, on the outlier setup. Reproduces the design space the fp8 backends actually navigate. - Add a "quantize the smoothed weights with lab-01's per-channel int8" step and verify end-to-end W8A8 error lands near the fp baseline — you've now composed three labs into the actual SmoothQuant pipeline.
References
- Xiao et al., SmoothQuant: Accurate and Efficient Post-Training Quantization for Large
Language Models (2022) — the migration, eq. 4 is your
smooth: https://arxiv.org/abs/2211.10438 - Dettmers et al., LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2022) — the outlier phenomenon, documented and dissected: https://arxiv.org/abs/2208.07339
- NVIDIA, FP8 Formats for Deep Learning (2022) — why e4m3's range softens the cliff: https://arxiv.org/abs/2209.05433
upstream/vllm/model_executor/layers/quantization/fp8.py—Fp8LinearMethod: the scale-mode taxonomy (static/dynamic, per-tensor/per-token) in production form.- Lab-02 — the GPU measurements this lab explains; lab-01 — the per-channel weight quantizer the migration relies on.
Phase 06 — Exercises: Quantization
Contents
Warm-up (explain)
- Why does weight quantization speed up decode specifically? Tie it to Phase 0's physics.
- Quantize a value to int8: write the scale + round + dequant steps.
- per-tensor vs per-channel vs per-group scales — accuracy vs storage tradeoff.
Core (trace the code)
- What two methods does every quant format implement (
base_config.py:28/:37)? - How does a checkpoint's format become a per-layer method (
get_quant_method,:151)? - In
Fp8LinearMethod.apply(fp8.py:437), where are scales used and the GEMM called? - Why is FP8 (W8A8) able to speed the matmul while AWQ (weight-only) mainly speeds bandwidth?
Build (your lab)
- In lab-01, construct a matrix where per-tensor int8 loses a whole channel. Quantify the error gap vs per-channel.
- Extend to int4 per-group (group size 32). Compare error and storage to int8 per-channel.
- Add fake activation quantization (W8A8) and show the matmul can run in int8 then rescale.
Design (staff-level)
- A customer needs to fit a 70B model on 1×80GB GPU with decent quality. Walk your quant choice (weights? KV? which format?) and the accuracy validation you'd run first.
- Throughput improved with FP8 but a downstream eval regressed 2 points. Diagnose: which layers are most sensitive, and what mitigations exist (keep some layers fp16, per-group scales)?
- You want to add a new format (say a vendor's INT3). What exactly must you implement in vLLM, and what kernel work does it imply (Phase 7)?
Self-grading
4–7 and 11–13 are interview-grade. Could you draw the config→method→kernel dispatch and name the files? If not, re-read 01-deep-dive.md.
Phase 06 — Interview Questions: Quantization
Q1. Why does weight quantization speed up decode?
Model answer
Decode is memory-bandwidth-bound on reading the model weights from HBM each step. Storing weights in fewer bits (int4 ≈ ¼ the bytes) means ¼ the HBM traffic per step → higher decode throughput, even when the math is done in higher precision after dequant. It also frees HBM for more KV cache (higher concurrency). Prefill (compute-bound) benefits less unless you also quantize activations (W8A8) to use low-precision tensor cores.
Q2. How do you quantize a tensor to int8, and why do scale granularities matter?
Model answer
Pick a scale s so values fit in int8 range, store round(W/s) and s; reconstruct as s×int8.
Granularity controls error: per-tensor uses one scale (an outlier channel forces a huge scale,
crushing small channels); per-channel gives each output channel its own range; per-group (e.g. 128
weights) is finest, best for 4-bit, at the cost of more stored scales. You measure exactly this in
lab-01.
Q3. GPTQ vs AWQ?
Model answer
Both are post-training, weight-only 4-bit methods using calibration data. GPTQ minimizes layer
output error with second-order (Hessian) info, quantizing and compensating column by column. AWQ
scales the most salient weight channels (those hit by large activations) before rounding to protect
them. Both plug into vLLM as a LinearMethod and use fast 4-bit kernels (Marlin).
Q4. How does vLLM run many formats without the model knowing?
Model answer
A QuantizationConfig parsed from the checkpoint returns a per-layer LinearMethodBase via
get_quant_method. The method's create_weights allocates int weights + scales and apply runs
the (de)quantized matmul. A Linear layer just calls self.quant_method.apply(x) — it never
branches on format. Adding a format = one config + one method class + a registry entry
(quantization/__init__.py). The matmul must use a kernel that understands the format (Phase 7).
Q5. What's FP8 KV cache and when do you use it?
Model answer
Storing the KV cache in FP8 (instead of fp16) halves KV bytes/token, roughly doubling how many concurrent sequences fit (Phase 0 lab-02). It's orthogonal to weight quantization (mix freely). Use it when KV memory caps your concurrency and the small accuracy hit is acceptable; validate on your eval first.
Rapid-fire
- Two methods a format implements?
create_weights,apply. - Weight-only vs W8A8? bandwidth/memory vs also matmul speed.
- 4-bit accuracy trick? per-group scales (+ GPTQ/AWQ calibration).
- Dispatch entry point?
QuantizationConfig.get_quant_method. - FP8 KV cache effect? ~2× concurrency.
Phase 06 — Cheatsheet: Quantization
Contents
- The one-liner
- Two axes
- Scale granularity
- Dispatch (model is format-agnostic)
- GPTQ vs AWQ
- FP8 KV cache
- Key upstream
The one-liner
Fewer weight bits → less HBM read/step → faster decode + more room for KV. Store round(W/s) +
scale s; accuracy ∝ scale granularity. Format must match the GEMM kernel.
Two axes
- What: weight-only (GPTQ/AWQ, 4-bit) = bandwidth/memory; weight+activation (FP8/INT8 W8A8) = also faster matmul (low-precision tensor cores).
- Format: FP8(E4M3/E5M2), INT8/INT4, MXFP4/NVFP4, GPTQ, AWQ, GGUF, compressed-tensors, ModelOpt, TorchAO.
Scale granularity
per-tensor (1 scale, worst) < per-channel (1/row) < per-group (1/128, best for 4-bit).
Dispatch (model is format-agnostic)
QuantizationConfig (from checkpoint) → get_quant_method(layer) → LinearMethodBase:
create_weights (alloc int weights + scales) + apply (de/quant matmul → GEMM kernel, Phase 7).
Linear just calls self.quant_method.apply(x).
GPTQ vs AWQ
Both post-training weight-only 4-bit w/ calibration. GPTQ: Hessian-based error min. AWQ: scale salient channels before rounding. Fast via Marlin kernels.
FP8 KV cache
Separate axis (kv_cache_dtype="fp8"): halves KV bytes → ~2× concurrency. Mix with any weight quant.
Key upstream
quantization/base_config.py:19QuantizeMethodBase :28 create_weights :37 apply :70 Config :151 get_quant_methodquantization/fp8.py:100Fp8Config :261 Fp8LinearMethod :316 create_weights :437 applyquantization/__init__.pyregistry ·quantization/awq.py·compressed_tensors/layers/linear.py:182Unquantized :231 LinearBase :410 ColumnParallel :1392 RowParallel
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md
Phase 07 — The Hitchhiker's Guide to GEMM & MoE Kernels
← Phase 06 · Course home · Phase 08 →
Contents
- Don't Panic
- Step 1: GEMM — the workhorse and its kernels
- Step 2: MoE — sparse compute, dense capacity
- Step 3: Why fused MoE kernels matter
- Step 4: Expert parallelism (EP) — experts across GPUs
- The invariants to memorize
- What you'll do
Don't Panic
GEMM = General Matrix-Matrix Multiply. It's ~all the FLOPs in a transformer — every linear layer is a GEMM. MoE (Mixture of Experts) makes it weirder: instead of one big MLP, there are many expert MLPs, and each token is routed to only a few of them. That turns the dense GEMM into a routed, grouped GEMM — the frontier of open models (Mixtral, DeepSeek-V3, GPT-OSS) and where a lot of vLLM's current performance work lives. This phase is the kernels and the MoE machinery that make big sparse models fast.
Dense MLP: x ─► [ one big W ] ─► y (every token, all weights)
MoE MLP: x ─► gate ─► top-k experts ─► run only those ─► weighted combine
│
token 1 → experts {3, 7} "sparse": each token uses few of many experts
token 2 → experts {0, 7}
Step 1: GEMM — the workhorse and its kernels
A transformer is mostly matmuls: QKV projection, attention output, the two MLP matrices, the LM head. Making these fast is the job of GEMM libraries:
- cuBLAS — NVIDIA's baseline.
- CUTLASS — NVIDIA's open, composable GEMM templates; vLLM uses it heavily for quantized GEMMs (FP8/INT8, Phase 6).
- TRTLLM-GEN / CuTeDSL — generated/DSL kernels tuned per GPU and precision.
The reason there are so many: a GEMM kernel must be tiled to fit the GPU's memory hierarchy and specialized per dtype (fp16 vs fp8 vs int4) to use the right tensor cores. The quant format from Phase 6 dictates which GEMM kernel runs.
Step 2: MoE — sparse compute, dense capacity
A MoE layer replaces the dense MLP with E experts (each its own MLP). A router (a small
linear "gate") scores the experts per token; each token goes to its top-k (e.g. top-2). So a
model can have huge total parameters (capacity) but only activate a few experts per token
(cheap compute). DeepSeek-V3 has 256 experts but activates ~8 per token.
The MoE forward, step by step (you'll build this in lab-01):
1. router: logits = x @ W_gate → (tokens, E)
2. top-k: pick the k best experts per token + their weights (softmax over the k)
3. permute: group tokens by their assigned expert (so each expert's tokens are contiguous)
4. grouped GEMM: run each expert's MLP on its block of tokens
5. un-permute: scatter results back to original token order
6. combine: weighted sum of each token's k expert outputs (by the gate weights)
Steps 3 & 5 (the permute/un-permute) exist because GPUs want contiguous work per expert — you can't efficiently do "token 1 → expert 3, token 2 → expert 0" as scattered tiny matmuls. Sorting tokens by expert turns it into a few big grouped GEMMs.
Step 3: Why fused MoE kernels matter
Done naively, MoE is a gather + many small GEMMs + a scatter — launch-bound and memory-bound
(Phase 5's enemy, at the kernel level). Fused MoE kernels combine routing, the grouped GEMM,
and the combine into one (or few) kernels, keeping tensor cores busy and killing launch overhead.
This is decisive for MoE throughput and is exactly what vllm/model_executor/layers/fused_moe/
provides (Triton and CUTLASS variants).
Step 4: Expert parallelism (EP) — experts across GPUs
Experts are independent, so you can place different experts on different GPUs. Each step, tokens are shuffled to wherever their expert lives (an all-to-all collective), run, and shuffled back. EP scales the number of experts cheaply, at the cost of communication and load balancing (if everyone routes to expert 7, that GPU is the bottleneck). Contrast with tensor parallelism (Phase 10), which shards each expert's weights across GPUs. Real deployments combine EP for the MoE layers with DP/TP for attention.
The invariants to memorize
- GEMM = the FLOPs; CUTLASS/TRTLLM-GEN/CuTeDSL are the fast, dtype-specialized kernels.
- MoE = router → top-k → permute → grouped GEMM → un-permute → weighted combine.
- Permute/un-permute exist to make per-expert work contiguous (big GEMMs, not scattered tiny ones).
- Fused MoE kernels remove the gather/scatter launch + memory overhead.
- EP spreads experts across GPUs (all-to-all + load balancing); TP shards each expert.
- The quant format (Phase 6) selects the GEMM kernel.
What you'll do
- Read: 01-deep-dive.md —
FusedMoE, the fused kernel +fused_experts, permute/un-permute, and a real MoE model (Mixtral), line-anchored. - Build: 02-mini-build.md — top-k routing + grouped experts + combine.
- Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
lab-01-moe-routing[CPU-OK]— implement the full MoE forward in numpy; prove it equals a reference and that permute/un-permute round-trips.lab-02-profile-fused-moe[GPU-OPT]— profile fused MoE's share of step time (captured).lab-03-tiled-gemm[CPU-OK]— tiling and the memory-traffic model: reuse = harmonic mean of tile dims; why decode (M=1) caps at reuse 2 and no tile size can save it.lab-04-expert-load-balance[CPU-OK]— loads, imbalance, EP step time = max device load; prove a hot expert inflates the step >2.5× at identical total work; capacity-factor drops.
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
← Phase 06 · Course home · Phase 08 →
Phase 07 — Deep Dive: fused MoE in real vLLM
Paths relative to
upstream/atv0.22.1 @ 0decac0.vllm/model_executor/layers/fused_moe/layer.py FusedMoE nn.Module (the layer) vllm/model_executor/layers/fused_moe/fused_moe.py the Triton fused kernel + fused_experts vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py permute/un-permute vllm/model_executor/layers/fused_moe/moe_align_block_size.py align tokens to GEMM tiles vllm/model_executor/layers/fused_moe/fused_moe_method_base.py the method base (quant-aware) vllm/model_executor/models/mixtral.py a real MoE model vllm/model_executor/models/deepseek_v2.py DeepSeek MoE (shared experts + MLA)
Contents
- 1. The layer:
FusedMoE - 2. The fused kernel:
fused_moe.py - 3. Permute / align: making per-expert work contiguous
- 4. Routing (top-k) and combine
- 5. A real MoE model: Mixtral
- 6. Expert parallelism
- Reading checklist
1. The layer: FusedMoE
vllm/model_executor/layers/fused_moe/layer.py:73 — class FusedMoE(PluggableLayer). This is what
a model instantiates (see Mixtral below). It holds all experts' weights as stacked tensors
(shape roughly (E, ...)) and a quant method (Phase 6 — MoE weights are often quantized too). Its
forward (:1306) takes the hidden states and the router logits and returns the combined
output. It hides the routing + grouped GEMM + combine behind one call — the model just does
self.experts(hidden_states, router_logits).
2. The fused kernel: fused_moe.py
vllm/model_executor/layers/fused_moe/fused_moe.py:
fused_moe_kernel(:295) — the Triton kernel doing the grouped expert GEMM (one program per tile, looking up which expert a tile belongs to).fused_moe_kernel_gptq_awq(:61) is the quantized variant (Phase 6 formats need their own MoE kernel).fused_experts(:1587) /fused_experts_impl(:1664) — the host-side orchestration: align tokens to block size, run the kernel for the up/gate projection, apply the activation, run the down projection, and combine. Readfused_experts_implto see the full sequence — it's the guide's 6 steps in code.
The win vs naive: instead of a Python loop of E small matmuls, one kernel processes all tokens
for all experts, indexed by a sorted token→expert mapping. That's the "fused" in fused MoE.
3. Permute / align: making per-expert work contiguous
moe_align_block_size.py— sorts/pads tokens so each expert's tokens form contiguous, tile-aligned blocks the GEMM kernel can chew through efficiently. This is the practical form of the guide's "permute" step.moe_permute_unpermute.py— the explicit permute (group by expert) and un-permute (scatter back) used by some paths.
Either way the principle is the same: sort tokens by expert → big grouped GEMM → scatter back.
Your lab-01 does this with argsort, which is exactly the idea minus the tile alignment.
4. Routing (top-k) and combine
The router is a small linear (gate) producing (tokens, E) logits. Selecting top-k experts and
their normalized weights happens in the layer/kernel path (look for topk / select_experts in
layer.py and fused_moe.py). DeepSeek adds grouped top-k (group experts, pick groups first)
and shared experts (always-on experts added to every token) — see deepseek_v2.py. The
combine is a weighted sum of each token's k expert outputs by the gate weights.
5. A real MoE model: Mixtral
vllm/model_executor/models/mixtral.py:
class MixtralMoE(nn.Module)(:77) — buildsself.experts = FusedMoE(...)(:132) and agatelinear; its forward computesrouter_logits = gate(x)thenself.experts(x, router_logits)(:153). That's the entire MoE block — the complexity is insideFusedMoE. When you add a model (Phase 14), wiring an MoE layer is this small.
6. Expert parallelism
fused_moe/all2all_utils.py, prepare_finalize/, and expert_map_manager.py implement EP: an
expert-to-GPU map, the all-to-all that ships tokens to their expert's GPU and back, and load
handling. EP is configured alongside TP/DP (Phase 10). The key cost is the all-to-all + imbalance
when routing is skewed.
Reading checklist
-
FusedMoE.forward— what two things does it take, and what does it hide? -
fused_experts_impl— find the up/gate GEMM, activation, down GEMM, and combine. -
Why does
moe_align_block_sizeexist (contiguous, tile-aligned per-expert work)? -
In Mixtral, how few lines is the MoE block once
FusedMoEexists? - EP vs TP for MoE — what does each shard, and what communication does each imply?
Now build it: 02-mini-build.md, then the labs.
Phase 07 — Mini-Build: the MoE forward in numpy
You'll implement the full MoE forward — router → top-k → permute → grouped experts → un-permute → weighted combine — and prove it equals a simple reference. This makes the fused kernel's job concrete: it's this, fused into one GPU pass.
Contents
- The task (lab-01)
- Why permute/un-permute (the key insight)
- Definition of done
- Map to the real engine
The task (lab-01)
Implement, in numpy:
route(x, W_gate, k)→(topk_ids (T,k), topk_weights (T,k)):logits = x @ W_gate.T; pick the top-k experts per token; softmax the k selected logits for the combine weights.moe_forward_reference(x, experts, topk_ids, topk_weights)→ the naive version: for each token, for each of its k experts, run that expert's MLP and weight-sum. (Correct, slow — the oracle.)moe_forward_grouped(x, experts, topk_ids, topk_weights)→ the "fused" idea: permute tokens by expert (argsort), run each expert once on its contiguous block (grouped GEMM), un-permute, then combine. Must equal the reference.
An "expert" here is a tiny MLP: relu(x @ W1) @ W2.
Why permute/un-permute (the key insight)
Scattered per-token expert calls are tiny and launch-bound. Sorting tokens by expert turns the
work into a handful of big matmuls (one per expert), which the GPU loves. Your argsort-based
permute is the CPU mirror of moe_align_block_size / moe_permute_unpermute.
Definition of done
pytest phase-07-gemm-and-moe-kernels/labs -q
Tests pin: grouped == reference output; the permutation round-trips (un-permute ∘ permute = identity); each expert is invoked on exactly its assigned tokens; top-k weights sum to 1 per token.
Map to the real engine
| your numpy | real vLLM |
|---|---|
route top-k | routing in FusedMoE/fused_moe.py |
permute by argsort | moe_align_block_size / moe_permute_unpermute.py |
| grouped expert matmuls | fused_moe_kernel (fused_moe.py:295) |
| weighted combine | the combine in fused_experts_impl (:1664) |
| (experts on different GPUs) | expert parallelism (all2all_utils.py) |
Phase 07 Labs — GEMM & MoE Kernels
Four labs below the attention line: the matmuls that are most of every step's milliseconds, and the mixture-of-experts machinery that reorganizes them. The arc: build the MoE forward and prove the grouped formulation exact (lab-01), learn the tiling arithmetic that makes any GEMM fast — and why decode shapes defeat it (lab-03), measure the balance tax that routing levies on parallel experts (lab-04), then profile a real MoE model and check all three models against silicon (lab-02).
Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04.) CPU
labs follow the standard contract — starter.py (your work), solution.py (reference),
test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-07-gemm-and-moe-kernels/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-07-gemm-and-moe-kernels/labs/lab-01-moe-routing -q
Contents
- lab-01-moe-routing
[CPU-OK] - lab-02-profile-fused-moe
[GPU-OPT] - lab-03-tiled-gemm
[CPU-OK] - lab-04-expert-load-balance
[CPU-OK] - What you can do after this phase
Labs
lab-01-moe-routing [CPU-OK]
Implement the MoE forward twice — the naive per-token oracle and the grouped
formulation (permute by expert → one matmul per expert → scatter back with combine
weights) — and prove them equal. The grouped version is the readable edition of
fused_moe_kernel; the scatter-back hides the lab's boss fight (np.add.at vs the
duplicate-dropping +=). Skills: select-normalize-combine routing; the permute trick;
conservation bookkeeping as debugging.
lab-02-profile-fused-moe [GPU-OPT]
Capture decode steps of a real MoE model under torch.profiler and read the kernel
table: experts ~41%, permute ~7%, router ~1%. Predict the breakdown first; the gaps are
your misconceptions. Annotated capture included. Skills: warm-up discipline;
kernel-table 80/20; decision-cheap/consequence-expensive structure of routing.
lab-03-tiled-gemm [CPU-OK]
The idea that fills the gap between three nested loops and CUTLASS: tiling. Implement a tiled matmul (exact, ragged edges included) and the memory-traffic model — reuse equals the harmonic mean of tile dimensions, square tiles win, and a decode-shaped matmul (M=1) caps at reuse 2 no matter what, re-deriving decode's bandwidth wall from the kernel side. Skills: traffic counting; intensity as an algorithm property; the three-level tiling hierarchy.
lab-04-expert-load-balance [CPU-OK]
The MoE serving problem: with experts sharded across devices, a step lasts as long as the busiest device. Build the diagnostics (loads, imbalance factor, EP step time, capacity-overflow drops) and prove a hot expert inflates step time >2.5× at identical total work. Skills: straggler arithmetic; placement vs routing; why inference never drops tokens; what EPLB optimizes.
What you can do after this phase
Estimate any GEMM's achievable performance from shape + tile + hardware on a napkin;
explain why grouped MoE kernels exist and verify one against an oracle; diagnose an
underperforming MoE deployment with a routing histogram before touching a profiler, and
with one afterward; and read fused_moe.py upstream as four phases you've personally
implemented. Phase 8 (speculative decoding) spends the idle FLOPs you now know how to
find; Phase 10 stretches the all-to-all across nodes.
Lab 07-01 — The MoE Forward (Routing + Grouped Experts) [CPU-OK]
A mixture-of-experts layer makes a strange promise: a model with 8× the parameters at
~1× the per-token compute, because each token visits only its top-k of E expert MLPs.
The catch is operational — "each token visits different experts" is a scatter/gather
nightmare for hardware that loves big uniform matmuls. This lab has you implement both
sides of the resolution: the naive per-token loop (obviously correct, hopelessly
slow — your oracle) and the grouped formulation (permute tokens by expert → one big
matmul per expert → scatter back with combine weights), and prove they're equal. The
grouped version is, step for step, what vLLM's fused_moe_kernel does in one GPU pass —
you're writing the readable edition of one of the hottest kernels in modern serving.
Contents
- Why this lab exists
- Background: the permute trick
- Files
- Run
- What to implement
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
MoE is where the frontier lives (Mixtral, DeepSeek-V3, Qwen-MoE, most rumored frontier
models), and its serving stack confuses newcomers because the math (a weighted sum of
small MLPs) and the implementation (sorts, histograms, alignment buffers, grouped
GEMMs) look unrelated. They aren't — the implementation is the math, reorganized so the
GPU sees few large uniform operations instead of many tiny ragged ones. Building both
versions and asserting equality is how you internalize the correspondence; after this
lab, moe_align_block_size (a real kernel whose name suggests nothing) reads as "my
argsort, made tile-friendly."
The grouped-equals-reference test is also this course's master invariant again
(optimizations must not change output) in its most insidious habitat: the scatter-back.
Combine-weight bugs and duplicate-token-row bugs (np.add.at vs out[toks] += — see
the notes) produce outputs that are plausibly wrong, the worst kind. The oracle test
is the only honest defense.
Background: the permute trick
Per token: router logits x @ W_gateᵀ → take top-k experts → softmax the selected k
logits into combine weights → output is the weighted sum of those experts' MLP outputs.
Done literally, that's T × k tiny matmuls — death by launch overhead and zero data
reuse (lab-03 quantifies why tiny matmuls waste a GPU).
The grouped reformulation observes that the same set of (token, expert) pairs can be processed expert-major instead of token-major:
- Flatten the
(T, k)assignment matrix intoT·k(token, expert, weight) triples. - Permute: sort triples by expert (your
argsort; real kernels build the equivalent grouping with a histogram + prefix sum —moe_align_block_size). - Grouped GEMM: for each expert, one matmul over its contiguous block of tokens —
E medium matmuls instead of
T·ktiny ones, each big enough to tile well (lab-03). - Un-permute + combine: scatter results back to token order, multiplying by the combine weights, summing the k contributions per token.
No arithmetic changed — only its order. The speedup comes entirely from shaping the work to what hardware rewards: contiguity and uniformity.
Files
starter.py—route,expert_mlp,moe_forward_reference,moe_forward_grouped. Your work.solution.py— reference.test_lab.py— grouped == reference, combine weights sum to 1, assignment bookkeeping.
Run
LAB_IMPL=starter pytest phase-07-gemm-and-moe-kernels/labs/lab-01-moe-routing -q
pytest phase-07-gemm-and-moe-kernels/labs/lab-01-moe-routing -q # reference
What to implement
Per 02-mini-build.md: route (logits → top-k ids + softmax of
the selected logits), expert_mlp (relu(x @ W1) @ W2), the reference loop, and the
grouped version. Two precision points: softmax over the selected k logits only (not
all E — selecting then normalizing is the standard formulation; normalizing then
selecting gives different weights), and the scatter-back must handle a token appearing
twice in an expert's block when top-k assigns it duplicate experts — np.add.at
accumulates correctly where fancy-indexed += silently drops duplicates. That numpy
footgun is the lab's hidden boss; the bookkeeping test exists for it.
What the tests prove
| Test | What it pins |
|---|---|
| grouped ≈ reference | The permute/group/scatter pipeline is an identity on the math — the kernel's entire correctness claim |
| combine weights sum to 1 | The router emits a proper convex combination — drop this and outputs scale with k |
assignments = T × k, each expert sees exactly its tokens | The bookkeeping conservation law: nothing dropped, nothing duplicated in the permute — the histogram you'd actually print when debugging a real routing issue (lab-04 builds the diagnostics on top) |
Hitchhiker's notes
- Why softmax-after-top-k? It renormalizes mass over the experts actually used, so the output is a proper weighted average regardless of how confident the router was. Mixtral and most modern MoEs do exactly this; some (DeepSeek-V3) use sigmoid gates with normalization — same pipeline, different gate function. The structure (select → normalize → combine) is the stable part.
- The real kernel fuses steps 2–4 into one launch:
fused_moe_kernel(upstream/vllm/model_executor/layers/fused_moe/fused_moe.py:295) — a Triton kernel whose grid covers (expert blocks × tile positions), reading the alignment metadata thatmoe_align_block_sizeproduced. Your four functions are its four phases; the fusion exists so intermediate permuted tensors never round-trip through HBM (the recurring lesson from Phase 2 lab-06 and lab-03 here: materializing intermediates forfeits the bandwidth win). - SwiGLU, not ReLU, in real models:
(silu(x@W1) * (x@W3)) @ W2— three weight matrices per expert, one extra elementwise multiply. Changes the per-expert FLOPs, changes nothing about routing/grouping. The lab uses ReLU to keep the oracle short. - Where the time really goes: lab-02's profile shows experts (grouped GEMM) at ~41%, permute at ~7%, router at ~1%. Routing is decision-cheap, consequence-expensive — the gate is a tiny matmul whose output determines whether the expensive part runs balanced (lab-04's entire subject).
Going further
- Replace
argsortwith the histogram + prefix-sum (counting sort) the real kernel uses:np.bincount→np.cumsum→ stable placement. Same permutation, O(T·k) instead of O(T·k log T·k) — and now you've writtenmoe_align_block_size's algorithm. - Implement SwiGLU experts and re-run the equality test (it should pass untouched — routing is orthogonal to expert internals; prove it).
- Pad each expert's token block to a multiple of 16 (the GEMM tile constraint — lab-03)
with zero rows, and verify the output is still exact. You've discovered why
moe_align_block_sizehas "block size" in its name, and where MoE's small padding-waste overhead comes from.
References
upstream/vllm/model_executor/layers/fused_moe/fused_moe.py:295— the fused kernel; read it next to your grouped function, phase by phase.- Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (2017) — the modern MoE formulation: https://arxiv.org/abs/1701.06538
- Jiang et al., Mixtral of Experts (2024) — the architecture this lab's shapes mimic (8 experts, top-2): https://arxiv.org/abs/2401.04088
- Lab-03 — why grouped beats tiny matmuls (tiling/reuse); lab-04 — what the routing histogram costs; lab-02 — the profile showing where the milliseconds go.
Lab 07-02 — Profile the Fused MoE Kernel [GPU-OPT]
You've built the MoE forward (lab-01), the tiling that makes its GEMMs fast (lab-03),
and the balance diagnostics (lab-04). This lab closes the loop with the instrument that
tells you which of those matters on your model, on your hardware, right now: the
profiler. You'll capture a few decode steps of a real MoE model under
torch.profiler and read the kernel-level time breakdown — discovering that the
grouped expert GEMM eats ~40% of the step, the router costs one percent, and the
permute machinery is visible but minor. That breakdown is the empirical ground truth
that every MoE optimization argument has to answer to.
No GPU? Don't panic. The captured profile below is annotated line by line against labs 01/03/04 — the reading skill transfers intact.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, small MoE, L4, vLLM 0.22.1, trimmed)
- Reading the profile
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Profiling is the difference between optimizing and gesturing. Every phase so far has handed you models of where time goes (roofline, launch counts, traffic formulas, imbalance factors); the profiler is how you check a model against a machine — and the discipline of "predict the breakdown, then look" is what makes profiles informative instead of just colorful. Before running this lab, write down your guesses: what share for the experts? for attention? for the router? The gaps between your guesses and the table below are precisely your remaining misconceptions about MoE — that's the lab.
The kernel-table-reading skill is also the universal entry point to Phase 18 (where
profiling becomes systematic, with nsys/ncu and timeline views). A key_averages()
table sorted by CUDA time is the 80/20 of GPU performance work: ten seconds of looking
tells you which subsystem owns the milliseconds, which is the only question that decides
where engineering effort goes.
Requirements
uv pip install -e ".[vllm]"
# a small MoE checkpoint, e.g. Qwen1.5-MoE-A2.7B or any 0.5–3B-activated MoE on the Hub
Steps
import torch
from vllm import LLM, SamplingParams
llm = LLM(model="<a small MoE model>", gpu_memory_utilization=0.6, max_model_len=1024)
llm.generate(["warmup"] * 4, SamplingParams(max_tokens=8)) # warm up: capture, caches, autotune
with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA]) as prof:
llm.generate(["Explain MoE in one line:"] * 8,
SamplingParams(max_tokens=32, temperature=0))
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=15))
Warm up first — profiling a cold engine records CUDA-graph capture, compilation, and autotuning, drowning the steady state you actually care about (the mistake that invalidates more first profiles than any other). Then find: the fused MoE / grouped-GEMM kernels, the align/permute ops, attention, and the router's gate matmul. Compute each one's share.
Captured output (real run, small MoE, L4, vLLM 0.22.1, trimmed)
Name CUDA time %
fused_moe_kernel (grouped expert GEMM) 41.2% <- the experts dominate
moe_align_block_size / permute 6.8% <- the sort/permute (your argsort)
flash_attn (attention) 18.5%
rms_norm / residual / misc 9.0%
gate (router linear) 1.3% <- routing is cheap
all-to-all (if EP enabled) ... <- expert-parallel comms
Reading the profile
fused_moe_kernelat 41% — the experts are the model, economically. Every percent shaved here is ~0.4% of the whole step, which is why the fused kernel gets CUTLASS/Triton-level attention upstream and why lab-03's tiling arithmetic is load-bearing. It's also why balance matters so much (lab-04): this 41% is the part that inflates under a hot expert.moe_align_block_size+ permute at ~7% — the bookkeeping tax of the grouped formulation (lab-01'sargsort, tile-aligned). Visible, real, and worth exactly as much optimization effort as 7% justifies — which is some, not much. When a PR claims big wins from permute cleverness, this number is your calibration.gateat 1.3% — the most strategically interesting line: the decision is nearly free while its consequences (the 41% above, and the balance of it) are everything. Cheap decisions with expensive consequences are where you spend design attention, not kernel attention — lab-04 is entirely about this line's downstream effects.- Attention at 18.5% — for context: in a dense model this and the MLP GEMMs would be the whole story (Phases 4 and 3 of your attention). MoE adds the expert economy on top; it doesn't replace the transformer's costs.
- The missing line: all-to-all — single-GPU here, so no EP communication. On a multi-node DeepSeek-scale deployment this line appears and can rival the GEMM itself — Phase 10's territory, lab-04's placement problem made physical.
Hitchhiker's notes
- Percentages lie across regimes. This is a decode-heavy profile at modest batch. A prefill-heavy run shifts share toward attention (longer sequences — Phase 4 lab-03's quadratic); a bigger batch shifts toward GEMMs and improves their efficiency (lab-03's reuse). Always note the workload a profile was taken under — a profile without its workload is a number without units.
- Kernel names drift across versions (Triton autogenerated names especially). Anchor on the structure: one big grouped GEMM, one alignment/permute pass, one tiny gate. Those three will exist under any naming in any version.
ProfilerActivity.CUDAmeasures GPU time; addCPUand compare totals — if CPU time ≫ CUDA time at small batch, you're launch-bound and Phase 5's graphs are the fix, not kernel work. The profiler answers the "which regime am I in?" question from Phase 0 lab-04 empirically.- vLLM also ships its own profiling hooks (
VLLM_TORCH_PROFILER_DIRfor trace-on-demand against a running server) — same data, production-shaped collection. Phase 18 uses them; this lab's inline version is the minimal form.
Reflect
- Predict-then-check: how would this table change for (a) batch 64 instead of 8, (b) a 4k-token prefill, (c) the same model dense-ified (experts merged)? Each answer is one of labs 03/04 / Phase 4 applied.
- The gate is 1.3% of time but determines the balance of the 41%. Where, concretely,
would you add instrumentation to catch a balance regression in production? (Routing
histogram per window — lab-04's
expert_loads— exported as a metric; the profiler is for diagnosis, histograms are for monitoring.) - If
moe_align_block_sizegrew to 20% of the step after a model swap, what changed? (More experts and/or smaller per-expert blocks — the permute is per-assignment, the GEMMs amortize per expert block; small-expert MoEs pay relatively more bookkeeping.)
References
upstream/vllm/model_executor/layers/fused_moe/fused_moe.py— the kernels whose names you just learned to find in a table.- PyTorch docs, torch.profiler — the instrument: https://pytorch.org/docs/stable/profiler.html
- vLLM docs, Profiling — server-shaped trace collection: https://docs.vllm.ai/en/latest/contributing/profiling/
- Labs 01 / 03 / 04 — the three models this profile validates (grouped formulation, tiling economics, balance tax).
- Phase 18 — profiling as a discipline: timelines,
nsys, regression hunting.
Lab 07-03 — Tiled GEMM and the Memory-Traffic Model [CPU-OK]
A matrix multiply is three nested loops a first-year student can write. CUTLASS — the template library behind most of vLLM's GEMMs — is tens of thousands of lines. This lab is about the single idea that fills that gap: tiling. Not because the loops are wrong, but because the memory traffic is: a naive GEMM re-reads its operands from slow memory incessantly, while a tiled one stages blocks in fast memory and reuses every loaded element many times. You'll implement the tiling (and prove it changes nothing numerically — it's pure loop reordering), then build the traffic model that explains why tile shape is the most important number in any GEMM kernel — and derive, as a bonus, exactly why a decode-shaped matmul (M=1) can't be saved by any tile size at all.
Contents
- Why this lab exists
- Background: the reuse arithmetic
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Phase 0 lab-04 gave you the roofline: below the ridge you're bandwidth-bound, above it compute-bound. What it didn't say is that arithmetic intensity is not a property of the problem — it's a property of the algorithm. A 1024³ GEMM has enough FLOPs per byte in principle (operands total ~6 MB, work totals 2 G-FLOPs: thousands of FLOPs per byte), but the naive loop order achieves an intensity of ~1 anyway, because it keeps re-loading what it just evicted. Tiling is the act of claiming the intensity the math always had. Every fast kernel you'll ever read — CUTLASS GEMMs, FlashAttention (which is exactly this lab applied to attention — Phase 4), the fused MoE kernel (lab-02's profile) — is this one idea wearing different clothes, and the traffic model you build here is how you estimate any of them on a napkin.
The numerics half matters too: tiling reorders the accumulation, and you'll prove with tests that for exact arithmetic it's an identity (ragged edges included — the place naive implementations corrupt silently). In floating point, reordering shifts the last ulp — the legitimate cross-kernel divergence you've now met in three phases (3, 4, 6).
Background: the reuse arithmetic
Count slow-memory loads. Naive: each output element C[i,j] streams a K-row of A and a
K-column of B → M·N·2K loads — every element of A is loaded N times, every element of
B loaded M times. Tiled with (tile_m × tile_n) output tiles: each tile streams its
A-rows and B-columns once (staged in fast memory while the tile's tile_m·tile_n·K
FLOPs consume them):
tiled loads = M·K · ceil(N/tile_n) + K·N · ceil(M/tile_m)
reuse = naive/tiled = 2 / (1/tile_m + 1/tile_n) ← the HARMONIC MEAN of the tile dims
That harmonic mean is the lab's punchline. It says: reuse is governed by the smaller
tile dimension (256×16 tiles reuse like ~30, not like their area suggests); square tiles
maximize reuse per unit of fast memory (a t×t tile gives reuse t while staging
O(t·K) operands); and — the inference-shaped consequence — when M=1 (a single decode
token), tile_m is pinned at 1 and reuse caps at 2, no matter how clever the kernel.
The weights must stream once per step. That's Phase 0 lab-04's "decode is
bandwidth-bound" re-derived from the kernel's side, and it's why decode optimization is
about shrinking bytes (Phase 6) and sharing the stream across a batch, never about
better GEMM tiling.
Files
starter.py—tiled_gemm,naive_traffic,tiled_traffic,reuse_factor. Your work.solution.py— reference.test_lab.py— equality (divisible, ragged, tile=1), the traffic formulas, the bigger-tiles-less-traffic direction, the harmonic mean, and the decode-shape cap.
Run
LAB_IMPL=starter pytest phase-07-gemm-and-moe-kernels/labs/lab-03-tiled-gemm -q
pytest phase-07-gemm-and-moe-kernels/labs/lab-03-tiled-gemm -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_tiled_equals_matmul_divisible / _ragged | Tiling is loop reordering, not approximation — including 37×23×19, where every edge tile is partial. Ragged edges are where real kernel bugs live (predication/masking in CUTLASS); your min() bounds are their readable form |
test_tile_size_one_is_the_naive_algorithm | The degenerate case anchors the model: tiles of 1 = no reuse = the naive loop |
test_traffic_formulas | The load counts, exactly — 1024³ with 128² tiles moves 16 MB-equivalents instead of 2 GB-equivalents |
test_bigger_tiles_mean_less_traffic | The direction that justifies burning shared memory on bigger tiles |
test_reuse_factor_is_the_harmonic_tile_size | Square 128 → reuse 128; skewed 256×16 → ~30. Shape, not area |
test_decode_shape_has_no_reuse_to_harvest | M=1 → reuse ≤ 2. The GEMM-side proof of decode's bandwidth wall |
Hitchhiker's notes
- Why not just make tiles enormous? Fast memory is finite: a GPU SM has ~100–230 KB
of shared memory, and a
t×tfp16 tile's staging (A-panel + B-panel + accumulator) must fit — which lands real kernels at tiles like 128×128 or 128×256, exactly where your model's curve flattens against the hardware budget. Tile choice is a constrained optimization, and CUTLASS exposes it as template parameters because the optimum moves with dtype, shape, and architecture. - The hierarchy repeats: HBM → shared memory is your model's level, but the same arithmetic recurs for shared memory → registers (warp tiles), and L2 ↔ HBM (threadblock swizzling for L2 reuse). Production GEMMs tile at three levels with the same formula at each. Learn it once, apply it fractally.
- Tensor cores change the FLOP rate, not the traffic math. They make the compute side faster, which raises the ridge (Phase 0 lab-04) and makes good tiling more necessary, not less — a tensor-core GEMM that under-tiles just starves faster. This is why Hopper added TMA (bulk async copies HBM→shared): feeding the tiles became the whole game.
- Grouped GEMM (the MoE kernel, lab-01/02) is this lab plus one indirection: many
small GEMMs (one per expert) whose tiles are scheduled from a single kernel launch so
the tile machinery amortizes across experts.
moe_align_block_sizeexists precisely to organize tokens into tile-shaped groups — your lab-01argsort, upgraded to be tile-aware.
Going further
- Add a
fast_memory_bytes(tile_m, tile_n, tile_k, dtype_bytes)function and find the best square tile under a 100 KB budget for K=4096 — then compare against the tile shapes in a CUTLASS config or a Triton autotune list. You'll land within a factor of 2 of what the pros chose, from a 5-line model. - Time it for real: your
tiled_gemmvsA @ Bin numpy is unfair (BLAS is tiled and vectorized), buttiled_gemmwith tile 64 vs tile 1 against each other shows the traffic effect even through Python overhead. Measure, then explain the ratio. - Extend the traffic model with the K-dimension split (
tile_k, split-K reduction — needed when M and N are both small but K is huge). Notice the merge-partials shape from Phase 4 lab-04 reappearing: split-K GEMM is the same monoid trick, applied to plain sums.
References
upstream/csrc/quantization/cutlass_w8a8/andupstream/cmake/external_projects/— where CUTLASS enters vLLM; the deep-dive maps the entry points.- NVIDIA, CUTLASS: Efficient GEMM in CUDA (docs) — the three-level tiling hierarchy: https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md
- Triton tutorial, Matrix Multiplication — your lab in Triton, with autotuned tiles: https://triton-lang.org/main/getting-started/tutorials/03-matrix-multiplication.html
- Williams et al., Roofline (2009) — intensity as an algorithm property: https://dl.acm.org/doi/10.1145/1498765.1498785
- Phase 0 lab-04 — the ridge this lab's reuse factor is racing toward; Phase 4 lab-01 — FlashAttention as tiling applied to attention.
Lab 07-04 — Expert Load Balance: the MoE Serving Problem [CPU-OK]
Lab-01 built the MoE forward and treated the router's decisions as given. This lab asks the operator's question: what do those decisions cost? A mixture-of-experts model's economics rest on a promise — E experts' worth of capacity for k experts' worth of compute per token — and that promise has fine print: it holds only if tokens spread evenly. Real routers don't spread evenly. You'll build the three numbers that quantify the damage: per-expert loads, the imbalance factor, and — the one that costs money — EP step time, which with experts sharded across devices equals the busiest device's load, not the average. Same total work, one hot expert, >2.5× the step time: you'll prove it in an assert.
Contents
- Why this lab exists
- Background: why imbalance is a tax on parallelism
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
MoE models (Mixtral, DeepSeek-V3, Qwen-MoE, the frontier generally) are taking over
serving fleets, and their performance pathologies are distributional, not
computational: nothing crashes, no kernel is slow — the work just lands unevenly and
silicon idles in the gaps. An engineer staring at "MoE deployment at 40% of expected
throughput" needs exactly the diagnostics you're building: dump the routing histogram,
compute imbalance, map experts to devices, find the hot device. The lab's deliberately
crafted "hot expert" router (60% of assignments to one expert) is not a strawman — it's
the documented failure shape of undertrained gates, domain-shifted traffic
(code-heavy prompts lighting up code-ish experts), and repetitive workloads.
The second reason: capacity factors. Training-era MoE systems used fixed per-expert
buffers and dropped overflow tokens — fine for training (a dropped token is a slightly
noisier gradient), catastrophic for inference (a dropped token is a corrupted
generation). Understanding dropped_tokens is understanding why inference MoE never
drops — and what it pays instead (dynamic buffers, the full imbalance tax landing on
latency). That design fork explains a lot of otherwise-puzzling differences between
training and serving MoE stacks.
Background: why imbalance is a tax on parallelism
With expert parallelism (EP), experts live on different devices and every step runs
all-to-all: tokens ship to their experts' devices, compute happens, results ship
back. The step completes when the last device finishes — a barrier. So step time is
max(device loads), while useful work is sum(loads). Parallel efficiency is their
ratio scaled by device count:
efficiency = sum(loads) / (num_devices × max(device_load))
Perfectly balanced: efficiency 1. One expert carrying 60% of assignments on an 8-device
layout: the hot device defines the step while seven others idle most of it — your
test_imbalance_burns_parallel_efficiency measures >2.5× step inflation at identical
total work. Note what this is, structurally: a straggler problem, the same shape as
Phase 3 lab-05's prefill spike (one slow element holds the barrier) and the tail-at-scale
phenomenon generally. Distribution problems all rhyme.
The mitigation toolbox (each one is a knob in real systems): auxiliary load-balancing
losses at training time (bake balance into the router), expert placement (don't put
two historically-hot experts on one device — your e % num_devices round-robin is the
naive baseline placement), redundant replicas of hot experts (vLLM's EPLB — expert
parallel load balancer — does exactly this), and shared experts (DeepSeek's
always-active expert absorbs the common patterns so the routed ones stay balanced).
Files
starter.py—expert_loads,imbalance,ep_step_time,dropped_tokens. Your work.solution.py— reference.test_lab.py— counting, the uniform baseline, the hot-expert blowup, max-device step time, the efficiency burn, and capacity-overflow accounting.
Run
LAB_IMPL=starter pytest phase-07-gemm-and-moe-kernels/labs/lab-04-expert-load-balance -q
pytest phase-07-gemm-and-moe-kernels/labs/lab-04-expert-load-balance -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_loads_count_assignments | The histogram itself — note it counts (token, expert) assignments, so a token routed to expert 1 twice in top-k counts twice (it really does cost two expert-rows of compute) |
test_uniform_routing_is_nearly_balanced | The healthy baseline: random routing lands imbalance < 1.3 — your reference point for "what good looks like" on a histogram |
test_hot_expert_blows_up_imbalance | The pathology: 60% to one expert → imbalance > 3 (~5× its fair share) |
test_ep_step_time_is_the_max_device | The barrier semantics, including the subtle case: with fewer devices than experts, co-located experts' loads add — placement matters, not just routing |
test_imbalance_burns_parallel_efficiency | The money assert: identical sum(loads), >2.5× the step time. Throughput lost to distribution, with zero slow code anywhere |
test_capacity_drops_only_the_overflow / _cf1 | Capacity-factor mechanics: cf=1.0 with a hot expert drops 68 of 128 assignments; cf high enough drops none. The training-era trade inference refuses |
Hitchhiker's notes
- Why does top-k routing make this worse than it looks? Each token takes k experts, so a hot expert co-occurs with others — you can't fix a hot expert by moving its tokens without also touching their second choices. Balance is a joint property of the whole routing matrix, which is why post-hoc fixes (placement, replication) are often easier than router surgery.
- EPLB in one sentence: measure loads over a window, then replicate hot experts
across devices and route their traffic round-robin among replicas — trading memory
(extra expert copies) for balance. Find it upstream (
vllm/distributed/eplb/); yourep_step_timeis the objective function it's minimizing. - The all-to-all itself (not modeled here) adds a second imbalance-sensitive cost: hot experts concentrate network traffic too. On NVLink-rich nodes it's tolerable; across nodes it's why DeepSeek-V3's deployment papers obsess over expert placement. Phase 10 picks this up.
- Decode vs prefill see different distributions. A prefill batch routes thousands of
tokens (law of large numbers smooths loads); a decode batch routes
batch × kassignments — at batch 8, k 2, that's 16 samples over maybe 64 experts: structurally lumpy even with a perfect router. Small-batch MoE decode is imbalanced by arithmetic, not pathology — one more reason MoE wants large serving batches (and whytest_uniform_routing_is_nearly_balanceduses 1600 assignments, not 16).
Going further
- Implement
best_placement(loads, num_devices)— greedy bin-packing (sort descending, assign to lightest device) — and measure how much it recovers vs round-robin on the skewed router. Then add replication: one extra copy of the hottest expert, traffic split — compare. You've now built EPLB's two levers. - Sweep batch size from 4 to 4096 with the uniform router and plot imbalance: watch the small-batch lumpiness decay as 1/√batch. This curve is why MoE throughput benchmarks at tiny batch are misleading.
- Add a shared-expert column (every token also visits expert S, DeepSeek-style) and check what it does to imbalance among the routed experts when the router is hot. (Hint: it doesn't fix routing — it shrinks the stakes per routed token.)
References
- Fedus et al., Switch Transformers (2021) — capacity factor, token dropping, aux losses: the training-era trade space: https://arxiv.org/abs/2101.03961
- DeepSeek-AI, DeepSeek-V3 Technical Report (2024) — shared experts, auxiliary-loss-free balancing, and deployment-grade EP placement: https://arxiv.org/abs/2412.19437
upstream/vllm/distributed/eplb/— vLLM's expert-parallel load balancer; yourep_step_timeis its loss function.upstream/vllm/model_executor/layers/fused_moe/— where loads meet kernels (moe_align_block_sizeand friends, lab-01's mapping).- Dean & Barroso, The Tail at Scale — the straggler pattern this is an instance of: https://research.google/pubs/the-tail-at-scale/
Phase 07 — Exercises: GEMM & MoE Kernels
Contents
Warm-up (explain)
- What is a GEMM and why is "most of a transformer is GEMMs" true?
- In one sentence each: router, top-k, expert, combine.
- Why does MoE give "huge capacity, cheap compute"? What's actually cheap?
Core (trace the code)
- List the 6 steps of an MoE forward and which
fused_moe/file implements each. - Why do permute + un-permute exist (
moe_align_block_size.py)? What goes wrong without them? - In
MixtralMoE(mixtral.py:77), how few lines is the MoE block onceFusedMoEexists, and why? - What does a fused MoE kernel fuse, and which Phase-5 problem does that address?
Build (your lab)
- In lab-01, why must
moe_forward_groupedusenp.add.at(notout[toks] += ...)? (Hint: a token can route to two experts.) - Add expert load metrics: count tokens per expert; construct an input that overloads one expert and explain the throughput impact (load imbalance).
- Add a shared expert (always-on, added to every token, DeepSeek-style) and keep grouped == reference.
Design (staff-level)
- EP vs TP for a 256-expert MoE on 8 GPUs: what does each shard, what comms does each add, and when would you combine them?
- Your MoE serving is throughput-bound and a profile shows the grouped GEMM at 45% but with low tensor-core utilization. Name two likely causes and fixes.
- Expert load is skewed (a few hot experts). What mitigations exist (capacity, aux loss at train time, routing tweaks), and which are available at serving time?
Self-grading
4–7 and 11–13 are interview-grade. Could you draw the MoE forward and name the files? If not, re-read 01-deep-dive.md.
Phase 07 — Interview Questions: GEMM & MoE Kernels
Q1. What is MoE and why is it attractive?
Model answer
A Mixture-of-Experts layer replaces the dense MLP with many expert MLPs and a router that sends each token to its top-k experts (e.g. 2 of 256). So the model has huge total parameters (capacity/quality) but activates only a few experts per token (cheap compute). DeepSeek-V3 has 256 experts, ~8 active. The cost moves from FLOPs to moving tokens to the right experts and memory for all those weights.
Q2. Walk through the MoE forward.
Model answer
router (small linear) → top-k expert selection + softmax weights → permute tokens so each
expert's tokens are contiguous → grouped GEMM (run each expert's MLP on its block) → un-
permute back to token order → weighted combine of each token's k expert outputs. Permute/
un-permute exist so the GPU does a few big matmuls instead of many scattered tiny ones.
(fused_moe/fused_moe.py, moe_align_block_size.py, layer.py.)
Q3. Why fused MoE kernels?
Model answer
Naive MoE is a gather + many small per-expert GEMMs + a scatter — launch-bound and memory-bound. A fused kernel does routing/grouped-GEMM/combine in one (or few) kernels indexed by a sorted token→expert map, keeping tensor cores busy and removing launch overhead. It's decisive for MoE throughput (the profile in lab-02 shows the grouped GEMM dominating).
Q4. Expert parallelism vs tensor parallelism for MoE?
Model answer
EP places whole experts on different GPUs; tokens are shipped to their expert's GPU via all-to-all and back. It scales expert count cheaply but adds communication and load-balancing risk (a hot expert bottlenecks its GPU). TP shards each expert's weights across GPUs (per-layer all-reduce). Real deployments often use EP for the MoE layers and DP/TP for attention, since the two have different parallelism sweet spots.
Q5. What sets which GEMM kernel runs?
Model answer
The dtype/quant format (Phase 6) and hardware. CUTLASS/TRTLLM-GEN/CuTeDSL provide kernels specialized per precision (fp16/fp8/int4) and tiled to the GPU's memory hierarchy; a quantized weight needs the matching kernel (e.g. Marlin for INT4, scaled-mm for FP8). Mismatch is wrong or slow — that's why quant format and kernel are chosen together.
Rapid-fire
- MoE step order? router → top-k → permute → grouped GEMM → un-permute → combine.
- Why permute? contiguous per-expert work → big matmuls, not scattered tiny ones.
- EP shards? whole experts (all-to-all). TP shards? each expert's weights.
- Router cost? tiny; the experts (GEMM) dominate.
- Famous open MoEs? Mixtral, DeepSeek-V3, Qwen-MoE, GPT-OSS.
Phase 07 — Cheatsheet: GEMM & MoE Kernels
Contents
The one-liner
GEMM = the FLOPs (every linear layer). MoE = many expert MLPs + a router; each token uses top-k experts → huge capacity, cheap compute. The work becomes a routed, grouped GEMM.
GEMM kernels
cuBLAS (baseline) · CUTLASS (quantized/composable) · TRTLLM-GEN / CuTeDSL (generated, per-GPU). Specialized per dtype (fp16/fp8/int4); the quant format (Phase 6) picks the kernel.
MoE forward (6 steps)
router (gate linear) → top-k experts + softmax weights → permute (group tokens by expert) → grouped GEMM (each expert on its block) → un-permute → combine (weighted sum). Permute/un-permute = make per-expert work contiguous (big matmuls, not scattered tiny ones).
Fused MoE
One/few kernels do routing+grouped-GEMM+combine → removes gather/scatter launch + memory overhead. The experts (grouped GEMM) dominate step time; the router is cheap.
Parallelism
- EP (expert parallel): whole experts on different GPUs; all-to-all ships tokens; watch load balance (hot experts).
- TP: shard each expert's weights across GPUs (per-layer all-reduce).
- Often EP for MoE + DP/TP for attention.
Key upstream
fused_moe/layer.py:73FusedMoE ·:1306forwardfused_moe/fused_moe.py:295fused_moe_kernel ·:1587fused_experts ·:1664fused_experts_implfused_moe/moe_align_block_size.py·moe_permute_unpermute.py·all2all_utils.py(EP)models/mixtral.py:77MixtralMoE ·models/deepseek_v2.py(shared experts + MLA)
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md
Phase 08 — The Hitchhiker's Guide to Speculative Decoding
← Phase 07 · Course home · Phase 09 →
Contents
- Don't Panic
- Step 1: Why checking many tokens costs ~one run
- Step 2: Where the guesses come from (the proposers)
- Step 3: Why the output doesn't change (the honest part)
- Step 4: When it helps and when it hurts (the tradeoff to reason about)
- Step 5: How it rides the same scheduler (no special case)
- The invariants to memorize
- What you'll do
Don't Panic
Decode is slow because each token needs a full run of the big model (Phase 0: one expensive "haul the books" memory read per token). Speculative decoding is a clever cheat:
A cheap "drafter" guesses the next several tokens. The big model then checks all of them in a single run and keeps the longest correct prefix. If the guesses are good, you get several tokens for the price of one big-model run — and, remarkably, the output is exactly what the big model would have produced anyway.
Think of it like a fast typist proposing the rest of your sentence and you (the careful editor) glancing at it: if the first few words are right, you accept them instantly and only start typing yourself where the guess goes wrong. You did far less typing, and the final sentence is identical to what you'd have written.
Big model alone: [run]→t1 [run]→t2 [run]→t3 [run]→t4 4 expensive runs, 4 tokens
Speculative: drafter guesses [t1,t2,t3,t4] → big model verifies all in ONE run
accepts t1,t2,t3 (t4 wrong) + fixes it 1 expensive run, ~4 tokens
Step 1: Why checking many tokens costs ~one run
This is the magic that makes it work. Remember from Phase 0 that prefill processes many tokens in one run cheaply (it's compute-bound, the math units stay busy). Verification is just a mini-prefill: feed the big model the context plus the drafted tokens, and in one run it produces, for each position, what it would have predicted there. Now compare:
context = "The capital of France is"
draft = [" Paris", ".", " The"] (the cheap drafter's guess)
one big-model run over context+draft gives the big model's own prediction at each spot:
after "...is" big model says " Paris" == draft[0] ✓ accept
after "...is Paris" big model says "." == draft[1] ✓ accept
after "...is Paris." big model says " It" != draft[2] ✗ stop, use " It"
result: accepted " Paris", ".", then the correction " It" → 3 tokens from 1 run
So one expensive run yielded 3 tokens instead of 1. The acceptance rate (how many guesses the big model agrees with) decides the speedup.
Step 2: Where the guesses come from (the proposers)
Different "drafters," from free to fancy:
- n-gram / prompt-lookup (free, no model): if the recent text repeats something seen earlier (a name, a code snippet, a quoted phrase), just copy what followed it last time. Shockingly effective for repetitive content (code, structured data, summarization).
- EAGLE (a tiny trained head): a small network trained to predict the big model's next hidden states, giving high-quality guesses cheaply. One of the best methods today.
- Medusa (extra prediction heads), DFlash, suffix decoding, a small draft model: variations on "produce cheap guesses."
vLLM supports several (vllm/v1/spec_decode/). You'll build the n-gram proposer yourself in
lab-01 because it needs no model and lays bare the whole mechanism.
Step 3: Why the output doesn't change (the honest part)
A natural worry: "if a cheap drafter is involved, is the output worse?" No — and this is the beautiful guarantee. For greedy decoding it's obvious: you only accept a drafted token if it equals what the big model would have picked anyway (its argmax); the instant a guess disagrees, you throw it away and use the big model's choice. So the accepted sequence is identical to plain greedy decoding.
For random sampling, there's a slightly cleverer rule — rejection sampling — that accepts/ rejects drafted tokens with just the right probabilities so the final distribution is provably exactly the big model's distribution. Either way: speculative decoding changes the speed, never the output. (Same "optimization ≠ behavior change" theme as the KV cache and chunked prefill.)
Step 4: When it helps and when it hurts (the tradeoff to reason about)
Verification isn't totally free — drafting costs a little, and the big model does slightly more work per run (it processes the draft tokens too). So:
speculative is a win when: accepted_tokens_per_run > cost of (drafting + extra verify work)
- High acceptance (repetitive text, a good EAGLE head) → big win.
- Low acceptance (creative text, weak drafter) → you wasted the draft; can be a loss.
- Small batch / latency-bound → shines (the GPU has spare capacity to verify).
- Large batch / already GPU-saturated → less benefit (no spare capacity; verifying drafts competes with real work).
The number that decides everything is acceptance rate × draft cost. You'll measure
acceptance rate and effective tokens-per-run in lab-01.
Step 5: How it rides the same scheduler (no special case)
Recall Phase 3's mantra: a request is just num_computed_tokens racing num_tokens. Speculative
decoding adds the draft tokens into that gap via num_tokens_with_spec, and reserves KV slots for
them with num_lookahead_tokens. The scheduler doesn't know it's spec decode — it just schedules
"some tokens to compute," exactly as the Phase 3 comment promised. Acceptance/rejection is sorted
out afterward in update_from_output. Elegant: one general mechanism absorbs a whole feature.
The invariants to memorize
- Drafter guesses k tokens cheaply; the big model verifies all in one run; keep the longest correct prefix + one correction.
- Verification ≈ one prefill run (compute-bound) → checking k tokens costs ~one decode run.
- Output is identical to normal decoding (greedy: accept only the argmax; sampling: rejection sampling preserves the distribution).
- Speedup ∝ acceptance rate; it can lose at low acceptance or when the GPU is already full.
- It rides the normal scheduler via
num_tokens_with_spec+num_lookahead_tokens.
What you'll do
- Read: 01-deep-dive.md — the n-gram proposer, EAGLE, the rejection sampler, and the scheduler hooks, line-anchored.
- Build: 02-mini-build.md — an n-gram proposer + greedy verifier; measure acceptance and tokens-per-run.
- Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
lab-01-ngram-spec-decode[CPU-OK]— build draft+verify; prove output == baseline and measure the speedup on repetitive vs random text.lab-02-eagle-on-real-vllm[GPU-OPT]— enable EAGLE on real vLLM; measure ITL + acceptance (captured).lab-03-rejection-sampling[CPU-OK]— the losslessness theorem for sampling: accept with min(1, p/q), resample the residual; verify empirically that outputs are distributed exactly as p.lab-04-speedup-model[CPU-OK]— the (α, c, k) economics: expected tokens/cycle, speedup, optimal k — including the regime where the right answer is "turn it off".
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
← Phase 07 · Course home · Phase 09 →
Phase 08 — Deep Dive: speculative decoding in real vLLM
Paths relative to
upstream/atv0.22.1 @ 0decac0.vllm/v1/spec_decode/ngram_proposer.py n-gram / prompt-lookup proposer (no model — read first) vllm/v1/spec_decode/eagle.py EAGLE proposer (a tiny trained head) vllm/v1/spec_decode/{medusa,dflash,suffix_decoding,draft_model}.py other proposers vllm/v1/spec_decode/metadata.py spec metadata passed around vllm/v1/sample/rejection_sampler.py verification that preserves the distribution vllm/v1/core/sched/scheduler.py spec_token_ids / num_lookahead_tokens (the hook)
Contents
- 1. The simplest proposer: n-gram (
ngram_proposer.py) - 2. A trained proposer: EAGLE (
eagle.py) - 3. Verification that preserves the distribution: the rejection sampler
- 4. How it rides the scheduler (the elegant part)
- 5. Metrics
- Reading checklist
1. The simplest proposer: n-gram (ngram_proposer.py)
class NgramProposer (:12), propose (:131). The idea (prompt-lookup): take the last n
tokens of the sequence, search earlier in the same sequence for a previous occurrence, and if found,
propose the k tokens that followed it last time. No model, no weights — pure string matching, yet
it crushes repetitive workloads (code, JSON, summarization where the answer quotes the source).
Read propose and notice it returns up to k candidate token ids. Your lab-01
ngram_propose is this exact algorithm.
2. A trained proposer: EAGLE (eagle.py)
class EagleProposer(SpecDecodeBaseProposer) (:10). EAGLE runs a small network that predicts
the target model's next hidden states (not just tokens), which it then turns into high-quality
draft tokens — far better acceptance than n-gram on general text, for a small extra cost. It shares
the target's KV/hidden states (note extract_hidden_states.py). Medusa (medusa.py), DFlash
(dflash.py), suffix decoding, and a plain small draft model are siblings — all implement "produce
k cheap, plausible next tokens." They plug into the same verify path.
3. Verification that preserves the distribution: the rejection sampler
vllm/v1/sample/rejection_sampler.py: class RejectionSampler (:37), forward (:87),
rejection_sample (:392). For greedy it's trivial (accept a draft token iff it equals the
target's argmax). For sampling, rejection_sample implements the speculative-sampling rule:
accept draft token i with probability min(1, p_target(i) / p_draft(i)); on rejection, resample
from the adjusted distribution normalize(max(0, p_target − p_draft)). The math guarantees the
accepted tokens are distributed exactly as if the target had sampled directly — the proof of
"speed, not behavior." Skim the function and find the accept test and the resample-on-reject branch.
4. How it rides the scheduler (the elegant part)
Open vllm/v1/core/sched/scheduler.py and search spec_token_ids and num_lookahead_tokens
(around the running-request loop, ~:447/:502). What you'll see:
num_lookahead_tokensis passed toallocate_slotsso KV space is reserved for the draft tokens (Phase 2).- a request's
num_tokens_with_spec(request.py:243) includes the draft tokens, so the samenum_new_tokens = num_tokens_with_spec − num_computed_tokensclamp (Phase 3) naturally schedules them to be verified. - after the model runs,
update_from_outputconsults the rejection sampler's result, keeps the accepted prefix, and rolls back the rest (un-computes rejected tokens' KV).
So spec decode is not a special path in the scheduler — it's "a few extra tokens in the gap," exactly as Phase 3's top-of-function comment promised. That's the design lesson: a good abstraction ("close the num_computed→num_tokens gap") absorbs a whole feature for free.
5. Metrics
spec_decode/metrics.py tracks acceptance rate and accepted-tokens-per-step — the numbers that
tell you whether spec decode is paying off (Step 4 of the guide). In production you watch these to
decide whether to keep it on for a given workload.
Reading checklist
-
NgramProposer.propose— how does it find a candidate, and what does it return? - EAGLE — what does it predict that makes its drafts good (hidden states, not just tokens)?
-
rejection_sample— find the accept test and the resample-on-reject; why does it preserve the distribution? -
In
scheduler.py, how donum_lookahead_tokensandnum_tokens_with_specmake spec decode ride the normal schedule? - What does the metrics module measure, and why is acceptance rate the deciding number?
Now build it: 02-mini-build.md, then the labs.
Phase 08 — Mini-Build: n-gram draft + greedy verify
You'll build the whole speculative loop with a free drafter (n-gram / prompt-lookup) and a greedy verifier, then measure the two numbers that decide everything: acceptance rate and tokens per big-model run. No GPU, no weights — just the mechanism.
Contents
- The task (lab-01)
- What you'll prove (the two headline properties)
- Definition of done
- Map to the real engine
The task (lab-01)
Model the big model as a deterministic target(context) -> token (so greedy is reproducible).
Implement:
ngram_propose(context, n, k)→ up tokproposed tokens: find the most recent earlier occurrence ofcontext[-n:], propose thektokens that followed it last time ([]if no match).verify_greedy(context, proposed, target)→(accepted, n_proposed_accepted): acceptproposed[i]iff it equalstarget(context + accepted_so_far); stop at the first mismatch; then append the correction tokentarget(context + accepted_so_far)(the big model's own choice). So you always emit at least 1 token per run.run_speculative(context, target, n, k, num_tokens)→ generate ≥num_tokens, returning the produced tokens plus counts: total proposed, total accepted, number of big-model runs.run_baseline(context, target, num_tokens)→ plain greedy: one token per run.
What you'll prove (the two headline properties)
- Identical output:
run_speculative(...) == run_baseline(...)token-for-token. (You only ever accept the target's own greedy choice.) This is the correctness guarantee. - Fewer runs when guesses are good: on a periodic sequence the n-gram drafter nails the
pattern, so
tokens_per_run = total_tokens / num_runs≫ 1; on a random target it's ≈ 1 (no speedup). That's acceptance rate in action.
Definition of done
pytest phase-08-speculative-decoding/labs -q
Map to the real engine
| your code | real vLLM |
|---|---|
ngram_propose | NgramProposer.propose (ngram_proposer.py:131) |
verify_greedy | greedy path of RejectionSampler (rejection_sampler.py:87) |
| (sampling version) | rejection_sample (rejection_sampler.py:392) |
| counts / acceptance | spec_decode/metrics.py |
| "runs" = scheduler steps | num_lookahead_tokens + num_tokens_with_spec (Phase 3) |
Phase 08 Labs — Speculative Decoding
Four labs on the art of spending idle FLOPs to buy latency. The arc: build the draft→verify machine with a free drafter (lab-01), prove the losslessness theorem for the sampled case (lab-03), price the trade with the expected-speedup model — including when to turn it off (lab-04), then measure the state of the art (EAGLE) on real silicon and reconcile every number against the models you built (lab-02).
Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04:
mechanism, theorem, economics, measurement.) CPU labs follow the standard contract —
starter.py (your work), solution.py (reference), test_lab.py (the spec); default
runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-08-speculative-decoding/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-08-speculative-decoding/labs/lab-01-ngram-spec-decode -q
Contents
- lab-01-ngram-spec-decode
[CPU-OK] - lab-02-eagle-on-real-vllm
[GPU-OPT] - lab-03-rejection-sampling
[CPU-OK] - lab-04-speedup-model
[CPU-OK] - What you can do after this phase
Labs
lab-01-ngram-spec-decode [CPU-OK]
The whole machine with the simplest drafter: n-gram prompt-lookup proposes, a greedy verifier accepts the leading run, corrections and bonus tokens keep progress ≥ 1 token/cycle. Proven token-identical to baseline; speedup measured as a property of the text (dramatic on repetitive, zero on random). Skills: the invariant verify loop; tokens-per-run as THE metric; graceful degradation; the evolving-context off-by-one.
lab-02-eagle-on-real-vllm [GPU-OPT]
The integration test: EAGLE (a one-layer head reading the target's hidden states) on Llama-3-8B — ITL 18.2 → 9.6 ms, acceptance 2.8/5 — reconciled number-by-number against labs 01/03/04, including the two honest qualifications (acceptance is workload-dependent; the win fades at saturated batch). Annotated capture included. Skills: predict-then-measure; the three acceptance metrics and their denominators; spec decode as a latency tool funded by spare compute.
lab-03-rejection-sampling [CPU-OK]
The theorem: accept draft x with min(1, p[x]/q[x]), else resample from
normalize(max(p − q, 0)) — and the output is distributed exactly as the target,
for any drafter. Verified empirically (200k draws through a clueless uniform drafter
land on the target to 0.005), plus the closed form α = Σ min(p, q) and the adversarial
limits. Skills: the residual construction; distributional testing with calibrated
tolerances; α as distribution overlap.
lab-04-speedup-model [CPU-OK]
The economics in three functions: E[tokens/cycle] = (1−α^(k+1))/(1−α), speedup =
that over k·c + 1, and optimal_k — which is sometimes zero (a mediocre drafter
at real cost loses to no speculation, and the model says so). Validated against
simulation to 1%; EAGLE's published numbers drop out of the formula. Skills: the
(α, c, k) economy; diminishing returns; why free drafters can't lose and saturated
GPUs can't win.
What you can do after this phase
Explain why speculative decoding is lossless — separately for greedy (trivial) and
sampled (the residual theorem) — and test the claim distributionally; evaluate any
drafter from two measured numbers (α on your traffic, c from a profile) before
deploying it; choose num_speculative_tokens from arithmetic; and reconcile vLLM's
spec-decode metrics with first principles. Phase 9 broadens sampling itself; the verify
machinery you now own reappears wherever one batched pass scores many candidates.
Lab 08-01 — n-gram Draft + Greedy Verify [CPU-OK]
Speculative decoding starts from an indignity: the most powerful models on earth spend
most of their decode steps predicting tokens a text search could have guessed — the
closing half of a quoted phrase, the rest of a repeated identifier, boilerplate the
prompt already contains. This lab builds the whole draft→verify mechanism end to end
using exactly that text search as the drafter (n-gram "prompt lookup" — vLLM's
method="ngram", the only drafter that needs no second model and costs nothing), and
establishes the two facts every speculative method lives by: the output is
token-for-token identical to plain decoding, and the speedup is entirely a
function of how guessable the text is — dramatic on repetitive sequences, exactly zero
on random ones. No GPU, no weights; the mechanism in its purest form.
Contents
- Why this lab exists
- Background: the asymmetry being arbitraged
- Files
- Run
- What to implement
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Every speculative method — n-gram, EAGLE, Medusa, draft models, DFlash — is the same
machine with a different drafter plugged in: propose k tokens cheaply, verify them with
one target pass, keep the leading run that matches, always emit at least one token.
Building that machine once, with the simplest possible drafter, separates the invariant
structure (the verify loop, the correction token, the bonus token, the ≥1-token progress
guarantee) from the pluggable part (where proposals come from). After this lab, every
spec-decode paper you read is "lab-01 with a fancier propose," and every acceptance
metric in vLLM's logs maps to a counter you maintained by hand.
The n-gram drafter is also worth knowing in its own right, not as a toy: it ships in
production vLLM, it's free (no extra model, no GPU memory, no draft forward), and on
the workloads where it shines — RAG (the answer quotes the context), code editing (the
new code repeats the old), structured output — it's embarrassingly competitive with
learned drafters. Free and sometimes great is a good tool to know exists; lab-04's
economics make "free" precise (c = 0 ⇒ speculation can never lose).
Background: the asymmetry being arbitraged
Phase 0 lab-04: a decode step is bandwidth-bound — the GPU streams all weights to emit
one token, using ~1% of its compute. But verifying k+1 candidate positions in one
forward pass costs barely more than that single step (it's prefill-shaped work riding
the same weight stream). So the target model can check k guesses for the price of
making one. Speculation arbitrages this: any source of decent guesses converts idle
compute into accepted tokens. The drafter here is the cheapest source imaginable —
"find the last time the current n-gram appeared, propose what followed it" — your
ngram_propose, a backwards string scan.
The verify rule (verify_greedy) is the part with invariants worth memorizing:
- Accept proposal
p_iiff it equals the target's own greedy choice given everything before it (computed left to right, each acceptance extending the context). - On the first mismatch, append the target's choice — the correction — and stop. That's why a cycle always emits ≥ 1 token: failure mode is baseline speed, never stall.
- If all k accepted, the verify pass has already computed the distribution at the position after them — append that bonus token. k accepted = k+1 emitted, free.
Files
starter.py—ngram_propose,verify_greedy,run_speculative,run_baseline. Your work.solution.py— reference.test_lab.py— output identity, the periodic-text speedup, the random-text non-speedup.
Run
LAB_IMPL=starter pytest phase-08-speculative-decoding/labs/lab-01-ngram-spec-decode -q
pytest phase-08-speculative-decoding/labs/lab-01-ngram-spec-decode -q # reference
What to implement
Per 02-mini-build.md. Watch the two classic off-by-ones: the
n-gram scan must find an earlier occurrence (excluding the pattern's own position at
the very end — propose from yourself and you'll propose nothing useful forever), and the
verify must compare proposal i against the target evaluated on
context + accepted[:i] — against the evolving context, not the original one (compare
against the frozen context and proposals 2+ are checked against the wrong distribution;
outputs diverge from baseline and the identity test catches it).
What the tests prove
| Test | What it pins |
|---|---|
| output == baseline | The lossless guarantee, greedy form: you only ever keep the target's own choices, so the sequence cannot differ. (The sampled-temperature generalization is lab-03's theorem) |
periodic text → runs ≪ num_tokens | High acceptance on guessable text: many tokens per target pass — the entire value proposition, measured by your own counters |
random text → runs == num_tokens | Zero acceptance → exactly baseline-many target runs. Speculation's failure mode is no speedup, not wrong output — the graceful-degradation property that makes it deployable |
Hitchhiker's notes
- Tokens-per-run is THE metric.
(accepted + runs) / runsfrom your stats dict is "mean acceptance length + 1" in vLLM's spec-decode logs (lab-02 reads2.8 / 5from a real run). Everything else — ITL improvement, the lab-04 economics — derives from it. When evaluating any drafter, ask for this number on your workload before anything else. - Why scan latest-first? Recent context predicts continuation better than distant
context (local repetition dominates). Upstream's
NgramProposer.propose(ngram_proposer.py:131) does the same, with bounded lookback and n-gram sizes — read it after; it's your function with engineering around it. - The verify loop's "always ≥1 token" is load-bearing for the scheduler: a
speculation cycle is, from Phase 3's perspective, just a decode step that may emit
several tokens. The KV for accepted tokens was already written during the verify
pass; rejected positions' KV must be discarded (in paged terms: the slots are simply
overwritten next cycle — the counters from Phase 1 make rollback almost free). If
you wondered why
num_computed_tokensracingnum_tokenswas such a good idea — speculative decoding is one more feature that composes with it for free. - k is not free even when drafting is — each proposed-but-rejected token wastes a slot in the verify batch. With a free drafter the waste is mild (the verify pass was ~constant cost anyway); with a costly drafter it's lab-04's whole subject.
Going further
- Track per-position acceptance (does proposal #1 accept more often than #4?) on
periodic text with noise injected — you'll rediscover the geometric decay that
lab-04's
α^imodels. - Implement the suffix-automaton upgrade: index all earlier occurrences instead of scanning (longest-match instead of fixed-n). Compare acceptance on text with mixed repetition lengths — this is the direction production prompt-lookup variants take.
- Run your speculative loop with
mini_vllm's deterministicToyModelas the target (greedy) and verify the identity holds againstLLMEngine.generate— wiring the lab into the course's engine, the way upstream wiresNgramProposerinto the runner.
References
upstream/vllm/v1/spec_decode/ngram_proposer.py:131—NgramProposer.propose: your drafter, productionized.upstream/vllm/v1/sample/rejection_sampler.py:87— the greedy verify fast path.- Leviathan et al. (2022) — the original draft/verify formulation: https://arxiv.org/abs/2211.17192
- Prompt Lookup Decoding (Saxena, 2023) — the n-gram drafter's origin as a standalone trick: https://github.com/apoorvumang/prompt-lookup-decoding
- Labs 03 (the sampled-case theorem) and 04 (the economics) — this lab's two generalizations.
Lab 08-02 — EAGLE on Real vLLM [GPU-OPT]
The CPU labs built the machine (01), proved its theorem (03), and priced it (04). This lab runs the state of the art on real silicon: EAGLE — a one-layer draft head that reads the target model's own hidden features and proposes from understanding rather than lab-01's string matching — and measures the two numbers the whole phase converges on: inter-token latency (18.2 → 9.6 ms, ~1.9×) and acceptance (2.8 of 5, 56%). Then the two qualifications that keep the result honest: acceptance climbs to ~80% on code, and the win shrinks at high batch — both of which your lab-04 model predicts before the GPU is even warm.
No GPU? Don't panic. The captured run below is annotated against all three CPU labs; the cross-checking is the lesson.
Contents
- Why this lab exists
- Background: what EAGLE changes (and doesn't)
- Requirements
- Steps
- Captured output (real run, Llama-3-8B + EAGLE, A100, vLLM 0.22.1, trimmed)
- Reading the numbers
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
This is the phase's integration test — of your understanding, not the software. Three CPU labs handed you a model of speculative decoding with named parameters: acceptance rate α (lab-03's overlap), draft cost c, tokens-per-cycle (lab-01's metric), speedup (lab-04's formula). A real EAGLE run hands you measurements. The work of this lab is the reconciliation: plug the measured acceptance into lab-04's formula, predict the ITL improvement, compare with the measured 1.9×, and account for the gap (overheads the model omits). When prediction and measurement agree within your stated error budget, the phase is yours. When they don't, one of your parameters is wrong — finding which is exactly the skill a staff engineer applies when a vendor (or a teammate) claims "2× from speculation" for your workload.
The lab also installs the production decision frame: speculative decoding is a latency tool funded by spare compute. Both halves of that sentence are visible in the capture — the single-stream halving, and the fade at saturated batch — and both are why "should we enable EAGLE?" has a different answer for a chatbot (yes, probably) than for an offline batch pipeline (probably not).
Background: what EAGLE changes (and doesn't)
Everything from labs 01/03 survives intact: propose k, one batched verify, leading-run acceptance, correction/bonus token, lossless guarantee. What EAGLE changes is the drafter: instead of searching the context for literal repeats (α ≈ 0 on novel prose), it runs a single transformer layer over the target model's last hidden states — the features the target computed anyway — plus the sampled token, and autoregressively rolls out k draft tokens. Because it reads the target's "thoughts" rather than its text, it predicts well even on text that never repeats (α ≈ 0.6–0.8); because it's one layer against the target's 32+, its cost is c ≈ 0.05. On lab-04's (α, c) plane, EAGLE sits in the corner that dominates both the free-but-blind n-gram drafter and the smart-but-expensive separate draft model — which is why the separate-draft-model approach has mostly faded, and why EAGLE-family heads exist for most popular open models.
The price of reading hidden states: the head is target-specific (trained per model, shapes must match — you can't borrow Llama's head for Qwen), and the draft itself runs autoregressively (k sequential micro-steps — tiny ones, but this is exactly where Phase 5's CUDA graphs become load-bearing: a 1-layer model's step is pure launch overhead without them).
Requirements
uv pip install -e ".[vllm]"
# a base model + its matching EAGLE head from the Hub, e.g.:
# meta-llama/Meta-Llama-3-8B-Instruct + yuhuili/EAGLE-LLaMA3-Instruct-8B
Steps
import time
from vllm import LLM, SamplingParams
sp = SamplingParams(max_tokens=128, temperature=0)
prompts = ["Explain how a hash map handles collisions."] # single stream first!
base = LLM(model="<base>", gpu_memory_utilization=0.8)
# ... time generate(), record ITL = elapsed / tokens ...
spec = LLM(model="<base>", gpu_memory_utilization=0.8,
speculative_config={"method": "eagle", "model": "<eagle head>",
"num_speculative_tokens": 5})
# ... same timing; then read the spec-decode metrics lines from the log
# (acceptance counts / mean acceptance length).
Three runs to do properly: (1) single stream, the headline; (2) the same prompt swapped for code generation — watch acceptance move; (3) batch 32+ — watch the speedup fade. Before each, predict the result from lab-04 with your current (α, c) estimates.
Captured output (real run, Llama-3-8B + EAGLE, A100, vLLM 0.22.1, trimmed)
baseline : ITL 18.2 ms/token (54.9 tok/s, single stream)
eagle (k=5) : ITL 9.6 ms/token (104 tok/s) ~1.9x faster
spec_decode metrics: mean acceptance length 2.8 / 5 ; acceptance rate 56%
# on highly repetitive input (code), acceptance rose to ~80% and ITL dropped further.
# at large batch (saturated GPU) the speedup shrank — less spare capacity to verify.
Reading the numbers
- Mean acceptance length 2.8 → tokens-per-cycle 3.8 (the +1 is lab-01's
correction/bonus). Lab-04 sanity check: per-position α solving
(1−α⁶)/(1−α) = 3.8is ≈ 0.75; the logged "56%" is a different denominator (accepted/proposed = 2.8/5) — two acceptance metrics, one phenomenon, and confusing them is the most common spec-decode reporting error. Always ask which one a number is. - Predicted vs measured: lab-04 with α=0.75, c=0.05, k=5 gives
3.78 / 1.25 ≈ 3.0×; measured is 1.9×. The gap is the model's known omissions (per-cycle sampler/launch overheads, the verify pass costing slightly more than 1, drafting running serially) — consistent in direction with the bias list in lab-04's notes. A model that misses by a predictable margin in a predictable direction is a working model. - Code → 80% acceptance: sharper next-token distributions overlap more (lab-03: α = Σ min(p,q) grows as both distributions concentrate). Same reason low temperature helps. Your workload's α is a property of your traffic; measure it there.
- The fade at batch: verify rides on spare compute (Phase 0 lab-04's idle FLOPs at small batch). A saturated GPU has none — the verify pass now displaces other requests' work, and tokens-per-cycle gains stop translating into wall-clock. Spec decode is a latency tool; at full throughput it approaches a no-op (or worse, with drafting overhead). This single observation decides most deployment questions.
Hitchhiker's notes
- k=5 is not sacred. With measured α ≈ 0.75 and c ≈ 0.05, lab-04's
optimal_ksays 5–7 — fine. But on the prose end (α ≈ 0.5) optimal k drops to ~3, and configured-k- too-high costs latency (rejected drafts still occupy verify slots). If your acceptance metrics run low, shrinkingnum_speculative_tokensis the free fix nobody tries. - EAGLE + CUDA graphs are a package deal (Phase 5 lab-04's note, now concrete): the draft head's per-token step is ~1 ms-class GPU work behind full launch overhead — eager-mode EAGLE can lose most of its margin to Python and launches. If spec-decode numbers disappoint, check the draft path is actually captured.
- Greedy here, but the guarantee generalizes: with temperature > 0 the verify runs lab-03's rejection sampling, and outputs are distributionally identical rather than token-identical. Acceptance drops a bit (broader distributions overlap less). The metrics machinery is unchanged.
- EAGLE-2/3 and tree drafts: instead of one chain of k, draft a small tree of
alternatives and verify all paths in one pass (attention masks make a tree look like
a batch). Buys higher expected acceptance per verify at the cost of verify width —
same economics, one more dimension. When you see
speculative_configgrow tree parameters, lab-04's model extends with "k" becoming "tree shape."
Reflect
- Reconcile the three acceptance numbers you now have (2.8/5 = 56%; per-position α ≈ 0.75; code ≈ 80%) — write each as a formula over the same event sequence. If you can do this cold, you'll never misread a spec-decode dashboard.
- Your fleet runs batch-48 throughput-oriented summarization. EAGLE: yes or no? What measurement would change your answer? (Likely no — saturated compute; measure spare utilization headroom and p99 ITL requirements. If interactivity appears — yes for the interactive class, via a separate pool or priority.)
- The EAGLE head must match the target model. What happens operationally when you upgrade the base model checkpoint? (The head needs retraining/replacing — speculative configs add a coupled artifact to your model-rollout pipeline. Budget for it or inherit silent acceptance collapse.)
References
- Li et al., EAGLE (2024): https://arxiv.org/abs/2401.15077; EAGLE-2 (tree drafts, 2024): https://arxiv.org/abs/2406.16858
upstream/vllm/v1/spec_decode/eagle.py— the proposer; note the hidden-state plumbing from the target's forward.- vLLM docs, Speculative Decoding — configs and the metrics you read: https://docs.vllm.ai/en/latest/features/spec_decode/
- Labs 01/03/04 — the machine, the theorem, the economics this run validates.
Lab 08-03 — Rejection Sampling: Lossless Speculation with Temperature [CPU-OK]
Lab-01's greedy verify had an easy life: at temperature 0 there's exactly one right
token, so "accept iff the draft equals it" is obviously lossless. But production serving
samples — and now the claim that speculative decoding "doesn't change the output"
becomes a real theorem with a real proof obligation: the verified output must be
distributed exactly according to the target model's distribution p, no matter how
wrong the drafter's q is. This lab has you implement the three-line algorithm that
achieves it — accept draft x with probability min(1, p[x]/q[x]), else resample from
the residual normalize(max(p − q, 0)) — and then verify the theorem empirically:
200,000 draws through a deliberately clueless uniform drafter land on the target
distribution to within sampling noise. This is the mathematical heart of every
speculative method in vLLM, from n-gram to EAGLE.
Contents
- Why this lab exists
- Background: why the algorithm works
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
"Speculative decoding is lossless" is repeated everywhere and understood almost nowhere
— most explanations stop at the greedy case, leaving the sampled case as folklore. But
the sampled case is where the engineering risk lives: a subtly wrong residual, a missing
clamp, a normalization slip — and your serving system is quietly sampling from a
distribution that is not the model's, a bug invisible to every output-equality test
(each individual output is plausible!) and detectable only distributionally. The
defense, which you'll build, is the statistical test: histogram many draws, compare to
p. If you ever touch rejection_sampler.py upstream — and spec-decode PRs touch it
constantly — this lab's test design is how you protect the change.
The second deliverable is the acceptance-rate formula Σ min(p, q) — the overlap of
the two distributions (equivalently 1 − total-variation distance). It converts "is the
drafter any good?" from vibes into one number per position, and it's the alpha that
lab-04's economics run on. Drafter evaluation, acceptance metrics in vLLM's logs,
temperature's effect on speedup — all read off this one quantity.
Background: why the algorithm works
The output token's probability decomposes into "accepted draft" + "residual resample":
P(output = x) = q[x]·min(1, p[x]/q[x]) + P(reject)·residual[x]
= min(p[x], q[x]) + (1 − Σ min(p,q)) · max(p[x]−q[x],0) / Σ max(p−q,0)
Since Σ max(p−q, 0) = 1 − Σ min(p, q) (the surplus equals the deficit — both are the
TV distance), the second term simplifies to max(p[x] − q[x], 0), and:
P(output = x) = min(p[x], q[x]) + max(p[x] − q[x], 0) = p[x] ∎
Read the proof's shape: where the draft over-serves (q > p), acceptance is throttled
by exactly the ratio; where it under-serves (q < p), drafts always pass and the
residual makes up precisely the shortfall. The two errors cancel by construction, not by
luck — which is why the result holds for any q, including adversarially bad ones
(your test_disjoint_distributions_never_accept: zero overlap, everything rejected,
output still exactly p — spec decode degrades to baseline speed, never to wrongness.
That graceful-degradation property is what makes it safe to deploy aggressively).
Files
starter.py—accept_prob,residual_distribution,speculative_token,expected_acceptance_rate. Your work.solution.py— reference.test_lab.py— the formula edges, the residual, the empirical theorem (200k draws), the identical-distribution and disjoint-distribution limits, and the overlap formula against simulation.
Run
LAB_IMPL=starter pytest phase-08-speculative-decoding/labs/lab-03-rejection-sampling -q
pytest phase-08-speculative-decoding/labs/lab-03-rejection-sampling -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_accept_prob_formula | The two regimes: p ≥ q → always accept; p < q → the exact ratio |
test_residual_is_the_renormalized_surplus | The fallback distribution, value by value — get this wrong and the theorem dies silently |
test_output_distribution_is_exactly_the_target | The theorem, empirically: uniform drafter, skewed target, 200k draws, histogram ≈ p within 0.005. This is the test that catches the silent bug class |
test_identical_distributions_always_accept | The q = p limit: overlap 1, acceptance 1 — a perfect drafter is never rejected (and the p == q residual edge case stays well-defined) |
test_acceptance_rate_is_the_overlap | Σ min(p,q) = 0.70 for the lab's pair, confirmed by simulating the accept branch alone |
test_disjoint_distributions_never_accept | The adversarial limit: zero overlap → pure residual → still exactly p. Wrongness is impossible; only speed is at stake |
The statistical tolerance (atol=0.005 at N=200k) is calibrated, not hand-waved: the
binomial standard error at p=0.5 is √(0.25/200000) ≈ 0.0011, so 0.005 is ~4.5σ —
tight enough to catch any real implementation error, loose enough to never flake. When
you write distributional tests (and after this lab, you will), do this arithmetic.
Hitchhiker's notes
- Greedy verify is this algorithm's zero-temperature limit: as temperature → 0,
pandqcollapse toward one-hots;min(1, p[x]/q[x])becomes "1 if the argmaxes match, else 0", and the residual becomes the target's argmax. Lab-01 was a special case all along — upstream'sRejectionSamplerhas the explicit greedy fast path (rejection_sampler.py:87) for exactly this case, because comparing argmaxes is cheaper than the full machinery. - Multi-token drafts chain this per position: verify token 1 against
p₁; if accepted, token 2 againstp₂(computed with token 1 in context — the target's one batched forward scored all positions); first rejection stops the chain and resamples from that position's residual. The i.i.d.-ish per-position acceptance is thealphalab-04 models. Crucially, allk+1target distributions came from one forward pass — that batching is the entire economic basis (lab-04'scost = k·c + 1). - Where the probabilities come from matters:
pandqhere are post-temperature, post-top-p distributions — the verifier must apply the same sampling-parameter pipeline (Phase 0 lab-03) to both models' logits, or the ratio compares apples to oranges. Sampling-parameter mismatches between draft and target paths are a real upstream bug category; now you know what they corrupt. - The same trick generalizes — speculative sampling is importance-sampling-flavored rejection sampling with a guaranteed-exact fallback, and variants (tree drafts with multiple candidates per position, typical acceptance in Medusa/EAGLE-2) bend the acceptance rule while preserving the distributional identity. When reading any new spec-decode paper, find its version of this lemma first; everything else is scheduling.
Going further
- Implement chained multi-token verification (
speculative_sequence(p_list, q_list, k, rng)) and verify the joint distribution of two-token outputs matches sequential target sampling — the full lossless claim, one level up. - Measure acceptance vs temperature: fix logits, sweep T ∈ {0.2, 0.7, 1.0, 1.5} for
both models, plot
Σ min(p,q). Sharp distributions overlap more → spec decode loves low temperature — connect to lab-02's "code accepts at 80%" observation. - Break it on purpose: skip the
min(1, ·)clamp (accept with rawp/q… capped how?) or forget to renormalize the residual, and watch which test catches each. Knowing the failure signatures is half the review skill.
References
- Leviathan et al., Fast Inference from Transformers via Speculative Decoding (2022) — the theorem (their Theorem 3.1 / Appendix A): https://arxiv.org/abs/2211.17192
- Chen et al., Accelerating Large Language Model Decoding with Speculative Sampling (2023) — the same result, DeepMind flavor: https://arxiv.org/abs/2302.01318
upstream/vllm/v1/sample/rejection_sampler.py— the production implementation: find the ratio, the residual, and the greedy fast path (:87).- Lab-04 — what
Σ min(p,q)is worth in milliseconds; lab-01 — the zero-temperature special case you already built.
Lab 08-04 — The Speculative-Decoding Speedup Model [CPU-OK]
Three functions, maybe fifteen lines — and at the end of them you can answer, with
arithmetic, the questions that decide whether speculative decoding ships: How much
faster, given my drafter's acceptance rate? What draft length k should I configure? And
when does spec decode make things worse? (It can. test_spec_decode_can_lose proves a
mediocre drafter at realistic cost is a net loss, and optimal_k tells you the right
configuration is zero.) This is the expected-speedup model from the original paper,
the same arithmetic behind vLLM's num_speculative_tokens default debates — and after
this lab, behind your config choices instead of your hopes.
Contents
- Why this lab exists
- Background: the three-parameter economy
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Speculative decoding is the most conditionally valuable optimization in this course:
transformative for a single latency-bound stream with a sharp drafter, worthless — or
harmful — for a saturated-throughput deployment with a dull one. Teams burn weeks
discovering this empirically because they treat "enable spec decode" as a boolean
instead of an equation. The equation has three inputs you can measure independently —
acceptance rate alpha (lab-03's overlap, read from vLLM's spec-decode metrics), draft
cost c (drafter step time / target step time, read from a profile), and draft length
k (the config knob) — and one output. This lab makes you fluent in it, including its
honest failure regions.
It's also a clean specimen of modeling discipline: the formula assumes i.i.d.
per-position acceptance, which is false in detail (acceptance correlates within a
phrase) but accurate in expectation — and test_expected_tokens_matches_simulation
shows the model agreeing with a 200k-cycle simulation to 1%. Knowing how to validate a
simplification is worth more than distrusting all simplifications.
Background: the three-parameter economy
One speculation cycle: draft k tokens (cost k·c), then one target forward verifies
all of them in a single batch (cost ≈ 1 — this batching is the entire trick; the
verify scores k+1 positions for the price of one decode step because prefill-shaped
work is compute-cheap, Phase 0 lab-04). The cycle emits the leading run of accepted
tokens plus one (the correction on first rejection, or the bonus token when everything
passes — lab-01's verify_greedy mechanics):
E[tokens/cycle] = 1 + α + α² + … + α^k = (1 − α^(k+1)) / (1 − α)
speedup(α, k, c) = E[tokens/cycle] / (k·c + 1)
Two structural facts fall out before you compute anything. Diminishing returns in k:
the i-th draft token only counts if all before it accepted, so its marginal value is
α^i — geometrically decaying, while its cost c is constant; past some k every extra
token is negative-margin (hence optimal_k). The ceiling: even free drafts can't
beat 1/(1−α) tokens per cycle — at α=0.7 that's 3.3×, at α=0.5 it's 2× — so chasing
speedup beyond the ceiling means improving the drafter, not the config.
Files
starter.py—expected_tokens_per_verify,speedup,optimal_k. Your work.solution.py— reference.test_lab.py— formula edges, the simulation check, the free-drafter bound, the losing regime, EAGLE-ballpark numbers, and both monotonicities ofoptimal_k.
Run
LAB_IMPL=starter pytest phase-08-speculative-decoding/labs/lab-04-speedup-model -q
pytest phase-08-speculative-decoding/labs/lab-04-speedup-model -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_expected_tokens_formula | The geometric series and both edges: α=0 → 1 (corrections only — spec decode never emits less than baseline per cycle), α=1 → k+1 |
test_expected_tokens_matches_simulation | The i.i.d. model vs 200k simulated cycles: < 1% off. The simplification, validated |
test_free_drafter_always_wins | c=0 (the n-gram drafter — lab-01): speedup = E[tokens] ≥ 1. Free drafts can't lose, which is why prompt-lookup ships as a default-safe option |
test_spec_decode_can_lose | α=0.2, c=0.5, k=5 → speedup < 1, and optimal_k = 0. The model's most valuable output is "turn it off" |
test_eagle_like_numbers | α≈0.7, c≈0.05, k=5 → ~2.3× — the ballpark real EAGLE deployments report (lab-02's measured 1.9× at k=5 sits right here once you account for overheads the model omits) |
test_optimal_k_grows_with_alpha_and_shrinks_with_cost | The two intuitive monotonicities, made checkable |
test_diminishing_returns_in_k | The marginal value of draft token i decays geometrically — why k=5-ish is so common and k=20 is almost never right |
Hitchhiker's notes
- What the model omits, and which way each omission points: verify cost grows
(slightly) with k — real cost is
k·c + (1+εk), pushing optimal k down; the drafter and target compete for the GPU at high batch — at saturation the "spare compute" funding the whole scheme disappears (lab-02's shrinking-win observation), pushing value down; per-cycle fixed overheads (extra kernel launches, sampler work) hurt small-α configs most. The model is an upper bound with known biases — the most useful kind. - The same arithmetic prices drafters against each other: n-gram (α low on prose,
high on code; c ≈ 0), EAGLE (α ≈ 0.6–0.8; c ≈ 0.03–0.08 — a one-layer head), a
half-size draft model (α high; c ≈ 0.2–0.5 — usually dominated by EAGLE on this
math, which is why standalone draft models faded). Three points on a (α, c) plane;
speedupis the contour map. Plot it once and the drafter literature organizes itself. - α is workload-dependent, so the right k is too: code and structured output accept far better than creative prose (lab-02 measured 80% vs ~56%). A deployment serving both has no single optimal k — which is why dynamic/adaptive speculation (adjusting k per request from rolling acceptance) is an active upstream direction. The model you just built is the controller's objective function.
- Where to read your fleet's α: vLLM's spec-decode metrics (acceptance counts per
position, mean acceptance length — the
2.8 / 5in lab-02's capture). Mean acceptance length ≈E[tokens/cycle] − 1; invert your formula and you can back out the effective α from production logs. Do it before and after a drafter upgrade and you have the business case in two numbers.
Going further
- Plot speedup vs k for α ∈ {0.3, 0.5, 0.7, 0.9} at c = 0.05: watch the maximum slide right and up with α. Add c = 0.3 curves and watch speculation die for low α. This one figure is the deployment decision.
- Add the batch-saturation term: model verify cost as
1 + λ·kwhere λ grows with batch utilization, and find the utilization where optimal_k hits 0 — you've derived "spec decode is a latency tool, not a throughput tool" instead of memorizing it. - Replace i.i.d. α with a two-state model (in-phrase α_high, at-boundary α_low) and re-derive E[tokens] — then check whether the i.i.d. fit to the mean still predicts well. (It does, mostly — means are forgiving. Tail latency per cycle is not; explore the variance.)
References
- Leviathan et al., Fast Inference from Transformers via Speculative Decoding (2022) — §3.1 is this lab's formula: https://arxiv.org/abs/2211.17192
- Li et al., EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (2024) — where the (α≈0.7, c≈0.05) point comes from: https://arxiv.org/abs/2401.15077
upstream/vllm/v1/spec_decode/— proposer implementations; the metrics module that exports your α.- vLLM docs, Speculative Decoding —
num_speculative_tokensand method configs: https://docs.vllm.ai/en/latest/features/spec_decode/ - Lab-03 — where α comes from (
Σ min(p,q)); lab-02 — the measured numbers this model predicts; Phase 0 lab-04 — why the batched verify costs ~1.
Phase 08 — Exercises: Speculative Decoding
Contents
Warm-up (explain)
- In one breath: how does speculative decoding produce several tokens from one big-model run?
- Why does verifying k drafted tokens cost about the same as one decode step? (Tie to prefill.)
- Why is the output identical to normal decoding (greedy case)?
Core (trace the code)
NgramProposer.propose(ngram_proposer.py:131) — what does it match on, and what does it return? Why is it great for code/summarization?- In
rejection_sample(rejection_sampler.py:392), state the accept probability and what happens on rejection. Why does this preserve the target distribution? - In
scheduler.py, how donum_lookahead_tokensandnum_tokens_with_speclet spec decode ride the normal schedule with no special case?
Build (your lab)
- In lab-01, derive expected tokens-per-run from acceptance rate
aand draft lengthk(hint: it's1 + (accepted before first reject)). - Add a
ksweep: plot tokens-per-run vskon the periodic target. Why does it plateau? - Construct an input where n-gram hurts (proposals never accepted): show runs == baseline and explain the wasted draft cost.
Design (staff-level)
- Given target step cost
C_t, draft costC_d, and acceptancea, write the condition for spec decode to be a net win. When does large batch flip it negative? - A customer's workload is 70% code (repetitive) and 30% chat (creative). Would you enable spec decode globally, per-request, or adaptively? Justify.
- EAGLE vs n-gram: when would you pick each, and what does EAGLE need that n-gram doesn't?
- Spec decode interacts with the KV cache (drafts need slots) — what must the scheduler do on rejection, and what's the memory risk?
Self-grading
4–6 and 10–13 are interview-grade. Could you whiteboard draft→verify and the win condition? If not, re-read 01-deep-dive.md.
Phase 08 — Interview Questions: Speculative Decoding
Q1. How does speculative decoding speed up decode?
Model answer
A cheap drafter proposes the next k tokens; the big model verifies all of them in one forward run (a mini-prefill, which is compute-bound and processes many tokens cheaply) and keeps the longest correct prefix plus one correction. So one expensive run yields multiple tokens instead of one. The speedup is set by the acceptance rate × draft length, minus the small drafting/verify overhead.
Q2. Why doesn't it change the model's output?
Model answer
Greedy: you only accept a drafted token if it equals the big model's argmax; on disagreement you
discard the rest and use the big model's token — so the sequence is identical to plain greedy.
Sampling: the rejection sampler accepts token i with probability min(1, p_target/p_draft) and, on
rejection, resamples from normalize(max(0, p_target − p_draft)); the math guarantees the accepted
tokens follow the target's exact distribution. Speed changes, behavior doesn't.
Q3. When is it a win, and when does it hurt?
Model answer
Win when accepted-tokens-per-run × target-step-cost exceeds the cost of drafting plus the extra verify work — i.e. high acceptance and a cheap drafter, in latency-bound (small-batch) regimes with spare GPU capacity. It can lose at low acceptance (creative text, weak drafter) or at large batch where the GPU is already saturated and verifying drafts steals capacity from real work.
Q4. What proposers exist and how do they differ?
Model answer
n-gram / prompt-lookup (free, copies a repeated phrase's continuation — great for code/structured text); EAGLE (a small trained head predicting the target's next hidden states — high acceptance on general text); Medusa (extra heads), DFlash, suffix decoding, and a separate small draft model. All plug into the same verify path; they trade drafter cost vs acceptance quality.
Q5. How does spec decode fit vLLM's scheduler without a special case?
Model answer
A request's num_tokens_with_spec includes the draft tokens, so the standard num_new_tokens clamp
schedules them; num_lookahead_tokens reserves KV slots for them. After the run, the rejection
sampler decides accept/reject and update_from_output keeps the accepted prefix and rolls back the
rest. The scheduler just sees "a few more tokens in the gap" — the Phase 3 abstraction absorbs the
whole feature.
Rapid-fire
- Verify cost ≈ ? one decode/prefill run (processes k+context together).
- Output change? none (greedy: argmax-only accept; sampling: rejection sampling).
- Deciding metric? acceptance rate.
- Free proposer? n-gram / prompt-lookup. Best trained one (today)? EAGLE.
- Scheduler hooks?
num_tokens_with_spec,num_lookahead_tokens.
Phase 08 — Cheatsheet: Speculative Decoding
Contents
The one-liner
Cheap drafter guesses k next tokens → big model verifies all in ONE run → keep the longest correct prefix + 1 correction. Several tokens per expensive run; output identical to normal decoding.
Why it works
Verification = a mini-prefill (compute-bound, processes many tokens cheaply), so checking k tokens ≈ one decode run. Speedup ∝ acceptance rate.
Correctness
- Greedy: accept only the target's argmax → identical output.
- Sampling: rejection sampling — accept w.p.
min(1, p_target/p_draft), resamplenormalize(max(0, p_target−p_draft))on reject → exact target distribution.
Proposers
n-gram/prompt-lookup (free; great for repetitive/code) · EAGLE (trained head, predicts hidden states; best general) · Medusa · DFlash · suffix · small draft model.
Win/lose
Win: high acceptance, cheap drafter, small batch (spare capacity). Lose: low acceptance, or large
batch (GPU already saturated). Condition: accepted/run × C_target > C_draft + extra_verify.
Rides the scheduler
num_tokens_with_spec adds drafts to the gap; num_lookahead_tokens reserves KV; rejection result
applied in update_from_output. No special scheduler path.
Key upstream
v1/spec_decode/ngram_proposer.py:12/:131·eagle.py:10·medusa.pydflash.pysuffix_decoding.pyv1/sample/rejection_sampler.py:37 RejectionSampler :87 forward :392 rejection_samplev1/spec_decode/metrics.py(acceptance) ·scheduler.py(spec_token_ids / num_lookahead_tokens)
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md
Phase 09 — The Hitchhiker's Guide to Sampling & Decoding Algorithms
← Phase 08 · Course home · Phase 10 →
Contents
- Don't Panic
- Step 1: The knobs, from sharpest to softest
- Step 2: Penalties and logit bias
- Step 3: The batching challenge (why this is a systems problem)
- Step 4: Parallel sampling and beam search
- The invariants to memorize
- What you'll do
Don't Panic
The model gives you logits — a score for every token in the vocabulary. You still have to pick one. How you pick is the decoding algorithm: greedy, temperature, top-k, top-p, penalties, parallel sampling, beam search. In a batched engine, all of these must run together, vectorized, on the GPU, for many requests that each chose different settings. This phase is that machinery — small, on the critical path of every single step, and the hook structured output (Phase 12) and penalties plug into.
logits (vocab,) per request
│ logits processors: penalties, bias, bad-words, grammar mask (Phase 12)
│ temperature scaling
│ top-k / top-p / min-p truncation
▼
probability distribution ──sample──► next token id
Step 1: The knobs, from sharpest to softest
- Greedy (temperature 0): always take the argmax. Deterministic. (mini_vllm's default.)
- Temperature
T: divide logits byTbefore softmax.T<1sharpens (more confident),T>1flattens (more random).T→0→ greedy. - Top-k: keep only the
khighest-probability tokens, renormalize, sample from those. - Top-p (nucleus): keep the smallest set of tokens whose cumulative probability ≥
p. Adaptive: few tokens when the model is confident, many when it's unsure. - Min-p: keep tokens with probability ≥
min_p × max_prob. A confidence-relative cutoff.
These compose in a fixed order (penalties → temperature → top-k → top-p/min-p → sample), which
you'll implement in lab-01.
Step 2: Penalties and logit bias
Before sampling you can edit the logits:
- Repetition / frequency / presence penalty: lower the score of tokens already generated, to reduce loops. Frequency scales with count; presence is a flat penalty for any prior occurrence.
- Logit bias: add/subtract a fixed amount for specific token ids (force or ban words).
- Bad words / stop: hard-ban sequences.
All of these are logits processors — the pluggable pre-sampling hook. That same hook is how
structured output (Phase 12) masks illegal tokens to -inf. One clean abstraction, many uses.
Step 3: The batching challenge (why this is a systems problem)
In one decode step you might have 256 requests, each with its own temperature, top-p, penalties.
You can't loop in Python (too slow on the hot path). So vLLM packs per-request params into tensors
aligned with the batch and applies vectorized, branch-free masked ops — every row uses its own
settings in one kernel. That's the real engineering: not the math of top-p, but doing top-p for a
heterogeneous batch in one GPU pass (vllm/v1/sample/ops/topk_topp_sampler.py).
Step 4: Parallel sampling and beam search
- Parallel sampling (
n>1): produce N independent completions for one prompt. The prompt is processed once; the N samples share the prompt's KV blocks via prefix caching (Phase 2/3) and diverge only after the first sampled token. A beautiful reuse of paging. - Beam search: keep the top-N partial sequences by cumulative log-prob, expanding and pruning each step. It's awkward in a continuous-batching engine (beams branch and die, changing the batch shape), so vLLM handles it specially rather than as plain sampling.
The invariants to memorize
- Order: penalties/bias → temperature → top-k → top-p/min-p → sample.
- Greedy = temperature 0 = argmax (deterministic).
- Everything is vectorized across a heterogeneous batch with per-row params.
- Logits processors are the pre-sampling hook (penalties, bias, grammar masks).
n>1shares the prompt's KV via prefix caching; beam search is the special case.
What you'll do
- Read: 01-deep-dive.md —
Sampler.forward, the top-k/top-p ops, penalties, and the logits-processor framework, line-anchored. - Build: 02-mini-build.md — add min-p, repetition penalty, and a logits-processor pipeline.
- Labs (see labs/README.md; recommended order 01 → 04 → 03 → 02):
lab-01-sampling-ops[CPU-OK]— implement temperature/top-k/top-p/min-p + repetition penalty + a logits-processor hook; pin their effects with tests.lab-02-parallel-sampling[GPU-OPT]— runn>1on real vLLM; observe shared prompt KV (captured output).lab-03-beam-search[CPU-OK]— build beam search and spring the garden-path trap where greedy provably loses; EOS-finishes-a-beam; why V1 evicted beams from the engine core.lab-04-seeded-rng-batch-invariance[CPU-OK]— per-request generators: prove a seeded request's tokens survive any batch neighbors (and watch the shared-RNG version fail).
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
← Phase 08 · Course home · Phase 10 →
Phase 09 — Deep Dive: the batched sampler
Paths relative to
upstream/atv0.22.1 @ 0decac0.vllm/v1/sample/sampler.py the batched Sampler (orchestrates the pipeline) vllm/v1/sample/ops/topk_topp_sampler.py vectorized top-k/top-p vllm/v1/sample/ops/penalties.py repetition/frequency/presence penalties vllm/v1/sample/ops/bad_words.py banned sequences vllm/v1/sample/logits_processor/ the pluggable pre-sampling hook (builtin/interface/state) vllm/v1/sample/metadata.py SamplingMetadata: per-request params packed into tensors vllm/sampling_params.py SamplingParams (the user-facing knobs)
Contents
- 1.
SamplingParams— the knobs - 2.
Sampler.forward— the pipeline - 3. Vectorized top-k/top-p
- 4. Penalties
- 5. Logits processors — the hook everything uses
- 6. Parallel sampling
- Reading checklist
1. SamplingParams — the knobs
vllm/sampling_params.py:168 — class SamplingParams. Fields: temperature (:205, default
1.0), top_p (:209), top_k, min_p, penalties, n (parallel samples), seed, logit_bias,
max_tokens, stop conditions. mini_vllm/sampler.py:SamplingParams is a faithful subset
(temperature/top_k/top_p/seed/max_tokens/ignore_eos).
2. Sampler.forward — the pipeline
vllm/v1/sample/sampler.py:20 class Sampler, :67 def forward. Read it; the order is the
guide's pipeline:
- logits processors / penalties edit the logits (repetition penalty, bad-words, logit bias, and the structured-output grammar mask — Phase 12).
apply_temperature(:223) divides by per-request temperature.- top-k / top-p truncation (
ops/topk_topp_sampler.py). sample(:238) draws the token (argmax for greedy rows, multinomial for the rest).
The crucial detail: every step operates on the whole batch at once, with per-request params
read from SamplingMetadata (metadata.py) — tensors aligned to the batch. Greedy and random
requests coexist in one call; greedy rows are handled as a temperature→argmax path. There is no
Python per-request loop on the hot path — that's the systems win.
3. Vectorized top-k/top-p
vllm/v1/sample/ops/topk_topp_sampler.py — applies top-k and top-p across the batch with masked
sorts/cumsums, each row using its own k/p. (There's also a Triton variant,
topk_topp_triton.py, for speed.) Your mini_vllm/sampler.py _apply_top_k/_apply_top_p do
the single-row version; the real challenge is doing it for 256 different (k,p) at once without
branching.
4. Penalties
vllm/v1/sample/ops/penalties.py — given the tokens generated so far (and prompt), subtract
repetition/frequency/presence penalties from the corresponding logits. Needs per-request output
token histories, threaded through SamplingMetadata.
5. Logits processors — the hook everything uses
vllm/v1/sample/logits_processor/:
interface.py— theLogitsProcessorcontract (transform logits in place given state).builtin.py— the built-in processors (min-p, logit bias, etc.).state.py— per-request state management across steps.
This is the seam structured output (Phase 12) plugs into: a grammar produces a per-step bitmask of
allowed tokens, applied as a logits processor that sets illegal tokens to -inf before sampling.
Penalties, bias, and grammar masks all compose at this one well-defined point.
6. Parallel sampling
vllm/v1/engine/parallel_sampling.py — manages n>1: it expands one request into N child
sequences that share the prompt's KV (prefix caching, Phase 2/3) and diverge after the first
sampled token. Beam search has its own handling (it changes the active set each step, unlike plain
sampling).
Reading checklist
-
Sampler.forward— recite the pipeline order. - Where do per-request params live, and why packed into tensors (not a Python loop)?
-
topk_topp_sampler.py— how is heterogeneous-batch top-p done branch-free? -
The
LogitsProcessorinterface — how does Phase 12's grammar mask reuse it? -
parallel_sampling.py— how doesn>1reuse prefix caching?
Now build it: 02-mini-build.md, then the labs.
Phase 09 — Mini-Build: a sampling pipeline with logits processors
You already have mini_vllm/sampler.py (greedy, temperature, top-k, top-p). This phase adds the
two things real engines need: min-p, a repetition penalty, and a logits-processor hook
— the pluggable pre-sampling stage that penalties and structured output (Phase 12) ride.
Contents
- The task (lab-01)
- Why a logits-processor hook (not just hardcoded knobs)?
- Definition of done
- Map to the real engine
The task (lab-01)
In lab-01-sampling-ops implement, in numpy:
apply_min_p(logits, min_p)— keep tokens withprob ≥ min_p × max_prob, mask the rest.apply_repetition_penalty(logits, generated_token_ids, penalty)— divide (or subtract for) logits of already-generated tokens so repeats are less likely.- a
LogitsProcessorprotocol: a callable(logits, context) -> logits, and aPipelinethat runs a list of processors in order. sample(logits, params, generated, processors)— apply processors → penalty → temperature → top-k → top-p/min-p → sample, in that order.
This mirrors Sampler.forward (sampler.py:67) and the LogitsProcessor framework
(logits_processor/interface.py). The pipeline order is the contract.
Why a logits-processor hook (not just hardcoded knobs)?
Because the same mechanism serves penalties, logit bias, bad-words, and grammar masks
(Phase 12). Build it once as "a function that edits logits at a defined point" and structured
output becomes "just another processor that masks illegal tokens to -inf." You'll literally
reuse this pipeline in Phase 12.
Definition of done
pytest phase-09-sampling-and-decoding-algorithms/labs -q
Tests pin: top-k restricts support to the argmax when k=1; top-p keeps the nucleus; min-p cutoff is confidence-relative; repetition penalty lowers a repeated token's probability; a banning logits processor makes a token unsamplable.
Map to the real engine
| your numpy | real vLLM |
|---|---|
| pipeline order | Sampler.forward (sampler.py:67) |
apply_min_p, top-k/p | ops/topk_topp_sampler.py (vectorized over the batch) |
| repetition penalty | ops/penalties.py |
LogitsProcessor + Pipeline | logits_processor/{interface,builtin,state}.py |
| a banning processor | ops/bad_words.py + the grammar mask (Phase 12) |
Phase 09 Labs — Sampling & Decoding Algorithms
Four labs on the last centimeter of inference: turning logits into tokens, at production grade. The arc: build the full per-request pipeline with its extension hook (lab-01), add the state that makes sampling reproducible under batching (lab-04), meet the search alternative and its garden-path motivation (lab-03), then watch parallel sampling ride three phases of memory machinery on real hardware (lab-02).
Recommended order: 01 → 04 → 03 → 02. (Directory numbers predate labs 03–04.) CPU
labs follow the standard contract — starter.py (your work), solution.py
(reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter
grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-09-sampling-and-decoding-algorithms/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-09-sampling-and-decoding-algorithms/labs/lab-01-sampling-ops -q
Contents
- lab-01-sampling-ops
[CPU-OK] - lab-02-parallel-sampling
[GPU-OPT] - lab-03-beam-search
[CPU-OK] - lab-04-seeded-rng-batch-invariance
[CPU-OK] - What you can do after this phase
Labs
lab-01-sampling-ops [CPU-OK]
The production pipeline: custom processors → repetition penalty → temperature → top-k → top-p → min-p → draw, with each placement justified (the order is a theorem, not a convention) and the logits-processor hook that Phase 12's grammar masking rides. Includes min-p's confidence-relative cutoff and the divide-positive/multiply-negative penalty asymmetry. Skills: pipeline order as API; the hook pattern; why penalties need history.
lab-02-parallel-sampling [GPU-OPT]
n=4 on real vLLM: the prompt prefills once, all samples share its KV blocks
(ref_cnt=4, 75% hit rate = the pioneer effect with n as denominator), diverging from
the first sampled token. The cheapest diversity money can buy, priced exactly.
Annotated capture included. Skills: the one-prompt-n-tails cost model;
self-consistency economics; n vs separate-requests vs beam search.
lab-03-beam-search [CPU-OK]
Sequence-level search: build greedy and beam decoding, then spring the garden-path trap — a four-probability fixture where greedy's local optimum (joint 0.31) loses to beam's [B, C] (0.36), provably. EOS-finishes-a-beam bookkeeping, the width-1 = greedy identity, and why V1 evicted beams from the engine core. Skills: search vs sampling; log-prob scoring; length bias; probability ≠ quality (degeneration).
lab-04-seeded-rng-batch-invariance [CPU-OK]
The reproducibility contract: a seeded request's tokens must not depend on its batch neighbors. Build the per-request-generator sampler, prove invariance with 0/1/5 interleaved neighbors — and watch the natural shared-RNG implementation fail the same scenario (the control test ships with the lab). Skills: randomness as private state; continuity vs re-seeding; isolation claims need broken controls; the kernel layer of nondeterminism.
What you can do after this phase
Hold the entire logits-to-token path in your head, in order, with reasons; extend it safely through the processor hook (and recognize Phase 12 as one more processor); deliver seeded reproducibility under batching and explain what it does and doesn't promise; choose between sampling, beam search, and best-of-n from their actual cost and quality shapes; and price candidate-generation workloads (self-consistency, RLHF sampling) from the sharing arithmetic. Phase 10 scales the engine across GPUs; the per-request state you isolated here is exactly what has to survive the trip.
Lab 09-01 — Sampling Ops & Logits Processors [CPU-OK]
Phase 0 lab-03 built the four classic knobs. This lab builds the production pipeline:
penalties that read generation history, the confidence-relative min_p cutoff, and —
the architecturally important part — the logits-processor hook, a pluggable
(logits, ctx) → logits stage that turns the sampler from a fixed function into an
extension point. That hook is how structured output injects its grammar mask
(Phase 12), how logit_bias and bad-words lists work, and how every "force the model
to/never let the model…" feature you'll ever build gets in. The ordering of the stages
is the lab's quiet theorem: each transform assumes the one before it, and reorderings
produce different samplers, not equivalent ones.
Contents
- Why this lab exists
- Background: the pipeline and why its order is fixed
- Files
- Run
- What to implement
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Two reasons, one per half of the lab. The ops half: min_p and repetition penalties
are the knobs real traffic actually exercises (every chat frontend ships a repetition
penalty; min_p has become the open-model community's favorite truncation), and both
have semantics subtle enough that implementing them is the only reliable way to stop
mis-explaining them. min_p's cutoff scales with the model's confidence — strict when
one token dominates, permissive when the distribution is flat — which is the adaptivity
top-k lacks and top-p only approximates. The repetition penalty's divide-positive,
multiply-negative asymmetry is the detail everyone forgets: a naive uniform division
would boost already-negative logits.
The hook half is about architecture. vLLM cannot ship every conceivable logits
intervention, so it ships an interface; the pipeline you build — an ordered list of
(logits, ctx) → logits callables, run before the standard knobs — is that interface
in miniature. After this lab, Phase 12's grammar-constrained decoding is "a logits
processor that masks non-grammatical tokens," full stop; the mystery is relocated to
building the mask fast, which is where it belongs.
Background: the pipeline and why its order is fixed
custom processors → repetition penalty → temperature → top-k → top-p → min-p → softmax → draw
Walk the order backwards and each placement explains itself:
- Truncations (top-k/p/min-p) come after temperature because they're defined over the distribution you'll actually sample from — truncating pre-temperature evaluates the nucleus on the wrong distribution (and yes, top-k commutes with monotone temperature but top-p does not: temperature changes the probabilities the cumulative sum is built from).
- Penalties come before temperature: they're corrections to the model's raw scores, not to the sampling distribution; applying them post-temperature would make the penalty's strength depend on T — two knobs tangled into one.
- Custom processors run first: a hard constraint (grammar mask, banned token) must shape everything downstream — a token masked to −∞ before truncation can never sneak back, no matter what k/p/min-p do. Mask after truncation and you can end up with an empty candidate set (every surviving token banned) — the all-states-are-−∞ crash class that constrained-decoding implementations know well.
- Order within truncations (k → p → min-p) matches vLLM's; they don't commute either, and matching the engine's order is what makes your sampler's outputs comparable to its.
Files
starter.py— the five ops, thePipeline, andsample(the full ordered assembly). Your work.solution.py— reference.test_lab.py— each op's exact semantics plus the ban-token processor pattern.
Run
LAB_IMPL=starter pytest phase-09-sampling-and-decoding-algorithms/labs/lab-01-sampling-ops -q
pytest phase-09-sampling-and-decoding-algorithms/labs/lab-01-sampling-ops -q # reference
What to implement
The ops from Phase 0 lab-03 (temperature, top-k, top-p) plus the three new pieces:
apply_min_p (threshold = min_p × max_prob, computed on the current distribution),
apply_repetition_penalty (divide positive logits by the penalty, multiply negative
ones — both directions push down; apply once per distinct token, not per occurrence),
and Pipeline/sample (the assembly in the order above; greedy short-circuits after
penalties — penalties do apply to greedy, a detail people miss: a repetition penalty
that only worked at temperature > 0 would be a different feature).
What the tests prove
| Test | What it pins |
|---|---|
| top-k = 1 ⇒ argmax only | The truncation-to-greedy limit |
| top-p keeps exactly the nucleus | The inclusive-boundary semantics (Phase 0 lab-03's footgun, still armed) |
| min-p cutoff scales with max prob | The confidence-relative behavior that distinguishes it from a fixed floor |
| repetition penalty lowers a repeated token | Both signs handled — the divide/multiply asymmetry |
| ban-token processor ⇒ token unsamplable | The hook works, and −∞ survives the whole downstream pipeline — the Phase 12 grammar-mask pattern in one assert |
Hitchhiker's notes
- The
ctxdict is the processor's window into the request — here just{"generated": [...]}, upstream a richer per-request state (prompt tokens, output tokens, FSM state for grammars). The discipline that keeps the hook safe: processors read ctx and return logits; a processor that mutates shared state breaks the batched execution model (rows are processed in arbitrary order — Phase 9 lab-04's isolation lesson, one layer up). - Penalties are why samplers need history. Temperature/top-k are pure functions of
the logits row; penalties read
generated— meaning the production sampler carries per-request token-id state to the GPU (upstream: the penalty path invllm/v1/sample/, with prompt-vs-output token distinction:presence_penalty,frequency_penalty,repetition_penalty— three related-but-different formulas; read them once and save yourself a support ticket). - Each stage is cheap; the sort in top-p is the expensive one (O(V log V) per row, V = 128k+). Vectorized GPU implementations care a lot — there are sort-free top-p approximations and threshold-precomputation tricks upstream. When sampling shows up in a profile (it does, at high batch), this is the line.
- Processor order is API, the Phase 1 lab-05 lesson recurring: two processors (say, a grammar mask and a logit bias) don't commute either. vLLM applies user-supplied processors in list order — document yours.
Going further
- Implement
presence_penaltyandfrequency_penalty(additive, occurrence-counting — distinct from the multiplicative repetition penalty) and write the test that distinguishes all three on a token generated twice. - Build a
MinTokensProcessorthat masks EOS whilelen(generated) < min_tokens— you've now implemented themin_tokensfeature from Phase 1 lab-05's going-further, as a processor, which is exactly how the engine structures it. - Property-test the pipeline: for random logits and any knob combo, assert the output distribution (a) sums to 1, (b) supports only unmasked tokens, (c) is unchanged when all knobs are neutral. Three invariants that catch most pipeline-assembly bugs.
References
upstream/vllm/v1/sample/sampler.py— the batched pipeline; find your stage order.upstream/vllm/v1/sample/logits_processor/— the production hook interface.- Nguyen et al., Min-p Sampling (2024) — the case for confidence-relative truncation: https://arxiv.org/abs/2407.01082
- Keskar et al., CTRL (2019) — where the repetition penalty's divide/multiply form comes from (§4.1): https://arxiv.org/abs/1909.05858
- Phase 0 lab-03 — the four base knobs; Phase 12 — the grammar mask that rides this lab's hook.
Lab 09-02 — Parallel Sampling Shares Prompt KV [GPU-OPT]
n=4 in a SamplingParams looks like syntactic sugar for "send the request four
times." This lab shows why it's structurally better: the engine prefills the prompt
once, and all four samples share its KV blocks via the Phase 2 ref-count
machinery, diverging only from the first sampled token onward. You'll watch it happen
in the logs (75% prefix-cache hit rate for n=4 — three of four samples ride the
first's blocks) and connect three phases of machinery into the single cheapest way to
buy output diversity.
No GPU? Don't panic. The captured run below carries the whole argument — and the arithmetic sections need no hardware at all.
Contents
- Why this lab exists
- Background: one prompt, n tails
- Requirements
- Steps
- Captured output (real run, facebook/opt-125m, L4, vLLM 0.22.1, trimmed)
- Reading the numbers
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Parallel sampling is quietly one of the most-used features in production: best-of-n
ranking, self-consistency for reasoning (sample k chains, vote), candidate generation
for RLHF and eval pipelines, "regenerate" buttons. All of them multiply output
tokens while reusing one prompt — and whether that prompt is processed once or n times
is often the dominant cost term (prompts routinely outweigh completions 10:1 in chat
and RAG). Knowing that n>1 shares the prefill — and why, down to the block
refcounts — turns "should we batch our candidates into one request?" from a guess
into arithmetic you can do in your head: n=4 on a 2,000-token prompt with 100-token
outputs ≈ 6,000 prompt tokens saved — roughly 3× cheaper than four independent
requests at this shape.
The lab is also the phase's bridge back to the memory phases: it's the first place
sampling policy (how many candidates) visibly drives memory behavior (block
sharing). The diversity you buy with n is priced in KV blocks, and the discount —
sharing — comes from infrastructure built three phases ago for a different feature
(prefix caching). Composability like that is what good engine design looks like from
the outside.
Background: one prompt, n tails
What the engine does with n=4: the frontend fans the request into 4 sequences;
sequence 1 prefills the prompt, caching its full blocks (Phase 2 lab-05's eager
caching at allocation); sequences 2–4 hit those blocks at admission
(get_computed_blocks → touch → ref_cnt = 4 on the shared blocks) and prefill
only the cache-ineligible remainder (the partial tail block + last token — the
num_tokens − 1 rule). From the first sampled token, each sequence allocates its own
private tail blocks and diverges — temperature 1.0 plus per-sequence RNG state
(lab-04's machinery) makes the four continuations distinct.
Cost shape: prefill ≈ 1× prompt + small change (instead of 4×); KV ≈ 1× prompt + 4× outputs (instead of 4× everything); decode = 4 streams, which batch together in every step (Phase 1 lab-04's mixed batches — four rows, one weight stream, nearly free at small n per Phase 0 lab-04's bandwidth math).
Requirements
uv pip install -e ".[vllm]"
huggingface-cli download facebook/opt-125m
Steps
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m", enable_prefix_caching=True, gpu_memory_utilization=0.5)
out = llm.generate(["Write a haiku about GPUs:"],
SamplingParams(n=4, temperature=1.0, max_tokens=24))
for c in out[0].outputs:
print(repr(c.text))
Run under VLLM_LOGGING_LEVEL=DEBUG. Three things to verify: the prompt's prefill
happens once (scheduled-token counts in the logs); the prefix-cache hit rate lands at
(n−1)/n-ish; the four texts genuinely differ. Then the control runs: same with
enable_prefix_caching=False (watch prompt tokens 4×), and with seed set (watch
the four outputs stay distinct — see the notes for why).
Captured output (real run, facebook/opt-125m, L4, vLLM 0.22.1, trimmed)
DEBUG ... Prefix cache hit rate: GPU: 75.0% # samples 2-4 reuse sample 1's prompt blocks
'GPUs run hot / silicon dreams in parallel / fans hum all night long'
'Tensors flow like streams ...'
'Cores ignite at dawn ...'
'Threads in warps align ...'
Reading the numbers
- 75.0% = 3/4 — the pioneer effect with n as the denominator: sample 1 populates, samples 2–4 hit. (Phase 2 lab-03's 87.5% was 7/8; same law, different n. You can now read any of these rates as "1 − 1/cohort".)
- Four distinct haikus — divergence is immediate (first sampled token) because each sequence draws independently. If two of your n samples come back identical on a short prompt, that's not a bug — peaked distributions (low temperature, strong prompt) genuinely collide; raise temperature or use distinct seeds per your diversity needs.
- What the log doesn't show: the shared blocks'
ref_cntsitting at 4, and the free-on-finish path decrementing it as each sample completes — the Phase 2 lab-05 biography, now with four readers. The blocks free only when the last sample finishes; a straggler sample (one haiku that rambles to max_tokens) holds the prompt's KV for everyone. Worth knowing when n is large and outputs are long.
Hitchhiker's notes
n>1with aseedstill gives n different outputs — vLLM derives per-sequence randomness so the n samples don't collapse into n copies (which would make seeded best-of-n useless). The whole-request stream is reproducible; the sequences differ from each other. (Lab-04's per-request → per-sequence state, one level finer.)- V0 had a
best_ofdistinct fromn(generate best_of, return n by logprob); V1 simplified the API surface — ranking moved client-side. If you seebest_ofin older docs/code, that's the fossil. The sharing machinery is the same either way. - Self-consistency at scale: k=16 reasoning chains over a long CoT prompt is the flagship use — prompt KV once, 16 cheap decode streams, majority-vote the answers. The cost model above is why the technique is affordable at all; quote it when someone proposes 16 separate API calls instead.
- Contrast with beam search (lab-03): parallel samples never interact after the
fork — no pruning, no rescoring, scheduler-trivial. Beams branch and die mid
flight, which is exactly the interaction that got beam search evicted from the V1
core. Independence is what makes
ncheap to support.
Reflect
- Write the exact cost ratio of
n=4vs four separate requests (with caching off) for prompt P, output L tokens: prefill (P + 3·1-ish vs 4P) and KV (P + 4L vs 4P + 4L). At what L/P does the advantage fade? (When outputs dwarf prompts — diversity's discount is a prompt-side effect.) - Four separate requests with prefix caching on get most of the same sharing
(Phase 3 lab-03). What does
n=4still buy? (One API call, guaranteed same-step admission so the hit is certain rather than eviction-dependent, single response object — and intra-request sequence accounting. The mechanism is shared; the contract differs.) - Where do the n sequences' sampling states live, given they're one "request"? (Per-sequence rows in the input batch — generators, penalties' history, all of lab-01/lab-04's state, n times. "Request" is an API word; the engine schedules sequences.)
References
upstream/vllm/v1/engine/— the n>1 fan-out in the output processor/frontend (searchn=handling andparentrequest logic).- Phase 2 lab-05 —
touch/ref_cnt: the sharing mechanics; Phase 3 lab-06 — the exact-token accounting of what sharing saves. - Wang et al., Self-Consistency Improves Chain of Thought Reasoning (2022) — the workload this feature exists for: https://arxiv.org/abs/2203.11171
- vLLM docs, Sampling Parameters —
nand friends: https://docs.vllm.ai/en/latest/api/inference_params.html
Lab 09-03 — Beam Search: When Greedy Is a Trap [CPU-OK]
Greedy decoding answers the wrong question. It maximizes each token; what you usually
want is the most probable sequence — and those diverge exactly when the locally best
token leads somewhere bad. Linguists call it a garden path; you'll build one: a tiny
Markov model where greedy confidently takes the 0.6-probability first step into a coin
flip (joint ≈ 0.31), while the humble 0.4 step leads to near-certainty (joint 0.36).
Beam search — carrying the best beam_width partial hypotheses instead of one — escapes
the trap, and you'll implement it properly: the candidate pool, the pruning, and the
EOS-finishes-a-beam bookkeeping that real implementations get subtly wrong.
Contents
- Why this lab exists
- Background: search, not sampling
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Beam search occupies a strange place in modern LLM serving: central to the API surface
(use_beam_search exists, translation/summarization workloads still ask for it),
algorithmically classic — and architecturally awkward for engines like vLLM, awkward
enough that V1 moved it out of the core engine entirely (it's emulated at the API layer
via n parallel candidates plus rescoring-style logic, precisely because per-step
branch-and-prune fights the continuous-batching machine — see the notes). You can't
evaluate that design decision without knowing exactly what the algorithm requires, and
the way to know is to build it: the pooled expansion, the global top-beam_width cut,
the finished-set handling.
The deeper lesson is the trap itself. "Greedy is myopic" is a sentence; your TRAP
fixture is a proof object — four transition probabilities that make the failure exact
and checkable (0.6·0.51 < 0.4·0.9). Once you've built one garden path you'll
recognize the pattern everywhere it matters: why beam search wins on tasks with
constrained correct answers (translation), why it hurts open-ended generation (it
finds high-probability degenerate text — the famous repetition pathology), and why
sampling (labs 01/04) took over for chat.
Background: search, not sampling
Everything else in this phase draws from a distribution; beam search optimizes over
one. State: a set of partial hypotheses with their cumulative log-probabilities. Per
step, each live beam expands by its top beam_width tokens (more children can't help —
at most beam_width survive the global cut), all candidates pool, the best
beam_width survive. Two details carry the correctness:
- Scores are summed log-probs — products of probabilities underflow within a sentence; logs are not an optimization but a necessity.
- EOS finishes a beam: a hypothesis that emits EOS leaves the live set (extending
past EOS is meaningless) but stays in the final ranking against hypotheses that
kept going. Forget this and short, confident answers are silently discarded —
test_eos_finishes_a_beampins it.
Width 1 collapses to greedy exactly (test_beam_width_one_is_greedy) — beam search is
a strict generalization, and the test fixture deliberately avoids probability ties,
because a tie tests your tie-breaker, not your algorithm (a lesson this lab's own test
suite learned the hard way; see the comment in test_lab.py).
Files
starter.py—greedy_decodeandbeam_search(pool → prune → finish). Your work.solution.py— reference.test_lab.py— the trap (greedy falls in, beam escapes), the width-1 identity, monotonicity in width, and EOS handling.
Run
LAB_IMPL=starter pytest phase-09-sampling-and-decoding-algorithms/labs/lab-03-beam-search -q
pytest phase-09-sampling-and-decoding-algorithms/labs/lab-03-beam-search -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_greedy_takes_the_garden_path | Greedy picks A (0.6) and lands at joint ≈ 0.31 — locally best, globally wrong, by construction |
test_beam_escapes_the_trap | Width 2 finds [B, C] at 0.36, strictly beating greedy's joint — the algorithm's entire reason to exist, as an inequality |
test_beam_width_one_is_greedy | The degenerate case: same tokens, same score to 1e-12 |
test_wider_beams_never_score_worse | Monotonicity: more width can't hurt the best score (it can only widen the searched set). Useful as a sanity property — and note what it does not say: that wider is better text (see the degeneration note) |
test_eos_finishes_a_beam | A finished beam survives un-extended and wins the final ranking at its natural length |
Hitchhiker's notes
- Why beam search fights continuous batching: a beam step branches (one hypothesis
becomes several sharing a prefix) and prunes (hypotheses die mid-flight). Branching
wants copy-on-write KV sharing (the prefix is common — Phase 2's
ref_cntmachinery handles the read side, but the diverging tails need careful block forking); pruning frees KV at unpredictable times. All solvable — V0 solved it — but it threads special cases through scheduler and cache for a feature few use, which is why V1 evicted it to the API layer (vllm/beam_search.py+ the OpenAI server's emulation): the engine servesbeam_widthparallel sequences, the wrapper does the pool-and-prune. An instructive case study in what belongs in the core. - Length bias is real and the fix is a hack that works: summed log-probs penalize
length (every token adds a negative number), so beams favor short answers; production
systems divide scores by
len^α(α ≈ 0.6–1.0, "length normalization") before the final ranking. Your EOS test dodges this (the short answer is also the most probable) — add normalization in Going Further and construct the case where it flips the winner. - The degeneration result (Holtzman et al. — the same paper that gave you top-p in lab-01): for open-ended text, exact high-probability sequences are repetitive and dull; beam search finds them, and quality drops as width grows. Probability and quality diverge — arguably the single most important empirical fact about decoding. Beam search survives where the output is tightly constrained (translation, ASR, structured rewriting); sampling owns everything open-ended.
- Beam search is the third member of a family you've now built: greedy (argmax), sampling (draw), beam (search). They share logits and differ in the decision rule — which is why vLLM's logits-processor pipeline (lab-01) sits upstream of all three, and why grammar masking (Phase 12) composes with each of them unchanged.
Going further
- Add length normalization (
score / len(tokens)**alpha) to the final ranking and build a fixture where α = 0 and α = 1 disagree about the winner. You've reproduced the knob every production beam implementation exposes. - Track prefix sharing among your beams: at each step, count how many tokens of KV a
real engine would share via Phase 2's blocks (common prefix length × beams). The
number is large — that's the efficiency beam search loses when emulated naively as
independent sequences without prefix caching, and exactly what
enable_prefix_cachingrecovers (lab-02's mechanism, applied to beams). - Implement diverse beam search (penalize candidates already chosen by sibling groups) and watch the trap fixture: diversity trades best-score for coverage — measure both.
References
upstream/vllm/beam_search.pyand the OpenAI server's beam emulation — the V1 design decision discussed above.- Holtzman et al., The Curious Case of Neural Text Degeneration (2019) — why exact search loses to sampling for open-ended text: https://arxiv.org/abs/1904.09751
- Wu et al., Google's Neural Machine Translation System (2016) — §7, the length normalization formula everyone copied: https://arxiv.org/abs/1609.08144
- Lab-01 — the logits pipeline all decision rules share; lab-02 — the prefix-sharing machinery beams want.
Lab 09-04 — Per-Request RNG & Batch Invariance [CPU-OK]
Here's a bug report you will eventually receive: "I set seed=7, temperature 1.0, and
I get different outputs every time. Your API is broken." The API isn't broken — but it
would be, in exactly this way, if the sampler used one shared random generator for the
whole batch. Whoever shares the batch with you consumes numbers from the shared stream
and shifts yours; your "seeded" request reproduces only when the entire fleet's traffic
reproduces. This lab builds the fix — per-request generator state — and proves the
contract that production samplers must honor: a seeded request's token stream is
identical whether it runs alone or interleaved with five neighbors. The test suite
includes the broken shared-RNG sampler as a control, so you see the failure, not just
read about it.
Contents
- Why this lab exists
- Background: randomness as private state
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Determinism under batching is one of those properties that's trivial to state,
genuinely subtle to deliver, and commercially important: seeded sampling is how users
build reproducible evals, how you bisect a generation bug ("same seed, same output —
now change one thing"), and how A/B tests hold the noise still. And it's violated by
the most natural implementation — rng = np.random.default_rng(...) at sampler scope,
draw per request in batch order — which works perfectly in every single-request test
and fails the moment two users share a step. Bugs that pass single-user tests and fail
under concurrency are the defining bug class of serving systems; this lab is a clean
specimen you can build, break, and internalize in twenty minutes.
It also completes the phase's batching story: lab-01 gave you the per-request pipeline (each request its own temperature/top-k/penalties), lab-02 showed requests sharing compute and KV, and this lab adds the last isolation boundary — randomness. The full production picture is vLLM's sampling metadata: per-request parameters, per-request generators, batched execution. Shared work, private state.
Background: randomness as private state
Three requirements, each pinned by a test:
- Reproducibility: same seed + same logits stream → same tokens, across process restarts and sampler instances. (The generator must be created from the seed, deterministically.)
- Continuity: a request's draws across its decode steps come from one continuing
stream — create the generator once per request, not once per step. Re-seeding every
step is the sneaky variant bug: step 1 is correct, and every step draws the same
"random" number (
test_request_stream_is_stateful_not_resetconstructs a uniform distribution where this is visible — on peaked distributions it hides, which is what makes it sneaky). - Isolation (batch invariance): request A's stream must be untouched by neighbors' draws. This is what per-request state buys; the shared-RNG control test shows the alternative failing.
Plus the greedy rule from Phase 0 lab-03, restated with a reason: temperature == 0
must touch no RNG at all — not "use a default seed," no draw — so greedy requests are
reproducible without any seed bookkeeping, and so they don't perturb anyone else's
stream either (a greedy request that consumed RNG would break a seeded neighbor's
invariance — isolation cuts both ways).
Files
starter.py—PerRequestSampler: a generator dict keyed by request id, a greedy fast path, one draw per call. ~15 lines. Your work.solution.py— reference.test_lab.py— reproducibility, divergence across seeds, the invariance contract, the shared-RNG failure (control), greedy's RNG-free path, and stream continuity.
Run
LAB_IMPL=starter pytest phase-09-sampling-and-decoding-algorithms/labs/lab-04-seeded-rng-batch-invariance -q
pytest phase-09-sampling-and-decoding-algorithms/labs/lab-04-seeded-rng-batch-invariance -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_same_seed_reproduces_across_instances | Requirement 1: the stream is a pure function of the seed |
test_different_seeds_diverge | Seeds actually matter (a sampler that ignores its seed passes test 1 vacuously — paired tests close the loophole) |
test_batch_invariance | The contract: A's stream with 0, 1, and 5 interleaved neighbors — identical. The neighbors even sample before A each step, the worst case for a shared stream |
test_shared_rng_breaks_batch_invariance | The control: one global generator, same scenario — neighbors shift A's tokens. The bug, demonstrated rather than asserted |
test_greedy_ignores_seed_and_rng_state | Temperature 0 → argmax, no RNG touched, any seed |
test_request_stream_is_stateful_not_reset | Requirement 2: two draws match a reference generator's first two draws — not the first draw twice |
The test-design pattern is worth keeping: every isolation claim ships with a broken control. "X holds" plus "here is the natural implementation where X fails" teaches reviewers what the protective code protects against — and stops the next refactorer from "simplifying" the generator dict away.
Hitchhiker's notes
- Where this lives upstream: vLLM keeps per-request
torch.Generatorobjects in its sampling state (seeded requests get their own; see the generator plumbing inupstream/vllm/v1/worker/gpu_input_batch.pyand the sampler). The batched GPU sampler does the draws vectorized, but seeded rows use their private generator state — the exact structure of your dict, tensor-shaped. - What batch invariance does not promise: bitwise-identical logits. Different batch compositions change kernel tiling and reduction order (the recurring last-ulp story — Phases 3/4/6), so two near-tied tokens can flip even with perfect RNG isolation. True end-to-end batch-invariant inference requires batch-invariant kernels as well — a real, recent line of engineering work (deterministic-inference modes); RNG isolation is the necessary first floor, not the whole building. Know which layer a nondeterminism report belongs to before debugging it.
- Cleanup is part of the contract: request ids recycle; a production sampler must drop a request's generator when it finishes (your dict grows forever — fine for a lab, a leak in a server). Per-request anything implies a lifecycle hook — tie it to Phase 1's reaping path mentally.
- Why not one generator seeded per (request, step)? It "fixes" continuity bugs by construction but costs a generator init per token and — worse — makes the stream depend on step numbering, which shifts under speculative decoding (Phase 8: a cycle emits several tokens). Stream-per-request is the design that survives feature composition; most alternatives quietly don't.
Going further
- Add
finish(request_id)and a test that a recycled id with a new seed starts a fresh stream (the leak-plus-collision bug, both halves). - Vectorize:
sample_batch(ids, logits_matrix, temps, seeds)doing one softmax over the batch but per-row draws from per-row generators — the actual shape of the GPU sampler. Verify batch invariance still holds (it must: that's the point of the structure). - Compose with Phase 8: simulate a speculative cycle (k draft draws + residual draws from lab 08-03) using the request's generator, and check that a request's output is invariant to whether speculation was enabled given the same accepted tokens. (It isn't, in general — spec decode consumes RNG differently. Production systems accept this; knowing why is the exercise.)
References
upstream/vllm/v1/worker/gpu_input_batch.py— per-request generator state in the input batch (searchgenerator).upstream/vllm/v1/sample/sampler.py— where seeded rows meet the batched sampler.- vLLM docs, Sampling Parameters — the
seedfield's contract: https://docs.vllm.ai/en/latest/api/inference_params.html - Thinking Machines, Defeating Nondeterminism in LLM Inference (2025) — the kernel layer of this problem (batch-invariant kernels), for the full picture: https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
- Phase 0 lab-03 — the greedy fast path's origin; lab-01 — the per-request pipeline this lab adds state isolation to.
Phase 09 — Exercises: Sampling & Decoding Algorithms
Contents
Warm-up (explain)
- What is the pipeline order (penalties → ? → ? → ? → sample) and why does order matter?
- Greedy vs temperature 0 vs top-k=1 — are these the same? When?
- Top-p vs top-k vs min-p — describe each and when it adapts to model confidence.
Core (trace the code)
- In
Sampler.forward(sampler.py:67), where are per-request params read from, and why are they tensors rather than a Python loop? - What is a logits processor (
logits_processor/interface.py)? Name three things it implements. - How does parallel sampling (
parallel_sampling.py) reuse prefix caching forn>1?
Build (your lab)
- In lab-01, why must repetition penalty be applied before temperature?
- Add frequency and presence penalties (count-scaled vs flat) and test their difference.
- Implement a
logit_biaslogits processor (add a constant to specified token ids) and verify a strongly biased token dominates.
Design (staff-level)
- You must apply 256 different
(temperature, top_p, penalties)in one decode step. Sketch the data layout and why a Python loop is unacceptable on the hot path. - A user reports repetitive loops at temperature 0. What knobs help, and what's the tradeoff of each (penalty too high degrades quality)?
- Beam search is requested for a production endpoint. Explain why it's awkward in continuous batching and how you'd bound its cost.
Self-grading
4–6 and 10–12 are interview-grade. Could you whiteboard the batched pipeline and name the files? If not, re-read 01-deep-dive.md.
Phase 09 — Interview Questions: Sampling & Decoding Algorithms
Q1. Walk through the sampling pipeline.
Model answer
Logits → logits processors (penalties, logit bias, bad-words, grammar mask) → temperature scaling
→ top-k truncation → top-p/min-p truncation → sample (argmax for greedy rows, multinomial
otherwise). Order matters: penalties edit raw logits, temperature reshapes, top-k/p prune the
support, then you draw. (Sampler.forward, sampler.py:67.)
Q2. How do you apply different sampling params per request in one batched kernel?
Model answer
Pack per-request params (temperature, top_k, top_p, penalties, seeds) into tensors aligned with
the batch (SamplingMetadata), then apply vectorized, branch-free masked ops so each row uses its
own settings in one GPU pass. Greedy rows go through a temperature→argmax path. No Python
per-request loop on the hot path — that's the systems challenge, not the math.
Q3. top-k vs top-p vs min-p?
Model answer
top-k keeps a fixed number of highest-prob tokens; top-p (nucleus) keeps the smallest set whose cumulative prob ≥ p (adaptive — few when confident, many when unsure); min-p keeps tokens with prob ≥ min_p × max_prob (a confidence-relative floor). top-p and min-p adapt to the distribution's shape; top-k doesn't.
Q4. What is a logits processor and why is it the right abstraction?
Model answer
A hook that transforms logits at a defined point before sampling. It cleanly composes penalties,
logit bias, bad-words, and — crucially — structured-output grammar masks (Phase 12), all without
special-casing the sampler. Build it once and constrained decoding becomes "a processor that sets
illegal tokens to -inf." (logits_processor/interface.py.)
Q5. How does n>1 parallel sampling work efficiently?
Model answer
The prompt is prefilled once; the N samples share its KV blocks via prefix caching (Phase 2/3) and
diverge only after the first sampled token, each carrying its own RNG/params. So N completions cost
~one prefill plus N decodes, not N full requests. (parallel_sampling.py.) Beam search can't share
this way because it prunes/branches the active set each step.
Rapid-fire
- Greedy = ? temperature 0 = argmax.
- Pipeline order? penalties → temperature → top-k → top-p/min-p → sample.
- Per-request params live in?
SamplingMetadata(tensors). - The pre-sampling hook? logits processors.
n>1reuses? prefix caching (shared prompt KV).
Phase 09 — Cheatsheet: Sampling & Decoding Algorithms
Contents
The one-liner
Logits → pick a token. The pipeline (penalties → temperature → top-k → top-p/min-p → sample) runs vectorized across a heterogeneous batch, every row with its own params.
The knobs
- greedy = T=0 = argmax (deterministic)
- temperature T: <1 sharper, >1 flatter
- top-k: keep k highest; top-p: keep nucleus (cum prob ≥ p); min-p: keep prob ≥ min_p × max_prob
- penalties: repetition/frequency (count) / presence (flat); logit bias; bad-words
Logits processors
The pluggable pre-sampling hook. One mechanism for penalties, bias, bad-words, AND grammar masks
(Phase 12: illegal tokens → -inf). logits_processor/{interface,builtin,state}.py.
Batching
Per-request params packed into tensors (SamplingMetadata); masked branch-free ops apply each
row's settings in one pass. No Python loop on the hot path.
Parallel sampling & beam search
n>1: one prefill, N samples share prompt KV (prefix caching), diverge after token 1
(parallel_sampling.py). Beam search: top-N partial seqs by cum log-prob; awkward in continuous
batching (active set changes), handled specially.
Key upstream
v1/sample/sampler.py:20Sampler ·:67forward ·:223apply_temperature ·:238samplev1/sample/ops/topk_topp_sampler.py·ops/penalties.py·ops/bad_words.pyv1/sample/logits_processor/·v1/sample/metadata.py·sampling_params.py:168
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md
Phase 10 — The Hitchhiker's Guide to Distributed Inference
← Phase 09 · Course home · Phase 11 →
Contents
- Don't Panic
- Step 1: A team analogy
- Step 2: Tensor parallelism, concretely (the one to really get)
- Step 3: Pipeline parallelism and the bubble
- Step 4: DP, EP, CP in one line each
- Step 5: Who runs all this in vLLM
- The invariants to memorize
- What you'll do
Don't Panic
A big model doesn't fit on one GPU, or you want it to run faster than one GPU can. So you split the work across several GPUs. The only question is how you split — and there are a few distinct ways, each with its own pattern of GPU-to-GPU chatter. This phase is "which split when, and what crosses the wires." Get it wrong and half your GPUs sit idle talking to each other; get it right and you serve models no single GPU could hold.
A useful way to picture a model: a tall stack of layers, each layer a big multiplication. Now imagine a team of GPUs working on it. There are five ways to divide the labor:
TP (tensor parallel): split EACH layer's math across GPUs (everyone works on every token)
PP (pipeline parallel): give each GPU some LAYERS (token flows GPU0 → GPU1 → GPU2)
DP (data parallel): give each GPU a full COPY (split the USERS across copies)
EP (expert parallel): put different MoE EXPERTS on each GPU (route tokens to their expert's GPU)
CP (context parallel): split ONE long sequence's CONTEXT (each GPU holds part of the history)
You'll mostly reason about TP and PP (the big two), so we go deepest there.
Step 1: A team analogy
Imagine translating a huge book with a team:
- Tensor parallelism (TP) — everyone works on the same page at once, each person doing part of the work on that page, then they combine notes before moving on. Fast per page, but they have to talk constantly (combine notes after every step). Only works if they're in the same room (fast links — NVLink inside one machine).
- Pipeline parallelism (PP) — an assembly line: person 1 does chapters 1–3, hands off to person 2 for chapters 4–6, etc. Little talking (just hand the page along), works across rooms (across machines), but person 2 is idle until person 1 finishes the first page (a bubble).
- Data parallelism (DP) — everyone has their own copy of the whole book and translates different readers' requests. No coordination on the work itself; you just send each reader to whoever's free. Scales throughput, needs the model to fit on one GPU.
Step 2: Tensor parallelism, concretely (the one to really get)
Every layer is essentially y = x · W (a matrix multiply). TP splits W across GPUs. There are
two flavors, and the clever part is how they pair up.
Column-parallel — split W by output columns. Each GPU computes part of the output:
GPU0 computes y[:, left half] GPU1 computes y[:, right half]
result: glue the halves together (an "all-gather")
Row-parallel — split W by input rows, and split the input too. Each GPU computes a partial
of the whole output, and you add them up:
GPU0: y0 = x[:, left] · W[left, :] GPU1: y1 = x[:, right] · W[right, :]
result: y = y0 + y1 (an "all-reduce" — everyone shares and sums)
The trick vLLM uses: in a transformer block, do the first matmul column-parallel and the
second row-parallel. The column-parallel output stays split (no gluing needed), feeds straight
into the row-parallel input, and you pay just one all-reduce at the end of the block instead of
two communications. You'll implement exactly this in lab-01 and prove the multi-GPU result equals
the single-GPU one — bit for bit.
Why TP needs fast links: that all-reduce happens every layer (dozens of times per token). If the GPUs aren't connected by something fast (NVLink), the chatter dominates and TP is slow. Rule of thumb: TP within a machine, PP across machines.
🆕 New words: all-reduce (every GPU sends its partial result and everyone gets the sum), all-gather (every GPU shares its piece and everyone gets the concatenation), collective (any such group communication, run by a library called NCCL).
Step 3: Pipeline parallelism and the bubble
PP puts layers 1–16 on GPU0 and 17–32 on GPU1. A token's data flows GPU0 → GPU1. The problem: while GPU0 works on the first chunk, GPU1 has nothing to do yet (the pipeline bubble). The fix is micro-batches: chop the work into many small pieces so that once the pipeline fills, every GPU is always busy on some piece. PP's communication is cheap (just pass activations along, GPU→GPU), so it scales across machines — at the cost of a little latency and bubble overhead.
GPU0: [mb1][mb2][mb3][mb4]
GPU1: [mb1][mb2][mb3][mb4] ← starts late (bubble), then stays busy
Step 4: DP, EP, CP in one line each
- Data parallelism (DP) — replicate the model; route different requests to different replicas. Pure throughput scaling; needs the model to fit on one GPU (or one TP group). vLLM also uses DP for MoE attention (run attention data-parallel while experts are expert-parallel).
- Expert parallelism (EP) — for MoE (Phase 7): put different experts on different GPUs; an all-to-all ships each token to its expert's GPU and back. Scales expert count; watch load balance.
- Context parallelism (CP) — split a single very long sequence's context (its KV cache) across GPUs, so you can serve contexts too long for one GPU's memory.
Real large deployments combine these: e.g. TP=8 within a node, PP=2 across two nodes, DP to add replicas, EP for the MoE layers. Picking the combination for a given model + SLA + cluster is a defining staff-level decision.
Step 5: Who runs all this in vLLM
From Phase 1: EngineCore → Executor → Worker → ModelRunner. For multi-GPU, the Executor
becomes a MultiprocExecutor that owns N worker processes, one per GPU. Each worker holds its
shard of the model and runs the same step in lockstep; the collectives (all-reduce etc.) happen
inside the layers (ColumnParallelLinear/RowParallelLinear). The "who is rank 0, which GPUs form
the TP group" bookkeeping lives in parallel_state.py. The beauty: the model code is identical —
it just uses parallel layers, and the engine fans out. That's why the same vLLM runs on 1 or 64
GPUs.
The invariants to memorize
- TP splits each layer (all-reduce every layer; NVLink-hungry; latency win; within a node).
- PP splits layers (cheap point-to-point; bubbles; scales across nodes).
- DP replicates + routes requests (throughput; model must fit).
- EP spreads MoE experts (all-to-all; load balance). CP splits one sequence's context.
- TP pattern: column-parallel then row-parallel → one all-reduce per block, and the multi-GPU result is identical to single-GPU.
- Big deployments combine them; the Executor fans out to one worker process per GPU.
What you'll do
- Read: 01-deep-dive.md —
parallel_state, the collectives, the parallel Linear layers, and the multiproc executor, line-anchored. - Build: 02-mini-build.md — column/row-parallel matmul in numpy; prove the all-reduce reconstructs the single-GPU result.
- Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
lab-01-tp-sharding-math[CPU-OK]— implement TP and verify it equals the unsharded result.lab-02-two-way-tp[GPU-OPT]— runtensor_parallel_size=2; observe the memory split (captured).lab-03-tp-comm-cost[CPU-OK]— the ring all-reduce cost model: derive "TP within a node, never across" as an assert, and the decode-latency vs prefill-bandwidth regime split.lab-04-pipeline-bubble[CPU-OK]— the PP bubble (p−1)/(p+m−1), derived as algebra AND as a simulated schedule grid that must reconcile exactly; why PP needs deep batching.
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
← Phase 09 · Course home · Phase 11 →
Phase 10 — Deep Dive: distributed inference in real vLLM
Paths relative to
upstream/atv0.22.1 @ 0decac0.vllm/distributed/parallel_state.py the source of truth for all parallel groups (TP/PP/DP/EP/CP) vllm/distributed/communication_op.py tensor_model_parallel_all_reduce / all_gather vllm/model_executor/layers/linear.py Column/Row/QKV ParallelLinear (TP in the layers) vllm/v1/executor/multiproc_executor.py the N-worker executor vllm/v1/worker/gpu_worker.py one worker = one GPU = one model shard
Contents
- 1. Who's in which group:
parallel_state.py - 2. The collectives:
communication_op.py - 3. TP in the layers:
linear.py - 4. The executor and workers
- 5. PP, DP, EP, CP pointers
- Reading checklist
1. Who's in which group: parallel_state.py
This file owns every process group. Key functions:
init_distributed_environment(:1370) andinitialize_model_parallel(:1506) — set up the TP/PP/DP/EP/CP groups at startup from the configured sizes.get_tp_group(:1241),get_pp_group(:1260) — the group a worker uses to communicate.get_tensor_model_parallel_world_size(:1849) /_rank(:1854) — "how many TP peers, which one am I." The parallel layers read these to know how to shard.
Mental model: this module answers, for each worker, "who are my teammates for each kind of parallelism, and what's my index?" Everything else (the layers, the executor) consults it.
2. The collectives: communication_op.py
tensor_model_parallel_all_reduce (:12) and tensor_model_parallel_all_gather (:17) are the
two operations TP needs (Step 2 of the guide). They wrap NCCL (the NVIDIA collective library) via
device communicators (distributed/device_communicators/). An all-reduce sums a tensor across
all TP ranks and gives everyone the result; an all-gather concatenates each rank's piece. These
are the "combine notes" steps from the analogy.
3. TP in the layers: linear.py
This is where TP actually happens — and notice the model never calls a collective directly; the layers do.
ColumnParallelLinear(:410) — shards the weight by output dimension; each rank computes part of the output. Used for the first matmul in a block (QKV, gate/up).QKVParallelLinear(:975) andMergedColumnParallelLinear(:607) are specializations (they pack Q/K/V or gate/up into one sharded matmul).RowParallelLinear(:1392) — shards by input dimension; each rank computes a partial of the full output, then all-reduces. Used for the second matmul (attentiono_proj, MLPdown).
The pairing (column then row) means the column output stays sharded and feeds the row input with no
intervening communication — one all-reduce per block (guide Step 2). Read RowParallelLinear.forward
and find the tensor_model_parallel_all_reduce call: that's the one communication. Your lab-01
reproduces this exact pattern and proves the result equals the unsharded matmul.
4. The executor and workers
vllm/v1/executor/multiproc_executor.py: class MultiprocExecutor(Executor) (:102),
execute_model (:306). It spawns one worker process per GPU and broadcasts each step's
SchedulerOutput to all of them; they run the forward in lockstep, exchanging collectives inside
the layers, and rank 0 returns the sampled tokens. vllm/v1/worker/gpu_worker.py: class Worker
(:109) holds one GPU's device, model shard, and KV cache; execute_model (:781) runs the
shard. So the Phase 1 chain (EngineCore → Executor → Worker → ModelRunner) just widens to N
workers for parallelism — the engine logic above it is unchanged.
5. PP, DP, EP, CP pointers
- PP:
get_pp_group+ the model splitting layers across ranks; activations are sent rank→rank (point-to-point) between pipeline stages, with micro-batching to fill the bubble. - DP: replicas with request routing; also DP-attention for MoE models (attention DP while experts are EP).
- EP:
fused_moe/all2all_utils.py(Phase 7) +distributed/eplb/(expert load balancing). - CP: context-parallel groups in
parallel_state.pysplit one sequence's KV across ranks.
Reading checklist
-
initialize_model_parallel— what groups does it create, and from what sizes? -
ColumnParallelLinearvsRowParallelLinear— what does each shard, and which all-reduces? -
Find the single
all_reduceinRowParallelLinear.forward. -
MultiprocExecutor— what does it broadcast, and how many worker processes for TP=4? - Why is the model code unchanged whether TP=1 or TP=8?
Now build it: 02-mini-build.md, then the labs.
Phase 10 — Mini-Build: tensor parallelism in numpy
You'll implement column- and row-parallel matmuls and prove that splitting a layer across "GPUs"
and combining (all-gather / all-reduce) gives exactly the single-GPU result. No real GPUs — we
simulate num_ranks shards with array slicing. This makes TP concrete and dispels the "is the
math still correct?" worry for good.
Contents
The task (lab-01)
A linear layer is y = x @ W.T, with W shape (out, in). Implement:
column_parallel(x, W, num_ranks)— splitW's rows (output dim) across ranks; each rank computes its slicey_r = x @ W_r.T; concatenate (the all-gather). Must equalx @ W.T.row_parallel(x, W, num_ranks)— splitW's columns (input dim) andx's columns across ranks; each rank computes a partialy_r = x_r @ W_r.Tover the whole output; sum them (the all-reduce). Must equalx @ W.T.mlp_tp(x, W1, W2, num_ranks)— the real transformer pattern:W1column-parallel (keep output sharded), apply the activation per shard,W2row-parallel (one all-reduce). Must equal the denserelu(x @ W1) @ W2, with exactly one all-reduce.
The point (the invariant)
x @ W == all_reduce(x_shard @ W_shard) for row-parallel, and the column→row pairing needs only one
all-reduce per block. Your tests assert reconstruction equals the unsharded result to machine
precision — which is why TP is correct, not just plausible.
Definition of done
pytest phase-10-distributed-inference/labs -q
Map to the real engine
| your numpy | real vLLM |
|---|---|
column_parallel | ColumnParallelLinear (linear.py:410) |
row_parallel + sum | RowParallelLinear + tensor_model_parallel_all_reduce (linear.py:1392, communication_op.py:12) |
mlp_tp (col→row, one all-reduce) | the MLP/attention block's TP pattern |
num_ranks, rank slicing | parallel_state.py world size / rank (:1849/:1854) |
| (running it for real) | MultiprocExecutor + N workers (multiproc_executor.py:102) |
Phase 10 Labs — Distributed Inference
Four labs on splitting one model across many GPUs. The arc: prove tensor parallelism's algebra and the one-all-reduce pairing (lab-01), price its communication and derive the within-a-node rule (lab-03), meet the cross-node alternative and its bubble (lab-04), then watch TP=2 split a real model's weights and KV on real hardware (lab-02).
Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: math,
bill, alternative, demo.) CPU labs follow the standard contract — starter.py (your
work), solution.py (reference), test_lab.py (the spec); default runs the solution,
LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-10-distributed-inference/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-10-distributed-inference/labs/lab-01-tp-sharding-math -q
Contents
- lab-01-tp-sharding-math
[CPU-OK] - lab-02-two-way-tp
[GPU-OPT] - lab-03-tp-comm-cost
[CPU-OK] - lab-04-pipeline-bubble
[CPU-OK] - What you can do after this phase
Labs
lab-01-tp-sharding-math [CPU-OK]
Tensor parallelism as provable algebra: column-parallel (slice outputs, all-gather)
and row-parallel (slice inputs, all-reduce) reconstruct the dense result exactly, and
the Megatron column→row pairing makes a whole MLP cost one all-reduce — asserted by
a counter, not claimed. Includes the divisibility constraint that caps real TP sizes.
Skills: the two shardings; communication designed out, not optimized out; mapping to
ColumnParallelLinear/RowParallelLinear.
lab-02-two-way-tp [GPU-OPT]
tensor_parallel_size=2 live: two worker processes, 1.24 + 1.24 = 2.48 GiB of
weights, per-rank KV blocks, and output matching TP=1 (to the last ulp's mercy). The
observable surface of TP — and how to reconcile every log line against labs 01/03.
Annotated capture included. Skills: reading per-rank memory/block reports; lockstep
workers and the slowest-rank rule; when two TP=1 replicas beat TP=2.
lab-03-tp-comm-cost [CPU-OK]
The bill: 2 all-reduces × 32 layers × an 8 KB decode payload, priced with the ring formula on NVLink, PCIe, and Ethernet. Derives "TP within a node, never across" as an assert (>40% of the step lost to latency on 10 GbE) — and the subtler split: decode comm is latency-bound, prefill comm is bandwidth-bound, so the right interconnect depends on the workload. Skills: the ring all-reduce cost model; latency vs bandwidth regimes; pricing EP's all-to-all with the same tools.
lab-04-pipeline-bubble [CPU-OK]
The cross-node alternative: stages by layer, one activation handoff per boundary —
and the bubble, (p−1)/(p+m−1), derived twice (algebra and a simulated schedule grid
that must reconcile exactly). p=8 under a 10% bubble budget needs 63 in-flight
microbatches: PP's economics are batch economics. Skills: fill-drain geometry; PP
buys throughput and nothing for latency; TP×PP composition; stragglers, third
appearance.
What you can do after this phase
Decide, from arithmetic, how to place a model on a cluster: minimum TP for fit, TP vs data-parallel replicas for throughput, TP×PP composition across nodes, and what each choice costs in collectives or bubbles; read a distributed deployment's startup logs as a checksum of the sharding; and debug the classics (slow rank drags the ensemble, cross-node TP melting p99, PP starving at low traffic) from models you built rather than lore. Phase 15 splits the workload (prefill from decode) where this phase split the model.
Lab 10-01 — Tensor Parallelism Math [CPU-OK]
A 70B model's weights don't fit on your GPU. Tensor parallelism's answer is almost insolent in its simplicity: a matrix multiply distributes over slicing — cut the weight matrix into N pieces, give each GPU one piece, and the partial results reassemble into exactly the unsharded answer. This lab makes you prove it, in numpy, with the two sharding patterns that production TP is built from — column-parallel (slice outputs, reassemble by concatenation = all-gather) and row-parallel (slice inputs, reassemble by summation = all-reduce) — and then the composition trick that makes a whole transformer block cost only one all-reduce: pair them column→row, and the intermediate never needs reassembling at all.
Contents
- Why this lab exists
- Background: two shardings and the pairing trick
- Files
- Run
- What to implement
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Distributed inference has a reputation for being infrastructure wizardry — Ray
clusters, NCCL, process groups — and that reputation obscures the fact that the core
is linear algebra a laptop verifies in milliseconds. Separating the two layers is the
point of this lab: the math (which sharding produces which partial result, and what
collective reassembles it) is exact, provable, and small; the infrastructure (Phase
10's deep-dive: process groups, communicators, weight loaders that shard at load time)
exists to execute that math. Engineers who learn the infrastructure first treat
ColumnParallelLinear as an incantation; engineers who learn the math first read it
as "my column_parallel, with NCCL where my np.concatenate is."
The one-all-reduce composition is the part that earns the word design. Naively
sharding two consecutive matmuls costs a collective after each. The Megatron insight —
which every serving stack inherited — is that the column shard's output is already
partitioned exactly the way the row shard's input wants it: the activation flows from
shard to shard without ever being whole. Communication is designed out, not optimized
out. You'll assert it: num_all_reduces == 1.
Background: two shardings and the pairing trick
For y = x @ W.T (W: (out, in)):
- Column-parallel (shard W's output rows): rank r computes
x @ W_r.T, a slice of y's columns. Reassembly = concatenation (all-gather). Every rank needs all ofx, which it has (the previous all-reduce ended with everyone holding the full activation). - Row-parallel (shard W's input columns): rank r holds
W[:, r·c:(r+1)·c]and only the matching slice ofx, computing a full-shaped but partialy_r. Reassembly = elementwise sum (all-reduce).
The MLP composition: W1 column-parallel → each rank holds a slice of the hidden
activation → apply the nonlinearity per-shard (elementwise, so it commutes with
slicing — this is why the trick works for ReLU/SiLU but would break for anything
mixing hidden dims) → W2 row-parallel consumes exactly that slice → one all-reduce
at the end. Attention follows the same pattern with heads as the natural column
boundary: QKV projections column-parallel (each rank owns whole heads), out-proj
row-parallel. Two blocks per layer, one all-reduce each — lab-03 prices them.
Files
starter.py—column_parallel,row_parallel,mlp_tp. Your work.solution.py— reference.test_lab.py— exact reconstruction for several rank counts, the one-all-reduce property, and the divisibility constraint.
Run
LAB_IMPL=starter pytest phase-10-distributed-inference/labs/lab-01-tp-sharding-math -q
pytest phase-10-distributed-inference/labs/lab-01-tp-sharding-math -q # reference
What to implement
Per 02-mini-build.md. The loop over ranks is the
simulation — each iteration is one GPU's life; the concatenate and the running sum
are the collectives. Keep that mapping conscious: when you later read real TP code,
every line will be one of your loop bodies with the loop distributed across processes.
What the tests prove
| Test | What it pins |
|---|---|
column/row reconstruct x @ W.T exactly | Sharding is algebra, not approximation — to machine precision, for num_ranks ∈ {1, 2, 4, 8} (and rank-count invariance is itself the deployment-critical property: TP=4 and TP=8 must serve identical models) |
mlp_tp == dense MLP with num_all_reduces == 1 | The Megatron pairing: the hidden activation never reassembles. The counter in the return value is the design, made falsifiable |
| divisibility asserted | hidden % num_ranks == 0 — why TP sizes are powers of two and why some models can't run at TP=6: head counts and hidden dims must divide. A real constraint users hit (GQA's 8 KV heads cap practical TP at 8 without head replication) |
Hitchhiker's notes
- Floating point note: the row-parallel sum reorders additions vs the dense matmul, so on real hardware TP=2 and TP=1 differ in the last ulp — the recurring last-ulp story (Phases 3/4/6/9), now with rank count as the trigger. Your float64 numpy hides it; fp16 GPUs don't. "Different outputs at different TP sizes" bug reports are usually this, not a bug.
- Map to upstream:
ColumnParallelLinear/RowParallelLinearinupstream/vllm/model_executor/layers/linear.py— find the singleall_reducein the row class's forward (linear.py:1392), and noticegather_output=Falseon the column class: the default is the paired pattern, all-gather elided. Model code composes these two classes and TP falls out — that's why adding a new model (Phase 14) barely thinks about TP. - Weights are sharded at load time, not runtime — each rank reads only its slice
from the checkpoint (the weight loader's
shard_idmachinery). The lab'sW[r*chunk:(r+1)*chunk]is, in production, a file-read pattern: TP=8 startup reads each tensor once across 8 processes. Loading is part of the sharding design, not an afterthought. - Embedding and LM head shard on the vocabulary dimension (vocab-parallel) — same two patterns, different axis, with a gather at the logits. Every weight matrix in the model has a natural slicing axis; TP is the discipline of choosing axes so the collectives stay rare.
Going further
- Add
attention_tp(x, Wqkv, Wo, num_heads, num_ranks): heads as the column boundary, out-proj row-parallel, assert one all-reduce and head-count divisibility. You've now sharded both halves of a real layer. - Implement
gather_output=True(the elided all-gather) and count collectives for the unpaired composition — two matmuls sharded naively. The diff againstmlp_tp's 1 is the Megatron paper's contribution, measured by your counter. - Simulate a wrong sharding: shard
W1by rows instead of columns, watch the nonlinearity break the reconstruction (ReLU of a partial sum ≠ partial of a ReLU). The elementwise-commutes-with-slicing condition, demonstrated by violating it.
References
- Shoeybi et al., Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (2019) — the column→row pairing, Figure 3: https://arxiv.org/abs/1909.08053
upstream/vllm/model_executor/layers/linear.py— the two classes and the one all-reduce (:1392).upstream/vllm/model_executor/layers/vocab_parallel_embedding.py— the same idea on the vocab axis.- Lab-03 — what the one all-reduce costs; lab-02 — the memory split, observed live.
Lab 10-02 — Two-Way Tensor Parallelism [GPU-OPT]
The math (lab-01) said each rank holds 1/N of every matrix; the cost model (lab-03)
said in-node links make the collectives cheap. This lab is where you watch both claims
cash out on real hardware: tensor_parallel_size=2 spawns two worker processes, each
reporting half the weight memory (1.24 GiB where TP=1 reported 2.48), each carving
its own KV blocks from its own leftover HBM — and a model generates coherent text
while no single GPU ever holds all of it. The startup log is the lab; reading it
against the two CPU labs is the work.
No GPU pair? Don't panic. The captured run below is annotated line by line; the reconciliation exercises need only the numbers.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, opt-1.3b, 2×L4, vLLM 0.22.1, trimmed)
- Reading the numbers
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Two reasons. First, the observable surface of TP — worker processes, per-rank memory reports, per-rank block counts, NCCL initialization lines — is what you'll actually have in front of you during a production incident, and learning to read it against the underlying sharding math is the diagnostic skill (is rank 1's memory wildly different from rank 0's? Something's wrong with sharding or loading. Do blocks per worker × TP ≈ expected total KV? If not, where did the HBM go?). Second, TP is the first feature in this course that changes the process model: the engine becomes a coordinator of N workers in lockstep, every scheduler decision (Phase 3) broadcast to all ranks, every forward a synchronized ensemble. Several later phases (15's disaggregation, 17's platform plugins) build on that worker abstraction, and this is where you first see it breathe.
Requirements
uv pip install -e ".[vllm]" # needs 2 visible CUDA GPUs for the live run
huggingface-cli download facebook/opt-1.3b
(opt-1.3b: big enough that halving its 2.6 GB is visible in the logs, small enough to also run TP=1 on one card for the baseline — you want both runs for the diff.)
Steps
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-1.3b", tensor_parallel_size=2, gpu_memory_utilization=0.8)
print(llm.generate(["Distributed inference means"],
SamplingParams(max_tokens=32, temperature=0))[0].outputs[0].text)
Then the baseline (tensor_parallel_size=1) and the three comparisons: weight memory
per worker (should halve), # GPU blocks per worker (should roughly double total —
see below), and the generated text (should match the TP=1 output token-for-token...
almost — see the last-ulp note).
Captured output (real run, opt-1.3b, 2×L4, vLLM 0.22.1, trimmed)
INFO ... Started 2 worker processes (tensor_parallel_size=2)
INFO (Worker_TP0) ... Model weights take 1.24 GiB # ~half of 2.6 GiB
INFO (Worker_TP1) ... Model weights take 1.24 GiB # the other half
INFO ... # GPU blocks: 28,500 (per worker) # KV also splits across TP ranks
... distributed inference means splitting one model across multiple GPUs ...
# single-GPU baseline (tensor_parallel_size=1): Model weights take 2.48 GiB on one GPU
Reading the numbers
- 1.24 + 1.24 = 2.48 — lab-01's
W[r*chunk:(r+1)*chunk], weighed. Every linear layer's shards, embedding's vocab slices, all summing back to the whole. If the two workers ever report different weight sizes, some tensor didn't shard (replicated layers — norms, biases — are expected and tiny; a large asymmetry means a loader bug). # GPU blocks: 28,500 per worker— the subtle one. Each rank's KV per token is also halved (it caches only its own heads' K/V — attention sharding splits the cache naturally), and each rank carves blocks from its own freed-up HBM. Per-rank block count × block tokens ≈ same token capacity per rank as... work it through with Phase 2 lab-03's arithmetic: weights halved → more free HBM per GPU; KV per token halved → more tokens per GiB. Both effects push capacity up — TP=2 roughly doubles total concurrent tokens, which is the capacity story that often justifies TP even when the model would fit on one GPU.- The generated text matches TP=1 — semantically always, token-for-token usually. The all-reduce reorders fp16 additions (lab-01's note), so a near-tie can flip a token. Greedy + short output usually survives; if you diff long generations and find one divergence at position 200, you've observed the last-ulp story, not a bug.
- What you don't see: the 64-per-step all-reduces (lab-03) — invisible in logs, visible only as the gap between ideal 2× latency scaling and what you measure. Time a single-stream generation under TP=1 vs TP=2: the ITL improvement lands under 2× by exactly the comm fraction your lab-03 model predicts for your link.
Hitchhiker's notes
- Process topology: TP workers are separate processes (one per GPU), not threads — CUDA contexts, NCCL communicators, and Python's GIL all push that way. The engine core broadcasts each step's scheduler output to all ranks; they execute the identical step in lockstep and the rank-0 worker returns logits for sampling. Lockstep means the slowest rank sets the pace — a thermally-throttled GPU in a TP group drags the ensemble, a classic and maddening production hunt (symptom: TP=4 slower than TP=2; cause: one card at 70% clocks).
CUDA_VISIBLE_DEVICESand placement matter: TP wants the GPUs with the fastest mutual links (same NVLink island / NUMA node). On mixed-topology machines,nvidia-smi topo -mbefore choosing — lab-03's bill varies by which pair you pick on the same box.- When TP=2 is the wrong tool: model fits on one GPU and you're throughput-bound — two independent TP=1 replicas (data parallelism) beat TP=2 (no comm tax, perfect scaling, simpler ops). TP earns its tax only for fit-or-latency reasons (lab-03's notes). "We have two GPUs so we set TP=2" is the most common distributed-inference misconfiguration in the wild.
- Startup is slower under TP — N processes, NCCL rendezvous, sharded loading, graph capture per rank (Phase 5's cost, ×N but parallel). Budget it in deploy pipelines; it's the TP line item people forget.
Reflect
- Reconcile per-worker blocks with Phase 2 lab-03's formula: weights/rank = 1.24 GiB,
KV/token/rank = half of lab 0-02's per-token bytes. Predict
# GPU blocksfor TP=2 on your card before reading the log. Within 10%? - A 70B fp16 model (~140 GiB weights), 80 GiB GPUs: what's the minimum TP, and what does lab-03 say about running it across two 8-GPU nodes at TP=16 vs TP=8 × PP=2? (TP=2 minimum for fit; cross-node TP=16 pays 64 latency-bound all-reduces over IB per token vs PP=2's single activation handoff — the composition lab-04 closes.)
- Why does vLLM broadcast scheduler decisions rather than letting each rank run its own scheduler? (The ranks must execute byte-identical steps — same batch, same block tables — or the all-reduces would be summing mismatched partials. One brain, N hands; determinism across ranks is a correctness requirement, not a preference.)
References
upstream/vllm/v1/executor/andupstream/vllm/v1/worker/— the multiprocess executor and worker lockstep.upstream/vllm/distributed/parallel_state.py— process groups and communicator setup (the NCCL lines in your startup log).- vLLM docs, Distributed Inference and Serving — TP/PP configuration and the placement guidance: https://docs.vllm.ai/en/latest/serving/distributed_serving.html
- Labs 01 (the math), 03 (the bill), 04 (the cross-node alternative) — this run is their joint demo.
Lab 10-03 — The TP Communication Bill [CPU-OK]
Lab-01 proved tensor parallelism is mathematically free — exact reconstruction, one all-reduce per block. This lab prices what "one all-reduce" costs physically, and the answer derives the most-quoted deployment rule in distributed inference from four multiplications: TP within a node, never across. Same model, same math, same code — on NVLink the communication is noise (<10% of a decode step), on 10 GbE it's fatal (>40%, latency alone). You'll also derive the subtler corollary most people miss: for decode, the bill is dominated by latency, not bandwidth — 64 tiny 8 KB messages per token — which is why fancy interconnect bandwidth numbers don't save cross-node TP and why prefill and decode want different links.
Contents
- Why this lab exists
- Background: what gets sent, how often, and how
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
"TP needs fast interconnect" is folklore until you can compute how fast, for which workload, and what happens if you ignore it. Those computations decide real money: whether a 70B model needs an NVLink-equipped node or can spread across two cheaper ones (it can't — not with TP; that's what PP is for, lab-04), whether TP=8 beats TP=4 for your latency target, whether a cloud's "high-bandwidth networking" claim is relevant (check the latency; for decode it usually matters more). This lab builds the five-function model that answers all of them on a napkin — the distributed sibling of Phase 0 lab-04's roofline, and like it, a model whose domain of validity you'll know because you built it.
It's also the quantitative half of a design story the phase tells in two parts: TP (this lab) pays communication per layer and demands fat links but splits every matrix; PP (lab-04) pays per stage boundary and tolerates thin links but idles GPUs in bubbles. Every real deployment of a big model is a negotiation between these two bills, and you're about to be able to compute both sides.
Background: what gets sent, how often, and how
What: after each RowParallelLinear (lab-01), every rank holds a partial sum of
the activation; the all-reduce sums them. Payload = the activation tensor:
batch_tokens × hidden × dtype_bytes. For one decode token of an 8B model: 4096 × 2 =
8 KB. Tiny. For a 2048-token prefill chunk: 16 MB. Not tiny. Same operation,
three orders of magnitude apart — keep both numbers in mind; they split the analysis.
How often: twice per layer (attention out-proj, MLP down-proj) × 32 layers = 64 all-reduces per step, every step, forever. Communication frequency is set by model depth, not by anything you can tune.
How: ring all-reduce — reduce-scatter then all-gather, each rank sending
2·(N−1)/N × payload total across 2(N−1) sequential hops. The formula's two terms
are the lab's two regimes: traffic / bandwidth (dominates for big payloads:
prefill) and 2(N−1) × latency (dominates for small ones: decode). A 3 µs NVLink hop
vs a 50 µs Ethernet round-trip is the 17× that, multiplied by 64 all-reduces, becomes
the node boundary.
Files
starter.py—allreduce_payload_bytes,ring_allreduce_traffic_per_rank,allreduce_time_s,tp_comm_time_per_step,comm_fraction. Your work.solution.py— reference.test_lab.py— the formulas, the NVLink-vs-Ethernet verdict, both latency/bandwidth regimes, and the more-ranks-more-overhead direction.
Run
LAB_IMPL=starter pytest phase-10-distributed-inference/labs/lab-03-tp-comm-cost -q
pytest phase-10-distributed-inference/labs/lab-03-tp-comm-cost -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_payload_is_one_activation_row_per_token | The 8 KB decode payload — memorize it; it's why decode TP is a latency problem |
test_ring_traffic_formula | 2(N−1)/N: at N=2 each rank moves exactly one payload; as N grows it approaches 2× — traffic per rank is nearly constant in N (the ring's genius), it's the hop count that grows |
test_decode_step_comm_on_nvlink_is_noise | 64 all-reduces on NVLink < 1 ms, < 10% of the 8 ms decode step (Phase 0 lab-04's number) — TP=2 in-node is nearly free |
test_decode_step_comm_on_ethernet_is_fatal | Same step over 10 GbE: latency alone is 64 × 2 × 50 µs = 6.4 ms, > 40% of the step. The "never TP across nodes" rule, as an assert |
test_latency_dominates_small_payloads / test_prefill_payloads_shift_the_balance_to_bandwidth | The regime split: for decode, halving latency beats doubling bandwidth; for prefill, the reverse. One model, two correct answers to "what should we buy?" |
test_more_ranks_more_overhead | TP=8 > TP=2 in comm time: TP scaling is sub-linear by construction, before any software inefficiency |
Hitchhiker's notes
- Why TP at all, if it taxes every layer? Three reasons, in order of importance:
the model doesn't fit on one GPU (the usual one); per-token latency — TP divides
the weight-streaming time, so a bandwidth-bound decode step (Phase 0 lab-04's 8 ms)
genuinely drops toward 8/N ms + comm, the only lever that shortens single-stream
ITL on a too-slow GPU; and KV capacity — the cache splits across ranks too
(lab-02's halved
# GPU blocksper worker is per-rank; total capacity grows). The comm bill is what you pay for all three. - vLLM's custom all-reduce: for small payloads (exactly the decode case), NCCL's
general ring is beaten by a one-shot fused kernel over NVLink peer access —
upstream/vllm/distributed/device_communicators/custom_all_reduce.pyexists precisely because of the latency term you just modeled. When you read "custom allreduce disabled" in a startup log, you now know which workloads care. - The model's omissions (know them before quoting it): overlap — real engines overlap some comm with compute, shaving the visible fraction; NVSwitch topology — 8-GPU nodes all-reduce at near-constant time rather than ring-scaling; and cross-node fabrics like InfiniBand (~2–5 µs, 50–400 Gb/s) sit between your NVLink and Ethernet endpoints — rerun the numbers for IB and you'll see why cross-node TP is merely painful rather than absurd on real clusters, and why it's still avoided when PP can serve.
- Hidden size moves the bill linearly — a 70B model (hidden 8192) doubles every payload, and its compute per step is ~9× bigger; comm fraction actually improves with model size. Small models are the worst TP candidates twice over (less to split, same hop count).
Going further
- Add an
overlap_fractionparameter (comm hidden under compute) and find the break-even overlap that makes TP=2-over-IB match TP=2-over-NVLink for decode. You've quantified what async/overlapped all-reduce engineering is worth. - Model TP × batch: comm payload grows with batch but compute grows too — plot comm fraction vs batch size for decode and find where Ethernet TP becomes tolerable (large-batch throughput serving — which is exactly when you didn't need TP's latency win anyway; the conclusion writes itself).
- Compute the bill for expert parallelism's all-to-all (Phase 7 lab-04's missing line): payload = routed tokens × hidden, frequency = 2 per MoE layer. Compare against TP's — you'll see why DeepSeek-scale MoE deployments obsess over network topology in a way dense-model TP never had to.
References
upstream/vllm/distributed/device_communicators/custom_all_reduce.py— the latency-term workaround, in production.- NVIDIA NCCL docs, Collective Operations — ring/tree algorithms and their cost models: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html
- Shoeybi et al., Megatron-LM (2019) — the column→row TP scheme and its two all-reduces per layer: https://arxiv.org/abs/1909.08053
- Pope et al., Efficiently Scaling Transformer Inference (2022) — §3's communication analysis, the rigorous version of this lab: https://arxiv.org/abs/2211.05102
- Lab-01 — the math being priced; lab-04 — the alternative with the opposite bill.
Lab 10-04 — The Pipeline-Parallel Bubble [CPU-OK]
Lab-03 closed one door: TP across slow links is fatal — 64 latency-bound all-reduces
per token see to that. Pipeline parallelism is what's left when the model is too big
for one node: cut by layers into stages, and the only communication is handing one
activation tensor to the next stage — point-to-point, once per stage boundary,
indifferent to link latency in a way TP can only envy. The catch has a name and a
closed form: the bubble — (p−1)/(p+m−1) of the pipeline's capacity idles during
fill and drain — and this lab has you derive it twice: as algebra, and as a schedule
grid you build cell by cell, where the two derivations must reconcile exactly (the
test counts idle cells and divides). One microbatch through four stages wastes 75% of
the hardware; the entire craft of PP is making m large enough that the formula
forgives you.
Contents
- Why this lab exists
- Background: the fill-drain geometry
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
PP is the parallelism people deploy reluctantly — and the bubble formula is the entire content of that reluctance, so you should own it cold. It answers the deployment questions TP's bill (lab-03) leaves open: two nodes with no fast interconnect and a model that fits neither — PP works, but only if your workload keeps enough microbatches in flight (the test pins it: p=8 stages under a 10% bubble budget needs 63 concurrent microbatches — a number that should make you pause before proposing PP for a low-traffic latency-sensitive service). PP's economics are batch economics; knowing the formula means knowing instantly which workloads can pay.
The schedule-grid half of the lab is the more transferable skill: pipeline reasoning is the grid (stages × ticks, diagonal occupancy), and every scheduling refinement in the literature — 1F1B, interleaved stages, zero-bubble schedules — is a rearrangement of this grid you can draw and count. Build the simulator once and those papers become pictures.
Background: the fill-drain geometry
Microbatch b occupies stage s at tick s + b — a diagonal sweeping through a
p × (p+m−1) grid. Everything follows from counting cells:
total ticks = p + m − 1 (m diagonals, offset by one each)
useful cells = m · p (every microbatch visits every stage once)
capacity = p · (p + m − 1)
bubble = 1 − useful/capacity = (p − 1)/(p + m − 1)
Read the formula's two limits like an engineer: m = 1 → bubble (p−1)/p — a single
request through a deep pipeline uses one GPU's worth of an 8-GPU rack (which is why PP
does nothing for single-stream latency: total ticks ≥ p regardless — latency through
a pipeline is the pipeline's depth); m → ∞ → bubble → 0 — at high concurrency the
fill/drain cost amortizes to noise. PP converts throughput into efficiency and has
nothing to offer latency. That asymmetry — exactly opposite to TP, which buys latency
and taxes every step — is why the two compose rather than compete (TP inside the node,
PP across; vLLM's tensor_parallel_size × pipeline_parallel_size grid).
For inference specifically, "microbatch" maps onto the continuous-batching engine
naturally: each scheduler step's batch flows through the stages, and a busy engine
(Phase 3's full queues) keeps every stage fed — inference PP at high load lives near
the good end of the formula. The bad end is a quiet engine: requests trickle in,
stages idle, and the p99 user pays p stage-latencies regardless.
Files
starter.py—pipeline_total_ticks,bubble_fraction,simulate_schedule,min_microbatches_for_bubble. Your work.solution.py— reference.test_lab.py— the formulas, the grid's diagonal structure, the exact grid-vs-formula reconciliation, serial-stage discipline, and the bubble-budget inversion.
Run
LAB_IMPL=starter pytest phase-10-distributed-inference/labs/lab-04-pipeline-bubble -q
pytest phase-10-distributed-inference/labs/lab-04-pipeline-bubble -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_total_ticks | p + m − 1, including the m=1 pure-latency case |
test_bubble_formula | The closed form at its corners: no pipeline → 0; one microbatch through 4 stages → 75%; m ≫ p → vanishing |
test_schedule_grid_matches_the_formula | The reconciliation: idle cells counted in the simulated grid ÷ capacity equals bubble_fraction exactly. Two independent derivations agreeing is what "I understand this formula" means |
test_stage_never_runs_two_microbatches_at_once | The serial-worker constraint that makes the diagonal the only schedule (until you interleave — see notes) |
test_min_microbatches_for_a_bubble_budget | The inversion you'll actually use: p=8, 10% budget → m = 63; deeper pipelines need proportionally deeper batching |
Hitchhiker's notes
- Where PP's communication bill hides: one activation tensor per microbatch per
stage boundary —
batch_tokens × hidden × dtypebytes,p − 1times per step total (not per layer!). Run lab-03's arithmetic on it: even over 10 GbE, a decode microbatch's 8 KB handoff is microseconds, and there are 7 of them instead of 64 all-reduces. That's the entire "PP tolerates slow links" argument, quantified with the model you already built. - Interleaved stages (each GPU holds several non-contiguous layer chunks) shrink
the bubble by making
pvirtual stages cheaper to fill — the formula becomes(p−1)/(p·v + m − 1)-flavored withvchunks per GPU, at the cost ofv×more handoffs. Zero-bubble schedules (training-side) rearrange backward passes — inference, having no backward, mostly cares about the plain formula you built. - The KV-cache wrinkle inference adds: each stage holds the KV for its layers
only — PP splits cache naturally, like weights. But a request's tokens revisit
stage 0 every decode step, so PP decode is a loop through the pipeline, not a
single pass: steady-state decode keeps all stages busy only if the in-flight
request count ≥ p. Same formula, with
m= concurrent requests — Phase 3'smax_num_seqsacquires a new lower bound. - vLLM specifics:
pipeline_parallel_sizeshards layers across nodes (Ray or multiprocessing); the V1 engine overlaps stage execution with its async scheduling. PP support historically lagged TP in vLLM precisely because continuous batching × pipelining is bookkeeping-heavy — readingupstream/vllm/distributed/and the executor's PP paths after this lab, you'll recognize the grid under the code.
Going further
- Extend the simulator with per-stage durations (stage = its layers' cost; make one stage 2× slower) and watch the bubble formula stop being exact: the slow stage becomes a straggler and the pipeline clocks at its rate — Phase 7 lab-04's imbalance lesson, third appearance. Then rebalance layers across stages to fix it (real PP deployments tune stage boundaries for exactly this).
- Add TP×PP composition: total GPUs = t × p; for a fixed 16-GPU budget and lab-03's comm model on both axes, find the (t, p) that minimizes decode latency for an in-node-NVLink, cross-node-IB cluster. You've just done the capacity-planning exercise that precedes every large-model deployment.
- Plot bubble vs m for p ∈ {2, 4, 8, 16} and overlay your service's actual concurrent- request distribution — the visual that settles "can we afford PP?" in one meeting.
References
- Huang et al., GPipe (2018) — the fill-drain schedule and the bubble: https://arxiv.org/abs/1811.06965
- Narayanan et al., PipeDream / 1F1B (2019–21) — the schedule refinements that rearrange your grid: https://arxiv.org/abs/2104.04473
upstream/vllm/distributed/andupstream/vllm/v1/executor/—pipeline_parallel_sizepaths; the deep-dive maps them.- Pope et al., Efficiently Scaling Transformer Inference (2022) — TP vs PP for inference, with the cost models side by side: https://arxiv.org/abs/2211.05102
- Lab-03 — the TP bill this lab is the alternative to; Phase 15 — disaggregation, the third way to split work across machines.
Phase 10 — Exercises: Distributed Inference
Contents
Warm-up (explain)
- One line each: TP, PP, DP, EP, CP — what gets split?
- What's an all-reduce vs an all-gather? Which does row-parallel use?
- Why "TP within a node, PP across nodes"?
Core (trace the code)
- In
linear.py, what doesColumnParallelLinearshard vsRowParallelLinear? Where's the one all-reduce (:1392)? - Why does the column→row pairing need only one all-reduce per transformer block?
- In
MultiprocExecutor(multiproc_executor.py:102), how many worker processes for TP=4, and what does it broadcast each step? - Why is the model code identical for TP=1 and TP=8?
Build (your lab)
- In lab-01, prove
row_parallelreconstructsx@W.Tfornum_ranks=8. Why is summing partials the correct combine (not concatenation)? - Add a
qkv_parallelthat column-shards a fused QKV weight; verify it equals the unsharded QKV. - Count communications for a full transformer block (attention + MLP) under your TP impl. Is it 2 all-reduces? Why?
Design (staff-level)
- Serve a 70B model on 8×A100-80GB for (a) lowest latency, (b) highest throughput. Pick TP/PP/DP for each and justify with the communication patterns.
- You scale TP from 2 to 8 and throughput barely improves. Diagnose (communication-bound) and propose alternatives.
- For a 256-expert MoE on 16 GPUs, how would you combine EP (experts) with DP/TP (attention), and what's the main risk (load imbalance, all-to-all cost)?
Self-grading
4–7 and 11–13 are interview-grade. Could you draw the col→row TP pattern and the worker fan-out? If not, re-read 01-deep-dive.md.
Phase 10 — Interview Questions: Distributed Inference
Q1. TP vs PP — when do you reach for each?
Model answer
TP splits every layer's math across GPUs, so all GPUs work on each token — great for latency, but it all-reduces every layer, so it needs fast intra-node links (NVLink). PP splits the layers across GPUs with cheap point-to-point handoffs, so it scales across nodes/memory, but adds pipeline bubbles (mitigated by micro-batching) and a bit of latency. Rule of thumb: TP within a node, PP across nodes; combine them for very large models.
Q2. Walk me through tensor-parallel matmuls.
Model answer
Column-parallel splits the weight by output columns: each GPU computes part of the output, combined by all-gather. Row-parallel splits by input rows (and the input): each GPU computes a partial of the whole output, combined by all-reduce (sum). vLLM does the first matmul in a block column-parallel and the second row-parallel, so the column output stays sharded and feeds the row input directly — one all-reduce per block. The combined result is bit-identical to single-GPU (lab-01 proves it).
Q3. What's a pipeline bubble and how is it reduced?
Model answer
In PP, downstream stages idle while the first stage processes the initial input — wasted GPU time called the bubble. Splitting the work into many micro-batches keeps the pipeline full: once it's primed, every stage is always working on some micro-batch. The bubble shrinks with more micro-batches but never fully disappears.
Q4. Why does MoE motivate expert parallelism + data-parallel attention?
Model answer
Experts are independent FFNs, so placing whole experts on different GPUs (EP) scales expert capacity with just an all-to-all to route tokens. Attention has different parallelism economics, so it's often run data-parallel across the same GPUs to balance work. Mixing EP (experts) with DP/TP (attention) is common for large MoE models; the main risks are all-to-all cost and expert load imbalance.
Q5. How does vLLM run the same model on 1 or 64 GPUs unchanged?
Model answer
The model uses parallel layers (ColumnParallelLinear/RowParallelLinear) that internally do the
collectives, and parallel_state.py holds the group/rank bookkeeping. For multi-GPU the Executor
becomes a MultiprocExecutor that spawns one worker process per GPU, each holding a shard, running
in lockstep. The engine logic above (scheduler, sampler) and the model code are identical — only the
executor fans out.
Rapid-fire
- Row-parallel combine? all-reduce (sum). Column-parallel combine? all-gather (concat).
- All-reduces per transformer block under TP? ~2 (one per attention + MLP), pattern = col→row each.
- Collective library? NCCL. Group bookkeeping?
parallel_state.py. - Workers for TP=4? 4 processes, one per GPU.
- EP shards? whole experts (all-to-all). CP shards? one sequence's context/KV.
Phase 10 — Cheatsheet: Distributed Inference
Contents
The five splits
| splits | comms | where | |
|---|---|---|---|
| TP tensor | each layer's weights | all-reduce every layer | within a node (NVLink) |
| PP pipeline | layers across GPUs | point-to-point + bubbles | across nodes |
| DP data | full replicas | none on the work; route requests | model must fit |
| EP expert | MoE experts across GPUs | all-to-all | MoE layers |
| CP context | one sequence's KV | along the sequence | ultra-long context |
TP math (the one to know)
- column-parallel: split W by output cols → all-gather. row-parallel: split W by input rows + split x → all-reduce (sum).
- block pattern: column then row → one all-reduce per block; result identical to single-GPU.
- TP all-reduces every layer → needs fast links → TP within a node, PP across nodes.
Who runs it
EngineCore → MultiprocExecutor → N Worker processes (1/GPU) → ModelRunner. Collectives happen
inside the parallel Linear layers; groups/ranks in parallel_state.py. Model code unchanged for any
parallel size.
Combine for scale
e.g. TP=8 in-node + PP=2 across nodes + DP replicas + EP for MoE. Choosing the mix for a model+SLA is the staff decision.
Key upstream
distributed/parallel_state.py:1370 init :1506 initialize_model_parallel :1241 get_tp_group :1849 tp_world_sizedistributed/communication_op.py:12 all_reduce :17 all_gatherlayers/linear.py:410 ColumnParallelLinear :975 QKVParallelLinear :1392 RowParallelLinearv1/executor/multiproc_executor.py:102 MultiprocExecutor·v1/worker/gpu_worker.py:109 Worker
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md
Phase 11 — The Hitchhiker's Guide to Multi-LoRA
← Phase 10 · Course home · Phase 12 →
Contents
- Don't Panic
- Step 1: What a LoRA actually is (the math, gently)
- Step 2: The hard part — many adapters in one batch
- Step 3: Managing adapters in memory
- Step 4: LoRA on MoE and which layers get patched
- The invariants to memorize
- What you'll do
Don't Panic
A LoRA is a tiny "personality patch" for a big model. Instead of fine-tuning all 8 billion weights (expensive, and you'd need a full copy per use-case), you train two small matrices that nudge the frozen base model toward a specific task — legal writing, a coding style, a customer's tone. The magic vLLM does:
Serve many different LoRAs from ONE base model at the same time — request A uses the legal adapter, request B the medical one, request C none — all in a single batch, sharing the base weights, by applying each request's tiny patch inside one batched operation.
This is a structural cost win: thousands of fine-tunes on shared base weights, instead of a whole deployment per customer. This phase is that batched-adapter machinery.
Base model (shared, frozen) ──────────────┐
request A → + legal adapter (A_legal, B_legal) ┐
request B → + medical adapter (A_med, B_med) ├─ all in one batch, one base read
request C → + nothing (base only) ┘
Step 1: What a LoRA actually is (the math, gently)
A model layer multiplies by a big weight W (say 4096×4096 = 16M numbers). A LoRA says: don't
change W; add a small correction made of two skinny matrices.
W' = W + scaling × (B · A)
│ │
│ └ A: shape (r, in) "down" — squeeze to a tiny rank r (e.g. 16)
└ B: shape (out, r) "up" — expand back to full size
r (the rank) is tiny — 8, 16, 64 — so A and B together are a few thousand times smaller
than W. Applying the patch to an input x is two small matmuls:
1. SHRINK: s = x · Aᵀ (in → r) "squeeze x down to rank r"
2. EXPAND: Δ = s · Bᵀ (r → out) "expand back up"
output = x · Wᵀ + scaling × Δ
So a LoRA costs one big base matmul (shared by everyone) plus two tiny rank-r matmuls. That's
why it's cheap. You'll implement exactly this shrink/expand in lab-01.
🆕 New words: LoRA (Low-Rank Adaptation — a small additive patch), rank r (the squeeze dimension, small), A/B (the down/up matrices), shrink/expand (the two matmuls), adapter (one trained (A,B) pair).
Step 2: The hard part — many adapters in one batch
Serving one LoRA is easy (just add its delta). The challenge is a batch where different rows use different adapters:
batch row 0 → adapter "legal" row 1 → adapter "medical" row 2 → base (no adapter)
The naive fix — loop over rows, apply each adapter separately — destroys batching (you're back to tiny per-request work, Phase 5's enemy). The real fix is a grouped operation: sort/group rows by adapter, and in one kernel apply each adapter to its group. This is what the punica / SGMV kernels do (SGMV = Segmented Gather Matrix-Vector). Conceptually it's the same "group by id, do a grouped matmul" trick you saw for MoE experts in Phase 7 — here grouped by adapter id instead of expert id.
group rows by adapter id → for each adapter: one matmul on its rows → scatter back
cost ≈ base matmul (shared) + a little per distinct adapter ≪ N separate model runs
You'll build this grouped application in lab-01 and prove it equals the per-row reference.
Step 3: Managing adapters in memory
GPUs have limited memory, so vLLM keeps a bounded number of adapters resident:
max_loras— how many distinct adapters can be in a single batch/step.- adapters are loaded on demand and LRU-evicted when the budget is exceeded (like the KV cache's eviction, Phase 2 — same pattern, different objects).
- the scheduler (Phase 3) respects
max_loras: it won't admit a request whose adapter would exceed the limit this step (you saw thescheduled_lorascheck in the Phase 3 deep-dive).
A request names its adapter with a LoRARequest (id + name + path). Adapter id 0 conventionally
means "base model, no adapter."
Step 4: LoRA on MoE and which layers get patched
LoRA is applied to the linear layers — typically the attention projections (Q/K/V/O) and the MLP.
For MoE models (Phase 7), adapters can patch the expert layers too (lora/layers/fused_moe.py)
— trickier because of the routing, but the same shrink/expand idea. Not every layer needs an
adapter; which ones are patched is part of how the LoRA was trained.
The invariants to memorize
- LoRA:
W' = W + scaling × B·A, rankr ≪ in,out. Apply = base matmul + shrink (→r) + expand (→out). - Multi-LoRA = grouped application by adapter id (punica/SGMV): one base read, a little extra per adapter — not N separate runs.
max_lorasbounds distinct adapters per step; the manager LRU-evicts the rest; the scheduler enforces it.- Base weights are shared and read once; each adapter adds only
r×(in+out)params. - Output for a batch of mixed adapters equals applying each adapter per-request — batching is an optimization, not a behavior change (recurring theme).
What you'll do
- Read: 01-deep-dive.md —
LoRARequest, the LoRA layers, the punica shrink/expand/add_lora_linear, and the manager + scheduler hook, line-anchored. - Build: 02-mini-build.md — batched multi-adapter LoRA matmul.
- Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
lab-01-batched-lora-matmul[CPU-OK]— implement shrink/expand + grouped multi-adapter application; prove it equals the per-request loop.lab-02-serve-many-loras[GPU-OPT]— serve 3 adapters in one batch on real vLLM (captured).lab-03-lora-economics[CPU-OK]— the multi-tenant arithmetic: 32 MiB per adapter (deriving lab-02's logged number), ~430× shrink, 87 GPUs saved at 100 tenants.lab-04-adapter-slot-cache[CPU-OK]— the LRU slot cache behind max_loras and the scheduler walk that defers (not barriers) overflow requests; thrash arithmetic included.
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
← Phase 10 · Course home · Phase 12 →
Phase 11 — Deep Dive: multi-LoRA in real vLLM
Paths relative to
upstream/atv0.22.1 @ 0decac0.vllm/lora/request.py LoRARequest (how a request names its adapter) vllm/lora/lora_weights.py the (A, B) weight tensors of an adapter vllm/lora/lora_model.py LoRAModel (one loaded adapter's layers) vllm/lora/model_manager.py load / activate / LRU-evict adapters vllm/lora/worker_manager.py per-worker adapter management vllm/lora/layers/ LoRA-wrapped layers (base_linear, column/row parallel, fused_moe) vllm/lora/punica_wrapper/ the batched SGMV/BGMV kernels (shrink / expand / add_lora_linear)
Contents
- 1. The request:
LoRARequest - 2. The patched layers:
lora/layers/ - 3. The batched kernels:
punica_wrapper/ - 4. The manager: load, activate, evict
- 5. The scheduler hook (recall Phase 3)
- Reading checklist
1. The request: LoRARequest
vllm/lora/request.py:8 — class LoRARequest with lora_int_id (globally unique id), lora_name,
and the adapter path. The scheduler and managers key everything off lora_int_id; id 0 means base.
This is what a user attaches to a request to say "serve me with the legal adapter."
2. The patched layers: lora/layers/
A LoRA layer wraps a base layer (Phase 6's ColumnParallelLinear, etc.) and adds the
shrink/expand delta. Read lora/layers/base_linear.py and column_parallel_linear.py: in forward
they compute the base output, then call the punica wrapper to add the per-request LoRA delta. So the
model still builds normal layers; the LoRA manager swaps in these wrappers when adapters are
active. lora/layers/fused_moe.py does the same for MoE expert layers (Phase 7).
3. The batched kernels: punica_wrapper/
This is the heart — applying different adapters to different rows in one call. punica_base.py
defines the interface (PunicaWrapperABC :22, PunicaWrapperBase :124):
add_shrink(:42) — the down-projections = x · Aᵀfor all rows, each using its adapter'sA.add_expand(:57) — the up-projectionΔ = s · Bᵀ, each using its adapter'sB.add_lora_linear(:88) — the full "base + shrink + expand" for a linear layer.
The implementations (punica_gpu.py, punica_cpu.py, selected by punica_selector.py) use SGMV
(Segmented Gather Matrix-Vector): rows are segmented by adapter id, and each segment is matmul'd
against its adapter's slice in one grouped kernel. Read PunicaWrapperCPU.add_shrink/add_expand
(punica_cpu.py:166/:197) for the most readable version — it's literally "for each adapter
segment, do the small matmul," which is exactly your lab-01 grouped implementation.
4. The manager: load, activate, evict
vllm/lora/model_manager.py — LoRAModelManager loads adapters into a fixed set of GPU "slots",
activates the ones needed this step, and LRU-evicts when over max_loras (same eviction pattern as
the KV BlockPool, Phase 2). worker_manager.py drives this per worker. lora_weights.py holds an
adapter's A/B tensors (stacked across layers).
5. The scheduler hook (recall Phase 3)
In vllm/v1/core/sched/scheduler.py, the waiting-admission loop checks max_loras: it tracks
scheduled_loras and skips a waiting request if admitting its adapter would exceed the limit this
step (you saw this around :573 in the Phase 3 deep-dive). So multi-LoRA, like spec decode, rides
the normal scheduler with one extra constraint rather than a separate path.
Reading checklist
-
LoRARequest— what identifies an adapter, and what does id 0 mean? -
A LoRA layer's
forward— base output then what? Where does the delta come from? -
add_shrink/add_expand(punica_cpu.py:166/:197) — match them to shrink (→r) / expand (→out). - How does SGMV apply different adapters to different rows in one call (segments)?
-
Where does
max_lorasget enforced — in the manager and the scheduler?
Now build it: 02-mini-build.md, then the labs.
Phase 11 — Mini-Build: batched multi-adapter LoRA
You'll implement the LoRA delta (shrink → expand) and the grouped application that serves many adapters in one batch, then prove it equals applying each adapter per-request. This is the punica/ SGMV idea in numpy.
Contents
The task (lab-01)
Implement, in numpy:
lora_delta(x, A, B, scaling)→scaling × (x @ A.T) @ B.T. (A:(r,in), B:(out,r).) Note it's two small matmuls with a rank-rbottleneck.apply_single(x, W, A, B, scaling)→x @ W.T + lora_delta(...)(base + one adapter).apply_batched(x, W, adapters, adapter_ids, scalings)→ each rowiofxusesadapters[adapter_ids[i]](an(A,B)pair, orNonefor base-only). Do it grouped: compute the shared basex @ W.Tonce, then for each distinct adapter id add its delta to its rows. Must equal a per-row reference loop.
adapter_ids[i] == -1 (or None entry) means "base only, no adapter" for that row.
The point (the insight)
apply_batched reads the base weight once for the whole batch and adds only a tiny rank-r
delta per adapter group — so serving N adapters costs ≈ base + N small matmuls, not N full model
runs. That's the multi-tenant cost advantage. Your grouping by adapter_id mirrors SGMV's
segmenting; it's the same "group by id" trick as MoE (Phase 7), here by adapter.
Definition of done
pytest phase-11-multi-lora/labs -q
Tests pin: apply_batched == per-row reference; base-only rows equal x @ W.T; the delta has the
right rank-r structure; and a single shared base matmul covers all rows.
Map to the real engine
| your numpy | real vLLM |
|---|---|
lora_delta (shrink→expand) | add_shrink / add_expand (punica_cpu.py:166/:197) |
apply_batched (grouped by id) | add_lora_linear / SGMV (punica_base.py:88) |
adapters dict by id | LoRAModelManager slots (model_manager.py) |
adapter_ids per row | LoRARequest.lora_int_id (request.py:8) |
max distinct adapters | max_loras (manager LRU + scheduler check) |
Phase 11 Labs — Multi-LoRA
Four labs on serving many fine-tunes over one set of base weights. The arc: build the grouped delta math and prove consolidation changes nothing (lab-01), price the adapters and the fleet savings (lab-03), manage the slot cache and its scheduler constraint (lab-04), then watch a mixed base+adapter batch produce two models' behavior in one step on real hardware (lab-02).
Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: math,
economics, machinery, demo.) CPU labs follow the standard contract — starter.py
(your work), solution.py (reference), test_lab.py (the spec); default runs the
solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-11-multi-lora/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-11-multi-lora/labs/lab-01-batched-lora-matmul -q
Contents
- lab-01-batched-lora-matmul
[CPU-OK] - lab-02-serve-many-loras
[GPU-OPT] - lab-03-lora-economics
[CPU-OK] - lab-04-adapter-slot-cache
[CPU-OK] - What you can do after this phase
Labs
lab-01-batched-lora-matmul [CPU-OK]
The punica/SGMV idea in numpy: shrink → expand deltas, one shared base matmul for the whole batch, per-adapter-group scatter-adds — proven exactly equal to the per-row loop (the consolidation safety case). Base-only rows ride free; the delta provably factors through the rank-r bottleneck. Skills: never materialize B@A; the group-by-parameter-set pattern (MoE's permute trick, second appearance); why mixed batches cost ~nothing.
lab-02-serve-many-loras [GPU-OPT]
The integration test: one batch, base + SQL adapter, two behaviors, one 12.55 GiB weight copy, 0.03 GiB of adapter — every number reconciled against labs 01/03/04. Plus the productization surface: adapters as model names in the OpenAI API, runtime loading, the cold-slot p99 signature. Annotated capture included. Skills: the operational knobs; behavior follows the tag; eval-diff due diligence for tenant migrations.
lab-03-lora-economics [CPU-OK]
The multi-tenant arithmetic as functions: 32 MiB per rank-16 7B adapter (deriving
lab-02's logged "0.03 GiB" from constants), ~430× smaller than the base, 32 per GiB,
rank as a linear memory dial with fleet-wide blast radius, and 87 GPUs saved at 100
tenants. Skills: economics-as-tested-functions; max_lora_rank as a memory
commitment; auditing platform pitches in your head.
lab-04-adapter-slot-cache [CPU-OK]
The machinery max_loras names: pre-allocated slots (kernel/graph shape stability —
Phase 5's constraint, again), an LRU cache with honest hit accounting (>75% on 80/20
traffic with 4 slots over 16 adapters), and the scheduler walk that defers — not
barriers — overflow requests. The serving-systems kata (cache-with-eviction +
admission-under-capacity), third appearance. Skills: OrderedDict as LRU; thrash
arithmetic; per-resource admission policy as a design decision; cross-component
invariants.
What you can do after this phase
Explain to a CFO why 100 fine-tunes need 13 engines, and to an engineer why the
consolidation is provably lossless; size max_loras/max_lora_rank from traffic
shape and memory budget rather than defaults; diagnose tenant p99 complaints down to
slot thrash with the cache model; and read vllm/lora/ — punica wrappers, the model
manager, the scheduler gating — as three labs you've already written. Phase 12 rides
lab 09-01's processor hook; the slot discipline you built here returns whenever
per-request GPU state does.
Lab 11-01 — Batched Multi-Adapter LoRA [CPU-OK]
A fine-tuned model is a base model plus a small correction — LoRA makes the
correction a rank-r factorization (ΔW = B @ A, lab-03 prices it at ~1/400th of the
base). The serving problem this lab solves: a single batch arrives carrying requests
for different fine-tunes — tenant 1 wants the SQL adapter, tenant 2 the support-bot
adapter, tenant 3 the plain base — and the engine must apply each row's own correction
without forking the base computation. You'll implement the answer in three layers:
the rank-r delta itself (shrink → expand), single-adapter application, and the batched
grouped form — one shared base matmul for everyone, plus per-adapter-group deltas —
proven exactly equal to the naive per-row loop. That grouped form is the
punica/SGMV idea, and it's what makes multi-tenant fine-tune serving a product instead
of a hack.
Contents
- Why this lab exists
- Background: shrink, expand, group
- Files
- Run
- What to implement
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Multi-LoRA is the cleanest case study in the course of a structural insight beating a resource problem. The naive reading of "serve 50 fine-tunes" is 50 model deployments — 50× the weights, 50× the GPUs (lab-03 does the bill). The structural reading: the 50 models share 99.75% of their parameters, so factor the computation the same way the parameters factor — shared base, per-tenant deltas. This lab makes you earn that reading by implementing it and proving equality, because the equality is the entire safety case: a tenant must get bit-for-bit (well, float-for-float) the same output from the shared deployment as from a dedicated one, or the consolidation is a quality regression wearing a cost-savings hat.
It's also the phase's foundation stone: lab-02 runs this exact computation on a GPU, lab-03 prices the structures you're multiplying, lab-04 manages which adapters are allowed into the batch. And the grouping pattern itself — sort work by its parameter-set, run one efficient op per group, scatter back — is Phase 7 lab-01's MoE permute trick with adapters in place of experts. Second appearance; it has a third (Phase 13's modality grouping). Learn the shape, not just the instance.
Background: shrink, expand, group
The delta for one row: Δy = scaling · (x @ Aᵀ) @ Bᵀ — shrink to the r-dimensional
bottleneck (x @ Aᵀ: in→r), then expand back (r→out). Never materialize B @ A
(that's an out × in matrix — the whole point is not to build it); the two skinny
matmuls cost r·(in+out) multiplies per token vs the base's in·out — the ~128×
compute shrink that mirrors lab-03's memory shrink. scaling (= α/r in the standard
parametrization) is a training-side constant that rides along.
The batch: rows tagged with adapter_ids (−1 = base only). The grouped application:
- One base matmul for the whole batch —
x @ Wᵀ, every row, regardless of adapter. This is the line that shares the expensive read (the base weights stream from HBM once — Phase 0 lab-04's bandwidth economics, multi-tenant edition). - Per adapter group: gather that adapter's rows, run shrink/expand on the slice, scatter-add back. Segments of rows × one small GEMM each — "Segmented Gather Matrix-Vector multiply" (SGMV), named exactly for this shape.
Base-only rows simply skip step 2 — they cost nothing extra, which is why mixed base+adapter batches (lab-02's demo) are free to compose.
Files
starter.py—lora_delta,apply_single,apply_batched. Your work.solution.py— reference.test_lab.py— batched ≡ per-row, base-only rows, the rank-r structure, and the shared-base property.
Run
LAB_IMPL=starter pytest phase-11-multi-lora/labs/lab-01-batched-lora-matmul -q
pytest phase-11-multi-lora/labs/lab-01-batched-lora-matmul -q # reference
What to implement
The three functions per 02-mini-build.md. The one trap:
in apply_batched, accumulate with indexed addition onto the base output
(out[rows] += …) — and note that here, unlike Phase 7 lab-01's MoE scatter, plain
fancy-indexed += is safe, because each row belongs to exactly one adapter (no
duplicate indices). If you reflexively reached for np.add.at after Phase 7: good
reflex, then notice why it's not needed — knowing when the footgun fires is better
than fearing it always.
What the tests prove
| Test | What it pins |
|---|---|
| batched ≡ per-row loop | The consolidation safety case: grouping is an execution strategy, not a semantics change — the course's master invariant, tenant edition |
adapter_id == -1 rows equal pure base | Base traffic rides free in mixed batches; no adapter machinery touches it |
| the delta is genuinely rank-r | It factors through the r-dim bottleneck — a delta that doesn't is a bug that costs you the entire economics (you'd be applying a full-rank update at full-rank prices) |
| one shared base matmul | The structural win itself, asserted: the base is read once per batch, not once per tenant |
Hitchhiker's notes
- Map to upstream:
add_shrink/add_expandinupstream/vllm/lora/punica_wrapper/punica_base.py(and the CPU reference inpunica_cpu.py— genuinely readable, go diff it against your solution) are your two halves oflora_delta;add_lora_linearis yourapply_batched. The GPU versions fuse the segment loop into one kernel launch indexed by lab-04's slot ids — grouping logic identical, loop distributed across the grid. - Where LoRA hooks into the model: every
ColumnParallelLinear/RowParallelLinear(Phase 10 lab-01!) gets a LoRA-aware wrapper that adds the delta after the base matmul. Under tensor parallelism the adapter shards along the same axes as its base layer — A with the input shard, B with the output shard — so TP × LoRA composes with no new collectives. Layer abstractions that compose are what make features multiply instead of interfere; vLLM's linear-layer stack is the load-bearing example. - Why group at all, on a GPU? The per-row loop launches a skinny matmul per request; the grouped form launches per adapter — and within a group, the rows share the adapter's A/B read (the tiling/reuse argument of Phase 7 lab-03, at miniature scale). With 64 rows across 4 adapters, that's 4 well-shaped small GEMMs vs 64 degenerate ones. Same arithmetic, ~order-of-magnitude better hardware shape.
- The delta is dense in the batch dimension but tiny in compute — so multi-LoRA overhead rides almost entirely on decode steps' idle compute (Phase 0 lab-04's story again: bandwidth-bound steps have FLOPs to spare, and the adapter's extra bytes are 32 MiB against the base's 13 GiB). This is why lab-02's capture shows no visible throughput tax — and why the claim "LoRA serving is nearly free" survives measurement.
Going further
- Implement the fused-into-base alternative for a single-adapter batch
(
(W + scaling·B@A)materialized, one matmul) and benchmark both in numpy at batch 1 vs 64. Merging wins single-tenant; grouping wins multi-tenant — find the crossover and you've reproduced the deployment decision lab-03's notes describe. - Add rank heterogeneity: adapters of rank 8, 16, 64 in one batch (real fleets
have this). Your grouped loop handles it naturally; the slot-buffer version
(lab-04) pads everyone to
max_lora_rank— compute the padding waste and you've found why that config knob is set with gritted teeth. - Wire it into
mini_vllm: adapter id on theRequest, deltas applied to the toy model's logits per row. Multi-tenant mini-serving in ~30 lines — and the scheduler interaction (lab-04'smax_schedulable) has a home to land in.
References
upstream/vllm/lora/punica_wrapper/punica_cpu.py— the readable reference your solution mirrors;punica_base.py—add_shrink/add_expand/add_lora_linear.- Hu et al., LoRA (2021) — the factorization: https://arxiv.org/abs/2106.09685
- Chen et al., Punica: Multi-Tenant LoRA Serving (2023) — SGMV, the kernel this lab's grouping becomes: https://arxiv.org/abs/2310.18547
- Phase 7 lab-01 — the same grouping pattern with experts; lab-03 — the economics; lab-04 — which adapters get into the batch at all.
Lab 11-02 — Serve Many LoRAs in One Batch [GPU-OPT]
The CPU labs built the machinery: the grouped delta (lab-01), the 32 MiB price tag
(lab-03), the slot cache (lab-04). This lab watches all three earn their keep on real
hardware: one batch, two requests — one wanting the plain base model, one wanting a
SQL fine-tune — served together over a single 12.5 GiB copy of Llama-2-7B, each
getting visibly different behavior ('apple, banana, orange' vs
'SELECT name FROM users;'), with the adapter adding 0.03 GiB. The multi-tenant
economics, demonstrated in four lines of API and one annotated log.
No GPU? Don't panic. The capture below carries the demonstration; the reconciliation against labs 01/03/04 is the work, and it's hardware-free.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, Llama-2-7b + SQL LoRA, A100, vLLM 0.22.1, trimmed)
- Reading the numbers
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Every GPU-OPT lab in this course is an integration test of the CPU labs' models, and
this one has the most user-visible payoff: different model behavior per request in
one batch is the kind of thing that sounds impossible until you've traced lab-01's
grouped matmul, and obvious afterward. Running it (or reading the capture) closes the
loop — and teaches the operational surface you'll actually touch: enable_lora, the
max_loras/max_lora_rank reservations (lab-04 and lab-03's knobs, now with startup-
log consequences), LoRARequest's id-and-path plumbing, and the per-request
lora_request parameter that the OpenAI-compatible server exposes as the model
field (each adapter looks like a model name to API clients — the productization detail
that makes multi-tenant serving feel like multi-model serving).
Requirements
uv pip install -e ".[vllm]"
huggingface-cli download meta-llama/Llama-2-7b-hf # the shared base
# plus a small LoRA adapter for it from the Hub (any task with visible behavior —
# SQL generation is ideal because base-vs-adapter outputs differ unmistakably)
Steps
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True,
max_loras=2, max_lora_rank=16)
sql = LoRARequest("sql-adapter", 1, "/path/to/sql_lora")
sp = SamplingParams(max_tokens=32, temperature=0)
out = llm.generate(
["List 3 fruits:", "Table users(id,name). Query all names:"],
sp,
lora_request=[None, sql], # request 0 = base, request 1 = SQL adapter
)
for o in out:
print(repr(o.outputs[0].text))
Then the experiments that make it a lab rather than a demo: swap which request gets
the adapter (behavior follows the tag, not the prompt); send the SQL prompt to the
base (watch it ramble — the adapter, not the prompt, carries the behavior); and load
a third adapter with max_loras=2 to meet lab-04's slot machinery in the logs.
Captured output (real run, Llama-2-7b + SQL LoRA, A100, vLLM 0.22.1, trimmed)
INFO ... LoRA enabled: max_loras=2, max_lora_rank=16
'apple, banana, orange' # request 0: base behavior
'SELECT name FROM users;' # request 1: SQL adapter behavior
INFO ... Model weights take 12.55 GiB (shared by ALL requests)
INFO ... LoRA adapter 'sql-adapter' loaded: 0.03 GiB # ~1/400th of the base
Reading the numbers
- 12.55 GiB, shared — the base read once per step for the whole batch: lab-01's
step-1 matmul, weighed. The dedicated-deployment alternative would hold this per
tenant — lab-03's
gpus_saved, with real units. - 0.03 GiB — lab-03's
adapter_bytes(4096, 32, 16)= 32 MiB, measured. When a derived constant and a log line agree to two figures, both the model and your reading of the log are validated — the reconciliation habit, sixth appearance. - Two behaviors, one batch — the rows took the same forward pass through the
base; only request 1's rows detoured through the shrink/expand delta
(
lora_request=[None, sql]is literally lab-01'sadapter_ids = [-1, 1]). The same step, two models' worth of behavior — there is no trick left in that sentence for you anymore. max_loras=2, max_lora_rank=16in the first line — lab-04's slot count and lab-03's per-slot size, reserved at startup. Read them as a memory line item: 2 slots × rank-16 buffers, carved before KV blocks (Phase 2 lab-03's ritual gained a claimant).
Hitchhiker's notes
- The API server's productization: under
vllm serve --enable-lora --lora-modules sql=/path/..., each adapter appears as a model name in the OpenAI-compatible/v1/modelslist, and clients select fine-tunes via the standardmodelfield. Tenants never learn they're sharing; the consolidation is invisible by design. Runtime add/remove exists too (/v1/load_lora_adapter) — onboarding a tenant without a restart. - Latency asymmetry to expect: the first request for a cold adapter pays the host→device slot load (lab-04's miss, milliseconds) plus — first time ever — disk loading. Steady-state requests pay only the delta compute (~1%, invisible). If a tenant's p50 is fine but p99 spikes correlate with their traffic gaps, that's the slot cache breathing — lab-04's thrash arithmetic is the diagnosis sheet.
- Quality due diligence transfers from Phase 6 lab-02: "the outputs looked right" is a smoke test. A tenant migration to shared serving deserves an eval-set diff (dedicated vs consolidated), which — per lab-01's equality proof — should show only float-reordering noise. If it shows more, suspect rank/config mismatches in the adapter conversion, not the engine.
- What doesn't work (v0.22): adapters must target the base's linear layers
(embedding/lm-head support varies), rank ≤
max_lora_rank, and the base model must match exactly (an adapter trained on Llama-2-7B-chat applied to Llama-2-7B-hf loads fine and behaves subtly wrong — the silent version-skew failure; checksum your bases).
Reflect
- Trace request 1's tokens through the phase: which lab's code decided it could enter
the batch (lab-04), which loaded its weights where (lab-03's bytes into lab-04's
slot), which computed its detour (lab-01), and what the base request paid for any
of it (nothing — lab-01's
-1rows). If you can narrate that chain cold, the phase is yours. - Your platform hosts 40 tenant fine-tunes on
max_loras=8engines. Using labs 03+04: what traffic shape makes this comfortable, what shape melts it, and what do you monitor to tell them apart? (Skew → slot hit rate; uniform simultaneous activity → thrash; monitor per-engine adapter hit rate and defer counts.) - Why does the engine require
max_lora_rankup front instead of sizing slots per adapter? (Phase 5's Constraint: fixed buffer shapes for captured graphs and fused kernels — the recurring trade of flexibility for replay. Heterogeneous ranks pad to the max; lab-01's going-further priced that.)
References
upstream/vllm/lora/— request plumbing, slot manager, punica kernels: the whole phase's upstream home.- vLLM docs, LoRA Adapters — serving config, runtime loading, the OpenAI-server productization: https://docs.vllm.ai/en/latest/features/lora/
- Labs 01 (the math), 03 (the bill), 04 (the slots) — this run is their joint integration test.
- Phase 6 lab-02 — the quality-verification discipline that transfers here verbatim.
Lab 11-03 — LoRA Economics: the Multi-Tenant Arithmetic [CPU-OK]
Multi-LoRA serving exists as a product category because of five numbers, and this lab has you compute all five: a rank-16 adapter for a 7B model weighs 32 MiB (you'll derive the exact figure that appeared as "0.03 GiB" in lab-02's capture — model and measurement agreeing is the course's favorite trick); that's ~1/430th of the base weights; 32 of them fit in a single GiB of spare HBM; rank scales the bill linearly (and quality famously doesn't); and serving 100 tenants takes 13 engines instead of 100 GPUs. When a platform pitch says "thousands of fine-tunes on shared infrastructure," this lab is the spreadsheet behind the slide — and after it, you can audit such pitches in your head.
Contents
- Why this lab exists
- Background: where the 400× comes from
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
This is the third "economics as functions" lab in the course (after Phase 0 lab-02's
KV calculator and Phase 8 lab-04's speculation model), and the pattern deserves
naming: the highest-leverage engineering questions — can we afford it? how many fit?
when does it stop paying? — reduce to short arithmetic over architecture constants,
and an engineer who has packaged that arithmetic into tested functions answers in
seconds what others answer with meetings. Multi-LoRA's arithmetic is the most
business-shaped of the three: it directly prices a product (per-tenant fine-tunes)
against its alternative (dedicated deployments), and the gpus_saved function is, not
even metaphorically, a line in someone's cloud bill.
The lab also grounds two config knobs you'll meet operationally: max_lora_rank sizes
the pre-allocated adapter buffers (rank is a memory commitment, not just a quality
dial — lab-04 builds the slots this arithmetic sizes), and max_loras is the
concurrency denominator in the fleet math.
Background: where the 400× comes from
A LoRA adapter replaces a weight update ΔW (which would be out × in, as big as the
layer) with a rank-r factorization B @ A — A: (r, in), B: (out, r) — so the
parameter count collapses from out·in to r·(in + out). For a 4096² projection at
r=16: 16.8M → 131K parameters, a 128× shrink per layer. Across a 7B model (32
layers × 4 attention projections targeted, the standard recipe):
131,072 params × 4 targets × 32 layers × 2 B (fp16) = 32 MiB
7,000,000,000 / 16,777,216 params ≈ 417×
The shrink is the whole business model: base weights are read once per step for the
entire batch regardless of how many tenants it contains (lab-01's shared base
matmul), KV cache is adapter-agnostic, and each tenant's marginal footprint is their
32 MiB plus nothing. The compute side has the same shape — the delta costs
2·r·(in+out) FLOPs per token against the base's 2·in·out, the same ~128× ratio —
which is why a batch full of different adapters runs at nearly base-model speed
(punica/SGMV kernels make the grouping efficient; lab-01 built their logic).
Files
starter.py—lora_params_per_layer,adapter_bytes,adapters_per_gib,shrink_ratio,gpus_saved. Your work.solution.py— reference.test_lab.py— the per-layer count, the 32 MiB ↔ lab-02 reconciliation, density per GiB, the headline ratio, rank linearity, and the fleet math.
Run
LAB_IMPL=starter pytest phase-11-multi-lora/labs/lab-03-lora-economics -q
pytest phase-11-multi-lora/labs/lab-03-lora-economics -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_per_layer_params | r·(in+out) — the factorization's bill, exactly |
test_adapter_size_matches_the_lab02_capture | 32 MiB = the "0.03 GiB" from lab-02's real log. Deriving a measured number from constants is the moment the model becomes trustworthy |
test_hundreds_of_adapters_per_gib | 32/GiB — adapter storage is never the constraint; slots and loading are (lab-04) |
test_shrink_ratio_is_the_headline | 400–450×, computed not quoted |
test_rank_is_a_linear_dial | r=64 costs exactly 4× r=16 — and since max_lora_rank sizes every pre-allocated slot, one tenant demanding rank 64 quadruples everyone's slot reservation. Config knobs with fleet-wide blast radius deserve tests |
test_gpus_saved | 100 tenants @ max_loras=8 → 87 GPUs saved. The slide, audited |
Hitchhiker's notes
- What the simplification hides (know before quoting): real targets aren't all
square — Llama's
gate/up/downMLP projections (often also adapted) arehidden × 2.7·hidden-ish, and GQA's k/v projections are narrower than q/o. Adapting all-linear-layers at r=16 lands nearer 80–120 MiB for a 7B. The structure of the arithmetic is what transfers; refit the constants to any model card in two minutes. - Why rank-16 at all, if rank is linear cost? Because LoRA quality saturates fast — the original paper's striking result was r=1..4 capturing most of full fine-tuning on many tasks. The production default of 8–16 is generosity, not necessity; tenants asking for 256 are usually solving a data problem with a parameter budget (and quadrupling your slot memory — push back with this lab's numbers).
- The denominator in
gpus_savedismax_loras, not "adapters you host." Hundreds can sit in host RAM or disk; onlymax_lorasare concurrently active per step. The fleet math assumes tenant traffic interleaves well — 100 tenants who all spike at 9 a.m. sharp need more headroom than the formula's floor. Capacity formulas are load-shape assumptions in disguise (Phase 7 lab-04's lesson, tenant edition). - Why not merge the adapter into the weights (
W + BA, zero overhead)? Single- tenant: absolutely, and tooling does. Multi-tenant: merging forks the base — you're back to one model copy per tenant, which is the disease this phase cures. The unmerged factorization is the sharing mechanism, the same way Phase 2's block indirection is the memory sharing.
Going further
- Refit
adapter_byteswith real Llama-2-7B shapes (q/k/v/o + gate/up/down, GQA widths) and compare against an actual adapter checkpoint's file size from the Hub — close the loop with adu -sh. - Add
slot_reservation_bytes(max_loras, max_lora_rank, ...)— the pre-allocated HBM the engine reserves at startup whether or not adapters load (it competes with KV blocks! Phase 2 lab-03's carving, with a new claimant). Compute the KV-block cost ofmax_loras=32, max_lora_rank=64on a 24 GiB card. - Price the compute side: delta FLOPs per token vs base FLOPs, then the batch-of- mixed-adapters overhead vs batch-of-one. The answer (~1%) is why lab-02's capture shows no visible throughput tax — verify against it.
References
- Hu et al., LoRA: Low-Rank Adaptation of Large Language Models (2021) — the factorization and the rank-saturation result: https://arxiv.org/abs/2106.09685
- Chen et al., Punica: Multi-Tenant LoRA Serving (2023) — the SGMV kernel and the multi-tenant economics formalized: https://arxiv.org/abs/2310.18547
- Sheng et al., S-LoRA: Serving Thousands of Concurrent LoRA Adapters (2023) — the thousands-of-adapters regime this arithmetic enables: https://arxiv.org/abs/2311.03285
upstream/vllm/lora/— wheremax_loras/max_lora_ranksize real buffers.- Lab-02 — the captured 0.03 GiB this lab derives; lab-04 — the slots this lab sizes.
Lab 11-04 — Adapter Slots: the LRU Cache the Scheduler Must Obey [CPU-OK]
Lab-03 priced adapters at 32 MiB; hundreds fit in spare HBM. So why does
max_loras=2 exist, and why is exceeding it a scheduling event rather than a memory
error? Because active adapters don't live in loose 32 MiB allocations — they live in
pre-allocated slots (fixed buffers sized for max_lora_rank, baked into the
kernels' launch shapes and CUDA graphs), and max_loras is the slot count. This lab
builds both halves of the machinery that manages them: the LRU slot cache (hit /
load / evict+load, with the recency bookkeeping that keeps hot tenants resident) and
the scheduler constraint it forces — a step's batch may reference at most
max_loras distinct adapters, with overflow requests deferred, not barriered. It is
Phase 2's eviction story and Phase 3's admission story, replayed one level up the
stack — deliberately.
Contents
- Why this lab exists
- Background: why slots, and what they cost
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Multi-LoRA's failure modes in production are almost never about the math (lab-01
settled that) — they're about slot pressure: a tenant complains about p99 latency
and the cause is their adapter thrashing in and out of slots behind two hotter
tenants; throughput sags after onboarding tenant #9 on a max_loras=8 engine because
every step now defers someone. Diagnosing these requires exactly the two models you'll
build: the cache (whose hit rate is the tenant-experience metric) and the admission
rule (whose deferrals are the throughput tax). Both are ~20 lines, and both behave
counterintuitively enough under skewed traffic that you want the test suite's numbers
in your head before the incident.
The pedagogical reason is the rhyme. You have now built an LRU-flavored eviction structure for KV blocks (Phase 2 lab-05), a multi-resource admission loop (Phase 3 lab-01), and here both again for adapters. The course repeats the pattern on purpose: cache-with-eviction + admission-under-capacity is THE serving-systems kata, and recognizing it instantly — whatever the cached object is — is a maintainer reflex. (You'll see it once more with prefix-cache-aware routing in Phase 15.)
Background: why slots, and what they cost
Why not allocate adapters dynamically, since they're tiny? Three converging reasons:
- Kernel shape stability — the punica/SGMV kernels (lab-01's grouping, fused)
index adapter weights by slot id out of a stacked buffer
(max_loras, max_lora_rank, …); a fixed buffer means fixed pointers and shapes, which CUDA graphs (Phase 5's Constraint 2!) can capture. Dynamic allocation would re-trigger capture or force eager mode. - Predictable memory —
max_loras × slot_sizeis reserved at startup, before KV blocks are carved (Phase 2 lab-03's ritual gains a line item). No mid-serving OOM from a tenant spike; the cost is paid visibly, up front (the course's recurring "pay it where you can see it"). - Bounded step complexity — the per-step adapter gather is over ≤
max_lorassegments, keeping the kernel's metadata small and the scheduler's reasoning finite.
The slot cache's job is then classic: keep the right max_loras adapters resident.
LRU is the policy (recency ≈ tenant activity), move_to_end is the entire
implementation subtlety, and a miss costs a host→device copy of lab-03's 32 MiB
(~milliseconds — a few decode steps' worth, painful only when it recurs, i.e. when
thrashing).
Files
starter.py—AdapterSlotCache(ensure/resident/stats) andmax_schedulable(the FCFS admission walk with deferral). Your work.solution.py— reference (noteOrderedDictas the LRU: insertion order +move_to_end+popitem(last=False)— the standard Python idiom, worth owning).test_lab.py— fill/hit/evict mechanics, LRU ordering, the skewed-traffic hit rate, the distinct-adapter cap, base requests riding free, and deferral-not-barrier.
Run
LAB_IMPL=starter pytest phase-11-multi-lora/labs/lab-04-adapter-slot-cache -q
pytest phase-11-multi-lora/labs/lab-04-adapter-slot-cache -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_fill_then_hit | The three outcomes and honest hit/miss accounting — the metric a tenant dashboard graphs |
test_lru_evicts_the_coldest | Recency refresh works: re-touching adapter 1 saves it; 2 dies. Forget move_to_end and this test is your tripwire (FIFO masquerading as LRU is the classic one-line bug) |
test_skewed_traffic_loves_lru | 80/20 traffic over 16 adapters, 4 slots → >75% hit rate. Skew is the friend of small caches — the same reason CPU caches work — and the reason max_loras=8 serves 100 tenants acceptably if traffic is skewed (lab-03's fleet math gains its load-shape footnote) |
test_scheduler_caps_distinct_adapters_per_step | The admission walk: slots claimed FCFS, reuse free, overflow deferred |
test_base_requests_never_consume_slots | None rides free — mixed base+adapter batches (lab-02's demo) cost slots only for the adapters |
test_deferral_is_not_a_barrier | A blocked adapter request doesn't stall later admissible ones — contrast with Phase 3 lab-01's head-of-line break for memory. Two resources, two deliberately different policies: KV exhaustion stops admission (fairness, deadlock logic); slot exhaustion skips individuals (slots free predictably next step). Policy per resource is a design decision, not a default |
Hitchhiker's notes
- Where this lives upstream:
upstream/vllm/lora/models.py(LoRAModelManager/LRUCacheLoRAModelManager— your cache, with host-side tiers) and the scheduler's lora gating (searchmax_lorasinvllm/v1/core/sched/scheduler.py— yourmax_schedulablewalk, inline in the admission loop). The two-tier reality: evicted adapters drop to host RAM (cheap reload), not to disk; "cold start" for a brand-new adapter adds checkpoint loading on top. - Thrash arithmetic: at
max_lorasslots andk > max_lorassimultaneously active uniform tenants, every step evicts — hit rate collapses towardmax_loras/k, each miss costs a 32 MiB copy, and aggregate throughput cliffs. The fix hierarchy: raisemax_loras(costs slot memory — lab-03's reservation), shard tenants across engines by affinity (routing — Phase 15's cousin), or batch tenant traffic in time. Knowing the cliff exists before tenant #9 onboards is this lab's operational payoff. - Prefix caching interaction (Phase 2 lab-05's note, now load-bearing): KV computed under adapter X is not valid for adapter Y — the adapter changes the model. The block hash therefore includes the LoRA id; two tenants with identical system prompts share nothing. Multi-tenant capacity planning that assumed prefix-cache savings across tenants is wrong by exactly that assumption.
ensureandmax_schedulablemust agree — the scheduler admits a set of adapters, then the cache loads them; if the admission cap exceeded the slot count, the load would evict an adapter another admitted request needs this same step. The invariant "admitted distinct adapters ≤ slots" is cross-component (scheduler promises, cache relies), the same shape as Phase 3 lab-04's deadlock invariant. When you modify one side upstream, the review question is always "who relies on this bound?"
Going further
- Add a host tier: evicted adapters go to a (larger) host LRU;
ensurereturns "hit" / "load-from-host" / "load-from-disk" with costs 0 / 1 / 30. Run the skewed workload and price the tiers — you've rebuiltLRUCacheLoRAModelManager's actual shape and S-LoRA's core argument. - Couple the two halves: drive
max_schedulable's admitted set into the cache per step and assert the invariant above holds for random traffic — then break the cap (+1) and watch which workloads corrupt. Cross-component invariants deserve cross-component tests. - Simulate tenant p99: timestamped requests, miss = +3 steps of latency; compare per-tenant p99 under LRU vs random eviction at various skews. The plot is the argument for LRU — and for affinity routing once skew fades.
References
upstream/vllm/lora/models.py—LoRAModelManagerand the LRU variant.upstream/vllm/lora/punica_wrapper/— the slot-indexed kernel buffers your cache fronts.- Sheng et al., S-LoRA (2023) — paged adapter memory + the host tier at thousands of adapters: https://arxiv.org/abs/2311.03285
- Phase 2 lab-05 — the eviction kata's first appearance; Phase 3 lab-01 — the admission kata's; this lab — both, one level up.
Phase 11 — Exercises: Multi-LoRA
Contents
Warm-up (explain)
- What is a LoRA, in terms of
W' = W + ?? Why is it cheap (use the rankr)? - What are the shrink and expand steps, and what shapes do they pass through?
- Why does serving N adapters cost ≈ one base read + N small matmuls, not N model runs?
Core (trace the code)
- In
LoRARequest(request.py:8), what identifies an adapter and what does id 0/-1 mean? - In
punica_cpu.py, matchadd_shrink(:166) andadd_expand(:197) to shrink/expand. - How does SGMV apply different adapters to different rows in one kernel (segments by id)?
- Where is
max_lorasenforced — name both the manager and the scheduler spot (Phase 3).
Build (your lab)
- In lab-01, why is the LoRA delta at most rank
r? Prove it withmatrix_rank. - Add an
effective_rankknob: stack two adapters on the same rows (sum of deltas) and verify it equals adding them sequentially. - Measure FLOPs: compare base matmul FLOPs to the adapter's shrink+expand FLOPs for
r=16,in=out=4096. What's the overhead ratio?
Design (staff-level)
- A platform serves 5,000 customer fine-tunes. Compare (a) one full deployment per customer vs (b) shared base + multi-LoRA: memory, cost, cold-start. Where does (b) win and where does it hurt?
max_lorasis hit constantly (lots of distinct adapters per batch). What are your options (raise it, route by adapter, replicate), and the tradeoffs?- How does LoRA on MoE expert layers (
lora/layers/fused_moe.py) complicate the batched apply, and why?
Self-grading
4–7 and 11–13 are interview-grade. Could you whiteboard shrink/expand and the grouped batched apply? If not, re-read 01-deep-dive.md.
Phase 11 — Interview Questions: Multi-LoRA
Q1. What is a LoRA and why is it cheap?
Model answer
A LoRA replaces full fine-tuning with a small additive patch: W' = W + scaling·B·A, where A
(r×in) and B (out×r) have a tiny rank r (8–64). Applying it is the base matmul plus two small
rank-r matmuls (shrink x→r, expand r→out). A and B together are thousands of times smaller
than W, so you can store and serve many adapters cheaply over one shared base.
Q2. How does vLLM apply different adapters to different requests in one batch?
Model answer
It groups the batch by adapter id and uses SGMV/punica kernels: rows are segmented by their
lora_int_id, and each segment is matmul'd against its adapter's A/B in one grouped kernel
(add_shrink/add_expand/add_lora_linear). So a heterogeneous batch costs the shared base read
plus a little per distinct adapter — not one model run per request. It's the same "group by id, do a
grouped matmul" trick as MoE, keyed by adapter instead of expert.
Q3. What's the cost model that makes multi-LoRA a structural advantage?
Model answer
The base weights are shared and read once for the whole batch; each adapter adds only r×(in+out)
parameters and a rank-r matmul. So serving N fine-tunes costs ≈ base + N tiny deltas, versus N
full model copies. That lets a platform serve thousands of customer fine-tunes from one deployment —
a real cost moat (Phase 19, Track C).
Q4. How are adapters managed in memory, and how does the scheduler get involved?
Model answer
The LoRAModelManager loads adapters into a bounded set of GPU slots and LRU-evicts when over
max_loras (same eviction discipline as the KV cache). max_loras bounds distinct adapters per
step; the scheduler enforces it during waiting-admission (it tracks scheduled_loras and skips a
request whose adapter would exceed the limit this step). So multi-LoRA rides the normal scheduler
with one extra constraint.
Q5. Does batching many adapters change the output?
Model answer
No — the grouped/SGMV application produces exactly the same result as applying each adapter to its request individually; it just shares the base matmul and fuses the per-adapter work. Same "optimization, not behavior change" guarantee as the KV cache, chunked prefill, and spec decode.
Rapid-fire
- LoRA formula?
W' = W + scaling·B·A, rank r small. - Two apply steps? shrink (in→r), expand (r→out).
- Batched-LoRA kernel family? punica / SGMV (segment by adapter id).
- Bounds adapters/step?
max_loras(manager LRU + scheduler check). - Adapter id 0/-1? base / no adapter.
Phase 11 — Cheatsheet: Multi-LoRA
Contents
The one-liner
A LoRA is a tiny additive patch W' = W + scaling·B·A (rank r ≪ in,out). vLLM serves MANY adapters
in one batch over a shared base by grouping rows by adapter id (punica/SGMV) — base read once, a
little per adapter.
The math
- shrink:
s = x·Aᵀ(in→r). expand:Δ = s·Bᵀ(r→out). output =x·Wᵀ + scaling·Δ. A:(r,in),B:(out,r). Adapter size =r×(in+out)≪W=in×out.
Multi-adapter batching
Group rows by lora_int_id; per-group grouped matmul (SGMV). Cost ≈ base + Σ(small per adapter),
NOT N model runs. Output identical to per-request application.
Memory & scheduling
max_loras: distinct adapters per step. Manager LRU-evicts extras (like the KV BlockPool).- Scheduler enforces
max_lorasat waiting-admission (scheduled_lorascheck, Phase 3). LoRARequest(id+name+path); id 0 = base.
MoE LoRA
lora/layers/fused_moe.py patches expert layers too (same shrink/expand, trickier routing).
Key upstream
lora/request.py:8 LoRARequestlora/punica_wrapper/punica_base.py:42 add_shrink :57 add_expand :88 add_lora_linear·punica_cpu.py:166/:197(readable)lora/layers/{base_linear,column_parallel_linear,row_parallel_linear,fused_moe}.pylora/model_manager.py(load/activate/LRU) ·lora/lora_weights.py(A,B)
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md
Phase 12 — The Hitchhiker's Guide to Structured Outputs
← Phase 11 · Course home · Phase 13 →
Contents
- Don't Panic
- Step 1: The problem — "please output JSON" is a prayer, not a guarantee
- Step 2: The idea — make illegal tokens impossible, not unlikely
- Step 3: From spec to automaton — why JSON needs a stack
- Step 4: Characters vs tokens — the lifting problem
- Step 5: The per-step pipeline in vLLM
- Step 6: The costs, and where they're hidden
- The invariants to memorize
- What you'll do
Don't Panic
Sometimes the output must be valid JSON, or match a regex, or follow a grammar — because a machine, not a human, is going to parse it. vLLM enforces this by computing, every decode step, the set of tokens that are legal next under the grammar, and setting every other token's logit to −inf before sampling. The model literally cannot emit an invalid token. No retries, no "I asked nicely," no post-hoc repair. The whole phase is that one move — mask, then sample — plus the machinery that makes it correct (a grammar automaton per request) and fast (a precompiled token bitmask per step).
every step, per constrained request
┌────────────────────────────────────┐
grammar ──►│ automaton state ──► allowed tokens │──► bitmask (vocab bits)
└────────────────────────────────────┘ │
model ────────────────► logits [vocab] ──── mask (−inf) ◄────┘
│
▼
sample → token ──► advance automaton state
Step 1: The problem — "please output JSON" is a prayer, not a guarantee
A sampler picks from a probability distribution over ~100k tokens. Even a model that is 99.9%
"JSON-reliable" per token will break a 1,000-token response about once per response. Tool
calls, agents, extraction pipelines — anything that feeds model output to json.loads or an
API schema — needs 0% failure, by construction. Prompting harder changes the probabilities;
constrained decoding changes the support of the distribution. That's the difference between
unlikely and impossible.
Step 2: The idea — make illegal tokens impossible, not unlikely
At any point mid-generation, the text so far puts the grammar in some state: "inside a
string", "after a {, expecting a key or }", "just closed the top-level object — done".
Each state defines exactly which next characters are legal. So:
- Track the grammar state as tokens are emitted.
- Before each sample, compute the allowed set for the current state.
- Mask:
logits[token] = −inffor every token not in the set. Softmax renormalizes the probability over the legal tokens — the model still chooses which legal token, with its own preferences intact. - After sampling, advance the state by the chosen token.
Note what this is not: it's not generate-then-validate (wasteful, unbounded retries), and it's not beam-searching for valid outputs (expensive). It's O(1 mask) per step, exact by construction.
Step 3: From spec to automaton — why JSON needs a stack
What machine tracks "the state"? Depends on the language class:
- Regex → a finite-state machine (FSM). Finitely many states, a transition table
next = δ[state][char]. Your lab-01 builds exactly this. - JSON / EBNF grammars → nesting is unbounded (
[[[[…]]]]), and an FSM cannot count. You need a pushdown automaton: a state plus a stack (push on{/[, pop on}/]). Your lab-03 builds exactly this, and it's what xgrammar implements for real. - JSON Schema → compiled into such a grammar first ("key
namethen a string, keyagethen an integer…"). Schema → grammar → pushdown automaton is the production pipeline.
regex ──compile──► FSM (states × chars table)
JSON-schema──compile──► CFG ──► pushdown automaton (state + stack)
Step 4: Characters vs tokens — the lifting problem
Grammars speak characters; the model emits tokens (multi-character chunks like
{"name). A token is legal iff feeding its characters one-by-one through the automaton
succeeds from the current state. Naively that's vocab_size × token_len automaton steps —
per decode step. The production answer (xgrammar's core trick) is to do the expensive
analysis at compile time: for each automaton context, precompute which tokens are
context-independent (always legal / always illegal) and store them in a compressed
token bitmask (vocab_size / 32 int32 words); only a small context-dependent remainder
(tokens that interact with the stack) is checked at runtime. That's why the per-step cost is
"fill a bitmask," not "simulate 100k tokens."
Step 5: The per-step pipeline in vLLM
The flow at the pinned commit (deep-dive walks every hop):
SamplingParams.structured_outputs {json=…, regex=…, choice=…, grammar=…}
│ (request arrives)
▼
StructuredOutputManager.grammar_init compile grammar — async, off the hot path
│ (request not schedulable until compiled — a new WAITING substate)
▼
Scheduler.get_grammar_bitmask each step: collect constrained requests
│ → manager fills one bitmask row per request
▼ GrammarOutput (numpy bitmask, serialized to workers)
gpu_model_runner: apply_grammar_bitmask reorder rows to batch order, −inf via xgr kernel
▼
sample → accepted tokens → grammar.accept_tokens() advances the automaton
Two production wrinkles worth noticing now (the deep-dive shows the code):
- Compilation is async (
ThreadPoolExecutor): a request whose grammar is still compiling simply isn't scheduled yet. Compile cost never blocks the engine loop. - Speculative decoding composes with this (Phase 8): the bitmask tensor holds one row per
position (each draft token + the bonus token), and the grammar exposes
rollback(n)so rejected drafts un-advance the automaton. Constraint + speculation, no special case.
Step 6: The costs, and where they're hidden
| Cost | When | Hidden how |
|---|---|---|
| Grammar compile (schema → automaton + token tables) | once per distinct grammar | async executor; cached by (type, spec) key |
| Bitmask fill (automaton state → vocab bits) | per request, per step | compile-time token classification; parallel fill above a batch threshold |
| Mask apply (−inf on logits) | per step | one fused GPU kernel over the batch |
| State advance | per accepted token | trivial (table/stack step) |
The first request with a new big schema pays a visible TTFT hit (lab-02 measures it on real vLLM). Steady-state per-step overhead is small single-digit percent.
The invariants to memorize
- Constrained decoding = mask then sample: illegal tokens get −inf; softmax renormalizes over the legal set. Output is valid by construction.
- The machine matching the language: regex → FSM; JSON/EBNF → pushdown (stack); JSON Schema compiles down to the latter.
- Grammars speak chars, models speak tokens — the token bitmask is the precompiled
lifting of char-rules to vocab entries (
vocab/32int32 words per position). - One grammar state per request, advanced on accept, rolled back on spec-decode rejection. Compile happens once per distinct grammar, off the hot path.
- Honest caveat: constraints guarantee validity, not truth — and
max_tokenscan still truncate mid-structure (finish_reason="length"), which no mask can save you from.
What you'll do
- Read: 01-deep-dive.md — the manager, the backend interface, xgrammar, and the scheduler/runner hops, all line-anchored.
- Build: 02-mini-build.md — a regex-FSM grammar mask as a
mini_vllm/grammar.pylogits processor (reference implementation + tests included). - Labs (see labs/README.md; recommended order 01 → 03 → 02):
lab-01-regex-fsm-mask[CPU-OK]— compile a regex to an FSM, lift it to per-step token masks, and prove an adversarial model still always matches the regex.lab-03-json-pushdown[CPU-OK]— why regex isn't enough: the stack-aware pushdown mask for a JSON subset; a brace-hating model still emits parseable JSON at depth 8.lab-02-json-schema-constrained[GPU-OPT]— xgrammar viaguided_jsonon real vLLM: 31/50 → 50/50 schema validity, first-request compile latency, thefinish_reason="length"trap. Captured output included.
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
← Phase 11 · Course home · Phase 13 →
Phase 12 — Deep Dive: structured outputs in real vLLM
Paths relative to
upstream/atv0.22.1 @ 0decac0(UPSTREAM_PIN.md). If a line number drifts in a newer tree, search for the named symbol.vllm/sampling_params.py StructuredOutputsParams (the user API) vllm/v1/structured_output/backend_types.py the two-interface contract (read first) vllm/v1/structured_output/__init__.py StructuredOutputManager (compile + bitmask) vllm/v1/structured_output/backend_xgrammar.py the default backend vllm/v1/structured_output/request.py per-request state + the cache key vllm/v1/structured_output/utils.py apply_grammar_bitmask (runner side) vllm/v1/core/sched/scheduler.py get_grammar_bitmask (the scheduler hook)
Contents
- 1. The user API:
StructuredOutputsParams - 2. The contract: two abstract classes
- 3. The manager: compile off the hot path
- 4. The bitmask: one row per position, spec-decode included
- 5. The runner: reorder and apply
- 6. The xgrammar backend
- Reading checklist
1. The user API: StructuredOutputsParams
vllm/sampling_params.py:41 — class StructuredOutputsParams holds exactly one of
json | regex | choice | grammar | json_object | structural_tag (__post_init__ counts the
set fields and raises if ≠ 1). This rides on every SamplingParams, so a constraint is a
per-request property — one batch can mix free requests, a JSON-schema request, and a regex
request.
The constraint becomes a cache key in
vllm/v1/structured_output/request.py:77 — get_structured_output_key() maps params to a
(StructuredOutputOptions, spec_string) tuple (JSON dict gets json.dumps-normalized).
Two requests with the same schema share one compiled grammar context.
2. The contract: two abstract classes
backend_types.py is the whole design in 136 lines — read it before anything else:
StructuredOutputOptions(:19) — the six request types (JSON, JSON_OBJECT, REGEX, GRAMMAR, CHOICE, STRUCTURAL_TAG).StructuredOutputGrammar(:31) — per-request state. Five methods carry the whole feature:accept_tokens(advance state),validate_tokens(check without advancing — used to vet spec-decode drafts),rollback(n)(un-advance — spec-decode rejection),fill_bitmask(tensor, index)(write this request's allowed-token bits into rowindex),is_terminated(grammar reached an accepting end state).StructuredOutputBackend(:99) — engine-level:compile_grammar(type, spec) → StructuredOutputGrammarandallocate_token_bitmask(max_seqs).
That rollback is in the base interface tells you spec decode wasn't bolted on — the
contract was designed so constraints and speculation compose.
3. The manager: compile off the hot path
vllm/v1/structured_output/__init__.py:36 — class StructuredOutputManager, owned by the
scheduler (scheduler.py:90), not the workers. Compile and bitmask-fill happen on the
scheduler side; only the finished numpy bitmask is shipped to GPU workers.
grammar_init(:115) — called when a constrained request arrives. Lazily instantiates the single engine-wide backend (xgrammar / guidance / outlines / lm-format-enforcer — note the comment: one backend per engine, not per request), then submits_create_grammarto aThreadPoolExecutor: the request'sgrammarfield holds aFutureuntil compilation lands.request.py:60— thegrammarproperty resolves that Future: a request whose grammar isn't ready yet is simply not schedulable (the scheduler skips it — searchstructured_output_request.grammarinscheduler.py). Compile latency costs that one request TTFT, never the engine loop.
4. The bitmask: one row per position, spec-decode included
grammar_bitmask (__init__.py:204) is the heart. Per step the scheduler calls
Scheduler.get_grammar_bitmask (scheduler.py:1259), which collects the scheduled
constrained request IDs and delegates here. What to notice:
- Allocation (once):
max_num_seqs × (1 + num_speculative_tokens)rows — one row per possible sampled position, not per request. With spec decode, requestrdrafting tokensd1..dkcontributes k+1 rows: mask for the state befored1, befored2, …, before the bonus token. - The spec-decode dance (the serial path): for each draft token it fills a row, then
accept_tokens([token])to advance the state, countingstate_advancements; after the last row it callsgrammar.rollback(state_advancements)— the grammar temporarily pretends the drafts were accepted to compute their masks, then rewinds, because the real accept/reject verdict belongs to the rejection sampler (Phase 8). - Parallel fill: above
fill_bitmask_parallel_threshold(non-spec case), requests are batched to the executor in groups — bitmask filling is pure CPU work and parallelizes. - Serialization: the tensor is returned as
numpy(.numpy(), see the comment) because ndarray serializes much faster than a torch tensor on the way to workers — it travels inGrammarOutput(scheduler.py:1281). should_advance(:322) /should_fill_bitmask(:302) — the reasoning-model gate: while a model is inside its thinking section, the constraint is suspended (the mask row is set to all-ones via_full_mask) and the automaton doesn't advance; enforcement begins when the reasoning parser (Phase 16) says reasoning ended.
5. The runner: reorder and apply
vllm/v1/structured_output/utils.py:44 — apply_grammar_bitmask(scheduler_output, grammar_output, input_batch, logits), called from the GPU model runner right before
sampling (gpu_model_runner.py:4359). Two jobs:
- Reorder: the bitmask rows are in the scheduler's request order; the runner's batch
order differs, and spec-decode offsets each request's logit rows. The function builds
struct_out_req_batch_indiceswalkinginput_batch.req_idswith acumulative_offsetof spec tokens, then scatters rows into asorted_bitmasksized[logits.rows, words](unconstrained rows = all-1= all-allowed). - Apply:
xgr.apply_token_bitmask_inplace(logits, bitmask, indices=out_indices)— one fused kernel writes −inf into every disallowed logit. 32 vocab entries per int32 word is why the mask is cheap to ship and apply.
6. The xgrammar backend
backend_xgrammar.py:35 — class XgrammarBackend. compile_grammar (:77) is a clean
switch over the six request types: compile_json_schema / compile_regex /
compile_grammar (EBNF) / compile_structural_tag, each returning a compiled context
ctx. CHOICE never reaches here as such — choices are converted to a grammar upstream.
Then:
return XgrammarGrammar(
matcher=xgr.GrammarMatcher(ctx, max_rollback_tokens=self.num_speculative_tokens),
vocab_size=self.vocab_size, ctx=ctx)
max_rollback_tokens sized to the spec-decode draft length — the compose-with-Phase-8
contract again, now at the C++ matcher level. XgrammarGrammar (:132) is a thin wrapper:
accept_tokens → matcher.accept_token loop, fill_bitmask (:191) →
matcher.fill_next_token_bitmask(bitmask, idx), rollback → matcher.rollback.
The actual FSM/pushdown machinery — and the compile-time token classification from the
guide's Step 4 — lives inside the xgrammar library; what vLLM owns is the plumbing you just
traced. Also skim has_xgrammar_unsupported_json_features (:221) and
validate_xgrammar_grammar (:268): unsupported schema features are rejected at the
front door (processor), not at compile time — fail fast, fail in the API layer.
backend_guidance.py implements the same two interfaces over llguidance (better coverage of
exotic JSON-schema features, lazy-computed masks); backend_outlines.py and
backend_lm_format_enforcer.py likewise. One contract, four interchangeable engines — the
same backend-registry pattern you saw for attention (Phase 4) and quantization (Phase 6).
Reading checklist
-
backend_types.py— why arevalidate_tokensandrollbackin the per-request interface? Which phase forces their existence? -
grammar_init— what exactly is async, and what state is a request in while its grammar compiles? -
grammar_bitmask— whymax_num_seqs × (1 + num_spec_tokens)rows? Walk the fill→accept→…→rollback sequence for one request with 2 draft tokens. -
apply_grammar_bitmask— why is reordering needed, and what does an all-1row mean? -
XgrammarBackend.compile_grammar— where doesmax_rollback_tokenscome from? -
In
scheduler.py:968, why is a request withis_prefill_chunkexcluded from bitmask generation? (Hint: which step actually samples a token?)
Now build it: 02-mini-build.md, then the labs.
Phase 12 — Mini-Build: a grammar mask for mini_vllm
Contents
- Your task
- Why build it (and not just read it)
- The spec
- Method
- Definition of done
- Map back to the real engine
Your task
Build mini_vllm/grammar.py: a regex-FSM grammar that produces a per-step allowed-token
mask and plugs into the mini engine as a logits processor — so a generation literally cannot
emit a string that violates the regex.
A reference implementation ships in mini_vllm/grammar.py
with tests in mini_vllm/test_grammar.py. Build yours
first; compare after.
Why build it (and not just read it)
Reading the real feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.
The spec
Mirror the upstream contract from backend_types.py, shrunk to its essence:
class RegexGrammar:
"""Compile once; one instance of state per request."""
def __init__(self, pattern: str, vocab: dict[int, str]): ...
def allowed_token_mask(self) -> "np.ndarray": # bool[vocab_size]
"""True where emitting the token keeps a path to a match alive."""
def accept_token(self, token_id: int) -> bool: # advance; False if illegal
def rollback(self, n: int) -> None: # un-advance n tokens (spec decode!)
def is_terminated(self) -> bool: # matched a full accepting state
Constraints that make it honest:
- Compile the regex to an explicit FSM yourself (subset is fine: literals,
[...]classes,|,*,+,?, digits — no need for full PCRE).remay be used only as a test oracle, never inside the mask. - A token (multi-char string) is allowed iff feeding its chars through the FSM from the current state stays alive. Cache per-state token masks after first computation — that's xgrammar's compile-time trick in miniature.
rollbackmust restore the exact state — keep a state-stack history.- Wire it into
mini_vllm/engine.py's sampling path as an optionallogits_mask_fn(request) -> maskhook, applied aslogits[~mask] = -infpre-softmax.
Method
- Re-read the matching real code:
backend_types.py(the contract),backend_xgrammar.py:132(XgrammarGrammar— yours is this class with the library replaced by your FSM). - Write the FSM compiler first; property-test it against
re.fullmatchon random strings. - Add the token lifting + mask caching; then the engine hook.
pytest mini_vllm -qand keep it green.
Definition of done
- CPU only, numpy only.
- A test proves the property: an adversarial sampler (always picks the worst allowed
token) still produces a string with
re.fullmatch(pattern, out)≠ None, for ≥ 3 patterns. - A test proves renormalization: with all tokens masked but one, that one is sampled with probability 1.
- A test proves rollback: advance k tokens, rollback k, masks are bit-identical.
- You can say out loud where yours simplifies: chars = bytes, no pushdown stack (lab-03 adds it), no compile-time context-independent/dependent token split (you cache lazily instead).
Map back to the real engine
| Yours | Upstream |
|---|---|
RegexGrammar | StructuredOutputGrammar impl (backend_xgrammar.py:132) |
allowed_token_mask() | fill_bitmask() row (bool array vs packed int32 bits) |
| per-state mask cache | xgrammar compile-time token classification |
logits[~mask] = -inf hook | apply_grammar_bitmask (structured_output/utils.py:44) |
rollback state stack | xgr.GrammarMatcher(max_rollback_tokens=…) |
Phase 12 Labs — Structured Outputs
Three labs that turn "please respond with JSON" into a mathematical guarantee. The
arc: build the regex→FSM→token-mask pipeline and its adversarial proof (lab-01),
cross the regular/context-free boundary with a pushdown machine for JSON — and get
caught by the fuzz oracle on a grammar corner (lab-03), then measure the
industrialized version (xgrammar via guided_json) forcing 50/50 schema validity on
real hardware (lab-02).
Recommended order: 01 → 03 → 02. CPU labs follow the standard contract —
starter.py (your work), solution.py (reference), test_lab.py (the spec); default
runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-12-structured-outputs/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-12-structured-outputs/labs/lab-01-regex-fsm-mask -q
Contents
- lab-01-regex-fsm-mask
[CPU-OK] - lab-02-json-schema-constrained
[GPU-OPT] - lab-03-json-pushdown
[CPU-OK] - What you can do after this phase
Labs
lab-01-regex-fsm-mask [CPU-OK]
The three moves of constrained decoding: compile a pattern to a char-level FSM, lift it to token masks (a token is allowed iff its characters keep the machine alive — the outlines insight, including multi-char tokens crossing atom boundaries), and gate EOS on accepting states. Proven against an adversarial model that prefers garbage and emits valid hex anyway — plus the honest truncation-caveat test (prefix-valid ≠ complete). Skills: masks edit support, not mood; the compile-time/runtime split; char→token lifting; the max_tokens trap.
lab-02-json-schema-constrained [GPU-OPT]
The verification protocol on real vLLM: one schema, 50 prompts, two arms, a strict
jsonschema validator — baseline 31/50 (mostly JSON wrapped in chat), guided 50/50.
Plus the operational signatures: +210 ms first-request grammar compile, and the
finish_reason: "length" truncation trap sprung deliberately. Annotated capture
included. Skills: control-arm benchmarking; the four guided formats; user-supplied
schemas as an operational risk surface.
lab-03-json-pushdown [CPU-OK]
Why regex isn't enough: JSON nests, nesting needs a stack, and you'll build the
pushdown machine (modes + depth) whose mask is stack-aware — a brace-hating model
still emits parseable JSON at depth 8. Featuring the lab's best war story: the
json.loads fuzz oracle caught the reference implementation accepting 0123 (JSON
forbids leading zeros) — grammar bugs need independent oracles. Skills: the
regular/CFG boundary as a product boundary; resume-the-parent via the stack;
oracle-driven grammar debugging; checkpointable machines for spec-decode
composition.
What you can do after this phase
Explain precisely why constrained decoding guarantees validity (and the two ways it
still doesn't: truncation, and bugs in the grammar itself); choose between
regex/choice/schema/grammar constraints by their compile cost and expressive need;
operate structured-output services with eyes open (grammar cache hit rates,
first-request latency, finish_reason hygiene, user-schema risk); and read
vllm/v1/structured_output/ as the industrial form of two machines you built by
hand. The masks ride Phase 9's processor hook; the per-request grammar state joins
Phase 9 lab-04's isolation discipline; and Phase 16's tool-calling parsers consume
what these masks guarantee.
Lab 12-01 — Regex → FSM → Token Masks [CPU-OK]
Prompting a model to "respond with a hex number" gets you a hex number most of the time — and "most of the time" is a production outage with a delay. Constrained decoding replaces the request with a guarantee: compile the pattern to a finite-state machine, and each step mask the logits so only tokens that keep the machine alive are sampleable (Phase 9 lab-01's processor hook, finally meeting its most important client). The model cannot emit an invalid output — not "is unlikely to": cannot — and you'll prove it with an adversarial fake model that prefers garbage and emits valid hex anyway. Plus the two ideas that separate toy versions from real ones: the char-to-token lifting (the FSM walks characters; the model emits multi-character tokens) and the truncation caveat (the mask guarantees prefix validity, not completion — a test demonstrates the failure honestly).
Contents
- Why this lab exists
- Background: the three moves
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Structured output is the feature that turned LLMs from chatbots into components — nothing downstream can consume "mostly JSON" — and it's also the feature whose implementation most people get wrong on the first guess (validate-and-retry? post-hoc repair? few-shot harder?). The correct answer is masks, and it's correct for a deep reason worth internalizing: it moves enforcement from after sampling (reject, retry, pray) to before (the invalid token's probability is −∞; renormalization spreads its mass over valid continuations). Zero retries, zero latency tax beyond the mask computation, and the guarantee is structural rather than statistical.
Building it small teaches you the production system's actual anatomy. Outlines'
famous contribution was precisely your allowed_tokens: precompute, for every FSM
state, which tokens (not characters) survive — turning the per-step cost from
"simulate the vocab" into a dict lookup. When you read vLLM's structured-output
manager (upstream/vllm/v1/structured_output/), you'll find your three functions
with caching and bitmask plumbing around them.
Background: the three moves
- Compile the pattern to a char-level FSM. The lab's pattern subset (atom
sequences: literals and
[...]classes, optional+) compiles to a beautifully simple machine — state i = "atoms 0..i−1 matched",+adds a self-loop. Real engines compile full regex via standard NFA→DFA machinery (interegular/outlines) — bigger automata, same interface:transitions,accepting. - Lift chars to tokens: token t is allowed in state s iff feeding t's
characters from s never hits a missing transition. This is where the tokenizer's
weirdness lives — a single token
"0x"crosses two atoms in one step, and your mask test pins exactly that. The lifted table is per-(state, vocab), computed once per pattern: the compile-time/runtime split that makes masking affordable at 128k vocab. - Mask per step: allowed tokens keep their logits, the rest get −∞, EOS is legal
iff the state is accepting (forcing the model through the pattern — the
test_eos_only_when_acceptingbehavior: a model that wants to stop immediately must emit a digit first).
Files
starter.py—parse_atoms,compile_pattern,advance,allowed_tokens,constrained_generate. Your work.solution.py— reference.test_lab.py— parsing, survival/death of multi-char tokens, mask exactness, the adversarial model, a 50-permutation fuzz, EOS gating, and the truncation caveat.
Run
LAB_IMPL=starter pytest phase-12-structured-outputs/labs/lab-01-regex-fsm-mask -q
pytest phase-12-structured-outputs/labs/lab-01-regex-fsm-mask -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_parse_atoms / test_advance_and_death | The compiler and the walker, including a token dying mid-token ("1x") — partial consumption must not corrupt state |
test_mask_is_exactly_the_survivors | The lifting: from start, only "0" and "0x" survive 0x[0-9a-f]+ — note the multi-char token legally crossing two atoms in one step |
test_constrained_output_always_matches | The adversarial guarantee: a model preferring q, @, zz emits valid hex anyway. The mask doesn't persuade; it removes the alternatives |
test_fuzzed_preferences_never_violate | 50 random models, zero violations — on a pattern chosen so truncation is harmless (see next row for why that choice was necessary) |
test_truncation_caveat_is_real | The honest failure: a digit-loving model loops in [0-9]+ and never reaches the .; at max_tokens the output is a valid prefix but not a valid match. Constrained decoding + token caps = possibly-incomplete output — a real production gotcha (validate downstream anyway!), demonstrated rather than footnoted |
test_eos_only_when_accepting | The stop token is part of the grammar: stopping is only legal where the pattern says so |
Hitchhiker's notes
- The compile-time/runtime split is the whole performance story. Compiling a
complex schema's automaton and lifting it over a 128k vocab takes real time
(xgrammar's headline is doing this fast + caching it); the per-step cost is then a
bitmask apply. vLLM compiles grammars asynchronously — a request can sit in
WAITING while its grammar compiles (a new reason to wait that Phase 3's scheduler
gates on; search
grammarin the V1 scheduler). First-request latency on a new schema vs steady-state is the operational signature of this split. - The mask must reach the GPU. Your set-of-ints becomes a
[batch, vocab]bitmask tensor applied inside the sampler (Phase 9 lab-01's pipeline, stage one). At 128k vocab × 256 batch that's real bytes per step — why the format is bitmask and why xgrammar emits them natively. - Tokenizer dependence is total: the lifted table is per-(pattern, tokenizer). Same pattern, different model → recompile. And exotic vocab corners (bytes, partial UTF-8 tokens) are exactly where naive lifters break — one more reason the production engines are libraries, not weekend scripts.
- The truncation caveat generalizes: any constrained system that guarantees
step-wise validity (prefix-closed) but not termination has this hole. Lab-03
shows its pushdown version (unclosed braces forever); real APIs return
finish_reason: "length"(Phase 1 lab-05!) on exactly these — your downstream parser must treat"length"+ structured output as suspect. The three labs of this course that compose here (1-05, 9-01, 12-01) are the whole story.
Going further
- Add
*(zero-or-more) to the pattern subset — note how it changes which states are skippable and that your state-numbering scheme needs epsilon-collapsing. You're one feature away from needing the real NFA→DFA pipeline; feel the cliff. - Precompute the full
state → allowed token-id listtable and benchmarkconstrained_generateagainst the on-the-fly version at vocab 50k (build a random vocab) — the outlines speedup, measured. - Wire the mask into Phase 9 lab-01's
Pipelineas a logits processor overmini_vllm'sByteTokenizervocab and constrain the toy engine end-to-end: structured output in your own engine, ~20 glue lines.
References
- Willard & Louf, Efficient Guided Generation for Large Language Models (2023) —
the outlines paper; your
allowed_tokensis its §3: https://arxiv.org/abs/2307.09702 upstream/vllm/v1/structured_output/— the manager, backends, and the async compile path.- Dong et al., XGrammar (2024) — the compile-time/runtime split industrialized: https://arxiv.org/abs/2411.15100
- Phase 9 lab-01 — the hook this mask rides; lab-03 — why JSON needs a stack on top of everything here.
Lab 12-02 — JSON Schema Constrained, on Real vLLM [GPU-OPT]
The CPU labs built the theory bottom-up: masks from FSMs (lab-01), stacks for nesting
(lab-03). This lab runs the industrialized version — xgrammar via vLLM's
guided_json — and measures the property that justifies the whole phase: 50 of 50
schema-valid outputs constrained, versus a baseline that politely wraps its JSON in
markdown fences and apologies. You'll also watch the operational signatures the CPU
labs predicted: the first-request grammar-compile latency (the compile-time/runtime
split, on a wall clock) and the finish_reason: "length" truncation caveat, live.
No GPU? Don't panic. The captured run below carries the measurements; the reconciliation against labs 01/03 is the work.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, Qwen2.5-0.5B-Instruct, L4, vLLM 0.22.1, trimmed)
- Reading the results
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
"100% valid JSON" is a strong claim and engineers should be professionally suspicious
of strong claims — this lab is the verification protocol. The design matters more
than the running: a fixed schema, N diverse prompts, two arms (constrained vs
unconstrained-but-asked-nicely), and a strict validator (jsonschema, not
json.loads — type and required-key checking, not just parseability). The
unconstrained arm is the control every structured-output benchmark needs and most
skip: without it, "98% valid" tells you nothing about what the constraint bought
(small instruct models often manage 60–85% unconstrained; the delta is the feature).
It's also your introduction to the feature's operational personality: per-schema
compile cost (cached thereafter), the scheduler's grammar-wait state, and the
interaction with max_tokens that labs 01/03 made you predict — all visible from the
client side if you know to look.
Requirements
uv pip install -e ".[vllm]" jsonschema
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct # small instruct model: a fair baseline arm
Steps
import json, jsonschema
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams
SCHEMA = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer", "minimum": 0},
"skills": {"type": "array", "items": {"type": "string"}},
},
"required": ["name", "age", "skills"],
}
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct", gpu_memory_utilization=0.6)
prompts = [f"Generate a profile for a fictional {job}." for job in
["pirate", "astronaut", "barista", "wizard", "plumber"] * 10]
def validity(outputs):
ok = 0
for o in outputs:
try:
jsonschema.validate(json.loads(o.outputs[0].text), SCHEMA)
ok += 1
except Exception:
pass
return ok
base = llm.generate([p + " Respond ONLY with JSON matching the schema." for p in prompts],
SamplingParams(max_tokens=128, temperature=0.8))
guided = llm.generate(prompts, SamplingParams(
max_tokens=128, temperature=0.8,
guided_decoding=GuidedDecodingParams(json=SCHEMA)))
print(f"baseline: {validity(base)}/50 guided: {validity(guided)}/50")
Time the first guided request separately from the rest (grammar compile), and run
one guided request with max_tokens=12 to spring the truncation trap on purpose.
Captured output (real run, Qwen2.5-0.5B-Instruct, L4, vLLM 0.22.1, trimmed)
baseline: 31/50 guided: 50/50
# typical baseline failure: 'Sure! Here is the profile:\n```json\n{"name": ...'
# (valid JSON, wrapped in chat — json.loads sees the fence and dies)
# first guided request: +210 ms (xgrammar compile, then cached for the schema)
# guided with max_tokens=12: '{"name": "Captain Redb' finish_reason='length'
# (prefix-valid, incomplete — the labs' truncation caveat, on silicon)
Reading the results
- 31/50 baseline — and look at how it fails: mostly not malformed JSON but JSON wrapped in helpfulness ("Sure! Here is..."). Instruct-tuning taught the model to chat; your parser disagrees. Prompt-engineering harder buys a few points and plateaus — the failure is distributional, and no amount of asking changes the distribution's tails. (This is the precise sense in which masking is structural: it edits the distribution's support, not its mood.)
- 50/50 guided — the first token is forced into
{territory; the fence is unsamplable. Note this matches labs 01/03's adversarial tests exactly: the model's preferences (chatty preamble) lose to the mask, every time, by construction. - +210 ms first request — lab-01's compile-time/runtime split with a wall-clock number: schema → grammar → automaton → token bitmask tables, then cached (per schema × tokenizer). A fleet serving many distinct schemas pays this repeatedly — cache hit rate on grammars is a real metric for structured-output-heavy services.
- The truncated run —
finish_reason: "length"plus a prefix-valid fragment: labs 01/03's caveat verbatim. The defensive pattern: treat"length"+ structured-output as invalid regardless of how parseable the prefix looks, and sizemax_tokensfor the schema's worst case (arrays make worst cases long).
Hitchhiker's notes
- The API surface spans four formats:
guided_json(schema),guided_regex(lab-01's domain),guided_choice(the degenerate-but-useful enum case), andguided_grammar(full EBNF — lab-03's domain, user-supplied). All compile to the same masking machinery with different front ends; choosing the narrowest format that fits is both faster to compile and a better model-steering signal. - Backend choice exists (
xgrammardefault,guidance,outlineslineage) — like Phase 4's attention backends, with the same operational reflex: when structured output misbehaves, swapping backends is the bisection move (--guided-decoding-backend). Feature-support matrices differ (regex corners, schema keywords); the deep-dive maps them. - Quality inside validity: 50/50 valid says nothing about whether the content is good — masks constrain syntax, not sense. A model bullied through an unfamiliar schema produces valid-but-vapid fields. The schema is also a prompt: include it in the text and the constraint, and the two reinforce (measure content quality separately — Phase 6 lab-02's eval discipline applies).
- Throughput cost is real but modest: bitmask application is cheap; grammar advance (per accepted token, per request) is CPU-side work that can bottleneck at high concurrency with complex grammars — watch the structured-output scheduling stats. Tail risk: one pathological schema compiling for seconds can stall its request, not the engine (the async-compile design — Phase 3's WAITING state earning a new tenant).
Reflect
- Map every capture line to its CPU-lab origin: the fence failure (mask edits support — labs 01/03's adversarial tests), the +210 ms (lab-01's compile/runtime split), the truncation (both labs' caveat tests). If each has a home, the phase composed.
- Your service takes user-supplied schemas. Name the three operational risks this
lab armed you against. (Unbounded compile cost per novel schema — cache + limits;
worst-case output length vs
max_tokens— validatefinish_reason; pathological grammars as a DoS surface — compile timeouts.) - Why does the guided arm use temperature 0.8 rather than 0? (The claim under test is "valid under sampling" — greedy would make validity trivially repetitive and hide mask bugs that only sampled tails reach. Constrain the support, then let the distribution be itself.)
References
upstream/vllm/v1/structured_output/— manager, xgrammar backend, the async compile path and bitmask plumbing.- vLLM docs, Structured Outputs — the four guided formats and backend selection: https://docs.vllm.ai/en/latest/features/structured_outputs/
- Dong et al., XGrammar (2024): https://arxiv.org/abs/2411.15100
- Labs 01 and 03 — the theory this run industrializes; Phase 1 lab-05 —
finish_reason, doing load-bearing work again.
Lab 12-03 — JSON Needs a Stack: the Pushdown Mask [CPU-OK]
Try to write lab-01's FSM for "balanced nested braces" and you'll hit a wall that's
been a theorem since 1956: a finite-state machine cannot count unbounded nesting —
matching { to } at arbitrary depth requires memory that grows, i.e. a stack.
JSON nests. Therefore JSON is not a regular language, regex-based masking cannot
enforce it, and the production answer (xgrammar) compiles context-free grammars to
pushdown automata. In this lab you build the pushdown machine for a JSON subset —
modes plus a depth counter that is the stack — and prove the strongest property in
the phase: a model that hates closing braces (they're its least-preferred character)
still emits output that json.loads accepts. Along the way the fuzz oracle will catch
a grammar corner most humans forget (this lab's reference implementation forgot it
too, the first time): JSON forbids leading zeros.
Contents
- Why this lab exists
- Background: modes + depth = pushdown
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
The regular/context-free boundary is the single most practical piece of formal-
language theory in modern serving, because it's a product boundary: "constrain to a
regex" and "constrain to a JSON schema" are different features with different engines,
different compile costs, and different failure modes — and engineers who don't know
why ship the wrong one. After this lab the boundary is physical: you'll have built
both machines and the difference is one integer (depth) that lab-01's FSM has
nowhere to put.
The second lesson is oracle-driven grammar debugging, courtesy of an honest war
story: the fuzz test here validates completed generations with json.loads — an
independent, stricter implementation of the spec — and on this lab's first draft it
rejected "0123", which the hand-built machine happily accepted. The machine's
grammar was wrong (JSON ints are '0' | [1-9][0-9]*), and no amount of testing the
machine against itself would have found it. Constrained decoding is only as correct
as its grammar; always fuzz against an independent parser. That habit is worth more
than the lab.
Background: modes + depth = pushdown
The machine is lab-01's FSM plus one number. Modes track position inside the
local grammar production (value, int, int_zero, obj_first, key, key_body,
colon, comma_or_close, obj_key, done); depth counts open objects — and
because this grammar's only recursive construct is the object, depth is the entire
stack (a general CFG pushes symbols; we push indistinguishable ones, so a counter
suffices — a nice special case to notice). The two rules that make it click:
'{'invaluemode pushes (depth+1) and entersobj_first;'}'pops and then asks: stack empty? →done(only EOS legal); else →comma_or_close(the enclosing object resumes). That resume-the-parent move is what no FSM can express — the parent's state was remembered by the stack.- Ending a value is context-dependent: a top-level int may end at EOS; the same int
inside an object must be followed by
,or}. Henceallowed()consultsdepth— the mask itself is stack-aware, which is the whole point.
One implementation subtlety worth savoring: feeding , while in int mode must
first end the int, then re-dispatch the comma in the new mode (feed calls itself
once). Tokens that terminate one production and belong to the next are the
bread-and-butter of incremental parsing — xgrammar's machinery handles exactly this,
industrialized.
Files
starter.py—JsonMachine(allowed / feed / accepting, modes documented) andconstrained_generate. Your work.solution.py— reference.test_lab.py— recognition both ways, illegal-feed errors, the depth-8 stack test, the brace-hating model, the json.loads fuzz oracle, and the truncation caveat (pushdown edition: depth that only grows).
Run
LAB_IMPL=starter pytest phase-12-structured-outputs/labs/lab-03-json-pushdown -q
pytest phase-12-structured-outputs/labs/lab-03-json-pushdown -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_recognizes_valid_json / test_rejects_invalid_json | The grammar, both directions — including {}}, {"a":}, and the trailing-comma classic |
test_illegal_feed_raises | The machine is also a validator; feeding it garbage is loud, not corrupting |
test_depth_is_the_stack | 8 levels opened, tracked, closed — the non-regular behavior, exercised. (Any fixed-state FSM fails some depth; the counter never does) |
test_brace_hating_model_still_emits_valid_json | The adversarial guarantee, CFG edition: } ranked dead last, output parses anyway — comma_or_close mode eventually offers nothing but structure-respecting choices |
test_fuzzed_preferences_always_parse_or_truncate_live | 50 random models against the independent oracle: every completed output parses, every truncated one is a live prefix. This is the test that caught the leading-zero bug — the lab's best argument made by its own history |
test_truncation_caveat_brace_lover_never_closes | Lab-01's caveat, worse: a {-loving model nests forever (each value slot opens another object), hits the cap with depth > 0, output unparseable. Prefix-valid ≠ complete, and recursion gives the failure infinite room |
Hitchhiker's notes
- From this machine to xgrammar: a JSON Schema adds constraints your subset
doesn't have (specific keys, types per key, string escapes, floats) — the grammar
grows, the principle doesn't: compile schema → CFG → pushdown automaton → per-state
token bitmasks (lab-01's lifting, over pushdown configurations). XGrammar's
research contribution is making that lifting fast despite the stack: most masking
decisions turn out to be context-independent (decidable from the mode alone) and
get precomputed; only the genuinely stack-dependent ones (your depth-consulting
closers) are evaluated at runtime. Your machine cleanly displays which decisions
are which — look at
allowed(): only two branches readdepth. - The speculative-decoding interaction (Phase 8): verifying k drafted tokens under a grammar means advancing the pushdown machine k times and rolling back on rejection — so the machine must be checkpointable (your machine: copy mode + depth; a full PDA: copy the stack). Feature composition is where structured-output engines earn their complexity; vLLM gates some combinations for exactly this reason.
- Per-request state, again: each request carries its own machine, advanced as
tokens commit (Phase 9 lab-04's isolation discipline — grammar state is one more
thing that must never leak between batch rows). In vLLM it lives alongside the
request in the structured-output manager, advanced in
update_from_output's neighborhood (Phase 1's loop, hosting yet another tenant). - Why not just retry until valid? Compute: invalid generations burn full generation cost each attempt, and complex schemas can have high rejection rates. Masking's cost is per-step and tiny. The mask also helps the model — at every step its probability mass is renormalized over only-valid continuations, so the model is never "off the rails" trying to recover from its own syntax error. Constrained models often produce better-content JSON too, for this reason.
Going further
- Add arrays (
[ value (',' value)* ]) — now two bracket types share the stack and a counter no longer suffices: you need an actual stack of{-vs-[symbols. You'll have crossed from counter automaton to true PDA, and the diff is ~15 lines that teach the distinction better than any textbook. - Add strings as values with escape handling (
"a\"b") — the mode machinery for escapes is exactly why real JSON grammars are bigger than people expect. - Lift to tokens: reuse lab-01's
allowed_tokensover this machine (a token is allowed iff feeding its chars never raises) with a multi-char vocab, and re-run the brace-hater test at the token level. You've now built, end to end, a miniature xgrammar.
References
- Dong et al., XGrammar: Flexible and Efficient Structured Generation (2024) — the context-independent/dependent split and the PDA machinery: https://arxiv.org/abs/2411.15100
upstream/vllm/v1/structured_output/backend_xgrammar.py— the integration; find the per-request grammar state and the bitmask path.- Chomsky, Three Models for the Description of Language (1956) — where "regex can't count braces" was proven, sixty-nine years before your fuzz test rediscovered its consequences: https://doi.org/10.1109/TIT.1956.1056813
- Lab-01 — the FSM floor this lab builds on; Phase 9 lab-01 — the hook both ride.
Phase 12 — Exercises: Structured Outputs
Contents
Warm-up (explain)
- Why mask logits per step instead of generating freely and rejecting invalid outputs?
- What machine do you need for a regex, and what more do you need for JSON? Why exactly can't an FSM handle JSON?
- Grammars constrain characters but models emit tokens — state the lifting rule for "token T is allowed in state S", and why it's too slow to evaluate naively per step.
Solution sketches
- Rejection is unbounded (a 1k-token output that's 99.9% reliable per token fails ~63% of the time; retries multiply cost and latency, and there's no guarantee of ever succeeding). Masking is O(1) per step and makes invalid output impossible while letting softmax renormalize over the legal set, preserving the model's preferences among valid continuations.
- Regex → FSM (finite states, transition table). JSON → pushdown automaton: nesting depth
is unbounded and an FSM has finite memory, so it cannot ensure every
{gets its}— you need a stack (push on open, pop on close). - T allowed in S iff running T's characters through the automaton from S never dies. Naive cost = vocab × token_len automaton steps each decode step (~hundreds of thousands). xgrammar precomputes per-context token verdicts at compile time into a packed bitmask, leaving only a small context-dependent set for runtime.
Core (trace the code)
StructuredOutputManager.grammar_init(__init__.py:115) — what is submitted to the executor, what does the request'sgrammarfield hold meanwhile, and what does the scheduler do with such a request?- In
grammar_bitmask(__init__.py:204), the serial path callsaccept_tokenson draft tokens and laterrollback. Why advance at all, and why must it rewind? apply_grammar_bitmask(utils.py:44) buildssorted_bitmaskfilled with-1. What does a-1word mean, and why does the function need to reorder rows at all?- Why does the bitmask allocation reserve
max_num_seqs × (1 + num_speculative_tokens)rows rather thanmax_num_seqs?
Solution sketches
_create_grammar(the backendcompile_grammarcall) goes to aThreadPoolExecutor; the field holds aFuture. Thegrammarproperty (request.py:60) returns None until resolved, and the scheduler skips the request — it waits, unscheduled, so compile latency hits only that request's TTFT, never the engine loop.- The mask for draft position i must reflect the state after drafts 1..i−1 — so the
grammar advances through the drafts to compute successive rows. But accept/reject
belongs to the rejection sampler after the forward pass; the grammar must rewind
(
rollback(state_advancements)) so the real outcome can be applied later. -1= all bits set = every token allowed (int32 of all 1s) — the rows for unconstrained requests. Reordering: the scheduler emitted rows in its own request order, but logits rows follow the runner's batch order with spec-token offsets;struct_out_req_batch_indicesmaps request → logit row.- With spec decode, each request samples at up to
1 + kpositions per step (k drafts + bonus/correction), and each position needs its own mask row, stored inline.
Build (your lab)
- In your lab-01 FSM, add a
choice(["yes", "no", "maybe"])constraint without writing a new engine — express it as a regex and confirm the adversarial model can only emit one of the three. - Measure your mask-cache hit rate: generate 200 tokens under a 5-state FSM and count distinct (state → mask) computations vs lookups. Relate the result to why xgrammar compiles ahead of time.
- In lab-03's pushdown, construct an input where the same automaton state has different allowed sets depending on the stack. Why does this kill any pure-FSM implementation?
Solution sketches
choiceis just alternation:yes|no|maybe— same FSM machinery (this is literally how CHOICE is lowered upstream). Adversarial run must end with output ∈ the set.- Distinct computations = number of reachable FSM states (≤ 5); everything after is a lookup, so hit rate → ~100% quickly. Ahead-of-time compilation is this cache computed eagerly for every state at compile time.
- State "expecting close bracket" with stack
[{must allow}not]; with[[it must allow]not}. Same control state, different stack top ⇒ allowed set is a function of (state, stack), which an FSM cannot represent for unbounded depth.
Design (staff-level)
- A tenant sends 10k requests/minute, each with a unique (uncacheable) 50 KB JSON schema; compiles take 200 ms of CPU. The engine also serves latency-sensitive unconstrained traffic. What breaks, and what do you do about it (at least three layers of defense)?
- Product asks: "constrained decoding makes outputs worse — the JSON is valid but the content got dumber." Is that plausible? Explain the mechanism and two mitigations.
- Spec decode (k=4) + structured output: derive the per-step grammar-work overhead vs non-spec, and explain why acceptance rate changes under constraints (which direction?).
- Design grammar-aware prefix caching: two requests share a long prompt but have different schemas. What can be shared (Phase 2/3 machinery), what cannot, and where would you hook the distinction in?
Solution sketches
- The compile executor saturates → constrained TTFT explodes; if compile work steals scheduler-process CPU, everyone's step time degrades. Defenses: bound + isolate the compile pool (separate cores/process); per-tenant compile-queue quotas and admission control (reject/429 on queue depth); schema canonicalization to recover cache hits; pre-registration API for schemas; cap schema size/complexity at the front door (upstream already rejects unsupported features in the processor — same layer).
- Plausible. The mask renormalizes over legal tokens, but the model's distribution was
conditioned on its own preferred phrasing; forcing rare continuations (e.g. token
boundaries that split words awkwardly) pushes it off-manifold and degrades content.
Mitigations: schema-aware prompting (show the schema, few-shot valid examples) so the
constrained path is also the high-probability path; looser grammars (whitespace
flexibility — note xgrammar's
any_whitespaceflag); two-pass (free-form reason, then constrained extraction — exactly what the reasoning-gate inshould_advanceenables). - Per step the grammar does k+1 fill_bitmask calls + k accept_tokens + one rollback(k)
vs 1 fill — so ~(k+1)× the grammar CPU, still small vs the forward pass. Acceptance
rises on structural tokens (mask forces draft and target into the same narrow set —
high agreement) and can fall if the drafter is unconstrained while the target is
masked (drafts get vetoed by structure the drafter didn't know about; vLLM mitigates by
validating drafts with
validate_tokens). - KV blocks for the shared prompt are shareable as ever — KV depends only on token ids (the mask touches logits, not hidden states). What's not shareable is grammar state and compiled grammars (different schemas). Hook: nothing to change in the block hash — grammar state only matters from the first generated token; just keep grammar identity out of prefix-cache keys for prompt blocks (and be careful only if constraint applies to prompt — it doesn't).
Self-grading
4–7 and 11–14 are interview-grade. Could you whiteboard the per-step pipeline (state → bits → −inf → sample → advance) and the spec-decode advance/rollback dance from memory? If not, re-read 01-deep-dive.md §4.
Phase 12 — Interview Questions: Structured Outputs
Staff/principal-level questions. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)
Q1. How does vLLM guarantee valid JSON output?
Model answer
It compiles the JSON schema into a grammar automaton (xgrammar: a pushdown automaton, since JSON nests), and at each decode step computes a bitmask of tokens that keep the output grammar-valid, applying −inf to all illegal logits before sampling. Softmax renormalizes over the legal set, so the model still expresses preference among valid tokens — but invalid output is impossible by construction, not just unlikely. One grammar state per request, advanced as tokens are accepted.
Q2. Grammars are over characters; models emit multi-character tokens. How is that bridged efficiently?
Model answer
A token is legal iff its character sequence keeps the automaton alive from the current state. Checking that naively is vocab × token-length automaton steps per decode step. The production trick (xgrammar) is compile-time token classification: for each automaton context, precompute which tokens are unconditionally legal/illegal and pack them into a bitmask (vocab/32 int32 words); only a small context-dependent remainder (e.g. tokens interacting with the stack) is checked at runtime. Per-step cost becomes "copy a precomputed row + check a few stragglers."
Q3. Where does grammar compilation run, and why does that placement matter?
Model answer
On the scheduler side, in a thread-pool executor (StructuredOutputManager.grammar_init).
The request's grammar field holds a Future, and the scheduler won't schedule the request
until it resolves — so a 200 ms schema compile costs that request TTFT but never stalls the
engine loop or other requests' steps. Compiled grammars are keyed by (type, spec) so
repeated schemas don't recompile. The bitmask is also filled scheduler-side and shipped to
workers as numpy (cheap serialization); the GPU only applies it.
Q4. How do structured outputs compose with speculative decoding?
Model answer
The bitmask carries one row per sampled position: for k draft tokens you need masks for
the state before each draft plus the bonus position, so allocation is
max_num_seqs × (1+k) rows. To compute row i the grammar tentatively accept_tokens's
drafts 1..i−1, and after filling all rows it rollback's — the true accept/reject verdict
belongs to the rejection sampler post-forward, which then advances the grammar by only the
accepted prefix. rollback is part of the base grammar interface and xgrammar's matcher is
constructed with max_rollback_tokens=num_speculative_tokens — composition was designed in,
not patched on.
Q5. A customer says constrained outputs are "valid but dumber." Diagnose.
Model answer
Real effect. Masking renormalizes but the model wasn't conditioned to follow this grammar;
when its preferred continuation is illegal, probability mass shifts to tokens it would
rarely choose, pushing generation off-distribution (worst at awkward token boundaries and
rigid whitespace rules). Mitigations: prompt with the schema and examples so the
high-probability path is also the legal path; relax the grammar where harmless (xgrammar's
any_whitespace); generate reasoning unconstrained and only constrain the final answer —
vLLM's reasoning gate (should_advance) does exactly this for thinking models. Also check
finish_reason="length": truncation, not the mask, is the most common "broken JSON" report.
Q6. Engine design: would you support different grammar backends per request? What does vLLM do and why?
Model answer
vLLM V1 deliberately supports one backend per engine (first constrained request picks it;
see the NOTE in grammar_init). Per-request backends would mean multiple compiled-grammar
caches, multiple bitmask allocation schemes, and validating every request's spec against
every backend's feature matrix — for little gain since backends are interchangeable behind
the two-interface contract (StructuredOutputBackend / StructuredOutputGrammar). The
right extension point is the contract, not per-request dispatch: that's also how attention
and quantization backends are handled.
Rapid-fire
- Mask applied where? Logits, −inf, pre-softmax (one fused kernel,
apply_grammar_bitmask). - Regex needs? FSM. JSON needs? Pushdown (stack). Schema → ? compiled to grammar first.
- Bitmask row size? vocab_size / 32 int32 words. All
-1row = ? unconstrained. - Compile blocking the engine loop? Never — async executor, request waits unscheduled.
- Spec-decode hooks in the grammar interface?
validate_tokens,rollback(n). - What constraints can't fix: truth of content, and
max_tokenstruncation.
Phase 12 — Cheatsheet: Structured Outputs
Contents
The one-liner
Per step: grammar state → allowed-token bitmask → illegal logits = −inf → sample → advance state. Valid by construction; softmax renormalizes over legal tokens.
The pipeline
StructuredOutputsParams (one of json/regex/choice/grammar/json_object/structural_tag) →
grammar_init compiles async (request unschedulable until Future resolves) →
scheduler get_grammar_bitmask per step → manager fills rows → numpy → runner
apply_grammar_bitmask reorders to batch order + fused −inf kernel → sample →
accept_tokens advances.
Machines
- regex → FSM · JSON/EBNF → pushdown (stack) · JSON Schema → compiled to grammar.
- Char-rules lifted to tokens at compile time (xgrammar token classification) →
packed bitmask,
vocab/32int32 words; runtime checks only context-dependent stragglers.
Performance model
- Compile: once per distinct
(type, spec)key; 10s–100s ms for big schemas; hits first request's TTFT only. - Per step: bitmask fill (CPU, parallelized above a batch threshold) + one fused mask kernel; low single-digit % overhead steady-state.
Spec-decode composition
- Bitmask rows =
max_num_seqs × (1 + num_spec_tokens)— one row per sampled position. - Fill row i after tentatively accepting drafts < i; then
rollback(advancements). - Grammar interface has
validate_tokens(check, no advance) +rollback(n); xgrammar matcher built withmax_rollback_tokens=num_spec_tokens.
Gotchas
- One backend per engine (xgrammar default; guidance/outlines/lm-format-enforcer), not per request.
- Reasoning models: constraint suspended during thinking (
should_advancegate; mask row set all-ones) until the reasoning parser signals end. - Valid ≠ true; and
finish_reason="length"still truncates mid-structure — budgetmax_tokensfor the schema's worst case. - Constrained + unconstrained drafter can tank spec acceptance — drafts get vetoed.
Key upstream
vllm/sampling_params.py:41 StructuredOutputsParamsv1/structured_output/backend_types.py:31 Grammar :99 Backend(the contract)v1/structured_output/__init__.py:36 Manager :115 grammar_init :204 grammar_bitmask :322 should_advancev1/structured_output/backend_xgrammar.py:77 compile_grammar :132 XgrammarGrammarv1/structured_output/utils.py:44 apply_grammar_bitmask(runner side,gpu_model_runner.py:4359)v1/core/sched/scheduler.py:1259 get_grammar_bitmask
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md
Phase 13 — The Hitchhiker's Guide to Multimodal Models
← Phase 12 · Course home · Phase 14 →
Contents
- Don't Panic
- Step 1: How a decoder-only LLM "sees" — the splice
- Step 2: Placeholders — the contract between processor and model
- Step 3: The cost — one image is a paragraph… or a chapter
- Step 4: The encoder cache — don't encode the same image twice
- Step 5: Encoder meets chunked prefill — the scheduling problem
- Step 6: Prefix caching with pixels — hashing the image itself
- The invariants to memorize
- What you'll do
Don't Panic
A vision-language model is not a new kind of engine. It's the same LLM you've been serving for twelve phases, plus a vision encoder bolted on the front. The image is encoded into a sequence of embedding vectors, and those vectors are spliced into the text embedding stream at reserved placeholder positions. From the first transformer layer onward, the model cannot tell which positions were words and which were pixels — KV cache, paged attention, continuous batching, all of it works unchanged. The new engineering is all around the splice: expanding placeholders, scheduling the encoder, and caching its output.
"What is in <image> ?" ┌──────────────┐
│ tokenize + expand pixels ──► vision encoder│──► [E1 E2 … E576]
▼ └──────────────┘ │ projector
[What][is][in][IMG][IMG]…[IMG][?] │ (to LLM dim)
│ embed text ▼
[w1][w2][w3][▢][▢]…[▢][w4] ──── overwrite ▢ positions ────► [w1 w2 w3 E1 … E576 w4]
│
▼
ordinary LLM forward (Phases 0–11)
Step 1: How a decoder-only LLM "sees" — the splice
Three parts (LLaVA is the canonical layout, llava.py):
- Vision encoder (a ViT): image → grid of patches (e.g. 24×24 = 576) → one embedding per patch.
- Projector (an MLP): maps encoder embeddings into the LLM's hidden dimension — two matrices is all it takes to make pixels speak the language model's language.
- The LLM: receives
inputs_embedswhere placeholder positions have been overwritten with projected image embeddings. In vLLM the overwrite is literally one indexed assignment:inputs_embeds[is_multimodal] = mm_embeds(models/utils.py:456,_merge_multimodal_embeddings).
That's the whole trick. Cross-attention encoder-decoder models (Whisper-style) are the exception, not the rule, in today's VLM zoo — the spliced decoder-only design won.
Step 2: Placeholders — the contract between processor and model
Before the model runs, the multimodal processor rewrites the prompt: the single
<image> marker becomes N repeated image tokens, and a PlaceholderRange(offset, length)
(multimodal/inputs.py:119) records exactly where. This bookkeeping is the contract:
- The tokenizer side promises: positions
[offset, offset+length)are dummies awaiting embeddings (some models interleave real structure — row separators — sois_embedcan mask which positions inside the range are actually image slots). - The model side promises: the encoder will produce exactly
length(oris_embed.sum()) embeddings. Get the count wrong and you get the classic VLM crash — upstream raises"Attempted to assign X multimodal tokens to Y placeholders"(utils.py:484). Your lab-01 makes you maintain this invariant by hand.
Step 3: The cost — one image is a paragraph… or a chapter
Image tokens are real tokens downstream: they occupy KV-cache blocks (Phase 2), consume scheduler token budget (Phase 3), and lengthen every later attention read. Typical scales:
| Model | One image becomes |
|---|---|
| LLaVA-1.5 (fixed 336²) | 576 tokens — always |
| Qwen2-VL (dynamic resolution) | ~4 → ~16k tokens, ∝ pixel count |
Dynamic resolution is the dangerous one: token count grows quadratically with image
side length (lab-02 measures the law on real Qwen2-VL). A 4-image request can dwarf its own
text. This is why MM models need their own memory profiling (compute_mm_encoder_budget,
encoder_cache_manager.py:269) — the worst-case image inflates both KV and the encoder
cache, and the engine must reserve for it at startup.
Step 4: The encoder cache — don't encode the same image twice
Encoder output is expensive (a full ViT forward) and reusable — the same image appears
across chunked-prefill steps of one request, across retries, across users pasting the same
screenshot. vLLM keeps finished encoder outputs in an EncoderCacheManager
(v1/core/encoder_cache_manager.py:17), a second cache next to the KV cache with its own
currency: it's measured in encoder embeddings, not blocks.
Design rhymes with Phase 2's block pool — learn the mapping:
| BlockPool (Phase 2) | EncoderCacheManager (here) |
|---|---|
| block hash | mm_hash (content hash of the image) |
ref_cnt | cached[mm_hash] = set of referencing request IDs |
| free queue (LRU eviction) | freeable OrderedDict (evict oldest unreferenced) |
| allocate / free | allocate / free_encoder_input, reclaim at allocation time |
Cross-request sharing falls out of content hashing: two requests with the same image hit
the same mm_hash (check_and_update_cache, :91).
Step 5: Encoder meets chunked prefill — the scheduling problem
Chunked prefill (Phase 3) slices a long prompt into budget-sized pieces. But an image
embedding is produced by one indivisible encoder forward — you can't compute the first
half of a ViT's patches this step and the rest next step. So the scheduler must reconcile
two granularities, and _try_schedule_encoder_inputs (scheduler.py:1096) is the
reconciliation. An encoder input is scheduled this step iff:
- its placeholder range overlaps the token window being computed,
[num_computed_tokens, num_computed_tokens + num_new_tokens); - it isn't already in the encoder cache;
- the per-step encoder compute budget has room (encoders are compute-heavy; unbounded encoder work would blow up step time exactly like unbounded prefill would);
- the encoder cache has space to hold the output.
If any check fails, the scheduler shrinks num_new_tokens to stop just before the
unschedulable image — decode the text up to the doorstep, wait for next step. And once
encoded-and-cached, a chunk boundary can land mid-placeholder freely: later chunks read
the cached embeddings. Lab-03 builds this exact logic, all-or-nothing encodes and all.
Step 6: Prefix caching with pixels — hashing the image itself
Phase 3's prefix cache keys blocks by token IDs — but two different images expand to the
same dummy token IDs! Sharing on token IDs alone would serve user B answers about user
A's photo. Fix: MultiModalHasher (multimodal/hasher.py:50) content-hashes the actual
image bytes, and that mm_hash is folded into the block hashes covering the placeholder
range. Same prompt + same pixels → full prefix-cache hit; same prompt + different pixels →
miss exactly at the image. (The same hash doubles as the encoder-cache key — one identity
for both caches.)
The invariants to memorize
- A VLM = encoder + projector + unchanged LLM; image embeddings overwrite placeholder
positions in
inputs_embeds. After the splice, the engine can't tell pixels from words. PlaceholderRangeis a contract: processor-side expansion count must equal encoder-side embedding count, exactly.- Image tokens are real tokens: they cost KV blocks, token budget, and attention time — dynamic-resolution models scale ∝ pixels (quadratic in side length).
- The encoder cache is a second cache with its own budget, keyed by content hash, ref-counted per request, LRU-evicted when unreferenced.
- Encoder runs are all-or-nothing; chunked prefill stops at the doorstep of an image it can't afford this step.
- Prefix caching must mix the image hash into block hashes — token IDs alone are ambiguous for placeholder spans.
What you'll do
- Read: 01-deep-dive.md — processor, placeholder machinery, encoder cache, scheduler hook, and LLaVA/Qwen2-VL as case studies, line-anchored.
- Build: 02-mini-build.md — a fake-image pipeline for
mini_vllm: placeholder expansion + toy encoder + content-hash cache. - Labs (see labs/README.md; recommended order 01 → 03 → 02):
lab-01-image-token-expansion[CPU-OK]— pixels → patches → tokens → blocks: placeholder expansion,PlaceholderRangebookkeeping, and the capacity punchline (one image = 38 KV blocks).lab-03-encoder-scheduling[CPU-OK]— chunked prefill meets the vision tower: per-step encoder budget, all-or-nothing encodes, truncate-at-the-doorstep, and the cache that restores mid-placeholder freedom (V1's_try_schedule_encoder_inputs, distilled).lab-02-run-a-vlm[GPU-OPT]— Qwen2-VL on a real photo: the 1,421-token "one-line" prompt, the quadratic resize law, the encoder's TTFT spike. Captured output included.
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
← Phase 12 · Course home · Phase 14 →
Phase 13 — Deep Dive: multimodal in real vLLM
Paths relative to
upstream/atv0.22.1 @ 0decac0(UPSTREAM_PIN.md). If a line number drifts in a newer tree, search for the named symbol.vllm/multimodal/inputs.py PlaceholderRange + input dataclasses (read first) vllm/multimodal/hasher.py MultiModalHasher — content identity vllm/model_executor/models/llava.py the canonical VLM (encoder+projector+splice) vllm/model_executor/models/utils.py _merge_multimodal_embeddings (the splice itself) vllm/v1/core/encoder_cache_manager.py the second cache vllm/v1/core/sched/scheduler.py _try_schedule_encoder_inputs (the hook) vllm/model_executor/models/qwen2_vl.py dynamic resolution (contrast case)
Contents
- 1. The contract:
PlaceholderRange - 2. The processor: prompt rewriting, LLaVA-style
- 3. The splice:
_merge_multimodal_embeddings - 4. Identity:
MultiModalHasher - 5. The second cache:
EncoderCacheManager - 6. The scheduler hook:
_try_schedule_encoder_inputs - 7. Contrast case: Qwen2-VL dynamic resolution
- Reading checklist
1. The contract: PlaceholderRange
vllm/multimodal/inputs.py:119 — class PlaceholderRange(offset, length, is_embed). The
docstring example is the whole idea: prompt AAAA BBBB What is… gives image A
PlaceholderRange(offset=0, length=4), image B (offset=5, length=4). is_embed is the
subtlety: some models put structure tokens inside the range (Pixtral inserts a row-break
token after each patch row — see llava.py:390, ([image_token_id] * ncols + [image_break_id]) * nrows), so a boolean mask says which positions actually receive
embeddings. Everything downstream — scheduler windowing, embedding merge, profiling — is
arithmetic over these ranges.
2. The processor: prompt rewriting, LLaVA-style
vllm/model_executor/models/llava.py is the layout to internalize, because Phase 14's
"add a model" recipe reuses every piece:
BaseLlavaProcessingInfo.get_num_image_tokens(:188) — asks the vision-encoder info object how many tokens an H×W image becomes. This number is model math, not a constant.LlavaDummyInputsBuilder(:222) — builds worst-case fake inputs (image_token * num_images) so startup profiling (Phase 1's memory measurement) sees the most expensive possible multimodal request before any real one arrives.BaseLlavaMultiModalProcessor._get_prompt_updates(:264) — the rewrite rule: replace oneimage_token_idwith[image_token_id] * num_image_tokens(:297). This is where<image>becomes 576 dummies and thePlaceholderRangeis born.- The registry (
vllm/multimodal/registry.py) binds processor classes to model classes via the@MULTIMODAL_REGISTRY.register_processordecorator on the model (:308region).
3. The splice: _merge_multimodal_embeddings
vllm/model_executor/models/utils.py:456. After embed_multimodal (llava.py:661) runs
encoder + projector (LlavaMultiModalProjector, :128 — two linears and an activation),
the merge is one line:
inputs_embeds[is_multimodal] = mm_embeds_flat.to(dtype=input_dtype)
An in-place masked scatter — pixels become "words" by assignment. Read the except RuntimeError block (:478): the count-mismatch error ("Attempted to assign X multimodal
tokens to Y placeholders") is the canonical symptom of a broken processor↔model contract,
and the first thing you'll debug when adding a VLM. Note also the comment about keeping
is_multimodal on CPU to avoid a device sync — model-runner hot path discipline.
4. Identity: MultiModalHasher
vllm/multimodal/hasher.py:50. hash_kwargs (:154) serializes each item (images go via
serialize_item, :52 — raw bytes, not object identity) through blake3-style hashing into
an mm_hash string. One hash, two jobs:
- Encoder-cache key — same image in any request hits the same cached embeddings.
- Prefix-cache ingredient — the hash is folded into KV block hashes covering the
placeholder span (Phase 3's
kv_cache_utilsblock hasher takesextra_keysfor exactly this), so identical dummy token IDs with different pixels cannot alias.
5. The second cache: EncoderCacheManager
vllm/v1/core/encoder_cache_manager.py:17. Read the class docstring — it is unusually
complete. The structure, mapped to Phase 2 vocabulary:
cached: dict[mm_hash, set[request_id]]— ref-counting by named references instead of an integerref_cnt(you can ask who holds it).freeable: OrderedDict[mm_hash, num_embeds]— the LRU free-queue analogue: entries with zero referencing requests, evictable oldest-first, reclaimed lazily at allocation time (can_allocate,:119) exactly like Phase 2's cached-block eviction.num_free_slotsvsnum_freeable_slots— actual free space vs free-after-evictions; the allocate path decides how much eviction it must perform.- Units are encoder embeddings, not blocks or bytes (see the NOTE in the docstring:
in-between break/text tokens don't count) — the budget that sized this cache comes from
compute_mm_encoder_budget(:269) at startup. get_freed_mm_hashes(:255) — drained each step intoSchedulerOutput(scheduler.py:901) so workers drop their copies: the manager is scheduler-side bookkeeping; the tensors live in the runner'sencoder_cachedict (gpu_model_runner.py:3065). Same split-brain pattern as KV: scheduler owns accounting, worker owns memory.
6. The scheduler hook: _try_schedule_encoder_inputs
vllm/v1/core/sched/scheduler.py:1096. Called for both running (:410) and waiting
(:679) requests. The docstring lists the four conditions (overlap with the computed
window; not already cached; encoder compute budget; encoder cache space). The mechanism to
study is the fallback: when an encoder input fails a check, the function truncates
num_new_tokens so the chunk ends just before the placeholder — the request still makes
progress on text, and the image waits for a step with budget. Consequences worth saying
out loud:
- Encoder work rides the same step as the decoder chunk that first overlaps the image — there is no separate "encoder phase" (contrast Phase 15's encode-disaggregated serving, where there is).
- The per-step
encoder_compute_budgetbounds step-time inflation; the cache-space check prevents an admission deadlock (an image that can never fit is rejected at the front door,compute_mm_encoder_budgetsizing guarantees the worst case fits). - On allocation (
:524/:810), the manager records the request as a referent; on request finish,free(:939) just de-references — the embeddings linger, freeable, for reuse.
7. Contrast case: Qwen2-VL dynamic resolution
vllm/model_executor/models/qwen2_vl.py. Versus LLaVA's fixed 576: token count is a
function of the actual image (grid_thw — patches per height/width/time), so
get_num_image_tokens does real arithmetic, video adds a time dimension, and M-RoPE
(multimodal rotary position encoding — text positions and 2-D image positions interleaved)
replaces vanilla RoPE. You don't need every detail; you need to recognize which parts of
the Phase-13 machinery flex (token counting, dummy-input profiling, position encoding)
and which don't (placeholder contract, encoder cache, scheduler hook — identical).
Reading checklist
-
PlaceholderRange— what isis_embedfor? Find the Pixtral line that makes it necessary (llava.py:390). -
_get_prompt_updatesin llava.py — where exactly does 1 token become N? -
_merge_multimodal_embeddings— what's the invariant, and what error message do you get when it breaks? -
EncoderCacheManager.check_and_update_cache/can_allocate— walk a second request arriving with the same image: which dict/list transitions happen? -
_try_schedule_encoder_inputs— all four scheduling conditions, and what happens tonum_new_tokenswhen one fails? -
In
scheduler.py:901, how do workers learn an encoder entry was evicted?
Now build it: 02-mini-build.md, then the labs.
Phase 13 — Mini-Build: a fake-image pipeline for mini_vllm
Contents
- Your task
- Why build it (and not just read it)
- The spec
- Method
- Definition of done
- Map back to the real engine
Your task
Teach mini_vllm to serve a request that carries a fake "image": expand a placeholder into
N synthetic image tokens, run a toy encoder (deterministic function of the image bytes),
splice the embeddings, and cache encoder outputs by content hash so the same image is never
encoded twice.
Why build it (and not just read it)
Reading the real feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.
The spec
- Request extension:
Requestmay carryimages: list[bytes]and a prompt containing the marker token<IMG>. A processing step expands each marker tonum_image_tokens(image)placeholder token IDs and recordsPlaceholderRange(offset, length)— your own tiny dataclass. Makenum_image_tokens = (len(image_bytes) // 64) + 1so "resolution" varies (the dynamic-resolution lesson in one line). - Toy encoder:
encode(image_bytes) -> np.ndarray[length, d], deterministic (seed a RNG from the content hash). Pretend it's expensive: count invocations. - Encoder cache: dict keyed by
sha256(image_bytes), with per-request reference sets and an LRUfreeablelist with a capacity in embeddings — a 40-lineEncoderCacheManagermirroring upstream'scached/freeable/freedtrio. - The splice: in the (fake) forward,
inputs_embeds[is_image_position] = cached_embeddings— assert the count contract and raise the upstream-style "X multimodal tokens to Y placeholders" error on mismatch. - Scheduler touch: image tokens must pass through your Phase-3 scheduler as ordinary tokens (KV blocks allocated, token budget consumed). If you did lab-03, optionally bolt on the per-step encoder budget + truncate-at-the-doorstep rule.
Method
- Re-read
encoder_cache_manager.py:17(docstring is the design doc) andmodels/utils.py:456(the splice). - Build processor → encoder → cache → splice in that order; test each before the next.
pytest mini_vllm -qand keep it green.
Definition of done
- CPU only, numpy only.
- A test proves expansion arithmetic: a prompt with 2 images of different sizes yields
correct total length and two correct
PlaceholderRanges. - A test proves cache sharing: two requests, same image bytes → encoder invoked once; different bytes (same length!) → invoked twice. This is the content-hash lesson.
- A test proves the contract: corrupt the expansion count and assert the mismatch error fires.
- A test proves eviction: capacity for one image's embeddings; finish request A, admit B with a new image → A's entry evicted (and its hash reported freed), not B rejected.
- You can say out loud where yours simplifies: no real ViT, no projector dim-matching, no
is_embedmasks, no chunked-prefill interaction unless you added it.
Map back to the real engine
| Yours | Upstream |
|---|---|
| marker expansion + range | _get_prompt_updates (llava.py:264) + PlaceholderRange (inputs.py:119) |
sha256(image_bytes) | MultiModalHasher.hash_kwargs (hasher.py:154) |
| cache dict + refs + LRU | EncoderCacheManager (encoder_cache_manager.py:17) |
| splice + count assert | _merge_multimodal_embeddings (models/utils.py:456) |
| encoder budget rule (optional) | _try_schedule_encoder_inputs (scheduler.py:1096) |
Phase 13 Labs — Multimodal Models
Three labs on the trick that lets a text engine see: translate pictures into the core's one currency — tokens — at the boundary, and keep Phases 1–3 untouched. The arc: build the expansion that turns pixels into sequence length (lab-01), referee the collision between chunked prefill and the can't-encode-half-a-picture vision tower (lab-03), then run a real VLM and reconcile every number — the 1,421-token "one-line" prompt, the quadratic resize law, the encoder's TTFT spike (lab-02).
Recommended order: 01 → 03 → 02. CPU labs follow the standard contract —
starter.py (your work), solution.py (reference), test_lab.py (the spec); default
runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-13-multimodal-models/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-13-multimodal-models/labs/lab-01-image-token-expansion -q
Contents
- lab-01-image-token-expansion
[CPU-OK] - lab-02-run-a-vlm
[GPU-OPT] - lab-03-encoder-scheduling
[CPU-OK] - What you can do after this phase
Labs
lab-01-image-token-expansion [CPU-OK]
Pixels → patches → tokens → blocks: the ViT patch arithmetic (with its double-ceiling
traps), the placeholder splice that rewrites the prompt, and the PlaceholderRange
bookkeeping everything downstream navigates by. The punchline test: a "20-token
prompt" with one image is a 595-token request needing 38 KV blocks. Skills: the
quadratic resolution law; containment-by-translation as architecture; multi-image
offset shifting; validating counts at the boundary.
lab-02-run-a-vlm [GPU-OPT]
Qwen2-VL-2B on a real photo: the ~30-token prompt arriving as 1,421 tokens, the ~4×
drop on halving resolution (predicted first, measured second), and the 41 → 118 ms
TTFT gap that is the vision encoder on a clock. Plus the operational surfaces:
resize policy as the cheapest capacity lever, limit_mm_per_prompt, and the
three-cache stack (processor / encoder / prefix). Annotated capture included.
Skills: auditing a processor's decisions; segmenting TTFT by has-image; the
quality cliff in resize tuning.
lab-03-encoder-scheduling [CPU-OK]
The collision: chunked prefill slices anywhere, but you can't encode half a picture. Implement V1's answer — per-step encoder budget, all-or-nothing encodes scheduled when a chunk enters a placeholder, truncate-at-the-doorstep when unaffordable, and the encoder cache that restores mid-placeholder freedom. Seven scenarios from pure-text to the zero-budget starvation edge. Skills: a third resource ledger; the cache-at-the-granularity-boundary pattern; why VLM prefills stall one token before their image.
What you can do after this phase
Price an image (or a video) in tokens, blocks, and TTFT before deploying it; predict
and explain VLM capacity from the traffic's image-size distribution; tune the resize
policy, encoder budget, and per-prompt limits with eyes open; and read
vllm/multimodal/ plus the V1 encoder-scheduling path as machinery you've already
built small. Phase 14 goes inside the models themselves — including how a vision
tower bolts onto a language model in the first place.
Lab 13-01 — Image-Token Expansion: Where Pictures Become Sequence Length [CPU-OK]
Here is the entire secret of multimodal serving, and it fits in one sentence: to the
engine, an image is just tokens. The user's prompt says <image>; the processor
replaces that single placeholder with N placeholder positions (144 for a 336×336
LLaVA-style image, 576+ for high-res); the vision encoder's embeddings will occupy
those positions; and from that moment every subsystem you've built in this course —
the scheduler's token budget (Phase 3), KV blocks (Phase 2), TTFT arithmetic
(Phase 1) — treats them like any other tokens. This lab implements the expansion: the
patch arithmetic that converts pixels to a token count, the splice that rewrites the
prompt, and the PlaceholderRange bookkeeping that remembers where each image lives —
the exact data structure upstream uses.
Contents
- Why this lab exists
- Background: pixels → patches → tokens → blocks
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Multimodal capacity surprises kill deployments. A chat service adds image support; the
prompt text barely grew, yet TTFT triples and concurrency halves — because every
image silently added hundreds of tokens that nobody counted. The arithmetic in this
lab is the inoculation: image_token_count tells you what a resolution costs,
test_resolution_is_quadratic_cost makes the scaling law visceral (double the sides,
4× the bill), and test_the_scheduler_sees_only_the_expanded_length does the
capacity-planning punchline — a "20-token prompt" with one image is a 595-token
request needing 38 KV blocks instead of 2. Run your traffic's image-size distribution
through these three functions before you ship a VLM, and Phase 0 lab-02's concurrency
math stays honest.
The deeper design point: expansion is how multimodality gets contained. The engine's core (scheduler, KV manager, attention) never learns what an image is — it sees a longer token sequence plus an opaque side-channel (the embeddings, delivered by lab-03's encoder scheduling). That containment is why vLLM could add vision, audio, and video without rewriting Phases 1–3, and it's the architectural pattern to copy: translate the exotic thing into the core's existing currency at the boundary.
Background: pixels → patches → tokens → blocks
The pipeline, stage by stage:
- Pixels → patches: ViT encoders slice the image into
patch × patchpixel tiles (14 px is the common size) —ceil(side / patch)per dimension. A 336×336 image: 24×24 = 576 patches. - Patches → tokens: many modern VLMs (Qwen-VL family and others) then merge
merge × mergeneighborhoods (pixel-unshuffle / spatial merge) to shrink the sequence: 24×24 → 12×12 = 144 tokens. Both divisions ceil — odd sizes round up at each stage, andtest_patch_arithmetic's 337-pixel case pins the double-ceiling (a classic off-by-one source when re-implementing processors). - Tokens → the sequence: each
<image>occurrence in the tokenized prompt is replaced by its image's count of sentinel ids, and aPlaceholderRange(offset, length)records the span — the coordinates lab-03's encoder scheduling and the model runner's embedding-scatter both navigate by. Multi-image prompts produce ordered, disjoint ranges whose offsets shift by earlier expansions (test_multi_image_ranges_are_ordered_and_disjointpins the shift). - Sequence → blocks: Phase 2's ceil-div, unchanged — image KV is KV.
Files
starter.py—image_token_count,expand_prompt,kv_blocks_needed. Your work.solution.py— reference.test_lab.py— the patch arithmetic (with the ceiling traps), quadratic scaling, the splice, multi-image ranges, the count-mismatch assert, and the capacity punchline.
Run
LAB_IMPL=starter pytest phase-13-multimodal-models/labs/lab-01-image-token-expansion -q
pytest phase-13-multimodal-models/labs/lab-01-image-token-expansion -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_patch_arithmetic | 336×336 → 144 (the LLaVA-ish number you'll see in lab-02's capture), the 1-token thumbnail floor, and ceiling-at-both-stages |
test_resolution_is_quadratic_cost | 2× resolution = 4× tokens — why high-res modes are a capacity feature, not just a quality feature, and why production VLM configs cap image size |
test_expansion_splices_in_place | The rewrite, exactly: text before, N sentinels, text after |
test_multi_image_ranges_are_ordered_and_disjoint | Range offsets account for earlier expansions; every placeholder position holds the sentinel. The bookkeeping the embedding scatter trusts |
test_mismatched_counts_assert | N placeholders demand N images — the validation that turns a garbled request into a clean 400 error instead of a runtime tensor-shape crash three layers deep |
test_the_scheduler_sees_only_the_expanded_length | 20 "text" tokens + 1 image = 595 tokens, 38 blocks. The line item your capacity model was missing |
Hitchhiker's notes
- Where this lives upstream: the per-model processor
(
upstream/vllm/model_executor/models/<model>.py+vllm/multimodal/processing.py) performs exactly this expansion at request-arrival time, emittingPlaceholderRanges (vllm/multimodal/inputs.py— same fields as yours). The count formula is model-specific (this is most of what differs between LLaVA, Qwen-VL, Pixtral processors); the splice machinery is shared. - The sentinel id never reaches the embedding table. At runtime the model runner
computes text embeddings for real ids and scatters the encoder's output over the
placeholder positions (
get_input_embeddingswith the ranges as the map). Your-100is upstream's reserved placeholder id — chosen, like all sentinels, to be un-confusable with a real token (Phase 2's null block, Phase 9's-1EOS: the course's sentinel family grows). - Prefix caching works for images — with one amendment you can now predict (Phase 2 lab-05's "anything that changes what KV means"): the block hash must include the image content hash, or two prompts with identical text but different pictures would share KV. vLLM hashes the multimodal items into the chain; same-image-same-text re-requests (retries, multi-turn over one photo) hit cache like any system prompt.
- Variable-resolution schemes (dynamic tiling à la InternVL/GPT-4V's "high-res crops") are this lab's formula applied per tile plus a global thumbnail — the token count becomes data-dependent, which is exactly why upstream processors compute counts from actual image dimensions instead of constants, and why your capacity model must use the traffic's real size distribution.
Going further
- Add
aspect_preserving_resize(w, h, max_side)→ new dims, then recompute the token bill — reproducing the resize-then-patch pipeline real processors run, and the knob (max_side) that trades quality for capacity. - Implement the embedding scatter: given
text_emb (seq, d),image_emb (n, d), and aPlaceholderRange, produce the merged input — ~3 lines with numpy slicing, and you've written the runtime half of this lab's compile-time work. - Compute the KV-bytes per image (144 tokens × Phase 0 lab-02's per-token bytes) for a 7B model, then for video at 1 fps × 60 s. The result explains why video models lean so hard on token merging and why "just feed it the video" is a memory proposal, not a feature request.
References
upstream/vllm/multimodal/inputs.py—PlaceholderRange, the real one.upstream/vllm/multimodal/processing.py— the expansion machinery (PromptReplacementand friends).- Liu et al., Visual Instruction Tuning (LLaVA, 2023) — the projector-into-the-token-stream design this lab models: https://arxiv.org/abs/2304.08485
- Qwen team, Qwen2-VL (2024) — the 2×2 spatial merge and dynamic resolution: https://arxiv.org/abs/2409.12191
- Lab-03 — who fills the placeholders, and what it costs the scheduler.
Lab 13-02 — Run a VLM and Count Its Image Tokens [GPU-OPT]
The CPU labs predicted two numbers: how many tokens an image becomes (lab-01's patch arithmetic) and what scheduling them costs (lab-03's encoder budget). This lab runs a real vision-language model — Qwen2-VL-2B — on a real image and checks both: the prompt that tokenized to ~30 text tokens arrives at the scheduler as ~1,400 tokens (one high-res photo), the encoder's execution shows up as a prefill-time spike, and the model then answers questions about pixels it turned into KV like any other context.
No GPU? Don't panic. The captured run below is annotated against both CPU labs; the counting exercises are the lab.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, Qwen2-VL-2B-Instruct, L4, vLLM 0.22.1, trimmed)
- Reading the numbers
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Every multimodal capacity incident starts with somebody not knowing their images'
token bill, and the cure is having once watched the bill get charged: prompt in,
expanded length in the logs, KV usage jumping by hundreds of blocks per picture. This
lab is that watching — plus the operational surfaces unique to VLMs that text-only
operators haven't met: the processor's resize decisions (the same photo costs
different tokens at different max_pixels settings), the limit_mm_per_prompt guard,
and the prefill-time encoder spike that no text-only latency model predicts.
Requirements
uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2-VL-2B-Instruct # small, modern, dynamic-resolution
# any test image; a ~1280x960 photo makes the arithmetic vivid
Steps
from vllm import LLM, SamplingParams
from PIL import Image
llm = LLM(model="Qwen/Qwen2-VL-2B-Instruct", gpu_memory_utilization=0.7,
max_model_len=4096, limit_mm_per_prompt={"image": 2})
image = Image.open("photo.jpg")
prompt = ("<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>"
"Describe this image in one sentence.<|im_end|>\n<|im_start|>assistant\n")
out = llm.generate({"prompt": prompt, "multi_modal_data": {"image": image}},
SamplingParams(max_tokens=48, temperature=0))
print(out[0].outputs[0].text)
print("prompt tokens:", len(out[0].prompt_token_ids)) # the EXPANDED length
Then the three experiments: re-run with the image resized to half each side (predict the token drop with lab-01's formula first — expect ~4×); send two images and watch both placeholder expansions land; and run a text-only prompt through the same engine to baseline the TTFT difference (the encoder's share, lab-03's budget made visible).
Captured output (real run, Qwen2-VL-2B-Instruct, L4, vLLM 0.22.1, trimmed)
INFO ... Using Flash Attention backend.
prompt tokens: 1421 # ~30 text tokens + ~1391 image tokens
'A golden retriever sits on a wooden dock beside a calm lake at sunset.'
# same photo, resized to half each side:
prompt tokens: 378 # ~30 text + ~348 image (~4x fewer, as predicted)
# text-only TTFT: 41 ms ; with the full-res image: 118 ms
# (the gap = vision encoder + the bigger prefill — lab-03's encoder cost, on a clock)
Reading the numbers
- 1421 tokens for a "one-line" prompt — lab-01's punchline on real silicon. Check
the arithmetic: Qwen2-VL at native ~1280×960 → 28-px effective patches after the
2×2 merge → ⌈1280/28⌉×⌈960/28⌉ ≈ 46×35 ≈ 1,610-ish before the processor's
max_pixelsresize trims it to ~1,391. Your prediction landing within ~15% of the log (resize policy explains the gap) is the pass condition. - 378 after halving — the quadratic law, confirmed: ~4× fewer image tokens. The
cheapest capacity lever in multimodal serving is the resize policy
(
min_pixels/max_pixelsin the processor config), and it's set per deployment, not per model. - TTFT 41 → 118 ms — the encoder runs at prefill time (lab-03: scheduled with the chunk that enters the placeholder), so images tax time-to-first-token specifically; decode speed afterward is untouched (the image is now just KV). Text-only latency dashboards miss this entirely — segment TTFT by has-image.
- KV math: 1,421 tokens ≈ 89 blocks at block_size 16 — one photo holds the cache
footprint of ~45 short text exchanges.
limit_mm_per_promptis the admission- control guard against the user who attaches twelve screenshots.
Hitchhiker's notes
- Prompt format is model-specific and unforgiving — Qwen's
<|vision_start|><|image_pad|><|vision_end|>, LLaVA's<image>, Pixtral's[IMG]: the processor knows the convention; the OpenAI-compatible server'simage_urlcontent blocks hide it from clients (Phase 16). When raw-prompting a VLM, a wrong placeholder doesn't error — the model just never sees the image and hallucinates cheerfully. Thetest_mismatched_counts_assertvalidation from lab-01 is what stands between you and that silence. - The processor cache: image preprocessing (resize, normalize, patchify) is
CPU-side and non-trivial; vLLM caches processed inputs by content hash
(
mm_processor_cache_gb), so repeated images (multi-turn over one photo, retries) skip it. Distinct from lab-03's encoder cache (GPU, within-request) — two caches, two lifetimes, and a prefix-cache third (Phase 2) whose block hashes fold in the image hash. Multimodal is a cache stack. - Resolution policy is a quality/capacity dial with a cliff: too aggressive a
max_pixelsand OCR/chart tasks degrade sharply (small text needs pixels). Tune it against your actual task mix with Phase 6 lab-02's eval discipline — "the description still looked fine" is not a measurement. - Video is this lab times frames: a 1 fps minute is ~60 images through the same pipeline (with temporal merging fighting the bill). The arithmetic you validated here is why video context windows are the current frontier of memory engineering.
Reflect
- Reconcile all three labs in one trace: the processor expanded (lab-01), the scheduler budgeted the encode with the chunk that entered the range (lab-03), the runner scattered embeddings over the placeholders, and decode proceeded over ordinary KV. Which of the four steps recur per step, and which per request? (Per-request: expansion + encode; per-step: scheduling + scatter of the relevant slice. The amortization is the design.)
- Your VLM fleet's p99 TTFT doubled after a client started sending 4K screenshots.
Three knobs, in the order you'd reach for them? (
max_pixelsresize policy — quality-checked; encoder budget /disable_chunked_mm_inputtuning for interference;limit_mm_per_prompt+ input validation as the guardrail.) - Why does the engine charge image tokens against
max_model_lenrather than tracking images separately? (Containment — lab-01's lesson: one currency keeps every Phase 1–3 invariant true for free. A separate ledger would re-litigate admission, blocks, and budgets per modality.)
References
upstream/vllm/model_executor/models/qwen2_vl.py— the processor whose decisions you just audited (find the merge factor and the pixel limits).upstream/vllm/multimodal/— registry, processor cache, input plumbing.- vLLM docs, Multimodal Inputs — the API surface and per-model conventions: https://docs.vllm.ai/en/latest/features/multimodal_inputs.html
- Qwen team, Qwen2-VL (2024) — dynamic resolution and the 2×2 merge: https://arxiv.org/abs/2409.12191
- Labs 01 and 03 — the two predictions this run validates.
Lab 13-03 — Encoder Scheduling: Chunked Prefill Meets the Vision Tower [CPU-OK]
Phase 3's chunked prefill rests on a freedom you probably never noticed it claiming:
a prompt can be sliced anywhere. Multimodal revokes it. The positions inside a
placeholder range (lab-01) get their embeddings from the vision encoder, and you
cannot encode half a picture — the ViT runs on the whole image or not at all. So when
a prefill chunk first reaches into an image's range, the engine faces a real
scheduling decision: run the encoder this step (it costs real compute, governed by a
per-step encoder budget), or truncate the chunk at the image's doorstep and
try again next step. You'll implement that decision — vLLM V1's
_try_schedule_encoder_inputs, distilled — including the piece that restores chunked
prefill's freedom: the encoder cache, which lets later chunks continue
mid-placeholder for free.
Contents
- Why this lab exists
- Background: three resources now, not two
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
This is the lab where two phases collide and you get to be the referee. Phase 3
taught that chunk boundaries are arbitrary (the clamp doesn't care what token it
stops at); lab-01 taught that some positions are image positions. The collision
produces every behavior in this lab's test suite, and each one is a production
symptom with a name: a VLM request whose prefill mysteriously stalls one token before
its image (encoder budget exhausted — test_unaffordable_image_truncates_the_chunk),
a step that schedules a 100-token encode for a chunk consuming only 40 image
positions (test_entering_an_image_schedules_its_encoder — the encoder is
all-or-nothing even when the decoder is incremental), a multi-image prompt that
prefills image A this step and stops dead before image B
(test_budget_splits_across_two_images).
The design lesson is the one the course keeps circling: vLLM did not forbid chunk boundaries inside images (which would couple the text scheduler to image geometry). It added a cache between the two engines — encode once, whole; consume incrementally, cached — so each side keeps its natural granularity. When two subsystems disagree about granularity, a cache at the boundary is usually the answer; this lab is the cleanest instance you'll ever implement.
Background: three resources now, not two
Phase 3's scheduler balanced the token budget and KV memory. Multimodal adds a third ledger, with its own units and its own cache:
- Encoder budget (per step, in encoder tokens): the vision tower is real compute outside the LM's token budget — a step that encodes a 576-token image while also prefilling text is doing two models' work. Capping encoder work per step protects ITL exactly the way the token budget does (Phase 3 lab-05's argument, new actor).
- Encoder cache (in encoder tokens of storage): outputs wait here between the encode and the chunks that consume them — and entries are freed once fully consumed. It's a third memory pool alongside KV blocks and LoRA slots (Phase 11 lab-04), with the same admission-pressure character.
The rule your plan_chunk implements, per placeholder the chunk would enter:
cached → free; affordable → schedule the whole encode now (even for partial
consumption); unaffordable → truncate the chunk to the placeholder's offset. And
the invariant the truncation preserves is Phase 3's invariant, extended: a position
is computed only when everything it needs exists — text positions need prior KV;
image positions need their encoding. Same race of counters, one more prerequisite.
Files
starter.py—plan_chunkwith the full rules in the docstring. Your work.solution.py— reference (~25 lines; the thinking is in the cases).test_lab.py— seven scenarios over a text/image/text/image/text prompt, from pure-text freedom to the zero-budget starvation edge.
Run
LAB_IMPL=starter pytest phase-13-multimodal-models/labs/lab-03-encoder-scheduling -q
pytest phase-13-multimodal-models/labs/lab-03-encoder-scheduling -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_pure_text_chunk_is_unconstrained | Phase 3 behavior survives where no image is touched |
test_entering_an_image_schedules_its_encoder | All-or-nothing encoding: touching 40 of 100 image positions schedules the full 100-token encode — the granularity mismatch, faced |
test_unaffordable_image_truncates_the_chunk | The doorstep rule: chunk ends at offset, encoder runs empty. The mysterious stall, explained |
test_cached_image_costs_nothing / test_continuation_mid_placeholder_needs_no_new_encode | The cache restores chunk freedom: mid-placeholder continuation with zero encoder budget — the design's whole payoff |
test_budget_splits_across_two_images | Budget as a per-step ledger across images: A scheduled, B deferred, chunk truncated between them; enough budget → one-step prefill, two encodes |
test_progress_is_always_possible | The honest edge: image-at-position-0 with zero budget yields a 0-token chunk — progress waits for a step with budget. (Per-step budgets reset, so this is a delay, not a deadlock — but a scheduler that forgot to give encoder budget would starve VLM requests forever; the test documents the dependency) |
Hitchhiker's notes
- Map to upstream:
Scheduler._try_schedule_encoder_inputsinupstream/vllm/v1/core/sched/scheduler.py— your function with the encoder-cache space check added (the cache has finite storage; an encode can also be deferred because its output wouldn't fit), andencoder_budgetflowing fromscheduler_config. The encoder cache itself:vllm/v1/core/encoder_cache_manager.py— allocation, reference, and free-on-consumption; recognizably a tiny sibling of Phase 2's machinery. - Why encode-whole-but-consume-partial is safe: the encoder is not autoregressive — its output for an image is a pure function of pixels, independent of the text around it. That's what makes caching trivially correct (no chained hashes needed — contrast Phase 2 lab-05's ancestry chains) and what makes the all-or-nothing constraint tolerable: you never re-encode, ever, within a request.
- Where the embeddings actually flow: encoder output → encoder cache → the model
runner gathers the scheduled slice of cached embeddings each step and scatters
them over lab-01's placeholder positions (
get_input_embeddings). ThePlaceholderRangeis the shared coordinate system of all three labs — compile-time (lab-01), schedule-time (this lab), runtime (the scatter). - Capacity interaction worth knowing: encoder budget and token budget compete for
the same wall-clock step. A VLM fleet tuned with Phase 3 lab-05's threshold
analysis but ignoring encoder spikes still gets ITL spikes — from the vision tower.
vLLM's
disable_chunked_mm_inputand encoder-budget knobs exist for exactly this tuning; you now know what they gate.
Going further
- Add the encoder-cache space dimension:
plan_chunkalso receivescache_free_tokens, and an encode needs both budget and space; consumed entries free space for later steps. You've now matched upstream's full predicate — and created the three-pool admission dance (KV + encoder cache + budget) that real VLM scheduling is. - Simulate a step sequence: one request, the lab's two-image prompt, budget 150/step — emit the chunk plan per step until prefill completes. The trace (where chunks stall, when encodes fire, when the cache carries) is Phase 1 lab-04's probe, multimodal edition.
- Model the ITL spike from an encode (Phase 3 lab-05's method): give encoder tokens a cost weight and plot a decode stream's step costs when a VLM prefill with a 576-token image lands beside it, with and without an encoder budget. The conclusion writes the config recommendation.
References
upstream/vllm/v1/core/sched/scheduler.py—_try_schedule_encoder_inputs: this lab, in production (with the cache-space check).upstream/vllm/v1/core/encoder_cache_manager.py— the third memory pool.- vLLM blog, vLLM V1 — the encoder-cache design rationale: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
- Phase 3 labs 01/02/05 — the chunking machinery this lab constrains; lab-01 — the ranges it navigates by.
Phase 13 — Exercises: Multimodal Models
Contents
Warm-up (explain)
- In one breath: how does a decoder-only LLM "see" an image, and which engine components (Phases 0–11) need zero changes for it?
- Why does vLLM keep an encoder cache separate from the KV cache? Name two ways their currencies and lifetimes differ.
- Why can't the prefix cache key placeholder-covering blocks by token IDs alone?
Solution sketches
- Vision encoder → projector → embeddings overwrite placeholder positions in
inputs_embeds; from layer 1 on it's text-indistinguishable. Unchanged: paged KV, attention backends, sampler, batching — everything past the embedding layer. - KV cache: per-(request-prefix) layered K/V in fixed blocks, grows every step, freed at request end. Encoder cache: per-content (mm_hash) embeddings, measured in embedding slots not blocks, written once per image, shared across requests, LRU-evicted when unreferenced. Different key (position-prefix vs content), different unit, different lifecycle.
- Every image expands to the same repeated dummy token ID — token-ID hashing would
alias different pictures and serve one user's image context to another. The image's
content hash (
MultiModalHasher) must be folded into those block hashes.
Core (trace the code)
_get_prompt_updates(llava.py:264) — where does the expansion count come from, and why does Pixtral (:390) needPromptUpdateDetails.select_token_id/is_embed?- Walk
EncoderCacheManager.check_and_update_cache(:91) for a request whose image is cached but currently unreferenced. Which structures change, and what is the Phase-2 analogue of this transition? _try_schedule_encoder_inputs(scheduler.py:1096): an image's placeholder starts at token 5000, the request has computed 2000 tokens, and this step's chunk is 2048. What happens to the image, and tonum_new_tokens?- The scheduler's manager tracks hashes but the runner holds tensors. Trace how an
eviction decided by the scheduler reaches the worker (
get_freed_mm_hashes→scheduler.py:901→ runner).
Solution sketches
- From
ProcessingInfo.get_num_image_tokens→ the vision encoder's patch math (model config, image size). Pixtral interleaves[image_break_id]after each patch row, so not every position in the range receives an embedding —is_embedmarks which do. - The hash is popped from
freeable(it was an eviction candidate), its embed count is subtracted fromnum_freeable_slots, and the request ID joinscached[mm_hash]. Phase-2 analogue:BlockPool.touch— resurrecting a cached block from the free queue by bumpingref_cnt0→1. - The chunk window [2000, 4048) doesn't reach offset 5000 → the image is not scheduled,
and
num_new_tokensis untouched (truncation only happens when the window overlaps an image that fails budget/cache checks). Next steps advance the window; the step whose window first overlaps 5000 must schedule (or truncate at) it. - Manager appends evicted hashes to
freed; each stepget_freed_mm_hashes()drains the list intoSchedulerOutput.free_encoder_mm_hashes; workers delete those keys from theirencoder_cachedict. Scheduler owns accounting, workers own memory — the same split as KV blocks.
Build (your lab)
- In lab-01, compute: at block_size 16, how many KV blocks does one LLaVA image (576 tokens) cost, and what fraction of a 7B model's typical 8 GiB KV budget is 50 cached image-bearing prompts of 1000 tokens each?
- Extend your mini-build's cache with a
stats()method (hits, misses, evictions, occupancy) and write a test that drives hit-rate from 0% to >80% with a zipfian image distribution. Why is zipfian the realistic assumption? - In lab-03, construct a request where the encoder budget forces the image to wait one step but the cache-space check would have passed. Verify text progress continues. Then flip it: cache full, budget free. What's the user-visible difference?
Solution sketches
- 576/16 = 36 blocks for the image alone (38 with prompt rounding in the lab's setup). 50 × 1000 tokens ≈ 50 × 63 blocks ≈ 3150 blocks ≈ 25% of an 8 GiB budget at ~16 KiB/ block-token-layer scale — images eat KV budgets fast; exact numbers depend on the model, the point is the order of magnitude.
- Real traffic repeats content (logos, screenshots, retried requests, multi-turn with the same image) with a long tail of singletons — zipf models that. Hits come from the head; the tail drives eviction churn.
- Both delay the image, not the text (truncate-at-doorstep). Budget-limited: resolves
next step deterministically. Cache-limited: resolves only when another request frees
embeddings — potentially unbounded wait, which is why worst-case sizing at startup
(
compute_mm_encoder_budget) must guarantee a single max image always fits.
Design (staff-level)
- Your fleet serves Qwen2-VL and users upload phone photos (12 MP). TTFT p99 is 4× worse than the text-only fleet. Walk the path pixels take and name the three biggest contributors + a mitigation for each.
- Design multi-tenant fairness for the encoder cache: tenant A uploads thousands of unique images (0% reuse), tenant B reuses a product catalog (90% reuse). What goes wrong with global LRU and what do you change?
- Should encoder outputs be prefix-cacheable across engine restarts (disk/remote)? Cost out the trade: embedding sizes vs re-encode time, and the consistency hazard the cache key must absorb.
- Video: 1 fps × 60 s × ~hundreds of tokens/frame. Which Phase-13 mechanisms break first, and what does that tell you about why encode-disaggregation (Phase 15) exists?
Solution sketches
- (a) Preprocessing/resize on CPU in the API process — move to async/parallel workers,
downscale at the edge (Qwen2-VL token count ∝ pixels; cap
max_pixels). (b) The ViT forward itself rides the first overlapping step — encoder budget tuning, or batch encoder work, or disaggregate encode (Phase 15). (c) Token inflation: 12 MP → tens of thousands of LLM tokens of prefill — enforce resolution limits server-side; chunked prefill spreads it but TTFT still pays. - Global LRU lets A's unique-image churn evict B's hot catalog (cache pollution by zero-reuse traffic). Fixes: per-tenant quotas/partitions, admission filter (only cache on second sight — a tiny bloom/ghost list), or weighted eviction favoring entries with reuse history.
- An embedding tensor for a 576-token image at d=4096 fp16 ≈ 4.7 MB — often larger
than the JPEG and comparable to re-encode time at high load; remote fetch can lose to
recompute. Worth it only for very hot content. The key must absorb model identity +
weights version + preprocessor config (resize policy!) — upstream's
reset()on weight updates is the single-process version of that hazard. - Encoder cache capacity (a minute of video ≈ tens of thousands of embeddings) and the per-step encoder budget (one step can't afford a frame burst) break first; KV inflation follows. When encode work rivals decode work, sharing one GPU starves both — that's precisely the case for a separate encode fleet with its own scaling (Phase 15's encode disaggregation, EPD).
Self-grading
4–7 and 11–14 are interview-grade. Could you whiteboard the splice (processor → expand → encode → overwrite) and both caches' keys from memory? If not, re-read 01-deep-dive.md §3–§5.
Phase 13 — Interview Questions: Multimodal Models
Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)
Q1. How does a decoder-only LLM 'see' an image in vLLM?
Model answer
A vision encoder turns the image into embeddings that occupy a fixed set of placeholder token positions in the prompt. The language model then attends over text+image tokens uniformly. vLLM's input processor handles encoding, placeholder expansion, and caching the encoder output so it isn't recomputed each step.
Q2. What new bottlenecks do multimodal models add?
Model answer
The vision encoder is extra compute/memory before prefill; image tokens inflate sequence length (and KV); and the encoder-cache plus input-processing must be profiled and batched carefully, especially for dynamic-resolution models.
Going deeper
The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.
Phase 13 — Cheatsheet: Multimodal Models
- Image -> vision encoder -> image embeddings -> placeholder token positions -> normal LLM.
- EncoderCacheManager reuses image features; don't recompute per step.
- Image tokens inflate seq length and KV usage; profile input processing.
Key upstream files
vllm/multimodal/vllm/multimodal/processing.pyvllm/v1/core/encoder_cache_manager.pyvllm/model_executor/models/llava.pyvllm/model_executor/models/qwen2_vl.py
Full reference: 00-guide.md · 01-deep-dive.md
Phase 14 — Model Architectures (Adding a Model)
← Phase 13 · Course home · Phase 15 →
Contents
- Don't Panic
- Why this phase matters
- What you'll learn
- The map: where this lives in the real code
- Labs in this phase
- How to work this phase
- Where you are
Don't Panic
vLLM supports 200+ architectures because adding one is a well-trodden recipe: write an nn.Module that uses vLLM's parallel layers and attention, map the checkpoint weights onto it, register it, done. This phase teaches that recipe — the single most valuable maintainer skill — across decoder-only, MoE, hybrid/SSM, and pooling models.
Why this phase matters
'Add support for model X' is the most common high-value vLLM contribution. Doing it well — correct weight mapping, TP-sharded layers, the right attention, tests — is exactly what earns maintainer trust.
What you'll learn
- The model contract: init(vllm_config), forward(input_ids, positions, ...) -> hidden
- vLLM building blocks: VocabParallelEmbedding, {Column,Row}ParallelLinear, Attention, RMSNorm
- Weight loading: load_weights + the name-remapping from HF checkpoints
- The model registry and how a name resolves to a class
- Families: decoder-only (Llama), MoE (Mixtral), hybrid/SSM (Mamba/Jamba), pooling/reward
- get_input_embeddings, tie_word_embeddings, LoRA/quant compatibility hooks
The map: where this lives in the real code
Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see
UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md)
walks through the important ones line by line.
vllm/model_executor/models/llama.py— The reference decoder-only implementation.vllm/model_executor/models/registry.py— The architecture registry.vllm/model_executor/model_loader/— Weight loading + checkpoint format handling.vllm/model_executor/models/mamba.py— A state-space (non-attention) model.vllm/model_executor/models/interfaces.py— Mixins: SupportsLoRA, SupportsPP, SupportsMultiModal, ...tests/models/— How model correctness is tested upstream (logit/greedy equality).
Labs in this phase
- lab-01-add-a-toy-architecture
[CPU-OK]— implement a new architecture against the mini_vllm model contract, serve it through the unchanged engine, and prove with a tripwire proxy that the contract is exactly one method. - lab-02-trace-weight-loading
[GPU-OPT]— trace 5 tensors through llama.py's load_weights: name → mapping row → fused param → slice, with live shape verification. Captured mapping included. - lab-03-weight-mapping
[CPU-OK]— implement the translation: q/k/v→qkv_proj renaming, GQA-aware slices, the loud shape-assert, and the fusion-legality theorem as a 1e-12 test.
See labs/README.md for the recommended order (01 → 03 → 02) and how to run them.
How to work this phase
- Read this guide for intuition.
- Read 01-deep-dive.md with the
upstream/files open. - Do 02-mini-build.md — build the
mini_vllmpiece yourself. - Run the labs, then attempt EXERCISES.md.
- Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.
Where you are
This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.
← Phase 13 · Course home · Phase 15 →
Phase 14 — Deep Dive: Model Architectures (Adding a Model)
Read this with
upstream/open. Every path is relative toupstream/at the pinned commitv0.22.1 @ 0decac0(UPSTREAM_PIN.md). If a line number ever drifts, search for the named symbol instead.
Contents
Guided reading list
Work through these in order. This is a scaffold: the reading targets and the questions are real; fill in the line-by-line annotations as you go (this is exactly the muscle a maintainer uses — reading unfamiliar code and extracting its contract).
vllm/model_executor/models/llama.py— The reference decoder-only implementation.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/model_executor/models/registry.py— The architecture registry.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/model_executor/model_loader/— Weight loading + checkpoint format handling.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/model_executor/models/mamba.py— A state-space (non-attention) model.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/model_executor/models/interfaces.py— Mixins: SupportsLoRA, SupportsPP, SupportsMultiModal, ...- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
tests/models/— How model correctness is tested upstream (logit/greedy equality).- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
Questions to answer as you read
- The model contract: init(vllm_config), forward(input_ids, positions, ...) -> hidden?
- vLLM building blocks: VocabParallelEmbedding, {Column,Row}ParallelLinear, Attention, RMSNorm?
- Weight loading: load_weights + the name-remapping from HF checkpoints?
- The model registry and how a name resolves to a class?
- Families: decoder-only (Llama), MoE (Mixtral), hybrid/SSM (Mamba/Jamba), pooling/reward?
- get_input_embeddings, tie_word_embeddings, LoRA/quant compatibility hooks?
Cross-references
- Intuition: 00-guide.md
- Build it yourself: 02-mini-build.md
- The gold-standard depth to emulate: Phase 02 deep-dive.
Phase 14 — Mini-Build: extend mini_vllm
Contents
Your task
Define a 'model contract' in mini_vllm and implement two toy architectures behind it (a decoder-only and a tiny MoE), swappable by config — mirroring how real models plug into one runner.
Why build it (and not just read it)
Reading the real kernel/feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.
Method
- Look at the matching real code from 01-deep-dive.md.
- Add your module under
mini_vllm/(or extend an existing one). - Write a
test_*.pynext to it that pins the behavior you care about. - Run
pytest mini_vllm -qand keep it green.
Definition of done
- Your component runs on CPU with no extra dependencies (numpy ok).
- A test demonstrates the property this phase is about (not just "it runs").
- You can explain, out loud, how your toy maps to the real implementation and where it intentionally simplifies.
The flagship phases ship complete
mini_vllmmodules + tests (mini_vllm/block_pool.py,mini_vllm/scheduler.py) — use them as your reference for structure and test style.
Phase 14 Labs — Model Architectures (Adding a Model)
Three labs on the most common vLLM contribution: adding a model. The arc: honor the
contract — implement a new architecture and prove the engine never looks past one
method (lab-01), translate the checkpoint — HF names to fused vLLM params, with
the GQA slice arithmetic and the fusion-legality proof (lab-03), then trace the
real thing — five tensors through llama.py's load_weights, shapes reconciled
live (lab-02).
Recommended order: 01 → 03 → 02. CPU labs follow the standard contract —
starter.py (your work), solution.py (reference), test_lab.py (the spec); default
runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-14-model-architectures/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-14-model-architectures/labs/lab-01-add-a-toy-architecture -q
Contents
- lab-01-add-a-toy-architecture
[CPU-OK] - lab-02-trace-weight-loading
[GPU-OPT] - lab-03-weight-mapping
[CPU-OK] - What you can do after this phase
Labs
lab-01-add-a-toy-architecture [CPU-OK]
Implement a genuinely new architecture (a bigram model that ignores positions)
against mini_vllm's one-method contract and serve it through the unchanged engine —
with the capstone tripwire test: a proxy that fails on any attribute access beyond
forward survives a full generate(), measuring the contract's width. Plus the
deep one: Phase 3's chunked-prefill invariant verified for the new model — engine
invariants are model-independent. Skills: the narrow waist; over-supplying
contracts; tripwire proxies as executable architecture docs; the layer library as
where features meet models.
lab-02-trace-weight-loading [GPU-OPT]
Five tensors traced through the real load_weights: safetensors name → mapping row →
fused parameter → slice — with live shape verification ((6144, 4096) qkv = 32 q-heads
- 2×8 kv-heads, halving under TP=2) and checkpoint forensics from shapes alone.
Captured mapping table included. Skills: reading
load_weightsas a peer;--load-format dummy; diagnosing loads-but-garbage; shapes as architecture fingerprints.
lab-03-weight-mapping [CPU-OK]
The translation implemented: q_proj/k_proj/v_proj → qkv_proj name rewriting with
shard tags, GQA-aware slice arithmetic (k/v narrower than q — the off-by-one
habitat), the loud shape-assert that catches MHA-checkpoint-meets-GQA-config at load
time, and the legality theorem as a test: fused output slices ≡ separate projections
to 1e-12. Skills: stacked_params_mapping; fusion is layout, not math; load-time
asserts beat serve-time hallucinations.
What you can do after this phase
Walk the full integrator's path: implement a model against the contract using the
layer library (getting TP/quant/LoRA for free), write its mapping table, load a real
checkpoint, and verify with the discipline these labs drilled (touched-exactly-once,
loud shape asserts, invariant tests against the new model). Read any file in
vllm/model_executor/models/ as a variation on machinery you've built — and
recognize "KeyError loading model X" issues as missing mapping rows you can fix.
That's the on-ramp to Phase 19's real upstream PR.
Lab 14-01 — Add an Architecture Without Touching the Engine [CPU-OK]
vLLM serves hundreds of architectures — Llama, Mixtral, DeepSeek, Mamba hybrids,
embedding models — through one engine, and the trick is a discipline, not a
miracle: models implement a narrow contract, and the engine calls nothing else.
This lab makes you live that discipline in miniature. mini_vllm's contract is one
method — forward(last_tokens, positions) → logits — and you'll implement a genuinely
new architecture against it (a bigram model: logits from a per-token table, positions
ignored), swap it into a running engine, and prove every engine feature works
unchanged. The capstone test is a tripwire proxy that fails on any attribute access
beyond forward — and the engine passes a full generate() through it, proving the
contract is exactly one method, not asserting it.
Contents
- Why this lab exists
- Background: the narrow waist
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
"Add support for model X" is the single most common vLLM contribution — the on-ramp through which most maintainers arrived — and the task is approachable precisely because of the contract this lab teaches. A model integrator never touches the scheduler, the KV manager, or the sampler; they write a model class that honors the interface and a weight loader that fills it (labs 02/03). Knowing where the boundary sits — what you must provide, what you may ignore, what you must never reach around — is the difference between a weekend PR and a month of confusion.
The lab's sneaky-deep test is test_engine_invariants_hold_for_the_new_model:
Phase 3's chunked-equals-unchunked is an engine property, and it must hold for any
contract-honoring model. Run it against your new architecture and you're doing what
vLLM's CI does across its whole model zoo — verifying that engine invariants and
model implementations are independent axes. When an invariant breaks only for one
model, the leak is in whoever crossed the boundary, and this test design localizes
the suspect instantly.
Background: the narrow waist
The contract's anatomy, and why each piece is what it is:
forward(last_tokens, positions) → (batch, vocab) logits— the engine guarantees row i of the output corresponds to entry i of the inputs (Phase 1 lab-03's positional contract), and that only requests passingneeds_sampleappear (the catch-up rule). The model guarantees deterministic logits given its inputs. Neither knows anything else about the other.- Positions are offered, not mandated — your
BigramModelignores them entirely and the engine cannot tell. That's the proof that the contract over-supplies on purpose: it carries what the most demanding architecture needs (positional information for RoPE-style models), and simpler models discard the surplus. Real vLLM's contract is wider for the same reason (KV caches, attention metadata, intermediate states for EAGLE — Phase 8), and most models use a subset. - The registry is the production version of your
install_model: config'sarchitecturesfield →ModelRegistrylookup → class constructed with the vLLM config. Swapping a model is data, not code — which is also how out-of-tree models plug in (Phase 17's plugin machinery registers into the same table).
Files
starter.py—BigramModel(the new brain) andinstall_model(the swap). Your work.solution.py— reference.test_lab.py— serving works, the brain is genuinely different, determinism, the engine-invariant check, and the contract tripwire.
Run
LAB_IMPL=starter pytest phase-14-model-architectures/labs/lab-01-add-a-toy-architecture -q
pytest phase-14-model-architectures/labs/lab-01-add-a-toy-architecture -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_new_architecture_serves_through_the_unchanged_engine | The integration: batching, scheduling, sampling, stopping — all engine code, all untouched, all working |
test_it_really_is_a_different_model | The outputs differ from ToyModel's — you changed the brain, not the plumbing (a lab that accidentally re-implemented the old model would pass everything else) |
test_determinism_across_engine_instances | The new architecture honors the course's testability convention: logits as a pure function of (seed, inputs) |
test_engine_invariants_hold_for_the_new_model | Chunked ≡ unchunked for the new model — engine invariants are model-independent, verified rather than hoped |
test_engine_touches_only_the_contract | The tripwire proxy: a full generate() with every attribute except forward booby-trapped. The contract's width, measured: one method |
Hitchhiker's notes
- The real contract, for comparison: a vLLM model implements
forward(input_ids, positions, …) → hidden_states,compute_logits, andload_weights, composed from the layer library —VocabParallelEmbedding,QKVParallelLinear,RowParallelLinear(Phase 10 lab-01's classes!),Attention(which hides the entire Phase 2/4 machinery behind one call),RMSNorm. Building from these gives you TP, quantization, LoRA, and paged attention for free — the layer library is where the engine's features and the model's architecture meet, and using barenn.Linearinstead is the classic new-contributor mistake (works single-GPU, breaks under TP, bypasses quantization). - The tripwire-proxy test pattern generalizes: any time a design claims "X only uses interface Y," wrap Y's provider in a proxy that fails on everything else and run the full workload. Interfaces rot by accretion — someone reaches around for "just one attribute" — and a tripwire in CI is the only durable fence. (Compare Phase 9 lab-04's broken-control pattern: both are executable architecture documentation.)
- Why a bigram model, of all things? Because ignoring
positionsis the point: the most instructive new architecture is one that uses less than the contract offers, proving the contract doesn't secretly require everything it carries. Hybrid/SSM models (Mamba-style) are the production version of this lesson — they need different state than KV caches, and vLLM's contract grew (state managers, hybrid allocators) precisely where their needs exceeded it. mini_vllm's engine constructs its own model (no registry) —install_modelpapers over that with assignment. The gap is deliberate lab surface: notice how a registry (construct-from-config) beats post-hoc swapping the moment configs, checkpoints, and TP enter. The README of the real registry:upstream/vllm/model_executor/models/registry.py.
Going further
- Add a second architecture — a
RepeaterModelthat strongly biases toward the last token (logits = one-hot-ish onlast_token) — and watch greedy decoding produceaaaa...: a two-line model that generates the repetition pathology Phase 9 lab-01's penalties exist to fight. Then apply the penalty and watch it break the loop. Three labs, one demo. - Build a
registry = {"bigram": BigramModel, "toy": ToyModel}and aengine_from_config({"architecture": "bigram", "seed": 7})constructor — the real registration pattern, 10 lines, and now your lab-02/03 weight knowledge has a place to plug in. - Write the negative test: a model whose
forwardreturns the wrong batch size, and assert the engine fails loudly rather than mis-assigning tokens (it fails in the sampler's indexing — would you ship a clearer assert upstream?).
References
upstream/vllm/model_executor/models/registry.py— the architecture → class table.upstream/vllm/model_executor/models/llama.py— the canonical model implementation; read it as "the contract, honored" (and lab-02/03's subject).- vLLM docs, Adding a New Model — the official integrator's guide this lab is the warm-up for: https://docs.vllm.ai/en/latest/contributing/model/
- Phase 1 lab-03 — the row-order contract this lab's
forwardinherits; Phase 10 lab-01 — the layer library that makes real models TP-able by construction.
Lab 14-02 — Trace Weight Loading in the Real llama.py [GPU-OPT]
Lab-03 had you implement the translation; this lab has you watch the production
version run — and read it as a peer. You'll trace five tensors from a real Llama
checkpoint through load_weights: the safetensors name on disk, the mapping row that
claims it, the vLLM parameter and shard it lands in, and (under TP) which rows of
that shard each rank takes. The deliverable is the filled-in mapping table below —
five rows of checkpoint surgery, verified against a live load.
No GPU? Don't panic. Loading happens on CPU before anything touches CUDA — you can trace most of this with
device="cpu"-ish settings or just the captured table below plus the source. The reading is the lab.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured mapping (Llama-3-8B, vLLM 0.22.1)
- Reading the trace
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
load_weights is where most model-integration PRs live or die, and it's also the
single most readable "real" function in the model zoo once you have lab-03's
vocabulary — a loop, a mapping table, and weight_loader callbacks. Tracing five
tensors end-to-end converts the function from code-you-scroll-past into
code-you-could-have-written, and it arms you for the two production moments that
need this knowledge: a new checkpoint that won't load (which mapping row is
missing?), and a loaded model that generates garbage (which shard landed wrong?).
Requirements
uv pip install -e ".[vllm]"
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct # or any Llama-family model
Steps
- List the checkpoint's names:
safetensorsfiles are zip-like; enumerate without loading:
from safetensors import safe_open
import glob
names = []
for f in sorted(glob.glob("<model_dir>/*.safetensors")):
with safe_open(f, framework="np") as sf:
names += list(sf.keys())
print(len(names)) # ~291 for an 8B
print([n for n in names if ".layers.0." in n]) # one layer's worth
-
Open
upstream/vllm/model_executor/models/llama.py, findstacked_params_mappingand theload_weightsloop. For each of the five tensor names in the table below, walk the loop by hand: which mapping row matches? what name does it become? whichshard_idrides along? -
Verify live: load the model with
LLM(model=..., enforce_eager=True)and afterwards inspect a parameter's shape:model.model.layers[0].self_attn.qkv_proj.weight.shape→(6144, 4096)— and reconcile: 32 q-heads × 128 + 2 × (8 kv-heads × 128) = 4096 + 2048 = 6144. Lab-03'sqkv_slices, on a real tensor.
Captured mapping (Llama-3-8B, vLLM 0.22.1)
| checkpoint tensor (HF) | vLLM parameter | shard | rows in fused |
|---|---|---|---|
model.layers.0.self_attn.q_proj.weight | ...self_attn.qkv_proj.weight | q | 0:4096 |
model.layers.0.self_attn.k_proj.weight | ...self_attn.qkv_proj.weight | k | 4096:5120 |
model.layers.0.self_attn.v_proj.weight | ...self_attn.qkv_proj.weight | v | 5120:6144 |
model.layers.0.mlp.gate_proj.weight | ...mlp.gate_up_proj.weight | 0 | 0:14336 |
model.layers.0.input_layernorm.weight | (itself) | — | unfused |
# live verification:
qkv_proj.weight.shape = (6144, 4096) # 4096 q + 1024 k + 1024 v (GQA: 8 kv heads)
gate_up_proj.weight.shape = (28672, 4096) # 14336 gate + 14336 up
# under tensor_parallel_size=2: (3072, 4096) per rank — heads split, slices halve
Reading the trace
- The k/v rows are 4× narrower than q's — GQA's 8 KV heads vs 32 query heads,
lab-03's slice asymmetry on a real 8B. If you ever see
(12288, 4096)here instead, you're looking at an MHA model — the fused shape is an architecture fingerprint. gate_up_projat 28,672 rows — the MLP's two halves stacked;down_projstays unfused (it has no sibling to stack with). The mapping table's five rows cover ~80% of a Llama checkpoint's tensors; everything else passes through.- Under TP=2, every fused shape halves along rows — Phase 10 lab-01's column-parallel sharding composed with lab-03's stacking: each rank loads its heads' rows of each shard directly from disk. Two slicings, one read, no redistribution — the loading-is-part-of-sharding point from Phase 10, visible in a tensor shape.
enforce_eager=Truekeeps the trace clean (no capture pass cluttering logs — Phase 5 lab-04's test-suite setting, used for exactly its intended purpose).
Hitchhiker's notes
--load-format dummyskips real weights (random init) — the tool for testing mapping and shapes without downloading 16 GB, and how CI exercises loaders cheaply. Pair with a tiny--max-model-lenand loader bugs surface in seconds.- Watch for the unloaded-parameter check: upstream tracks which params got weights and errors on leftovers — the missed-tensor guard from lab-03's going-further, in production. When adding a model, that error message is your todo list.
- Sharded checkpoints (multiple
.safetensorsfiles) interleave layers across files arbitrarily — the loader is order-independent by design (each tensor knows its name; the mapping doesn't care about file layout). Resist any urge to assume file order means anything. - Quantized variants add scale/zero tensors with their own names
(
...qweight,...scales) routed by the quant method's loader (Phase 6) — same loop, more rows. Tracing one AWQ tensor through is the natural sequel to this lab.
Reflect
- From the shapes alone —
(6144, 4096)qkv,(28672, 4096)gate_up — reconstruct the model card: hidden size, head count, KV heads, MLP expansion. (4096 hidden; 32 heads × 128; 8 KV heads; 3.5× ffn ratio.) Checkpoint forensics is a real skill; you just did it. - A teammate's new model PR loads but outputs garbage; loading reported no errors. Using labs 01–03: what are your first three checks? (Mapping rows for every fused family — a missed one means init-valued weights; slice boundaries vs the config's head counts; and the q/k/v order in the fused buffer vs what the attention layer expects.)
- Why does vLLM fuse at load time rather than shipping a conversion script? (Checkpoints stay interchange-format; the fusion choice is the runtime's, can change between versions, and composes with TP/quantization decided at startup — lab-03's interface-vs-implementation point, operationalized.)
References
upstream/vllm/model_executor/models/llama.py—load_weightsandstacked_params_mapping: the function under trace.upstream/vllm/model_executor/layers/linear.py— theweight_loadercallbacks that place each shard (lab-03'sload_stacked, with TP).upstream/vllm/model_executor/model_loader/— the loader framework (formats, dummy loading, sharded files).- Lab-03 — the implementation this trace recognizes; lab-01 — the contract the loaded model serves through.
Lab 14-03 — Checkpoint Surgery: HF Names → vLLM Params, Shards → Fused [CPU-OK]
A HuggingFace checkpoint and a vLLM model disagree about what a layer is. The
checkpoint stores q_proj, k_proj, v_proj as three tensors; vLLM runs one fused
qkv_proj (one big GEMM beats three small ones — Phase 7 lab-03's tiling economics,
applied to layer design). Same for gate_proj+up_proj → gate_up_proj. Loading
weights is therefore translation: rename every checkpoint tensor to its vLLM
parameter, and copy shard tensors into the right slice of the fused buffer. This
lab has you build the translation table (llama.py's stacked_params_mapping, in
spirit), the GQA-aware slice arithmetic, and the shape guard that turns
wrong-checkpoint disasters into loud load-time errors — then prove the fusion legal
with the test that matters: the fused matmul's output slices equal the three separate
projections, exactly.
Contents
- Why this lab exists
- Background: why fused, and where the slices fall
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
When a newly-added model loads and generates fluent nonsense, the bug is almost never
in the forward pass — it's here, in the mapping: a shard landed in the wrong slice,
a name pattern missed a tensor (silently left at init values), or an MHA checkpoint
met a GQA config. These failures are maddening precisely because nothing crashes:
the shapes coincidentally fit, the matmuls run, the output is garbage. The two
defenses you'll build are the professional's toolkit: exact slice arithmetic
(derived, not pattern-matched) and assert-on-shape at load time
(test_shape_mismatch_is_loud — the wrong-checkpoint case caught at the door, not at
the demo).
This lab is also lab-02's prerequisite done right: lab-02 has you read
load_weights in llama.py; this lab has you implement its core first, so the
reading is recognition. The pairing (build small, then read big) is the course's
method; this is its purest instance — the production function is your three
functions plus a loop over the checkpoint.
Background: why fused, and where the slices fall
Why fuse at all: three matmuls over the same input x with weights Wq, Wk, Wv
equal one matmul with the row-stacked weight — x @ [Wq; Wk; Wv]ᵀ — and the single
GEMM launches once, tiles better (Phase 7 lab-03: bigger M×N per weight-read), and
reads x from memory once instead of three times. The legality is two lines of block
matrix algebra, and test_fused_matmul_equals_separate_projections states it as an
executable fact. (Column-stacking composes with tensor parallelism too:
QKVParallelLinear is Phase 10 lab-01's column-parallel class with this stacking
built in — the shard boundaries respect head boundaries on every rank.)
Where the slices fall — the GQA wrinkle: with nh query heads, nkv KV heads,
head_dim hd, the fused weight has (nh + 2·nkv)·hd rows: q owns the first nh·hd,
k the next nkv·hd, v the last nkv·hd. Under GQA (Phase 0 lab-02's 4× KV saving)
nkv < nh, so k and v slices are narrower than q's — the asymmetry
test_qkv_slices_account_for_gqa pins, and exactly the place hand-written loaders
go wrong when their author last looked at an MHA model.
The name mapping: a substring rewrite (q_proj → qkv_proj) plus a shard tag
telling the loader which slice. Tensors outside the table (norms, embeddings,
down_proj — anything unfused) map to themselves. Upstream's
stacked_params_mapping is literally this list of triples; your STACKED_PARAMS
copies its shape.
Files
starter.py—map_weight_name,qkv_slices,load_stacked(+ theSTACKED_PARAMStable, provided). Your work.solution.py— reference.test_lab.py— the mapping, pass-throughs, GQA slice arithmetic, the fusion- legality proof, and the loud-mismatch guard.
Run
LAB_IMPL=starter pytest phase-14-model-architectures/labs/lab-03-weight-mapping -q
pytest phase-14-model-architectures/labs/lab-03-weight-mapping -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_name_mapping | The rewrite preserves the layer path and swaps only the projection name — model.layers.3.self_attn.q_proj.weight keeps its layers.3 identity |
test_unstacked_names_pass_through | Norms, embeddings, down_proj, lm_head: shard_id None, name unchanged. A mapping that's too greedy (matching up_proj inside gate_up_proj-like names) fails here |
test_qkv_slices_account_for_gqa | q gets 128 rows, k and v get 32 each — and the three slices tile the fused rows with no gaps and no overlap (assert the boundary equalities; off-by-ones here are the garbage-output bug) |
test_fused_matmul_equals_separate_projections | The legality theorem: slice the fused output and recover each projection to 1e-12. Fusion is layout, not math — the course's paged-attention identity (Phase 2 lab-06), weight edition |
test_shape_mismatch_is_loud | An MHA-width k shard against a GQA config: caught by the assert at load, with shapes in the message. The alternative is a demo that hallucinates |
Hitchhiker's notes
- Read
load_weightsright after this (upstream/vllm/model_executor/models/llama.py, searchstacked_params_mapping): the production loop is your three functions plus reality — iterating safetensors shards, skipping rotary-embedding buffers, handling TP (each rank loads only its rows of each slice: Phase 10 lab-01's sharding composed with this lab's stacking — two slicings, one tensor), andweight_loadercallbacks per parameter that encapsulate the slice placement. Yourload_stackedis theweight_loaderofQKVParallelLinear, minus distribution. - Quantized checkpoints stack the stakes: AWQ/GPTQ tensors come with scales and zero-points per group (Phase 6 lab-03) that must be sliced consistently with their weights — a mapping bug now corrupts numerics in a way that's only statistically visible. Same machinery, smaller margin for error; the loud-assert habit pays double.
- The mapping table is per-architecture API: when a new HF model renames a tensor
(
mlp.experts.0.w1vsblock_sparse_moe...), vLLM's loader needs a new mapping entry — the single most common cause of "KeyError loading model X with vLLM version Y" issues. You can now read those tracebacks as "the translation table is missing a row" and often fix them yourself; that's a real first upstream PR shape. - Why not store fused in the checkpoint? The checkpoint serves every runtime
(HF transformers, llama.cpp, MLX...), each with its own fusion choices. Unfused is
the interchange format; fusion is a runtime optimization — the same
interface-vs-implementation split as Phase 11's unmerged LoRA, and the reason
load_weightsexists at all.
Going further
- Add
down_projand embedding handling plus a fullload_checkpoint(params, ckpt)driver: iterate a dict of fake checkpoint tensors, translate, place, and assert every parameter got touched exactly once (the missed-tensor bug class, made checkable — upstream tracksloaded_paramsfor the same reason). - Compose with TP: given
tp_rank, tp_size, makeload_stackedplace only the rank's rows of each shard (q rows shard by head; k/v by KV head — and note what happens whennkv < tp_size: KV-head replication, the real constraint from Phase 10 lab-01's divisibility test). - Write the MoE mapping rows (
experts.N.w1 → experts.w13_weightwith expert-index shards) by readingmixtral.py's table — the same idea with two stacking axes.
References
upstream/vllm/model_executor/models/llama.py—load_weights+stacked_params_mapping: this lab, productionized (lab-02 reads it with you).upstream/vllm/model_executor/layers/linear.py—QKVParallelLinear.weight_loader: yourload_stackedwith TP.- vLLM docs, Adding a New Model — where the mapping table fits in the integrator's checklist: https://docs.vllm.ai/en/latest/contributing/model/
- Phase 7 lab-03 — why fused GEMMs win; Phase 10 lab-01 — the sharding this stacking composes with; Phase 6 lab-03 — the quantized version of the stakes.
Phase 14 — Exercises: Model Architectures (Adding a Model)
Work these after the labs. They escalate from "explain it" to "design it" — staff-level means you can do the last ones cold.
- Map a HF attention block's qkv/o weights onto QKVParallelLinear/RowParallelLinear.
- What must change to make a model support tensor parallelism correctly?
- How would you add a pooling/reward head, and what changes in output handling?
Self-grading
For each: could you (a) explain it to a teammate in 2 minutes, and (b) point to the exact
upstream/ file that proves your answer? If not, re-read the matching anchor in
01-deep-dive.md.
Phase 14 — Interview Questions: Model Architectures (Adding a Model)
Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)
Q1. Walk me through adding support for a new decoder-only model to vLLM.
Model answer
Implement the model as an nn.Module using vLLM's parallel layers + Attention; implement load_weights to remap the HF checkpoint (esp. fused QKV/gate-up); register it; add it to the supported list; and add a correctness test comparing greedy/logits to HF. Handle TP sharding, tied embeddings, and any quant/LoRA hooks.
Q2. Why must the model use vLLM's Linear/Attention layers instead of plain torch?
Model answer
Those layers carry tensor-parallel sharding, paged-attention metadata, quantization dispatch, and CUDA-graph/compile compatibility. Plain torch layers would bypass paging, TP, and quantization — breaking the whole engine's contract.
Going deeper
The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.
Phase 14 — Cheatsheet: Model Architectures (Adding a Model)
- Model recipe: parallel layers + Attention -> register -> load_weights remap -> test vs HF.
- Fused weights (QKV, gate_up) are the usual load_weights gotcha.
- Interfaces/mixins declare LoRA/PP/MultiModal/pooling support.
- Families: decoder-only, MoE, hybrid/SSM (Mamba), embedding/reward.
Key upstream files
vllm/model_executor/models/llama.pyvllm/model_executor/models/registry.pyvllm/model_executor/model_loader/vllm/model_executor/models/mamba.pyvllm/model_executor/models/interfaces.pytests/models/
Full reference: 00-guide.md · 01-deep-dive.md
Phase 15 — Disaggregated Serving
← Phase 14 · Course home · Phase 16 →
Contents
- Don't Panic
- Why this phase matters
- What you'll learn
- The map: where this lives in the real code
- Labs in this phase
- How to work this phase
- Where you are
Don't Panic
Prefill and decode have opposite appetites: prefill wants compute, decode wants memory bandwidth and runs much longer. Disaggregation runs them on SEPARATE machines — prefill servers and decode servers — and ships the KV cache between them. Each fleet is tuned and scaled independently. This phase is that split and the KV transfer that enables it.
Why this phase matters
P/D disaggregation is how the largest deployments hit both tight TTFT and high throughput at once, and it's a frontier of vLLM. Understanding KV connectors also unlocks KV offloading and cross-engine caching.
What you'll learn
- Why co-locating prefill+decode causes interference (prefill stalls decodes)
- Prefill node -> KV transfer -> decode node; the request handoff
- KV connectors: the transfer abstraction (NIXL, shared storage, etc.)
- Encode disaggregation for multimodal
- Routing / proxy between P and D fleets; load balancing
The map: where this lives in the real code
Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see
UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md)
walks through the important ones line by line.
vllm/distributed/kv_transfer/— The KV connector framework (the heart of disagg).vllm/distributed/kv_transfer/kv_connector/v1/— V1 connectors (base + implementations).vllm/v1/core/sched/scheduler.py— Search 'connector' / 'WAITING_FOR_REMOTE_KVS' to see async KV load.examples/— Look for disaggregated-prefill example scripts/configs.
Labs in this phase
- lab-01-kv-handoff
[CPU-OK]— migrate a live request between two mini_vllm engines (export/import + the KV block bill) and prove the continuation token-for-token identical. - lab-02-pd-pair
[GPU-OPT]— a real producer/consumer pair with a KV connector: TTFT +10% (the toll), ITL p99 3× better (the interference, gone). Captured output included. - lab-03-disagg-economics
[CPU-OK]— the trade in five functions: 256 MiB of freight per 2048-token 8B prompt, ~11 ms on fast fabric vs ~215 ms on 10 GbE, and the decision function that says no two different ways.
See labs/README.md for the recommended order (01 → 03 → 02) and how to run them.
How to work this phase
- Read this guide for intuition.
- Read 01-deep-dive.md with the
upstream/files open. - Do 02-mini-build.md — build the
mini_vllmpiece yourself. - Run the labs, then attempt EXERCISES.md.
- Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.
Where you are
This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.
← Phase 14 · Course home · Phase 16 →
Phase 15 — Deep Dive: Disaggregated Serving
Read this with
upstream/open. Every path is relative toupstream/at the pinned commitv0.22.1 @ 0decac0(UPSTREAM_PIN.md). If a line number ever drifts, search for the named symbol instead.
Contents
Guided reading list
Work through these in order. This is a scaffold: the reading targets and the questions are real; fill in the line-by-line annotations as you go (this is exactly the muscle a maintainer uses — reading unfamiliar code and extracting its contract).
vllm/distributed/kv_transfer/— The KV connector framework (the heart of disagg).- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/distributed/kv_transfer/kv_connector/v1/— V1 connectors (base + implementations).- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/v1/core/sched/scheduler.py— Search 'connector' / 'WAITING_FOR_REMOTE_KVS' to see async KV load.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
examples/— Look for disaggregated-prefill example scripts/configs.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
Questions to answer as you read
- Why co-locating prefill+decode causes interference (prefill stalls decodes)?
- Prefill node -> KV transfer -> decode node; the request handoff?
- KV connectors: the transfer abstraction (NIXL, shared storage, etc.)?
- Encode disaggregation for multimodal?
- Routing / proxy between P and D fleets; load balancing?
Cross-references
- Intuition: 00-guide.md
- Build it yourself: 02-mini-build.md
- The gold-standard depth to emulate: Phase 02 deep-dive.
Phase 15 — Mini-Build: extend mini_vllm
Contents
Your task
Model disaggregation in mini_vllm: run a 'prefill engine' that produces KV blocks, serialize the block table + (fake) KV, and hand it to a separate 'decode engine' that continues generation — proving the handoff preserves output.
Why build it (and not just read it)
Reading the real kernel/feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.
Method
- Look at the matching real code from 01-deep-dive.md.
- Add your module under
mini_vllm/(or extend an existing one). - Write a
test_*.pynext to it that pins the behavior you care about. - Run
pytest mini_vllm -qand keep it green.
Definition of done
- Your component runs on CPU with no extra dependencies (numpy ok).
- A test demonstrates the property this phase is about (not just "it runs").
- You can explain, out loud, how your toy maps to the real implementation and where it intentionally simplifies.
The flagship phases ship complete
mini_vllmmodules + tests (mini_vllm/block_pool.py,mini_vllm/scheduler.py) — use them as your reference for structure and test style.
Phase 15 Labs — Disaggregated Serving
Three labs on splitting the workload where Phase 10 split the model: prefill on machines built for compute, decode on machines built for bandwidth, a request's KV shipped between them. The arc: build the migration bookkeeping and prove it output-invisible (lab-01), price the trade — transfer toll vs interference win, verdicts flipping with the wire (lab-03), then assemble a real producer/consumer pair and watch p99 ITL collapse 3× while TTFT pays its 10% (lab-02).
Recommended order: 01 → 03 → 02. CPU labs follow the standard contract —
starter.py (your work), solution.py (reference), test_lab.py (the spec); default
runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-15-disaggregated-serving/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-15-disaggregated-serving/labs/lab-01-kv-handoff -q
Contents
- lab-01-kv-handoff
[CPU-OK] - lab-02-pd-pair
[GPU-OPT] - lab-03-disagg-economics
[CPU-OK] - What you can do after this phase
Labs
lab-01-kv-handoff [CPU-OK]
Move a live request between two engines: export (snapshot + free the source — usage back to 0.0, the anti-leak invariant), import (claim destination blocks, loudly OOM if they don't exist), and the proof that justifies the architecture — the migrated request's output is token-for-token identical to never moving. Migration revealed as admission-with-prepaid-compute, preemption's ship-instead-of-discard sibling. Skills: a request's transferable identity; the two recovery strategies; block identity doesn't survive (contents do); why routing must be output-invisible.
lab-02-pd-pair [GPU-OPT]
The real system: producer + consumer instances joined by a KV connector, a proxy
running the max_tokens=1 handoff, and the two predicted signatures measured —
TTFT +10% (the toll), ITL p99 38 → 12 ms (the interference, gone; p50 untouched,
because interference was always a tail phenomenon). Annotated capture included.
Skills: kv_role/connector configuration; both-sides-must-agree hazards; failure
drills and graceful degradation; tails are what you're buying.
lab-03-disagg-economics [CPU-OK]
The trade in five functions: 256 MiB of KV freight per 2048-token 8B prompt — ~11 ms on InfiniBand-class fabric (invisible) vs ~215 ms on 10 GbE (doubles TTFT) — against the interference win from Phase 3 lab-05's spike. The decision function says yes, no-slow-link, and no-no-disease, each pinned by a test. Skills: the penalty ratio as the qualifying number; bits-vs-bytes; per-token-tax → per-request-toll as a pattern; KV compression as a topology enabler.
What you can do after this phase
Decide, from your cluster's fabric and your traffic's prompt/decode shape, whether disaggregation pays — and say which metric it buys (p99 ITL) and which it taxes (TTFT) with numbers; implement and review KV-transfer bookkeeping with the invariants drilled here (source clean, destination billed, OOM loud, output invisible); and stand up, configure, and failure-drill a real P/D pair. Combined with Phase 10, you now hold both axes of scale-out: split the model, split the workload — Phase 18 teaches you to measure which one your bottleneck wants.
Lab 15-01 — KV Handoff: Move a Live Request Between Engines [CPU-OK]
Everything in this course so far assumed a request lives and dies on one engine.
Disaggregated serving breaks that assumption on purpose: engine P (tuned for
compute-hungry prefill) processes the prompt, then the request — its token state and
its computed-KV claim — migrates to engine D (tuned for bandwidth-hungry
decode), which continues as if nothing happened. This lab implements the migration's
bookkeeping on two mini_vllm engines: export_request (snapshot + release the
source), import_request (resurrect + claim KV blocks at the destination), and the
proof that justifies the whole architecture — the migrated request's final output is
token-for-token identical to never having moved. Plus the two operational truths
migrations live with: the source must come back clean (every block freed), and the
destination must pay the KV bill up front — loudly failing if it can't.
Contents
- Why this lab exists
- Background: what actually moves
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
The deep observation behind this lab — and behind Phase 3 lab-04's preemption before
it — is that a request's entire transferable identity is small and explicit: prompt
ids, output ids, num_computed_tokens, sampling params, and (the only heavy part)
the KV those counters claim. Preemption exploited that by discarding the KV and
recomputing; handoff exploits it by shipping the KV and not. Same state machine,
two recovery strategies — and your import_request is structurally
Scheduler-admission code (allocate, set counters, mark RUNNING), because migration
is admission with prepaid compute. Once you see migration this way, the production
machinery (KV connectors, NIXL, multi-engine routing — Phase 15's deep-dive) reads as
transport details around bookkeeping you've already written twice.
The identical-output proof matters operationally, not just aesthetically: P/D deployments route some requests through the split path and others not (short prompts often stay colocated). If migration changed outputs, the same request would answer differently depending on an infrastructure routing decision — an unacceptable, undebuggable property. The test suite makes it impossible.
Background: what actually moves
The honest accounting of a migration, in order:
- Export: snapshot the token state (cheap — a few hundred ints) and the
num_computed_tokensclaim; remove the request from the source's schedule; free its blocks (the source owes it nothing —test_source_engine_is_clean_after_exportpins usage back to 0.0, because a leak here, times thousands of migrations, is an OOM with a delay). - Transfer: in real systems, the KV tensors themselves cross the wire — lab-03
prices this (256 MiB for a 2048-token prompt on an 8B; the freight is the whole
economics). In
mini_vllm, the toy model never reads KV values, so the transfer carries metadata only — which is precisely why the lab can isolate the bookkeeping correctness from the transport. - Import: allocate destination blocks for the computed tokens (Phase 2's
ceil-div bill, paid in D's pool —
test_destination_pays_the_kv_billcounts it exactly), set the counter, mark RUNNING, join the schedule. The connector would now fill those blocks with the shipped tensors; decoding resumes either way.
Note what makes step 3 legal without recomputation, in contrast to preemption's
reset-to-zero: the claim "these num_computed_tokens tokens have valid KV" is now
backed by the transfer rather than by local compute. The counter doesn't care who
paid — which is the two-counters model (Phase 1) earning its keep one more time.
Files
starter.py—export_request,import_request,run_to_completion. Your work.solution.py— reference.test_lab.py— identical continuation (post-prefill and mid-decode), source cleanliness, the destination's block bill, and the loud-OOM import.
Run
LAB_IMPL=starter pytest phase-15-disaggregated-serving/labs/lab-01-kv-handoff -q
pytest phase-15-disaggregated-serving/labs/lab-01-kv-handoff -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_handoff_after_prefill_continues_identically | The canonical P/D split (one step = prefill + first token, then migrate): final output ≡ single-engine. Routing decisions must be output-invisible |
test_handoff_mid_decode_also_works | Migration is general, not prefill-special — any consistent (tokens, counter) snapshot moves. The mechanism that also underlies decode-to-decode rebalancing |
test_source_engine_is_clean_after_export | Usage back to 0.0: the anti-leak invariant. Migration without cleanup is a slow-motion OOM |
test_destination_pays_the_kv_bill | ceil(computed/block_size) blocks claimed at D — capacity planning for decode fleets must budget imported KV, not just locally-grown KV |
test_destination_oom_is_loud | A destination that can't hold the transfer fails at import, not mid-decode — the admission check a router relies on when picking D instances |
Hitchhiker's notes
- Map to upstream: the KV connector interface
(
upstream/vllm/distributed/kv_transfer/kv_connector/v1/) is your export/import with tensors attached —get_num_new_matched_tokens(what can the destination receive?), the worker-side send/recv of block contents, and scheduler hooks that overlap transfer with compute. Connectors ship for shared storage (LMCache), point-to-point (NIXL/P2P), and more — transport varies, your bookkeeping shape doesn't. - The real subtlety production adds is asynchrony: D starts allocating and even scheduling while KV is still in flight, attention must not read blocks the transfer hasn't filled — a readiness-tracking problem your synchronous lab dodges on purpose. When you read connector code, most of its complexity is exactly this fence; the synchronous core underneath is this lab.
- Block identity does not survive migration — P's block 47 becomes whatever D's pool hands out; only the logical token order matters, and the block table rebuild is free because tables are per-engine metadata (Phase 2). Anyone who tries to ship block ids instead of block contents has misunderstood the indirection — a surprisingly common design-review catch.
- Prefix caching composes: if D already holds cached blocks for the prompt's
prefix (another request warmed it), the transfer can skip those — connectors
literally consult
get_computed_blocksto shrink the freight. Phase 2 lab-05's machinery, now saving network bytes instead of FLOPs.
Going further
- Make the import prefix-cache-aware: enable caching in D, pre-warm it with the
same prompt, and extend
import_requestto claim cached blocks first (viaget_computed_blocks) and allocate only the remainder — measure the freight saved. You've implemented the connector's matched-tokens optimization. - Build a tiny router: N decode engines, route each import to the one with the most free blocks; assert no import ever OOMs under a workload where round-robin would. Phase 11 lab-04's admission thinking, fleet edition.
- Simulate the failure path: export, "lose" the payload, and re-run the request from scratch on D — preemption-style recompute as the fallback when transfer fails. Note that correctness needs nothing new: the request's identity is still just tokens. (This is why P/D systems can degrade gracefully to colocated.)
References
upstream/vllm/distributed/kv_transfer/kv_connector/v1/— the connector interface and implementations (NIXL, shared-storage, multi-connector).- vLLM docs, Disaggregated Prefilling — the deployment shape this lab's bookkeeping serves: https://docs.vllm.ai/en/latest/features/disagg_prefill/
- Zhong et al., DistServe: Disaggregating Prefill and Decoding for Goodput- optimized LLM Serving (OSDI 2024) — the why (lab-03 prices it): https://arxiv.org/abs/2401.09670
- Phase 3 lab-04 — the discard-and-recompute sibling of this lab's ship-and-continue; Phase 1 — the two counters that make both legal.
Lab 15-02 — Stand Up a Prefill/Decode Pair [GPU-OPT]
The CPU labs built the bookkeeping (lab-01) and priced the trade (lab-03). This lab assembles the real thing: two vLLM instances on one box — one configured as the prefill producer, one as the decode consumer — joined by a KV connector, with a tiny proxy routing each request through both. You'll watch a request's KV cross between processes, the decode instance emit tokens for a prompt it never prefilled, and the two latency signatures the economics predicted: TTFT carrying the transfer, ITL running clean.
No GPU pair? Don't panic. The captured run below is annotated against both CPU labs; the reconciliation is the lab.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, Qwen2.5-0.5B ×2, 2×L4, vLLM 0.22.1, trimmed)
- Reading the run
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Disaggregation is a system — engines, connector, router — and systems have failure
modes no component lab shows: the connector handshake that never completes (mismatched
kv_transfer_config between the pair), the proxy that forgets to forward the
first-token state, the decode instance whose pool can't absorb incoming KV at load
(lab-01's loud-OOM, now a 500 error). Standing the pair up once, even on one box,
converts the architecture from diagram to muscle memory — and the configuration
surface (kv_role, kv_connector, the proxy contract) is exactly what you'll touch
in any production P/D rollout.
Requirements
uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
# 2 GPUs ideal (one per role); 1 GPU works with gpu_memory_utilization=0.4 each.
Steps
- Launch the pair (the P2P/NIXL-style connector config; exact connector names
vary by version —
vllm serve --help | grep kvis authoritative):
# Prefill instance (producer):
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8100 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
# Decode instance (consumer):
CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8200 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
-
Run the proxy (vLLM ships examples —
upstream/examples/online_serving/ disaggregated_serving/): it sends each request to P withmax_tokens=1, then replays it to D, which pulls the KV instead of prefilling. -
Measure both arms: the same prompts against a plain single instance vs the pair — TTFT and ITL distributions separately (Phase 3 lab-05's follow-one-request discipline). Then load the decode side with steady streams and fire big prompts: colocated, the streams stutter; through the pair, they don't.
Captured output (real run, Qwen2.5-0.5B ×2, 2×L4, vLLM 0.22.1, trimmed)
(prefill) INFO ... NixlConnector: registered as kv_producer
(decode) INFO ... NixlConnector: registered as kv_consumer
(proxy) request 0: prefill 188 ms (2031 tok) -> transfer -> decode first token
(decode) INFO ... received KV for request 0: 127 blocks (~32 MiB)
single-instance : TTFT 192 ms ITL p50 11.2 ms ITL p99 38.4 ms (decode + big prompts mixed)
disaggregated pair : TTFT 211 ms ITL p50 11.0 ms ITL p99 12.1 ms (clean decode)
# TTFT +10% (the transfer toll) ; ITL p99 3.2x better (the interference, gone)
Reading the run
127 blocks (~32 MiB)— lab-03's freight, itemized: ~2031 tokens × ~16 KiB (a 0.5B model's per-token KV; run Phase 0 lab-02's formula to check). On the 8B from lab-03's tests this same prompt ships 256 MiB — small models flatter the transfer; scale the conclusion with the formula, not the demo.- TTFT 192 → 211 ms (+10%) — the toll, in the predicted range for an intra-box link (lab-03's penalty ratio, plus proxy overhead the model omits).
- ITL p99 38.4 → 12.1 ms — the purchase: p99 collapses to ~p50 because decode steps never share a batch with prefill chunks anymore. Note p50 barely moved — interference was always a tail phenomenon (Phase 3 lab-05's lesson), and disaggregation is tail surgery.
- The proxy's
max_tokens=1trick — P must run exactly through first-token (prefill + sample) so the KV is complete and the request state matches lab-01's canonical export point. Off-by-one here (max_tokens=0 isn't a thing; forgetting to carry the first token to D) is the classic proxy bug.
Hitchhiker's notes
- Both instances must agree on everything KV-shaped — model, dtype, block size, TP layout — or the transferred tensors are garbage with compatible shapes (the silent kind). Real deployments pin both sides from one config source; version-skewed pairs during rolling upgrades are the operational hazard.
- Connector zoo: NIXL (point-to-point RDMA-ish), LMCache (shared KV store —
doubles as a cross-request prefix cache), MultiConnector (compose them). The
roles (
kv_producer/kv_consumer) and the scheduler hooks are the stable interface; transports compete underneath (lab-01's "transport varies, bookkeeping doesn't"). - One box is a simulation of the topology, not the economics — intra-node transfer crosses NVLink/PCIe, flattering lab-03's toll. The correctness and configuration learning transfers; re-price before declaring victory on a real fabric.
- Failure drill worth running: kill the decode instance mid-stream and watch the proxy's error; then kill the prefill side and note requests can fall back to the decode instance running colocated (it's a full vLLM!). Graceful degradation is configuration, not magic — design your proxy to use it.
Reflect
- Trace one request through every phase-15 artifact: lab-01's export point (P's first-token state), lab-03's toll (the 32 MiB), this run's two latency signatures. Which numbers change when the model is 8B? When the link is 10 GbE? (Freight ×8 via per-token KV; toll ratio per lab-03's tests — possibly fatal.)
- Why does the pair's p50 ITL match the single instance's? (Median decode steps were interference-free in both — chunking already protected them; the p99 was the casualty. Disaggregation buys tails, and SLOs are written on tails.)
- Sketch the 3-instance variant: 1 prefill, 2 decode, router balancing imports by free blocks (lab-01's going-further). What new metric does the router need from each D? (Free-block headroom — the loud-OOM check, exported as capacity signal.)
References
upstream/examples/online_serving/disaggregated_serving/— the proxy + configs this lab assembles.upstream/vllm/distributed/kv_transfer/— connectors, roles, scheduler hooks.- vLLM docs, Disaggregated Prefilling: https://docs.vllm.ai/en/latest/features/disagg_prefill/
- Labs 01 (the bookkeeping) and 03 (the economics) — this run is their joint integration test, per the course's GPU-lab custom.
Lab 15-03 — The Disaggregation Trade: Transfer Bills vs Interference Wins [CPU-OK]
Why run prefill and decode on different machines when colocated chunked prefill
(Phase 3) already works? Because chunking only caps the interference — every decode
step that shares a batch with a prefill chunk still pays for it (the [33, 33, …]
profile from Phase 3 lab-05), and at scale that cap is your ITL p99.
Disaggregation buys perfectly clean decode steps — and pays by shipping the prompt's
KV across a wire, straight into TTFT. This lab prices both sides in five functions
and lands the punchline numbers: a 2048-token prompt on an 8B is 256 MiB of
freight — ~11 ms over an InfiniBand-class link (invisible inside a ~205 ms prefill)
versus ~215 ms over 10 GbE (doubling TTFT). Same architecture, opposite verdicts,
decided entirely by the wire.
Contents
- Why this lab exists
- Background: the two ledgers
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Disaggregation is the most hyped serving architecture of the moment, which is exactly when an engineer needs the arithmetic most — to know when it's transformative (latency-SLO products with long prompts, fleets big enough to pool P and D capacity separately) and when it's cargo cult (short prompts, slow links, or workloads whose interference a tuned chunk threshold already handles). The five functions you'll write are the meeting-room version of the DistServe paper's argument, and the decision function's three test cases are the three deployments you'll actually encounter: heavy interference + fast link (split), heavy interference + slow link (the cure costs more than the disease), and negligible interference (why bother).
The deeper pattern — the course's economics-lab family (Phase 0 lab-02, Phase 8 lab-04, Phase 11 lab-03, Phase 10 lab-03) — closes here with its cleanest specimen: one latency line item moved from a per-token tax (interference on every decode step) to a per-request toll (transfer once into TTFT). Whether that's a good trade depends on tokens-per-request and the toll rate; everything else is detail.
Background: the two ledgers
What disaggregation buys — decode steps that never share a batch with prefill:
worst-case ITL drops from decode_step + chunk_time (Phase 3 lab-05's spike, capped
but real) to decode_step, clean. For a 10 ms step under 25 ms chunks, that's a
3.5× p99 improvement — and each fleet can now be sized, scheduled, and even
hardware-chosen for its own regime (prefill is compute-bound, decode
bandwidth-bound — Phase 0 lab-04's split, finally given separate machines).
What it costs — the prompt's entire KV crosses a wire: prompt_tokens × kv_bytes_per_token (Phase 0 lab-02's 128 KiB/token for an 8B; 2.5× that for a 70B —
test_payload_scales_with_model_not_just_prompt). The transfer lands in TTFT, and
the right way to judge it is relative: transfer_time / prefill_time. Both scale
~linearly with prompt length, so the ratio is roughly constant per (model, link) —
~5% on a 200 Gb/s fabric (invisible), >100% on 10 GbE (the transfer outweighs the
prefill it's delivering). That ratio is the single number that qualifies or
disqualifies a cluster for P/D — compute it before the design review, not after the
deployment.
Mind the unit trap the tests enforce: links are quoted in gigabits; KV comes in bytes. The factor of 8 has embarrassed real capacity plans.
Files
starter.py—kv_payload_bytes,transfer_seconds,colocated_itl_worst,disagg_ttft_penalty,disagg_wins. Your work.solution.py— reference.test_lab.py— the freight, both link verdicts, the interference identity, the penalty fractions, the three-way decision, and the model-size scaling.
Run
LAB_IMPL=starter pytest phase-15-disaggregated-serving/labs/lab-03-disagg-economics -q
pytest phase-15-disaggregated-serving/labs/lab-03-disagg-economics -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_payload_is_real_freight | 256 MiB per 2048-token request — per request, every request. KV transfer is a bandwidth product, not a control message |
test_link_speed_is_the_whole_story | ~11 ms vs ~215 ms for the same payload: the fabric is the feasibility condition, with the bits-vs-bytes factor of 8 enforced |
test_interference_math_is_phase3_lab05 | The colocated worst case is literally that lab's spike, in seconds |
test_ttft_penalty_fractions | <6% on fast fabric, >100% on 10 GbE — the qualifying ratio |
test_the_decision_both_ways | All three real deployments: split / don't (slow link) / don't (no disease to cure). A decision function that can say "no" in two different ways is one you can trust |
test_payload_scales_with_model_not_just_prompt | The 70B multiplier: bigger models raise the freight and (via slower prefill) the budget — rerun the ratio per model, never reuse it |
Hitchhiker's notes
- GQA/MLA shrink the freight too —
kv_bytes_per_tokenis Phase 0 lab-02's formula, so every KV-compression technique (Phase 6's FP8-KV included) is also a disaggregation enabler. DeepSeek's MLA (≈ 70 KiB/token at svelte) makes P/D dramatically cheaper to feed — architecture choices propagate into deployment topology, which is the kind of cross-layer effect staff engineers are paid to notice. - Overlap hides part of the toll: real connectors stream KV layer-by-layer
while prefill still computes later layers, so the visible TTFT penalty can be a
fraction of your
transfer_seconds. The model is an upper bound with a known bias — the most useful kind (Phase 8 lab-04's phrasing, still true). - The hidden third ledger is utilization: separate fleets can each run their regime's optimal batch shape (prefill: few huge batches; decode: many small steady ones) instead of compromising — DistServe's "goodput" argument, which can dominate both latency ledgers at scale. Your model prices latency; remember the throughput term exists before declaring a verdict from latency alone.
- The degenerate fallback matters: when the link is slow or the prompt short, routing the request colocated (no migration) costs nothing — P/D systems are hybrid by construction (lab-01's output-invariance is what makes per-request routing safe). The decision function runs per request class, not per cluster.
Going further
- Add overlap:
effective_transfer(transfer_s, prefill_s, overlap_fraction)and find the overlap that makes 25 GbE viable for 2048-token prompts. You've priced what connector engineering is worth (compare Phase 10 lab-03's same move for all-reduce). - Sweep prompt length 128 → 32k and plot both ledgers: the interference win grows with prompt length (bigger chunks to dodge) and the freight grows — but the penalty ratio stays flat while the ITL win grows. Long-context workloads are disaggregation's home turf; the plot shows why in one figure.
- Add the queueing term: P-fleet utilization → prefill queue wait → TTFT. At high load, disaggregation's pooling effect (any P serves any D) cuts queue waits — the goodput argument made visible with an M/M/1 sketch.
References
- Zhong et al., DistServe (OSDI 2024) — the goodput argument and the interference/transfer trade formalized: https://arxiv.org/abs/2401.09670
- Patel et al., Splitwise (2024) — the same split from the hardware-heterogeneity angle: https://arxiv.org/abs/2311.18677
upstream/vllm/distributed/kv_transfer/— where the freight actually ships (lab-01's bookkeeping + transport).- Phase 3 lab-05 — the interference this architecture deletes; Phase 0 labs 02/04 — the per-token bytes and the regime split that make both ledgers computable.
Phase 15 — Exercises: Disaggregated Serving
Work these after the labs. They escalate from "explain it" to "design it" — staff-level means you can do the last ones cold.
- Quantify when disaggregation beats co-location (interference vs transfer cost).
- What exactly must transfer between P and D, and in what layout?
- How does the scheduler represent 'waiting for remote KV'?
Self-grading
For each: could you (a) explain it to a teammate in 2 minutes, and (b) point to the exact
upstream/ file that proves your answer? If not, re-read the matching anchor in
01-deep-dive.md.
Phase 15 — Interview Questions: Disaggregated Serving
Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)
Q1. Why disaggregate prefill and decode?
Model answer
They have different resource profiles and interfere when co-located: a big prefill stalls ongoing decodes (latency spikes). Splitting them lets you scale and tune each fleet independently — more compute for prefill TTFT, more memory-bandwidth/instances for decode throughput — at the cost of transferring the KV cache between them.
Q2. What's the main cost/risk of disaggregation?
Model answer
Shipping the KV cache over the network adds latency and bandwidth pressure; it only pays off when interference savings exceed transfer cost. It also adds routing/orchestration complexity and failure modes (a decode node waiting on remote KV).
Going deeper
The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.
Phase 15 — Cheatsheet: Disaggregated Serving
- Prefill fleet (compute) -> KV transfer -> decode fleet (bandwidth). Tune each separately.
- KV connectors abstract the transfer (also used for offloading / cross-engine cache).
- Scheduler state WAITING_FOR_REMOTE_KVS gates decode until KV arrives.
Key upstream files
vllm/distributed/kv_transfer/vllm/distributed/kv_transfer/kv_connector/v1/vllm/v1/core/sched/scheduler.pyexamples/
Full reference: 00-guide.md · 01-deep-dive.md
Phase 16 — Serving APIs & Parsers
← Phase 15 · Course home · Phase 17 →
Contents
- Don't Panic
- Why this phase matters
- What you'll learn
- The map: where this lives in the real code
- Labs in this phase
- How to work this phase
- Where you are
Don't Panic
Almost no one calls vLLM in Python in production — they hit its HTTP server, which speaks the OpenAI API (and the Anthropic Messages API, and gRPC). On top of raw generation it adds chat templating, streaming (SSE), tool calling, and reasoning parsers. This phase is the front door everyone actually uses.
Why this phase matters
The API server is where correctness meets the real world: streaming semantics, tool-call extraction, error handling, and OpenAI compatibility quirks. Tool/reasoning parsers are a frequent contribution area and a place small bugs cause big incidents.
What you'll learn
- The OpenAI-compatible server: /v1/chat/completions, /v1/completions, /v1/embeddings
- Chat templates and how messages become a token prompt
- Streaming via Server-Sent Events; delta semantics
- Tool/function calling: schema in, tool_calls out; the tool-call parsers
- Reasoning parsers (separating chain-of-thought from the answer)
- Anthropic Messages API and gRPC front-ends
The map: where this lives in the real code
Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see
UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md)
walks through the important ones line by line.
vllm/entrypoints/openai/api_server.py— The FastAPI app + routes.vllm/entrypoints/openai/serving_chat.py— Chat completions: templating, streaming, tools.vllm/entrypoints/openai/tool_parsers/— Per-model tool-call parsers (the pluggable bit).vllm/entrypoints/openai/reasoning_parsers/— Reasoning/think-tag parsers.vllm/entrypoints/— Look for the Anthropic Messages + gRPC entrypoints.
Labs in this phase
- lab-01-tool-call-parser
[CPU-OK]— batch + streaming tool-call parsing with the hold-back discipline (half-tags never leak, false alarms release), proven chunking-invariant by fuzz. - lab-02-openai-server-smoke
[GPU-OPT]—vllm serve+ the OpenAI client end to end, then the source trace through serving_chat: every response artifact assigned to its layer. Captured output included. - lab-03-streaming-detokenizer
[CPU-OK]— the byte boundary: an incremental detokenizer that never emits broken UTF-8 (🚀 = three silences and a rocket), with the naive per-token decoder kept as a failing control.
See labs/README.md for the recommended order (03 → 01 → 02) and how to run them.
How to work this phase
- Read this guide for intuition.
- Read 01-deep-dive.md with the
upstream/files open. - Do 02-mini-build.md — build the
mini_vllmpiece yourself. - Run the labs, then attempt EXERCISES.md.
- Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.
Where you are
This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.
← Phase 15 · Course home · Phase 17 →
Phase 16 — Deep Dive: Serving APIs & Parsers
Read this with
upstream/open. Every path is relative toupstream/at the pinned commitv0.22.1 @ 0decac0(UPSTREAM_PIN.md). If a line number ever drifts, search for the named symbol instead.
Contents
Guided reading list
Work through these in order. This is a scaffold: the reading targets and the questions are real; fill in the line-by-line annotations as you go (this is exactly the muscle a maintainer uses — reading unfamiliar code and extracting its contract).
vllm/entrypoints/openai/api_server.py— The FastAPI app + routes.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/entrypoints/openai/serving_chat.py— Chat completions: templating, streaming, tools.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/entrypoints/openai/tool_parsers/— Per-model tool-call parsers (the pluggable bit).- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/entrypoints/openai/reasoning_parsers/— Reasoning/think-tag parsers.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/entrypoints/— Look for the Anthropic Messages + gRPC entrypoints.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
Questions to answer as you read
- The OpenAI-compatible server: /v1/chat/completions, /v1/completions, /v1/embeddings?
- Chat templates and how messages become a token prompt?
- Streaming via Server-Sent Events; delta semantics?
- Tool/function calling: schema in, tool_calls out; the tool-call parsers?
- Reasoning parsers (separating chain-of-thought from the answer)?
- Anthropic Messages API and gRPC front-ends?
Cross-references
- Intuition: 00-guide.md
- Build it yourself: 02-mini-build.md
- The gold-standard depth to emulate: Phase 02 deep-dive.
Phase 16 — Mini-Build: extend mini_vllm
Contents
Your task
Put a tiny HTTP layer over mini_vllm (stdlib http.server is fine) exposing a /v1/completions-shaped endpoint with streaming, plus a toy tool-call parser that extracts a JSON tool call from the output.
Why build it (and not just read it)
Reading the real kernel/feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.
Method
- Look at the matching real code from 01-deep-dive.md.
- Add your module under
mini_vllm/(or extend an existing one). - Write a
test_*.pynext to it that pins the behavior you care about. - Run
pytest mini_vllm -qand keep it green.
Definition of done
- Your component runs on CPU with no extra dependencies (numpy ok).
- A test demonstrates the property this phase is about (not just "it runs").
- You can explain, out loud, how your toy maps to the real implementation and where it intentionally simplifies.
The flagship phases ship complete
mini_vllmmodules + tests (mini_vllm/block_pool.py,mini_vllm/scheduler.py) — use them as your reference for structure and test style.
Phase 16 Labs — Serving APIs & Parsers
Three labs on the front door: the layer that turns an inference engine into an API
product. The arc: parse tool calls out of a token stream, streaming-safely
(lab-01), go a level down to the byte boundary — the detokenizer that never emits
broken UTF-8 (lab-03), then run the whole door — vllm serve, the OpenAI client,
and a source trace that assigns every response artifact to its layer (lab-02).
Recommended order: 03 → 01 → 02 (bytes, then tags, then the server that composes
both). CPU labs follow the standard contract — starter.py (your work),
solution.py (reference), test_lab.py (the spec); default runs the solution,
LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-16-serving-apis-and-parsers/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-16-serving-apis-and-parsers/labs/lab-01-tool-call-parser -q
Contents
- lab-01-tool-call-parser
[CPU-OK] - lab-02-openai-server-smoke
[GPU-OPT] - lab-03-streaming-detokenizer
[CPU-OK] - What you can do after this phase
Labs
lab-01-tool-call-parser [CPU-OK]
Batch and streaming parsers for <tool_call> blocks, with the discipline that defines
the streaming one: hold back any trailing text that might still become a tag, release
it on false alarms, never leak half-tags to the user. Proven chunking-invariant by a
50-random-slicings fuzz. Skills: chunking invariance as the incremental parser's
contract; hold-back buffers; per-model conventions as trained-in templates; loud
failure for malformed calls.
lab-02-openai-server-smoke [GPU-OPT]
vllm serve + the OpenAI client: streamed deltas, a structured tool call,
deliberate 400s, and a mid-stream disconnect (watch the abort free its KV). Then the
source trace — route → validation → chat template → AsyncLLM.generate → the
detokenizer/parser pipeline → SSE — with the framing question per leg: translation or
inference? Annotated capture included. Skills: the server as translator; chat
templates as derived state; finish_reason: "tool_calls"; front-door latency as its
own budget.
lab-03-streaming-detokenizer [CPU-OK]
The byte boundary: 🚀 is four byte-tokens, and per-token decoding emits garbage three
times — build the incremental detokenizer that emits only complete UTF-8 characters
(lead-byte arithmetic, hold the tail, honest � on real truncation), with the naive
approach kept as a failing control. Skills: the emit-eagerly-but-never-emit-what-
might-change pattern (third appearance); why English-only testing is a blind spot;
where character responsibility ends and grapheme rendering begins.
What you can do after this phase
Trace any API response artifact to the layer that produced it; pair models with
their tool parsers and chat templates deliberately; build streaming text pipelines
out of composable hold-back buffers (detokenize → stop-match → tag-parse) and test
them with chunking fuzzes; and read vllm/entrypoints/openai/ as a translation
layer over the engine you already know down to its counters. Phase 17 goes the other
direction from the front door — down to the hardware the engine runs on.
Lab 16-01 — Tool-Call Parsing: Structure Out of a Token Stream [CPU-OK]
A tool-calling model doesn't emit function calls — it emits text that describes
function calls (<tool_call>{"name": …}</tool_call> for Hermes-style models;
[TOOL_CALLS] for Mistral; a dozen other conventions). The server's job is to turn
that text into the OpenAI response's structured tool_calls field — and to do it
while streaming, over chunks that can split the tag or the JSON anywhere. The
batch parser is twenty easy lines; the streaming parser is where every real bug in
vLLM's tool_parsers/ directory lives, and its central discipline is the lab's
takeaway: hold back any text that might still become a tag — emit "Sure. "
immediately, but keep "<tool" buffered until the next chunk says whether it's a
tool call or the user's <today>.
Contents
- Why this lab exists
- Background: the two parsers
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Tool calling is the load-bearing feature of the agent era, and its serving-side
reality is unglamorous: per-model text conventions, parsed incrementally, under the
OpenAI API's streaming contract (content deltas must flow immediately; tool calls
must arrive structured). vLLM ships ~20 parser plugins (upstream/vllm/entrypoints/ openai/tool_parsers/) that all solve this lab with different tag conventions — and
their bug tracker is a museum of exactly the cases this lab's tests pin: tags split
across chunks leaking half-tags into chat UIs, held-back text swallowed forever on
false alarms, malformed JSON crashing streams instead of failing requests.
The streaming-equals-batch fuzz test is the lab's methodological gift: 50 random chunkings of the same text, all required to reassemble to the batch parse. Chunking invariance is the property every incremental parser owes, and randomized chunk boundaries are how you test it — the same move as Phase 8 lab-03's distributional oracle, applied to parsing.
Background: the two parsers
Batch (parse_tool_calls): scan for OPEN…CLOSE blocks, JSON-parse each,
return (remaining content, calls). Malformed JSON raises — a call the executor can't
parse must 4xx at the server, not detonate downstream (the loud-failure habit from
Phase 14 lab-03).
Streaming (StreamingToolParser): a buffer and one bit of state (in_block).
Outside a block, emit text eagerly except the longest trailing proper-prefix of
OPEN — the hold-back. Inside, buffer silently until CLOSE (partial JSON is never
parseable, so nothing useful can be emitted early), then parse and emit the call.
finish() flushes held text and makes an unterminated block loud — the
finish_reason: "length" interaction from Phase 12 lab-02, parser edition: a stream
truncated mid-call is an error, not a tool call.
Files
starter.py—parse_tool_callsandStreamingToolParser(feed/finish). Your work.solution.py— reference (note_trailing_tag_prefix: the hold-back, isolated).test_lab.py— batch semantics, the 50-chunking fuzz, the split-tag leak test, the false-alarm release, and the unterminated-block failure.
Run
LAB_IMPL=starter pytest phase-16-serving-apis-and-parsers/labs/lab-01-tool-call-parser -q
pytest phase-16-serving-apis-and-parsers/labs/lab-01-tool-call-parser -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_batch_parse / test_multiple_calls_in_order | The structured extraction, content preserved around it, order kept |
test_malformed_json_is_loud | Garbage in a block raises — the server's chance to fail the request instead of the agent loop |
test_streaming_equals_batch_for_any_chunking | Chunking invariance, 50 random slicings — the incremental parser's defining property |
test_tag_split_across_chunks_is_not_leaked | "Sure. <tool" emits "Sure. " and holds "<tool" — half-tags never reach the user (the chat-UI-shows-<tool bug, prevented) |
test_false_alarm_prefix_is_released | "<to" + "day>" → "<today>" emitted intact — held-back is not swallowed (the opposite bug, equally real) |
test_unterminated_block_fails_at_finish | Truncation inside a call is an error, matching the Phase 12 hygiene rule |
Hitchhiker's notes
- Why per-model parsers at all? The tag convention is trained into each model
(Hermes, Mistral, Llama, Qwen each render tool calls differently in their chat
templates), so the parser must match the template —
--tool-call-parser hermespairs with the model the same way Phase 14's mapping table pairs with a checkpoint. Mismatched parser ⇒ tool calls stream as visible text: instantly recognizable once you've done this lab. - The OpenAI streaming contract adds a layer your events map onto: tool-call
deltas (
tool_calls[i].function.argumentsstreamed as JSON fragments). Real parsers emit partial-argument deltas for responsiveness — which requires incremental JSON parsing too (is this string complete? is the brace balanced?). Your buffer-until-close design is the correctness-first version; the delta-streaming upgrade is the going-further. - Constrained decoding (Phase 12) and parsing are complements, not rivals: the
grammar mask can guarantee the model emits well-formed
<tool_call>JSON (vLLM's tool-choice enforcement does exactly this), and the parser still must extract it from the stream. Guarantee the syntax, then parse it — belt and suspenders, both load-bearing. - The hold-back has a latency cost: a trailing
<waits one chunk before display. Imperceptible — but the general trade (display latency vs structural certainty) recurs in stop-string handling (Phase 1 lab-05's straddle problem) and reasoning-tag parsers. Same buffer discipline everywhere; vLLM's detokenizer and parsers share it.
Going further
- Add streaming argument deltas: inside a block, emit
("args_delta", fragment)events for completed JSON string portions — you'll need a brace/quote tracker (a mini Phase 12 lab-03 machine), and you'll understand why upstream parsers carry exactly one. - Implement a second convention (Mistral's
[TOOL_CALLS][{...}]) behind the same event interface, and aget_parser(name)registry — the plugin shape oftool_parsers/, reproduced. - Property-test with adversarial content: tool-call JSON whose string values
contain
</tool_call>. Your parser breaks (find -> escape-aware scanning). Upstream's do too, mostly — models are trained not to emit this, which is a contract worth knowing is social, not technical.
References
upstream/vllm/entrypoints/openai/tool_parsers/— the plugin zoo;hermes_tool_parser.pyis your lab with delta streaming.- vLLM docs, Tool Calling — parser selection and
--enable-auto-tool-choice: https://docs.vllm.ai/en/latest/features/tool_calling/ - OpenAI API reference, function calling & streaming — the contract being satisfied: https://platform.openai.com/docs/guides/function-calling
- Phase 12 — the masks that can guarantee what this lab parses; lab-03 — the same buffering discipline one level down, at the byte boundary.
Lab 16-02 — The OpenAI Server, End to End [GPU-OPT]
The CPU labs built the two text-pipeline stages (detokenizer, tool parser); this lab
runs the whole front door: vllm serve, the OpenAI client, a streamed chat
completion, and a tool call — then traces one request through the server source
(serving_chat.py) so the HTTP layer stops being a fog between you and the engine
you know. The payoff observation: everything from Phase 1 onward sits behind one
async generator call — the server is a translator, not a second engine.
No GPU? Don't panic. The captured exchange below is annotated; the source trace is hardware-free.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, Qwen2.5-0.5B-Instruct, L4, vLLM 0.22.1, trimmed)
- Tracing the request through the source
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Production vLLM is touched through this server far more often than through LLM() —
and most operational questions ("why did this request 400?", "where do sampling
defaults come from?", "what adds the latency between client and first token?") are
server-layer questions. The trace this lab walks — FastAPI route → request
validation → chat-template rendering → AsyncLLM.generate → per-token streaming
through detokenizer/parsers → SSE chunks — is the request's actual itinerary, and
each leg is a place you'll someday debug. The lab's framing question for every leg:
is this translation (server's job) or inference (engine's job)? Keeping that line
sharp is what makes the 20k-line entrypoints directory navigable.
Requirements
uv pip install -e ".[vllm]" openai
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct # small instruct model with tool support
Steps
- Serve (note the parser flags — lab-01's convention pairing):
vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000 \
--enable-auto-tool-choice --tool-call-parser hermes
- Stream a chat completion and watch the deltas arrive:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="-")
stream = client.chat.completions.create(
model="Qwen/Qwen2.5-0.5B-Instruct",
messages=[{"role": "user", "content": "Say hi in French, one word."}],
stream=True)
for chunk in stream:
print(repr(chunk.choices[0].delta.content), end=" ")
-
Force a tool call (define one tool, ask a matching question) and inspect the structured
tool_callsin the response — lab-01's parser output, arriving over HTTP. -
Misbehave on purpose: oversized
max_tokens(read the 400's error body — validation is the server's first translation), a wrong model name, and a request withstream=truekilled mid-stream (watch the server log the disconnect and the engine abort the request — Phase 1 lab-05'sFINISHED_ABORTED, finally observed).
Captured output (real run, Qwen2.5-0.5B-Instruct, L4, vLLM 0.22.1, trimmed)
INFO ... Started server process; Application startup complete. (Uvicorn + FastAPI)
INFO ... "POST /v1/chat/completions HTTP/1.1" 200 OK
None ' Bon' 'jour' ' !' None # deltas: first None = role chunk, last = finish chunk
# tool call response (non-streamed):
"tool_calls": [{"type": "function", "function":
{"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"}}],
"finish_reason": "tool_calls"
# the deliberate 400:
{"error": {"message": "max_tokens must be at most 32768 ...", "type": "BadRequestError"}}
Tracing the request through the source
Open these in order, one request in mind:
upstream/vllm/entrypoints/openai/api_server.py— the FastAPI route; finds the handler per endpoint. (Translation: HTTP ↔ python objects.)upstream/vllm/entrypoints/openai/serving_chat.py— the heart:create_chat_completionvalidates, renders the chat template (messages → the model's prompt format — the per-model convention lab-01's parser is the inverse of), buildsSamplingParamsfrom the request body (every Phase 9 knob, arriving as JSON), and callsAsyncLLM.generate— the only line where inference happens.- The streaming loop just below — consumes engine outputs, runs the
detokenizer-fed deltas (lab-03's output!) through the tool parser (lab-01!),
and yields SSE chunks with
finish_reasonmapped per Phase 1 lab-05. upstream/vllm/v1/engine/async_llm.py—AsyncLLM: the async wrapper over theEngineCoreyou traced in Phase 1 lab-02. The circle closes.
Hitchhiker's notes
- The chat template is the most consequential invisible step: the same messages
render differently per model (system-prompt placement, tool-schema injection,
generation prompt), and template mismatches are the top cause of "model is dumb
via API but fine in the playground."
--chat-templateoverrides it; the template ships in the tokenizer config. The server's prompt is derived state — when debugging quality, print it (add_generation_prompt, the works) before blaming weights. finish_reason: "tool_calls"— a third value joining Phase 1 lab-05's"stop"/"length": set when the parser extracted calls, telling the client to execute and continue the loop. The enum keeps earning.- One server, many surfaces: the same process exposes
/v1/completions,/v1/chat/completions, embeddings, and (version-dependent) Anthropic-style routes — all translating onto the sameAsyncLLM. API multiplexing is cheap because the engine boundary is clean; that's the architectural moral of the whole phase. - Disconnect handling is a correctness feature: a client that vanishes
mid-stream must abort its request (free KV! — Phase 2's blocks don't free
themselves), and the server's disconnect-watcher →
abort_requestpath is what stands between you and a slow leak under flaky clients. Your step-4 experiment watched it work; know where it lives (api_server's disconnect checks).
Reflect
- For each captured artifact, name the layer that produced it: the
Nonerole chunk (server's SSE framing),' Bon'(engine token → lab-03 detokenizer → delta), the structuredtool_calls(lab-01's parser), the 400 (validation — never reached the engine). If every artifact has an owner, the fog is gone. - The OpenAI contract returns
argumentsas a JSON string, not an object — and your lab-01 parser emitted dicts. Where must the re-serialization live, and why there? (The server's translation layer: the contract is the client's, the dict is internal. Translation owns format debts.) - What's the latency budget of the server layer itself? Measure: time-to-first- delta minus engine TTFT (from metrics) ≈ template render + validation + HTTP. If that gap grows under load, you're CPU-bound in the front door — a real failure mode (event-loop starvation) that no GPU dashboard will show you.
References
upstream/vllm/entrypoints/openai/serving_chat.py— the file this lab makes readable.upstream/vllm/v1/engine/async_llm.py— the engine's async face.- vLLM docs, OpenAI-Compatible Server — endpoints, flags, template overrides: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html
- Labs 01 and 03 — the two pipeline stages this server composes; Phase 1 lab-02 — the engine loop at the bottom of the stack.
Lab 16-03 — The Streaming Detokenizer: Never Emit Broken UTF-8 [CPU-OK]
Streaming sends text the instant tokens arrive — but token boundaries and character
boundaries don't align. With mini_vllm's ByteTokenizer the problem is stark: 🚀 is
four byte-tokens, and decoding after each token emits replacement-character garbage
(�) three times before the rocket completes. Real BPE tokenizers have the identical
problem wherever a multibyte character spans tokens (CJK text, emoji, accents — i.e.
most of the world's traffic). This lab builds the fix every serving stack carries: an
incremental detokenizer that only ever emits complete characters, holding
incomplete byte sequences until they finish — with a control test proving the naive
approach really does produce four pieces of garbage where yours produces three
silences and a rocket.
Contents
- Why this lab exists
- Background: UTF-8 tells you how long to wait
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
This bug ships constantly. Every few months a chat product somewhere streams � mid-
emoji or garbles Chinese text, because someone decoded per-token and tested only in
English — ASCII is the one alphabet where token and character boundaries happen to
agree, which makes English-only testing a perfect blind spot. The fix is small but
must live in the streaming path (vLLM's IncrementalDetokenizer holds back exactly
these bytes), and implementing it once inoculates you: afterward, "stream text" and
"stream complete characters" register as different operations, the way Phase 9
taught "random" and "reproducibly random" to.
It's also the purest specimen of the phase's recurring discipline — lab-01 held back possible tag prefixes, stop-string handling (Phase 1 lab-05) holds back possible stop matches, and this lab holds back incomplete characters. One pattern, three layers: emit eagerly, but never emit what might still change meaning.
Background: UTF-8 tells you how long to wait
UTF-8's self-describing first byte is what makes the fix clean: 0xxxxxxx = 1-byte
char, 110xxxxx = 2, 1110xxxx = 3, 11110xxx = 4 — your utf8_expected_len
table. The detokenizer keeps a byte buffer; after each token it computes the longest
prefix that is a whole number of complete sequences, decodes and emits that, and
keeps the tail. The lead byte announces the wait; no guessing, no decode-and-check.
flush() handles the honest edge: a stream truncated mid-character (max_tokens
landing inside an emoji — Phase 1 lab-05's cap, byte edition) decodes the remnant
with errors='replace', because at end-of-stream the garbage is real and hiding it
would be lying.
Files
starter.py—utf8_expected_lenandStreamingDetokenizer(feed/flush). Your work.solution.py— reference.test_lab.py— the length table, ASCII eagerness, emoji holding, the no-garbage-ever invariant on mixed multilingual text, the naive-approach control, truncation honesty, and EOS handling.
Run
LAB_IMPL=starter pytest phase-16-serving-apis-and-parsers/labs/lab-03-streaming-detokenizer -q
pytest phase-16-serving-apis-and-parsers/labs/lab-03-streaming-detokenizer -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_ascii_streams_one_char_per_token | Eagerness: nothing is held that could be shown — latency is sacrificed only when correctness demands |
test_emoji_is_held_until_complete | ["", "", "", "🚀"] — three silences, one rocket: the wait is exactly the character's length, no more |
test_never_emits_replacement_chars_for_valid_text | The invariant, on naïve café — 你好 🚀🇫🇷: no � ever, and concatenation loses nothing (both halves matter: no garbage AND no swallowing) |
test_naive_approach_really_is_broken | The control (Phase 9 lab-04's pattern): per-token decode of 🚀 yields four garbage strings — the bug demonstrated, not described |
test_flush_handles_truncated_sequence | Stream cut mid-emoji: flush emits honest � rather than raising or hiding — truncation is the caller's fact to handle |
test_eos_is_ignored | Non-byte ids pass through silently — the sentinel discipline again |
Hitchhiker's notes
- The real version sits one level up: BPE tokens map to byte sequences (via the
tokenizer's byte-level encoding), so vLLM's incremental detokenizer
(
upstream/vllm/v1/engine/detokenizer.py, backed by thetokenizerslibrary's incremental decode) buffers token-ids and re-decodes a sliding window — same hold-back logic with tokenizer-specific machinery for "which prefix is stable." Your byte-level version is that algorithm with the cleanest possible alphabet. - The flag emoji in the test is a deliberate landmine that doesn't explode:
🇫🇷 is two complete 4-byte codepoints (regional indicators) that render as one
flag. Your detokenizer may legally emit them separately — character completeness
is the engine's contract; grapheme clustering is the terminal's problem. Knowing
where your responsibility ends is part of the spec (and why the test checks for
�, not for atomic flags). - This buffering interacts with everything downstream: stop strings are matched on detokenized text (so they inherit this buffer's timing), and lab-01's tag parser consumes this lab's output. The serving text pipeline is a stack of hold-back buffers, each with its own "might still change" criterion — when streamed output seems to lag by a character or two, you now know all three suspects.
- Performance note: production detokenizers avoid re-decoding from scratch per
token (your
_complete_prefix_lenscan is O(buffer), fine; re-decoding the whole output per token, the other naive approach, is O(n²) over a generation and has caused real regressions). Incrementality is a performance property here, not just a correctness one.
Going further
- Build the full pipeline: ByteTokenizer →
StreamingDetokenizer→ lab-01'sStreamingToolParser, fed token-by-token; assert end-to-end that a tool call with an emoji in its arguments survives both buffers. Two hold-backs composed — the actual server path. - Add stop-string support on top (Phase 1 lab-05's going-further, now with the right substrate): match on the emitted text, hold back any suffix that prefixes a stop string. Three buffers. Notice they compose without coordinating — each one's output is the next one's honest input.
- Measure the worst-case display latency your buffer adds for a pathological all-emoji stream — then check the real detokenizer's equivalent bound. (Four tokens. The wait is bounded by UTF-8's max sequence length; this is why the design needs no timeout.)
References
upstream/vllm/v1/engine/detokenizer.py—IncrementalDetokenizer: this lab at the BPE level.- The Unicode Standard, ch. 3 (UTF-8) — the lead-byte table you implemented: https://www.unicode.org/versions/latest/
- Phase 1 lab-05 — stop strings, the neighboring hold-back; lab-01 — the tag hold-back this lab feeds.
Phase 16 — Exercises: Serving APIs & Parsers
Work these after the labs. They escalate from "explain it" to "design it" — staff-level means you can do the last ones cold.
- Why are streaming tool-call parsers hard (partial JSON across deltas)?
- How does a chat template turn messages into a single token sequence?
- What must be true for vLLM to be a drop-in OpenAI replacement?
Self-grading
For each: could you (a) explain it to a teammate in 2 minutes, and (b) point to the exact
upstream/ file that proves your answer? If not, re-read the matching anchor in
01-deep-dive.md.
Phase 16 — Interview Questions: Serving APIs & Parsers
Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)
Q1. How does vLLM implement tool calling on top of plain text generation?
Model answer
The server injects tool schemas into the prompt (often via the chat template / structured output), then a model-specific tool-call parser extracts the function name and JSON args from the generated text — incrementally during streaming — and emits OpenAI-style tool_calls. Structured output can hard-constrain the args to the schema.
Q2. What's tricky about streaming responses?
Model answer
You must emit incremental deltas while maintaining correct semantics (role, finish_reason), and parse partial content (tool-call JSON, reasoning tags) that spans multiple chunks without committing to an interpretation too early.
Going deeper
The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.
Phase 16 — Cheatsheet: Serving APIs & Parsers
- vllm serve -> FastAPI -> serving_chat -> AsyncLLM. Speaks OpenAI + Anthropic + gRPC.
- Chat template turns messages -> prompt tokens. SSE for streaming deltas.
- Tool/reasoning parsers are pluggable and per-model; streaming makes them partial-parse.
Key upstream files
vllm/entrypoints/openai/api_server.pyvllm/entrypoints/openai/serving_chat.pyvllm/entrypoints/openai/tool_parsers/vllm/entrypoints/openai/reasoning_parsers/vllm/entrypoints/
Full reference: 00-guide.md · 01-deep-dive.md
Phase 17 — Hardware Backends & Plugins
← Phase 16 · Course home · Phase 18 →
Contents
- Don't Panic
- Why this phase matters
- What you'll learn
- The map: where this lives in the real code
- Labs in this phase
- How to work this phase
- Where you are
Don't Panic
vLLM runs on NVIDIA, AMD, CPUs, TPUs, Gaudi, and more. It does this by hiding every hardware difference behind a Platform abstraction and a plugin system, so the engine code stays hardware-agnostic and new accelerators arrive as plugins. This phase is that abstraction — and you'll run the CPU backend with no GPU at all.
Why this phase matters
Hardware breadth is a strategic advantage (GPU supply, cost arbitrage) and the Platform abstraction is a clean piece of architecture worth studying. Knowing where the seams are lets you reason about porting and about why a feature is available on one backend but not another.
What you'll learn
- The Platform abstraction: device type, attention backend default, capabilities
- How the engine queries the platform instead of hardcoding CUDA
- The out-of-tree plugin system (entry points) for new hardware
- CPU backend: what changes (no paging kernels? threading? dtype support)
- Why some features are platform-gated (FP8, CUDA graphs, certain kernels)
The map: where this lives in the real code
Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see
UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md)
walks through the important ones line by line.
vllm/platforms/interface.py— The Platform base class — the contract every backend implements.vllm/platforms/cuda.py— The NVIDIA platform.vllm/platforms/cpu.py— The CPU platform — read this; you can run it on a laptop.vllm/platforms/__init__.py— Platform detection/resolution + plugin discovery.vllm/plugins/— The plugin loading mechanism.
Labs in this phase
- lab-01-platform-abstraction
[CPU-OK]— build the Platform interface, registry, resolver (CPU floor + loud override), then register an out-of-tree platform and change the engine's decisions with zero core edits — plus the duplicate-registration supply-chain guard. - lab-02-run-cpu-vllm
[CPU-OK]— run vLLM on laptop cores and read cpu.py against lab-01's interface: every override checked off, the Phase 1–3 engine untouched. Captured output included.
See labs/README.md for how to run them.
How to work this phase
- Read this guide for intuition.
- Read 01-deep-dive.md with the
upstream/files open. - Do 02-mini-build.md — build the
mini_vllmpiece yourself. - Run the labs, then attempt EXERCISES.md.
- Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.
Where you are
This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.
← Phase 16 · Course home · Phase 18 →
Phase 17 — Deep Dive: Hardware Backends & Plugins
Read this with
upstream/open. Every path is relative toupstream/at the pinned commitv0.22.1 @ 0decac0(UPSTREAM_PIN.md). If a line number ever drifts, search for the named symbol instead.
Contents
Guided reading list
Work through these in order. This is a scaffold: the reading targets and the questions are real; fill in the line-by-line annotations as you go (this is exactly the muscle a maintainer uses — reading unfamiliar code and extracting its contract).
vllm/platforms/interface.py— The Platform base class — the contract every backend implements.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/platforms/cuda.py— The NVIDIA platform.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/platforms/cpu.py— The CPU platform — read this; you can run it on a laptop.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/platforms/__init__.py— Platform detection/resolution + plugin discovery.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/plugins/— The plugin loading mechanism.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
Questions to answer as you read
- The Platform abstraction: device type, attention backend default, capabilities?
- How the engine queries the platform instead of hardcoding CUDA?
- The out-of-tree plugin system (entry points) for new hardware?
- CPU backend: what changes (no paging kernels? threading? dtype support)?
- Why some features are platform-gated (FP8, CUDA graphs, certain kernels)?
Cross-references
- Intuition: 00-guide.md
- Build it yourself: 02-mini-build.md
- The gold-standard depth to emulate: Phase 02 deep-dive.
Phase 17 — Mini-Build: extend mini_vllm
Contents
Your task
Add a 'platform' abstraction to mini_vllm: a base class exposing device/dtype/default-backend, with a CPU implementation, and have the engine consult it instead of hardcoding — mirroring vLLM's Platform.
Why build it (and not just read it)
Reading the real kernel/feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.
Method
- Look at the matching real code from 01-deep-dive.md.
- Add your module under
mini_vllm/(or extend an existing one). - Write a
test_*.pynext to it that pins the behavior you care about. - Run
pytest mini_vllm -qand keep it green.
Definition of done
- Your component runs on CPU with no extra dependencies (numpy ok).
- A test demonstrates the property this phase is about (not just "it runs").
- You can explain, out loud, how your toy maps to the real implementation and where it intentionally simplifies.
The flagship phases ship complete
mini_vllmmodules + tests (mini_vllm/block_pool.py,mini_vllm/scheduler.py) — use them as your reference for structure and test style.
Phase 17 Labs — Hardware Backends & Plugins
Two labs on the layer that lets one engine speak to any silicon. The arc: build
the platform interface, the registry, and the resolver — then register an
out-of-tree platform and change the engine's decisions with zero core edits
(lab-01); then run the realest possible demonstration — vLLM on your laptop's
CPU, with cpu.py read against the interface you built (lab-02).
CPU labs follow the standard contract — starter.py (your work), solution.py
(reference), test_lab.py (the spec); default runs the solution,
LAB_IMPL=starter grades yours.
# Whole phase:
pytest phase-17-hardware-backends-and-plugins/labs -m "not gpu"
# Grade yourself:
LAB_IMPL=starter pytest phase-17-hardware-backends-and-plugins/labs/lab-01-platform-abstraction -q
Contents
Labs
lab-01-platform-abstraction [CPU-OK]
The funnel: a Platform interface answering every hardware question (attention
backend, dtypes, graph support), a registry with a CPU floor and a loud override,
and the test that is the architecture — an out-of-tree "vendor" platform changes
the engine's decisions without touching core. Plus the supply-chain guard:
duplicate registration refused. Skills: the registry trilogy completed (attention
→ models → platforms); capability negotiation over assumption; plugins as additive
hardware support; tests as architecture proofs.
lab-02-run-cpu-vllm [CPU-OK]
vLLM on laptop cores: the platform resolver choosing the floor, Torch SDPA standing
in for flash attention, KV carved from RAM by VLLM_CPU_KVCACHE_SPACE, graphs
degrading to eager — and the whole Phase 1–3 engine running unmodified, because none
of it was ever a GPU concept. Read cpu.py against lab-01 and check off every
override; note what a backend doesn't have to implement. Captured run included
(your tok/s will differ; nothing else will). Skills: knob translation across
platforms; the CPU roofline pricing the 9 tok/s; what to ask a vendor pitching
"vLLM support."
What you can do after this phase
Explain how one engine serves five silicon families; evaluate or review a hardware plugin by what it overrides and what it leaves alone; run and tune vLLM where there is no GPU at all; and place any hardware question ("does X support fp8? graphs? custom all-reduce?") at the platform boundary where its answer lives. Phase 18 measures what all these layers cost; Phase 19 sends you upstream.
Lab 17-01 — The Platform Abstraction: One Engine, Any Silicon [CPU-OK]
vLLM runs on NVIDIA, AMD, Intel GPUs, TPUs, Gaudi, and plain CPUs — and the reason
it can is one interface and one registry: every hardware-specific decision (which
attention backend? which dtypes? are CUDA graphs a thing here?) is asked of a
Platform object, and platforms register into a table that out-of-tree plugins can
join without touching a line of core code. You'll build the whole mechanism small —
the interface, two in-tree platforms, the resolver with its override and its CPU
floor — and then the test that is the architecture: register a third platform from
"outside" and watch the engine's decisions change, core untouched. Plus the security
posture detail most plugin systems forget: duplicate registration is refused,
because a plugin silently shadowing the CUDA platform is a supply-chain incident
wearing a convenience feature.
Contents
- Why this lab exists
- Background: the decisions that funnel through
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
The platform layer is how vLLM scaled organizationally, not just technically: hardware vendors (AMD, Intel, Google, Huawei, IBM) maintain their own backends — some in-tree, some as plugin packages — without serializing through the core team. That only works because the interface is explicit and the extension point is a registry, and the lab's plugin test demonstrates the payoff in its purest form: new silicon support is additive. If you ever bring vLLM to new hardware (a real career path — ask the Spyre and Ascend teams), this lab is the map of what you'll implement; if you review plugin PRs, it's the map of what to check.
The design pattern is also the course's registry trilogy completed: attention
backends (Phase 4's selector), model architectures (Phase 14's registry), and now
platforms — three tables, one philosophy: core code asks "who handles this?"
instead of knowing. Each table is also a place where Phase 4 lab-02's bisection
move works (override exists at every layer for exactly that reason).
Background: the decisions that funnel through
The real Platform interface (upstream/vllm/platforms/interface.py) answers, per
hardware: which attention backend class (this is literally where Phase 4's
selector gets its platform default), supported dtypes (your check_dtype is the
negotiation — bf16 everywhere, fp16 not on CPU, fp8 only on Hopper+-class),
device introspection (memory totals — Phase 2 lab-03's carving needs to ask
someone), graph capture support (Phase 5 is a no-op on CPU), and communicator
choices (Phase 10's collectives differ per fabric). Resolution happens once at
import/startup: detect devices → consult the registry → (or honor the override) →
fall back to CPU, the platform that always exists — the floor that makes "no
accelerator detected" a slow day instead of a crash.
Plugins join via Python entry points: installing vllm-ascend registers its
platform at import time — your register_platform, with packaging around it. The
refuse-duplicates rule is the trust boundary: in-tree names are spoken for.
Files
starter.py—Platform.check_dtype,register_platform,resolve_platform,make_default_platforms. Your work.solution.py— reference.test_lab.py— accelerator preference, the CPU floor, override + loud unknowns, dtype negotiation, the out-of-tree plugin, and the duplicate refusal.
Run
LAB_IMPL=starter pytest phase-17-hardware-backends-and-plugins/labs/lab-01-platform-abstraction -q
pytest phase-17-hardware-backends-and-plugins/labs/lab-01-platform-abstraction -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_resolution_prefers_the_accelerator | Detection order: the GPU wins when present, and with it come flash_attn and graphs — the decisions travel as a bundle |
test_cpu_is_the_floor | Empty device list still resolves — vLLM always has somewhere to run, which is why lab-02 works at all |
test_override_wins_and_unknown_is_loud | The bisection hook (Phase 4 lab-02's reflex, platform edition) — and typos fail fast instead of silently falling back |
test_dtype_negotiation | Unsupported dtype → float32, never a crash mid-load: capability mismatches are negotiated at the boundary |
test_out_of_tree_plugin_changes_decisions_without_core_edits | The architecture: a "vendor" registers mytpu, resolution returns it, the attention backend is now pallas — and the diff to core is zero lines |
test_duplicate_registration_is_refused | A plugin cannot shadow cpu or cuda — the supply-chain guard, as an assert |
Hitchhiker's notes
- Find your functions upstream:
upstream/vllm/platforms/interface.py(Platform, with ~30 methods where you wrote 1 — same skeleton),upstream/vllm/platforms/__init__.py(detection + resolution + plugin loading — yourresolve_platformwith the entry-point scan), and any ofcuda.py/cpu.py/rocm.py/tpu.pyas the in-tree implementations. Readcpu.pywith lab-02 — its overrides are exactly the decision list above. - The plugin mechanism is general: vLLM's plugin system
(
upstream/vllm/plugins/) loads any registered entry point at startup — platforms, but also out-of-tree models (Phase 14's registry accepts plugins the same way) and custom components. One loading mechanism, many tables — when you seeVLLM_PLUGINSin an environment, this is what it gates. - Why funnel rather than
if torch.cuda.is_available()sprinkled everywhere? Because the sprinkled version is what most codebases have, and it makes new hardware a grep-and-pray refactor across hundreds of sites. The funnel makes it one class. The lab's plugin test is unwritable against sprinkled conditionals — which is the test-as-architecture-proof point again (Phase 14's tripwire, in registry form). - Capability negotiation beats capability assumption:
check_dtype's fall-to-float32 is a microcosm of how the whole layer behaves — requests for the unsupported degrade explicitly (with a warning upstream) rather than crashing or, worse, silently miscomputing. Every backend boundary in your own systems deserves the same negotiation shape.
Going further
- Wire it into
mini_vllm: giveLLMEngineaplatformparameter whoseattention_backendstring selects between two toy attention impls (both correct, different "hardware"). The Phase 14 lab-01 tripwire test then proves the engine consults only the platform — the funnel, enforced. - Add
get_device_memory()per platform and route Phase 2 lab-03's blocks-from-bytes carving through it — the startup ritual becomes platform-portable, which is precisely how the real worker does it. - Simulate the entry-point load: a
plugins/dict of callables, each registering a platform; load them in sorted order and re-run the duplicate test. Then consider: what should happen when two plugins collide? (Upstream: first wins- a warning. Reasonable people disagree — write down the trade.)
References
upstream/vllm/platforms/interface.py— the realPlatform.upstream/vllm/platforms/__init__.py— detection, resolution, plugin loading.- vLLM docs, vLLM Plugin System: https://docs.vllm.ai/en/latest/design/plugin_system.html
- Phase 4 lab-02 (attention selector) and Phase 14 lab-01 (model registry) — the other two tables in the trilogy.
Lab 17-02 — Run vLLM on CPU, and Read What the Platform Overrode [CPU-OK]
The one GPU-flavored lab in this course that genuinely needs no GPU: install
vLLM's CPU backend, serve a tiny model on your laptop cores, and then read cpu.py
against lab-01's interface to see exactly which decisions the platform redirected —
attention backend swapped, CUDA graphs gone, KV cache carved from RAM by a different
knob (VLLM_CPU_KVCACHE_SPACE instead of gpu_memory_utilization). Same engine,
same scheduler, same paged KV, different silicon — Phase 1–3's machinery proving
itself hardware-agnostic before your eyes.
The captured run below is from a 16-core laptop; yours will differ in tok/s and nothing else. That's the lesson.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, Qwen2.5-0.5B, 16-core CPU, vLLM 0.22.1, trimmed)
- Reading cpu.py against lab-01
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Three reasons, ascending. Practically: CPU vLLM is real deployment surface — CI
pipelines, edge boxes, air-gapped environments, and cost-floor serving of small
models all use it, and its knobs differ enough from CUDA's to merit one deliberate
run. Pedagogically: it's the existence proof for lab-01's architecture — every
phase of this course you learned on GPU concepts (paged KV, continuous batching,
chunked prefill) executes here unmodified, because none of them were ever GPU
concepts; they were engine concepts, and the platform layer is what kept them so.
Strategically: reading cpu.py teaches you the size of a backend — it's a short
file, and "supporting new hardware is a short file plus kernels" is the fact that
makes Phase 17's vendor-plugin world believable.
Requirements
# CPU wheels/build per the official guide (the pip default wheel is CUDA-flavored):
# https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
Steps
VLLM_CPU_KVCACHE_SPACE=4 python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='Qwen/Qwen2.5-0.5B-Instruct', dtype='bfloat16', max_model_len=1024)
print(llm.generate(['The CPU backend exists because'],
SamplingParams(max_tokens=48, temperature=0))[0].outputs[0].text)
"
Three observations to collect: the startup log naming the platform and attention
backend (compare with any GPU capture from earlier phases); the KV cache sized from
VLLM_CPU_KVCACHE_SPACE (gigabytes of RAM — Phase 2 lab-03's carving with a new
budget source); and your tok/s (single-digit to low-double — see the roofline note).
Captured output (real run, Qwen2.5-0.5B, 16-core CPU, vLLM 0.22.1, trimmed)
INFO ... Using CPU platform. # lab-01's resolver, choosing the floor
INFO ... Using Torch SDPA backend. # the platform's attention answer
INFO ... CPU KV cache space: 4 GiB # VLLM_CPU_KVCACHE_SPACE, not gpu_mem_util
INFO ... # CPU blocks: 13,107 # same BlockPool, RAM-backed
WARNING ... CUDA graphs are not supported ... falling back to eager
the CPU backend exists because not every deployment has a GPU ...
# generation: ~9 tok/s single stream (16 cores, bf16)
Reading cpu.py against lab-01
Open upstream/vllm/platforms/cpu.py next to your lab-01 Platform and check off
the decisions: get_attn_backend_cls → Torch SDPA (the platform is Phase 4's
selector for this hardware); dtype checks (fp16 discouraged on CPU — your
check_dtype negotiation, with a warning); graphs unsupported (Phase 5 short-
circuits — note the engine degrades, not crashes: eager mode was always a valid
path); memory introspection reading system RAM. Then notice what's absent: nothing
about schedulers, blocks, batching, or sampling. The platform overrides the
hardware-touching edge and only the edge — lab-01's funnel, confirmed by reading
what a real backend did not have to implement.
Hitchhiker's notes
- The performance is honest, and the roofline explains it (Phase 0 lab-04 with CPU constants): ~50 GB/s of DRAM bandwidth vs a GPU's 2,000 — decode's weight-streaming bound lands at ~9 tok/s for a 1 GB-weight model, right where the capture sits. CPU serving is bandwidth-priced, same physics, smaller numbers — which is also why small models + quantization (fewer bytes!) are disproportionately effective here.
- Knob translation table:
gpu_memory_utilization→VLLM_CPU_KVCACHE_SPACE(absolute GiB — RAM isn't pre-carved like HBM); TP within a node → multiple NUMA-pinned CPU "devices" (VLLM_CPU_OMP_THREADS_BIND); graphs → nothing (eager always). The concepts you tuned all course exist; the spellings moved to where the hardware's truth lives. - CI is the killer app: vLLM's own test suite exercises engine logic on CPU runners constantly — correctness of schedulers and parsers doesn't need an A100 (this course's whole premise, which the project itself relies on).
- From
cpu.pyto a vendor plugin is a difference of packaging, not kind:vllm-ascend,vllm-spyreand friends are out-of-treecpu.py-shaped files plus kernels, registered through lab-01's entry-point mechanism. After reading one in-tree backend, you can review (or write) an out-of-tree one.
Reflect
- List which course phases' machinery you just watched run unchanged on CPU, and which were platform-swapped. (Unchanged: 1, 2, 3, 9, 12, 16 — the engine and text layers. Swapped: 4's backend choice, 5 disabled, 7's kernels, 0/18's constants.) The ratio is the architecture's grade.
- Why is
VLLM_CPU_KVCACHE_SPACEabsolute GiB while the GPU knob is a fraction? (HBM is the engine's to claim — a fraction of a dedicated resource; RAM is shared with the OS and everything else — an absolute budget is the honest contract. Knob design encodes resource ownership.) - A vendor pitches you "vLLM support" for their accelerator. From this phase, what three artifacts do you ask to see? (Their platform class and what it overrides; their attention backend's correctness story against Phase 4's reference shapes; benchmark constants for the Phase 0 lab-04 roofline so claims can be checked.)
References
upstream/vllm/platforms/cpu.py— the backend under read.- vLLM docs, CPU installation: https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html
- vLLM docs, Plugin System — the out-of-tree path this is the in-tree template for: https://docs.vllm.ai/en/latest/design/plugin_system.html
- Lab-01 — the interface this file implements; Phase 0 lab-04 — the physics that prices the capture's 9 tok/s.
Phase 17 — Exercises: Hardware Backends & Plugins
Work these after the labs. They escalate from "explain it" to "design it" — staff-level means you can do the last ones cold.
- List 3 decisions the Platform abstraction centralizes and why hardcoding them would hurt.
- Why is FP8 / CUDA-graph support platform-gated?
- How would a new accelerator vendor add support without forking vLLM?
Self-grading
For each: could you (a) explain it to a teammate in 2 minutes, and (b) point to the exact
upstream/ file that proves your answer? If not, re-read the matching anchor in
01-deep-dive.md.
Phase 17 — Interview Questions: Hardware Backends & Plugins
Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)
Q1. How does vLLM support so many hardware backends without forking the engine?
Model answer
A Platform abstraction centralizes hardware-specific choices (device, default attention backend, supported dtypes, capabilities), and the engine queries it instead of hardcoding CUDA. New hardware can register out-of-tree via the plugin entry-point system, so vendors add support without modifying core code.
Q2. Why can you run vLLM on a CPU at all, and what's different?
Model answer
The CPU platform provides CPU-appropriate kernels and disables GPU-only features (certain fused/quant kernels, CUDA graphs). It's slower but lets you develop and test the engine logic — exactly what the [CPU-OK] labs in this course rely on.
Going deeper
The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.
Phase 17 — Cheatsheet: Hardware Backends & Plugins
- Platform abstraction = one place for device/dtype/default-backend/capabilities.
- Engine asks the Platform; it never hardcodes CUDA.
- New hardware = out-of-tree plugin via entry points.
- CPU backend runs on a laptop (no paging/graph kernels), great for learning.
Key upstream files
vllm/platforms/interface.pyvllm/platforms/cuda.pyvllm/platforms/cpu.pyvllm/platforms/__init__.pyvllm/plugins/
Full reference: 00-guide.md · 01-deep-dive.md
Phase 18 — Performance Engineering
← Phase 17 · Course home · Phase 19 →
Contents
- Don't Panic
- Why this phase matters
- What you'll learn
- The map: where this lives in the real code
- Labs in this phase
- How to work this phase
- Where you are
Don't Panic
Now you make it FAST and prove it. This phase is the engineer's loop: measure (TTFT, ITL, throughput) with the right tools, find the bottleneck (CPU launch? memory? a kernel?), turn the right knob (batch size, token budget, memory utilization, graphs, quant), and re-measure. It's the meta-skill that ties phases 2–17 together.
Why this phase matters
This is the daily job of a staff inference engineer and the thing startups live or die on (cost/token). Being able to read a profile, reason with a roofline, and tune vLLM's knobs methodically is what separates senior from staff.
What you'll learn
- Metrics that matter: throughput (tok/s), TTFT, ITL/TPOT, goodput, latency percentiles
- Little's Law and how batch size, arrival rate, and latency relate
- The roofline model: compute-bound vs memory-bound; arithmetic intensity
- Profiling: the torch profiler, Nsight Systems, and vLLM's own metrics
- The knobs: max_num_seqs, max_num_batched_tokens, gpu_memory_utilization, enable_chunked_prefill, CUDA graphs, quant, spec decode
- Benchmarking properly: vllm bench, warmup, steady state, fair comparisons
The map: where this lives in the real code
Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see
UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md)
walks through the important ones line by line.
benchmarks/— The benchmark suite (throughput, latency, serving).vllm/benchmarks/— The 'vllm bench' implementation.vllm/v1/metrics/— The metrics/stats the engine exposes (Prometheus + logging).vllm/v1/metrics/stats.py— SchedulerStats / IterationStats: what's measured each step.vllm/config/scheduler.py— The tuning knobs and their defaults/semantics.
Labs in this phase
- lab-01-tune-the-knobs
[CPU-OK]— build the full tuning loop on mini_vllm: arrival schedules (queueing enters the course), TTFT/spike/steps metrics, and an SLO-constrained grid search that refuses impossible SLOs — with two measured surprises about the chunk threshold. - lab-02-benchmark-real-vllm
[GPU-OPT]— the same loop with wall-clocks:vllm bench servesweeps, the rate-sweep knee found first, percentiles everywhere, and the one-page tuning report as the deliverable. Captured numbers included.
See labs/README.md for how to run them.
How to work this phase
- Read this guide for intuition.
- Read 01-deep-dive.md with the
upstream/files open. - Do 02-mini-build.md — build the
mini_vllmpiece yourself. - Run the labs, then attempt EXERCISES.md.
- Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.
Where you are
This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.
← Phase 17 · Course home · Phase 19 →
Phase 18 — Deep Dive: Performance Engineering
Read this with
upstream/open. Every path is relative toupstream/at the pinned commitv0.22.1 @ 0decac0(UPSTREAM_PIN.md). If a line number ever drifts, search for the named symbol instead.
Contents
Guided reading list
Work through these in order. This is a scaffold: the reading targets and the questions are real; fill in the line-by-line annotations as you go (this is exactly the muscle a maintainer uses — reading unfamiliar code and extracting its contract).
benchmarks/— The benchmark suite (throughput, latency, serving).- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/benchmarks/— The 'vllm bench' implementation.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/v1/metrics/— The metrics/stats the engine exposes (Prometheus + logging).- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/v1/metrics/stats.py— SchedulerStats / IterationStats: what's measured each step.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
vllm/config/scheduler.py— The tuning knobs and their defaults/semantics.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
Questions to answer as you read
- Metrics that matter: throughput (tok/s), TTFT, ITL/TPOT, goodput, latency percentiles?
- Little's Law and how batch size, arrival rate, and latency relate?
- The roofline model: compute-bound vs memory-bound; arithmetic intensity?
- Profiling: the torch profiler, Nsight Systems, and vLLM's own metrics?
- The knobs: max_num_seqs, max_num_batched_tokens, gpu_memory_utilization, enable_chunked_prefill, CUDA graphs, quant, spec decode?
- Benchmarking properly: vllm bench, warmup, steady state, fair comparisons?
Cross-references
- Intuition: 00-guide.md
- Build it yourself: 02-mini-build.md
- The gold-standard depth to emulate: Phase 02 deep-dive.
Phase 18 — Mini-Build: extend mini_vllm
Contents
Your task
Add a metrics collector to mini_vllm (tokens/step, batch size, KV usage, preemptions) and a tiny benchmark that sweeps max_num_batched_tokens to find the throughput knee — the real tuning loop in miniature.
Why build it (and not just read it)
Reading the real kernel/feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.
Method
- Look at the matching real code from 01-deep-dive.md.
- Add your module under
mini_vllm/(or extend an existing one). - Write a
test_*.pynext to it that pins the behavior you care about. - Run
pytest mini_vllm -qand keep it green.
Definition of done
- Your component runs on CPU with no extra dependencies (numpy ok).
- A test demonstrates the property this phase is about (not just "it runs").
- You can explain, out loud, how your toy maps to the real implementation and where it intentionally simplifies.
The flagship phases ship complete
mini_vllmmodules + tests (mini_vllm/block_pool.py,mini_vllm/scheduler.py) — use them as your reference for structure and test style.
Phase 18 Labs — Performance Engineering
Two labs, one loop: define metrics → measure a workload under a config → search
under an SLO constraint. First built cheap — a simulator over mini_vllm with
arrival schedules, spike proxies, and a grid search that refuses impossible SLOs
(lab-01) — then run for real with vllm bench serve, wall-clocks, percentile
distributions, and the tuning report as the deliverable artifact (lab-02).
CPU labs follow the standard contract — starter.py (your work), solution.py
(reference), test_lab.py (the spec); default runs the solution,
LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-18-performance-engineering/labs -m "not gpu"
# Grade yourself:
LAB_IMPL=starter pytest phase-18-performance-engineering/labs/lab-01-tune-the-knobs -q
Contents
- lab-01-tune-the-knobs
[CPU-OK] - lab-02-benchmark-real-vllm
[GPU-OPT] - What you can do after this phase
Labs
lab-01-tune-the-knobs [CPU-OK]
The loop, built cheap: a simulator running arrival schedules (queueing finally enters the course) with three metrics — TTFT-from-arrival, worst-step tokens (the spike proxy), total steps — and an SLO-constrained grid search that breaks ties toward latency and raises on unsatisfiable SLOs. Two measured surprises: the chunk threshold is per-request (only the budget caps a step globally), and chunking can cost zero throughput when decode steps are already there to hide chunks in. Skills: constraints beat preferences; metric calibration tests; cheap models with known biases shrink expensive searches.
lab-02-benchmark-real-vllm [GPU-OPT]
The loop, run for real: vllm bench serve sweeps with warm servers, two runs per
config, percentiles everywhere, the rate sweep that finds the knee first — and
the tuning report as the artifact (workload, table, distributions, recommendation
with its trade named). The captured sweep reconciles every row against the CPU labs
that predicted it. Skills: the four methodology checks; conservation of suffering
(admission knobs relocate latency); macro before micro; benchmark at the knee.
What you can do after this phase
Run a tuning engagement end to end: state the workload, find the knee, sweep one knob at a time with honest variance, report distributions, and recommend with the trade named — having prototyped the search cheaply enough to afford it. You can also audit anyone else's benchmark in about a minute (workload? warm? percentiles? one knob?), which is its own kind of superpower. Phase 19 sends everything upstream.
Lab 18-01 — Tune the Knobs: an SLO-Constrained Grid Search [CPU-OK]
Performance engineering is one loop, run with discipline: define metrics, measure a
workload under a config, search the config space under a constraint. This lab has
you build the whole loop on mini_vllm — a simulator that runs an arrival schedule
(requests landing at different times, the thing every previous lab simplified away)
and emits three metrics: per-request TTFT, the worst step's token count (the
ITL-spike proxy), and total steps (throughput). Then grid_search sweeps budget ×
chunk-threshold under a hard spike SLO and returns the best legal config — refusing
loudly when no config qualifies, because a quietly violated SLO is the worst outcome
in the trade. Along the way the tests teach two facts that surprise most tuners: the
chunk threshold is per-request (two chunked prefills still stack in one step —
only the budget caps globally), and chunking can cost zero throughput when a long
decode stream's steps are already there to hide the chunks in.
Contents
- Why this lab exists
- Background: metrics, then search
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Every knob in this course got its own lab; this one is where they meet a workload
— and workloads, not knobs, are what you actually tune for. The arrival schedule is
the lab's quiet upgrade over everything before it: queueing (TTFT now includes
waiting — test_queueing_shows_up_in_ttft), interference between requests that
arrive at different moments, and the SLO-vs-throughput tension that only exists when
both matter at once. The simulator is deliberately the cheapest possible version of
the loop (steps and token counts, no GPUs, milliseconds per run) because the
methodology is the deliverable: lab-02 runs the identical loop with vllm bench and
wall-clocks, and the only thing that changes is the cost of each measurement —
which is exactly why you prototype the search cheap.
The grid search's design choices are the staff-engineer content: the SLO is a
constraint, not a weighted term (latency SLOs are promises, not preferences);
ties break toward lower worst-TTFT (when throughput is equal, take the latency);
and unsatisfiability raises (test_unsatisfiable_slo_is_loud) — the tuning loop's
version of the course's loud-failure habit, because "best effort" on an impossible
SLO ships a violation with extra steps.
Background: metrics, then search
The three metrics, and what each proxies:
ttft_steps— steps from arrival (not admission!) to first token. Queueing, scheduling, prefill: all of it. The user-facing wait.max_step_tokens— the worst step's total scheduled tokens ≈ the worst inter-token stall any decoding user felt (Phase 3 lab-05's proxy, now a tunable's objective).total_steps— the schedule's length ≈ inverse throughput at fixed step cost. (The proxy's known bias: real steps' wall-clock varies with their token count — total tokens would weight differently; lab-02's wall-clocks settle it.)
The two knobs swept are the course's latency dial (threshold) and throughput dial (budget) — and the search space is tiny on purpose. Real tuning fails far more often from unclear objectives than from undersized grids; get the constraint and the tiebreak right first, then enlarge the grid.
Files
starter.py—Metrics,simulate(arrivals + the Phase 1 lab-04 probe + first-token tracking),grid_search. Your work.solution.py— reference.test_lab.py— monotonicity, the per-request-vs-global cap lesson, queueing in TTFT, arrival-relative measurement, SLO compliance, and loud unsatisfiability.
Run
LAB_IMPL=starter pytest phase-18-performance-engineering/labs/lab-01-tune-the-knobs -q
pytest phase-18-performance-engineering/labs/lab-01-tune-the-knobs -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_throughput_more_budget_never_more_steps | The sanity direction every tuning loop needs before it can be trusted with anything subtle |
test_spike_threshold_is_per_request_budget_is_global | The two-cap structure: threshold=32 still allows a 65-token step (two chunks + a decode); only budget=40 forces ≤ 40. And the surprise: chunking cost zero steps here, because the 24-token decode stream's steps were already there to hide chunks in — Sarathi's piggybacking, measured from the throughput side. A lonely fat prompt, with nowhere to hide, pays the full chunking step-count bill |
test_queueing_shows_up_in_ttft | max_num_seqs=1 makes later arrivals wait, and the metric sees it — TTFT without queueing is a benchmark fiction |
test_ttft_is_measured_from_arrival | The zero-point check: an idle engine serves first tokens in the arrival step. Metrics need calibration tests too |
test_grid_search_respects_the_slo | The constrained search refuses the throughput-optimal-but-violating config — constraints beat preferences |
test_unsatisfiable_slo_is_loud | An impossible SLO raises; it does not return the least-bad violation |
Hitchhiker's notes
- The hide-the-chunks result generalizes and matters: chunked prefill's
throughput cost is
max(0, chunk_steps − coexisting_decode_steps)-shaped. Fleets with deep decode streams (chat) chunk nearly free; bursty prefill-only fleets (batch summarization) pay full price — and that's also the fleet that didn't need the latency protection. The knob's cost and its benefit anti-correlate across workloads, which is why per-deployment tuning beats global defaults. - Arrival schedules are the difference between benchmarks and reality: this
lab's three-request workload already produces queueing, interference, and
hiding effects no all-at-once batch shows. Real benchmark suites
(
vllm bench serve) generate Poisson arrivals at a target QPS for the same reason — lab-02 uses exactly that. - Grid search is the right first search: 6 configs here, exhaustive, done. At real scale (5+ knobs), the same loop wraps Bayesian/successive-halving optimizers — but the metrics, the constraint handling, and the loud unsatisfiability transfer unchanged. The loop is the asset; the optimizer is a plug-in.
- One proxy limitation to carry consciously: step counts can't see fixed per-step overheads (launch costs, scheduler time), so this simulator systematically favors many-small-steps configs vs what wall-clocks will say — Phase 5's whole subject is that bias. Cheap models with known biases, again (Phase 8 lab-04, Phase 15 lab-03): use them to shrink the expensive search, never to replace the final measurement.
Going further
- Add a
worst_ttftSLO as a second constraint and find workloads where the two SLOs conflict (spike cap wants small budget; TTFT wants big) — the multi-objective frontier, met honestly. - Generate Poisson arrivals (
rng.poisson) at increasing rates and plot worst-TTFT vs offered load for two configs: the hockey stick where queueing takes over is the capacity limit, found by simulation — Phase 3 lab-04's going-further, completed. - Port
simulate's probe to count tokens per step and weighttotal_stepsby a per-step cost model (fixed + per-token) — calibrate the two constants against one lab-02 measurement, then re-run the grid. You've built the cheap-model/ expensive-measurement two-tier loop production tuning actually uses.
References
- Phase 1 lab-04 (the probe), Phase 3 labs 01/05 (the two caps and the spike) — the parts this lab assembles.
upstream/vllm/benchmarks/andvllm bench— the production version of this loop (lab-02).- vLLM docs, Optimization and Tuning — the knobs' official guidance, now checkable against your own search: https://docs.vllm.ai/en/latest/configuration/optimization.html
- Agrawal et al., Sarathi-Serve (OSDI 2024) — the piggybacking result your zero-cost-chunking test measured: https://arxiv.org/abs/2403.02310
Lab 18-02 — Benchmark Real vLLM and Write the Tuning Report [GPU-OPT]
Lab-01's loop, with wall-clocks: run vllm bench serve against a live server,
sweep two knobs, and produce the artifact this phase exists to teach — a tuning
report: workload stated, configs compared, distributions (not means) reported,
a recommendation with its trade named. The capture below is such a report in
miniature; your deliverable is the same table for your hardware and a workload you
choose.
No GPU? Don't panic. The captured sweep below is the worked example, and the report-writing discipline is hardware-free. (You can also run the whole lab against Phase 17 lab-02's CPU backend — slower numbers, identical methodology.)
Contents
- Why this lab exists
- Requirements
- Steps
- Captured sweep (Qwen2.5-0.5B, L4, vLLM 0.22.1)
- Reading the sweep
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Benchmark numbers without methodology are advocacy, and most published LLM serving comparisons fail one of four checks you'll practice here: stated workload (QPS, prompt/output length distributions — Phase 13 taught how much one image shifts these), warm measurement (Phase 5's capture and compile excluded), distributions (p50/p99 for TTFT and ITL — the tails are the product, Phase 3 lab-05), and one knob at a time. The phase's CPU labs built every mental model this lab's numbers will land in; the remaining skill is operational care, which only practice installs.
Requirements
uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
Steps
- Serve (one terminal):
vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000 - Bench (another): sweep request rate first to find the knee, then the knobs:
vllm bench serve --backend openai-chat --base-url http://localhost:8000 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--dataset-name random --random-input-len 512 --random-output-len 128 \
--num-prompts 200 --request-rate 8
- Re-run the same command against servers restarted with one change each:
--long-prefill-token-threshold 64, then--max-num-seqs 64, then--gpu-memory-utilization 0.9. Two runs per config (eyeball variance before trusting deltas — Phase 5 lab-04's discipline). - Write the report: table, distributions, the knee, one recommendation per SLO profile.
Captured sweep (Qwen2.5-0.5B, L4, vLLM 0.22.1)
workload: 512-in/128-out random, 200 prompts, rate 8 req/s, warm server, 2 runs each
config tput tok/s TTFT p50/p99 (ms) ITL p50/p99 (ms)
baseline (defaults) 4,310 145 / 610 11.2 / 41.8
threshold=64 4,150 160 / 660 11.3 / 14.9 <- p99 ITL 2.8x better
max_num_seqs=64 (was 256) 4,290 150 / 1,240 11.1 / 13.6 <- queueing moved to TTFT
gpu_mem_util=0.9 (was 0.85) 4,420 144 / 600 11.2 / 40.9 <- more KV, small gain here
# rate sweep (baseline): 4 req/s p99 TTFT 210ms; 8 -> 610ms; 12 -> 4,900ms <- the knee is ~8-10
Reading the sweep
- threshold=64: −4% throughput, ÷2.8 p99 ITL — lab-01's trade with real units, and the per-request-vs-global subtlety still applies (check max chunk concurrency before promising the cap). For a chat product this row is the recommendation; for batch summarization it's a pure loss. The workload decides; the report must say so.
- max_num_seqs=64: ITL p99 improves (fewer co-resident decodes per step) but TTFT p99 doubles — the queue moved from inside steps to in front of them. Conservation of suffering: admission knobs relocate latency between metrics; only capacity (next row) or efficiency creates more of it.
- gpu_mem_util=0.9: +2.5% here because this workload wasn't KV-bound (0.5B, short contexts). The same knob on a 70B at long context is the difference between serving and queueing — a knob's value is workload-conditional, which is why the report states the workload first.
- The rate sweep is the most important line: the knee (~8–10 req/s) is the capacity number every other measurement is conditional on. Benchmarking at the knee shows tradeoffs; past it, everything drowns in queueing and configs look identical (all terrible). Find the knee first, always.
Hitchhiker's notes
vllm benchsubsumes the oldbenchmark_serving.pyscripts — datasets (random, sharegpt, sonnet), Poisson arrivals via--request-rate, and the percentile outputs this report needs. The server-side Prometheus metrics (vllm:time_to_first_token_secondsand friends) should agree with the client-side numbers minus network — when they don't, you've found front-door overhead (Phase 16 lab-02's gap measurement).- Variance discipline scales with claim size: two runs to eyeball, five+ with a t-test before shipping a regression report someone will act on. The single most common benchmarking sin is one run per config and a conclusion from a 3% delta inside run-to-run noise.
- Profile only after the macro story is clear: this lab's table tells you which config to keep; Phase 7 lab-02's profiler tells you why a step costs what it does. Macro → micro, never the reverse — profiling an untuned config optimizes the wrong thing precisely.
- Report format matters more than it should: workload, configs, table, distributions, knee, recommendation-with-trade — one page. Decision-makers act on the page, not the runs; a perfect sweep badly reported changes nothing.
Reflect
- Reconcile each captured row with its CPU-lab prediction: threshold (lab-01 + Phase 3 lab-05), max_num_seqs (lab-01's queueing test), mem_util (Phase 2 lab-03's blocks). Any row you couldn't have predicted within 2× deserves a note in the report — that's where your model of the system is thinnest.
- Your p99 TTFT SLO is 800 ms and traffic is 10 req/s on this hardware. What does the rate sweep say, and what are the three escape routes? (You're past the knee: more replicas, a smaller/quantized model — Phase 6 — or admission control that sheds load visibly. Tuning knobs won't move a knee much; capacity does.)
- Why benchmark with
randomdata instead of real prompts first? (Controlled lengths isolate the knobs; then confirm with a real-trace dataset — sharegpt — because length distributions, prefix sharing, and image tokens all shift the knee. Synthetic isolates; real validates. You need both, in that order.)
References
vllm bench serve --helpandupstream/vllm/benchmarks/— the harness.- vLLM docs, Benchmarking — official methodology notes: https://docs.vllm.ai/en/latest/contributing/benchmarks/
- Phase 3 lab-05 (the ITL story), Phase 2 lab-03 (the capacity story), Phase 5 lab-04 (warmup + variance), lab-01 (the search loop this lab runs for real).
- Dean & Barroso, The Tail at Scale — why every column here is a percentile: https://research.google/pubs/the-tail-at-scale/
Phase 18 — Exercises: Performance Engineering
Work these after the labs. They escalate from "explain it" to "design it" — staff-level means you can do the last ones cold.
- From a profile showing low GPU util at small batch, name the likely cause and fix.
- Use Little's Law to predict the batch size needed for a target throughput at a given ITL.
- Design a fair benchmark comparing two configs (warmup, steady state, same traffic).
Self-grading
For each: could you (a) explain it to a teammate in 2 minutes, and (b) point to the exact
upstream/ file that proves your answer? If not, re-read the matching anchor in
01-deep-dive.md.
Phase 18 — Interview Questions: Performance Engineering
Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)
Q1. Throughput is low and GPU utilization is ~30% at batch size 1–2. What's happening?
Model answer
Almost certainly CPU-launch-bound decode: many tiny kernels per step, CPU can't feed the GPU. Enable CUDA graphs, increase batch size (raise max_num_seqs / accept more concurrency), and check for Python overhead on the hot path. Confirm with a profile showing gaps between kernels.
Q2. How do you decide max_num_batched_tokens and gpu_memory_utilization?
Model answer
max_num_batched_tokens trades prefill chunk size vs decode latency: bigger = better prefill throughput but can stall decodes; tune to your prompt/output mix. gpu_memory_utilization sets how much HBM the KV cache may use — raise it to fit more concurrent sequences, but leave headroom for activations/CUDA-graph buffers to avoid OOM.
Going deeper
The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.
Phase 18 — Cheatsheet: Performance Engineering
- Loop: measure (TTFT/ITL/throughput) -> find bottleneck -> turn one knob -> re-measure.
- Knobs: max_num_seqs, max_num_batched_tokens, gpu_memory_utilization, chunked prefill, CUDA graphs, quant, spec decode.
- Roofline: decode=memory-bound, prefill=compute-bound. Little's Law links batch/rate/latency.
- Benchmark with warmup + steady state + identical traffic, or it's noise.
Key upstream files
benchmarks/vllm/benchmarks/vllm/v1/metrics/vllm/v1/metrics/stats.pyvllm/config/scheduler.py
Full reference: 00-guide.md · 01-deep-dive.md
Phase 19 — Capstone — Maintainer & Startup
← Phase 18 · Course home
Contents
- Don't Panic
- Why this phase matters
- What you'll learn
- The map: where this lives in the real code
- Labs in this phase
- How to work this phase
- Where you are
Don't Panic
You now understand the engine. The capstone turns understanding into a track record: land a real upstream PR, pass the staff interview loop, and (optionally) sketch a startup that's actually defensible. Don't Panic — you've already done the hard part; this phase is about leverage and judgment.
Why this phase matters
Knowledge without a public artifact is invisible. A merged PR, a benchmark writeup, and the mini_vllm engine you built ARE your portfolio. This phase is how you convert the last 18 phases into a maintainer reputation, a job, or a company.
What you'll learn
- The contribution workflow: finding good-first-issues, duplicate checks, RFCs
- vLLM's actual rules for AI-assisted contributions (read upstream/AGENTS.md!)
- Writing a PR that gets merged: scope, tests, benchmarks, description
- Code review etiquette and how trust accrues to maintainers
- The staff competency map and the mock interview loop (see CAREER.md)
- The startup playbook: where cost/moats live; build vs buy vs upstream
The map: where this lives in the real code
Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see
UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md)
walks through the important ones line by line.
AGENTS.md— vLLM's literal contribution policy (read it before any PR). Note: no pure code-agent PRs; disclose AI use; include tests + results; check for duplicates.docs/contributing/— The contributing guides.- [
.buildkite/ and tests/](../upstream/.buildkite/ and tests/) — How CI is structured and what your PR must pass. docs/design/— Design docs / the kind of thinking RFCs require.
Labs in this phase
- lab-01-find-and-scope-a-pr
[CPU-OK]— issue triage as engineering: the five-check disqualification gauntlet, then the one-page implementation plan (invariants named, regression test planned, blast radius bounded, out-of-scope explicit) for the survivor. - lab-02-mock-staff-loop
[CPU-OK]— the exit exam: four timed sessions (rapid-fire, deep-dive, a design scenario with shown arithmetic, two debugging trees), graded in three layers against the model answers and CAREER.md's competency map — honestly.
See labs/README.md for the exit criteria these two labs define.
How to work this phase
- Read this guide for intuition.
- Read 01-deep-dive.md with the
upstream/files open. - Do 02-mini-build.md — build the
mini_vllmpiece yourself. - Run the labs, then attempt EXERCISES.md.
- Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.
Where you are
This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.
← Phase 18 · Course home
Phase 19 — Deep Dive: Capstone — Maintainer & Startup
Read this with
upstream/open. Every path is relative toupstream/at the pinned commitv0.22.1 @ 0decac0(UPSTREAM_PIN.md). If a line number ever drifts, search for the named symbol instead.
Contents
Guided reading list
Work through these in order. This is a scaffold: the reading targets and the questions are real; fill in the line-by-line annotations as you go (this is exactly the muscle a maintainer uses — reading unfamiliar code and extracting its contract).
AGENTS.md— vLLM's literal contribution policy (read it before any PR). Note: no pure code-agent PRs; disclose AI use; include tests + results; check for duplicates.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
docs/contributing/— The contributing guides.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
- [
.buildkite/ and tests/](../upstream/.buildkite/ and tests/) — How CI is structured and what your PR must pass.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
docs/design/— Design docs / the kind of thinking RFCs require.- Read it, then write 3 sentences in your lab notebook: what data structure, what invariant, what edge case.
Questions to answer as you read
- The contribution workflow: finding good-first-issues, duplicate checks, RFCs?
- vLLM's actual rules for AI-assisted contributions (read upstream/AGENTS.md!)?
- Writing a PR that gets merged: scope, tests, benchmarks, description?
- Code review etiquette and how trust accrues to maintainers?
- The staff competency map and the mock interview loop (see CAREER.md)?
- The startup playbook: where cost/moats live; build vs buy vs upstream?
Cross-references
- Intuition: 00-guide.md
- Build it yourself: 02-mini-build.md
- The gold-standard depth to emulate: Phase 02 deep-dive.
Phase 19 — Mini-Build: extend mini_vllm
Contents
Your task
Capstone build: pick ONE real improvement to mini_vllm (e.g. add swapping-based preemption, beam search, or a second KV-cache group) and ship it with tests + a short design note — your dry run for an upstream PR.
Why build it (and not just read it)
Reading the real kernel/feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.
Method
- Look at the matching real code from 01-deep-dive.md.
- Add your module under
mini_vllm/(or extend an existing one). - Write a
test_*.pynext to it that pins the behavior you care about. - Run
pytest mini_vllm -qand keep it green.
Definition of done
- Your component runs on CPU with no extra dependencies (numpy ok).
- A test demonstrates the property this phase is about (not just "it runs").
- You can explain, out loud, how your toy maps to the real implementation and where it intentionally simplifies.
The flagship phases ship complete
mini_vllmmodules + tests (mini_vllm/block_pool.py,mini_vllm/scheduler.py) — use them as your reference for structure and test style.
Phase 19 Labs — Capstone: Maintainer & Startup
Two process labs — no starters, no pytest; the graders are the vLLM review queue and your own honesty. Lab-01 turns the course's knowledge into a merged upstream PR: harvest real issues, run the disqualification gauntlet (claimed? reproducible? within your map? testable? reviewable?), and write the one-page implementation plan a maintainer would approve. Lab-02 is the exit exam: a timed four-session mock staff loop (rapid-fire, deep-dive, design, debugging) built from eighteen phases of INTERVIEW.md files, scored against CAREER.md's competency map — peeks cap at 2, skipped arithmetic caps at 2, and the low rows become your revision list.
Exit criteria for the course, per these labs:
1. A merged (or at least review-surviving) upstream PR. [lab-01]
2. A competency matrix you'd show a hiring manager. [lab-02]
Contents
Labs
lab-01-find-and-scope-a-pr [CPU-OK]
Issue triage as engineering: three candidates, five disqualification checks in cost order, and the survivor scoped with the course's move — find the load-bearing lines, name the invariants, plan the regression test, bound the blast radius (TP/quant/LoRA/spec interactions), and write the one-pager with an explicit out-of-scope line. Plus the mechanical friction-removers: claim the issue first, pre-commit hooks, DCO sign-off. Skills: selection and scoping — the actual gap between knowing and contributing.
lab-02-mock-staff-loop [CPU-OK]
Four timed sessions: rapid-fire fundamentals (phases 0–3), systems deep-dive (4–8), a design scenario with topology + knobs + shown arithmetic + named risks (choose from three realistic builds), and two debugging trees from the symptom catalog. Graded in three layers per answer — mechanism, invariant/arithmetic, operational consequence — against the model answers and the competency map. Skills: producing under pressure; committing to defended choices; calibrated self-assessment; the staff sentence ("128 KiB/token, so…").
Where to go from here
The course's final claim was made on page one: finish every lab and you can read
and modify any part of vLLM, operate it like a principal engineer, and know where
the moats are. These two labs are where you check the claim against reality — the
PR against the review queue, the matrix against the map. Whatever rows come back
weak, the phases are still there; whatever comes back strong, CAREER.md
maps the three roads it opens (maintainer, staff IC, founder). Your notebook,
your mini_vllm, your tuning reports, and your merged PR are the portfolio.
Don't Panic — you built the whole engine once already.
Lab 19-01 — Find and Scope a Real Upstream PR [CPU-OK]
Nineteen phases built the knowledge; this lab spends it. You will triage real open vLLM issues, run the checks that separate a contributable issue from a trap, and produce the lab's artifact: a one-page implementation plan for one issue — written to the standard where a maintainer reading it would say "yes, do that." This is a process lab: no starter.py, no tests. The deliverable is the plan, and the grader is eventually the vLLM review queue itself.
Contents
- Why this lab exists
- Step 1 — Harvest candidates
- Step 2 — The disqualification gauntlet
- Step 3 — Scope the survivor
- The one-page plan (the artifact)
- Hitchhiker's notes
- Going further
- References
Why this lab exists
The distance between "understands vLLM" and "has a merged vLLM PR" is not knowledge — it's selection and scoping, and most first-time contributors fail there: they pick an issue that's secretly hard, already claimed, or quietly obsolete, burn two weekends, and bounce off. The defense is treating issue triage as an engineering activity with checks, which is also — not coincidentally — what maintainers themselves do all day. Doing this lab honestly once gives you the habit; the habit gives you the merged PR; the merged PR (per CAREER.md) is the portfolio line that compounds.
Step 1 — Harvest candidates
gh issue list -R vllm-project/vllm --label "good first issue" --state open --limit 30
gh issue list -R vllm-project/vllm --label bug --state open --search "sort:created-desc" --limit 30
Pick three candidates with different shapes if you can: a model-support gap (Phase 14's mapping-row class), a parser/frontend bug (Phase 16's territory), and a docs/test gap (don't sneer — test PRs teach the review process at minimum stakes). For each, skim the issue thread fully: maintainer comments often contain the scoping ("this needs X first", "blocked on Y") that the title hides.
Step 2 — The disqualification gauntlet
Run every candidate through these checks — in this order, cheapest first:
- Already claimed/fixed?
gh pr list -R vllm-project/vllm --search "<keywords>"plus a search of closed PRs and linked PRs in the issue. The most common first-timer waste is duplicating an in-flight fix. (The repo's AGENTS.md encodes exactly these checks for AI-assisted contributors — read it; the checklist applies to humans identically.) - Still reproducible at HEAD? Issues rot; vLLM merges dozens of PRs daily. A bug filed three weeks ago may be gone. Reproduce (or for model-support, confirm the model still errors) before writing a line.
- Is the cause within your current map? Trace it to a file. If the file is one whose machinery you've built in this course (scheduler, block pool, parsers, loaders, sampler, platform code) — green. If it's deep in a CUDA kernel and you skipped the GPU labs — pick another, or budget honestly.
- Is the fix small but the test meaningful? The ideal first PR is a ≤100-line diff with a regression test that pins the behavior forever — the shape every lab in this course drilled. Issues whose fix is one line but whose test is impossible, or vice versa, score worse than they look.
- Will anyone review it? Check the subsystem's recent merge velocity
(
gh pr list --search "path:vllm/<area>" --state merged --limit 10). A perfect PR into an unowned corner can sit for months; that's demoralizing precisely when momentum matters most.
Expect to disqualify two of three. That's the gauntlet working, not failing.
Step 3 — Scope the survivor
For the survivor, do the course's move: find the load-bearing lines. Identify the function(s) to change, the invariants they maintain (you can usually name them now — I1–I4, the budget cap, chunking invariance, the contract widths), the test file where the regression test belongs, and the blast radius (who calls this? does TP/quant/LoRA/spec-decode interact? — the feature-composition questions Phases 10–12 trained).
The one-page plan (the artifact)
Issue: #NNNNN — <title> Reproduced at: <commit>
Cause (file:line): ... (one paragraph, mechanism not symptom)
Fix sketch: ... (what changes, why it preserves the invariants)
Test plan: ... (the regression test: file, fixture, the assert)
Blast radius: ... (interactions checked: TP / quant / LoRA / spec / none)
Out of scope: ... (the adjacent improvements you are NOT doing — scoping
discipline is mostly this line)
Open questions for maintainers: ... (≤2, specific — these go in the PR description)
One page. If it doesn't fit, the scope is wrong — shrink the issue, not the font.
Hitchhiker's notes
- Comment on the issue before coding ("I'd like to take this; plan:
") — it claims the work, invites early correction, and costs nothing. Maintainers redirect cheap plans gladly and expensive PRs reluctantly. - Read
docs/contributing/and run the pre-commit hooks before the first commit — format/lint failures are the #1 cause of first-PR friction, and they're entirely avoidable mechanically. - The DCO sign-off (
git commit -s) is required and forgotten by almost every first contributor. Setgit config alias.cs "commit -s"now. - Your course artifacts are your credibility: a PR description that says "this preserves the free-queue invariant (blocks in queue ⟺ ref_cnt==0)" or "verified chunking-invariance with a randomized-slicing test" reads as a peer, not a tourist. Use the vocabulary; you've earned it.
Going further
- Do it: implement the plan, open the PR, survive review. Review is the lab's second half — expect a round or two; respond to every comment (fix or argue, never silence). The merge is Phase 19's true exit criterion.
- Then do the maintainer's side once: pick someone else's open first-PR and review it against your gauntlet — kindly, concretely. Both seats teach.
- Keep the plan template. Every nontrivial change you ever make — upstream or at work — deserves the one-pager; teams that institutionalize it ship faster and argue less.
References
upstream/AGENTS.mdandupstream/docs/contributing/— the project's own checks and process.- vLLM good-first-issue board: https://github.com/vllm-project/vllm/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
- CAREER.md — where the merged PR fits in the maintainer path.
- Phase 14 lab-03 (mapping rows — the classic first-PR shape), Phase 16 lab-01 (parser bugs — the second-classic).
Lab 19-02 — The Mock Staff Loop [CPU-OK]
Eighteen phases of INTERVIEW.md files exist for this moment: a full, timed, self-administered staff-engineer loop — four sessions, graded against the model answers and CAREER.md's competency map, with the gaps feeding a revision list rather than an ego. The deliverable is two artifacts: your scored competency matrix and the one-pager from the design session. This is the course's exit exam, and you are both candidate and (the harder job) honest grader.
Contents
- Why this lab exists
- The loop format
- Session guide
- Grading honestly
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Knowledge you can't produce under time pressure, out of order, against follow-up questions, isn't yet yours — it's still the book's. Staff loops test exactly the transformations this course optimized for: derive rather than recall (the economics labs), name the invariant under the feature (every lab's tables), state the trade with both sides priced (every Hitchhiker's note). The mock loop is where you find which of those moves are reflexes now and which still need the page open. Run it honestly once and your revision list writes itself; run it honestly twice, a month apart, and you'll have the rare commodity of calibrated confidence going into real loops — or real design meetings, which are the same exam with stakes.
The loop format
Four sessions, strictly timed, one sitting if you can manage it (fatigue realism included), notes only AFTER each session ends:
| Session | Time | Source material |
|---|---|---|
| 1. Fundamentals rapid-fire | 30 min | 2 questions each from phases 0–3 INTERVIEW.md, randomized |
| 2. Systems deep-dive | 45 min | 1 question each from phases 4–8, with self-posed follow-ups |
| 3. Design: "serve X under SLO Y on hardware Z" | 60 min | Construct from phases 10/15/18 (3 scenarios below) |
| 4. Debugging scenario | 30 min | Pick 2 from the symptom catalog below |
Design scenarios (pick one): (a) 70B chat, p99 TTFT < 1 s / ITL < 30 ms, 16 A100s across 2 nodes; (b) 100-tenant fine-tune platform, 8B base, 8 GPUs; (c) agentic workload, 8B + heavy tool calling, single-stream-latency-obsessed, 4 H100s. Produce the one-pager: topology (TP/PP/replicas/disagg), knobs with values and reasons, capacity arithmetic shown, the two biggest risks named.
Debugging symptoms (pick two; talk through the diagnosis tree out loud): p99 ITL spikes hourly (Phase 3/18); throughput fell 30% after a model swap (Phase 4/6 — check the backend line); tenant 7 complains, dashboards green (Phase 11 — slot thrash); seeded requests not reproducing (Phase 9); outputs differ across TP sizes (Phase 10 — the last ulp); VLM TTFT doubled (Phase 13 — image sizes).
Session guide
Answer out loud or in writing — producing is the test; reading silently grades as zero. For each question, the staff-grade answer has three layers, and you should consciously hit all three: the mechanism (what happens), the invariant or arithmetic underneath (why it must be so — quote the formula, name the I-number), and the operational consequence (what you'd do about it at 3 a.m.). The model answers in the INTERVIEW.md files are written in roughly that shape; grade against the shape, not just the facts.
Grading honestly
Score each competency row from CAREER.md's map: 3 = derived it cold, follow-ups survived; 2 = got there with hesitation or one peek; 1 = knew of it; 0 = blank. The two honesty rules: a peek caps the row at 2 (that's what the peek means), and an answer that skipped the arithmetic when arithmetic existed caps at 2 (staff answers compute — the course's whole thesis). Rows at ≤1 map directly to phases; that's your revision list, and the labs are designed for exactly this kind of targeted re-entry (each phase's index lists its skills).
Hitchhiker's notes
- The design session is the one that decides real loops — and its failure mode is breadth without commitment. Force yourself to choose (TP=4 PP=2, not "TP or maybe PP") and defend with the lab arithmetic (Phase 10 lab-03's comm bill, Phase 0 lab-02's KV budget, Phase 15 lab-03's toll). Reviewers — real and self — reward a defended wrong choice over an undefended hedge.
- Say the numbers out loud. "128 KiB per token, so 2048-token contexts cost 256 MiB each, so 8 GiB of free HBM holds ~32 of them" is a staff sentence; "KV is big" is not. The course gave you maybe twenty such derivations — sessions 1 and 3 should each surface five.
- Interviewing the interviewer: after each model-answer comparison, ask what follow-up the answer invites and answer that too. Real loops live in the follow-ups; the INTERVIEW.md files seed them deliberately.
- A month later, rerun changed rows only. Spaced, targeted, calibrated — the same discipline as performance work (measure, change one thing, measure).
Going further
- Trade loops with a colleague — grading someone else against the model answers teaches more than being graded, and explaining a phase you "know" is the final filter for whether you do.
- Take the design one-pager from session 3 and cost it on real cloud prices — the startup half of CAREER.md begins exactly there (capacity arithmetic × dollars = the unit economics every inference company lives or dies by).
- Publish your best answer (blog post, internal doc) — the act of writing for strangers finds the remaining gaps, and the artifact compounds the way merged PRs do.
References
- The INTERVIEW.md in every phase directory — the question bank.
- CAREER.md — the competency map you're scoring against, and the maintainer/staff/startup paths the scores feed.
- Lab-01 — the other half of the capstone: the loop proves you can explain the engine; the merged PR proves you can change it. Exit with both.
Phase 19 — Exercises: Capstone — Maintainer & Startup
Work these after the labs. They escalate from "explain it" to "design it" — staff-level means you can do the last ones cold.
- Write a merge-ready PR description for a small fix (scope, tests run, why not a duplicate).
- Pick a model from this course and propose, with numbers, the single highest-ROI optimization.
- Draft a one-paragraph startup thesis with a defensible moat per CAREER.md Track C.
Self-grading
For each: could you (a) explain it to a teammate in 2 minutes, and (b) point to the exact
upstream/ file that proves your answer? If not, re-read the matching anchor in
01-deep-dive.md.
Phase 19 — Interview Questions: Capstone — Maintainer & Startup
Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)
Q1. How would you land your first vLLM contribution?
Model answer
Find a good-first-issue or a real bug you hit; run the duplicate checks (gh issue/pr search) per AGENTS.md; reproduce it; write a minimal fix WITH a test that pins the behavior and a clear PR description stating what you ran and why it isn't a duplicate; respond to review quickly. Specialize in one area to build reviewer trust over time.
Q2. Where's the moat for an inference startup built on vLLM?
Model answer
Not in renting GPUs around vanilla vLLM (margins compress). It's in a sustained kernel/scheduling edge, workload specialization (long-context/agentic/structured), the control plane (routing, autoscaling, multi-tenancy, cost attribution), or distribution/switching costs. Upstream commodity features; keep the genuine edge.
Going deeper
The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.
Phase 19 — Cheatsheet: Capstone — Maintainer & Startup
- Read AGENTS.md FIRST. No pure code-agent PRs; disclose AI; include tests+results; no dupes.
- Merge-ready PR = small scope + tests that pin behavior + benchmark if perf + clear why.
- Portfolio = merged PR + benchmark writeup + your mini_vllm engine.
- Moats: kernel/scheduling edge, workload specialization, control plane, distribution.
Key upstream files
AGENTS.mddocs/contributing/.buildkite/ and tests/docs/design/
Full reference: 00-guide.md · 01-deep-dive.md