Lab 05-03 — Reimplement the CUDAGraphMode Dispatch [CPU-OK]
Labs 01–02 established that graphs love uniform, repeated shapes. Now meet the batch that
hates them: a mixed batch — Phase 3's chunked prefill riding alongside decodes, every
step a different ragged collection of sequence lengths flowing into attention. A FULL
graph can't swallow that. vLLM's answer is a small but consequential piece of policy: the
CUDAGraphMode enum (upstream/vllm/config/compilation.py:53), which routes pure-decode
batches and mixed batches to different graph strategies — and whose composite values
(FULL_AND_PIECEWISE, the V1 default) are the reason your lab-04 capture log shows two
capture passes. You'll reimplement its dispatch methods exactly, because this tiny enum is
where three phases of machinery (chunked prefill, attention metadata, graph constraints)
get reconciled in about ten lines.
Contents
- Why this lab exists
- Background (read first)
- Files
- Run
- What you must reproduce
- The routing you're proving
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Most engineers meet cudagraph_mode as a config string they cargo-cult when something
breaks ("try PIECEWISE"). The enum deserves better: it's a textbook example of encoding
a two-dimensional policy in a one-dimensional config, and the dispatch methods you'll
write are the decoder ring. Once you've implemented decode_mode/mixed_mode/has_mode
yourself, every graphs-related symptom maps to a row of the routing table: capture log has
one pass instead of two → someone set FULL; mixed batches mysteriously slow → mode is
FULL_DECODE_ONLY and prefill steps run eager; compile time doubled → the mode
requires_piecewise_compilation and the model was split at attention.
There's also a compile-time/run-time lesson here that generalizes: some of these flags must be known before the model is compiled (you can't piecewise-replay a graph that wasn't piecewise-compiled), so the enum is consulted in two different epochs of the engine's life. Configuration that crosses epochs is where the subtle bugs live — this lab makes the two consumers explicit.
Background (read first)
class CUDAGraphMode(enum.Enum):
NONE = 0
PIECEWISE = 1
FULL = 2
FULL_DECODE_ONLY = (FULL, NONE) # full graph for decode, no graph for mixed
FULL_AND_PIECEWISE = (FULL, PIECEWISE) # full for decode, piecewise for mixed (V1 default)
The composite modes are tuples (decode_mode, mixed_mode). Why the split:
- A pure-decode batch is the graph's dream: every request contributes exactly one token, shapes are uniform (padded to a ladder rung — lab-05), attention metadata is regular. Safe for a FULL graph — the entire step, attention included, one replay.
- A mixed batch (prefill chunks + decodes) has per-request query lengths, ragged attention metadata, varlen kernels — exactly what a recording can't generalize over. Options: no graph at all (NONE), or PIECEWISE — capture the shape-stable runs between attention calls and run attention eagerly. Piecewise is the compromise that keeps most of the launch win (the hundreds of small ops around attention) while letting the one genuinely dynamic op stay dynamic.
PIECEWISE requires the model to have been compiled with attention as a splitting op
(torch.compile carves the graph at splitting_ops) — that's the compile-time dependency
requires_piecewise_compilation guards.
Files
starter.py— implementseparate_routine,decode_mode,mixed_mode,has_mode,requires_piecewise_compilation,runtime_mode_for. Modes are strings; composites live in aROUTINESdict. Your work.solution.py— reference.test_lab.py— the full routing table, every mode × both batch kinds.
Run
LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-03-cudagraph-mode -q
pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-03-cudagraph-mode -q # reference
What you must reproduce
separate_routine(m)→ ismcomposite (distinct decode/mixed routines)?decode_mode(m)/mixed_mode(m)→ the concrete mode for that batch kind (composites split; simple modes return themselves for both —FULLmeans full everywhere, which is why it's only safe with chunked prefill disabled or padded-prefill tricks).has_mode(m, target)→ doesmemploytargetin any routine?requires_piecewise_compilation(m)→has_mode(m, PIECEWISE).runtime_mode_for(m, is_decode)→ the per-batch dispatch the wrapper performs each step (upstream: theBatchDescriptor.uniform_decodeflag selecting the entry).
The routing you're proving
| mode | decode batch | mixed batch | needs piecewise compile? |
|---|---|---|---|
NONE | NONE | NONE | no |
PIECEWISE | PIECEWISE | PIECEWISE | yes |
FULL | FULL | FULL | no |
FULL_DECODE_ONLY | FULL | NONE | no |
FULL_AND_PIECEWISE | FULL | PIECEWISE | yes |
Memorize the last row — it's the default, and both lab-04 capture passes
((decode, FULL) and (mixed prefill-decode, PIECEWISE)) are its two cells.
Hitchhiker's notes
- Why is
FULL_AND_PIECEWISEthe default and not plainFULL? Because chunked prefill is default-on (Phase 3): mixed batches are the common case, not the exception, and a FULL-only config would either crash on them or force eager. The default encodes the workload assumption; change the workload assumption (e.g. a decode-only disaggregated worker — Phase 15) andFULL_DECODE_ONLYbecomes the rational pick. Configs are workload claims in disguise. - Where the dispatch actually happens: per step, the runner builds a
BatchDescriptor(batch size + uniform-decode flag); the graph wrapper keys its entry dict on it (lab-01's dict, now two-dimensional). Yourruntime_mode_for(m, is_decode)is that lookup's policy half. - What "attention runs eagerly" costs in PIECEWISE: one-ish launches per attention per layer per step, vs the hundreds saved elsewhere. That's why piecewise keeps most of the win — and why backends that support graph-safe attention metadata (uniform decode) unlock FULL for the decode half, which is the entire point of the composite.
- Failure smell catalog: capture log shows one pass → not the default mode; OOM during capture → ladder too long × two routines (lab-05's memory cost, doubled); "piecewise compilation required" assertion → mode demands PIECEWISE but compilation level didn't split. Ten lines of enum, three distinct production symptoms.
Reflect
- Why can't the runtime "just check if the batch is uniform and use FULL when it can"
without any enum? (It does check — that's
runtime_mode_for. The enum exists for the compile-time half: whether to split at attention must be decided before any batch arrives. Runtime flexibility is bounded by compile-time commitments.) - A team disables chunked prefill entirely and serves short prompts only. Which mode
maximizes their throughput, and what new risk do they take? (
FULL— every batch can be graph-shaped now; the risk is any stray mixed/odd batch has no graph and no piecewise fallback: eager cliffs.) - Sketch the routing table for a hypothetical
PIECEWISE_DECODE_ONLY. Why does no such mode ship? (If decode batches — the most uniform — can only manage piecewise, mixed can't do better; the composite would collapse to plainPIECEWISE.)
References
upstream/vllm/config/compilation.py:53— the real enum and its methods; diff your solution against it line by line.upstream/vllm/compilation/cuda_graph.py—BatchDescriptorand the per-entry dispatch.upstream/vllm/v1/worker/gpu_model_runner.py— whereuniform_decodeis determined per step.- vLLM docs, Compilation Config — the user-facing knob this enum sits behind: https://docs.vllm.ai/en/latest/configuration/optimization.html
- Lab-04's capture log — both routines of the default mode, visible at startup.