Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 05-03 — Reimplement the CUDAGraphMode Dispatch [CPU-OK]

Labs 01–02 established that graphs love uniform, repeated shapes. Now meet the batch that hates them: a mixed batch — Phase 3's chunked prefill riding alongside decodes, every step a different ragged collection of sequence lengths flowing into attention. A FULL graph can't swallow that. vLLM's answer is a small but consequential piece of policy: the CUDAGraphMode enum (upstream/vllm/config/compilation.py:53), which routes pure-decode batches and mixed batches to different graph strategies — and whose composite values (FULL_AND_PIECEWISE, the V1 default) are the reason your lab-04 capture log shows two capture passes. You'll reimplement its dispatch methods exactly, because this tiny enum is where three phases of machinery (chunked prefill, attention metadata, graph constraints) get reconciled in about ten lines.

Contents


Why this lab exists

Most engineers meet cudagraph_mode as a config string they cargo-cult when something breaks ("try PIECEWISE"). The enum deserves better: it's a textbook example of encoding a two-dimensional policy in a one-dimensional config, and the dispatch methods you'll write are the decoder ring. Once you've implemented decode_mode/mixed_mode/has_mode yourself, every graphs-related symptom maps to a row of the routing table: capture log has one pass instead of two → someone set FULL; mixed batches mysteriously slow → mode is FULL_DECODE_ONLY and prefill steps run eager; compile time doubled → the mode requires_piecewise_compilation and the model was split at attention.

There's also a compile-time/run-time lesson here that generalizes: some of these flags must be known before the model is compiled (you can't piecewise-replay a graph that wasn't piecewise-compiled), so the enum is consulted in two different epochs of the engine's life. Configuration that crosses epochs is where the subtle bugs live — this lab makes the two consumers explicit.

Background (read first)

class CUDAGraphMode(enum.Enum):
    NONE = 0
    PIECEWISE = 1
    FULL = 2
    FULL_DECODE_ONLY = (FULL, NONE)         # full graph for decode, no graph for mixed
    FULL_AND_PIECEWISE = (FULL, PIECEWISE)  # full for decode, piecewise for mixed (V1 default)

The composite modes are tuples (decode_mode, mixed_mode). Why the split:

  • A pure-decode batch is the graph's dream: every request contributes exactly one token, shapes are uniform (padded to a ladder rung — lab-05), attention metadata is regular. Safe for a FULL graph — the entire step, attention included, one replay.
  • A mixed batch (prefill chunks + decodes) has per-request query lengths, ragged attention metadata, varlen kernels — exactly what a recording can't generalize over. Options: no graph at all (NONE), or PIECEWISE — capture the shape-stable runs between attention calls and run attention eagerly. Piecewise is the compromise that keeps most of the launch win (the hundreds of small ops around attention) while letting the one genuinely dynamic op stay dynamic.

PIECEWISE requires the model to have been compiled with attention as a splitting op (torch.compile carves the graph at splitting_ops) — that's the compile-time dependency requires_piecewise_compilation guards.

Files

  • starter.py — implement separate_routine, decode_mode, mixed_mode, has_mode, requires_piecewise_compilation, runtime_mode_for. Modes are strings; composites live in a ROUTINES dict. Your work.
  • solution.py — reference.
  • test_lab.py — the full routing table, every mode × both batch kinds.

Run

LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-03-cudagraph-mode -q
pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-03-cudagraph-mode -q   # reference

What you must reproduce

  • separate_routine(m) → is m composite (distinct decode/mixed routines)?
  • decode_mode(m) / mixed_mode(m) → the concrete mode for that batch kind (composites split; simple modes return themselves for both — FULL means full everywhere, which is why it's only safe with chunked prefill disabled or padded-prefill tricks).
  • has_mode(m, target) → does m employ target in any routine?
  • requires_piecewise_compilation(m)has_mode(m, PIECEWISE).
  • runtime_mode_for(m, is_decode) → the per-batch dispatch the wrapper performs each step (upstream: the BatchDescriptor.uniform_decode flag selecting the entry).

The routing you're proving

modedecode batchmixed batchneeds piecewise compile?
NONENONENONEno
PIECEWISEPIECEWISEPIECEWISEyes
FULLFULLFULLno
FULL_DECODE_ONLYFULLNONEno
FULL_AND_PIECEWISEFULLPIECEWISEyes

Memorize the last row — it's the default, and both lab-04 capture passes ((decode, FULL) and (mixed prefill-decode, PIECEWISE)) are its two cells.

Hitchhiker's notes

  • Why is FULL_AND_PIECEWISE the default and not plain FULL? Because chunked prefill is default-on (Phase 3): mixed batches are the common case, not the exception, and a FULL-only config would either crash on them or force eager. The default encodes the workload assumption; change the workload assumption (e.g. a decode-only disaggregated worker — Phase 15) and FULL_DECODE_ONLY becomes the rational pick. Configs are workload claims in disguise.
  • Where the dispatch actually happens: per step, the runner builds a BatchDescriptor (batch size + uniform-decode flag); the graph wrapper keys its entry dict on it (lab-01's dict, now two-dimensional). Your runtime_mode_for(m, is_decode) is that lookup's policy half.
  • What "attention runs eagerly" costs in PIECEWISE: one-ish launches per attention per layer per step, vs the hundreds saved elsewhere. That's why piecewise keeps most of the win — and why backends that support graph-safe attention metadata (uniform decode) unlock FULL for the decode half, which is the entire point of the composite.
  • Failure smell catalog: capture log shows one pass → not the default mode; OOM during capture → ladder too long × two routines (lab-05's memory cost, doubled); "piecewise compilation required" assertion → mode demands PIECEWISE but compilation level didn't split. Ten lines of enum, three distinct production symptoms.

Reflect

  • Why can't the runtime "just check if the batch is uniform and use FULL when it can" without any enum? (It does check — that's runtime_mode_for. The enum exists for the compile-time half: whether to split at attention must be decided before any batch arrives. Runtime flexibility is bounded by compile-time commitments.)
  • A team disables chunked prefill entirely and serves short prompts only. Which mode maximizes their throughput, and what new risk do they take? (FULL — every batch can be graph-shaped now; the risk is any stray mixed/odd batch has no graph and no piecewise fallback: eager cliffs.)
  • Sketch the routing table for a hypothetical PIECEWISE_DECODE_ONLY. Why does no such mode ship? (If decode batches — the most uniform — can only manage piecewise, mixed can't do better; the composite would collapse to plain PIECEWISE.)

References

  • upstream/vllm/config/compilation.py:53 — the real enum and its methods; diff your solution against it line by line.
  • upstream/vllm/compilation/cuda_graph.pyBatchDescriptor and the per-entry dispatch.
  • upstream/vllm/v1/worker/gpu_model_runner.py — where uniform_decode is determined per step.
  • vLLM docs, Compilation Config — the user-facing knob this enum sits behind: https://docs.vllm.ai/en/latest/configuration/optimization.html
  • Lab-04's capture log — both routines of the default mode, visible at startup.