Phase 05 — The Hitchhiker's Guide to CUDA Graphs & torch.compile ⭐
← Phase 04 · Course home · Phase 06 →
Flagship phase — written in full. Phases 02–03 made memory and scheduling fast. This phase attacks a different enemy: the CPU, which can be too slow to even tell the GPU what to do.
Contents
- Don't Panic
- Step 1: Why the CPU is a bottleneck at all
- Step 2: CUDA graphs — record once, replay forever
- Step 3: Full vs Piecewise graphs
- Step 4: torch.compile — making the kernels themselves better
- Step 5: How they fit together at runtime
- The invariants to memorize
- What you'll do in this phase
Don't Panic
Two ideas, one breath each:
CUDA graphs: launching a GPU kernel from Python costs CPU time. During decode you launch hundreds of tiny kernels per token — and the CPU can't issue them fast enough, so the GPU sits idle waiting for work. A CUDA graph records that whole sequence of launches once and replays it with a single launch. The CPU overhead vanishes.
torch.compile: instead of running your model op-by-op in Python, PyTorch's compiler traces it into a graph, then fuses and rewrites the ops into fewer, faster kernels. vLLM wraps this with its own backend, caching, and custom optimization passes.
They're complementary: torch.compile makes the kernels better; CUDA graphs make launching
them free. vLLM uses both, together, by default. By the end of this phase you'll have built a
CPU simulation of capture/replay (mini_vllm/cudagraph.py) that reproduces the exact win and
the exact constraints, and you'll have read the real CUDAGraphWrapper.
Step 1: Why the CPU is a bottleneck at all
Recall the decode loop (Phase 0): one token at a time, and each token's forward pass runs many small operations — for each of, say, 32 layers: a QKV projection, attention, an output projection, two MLP matmuls, norms, residual adds… Easily hundreds of GPU kernels per token.
Each kernel is launched from Python/C++ on the CPU. A launch isn't free — it costs a few microseconds of CPU work to set up and enqueue. Do the arithmetic:
~300 kernels/token × ~5 µs CPU launch overhead ≈ 1.5 ms of CPU work per token
If the GPU work for that token is also ~1.5 ms, you're at best 50% utilized — and at small batch sizes (where each kernel is tiny and finishes fast) the GPU finishes each kernel before the CPU can launch the next one. The GPU starves, waiting on the CPU. This is CPU-launch-bound decode, and it's the default failure mode at low batch sizes.
CPU: [launch k1][launch k2][launch k3]......[launch k300] ← the CPU is the critical path
GPU: [k1] idle [k2] idle [k3] idle ..... ← GPU waits between tiny kernels
└ each gap = CPU not done issuing the next launch
You can't make the launches cheaper one by one. But you can stop doing them every step.
Step 2: CUDA graphs — record once, replay forever
A CUDA graph is a recording of a sequence of GPU operations and their dependencies. You "capture" it once by running the forward pass in a special mode; CUDA records every kernel and its arguments into a graph object. Thereafter you replay the whole graph with a single API call — the CPU issues one launch and the GPU rips through all 300 kernels with zero per-kernel CPU involvement.
Without graphs (every step): CPU issues 300 launches → GPU starves between them
With a captured graph: CPU issues 1 "replay" → GPU runs all 300 back-to-back
The catch — and this is the whole reason it's tricky — is that a graph is a frozen recording. It records exact kernels reading from exact memory addresses for exact tensor shapes. So:
- Constraint 1 — fixed shapes. A graph captured for batch size 8 only replays for batch size 8. vLLM captures a graph for each batch size it expects (and pads odd batch sizes up to the nearest captured size). It keeps a dictionary of graphs keyed by shape.
- Constraint 2 — static input buffers. Replay reads from the same memory the capture used. So to run a new token's inputs, you must copy them into the captured input buffer first, then replay. The graph reads from the fixed address; your job is to keep that address valid and current.
These two constraints are exactly what your mini_vllm/cudagraph.py GraphRunner models: a
dict keyed by input shape (Constraint 1), and a static_input buffer you np.copyto into
before replay (Constraint 2). Go read it — it's ~40 lines and it is the mental model.
Why "graphs" plural? Because of Constraint 1, vLLM holds many captured graphs — one per batch size in
cudagraph_capture_sizes. The realCUDAGraphWrapperstores them inconcrete_cudagraph_entries: dict[BatchDescriptor, CUDAGraphEntry](cuda_graph.py:207). Your simulation'sself.graphs: dict[shape, GraphEntry]is the same idea with the GPU filed off.
Step 3: Full vs Piecewise graphs
There's a wrinkle. Some operations can't be captured into a graph cleanly — most importantly attention, because its kernel takes variable-length metadata (the block tables and sequence lengths from Phase 02/03) that change every step and don't fit the "frozen recording" model.
vLLM offers two strategies (the CUDAGraphMode enum, compilation.py:53):
- FULL — capture the entire model forward as one graph. Maximum CPU-overhead removal, but fragile: everything in the forward must be capture-safe, including attention (which needs special handling, e.g. capturing only the decode case where shapes are uniform).
- PIECEWISE — split the forward at the uncapturable ops (attention). Capture each contiguous compiled region between splits as its own small graph; run the split ops (attention) eagerly. You pay a few launches (one per piece + the eager attention) instead of 300 — most of the win, far more robustly.
FULL: [ ============== one graph: whole forward ============== ] (1 replay)
PIECEWISE: [ graph A ] (attention eager) [ graph B ] (attention eager) [ graph C ]
└ capture └ run live └ capture └ run live └ capture
a handful of launches, robust to attention's dynamic metadata
vLLM's V1 default is actually FULL_AND_PIECEWISE (compilation.py:63): use a FULL graph for
pure-decode batches (uniform shapes — safe and fastest) and PIECEWISE for mixed prefill+decode
batches (where attention metadata varies). Your mini_vllm.PiecewiseGraphRunner models exactly
this: it splits ops at an "uncapturable" predicate, captures the rest, runs the splits eagerly —
and a test proves the output is identical to eager.
Step 4: torch.compile — making the kernels themselves better
CUDA graphs remove launch overhead but don't change what runs. torch.compile does the
other half: it traces your model into an FX graph (via TorchDynamo), then a backend
(Inductor) generates fused kernels — e.g. fusing a RMSNorm + a quantization into one kernel,
so you read memory once instead of three times.
vLLM doesn't just use stock torch.compile; it has a custom backend (compilation/ backends.py) and a compilation pipeline with levels (CompilationMode, compilation.py:37):
CompilationMode (the "level"):
0 NONE – pure eager, no compile (what enforce_eager gives you)
1 STOCK_TORCH_COMPILE – plain torch.compile
2 DYNAMO_TRACE_ONCE – trace once, no recompiles
3 VLLM_COMPILE – vLLM's Inductor backend: caching + PIECEWISE compilation +
shape specialization + custom passes ← the V1 default
At level 3, vLLM:
- traces the model once and caches the compiled artifacts (so restarts are fast),
- splits the graph at attention for piecewise compilation (lining up with piecewise CUDA graphs),
- runs custom graph passes (
compilation/passes/) — fusions vLLM knows are safe and profitable for inference but stock Inductor wouldn't do (e.g. fused add+RMSNorm, sequence- parallel rewrites, quant fusions).
You opt a model into all this with one decorator: @support_torch_compile on the model class
(decorators.py:118). That's the seam between "a model" and "the compiler."
Step 5: How they fit together at runtime
model class
└─ @support_torch_compile (Phase 5: opt into the compiler)
└─ torch.compile / VLLM_COMPILE backend → fused kernels, piecewise split at attention
└─ CUDAGraphWrapper (Phase 5: capture/replay per batch size)
└─ for batch size B: replay graph_B (1 launch) OR capture if new shape
Each decode step, the model runner sets a forward_context with the current
cudagraph_runtime_mode and a batch_descriptor (the shape key). The CUDAGraphWrapper
(cuda_graph.py:233) reads that context and either runs eagerly (mode NONE / mismatch),
replays the cached graph for that shape, or captures a new one. That dispatch-by-context
is precisely what your GraphRunner.__call__ does with x.shape as the key.
The invariants to memorize
- CUDA graphs remove CPU launch overhead, not GPU compute. They help when decode is CPU-launch-bound (small batch), not when it's GPU-bound (large batch / prefill).
- A graph is per-shape: one captured graph per batch size; odd sizes are padded up.
- Replay reads static buffers: new inputs must be copied in before replay.
- Attention is the thing that resists capture → piecewise splits around it.
torch.compileimproves kernels; CUDA graphs improve launching. Different problems, used together.enforce_eager=Trueturns both off — your debugging escape hatch (and the only way to get fully dynamic shapes).
What you'll do in this phase
- Read: 01-deep-dive.md walks the real
CUDAGraphWrapper.__call__, theCUDAGraphMode/CompilationModeenums, and@support_torch_compileline by line. - Build: 02-mini-build.md — the capture/replay simulator (reference:
mini_vllm/cudagraph.py). - Labs (see labs/README.md; recommended order 01 → 02 → 05 → 03 → 04):
lab-01-graph-replay-simulator[CPU-OK]— implement capture/replay + shape dispatch + static buffers; pass the tests.lab-02-launch-overhead[CPU-OK]— model launch overhead and find the eager↔graph crossover.lab-03-cudagraph-mode[CPU-OK]— reimplement theCUDAGraphModedispatch (FULL / PIECEWISE / FULL_AND_PIECEWISE) and prove you understand decode-vs-mixed routing.lab-04-graph-vs-eager-real[GPU-REQ]— run real vLLM withenforce_eagervs CUDA graphs, measure the ITL difference (captured output included).lab-05-capture-sizes[CPU-OK]— the capture-size ladder: rung lookup, padding waste, and the denser-ladder-vs-more-captures trade ("why is my batch of 33 running at 40?").
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
When you can explain why a graph helps decode but not prefill, name the two constraints, and draw piecewise vs full from memory, you understand the layer that often doubles low-batch throughput for free.
← Phase 04 · Course home · Phase 06 →