Phase 05 — The Hitchhiker's Guide to CUDA Graphs & torch.compile ⭐

← Phase 04 · Course home · Phase 06 →

Flagship phase — written in full. Phases 02–03 made memory and scheduling fast. This phase attacks a different enemy: the CPU, which can be too slow to even tell the GPU what to do.

Don't Panic
Step 1: Why the CPU is a bottleneck at all
Step 2: CUDA graphs — record once, replay forever
Step 3: Full vs Piecewise graphs
Step 4: torch.compile — making the kernels themselves better
Step 5: How they fit together at runtime
The invariants to memorize
What you'll do in this phase

Don't Panic

Two ideas, one breath each:

CUDA graphs: launching a GPU kernel from Python costs CPU time. During decode you launch hundreds of tiny kernels per token — and the CPU can't issue them fast enough, so the GPU sits idle waiting for work. A CUDA graph records that whole sequence of launches once and replays it with a single launch. The CPU overhead vanishes.

torch.compile: instead of running your model op-by-op in Python, PyTorch's compiler traces it into a graph, then fuses and rewrites the ops into fewer, faster kernels. vLLM wraps this with its own backend, caching, and custom optimization passes.

They're complementary: torch.compile makes the kernels better; CUDA graphs make launching them free. vLLM uses both, together, by default. By the end of this phase you'll have built a CPU simulation of capture/replay (mini_vllm/cudagraph.py) that reproduces the exact win and the exact constraints, and you'll have read the real CUDAGraphWrapper.

Step 1: Why the CPU is a bottleneck at all

Recall the decode loop (Phase 0): one token at a time, and each token's forward pass runs many small operations — for each of, say, 32 layers: a QKV projection, attention, an output projection, two MLP matmuls, norms, residual adds… Easily hundreds of GPU kernels per token.

Each kernel is launched from Python/C++ on the CPU. A launch isn't free — it costs a few microseconds of CPU work to set up and enqueue. Do the arithmetic:

~300 kernels/token × ~5 µs CPU launch overhead ≈ 1.5 ms of CPU work per token

If the GPU work for that token is also ~1.5 ms, you're at best 50% utilized — and at small batch sizes (where each kernel is tiny and finishes fast) the GPU finishes each kernel before the CPU can launch the next one. The GPU starves, waiting on the CPU. This is CPU-launch-bound decode, and it's the default failure mode at low batch sizes.

CPU:  [launch k1][launch k2][launch k3]......[launch k300]   ← the CPU is the critical path
GPU:  [k1] idle  [k2] idle  [k3] idle .....                  ← GPU waits between tiny kernels
       └ each gap = CPU not done issuing the next launch

You can't make the launches cheaper one by one. But you can stop doing them every step.

Step 2: CUDA graphs — record once, replay forever

A CUDA graph is a recording of a sequence of GPU operations and their dependencies. You "capture" it once by running the forward pass in a special mode; CUDA records every kernel and its arguments into a graph object. Thereafter you replay the whole graph with a single API call — the CPU issues one launch and the GPU rips through all 300 kernels with zero per-kernel CPU involvement.

Without graphs (every step):  CPU issues 300 launches  →  GPU starves between them
With a captured graph:        CPU issues 1 "replay"    →  GPU runs all 300 back-to-back

The catch — and this is the whole reason it's tricky — is that a graph is a frozen recording. It records exact kernels reading from exact memory addresses for exact tensor shapes. So:

Constraint 1 — fixed shapes. A graph captured for batch size 8 only replays for batch size 8. vLLM captures a graph for each batch size it expects (and pads odd batch sizes up to the nearest captured size). It keeps a dictionary of graphs keyed by shape.
Constraint 2 — static input buffers. Replay reads from the same memory the capture used. So to run a new token's inputs, you must copy them into the captured input buffer first, then replay. The graph reads from the fixed address; your job is to keep that address valid and current.

These two constraints are exactly what your mini_vllm/cudagraph.py GraphRunner models: a dict keyed by input shape (Constraint 1), and a static_input buffer you np.copyto into before replay (Constraint 2). Go read it — it's ~40 lines and it is the mental model.

Why "graphs" plural? Because of Constraint 1, vLLM holds many captured graphs — one per batch size in cudagraph_capture_sizes. The real CUDAGraphWrapper stores them in concrete_cudagraph_entries: dict[BatchDescriptor, CUDAGraphEntry] (cuda_graph.py:207). Your simulation's self.graphs: dict[shape, GraphEntry] is the same idea with the GPU filed off.

Step 3: Full vs Piecewise graphs

There's a wrinkle. Some operations can't be captured into a graph cleanly — most importantly attention, because its kernel takes variable-length metadata (the block tables and sequence lengths from Phase 02/03) that change every step and don't fit the "frozen recording" model.

vLLM offers two strategies (the CUDAGraphMode enum, compilation.py:53):

FULL — capture the entire model forward as one graph. Maximum CPU-overhead removal, but fragile: everything in the forward must be capture-safe, including attention (which needs special handling, e.g. capturing only the decode case where shapes are uniform).
PIECEWISE — split the forward at the uncapturable ops (attention). Capture each contiguous compiled region between splits as its own small graph; run the split ops (attention) eagerly. You pay a few launches (one per piece + the eager attention) instead of 300 — most of the win, far more robustly.

FULL:       [ ============== one graph: whole forward ============== ]   (1 replay)

PIECEWISE:  [ graph A ] (attention eager) [ graph B ] (attention eager) [ graph C ]
             └ capture   └ run live         └ capture   └ run live        └ capture
            a handful of launches, robust to attention's dynamic metadata

vLLM's V1 default is actually FULL_AND_PIECEWISE (compilation.py:63): use a FULL graph for pure-decode batches (uniform shapes — safe and fastest) and PIECEWISE for mixed prefill+decode batches (where attention metadata varies). Your mini_vllm.PiecewiseGraphRunner models exactly this: it splits ops at an "uncapturable" predicate, captures the rest, runs the splits eagerly — and a test proves the output is identical to eager.

Step 4: torch.compile — making the kernels themselves better

CUDA graphs remove launch overhead but don't change what runs. torch.compile does the other half: it traces your model into an FX graph (via TorchDynamo), then a backend (Inductor) generates fused kernels — e.g. fusing a RMSNorm + a quantization into one kernel, so you read memory once instead of three times.

vLLM doesn't just use stock torch.compile; it has a custom backend (compilation/ backends.py) and a compilation pipeline with levels (CompilationMode, compilation.py:37):

CompilationMode (the "level"):
  0 NONE                – pure eager, no compile (what enforce_eager gives you)
  1 STOCK_TORCH_COMPILE – plain torch.compile
  2 DYNAMO_TRACE_ONCE   – trace once, no recompiles
  3 VLLM_COMPILE        – vLLM's Inductor backend: caching + PIECEWISE compilation +
                          shape specialization + custom passes   ← the V1 default

At level 3, vLLM:

traces the model once and caches the compiled artifacts (so restarts are fast),
splits the graph at attention for piecewise compilation (lining up with piecewise CUDA graphs),
runs custom graph passes (compilation/passes/) — fusions vLLM knows are safe and profitable for inference but stock Inductor wouldn't do (e.g. fused add+RMSNorm, sequence- parallel rewrites, quant fusions).

You opt a model into all this with one decorator: @support_torch_compile on the model class (decorators.py:118). That's the seam between "a model" and "the compiler."

Step 5: How they fit together at runtime

model class
  └─ @support_torch_compile           (Phase 5: opt into the compiler)
        └─ torch.compile / VLLM_COMPILE backend  → fused kernels, piecewise split at attention
              └─ CUDAGraphWrapper      (Phase 5: capture/replay per batch size)
                    └─ for batch size B: replay graph_B  (1 launch)  OR capture if new shape

Each decode step, the model runner sets a forward_context with the current cudagraph_runtime_mode and a batch_descriptor (the shape key). The CUDAGraphWrapper (cuda_graph.py:233) reads that context and either runs eagerly (mode NONE / mismatch), replays the cached graph for that shape, or captures a new one. That dispatch-by-context is precisely what your GraphRunner.__call__ does with x.shape as the key.

The invariants to memorize

CUDA graphs remove CPU launch overhead, not GPU compute. They help when decode is CPU-launch-bound (small batch), not when it's GPU-bound (large batch / prefill).
A graph is per-shape: one captured graph per batch size; odd sizes are padded up.
Replay reads static buffers: new inputs must be copied in before replay.
Attention is the thing that resists capture → piecewise splits around it.
torch.compile improves kernels; CUDA graphs improve launching. Different problems, used together.
enforce_eager=True turns both off — your debugging escape hatch (and the only way to get fully dynamic shapes).

What you'll do in this phase

Read: 01-deep-dive.md walks the real CUDAGraphWrapper.__call__, the CUDAGraphMode/CompilationMode enums, and @support_torch_compile line by line.
Build: 02-mini-build.md — the capture/replay simulator (reference: mini_vllm/cudagraph.py).
Labs (see labs/README.md; recommended order 01 → 02 → 05 → 03 → 04):
- lab-01-graph-replay-simulator [CPU-OK] — implement capture/replay + shape dispatch + static buffers; pass the tests.
- lab-02-launch-overhead [CPU-OK] — model launch overhead and find the eager↔graph crossover.
- lab-03-cudagraph-mode [CPU-OK] — reimplement the CUDAGraphMode dispatch (FULL / PIECEWISE / FULL_AND_PIECEWISE) and prove you understand decode-vs-mixed routing.
- lab-04-graph-vs-eager-real [GPU-REQ] — run real vLLM with enforce_eager vs CUDA graphs, measure the ITL difference (captured output included).
- lab-05-capture-sizes [CPU-OK] — the capture-size ladder: rung lookup, padding waste, and the denser-ladder-vs-more-captures trade ("why is my batch of 33 running at 40?").
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

When you can explain why a graph helps decode but not prefill, name the two constraints, and draw piecewise vs full from memory, you understand the layer that often doubles low-batch throughput for free.

← Phase 04 · Course home · Phase 06 →

vLLM Mastery — From Zero to Maintainer