Phase 05 — Cheatsheet: CUDA Graphs & torch.compile

The one-liner
When graphs help
The two constraints
CUDAGraphMode (compilation.py:53)
CompilationMode levels (compilation.py:37)
Capture/replay dispatch (cuda_graph.py:233)
Key upstream
Gotchas

The one-liner

Two different enemies: CUDA graphs kill CPU launch overhead (record once, replay in one launch); torch.compile makes the kernels better (trace → fuse → generate). Used together, on by default.

When graphs help

Help: decode at small batch (CPU-launch-bound, many tiny kernels, same shape, many repeats).
Don't: prefill / large batch (GPU-bound; launch overhead negligible; shapes vary).
Limit speedup ≈ ops-per-step (lab-02); collapses to ~1 when GPU-bound.

The two constraints

Fixed shape — one graph per batch size; pad odd sizes up. Stored in concrete_cudagraph_entries: dict[BatchDescriptor, CUDAGraphEntry].
Static buffers — replay reads the same memory; copy new inputs in first (input_addresses debug check).

`CUDAGraphMode` (compilation.py:53)

mode	decode batch	mixed batch
NONE	NONE	NONE
PIECEWISE	PIECEWISE	PIECEWISE
FULL	FULL	FULL
FULL_DECODE_ONLY	FULL	NONE
FULL_AND_PIECEWISE (default)	FULL	PIECEWISE

Composite modes = (decode_mode, mixed_mode) tuples. requires_piecewise_compilation = has_mode(PIECEWISE).
Attention is why mixed batches go PIECEWISE (variable metadata can't be frozen).

`CompilationMode` levels (compilation.py:37)

0 NONE · 1 STOCK_TORCH_COMPILE · 2 DYNAMO_TRACE_ONCE · 3 VLLM_COMPILE (default: caching + piecewise + shape specialization + custom passes).

Capture/replay dispatch (cuda_graph.py:233)

mode==NONE or mode!=mine        -> run eager
shape unseen                    -> CAPTURE (torch.cuda.graph), cache, return real output
shape seen                      -> entry.cudagraph.replay(); return cached output  <- the win

Key upstream

vllm/compilation/cuda_graph.py:145 CUDAGraphWrapper · :233 __call__ · :128 CUDAGraphEntry
vllm/config/compilation.py:37 CompilationMode · :53 CUDAGraphMode · :381 CompilationConfig
vllm/compilation/decorators.py:118 @support_torch_compile
vllm/compilation/backends.py VllmBackend · passes/pass_manager.py custom passes

Gotchas

enforce_eager=True disables both graphs and compile (debug/odd-shapes escape hatch).
Startup pays a one-time capture+compile cost (amortized; compile artifacts cached across runs).
Piecewise needs the model compiled piecewise — you can't piecewise-replay a non-split graph.

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

vLLM Mastery — From Zero to Maintainer