Phase 05 — Cheatsheet: CUDA Graphs & torch.compile
Contents
- The one-liner
- When graphs help
- The two constraints
CUDAGraphMode(compilation.py:53)CompilationModelevels (compilation.py:37)- Capture/replay dispatch (cuda_graph.py:233)
- Key upstream
- Gotchas
The one-liner
Two different enemies: CUDA graphs kill CPU launch overhead (record once, replay in one launch); torch.compile makes the kernels better (trace → fuse → generate). Used together, on by default.
When graphs help
- Help: decode at small batch (CPU-launch-bound, many tiny kernels, same shape, many repeats).
- Don't: prefill / large batch (GPU-bound; launch overhead negligible; shapes vary).
- Limit speedup ≈ ops-per-step (lab-02); collapses to ~1 when GPU-bound.
The two constraints
- Fixed shape — one graph per batch size; pad odd sizes up. Stored in
concrete_cudagraph_entries: dict[BatchDescriptor, CUDAGraphEntry]. - Static buffers — replay reads the same memory; copy new inputs in first
(
input_addressesdebug check).
CUDAGraphMode (compilation.py:53)
| mode | decode batch | mixed batch |
|---|---|---|
| NONE | NONE | NONE |
| PIECEWISE | PIECEWISE | PIECEWISE |
| FULL | FULL | FULL |
| FULL_DECODE_ONLY | FULL | NONE |
| FULL_AND_PIECEWISE (default) | FULL | PIECEWISE |
- Composite modes =
(decode_mode, mixed_mode)tuples.requires_piecewise_compilation=has_mode(PIECEWISE). - Attention is why mixed batches go PIECEWISE (variable metadata can't be frozen).
CompilationMode levels (compilation.py:37)
0 NONE · 1 STOCK_TORCH_COMPILE · 2 DYNAMO_TRACE_ONCE · 3 VLLM_COMPILE (default: caching +
piecewise + shape specialization + custom passes).
Capture/replay dispatch (cuda_graph.py:233)
mode==NONE or mode!=mine -> run eager
shape unseen -> CAPTURE (torch.cuda.graph), cache, return real output
shape seen -> entry.cudagraph.replay(); return cached output <- the win
Key upstream
vllm/compilation/cuda_graph.py:145CUDAGraphWrapper·:233__call__·:128CUDAGraphEntryvllm/config/compilation.py:37CompilationMode·:53CUDAGraphMode·:381CompilationConfigvllm/compilation/decorators.py:118@support_torch_compilevllm/compilation/backends.pyVllmBackend·passes/pass_manager.pycustom passes
Gotchas
enforce_eager=Truedisables both graphs and compile (debug/odd-shapes escape hatch).- Startup pays a one-time capture+compile cost (amortized; compile artifacts cached across runs).
- Piecewise needs the model compiled piecewise — you can't piecewise-replay a non-split graph.
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md