Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 05 — Cheatsheet: CUDA Graphs & torch.compile

Contents


The one-liner

Two different enemies: CUDA graphs kill CPU launch overhead (record once, replay in one launch); torch.compile makes the kernels better (trace → fuse → generate). Used together, on by default.

When graphs help

  • Help: decode at small batch (CPU-launch-bound, many tiny kernels, same shape, many repeats).
  • Don't: prefill / large batch (GPU-bound; launch overhead negligible; shapes vary).
  • Limit speedup ≈ ops-per-step (lab-02); collapses to ~1 when GPU-bound.

The two constraints

  1. Fixed shape — one graph per batch size; pad odd sizes up. Stored in concrete_cudagraph_entries: dict[BatchDescriptor, CUDAGraphEntry].
  2. Static buffers — replay reads the same memory; copy new inputs in first (input_addresses debug check).

CUDAGraphMode (compilation.py:53)

modedecode batchmixed batch
NONENONENONE
PIECEWISEPIECEWISEPIECEWISE
FULLFULLFULL
FULL_DECODE_ONLYFULLNONE
FULL_AND_PIECEWISE (default)FULLPIECEWISE
  • Composite modes = (decode_mode, mixed_mode) tuples. requires_piecewise_compilation = has_mode(PIECEWISE).
  • Attention is why mixed batches go PIECEWISE (variable metadata can't be frozen).

CompilationMode levels (compilation.py:37)

0 NONE · 1 STOCK_TORCH_COMPILE · 2 DYNAMO_TRACE_ONCE · 3 VLLM_COMPILE (default: caching + piecewise + shape specialization + custom passes).

Capture/replay dispatch (cuda_graph.py:233)

mode==NONE or mode!=mine        -> run eager
shape unseen                    -> CAPTURE (torch.cuda.graph), cache, return real output
shape seen                      -> entry.cudagraph.replay(); return cached output  <- the win

Key upstream

  • vllm/compilation/cuda_graph.py:145 CUDAGraphWrapper · :233 __call__ · :128 CUDAGraphEntry
  • vllm/config/compilation.py:37 CompilationMode · :53 CUDAGraphMode · :381 CompilationConfig
  • vllm/compilation/decorators.py:118 @support_torch_compile
  • vllm/compilation/backends.py VllmBackend · passes/pass_manager.py custom passes

Gotchas

  • enforce_eager=True disables both graphs and compile (debug/odd-shapes escape hatch).
  • Startup pays a one-time capture+compile cost (amortized; compile artifacts cached across runs).
  • Piecewise needs the model compiled piecewise — you can't piecewise-replay a non-split graph.

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md