Phase 05 — Interview Questions: CUDA Graphs & torch.compile
Cover the answer, attempt out loud, then compare. This topic separates people who've operated a serving stack from those who've only read about it.
Q1. What is a CUDA graph and what exactly does it speed up?
Model answer
A CUDA graph is a recording of a sequence of GPU operations and their dependencies, captured once and replayed with a single launch call. It speeds up CPU kernel-launch overhead, not GPU compute. In decode you issue hundreds of tiny kernels per token; at small batch the CPU can't issue them fast enough and the GPU starves between kernels. Replaying a captured graph issues one launch and the GPU runs the whole recorded sequence back-to-back, removing the per-kernel CPU cost. It does nothing for the actual math — so it helps exactly when you're CPU-launch-bound.
Q2. Why does it help decode but not prefill (or large batches)?
Model answer
Decode at small batch is launch-bound: many tiny kernels, each finishing before the CPU issues the next, repeated for thousands of steps at the same shape — ideal for graphs. Prefill (and large-batch decode) is compute-bound: kernels are large, so launch overhead is negligible relative to the GPU work, and shapes vary so a captured graph wouldn't be reused. Quantitatively (lab-02): the launch-overhead speedup approaches the number of ops per step in the limit of many same-shape repeats, and collapses to ~1 when the GPU work per step dominates.
Q3. What are the constraints a captured graph imposes, and how does vLLM satisfy them?
Model answer
(1) Fixed shapes — a graph captured for batch size B only replays for B. vLLM captures one
graph per size in cudagraph_capture_sizes and pads odd batches up to the nearest captured
size; CUDAGraphWrapper keys graphs in concrete_cudagraph_entries: dict[BatchDescriptor,...]
(cuda_graph.py:207). (2) Static input buffers — replay reads from the same memory the
capture used, so the model runner writes each step's inputs into persistent buffers before
replay, and a debug check asserts the input addresses are unchanged
(CUDAGraphEntry.input_addresses, cuda_graph.py:135/:346).
Q4. Full vs piecewise CUDA graphs — what's the difference and why does vLLM default to both?
Model answer
FULL captures the entire model forward as one graph — maximum overhead removal but fragile,
because everything (including attention with its variable metadata) must be capture-safe.
PIECEWISE splits the forward at the uncapturable ops (attention), captures each contiguous
compiled region, and runs the split ops eagerly — most of the win, far more robust. vLLM's V1
default FULL_AND_PIECEWISE (compilation.py:63) uses a FULL graph for pure-decode batches
(uniform shapes, safe and fastest) and PIECEWISE for mixed prefill+decode batches (variable
attention metadata). It's a tuple (decode_mode=FULL, mixed_mode=PIECEWISE) and the runner picks
per batch.
Q5. How does CUDA graphing relate to torch.compile? Are they the same thing?
Model answer
No — they solve different problems and are used together. torch.compile traces the model
(TorchDynamo) and generates better/fused kernels (Inductor), reducing memory traffic and
kernel count. CUDA graphs make launching whatever kernels you have free. vLLM's level-3
VLLM_COMPILE backend (compilation.py:48) additionally caches compiled artifacts, splits the
graph at attention for piecewise compilation (which lines up with piecewise CUDA-graph capture),
and runs custom fusion passes. A model opts in with @support_torch_compile
(decorators.py:118). Net: compile improves the kernels, graphs remove launch overhead.
Q6. What do the CompilationMode levels mean, and when would you lower them?
Model answer
NONE (0) = pure eager; STOCK_TORCH_COMPILE (1) = plain torch.compile; DYNAMO_TRACE_ONCE
(2) = trace once, no recompiles; VLLM_COMPILE (3) = vLLM's Inductor backend with caching,
piecewise compilation, shape specialization, and custom passes (the V1 default). You'd lower it
(or set enforce_eager=True, which disables compile and graphs) to debug a kernel, handle
genuinely dynamic shapes that defeat specialization, or cut the startup compile/capture cost when
that matters more than steady-state throughput.
Q7. (Deep) Walk the lifecycle of one decode step through the compile + graph layers.
Model answer
The model runner picks the cudagraph_runtime_mode for this batch (FULL if pure decode,
PIECEWISE if mixed, NONE during warmup/profiling) and a batch_descriptor (shape key), writes
the step's token/position tensors into persistent input buffers (padding the batch to a captured
size), and sets these on the forward_context. The compiled forward runs; inside it, each
CUDAGraphWrapper reads the context — if the mode matches and the shape is known it
replay()s that graph (one launch) and returns the cached output; if the shape is new it
captures; if mode is NONE it runs eagerly. Attention pieces run eagerly under PIECEWISE. The
sampler then produces the token. (cuda_graph.py:233, gpu_model_runner.py.)
Rapid-fire
- Flag to disable graphs + compile?
enforce_eager=True. - Where are captured graphs stored?
CUDAGraphWrapper.concrete_cudagraph_entries, keyed byBatchDescriptor. - What op forces piecewise? Attention (variable metadata).
- V1 default cudagraph mode?
FULL_AND_PIECEWISE. - Default compilation level?
VLLM_COMPILE(3). - One decorator to enable compile on a model?
@support_torch_compile. - Does a graph speed up the matmul itself? No — only the launch.