Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 05 — Interview Questions: CUDA Graphs & torch.compile

Cover the answer, attempt out loud, then compare. This topic separates people who've operated a serving stack from those who've only read about it.

Q1. What is a CUDA graph and what exactly does it speed up?

Model answer

A CUDA graph is a recording of a sequence of GPU operations and their dependencies, captured once and replayed with a single launch call. It speeds up CPU kernel-launch overhead, not GPU compute. In decode you issue hundreds of tiny kernels per token; at small batch the CPU can't issue them fast enough and the GPU starves between kernels. Replaying a captured graph issues one launch and the GPU runs the whole recorded sequence back-to-back, removing the per-kernel CPU cost. It does nothing for the actual math — so it helps exactly when you're CPU-launch-bound.

Q2. Why does it help decode but not prefill (or large batches)?

Model answer

Decode at small batch is launch-bound: many tiny kernels, each finishing before the CPU issues the next, repeated for thousands of steps at the same shape — ideal for graphs. Prefill (and large-batch decode) is compute-bound: kernels are large, so launch overhead is negligible relative to the GPU work, and shapes vary so a captured graph wouldn't be reused. Quantitatively (lab-02): the launch-overhead speedup approaches the number of ops per step in the limit of many same-shape repeats, and collapses to ~1 when the GPU work per step dominates.

Q3. What are the constraints a captured graph imposes, and how does vLLM satisfy them?

Model answer

(1) Fixed shapes — a graph captured for batch size B only replays for B. vLLM captures one graph per size in cudagraph_capture_sizes and pads odd batches up to the nearest captured size; CUDAGraphWrapper keys graphs in concrete_cudagraph_entries: dict[BatchDescriptor,...] (cuda_graph.py:207). (2) Static input buffers — replay reads from the same memory the capture used, so the model runner writes each step's inputs into persistent buffers before replay, and a debug check asserts the input addresses are unchanged (CUDAGraphEntry.input_addresses, cuda_graph.py:135/:346).

Q4. Full vs piecewise CUDA graphs — what's the difference and why does vLLM default to both?

Model answer

FULL captures the entire model forward as one graph — maximum overhead removal but fragile, because everything (including attention with its variable metadata) must be capture-safe. PIECEWISE splits the forward at the uncapturable ops (attention), captures each contiguous compiled region, and runs the split ops eagerly — most of the win, far more robust. vLLM's V1 default FULL_AND_PIECEWISE (compilation.py:63) uses a FULL graph for pure-decode batches (uniform shapes, safe and fastest) and PIECEWISE for mixed prefill+decode batches (variable attention metadata). It's a tuple (decode_mode=FULL, mixed_mode=PIECEWISE) and the runner picks per batch.

Q5. How does CUDA graphing relate to torch.compile? Are they the same thing?

Model answer

No — they solve different problems and are used together. torch.compile traces the model (TorchDynamo) and generates better/fused kernels (Inductor), reducing memory traffic and kernel count. CUDA graphs make launching whatever kernels you have free. vLLM's level-3 VLLM_COMPILE backend (compilation.py:48) additionally caches compiled artifacts, splits the graph at attention for piecewise compilation (which lines up with piecewise CUDA-graph capture), and runs custom fusion passes. A model opts in with @support_torch_compile (decorators.py:118). Net: compile improves the kernels, graphs remove launch overhead.

Q6. What do the CompilationMode levels mean, and when would you lower them?

Model answer

NONE (0) = pure eager; STOCK_TORCH_COMPILE (1) = plain torch.compile; DYNAMO_TRACE_ONCE (2) = trace once, no recompiles; VLLM_COMPILE (3) = vLLM's Inductor backend with caching, piecewise compilation, shape specialization, and custom passes (the V1 default). You'd lower it (or set enforce_eager=True, which disables compile and graphs) to debug a kernel, handle genuinely dynamic shapes that defeat specialization, or cut the startup compile/capture cost when that matters more than steady-state throughput.

Q7. (Deep) Walk the lifecycle of one decode step through the compile + graph layers.

Model answer

The model runner picks the cudagraph_runtime_mode for this batch (FULL if pure decode, PIECEWISE if mixed, NONE during warmup/profiling) and a batch_descriptor (shape key), writes the step's token/position tensors into persistent input buffers (padding the batch to a captured size), and sets these on the forward_context. The compiled forward runs; inside it, each CUDAGraphWrapper reads the context — if the mode matches and the shape is known it replay()s that graph (one launch) and returns the cached output; if the shape is new it captures; if mode is NONE it runs eagerly. Attention pieces run eagerly under PIECEWISE. The sampler then produces the token. (cuda_graph.py:233, gpu_model_runner.py.)

Rapid-fire

  • Flag to disable graphs + compile? enforce_eager=True.
  • Where are captured graphs stored? CUDAGraphWrapper.concrete_cudagraph_entries, keyed by BatchDescriptor.
  • What op forces piecewise? Attention (variable metadata).
  • V1 default cudagraph mode? FULL_AND_PIECEWISE.
  • Default compilation level? VLLM_COMPILE (3).
  • One decorator to enable compile on a model? @support_torch_compile.
  • Does a graph speed up the matmul itself? No — only the launch.