Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 05 — Exercises: CUDA Graphs & torch.compile

Escalating from "explain it" to "design it." Staff-level = the last ones cold, citing the exact upstream/ line.

Contents


Warm-up (explain)

  1. In one sentence each: what does a CUDA graph remove, and what does torch.compile improve?
  2. Why does a graph help decode at batch size 1 but barely at batch size 256?
  3. Name the two constraints a captured graph imposes, and the field/structure in CUDAGraphWrapper that enforces each (cuda_graph.py).

Core (trace the code)

  1. Walk the three branches of CUDAGraphWrapper.__call__ (cuda_graph.py:233): name the trigger for eager / capture / replay and the one line that is the win.
  2. FULL_AND_PIECEWISE is encoded as the tuple (FULL, PIECEWISE). Using decode_mode / mixed_mode (compilation.py:65), state which concrete mode runs for a pure-decode batch vs a mixed batch, and why those choices are safe.
  3. Why is attention the op that forces piecewise capture? What about it doesn't fit a frozen recording? (Hint: Phase 02/03 metadata.)

Build (extend your code / mini_vllm)

  1. Add capture-size padding to GraphRunner (stretch in 02-mini-build.md): round the batch dim up to the nearest of [1,2,4,8] before keying. Show batches 5 and 7 both reuse the size-8 graph, and count distinct captures across batches 1..8 with vs without padding.
  2. Extend PiecewiseGraphRunner to count launches (capturable segments replay as 1 each; eager segments pay per-op). Compare total launches of FULL (1) vs PIECEWISE (segments+eager) vs eager (all ops) for a 10-op model split at op 5.
  3. Write a crossover table (from lab-02) for num_ops ∈ {1, 4, 32, 300} and capture_cost_ops ∈ {num_ops, 5×num_ops}. Explain the row for num_ops=1.

Design (staff-level)

  1. A serving box shows 30% GPU utilization at batch 1–2 and a profile full of gaps between tiny kernels. Walk your diagnosis and the first fix you'd try, and predict the batch size at which the fix stops mattering.
  2. You enable torch.compile (level VLLM_COMPILE) and startup time jumps from 10s to 90s. Explain where the time goes and how vLLM mitigates it across restarts (compilation.py caching). What do you trade if you drop to enforce_eager?
  3. A new attention backend you wrote breaks under FULL graphs but works in PIECEWISE. Explain the likely cause and which CUDAGraphMode you'd ship as the default while you investigate.
  4. Design a benchmark that isolates the CUDA-graph win from the torch.compile win (so you can attribute a speedup to the right layer). Which flags toggle each independently?

Self-grading

4–6 and 10–13 are interview-grade. Could you whiteboard each in 5 minutes and name the file? If not, re-read the matching deep-dive section, then drill INTERVIEW.md.