Phase 05 — Exercises: CUDA Graphs & torch.compile
Escalating from "explain it" to "design it." Staff-level = the last ones cold, citing the exact
upstream/ line.
Contents
- Warm-up (explain)
- Core (trace the code)
- Build (extend your code / mini_vllm)
- Design (staff-level)
- Self-grading
Warm-up (explain)
- In one sentence each: what does a CUDA graph remove, and what does
torch.compileimprove? - Why does a graph help decode at batch size 1 but barely at batch size 256?
- Name the two constraints a captured graph imposes, and the field/structure in
CUDAGraphWrapperthat enforces each (cuda_graph.py).
Core (trace the code)
- Walk the three branches of
CUDAGraphWrapper.__call__(cuda_graph.py:233): name the trigger for eager / capture / replay and the one line that is the win. FULL_AND_PIECEWISEis encoded as the tuple(FULL, PIECEWISE). Usingdecode_mode/mixed_mode(compilation.py:65), state which concrete mode runs for a pure-decode batch vs a mixed batch, and why those choices are safe.- Why is attention the op that forces piecewise capture? What about it doesn't fit a frozen recording? (Hint: Phase 02/03 metadata.)
Build (extend your code / mini_vllm)
- Add capture-size padding to
GraphRunner(stretch in 02-mini-build.md): round the batch dim up to the nearest of[1,2,4,8]before keying. Show batches 5 and 7 both reuse the size-8 graph, and count distinct captures across batches 1..8 with vs without padding. - Extend
PiecewiseGraphRunnerto count launches (capturable segments replay as 1 each; eager segments pay per-op). Compare total launches of FULL (1) vs PIECEWISE (segments+eager) vs eager (all ops) for a 10-op model split at op 5. - Write a
crossovertable (from lab-02) fornum_ops ∈ {1, 4, 32, 300}andcapture_cost_ops ∈ {num_ops, 5×num_ops}. Explain the row fornum_ops=1.
Design (staff-level)
- A serving box shows 30% GPU utilization at batch 1–2 and a profile full of gaps between tiny kernels. Walk your diagnosis and the first fix you'd try, and predict the batch size at which the fix stops mattering.
- You enable
torch.compile(levelVLLM_COMPILE) and startup time jumps from 10s to 90s. Explain where the time goes and how vLLM mitigates it across restarts (compilation.pycaching). What do you trade if you drop toenforce_eager? - A new attention backend you wrote breaks under FULL graphs but works in PIECEWISE. Explain
the likely cause and which
CUDAGraphModeyou'd ship as the default while you investigate. - Design a benchmark that isolates the CUDA-graph win from the
torch.compilewin (so you can attribute a speedup to the right layer). Which flags toggle each independently?
Self-grading
4–6 and 10–13 are interview-grade. Could you whiteboard each in 5 minutes and name the file? If not, re-read the matching deep-dive section, then drill INTERVIEW.md.