Phase 05 — Mini-Build: simulate CUDA-graph capture & replay
You'll build a CPU simulation of CUDA graphs that reproduces the one win and the two
constraints from the guide. No GPU, no torch — just numpy and a launch counter. The reference
lives in mini_vllm/cudagraph.py; write it yourself first against lab-01's stub + tests.
The trick that makes this teachable on a laptop: we don't time anything. We count launches.
LaunchCounter is a global tally standing in for per-op CPU launch overhead. Eager pays one per
op every call; a replay pays exactly one. That single number captures the entire point of CUDA
graphs.
Contents
The build, in order
1. LaunchCounter
A class with a class-level n, plus reset() and bump(k=1). This is your stand-in for CPU
launch overhead.
2. run_eager(ops, x)
Run a list of ops over x, bump(1) per op. Returns the result. This is the baseline: overhead
paid per op, every single call.
3. GraphRunner(ops) — capture once, replay forever
The core. __call__(x):
- key =
x.shape. - Capture (key unseen): copy
xinto astatic_inputbuffer, run the ops (bump per op), cache aGraphEntry(shape, static_input, output, num_ops), return the output. (Constraint 1: graphs are keyed by shape.) - Replay (key seen):
np.copyto(entry.static_input, x)(Constraint 2: new inputs must land in the fixed buffer), recompute the ops from the buffer,bump(1)for the whole replay, return. (The win: one launch instead oflen(ops).) Exposenum_captured.
4. PiecewiseGraphRunner(ops, is_capturable) — split at the attention analog
Build contiguous segments by grouping consecutive ops with the same is_capturable(i) value.
Wrap capturable segments in a GraphRunner; keep uncapturable segments as plain op lists run via
run_eager. __call__ threads x through the segments in order. Expose num_graphs (count of
capturable segments). This models PIECEWISE: capture the compiled regions, run attention eagerly.
Definition of done
pytest mini_vllm/test_cudagraph.py -q # the reference suite (7 tests)
pytest phase-05-cuda-graphs-and-torch-compile/labs -q
Then answer in your notebook, citing mini_vllm/cudagraph.py lines:
- Which line is Constraint 1 (per-shape dispatch)? Which is Constraint 2 (static buffer copy)? Which is the win (single launch on replay)?
- In
PiecewiseGraphRunner, why doesnum_graphs == 2when you split a 3-op model at the middle op? (Two capturable runs surround one eager op.)
Map your toy to the real engine
your mini_vllm/cudagraph.py | real vLLM |
|---|---|
GraphRunner.graphs: dict[shape, GraphEntry] | CUDAGraphWrapper.concrete_cudagraph_entries: dict[BatchDescriptor, CUDAGraphEntry] (cuda_graph.py:207) |
entry.static_input + np.copyto | CUDAGraphEntry.input_addresses + persistent input buffers (cuda_graph.py:135, :346) |
| capture branch | with torch.cuda.graph(...) (cuda_graph.py:313) |
replay branch (bump(1)) | entry.cudagraph.replay() (cuda_graph.py:360) |
PiecewiseGraphRunner split | piecewise compilation/capture, split at attention (backends.py) |
Stretch (optional)
- Padding to capture sizes. Real vLLM only captures graphs for a fixed set of batch sizes
and pads odd sizes up. Add a
capture_sizes=[1,2,4,8]toGraphRunner: roundx's batch dim up to the nearest capture size before keying, so batch 5 and 7 both reuse the size-8 graph. Count how many distinct graphs you capture across batches 1..8 with and without padding. - A fusion pass. Add an optional graph-rewrite step that fuses two adjacent elementwise ops
into one (e.g.
+1then*2→ one op) and show it reduces launches even in eager mode — thetorch.compilehalf of the story.