Phase 05 — Mini-Build: simulate CUDA-graph capture & replay

You'll build a CPU simulation of CUDA graphs that reproduces the one win and the two constraints from the guide. No GPU, no torch — just numpy and a launch counter. The reference lives in mini_vllm/cudagraph.py; write it yourself first against lab-01's stub + tests.

The trick that makes this teachable on a laptop: we don't time anything. We count launches. LaunchCounter is a global tally standing in for per-op CPU launch overhead. Eager pays one per op every call; a replay pays exactly one. That single number captures the entire point of CUDA graphs.

The build, in order
Definition of done
Map your toy to the real engine
Stretch (optional)

The build, in order

1. `LaunchCounter`

A class with a class-level n, plus reset() and bump(k=1). This is your stand-in for CPU launch overhead.

2. `run_eager(ops, x)`

Run a list of ops over x, bump(1) per op. Returns the result. This is the baseline: overhead paid per op, every single call.

3. `GraphRunner(ops)` — capture once, replay forever

The core. __call__(x):

key = x.shape.
Capture (key unseen): copy x into a static_input buffer, run the ops (bump per op), cache a GraphEntry(shape, static_input, output, num_ops), return the output. (Constraint 1: graphs are keyed by shape.)
Replay (key seen): np.copyto(entry.static_input, x) (Constraint 2: new inputs must land in the fixed buffer), recompute the ops from the buffer, bump(1) for the whole replay, return. (The win: one launch instead of len(ops).) Expose num_captured.

4. `PiecewiseGraphRunner(ops, is_capturable)` — split at the attention analog

Build contiguous segments by grouping consecutive ops with the same is_capturable(i) value. Wrap capturable segments in a GraphRunner; keep uncapturable segments as plain op lists run via run_eager. __call__ threads x through the segments in order. Expose num_graphs (count of capturable segments). This models PIECEWISE: capture the compiled regions, run attention eagerly.

Definition of done

pytest mini_vllm/test_cudagraph.py -q          # the reference suite (7 tests)
pytest phase-05-cuda-graphs-and-torch-compile/labs -q

Then answer in your notebook, citing mini_vllm/cudagraph.py lines:

Which line is Constraint 1 (per-shape dispatch)? Which is Constraint 2 (static buffer copy)? Which is the win (single launch on replay)?
In PiecewiseGraphRunner, why does num_graphs == 2 when you split a 3-op model at the middle op? (Two capturable runs surround one eager op.)

Map your toy to the real engine

your `mini_vllm/cudagraph.py`	real vLLM
`GraphRunner.graphs: dict[shape, GraphEntry]`	`CUDAGraphWrapper.concrete_cudagraph_entries: dict[BatchDescriptor, CUDAGraphEntry]` (`cuda_graph.py:207`)
`entry.static_input` + `np.copyto`	`CUDAGraphEntry.input_addresses` + persistent input buffers (`cuda_graph.py:135`, `:346`)
capture branch	`with torch.cuda.graph(...)` (`cuda_graph.py:313`)
replay branch (`bump(1)`)	`entry.cudagraph.replay()` (`cuda_graph.py:360`)
`PiecewiseGraphRunner` split	piecewise compilation/capture, split at attention (`backends.py`)

Stretch (optional)

Padding to capture sizes. Real vLLM only captures graphs for a fixed set of batch sizes and pads odd sizes up. Add a capture_sizes=[1,2,4,8] to GraphRunner: round x's batch dim up to the nearest capture size before keying, so batch 5 and 7 both reuse the size-8 graph. Count how many distinct graphs you capture across batches 1..8 with and without padding.
A fusion pass. Add an optional graph-rewrite step that fuses two adjacent elementwise ops into one (e.g. +1 then *2 → one op) and show it reduces launches even in eager mode — the torch.compile half of the story.

vLLM Mastery — From Zero to Maintainer