Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 05 — Mini-Build: simulate CUDA-graph capture & replay

You'll build a CPU simulation of CUDA graphs that reproduces the one win and the two constraints from the guide. No GPU, no torch — just numpy and a launch counter. The reference lives in mini_vllm/cudagraph.py; write it yourself first against lab-01's stub + tests.

The trick that makes this teachable on a laptop: we don't time anything. We count launches. LaunchCounter is a global tally standing in for per-op CPU launch overhead. Eager pays one per op every call; a replay pays exactly one. That single number captures the entire point of CUDA graphs.

Contents


The build, in order

1. LaunchCounter

A class with a class-level n, plus reset() and bump(k=1). This is your stand-in for CPU launch overhead.

2. run_eager(ops, x)

Run a list of ops over x, bump(1) per op. Returns the result. This is the baseline: overhead paid per op, every single call.

3. GraphRunner(ops) — capture once, replay forever

The core. __call__(x):

  • key = x.shape.
  • Capture (key unseen): copy x into a static_input buffer, run the ops (bump per op), cache a GraphEntry(shape, static_input, output, num_ops), return the output. (Constraint 1: graphs are keyed by shape.)
  • Replay (key seen): np.copyto(entry.static_input, x) (Constraint 2: new inputs must land in the fixed buffer), recompute the ops from the buffer, bump(1) for the whole replay, return. (The win: one launch instead of len(ops).) Expose num_captured.

4. PiecewiseGraphRunner(ops, is_capturable) — split at the attention analog

Build contiguous segments by grouping consecutive ops with the same is_capturable(i) value. Wrap capturable segments in a GraphRunner; keep uncapturable segments as plain op lists run via run_eager. __call__ threads x through the segments in order. Expose num_graphs (count of capturable segments). This models PIECEWISE: capture the compiled regions, run attention eagerly.

Definition of done

pytest mini_vllm/test_cudagraph.py -q          # the reference suite (7 tests)
pytest phase-05-cuda-graphs-and-torch-compile/labs -q

Then answer in your notebook, citing mini_vllm/cudagraph.py lines:

  • Which line is Constraint 1 (per-shape dispatch)? Which is Constraint 2 (static buffer copy)? Which is the win (single launch on replay)?
  • In PiecewiseGraphRunner, why does num_graphs == 2 when you split a 3-op model at the middle op? (Two capturable runs surround one eager op.)

Map your toy to the real engine

your mini_vllm/cudagraph.pyreal vLLM
GraphRunner.graphs: dict[shape, GraphEntry]CUDAGraphWrapper.concrete_cudagraph_entries: dict[BatchDescriptor, CUDAGraphEntry] (cuda_graph.py:207)
entry.static_input + np.copytoCUDAGraphEntry.input_addresses + persistent input buffers (cuda_graph.py:135, :346)
capture branchwith torch.cuda.graph(...) (cuda_graph.py:313)
replay branch (bump(1))entry.cudagraph.replay() (cuda_graph.py:360)
PiecewiseGraphRunner splitpiecewise compilation/capture, split at attention (backends.py)

Stretch (optional)

  1. Padding to capture sizes. Real vLLM only captures graphs for a fixed set of batch sizes and pads odd sizes up. Add a capture_sizes=[1,2,4,8] to GraphRunner: round x's batch dim up to the nearest capture size before keying, so batch 5 and 7 both reuse the size-8 graph. Count how many distinct graphs you capture across batches 1..8 with and without padding.
  2. A fusion pass. Add an optional graph-rewrite step that fuses two adjacent elementwise ops into one (e.g. +1 then *2 → one op) and show it reduces launches even in eager mode — the torch.compile half of the story.