Lab 05-02 — Launch Overhead & the Eager↔Graph Crossover `[CPU-OK]`

Lab-01 gave you the mechanism; this lab gives you the economics. Capturing a graph isn't free — the first call pays full overhead, plus (in reality) warmup and memory. So when does the investment pay off, and how big can the payoff get? You'll derive both answers as closed-form formulas — the crossover point and the asymptotic speedup — and they're worth deriving because they're the difference between "graphs are good" (folklore) and knowing for which workloads, by how much, and when not to bother (engineering). Spoiler for the impatient: break-even at the second call, asymptotic speedup = the number of ops captured — and both facts have production consequences listed below.

Why this lab exists
The model (launches as a proxy for CPU overhead)
Files
Run
The two results you should be able to state cold
What the tests prove
Hitchhiker's notes
Going further
References

Why this lab exists

Every caching/compilation decision in systems — JIT vs interpret, memoize vs recompute, capture vs eager — has the same shape: an upfront cost amortized over repeats. The two numbers that decide it are always the same two you'll derive here: how soon does it break even (crossover) and what's the ceiling (asymptotic ratio). This lab drills the pattern on the cleanest possible instance, where both answers are exact integers. After it, you'll recognize the same analysis inside torch.compile's warmup tradeoffs, lab-05's ladder-density question, and Phase 18's "is this optimization worth its startup cost" recurring decision.

It also arms you for the most common graphs-related production question: "should I set enforce_eager=True to speed up startup?" The formulas say precisely what that trades away, per decode step, forever — and lab-04 confirms the prediction on silicon.

The model (launches as a proxy for CPU overhead)

Eager, k calls of an num_ops-op model: k × num_ops launches.
Graph: first call captures at cost capture_cost_ops (default num_ops), every later call replays in 1: total capture_cost_ops + (k − 1).

One unit ≈ one kernel launch ≈ some microseconds of CPU time. The model deliberately ignores GPU compute time — which is exactly why its predictions hold when launches dominate (small-batch decode) and fade when they don't (large batch, prefill). Knowing a model's domain of validity is part of the lab; lab-04's batch-64 numbers show the fade.

Files

starter.py — implement eager_launches, graph_launches, crossover, asymptotic_speedup. Your work.
solution.py — reference.
test_lab.py — pins the formulas, the second-call break-even, the num_ops == 1 degenerate case, and the asymptote.

Run

LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-02-launch-overhead -q
pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-02-launch-overhead -q   # reference

The two results you should be able to state cold

Crossover. With capture_cost_ops = num_ops > 1, the graph total beats eager from the 2nd call onward: capture costs one eager-pass-worth, and every replay after saves num_ops − 1. Graphs are not a long-game investment — they pay back almost immediately provided the shape repeats. The real risk is never "capture was too expensive"; it's "the shape never came back" (which is why capture sizes exist — lab-05 — and why wildly dynamic workloads gain little).
Asymptotic speedup. As k → ∞, per-call launches → num_ops (eager) vs 1 (graph): the launch-overhead speedup approaches the number of ops captured. A model with 300 kernels per decode step has a 300× launch speedup ceiling — diluted in wall-clock by the GPU work that remains (Amdahl), which is why lab-04 measures 2.5×, not 300×. Bigger models per-step → more GPU work per launch → less dilution benefit needed; smaller models → launch-bound → graphs are load-bearing. Phase 0 lab-04's 8 ms-vs-1 ms step analysis, now with the fix attached.

What the tests prove

Test	What it pins
formula tests	`eager = k·n`, `graph = capture + (k−1)` — exact, no off-by-ones (an off-by-one here is a wrong capacity claim later)
crossover tests	break-even at call 2 for `n > 1`; never for `n == 1` (one launch either way — a single fused megakernel gains nothing from graphs, an endpoint worth knowing)
asymptote test	per-call ratio → `n` from below, monotonically

Hitchhiker's notes

What capture_cost_ops > num_ops models: real capture includes warmup passes (allocator, autotuners), stream synchronization, and graph instantiation — typically several eager-passes-worth. It shifts the crossover later by a few calls; it never changes the asymptote. The real engine moves this entire cost to startup (capturing the whole ladder before serving — lab-04's 7-second log line), so steady-state traffic sees only replays: the crossover question is answered "before the first request."
Why the speedup ceiling is num_ops and not infinite: a replay is still one launch. The only way past it is fewer-than-one launch per step — batching multiple steps per launch — which exists (multi-step scheduling / async scheduling in vLLM's history) and brings its own complications. Ceilings tell you where the next optimization frontier is; that's their real use.
torch.compile plays the same game one level up: compilation cost (seconds to minutes, cached to disk in vLLM via the compilation cache) amortized over runs; kernel fusion reduces num_ops itself, which lowers what graphs have left to save. Fusion and capture are complementary attacks on the same k × num_ops bill — fusion shrinks num_ops, capture shrinks its coefficient. The deep-dive's pipeline (compile → piecewise split → capture) is exactly this composition.
The formulas assume the shape repeats. Per-shape accounting is multiplicative: every distinct batch size runs its own crossover race. A uniform-traffic deployment amortizes a handful of shapes beautifully; a chaotic one spreads k thin across many shapes. That's the bridge to lab-05 — the ladder exists to concentrate k onto few shapes by padding.

Going further

Plot per-call cost vs k for n ∈ {3, 30, 300} with capture_cost = 3n. Mark the crossovers. This single chart is the "should we graph it?" conversation, pre-had.
Add a second resource to the model: each captured graph costs M memory, and you have budget B. Combined with lab-05's ladder, derive the optimal number of rungs for a given traffic distribution — you've reinvented the actual config-tuning problem.
Measure a real launch: time torch.mm on tiny tensors CPU-side (or just a no-op Python function call stack 300 deep) and put microseconds to the unit. The model stays the same; the constants acquire meaning.

References

Lab-04 — the formulas, confirmed on an L4: ~2.5× at batch 1, fading to 1.13× at 64.
upstream/vllm/compilation/cuda_graph.py — where capture cost is actually paid.
NVIDIA, CUDA Graphs blog — measured launch overheads that set the unit: https://developer.nvidia.com/blog/cuda-graphs/
Phase 0 lab-04 — the roofline reason launch overhead only matters at small step sizes.
Hennessy & Patterson, Computer Architecture — Amdahl's law, the reason ceilings dilute; any edition, the first chapter.

vLLM Mastery — From Zero to Maintainer