Lab 05-02 — Launch Overhead & the Eager↔Graph Crossover [CPU-OK]
Lab-01 gave you the mechanism; this lab gives you the economics. Capturing a graph isn't free — the first call pays full overhead, plus (in reality) warmup and memory. So when does the investment pay off, and how big can the payoff get? You'll derive both answers as closed-form formulas — the crossover point and the asymptotic speedup — and they're worth deriving because they're the difference between "graphs are good" (folklore) and knowing for which workloads, by how much, and when not to bother (engineering). Spoiler for the impatient: break-even at the second call, asymptotic speedup = the number of ops captured — and both facts have production consequences listed below.
Contents
- Why this lab exists
- The model (launches as a proxy for CPU overhead)
- Files
- Run
- The two results you should be able to state cold
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Every caching/compilation decision in systems — JIT vs interpret, memoize vs recompute, capture vs eager — has the same shape: an upfront cost amortized over repeats. The two numbers that decide it are always the same two you'll derive here: how soon does it break even (crossover) and what's the ceiling (asymptotic ratio). This lab drills the pattern on the cleanest possible instance, where both answers are exact integers. After it, you'll recognize the same analysis inside torch.compile's warmup tradeoffs, lab-05's ladder-density question, and Phase 18's "is this optimization worth its startup cost" recurring decision.
It also arms you for the most common graphs-related production question: "should I set
enforce_eager=True to speed up startup?" The formulas say precisely what that trades
away, per decode step, forever — and lab-04 confirms the prediction on silicon.
The model (launches as a proxy for CPU overhead)
- Eager,
kcalls of annum_ops-op model:k × num_opslaunches. - Graph: first call captures at cost
capture_cost_ops(defaultnum_ops), every later call replays in 1: totalcapture_cost_ops + (k − 1).
One unit ≈ one kernel launch ≈ some microseconds of CPU time. The model deliberately ignores GPU compute time — which is exactly why its predictions hold when launches dominate (small-batch decode) and fade when they don't (large batch, prefill). Knowing a model's domain of validity is part of the lab; lab-04's batch-64 numbers show the fade.
Files
starter.py— implementeager_launches,graph_launches,crossover,asymptotic_speedup. Your work.solution.py— reference.test_lab.py— pins the formulas, the second-call break-even, thenum_ops == 1degenerate case, and the asymptote.
Run
LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-02-launch-overhead -q
pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-02-launch-overhead -q # reference
The two results you should be able to state cold
- Crossover. With
capture_cost_ops = num_ops > 1, the graph total beats eager from the 2nd call onward: capture costs one eager-pass-worth, and every replay after savesnum_ops − 1. Graphs are not a long-game investment — they pay back almost immediately provided the shape repeats. The real risk is never "capture was too expensive"; it's "the shape never came back" (which is why capture sizes exist — lab-05 — and why wildly dynamic workloads gain little). - Asymptotic speedup. As
k → ∞, per-call launches →num_ops(eager) vs 1 (graph): the launch-overhead speedup approaches the number of ops captured. A model with 300 kernels per decode step has a 300× launch speedup ceiling — diluted in wall-clock by the GPU work that remains (Amdahl), which is why lab-04 measures 2.5×, not 300×. Bigger models per-step → more GPU work per launch → less dilution benefit needed; smaller models → launch-bound → graphs are load-bearing. Phase 0 lab-04's 8 ms-vs-1 ms step analysis, now with the fix attached.
What the tests prove
| Test | What it pins |
|---|---|
| formula tests | eager = k·n, graph = capture + (k−1) — exact, no off-by-ones (an off-by-one here is a wrong capacity claim later) |
| crossover tests | break-even at call 2 for n > 1; never for n == 1 (one launch either way — a single fused megakernel gains nothing from graphs, an endpoint worth knowing) |
| asymptote test | per-call ratio → n from below, monotonically |
Hitchhiker's notes
- What
capture_cost_ops > num_opsmodels: real capture includes warmup passes (allocator, autotuners), stream synchronization, and graph instantiation — typically several eager-passes-worth. It shifts the crossover later by a few calls; it never changes the asymptote. The real engine moves this entire cost to startup (capturing the whole ladder before serving — lab-04's 7-second log line), so steady-state traffic sees only replays: the crossover question is answered "before the first request." - Why the speedup ceiling is
num_opsand not infinite: a replay is still one launch. The only way past it is fewer-than-one launch per step — batching multiple steps per launch — which exists (multi-step scheduling / async scheduling in vLLM's history) and brings its own complications. Ceilings tell you where the next optimization frontier is; that's their real use. - torch.compile plays the same game one level up: compilation cost (seconds to
minutes, cached to disk in vLLM via the compilation cache) amortized over runs; kernel
fusion reduces
num_opsitself, which lowers what graphs have left to save. Fusion and capture are complementary attacks on the samek × num_opsbill — fusion shrinksnum_ops, capture shrinks its coefficient. The deep-dive's pipeline (compile → piecewise split → capture) is exactly this composition. - The formulas assume the shape repeats. Per-shape accounting is multiplicative:
every distinct batch size runs its own crossover race. A uniform-traffic deployment
amortizes a handful of shapes beautifully; a chaotic one spreads
kthin across many shapes. That's the bridge to lab-05 — the ladder exists to concentratekonto few shapes by padding.
Going further
- Plot per-call cost vs
kforn ∈ {3, 30, 300}withcapture_cost = 3n. Mark the crossovers. This single chart is the "should we graph it?" conversation, pre-had. - Add a second resource to the model: each captured graph costs
Mmemory, and you have budgetB. Combined with lab-05's ladder, derive the optimal number of rungs for a given traffic distribution — you've reinvented the actual config-tuning problem. - Measure a real launch: time
torch.mmon tiny tensors CPU-side (or just a no-op Python function call stack 300 deep) and put microseconds to the unit. The model stays the same; the constants acquire meaning.
References
- Lab-04 — the formulas, confirmed on an L4: ~2.5× at batch 1, fading to 1.13× at 64.
upstream/vllm/compilation/cuda_graph.py— where capture cost is actually paid.- NVIDIA, CUDA Graphs blog — measured launch overheads that set the unit: https://developer.nvidia.com/blog/cuda-graphs/
- Phase 0 lab-04 — the roofline reason launch overhead only matters at small step sizes.
- Hennessy & Patterson, Computer Architecture — Amdahl's law, the reason ceilings dilute; any edition, the first chapter.