Lab 08-04 — The Speculative-Decoding Speedup Model [CPU-OK]
Three functions, maybe fifteen lines — and at the end of them you can answer, with
arithmetic, the questions that decide whether speculative decoding ships: How much
faster, given my drafter's acceptance rate? What draft length k should I configure? And
when does spec decode make things worse? (It can. test_spec_decode_can_lose proves a
mediocre drafter at realistic cost is a net loss, and optimal_k tells you the right
configuration is zero.) This is the expected-speedup model from the original paper,
the same arithmetic behind vLLM's num_speculative_tokens default debates — and after
this lab, behind your config choices instead of your hopes.
Contents
- Why this lab exists
- Background: the three-parameter economy
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Speculative decoding is the most conditionally valuable optimization in this course:
transformative for a single latency-bound stream with a sharp drafter, worthless — or
harmful — for a saturated-throughput deployment with a dull one. Teams burn weeks
discovering this empirically because they treat "enable spec decode" as a boolean
instead of an equation. The equation has three inputs you can measure independently —
acceptance rate alpha (lab-03's overlap, read from vLLM's spec-decode metrics), draft
cost c (drafter step time / target step time, read from a profile), and draft length
k (the config knob) — and one output. This lab makes you fluent in it, including its
honest failure regions.
It's also a clean specimen of modeling discipline: the formula assumes i.i.d.
per-position acceptance, which is false in detail (acceptance correlates within a
phrase) but accurate in expectation — and test_expected_tokens_matches_simulation
shows the model agreeing with a 200k-cycle simulation to 1%. Knowing how to validate a
simplification is worth more than distrusting all simplifications.
Background: the three-parameter economy
One speculation cycle: draft k tokens (cost k·c), then one target forward verifies
all of them in a single batch (cost ≈ 1 — this batching is the entire trick; the
verify scores k+1 positions for the price of one decode step because prefill-shaped
work is compute-cheap, Phase 0 lab-04). The cycle emits the leading run of accepted
tokens plus one (the correction on first rejection, or the bonus token when everything
passes — lab-01's verify_greedy mechanics):
E[tokens/cycle] = 1 + α + α² + … + α^k = (1 − α^(k+1)) / (1 − α)
speedup(α, k, c) = E[tokens/cycle] / (k·c + 1)
Two structural facts fall out before you compute anything. Diminishing returns in k:
the i-th draft token only counts if all before it accepted, so its marginal value is
α^i — geometrically decaying, while its cost c is constant; past some k every extra
token is negative-margin (hence optimal_k). The ceiling: even free drafts can't
beat 1/(1−α) tokens per cycle — at α=0.7 that's 3.3×, at α=0.5 it's 2× — so chasing
speedup beyond the ceiling means improving the drafter, not the config.
Files
starter.py—expected_tokens_per_verify,speedup,optimal_k. Your work.solution.py— reference.test_lab.py— formula edges, the simulation check, the free-drafter bound, the losing regime, EAGLE-ballpark numbers, and both monotonicities ofoptimal_k.
Run
LAB_IMPL=starter pytest phase-08-speculative-decoding/labs/lab-04-speedup-model -q
pytest phase-08-speculative-decoding/labs/lab-04-speedup-model -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_expected_tokens_formula | The geometric series and both edges: α=0 → 1 (corrections only — spec decode never emits less than baseline per cycle), α=1 → k+1 |
test_expected_tokens_matches_simulation | The i.i.d. model vs 200k simulated cycles: < 1% off. The simplification, validated |
test_free_drafter_always_wins | c=0 (the n-gram drafter — lab-01): speedup = E[tokens] ≥ 1. Free drafts can't lose, which is why prompt-lookup ships as a default-safe option |
test_spec_decode_can_lose | α=0.2, c=0.5, k=5 → speedup < 1, and optimal_k = 0. The model's most valuable output is "turn it off" |
test_eagle_like_numbers | α≈0.7, c≈0.05, k=5 → ~2.3× — the ballpark real EAGLE deployments report (lab-02's measured 1.9× at k=5 sits right here once you account for overheads the model omits) |
test_optimal_k_grows_with_alpha_and_shrinks_with_cost | The two intuitive monotonicities, made checkable |
test_diminishing_returns_in_k | The marginal value of draft token i decays geometrically — why k=5-ish is so common and k=20 is almost never right |
Hitchhiker's notes
- What the model omits, and which way each omission points: verify cost grows
(slightly) with k — real cost is
k·c + (1+εk), pushing optimal k down; the drafter and target compete for the GPU at high batch — at saturation the "spare compute" funding the whole scheme disappears (lab-02's shrinking-win observation), pushing value down; per-cycle fixed overheads (extra kernel launches, sampler work) hurt small-α configs most. The model is an upper bound with known biases — the most useful kind. - The same arithmetic prices drafters against each other: n-gram (α low on prose,
high on code; c ≈ 0), EAGLE (α ≈ 0.6–0.8; c ≈ 0.03–0.08 — a one-layer head), a
half-size draft model (α high; c ≈ 0.2–0.5 — usually dominated by EAGLE on this
math, which is why standalone draft models faded). Three points on a (α, c) plane;
speedupis the contour map. Plot it once and the drafter literature organizes itself. - α is workload-dependent, so the right k is too: code and structured output accept far better than creative prose (lab-02 measured 80% vs ~56%). A deployment serving both has no single optimal k — which is why dynamic/adaptive speculation (adjusting k per request from rolling acceptance) is an active upstream direction. The model you just built is the controller's objective function.
- Where to read your fleet's α: vLLM's spec-decode metrics (acceptance counts per
position, mean acceptance length — the
2.8 / 5in lab-02's capture). Mean acceptance length ≈E[tokens/cycle] − 1; invert your formula and you can back out the effective α from production logs. Do it before and after a drafter upgrade and you have the business case in two numbers.
Going further
- Plot speedup vs k for α ∈ {0.3, 0.5, 0.7, 0.9} at c = 0.05: watch the maximum slide right and up with α. Add c = 0.3 curves and watch speculation die for low α. This one figure is the deployment decision.
- Add the batch-saturation term: model verify cost as
1 + λ·kwhere λ grows with batch utilization, and find the utilization where optimal_k hits 0 — you've derived "spec decode is a latency tool, not a throughput tool" instead of memorizing it. - Replace i.i.d. α with a two-state model (in-phrase α_high, at-boundary α_low) and re-derive E[tokens] — then check whether the i.i.d. fit to the mean still predicts well. (It does, mostly — means are forgiving. Tail latency per cycle is not; explore the variance.)
References
- Leviathan et al., Fast Inference from Transformers via Speculative Decoding (2022) — §3.1 is this lab's formula: https://arxiv.org/abs/2211.17192
- Li et al., EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (2024) — where the (α≈0.7, c≈0.05) point comes from: https://arxiv.org/abs/2401.15077
upstream/vllm/v1/spec_decode/— proposer implementations; the metrics module that exports your α.- vLLM docs, Speculative Decoding —
num_speculative_tokensand method configs: https://docs.vllm.ai/en/latest/features/spec_decode/ - Lab-03 — where α comes from (
Σ min(p,q)); lab-02 — the measured numbers this model predicts; Phase 0 lab-04 — why the batched verify costs ~1.