Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 08-04 — The Speculative-Decoding Speedup Model [CPU-OK]

Three functions, maybe fifteen lines — and at the end of them you can answer, with arithmetic, the questions that decide whether speculative decoding ships: How much faster, given my drafter's acceptance rate? What draft length k should I configure? And when does spec decode make things worse? (It can. test_spec_decode_can_lose proves a mediocre drafter at realistic cost is a net loss, and optimal_k tells you the right configuration is zero.) This is the expected-speedup model from the original paper, the same arithmetic behind vLLM's num_speculative_tokens default debates — and after this lab, behind your config choices instead of your hopes.

Contents


Why this lab exists

Speculative decoding is the most conditionally valuable optimization in this course: transformative for a single latency-bound stream with a sharp drafter, worthless — or harmful — for a saturated-throughput deployment with a dull one. Teams burn weeks discovering this empirically because they treat "enable spec decode" as a boolean instead of an equation. The equation has three inputs you can measure independently — acceptance rate alpha (lab-03's overlap, read from vLLM's spec-decode metrics), draft cost c (drafter step time / target step time, read from a profile), and draft length k (the config knob) — and one output. This lab makes you fluent in it, including its honest failure regions.

It's also a clean specimen of modeling discipline: the formula assumes i.i.d. per-position acceptance, which is false in detail (acceptance correlates within a phrase) but accurate in expectation — and test_expected_tokens_matches_simulation shows the model agreeing with a 200k-cycle simulation to 1%. Knowing how to validate a simplification is worth more than distrusting all simplifications.

Background: the three-parameter economy

One speculation cycle: draft k tokens (cost k·c), then one target forward verifies all of them in a single batch (cost ≈ 1 — this batching is the entire trick; the verify scores k+1 positions for the price of one decode step because prefill-shaped work is compute-cheap, Phase 0 lab-04). The cycle emits the leading run of accepted tokens plus one (the correction on first rejection, or the bonus token when everything passes — lab-01's verify_greedy mechanics):

E[tokens/cycle] = 1 + α + α² + … + α^k = (1 − α^(k+1)) / (1 − α)

speedup(α, k, c) = E[tokens/cycle] / (k·c + 1)

Two structural facts fall out before you compute anything. Diminishing returns in k: the i-th draft token only counts if all before it accepted, so its marginal value is α^i — geometrically decaying, while its cost c is constant; past some k every extra token is negative-margin (hence optimal_k). The ceiling: even free drafts can't beat 1/(1−α) tokens per cycle — at α=0.7 that's 3.3×, at α=0.5 it's 2× — so chasing speedup beyond the ceiling means improving the drafter, not the config.

Files

  • starter.pyexpected_tokens_per_verify, speedup, optimal_k. Your work.
  • solution.py — reference.
  • test_lab.py — formula edges, the simulation check, the free-drafter bound, the losing regime, EAGLE-ballpark numbers, and both monotonicities of optimal_k.

Run

LAB_IMPL=starter pytest phase-08-speculative-decoding/labs/lab-04-speedup-model -q
pytest phase-08-speculative-decoding/labs/lab-04-speedup-model -q   # reference

What the tests prove

TestWhat it pins
test_expected_tokens_formulaThe geometric series and both edges: α=0 → 1 (corrections only — spec decode never emits less than baseline per cycle), α=1 → k+1
test_expected_tokens_matches_simulationThe i.i.d. model vs 200k simulated cycles: < 1% off. The simplification, validated
test_free_drafter_always_winsc=0 (the n-gram drafter — lab-01): speedup = E[tokens] ≥ 1. Free drafts can't lose, which is why prompt-lookup ships as a default-safe option
test_spec_decode_can_loseα=0.2, c=0.5, k=5 → speedup < 1, and optimal_k = 0. The model's most valuable output is "turn it off"
test_eagle_like_numbersα≈0.7, c≈0.05, k=5 → ~2.3× — the ballpark real EAGLE deployments report (lab-02's measured 1.9× at k=5 sits right here once you account for overheads the model omits)
test_optimal_k_grows_with_alpha_and_shrinks_with_costThe two intuitive monotonicities, made checkable
test_diminishing_returns_in_kThe marginal value of draft token i decays geometrically — why k=5-ish is so common and k=20 is almost never right

Hitchhiker's notes

  • What the model omits, and which way each omission points: verify cost grows (slightly) with k — real cost is k·c + (1+εk), pushing optimal k down; the drafter and target compete for the GPU at high batch — at saturation the "spare compute" funding the whole scheme disappears (lab-02's shrinking-win observation), pushing value down; per-cycle fixed overheads (extra kernel launches, sampler work) hurt small-α configs most. The model is an upper bound with known biases — the most useful kind.
  • The same arithmetic prices drafters against each other: n-gram (α low on prose, high on code; c ≈ 0), EAGLE (α ≈ 0.6–0.8; c ≈ 0.03–0.08 — a one-layer head), a half-size draft model (α high; c ≈ 0.2–0.5 — usually dominated by EAGLE on this math, which is why standalone draft models faded). Three points on a (α, c) plane; speedup is the contour map. Plot it once and the drafter literature organizes itself.
  • α is workload-dependent, so the right k is too: code and structured output accept far better than creative prose (lab-02 measured 80% vs ~56%). A deployment serving both has no single optimal k — which is why dynamic/adaptive speculation (adjusting k per request from rolling acceptance) is an active upstream direction. The model you just built is the controller's objective function.
  • Where to read your fleet's α: vLLM's spec-decode metrics (acceptance counts per position, mean acceptance length — the 2.8 / 5 in lab-02's capture). Mean acceptance length ≈ E[tokens/cycle] − 1; invert your formula and you can back out the effective α from production logs. Do it before and after a drafter upgrade and you have the business case in two numbers.

Going further

  • Plot speedup vs k for α ∈ {0.3, 0.5, 0.7, 0.9} at c = 0.05: watch the maximum slide right and up with α. Add c = 0.3 curves and watch speculation die for low α. This one figure is the deployment decision.
  • Add the batch-saturation term: model verify cost as 1 + λ·k where λ grows with batch utilization, and find the utilization where optimal_k hits 0 — you've derived "spec decode is a latency tool, not a throughput tool" instead of memorizing it.
  • Replace i.i.d. α with a two-state model (in-phrase α_high, at-boundary α_low) and re-derive E[tokens] — then check whether the i.i.d. fit to the mean still predicts well. (It does, mostly — means are forgiving. Tail latency per cycle is not; explore the variance.)

References

  • Leviathan et al., Fast Inference from Transformers via Speculative Decoding (2022) — §3.1 is this lab's formula: https://arxiv.org/abs/2211.17192
  • Li et al., EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (2024) — where the (α≈0.7, c≈0.05) point comes from: https://arxiv.org/abs/2401.15077
  • upstream/vllm/v1/spec_decode/ — proposer implementations; the metrics module that exports your α.
  • vLLM docs, Speculative Decodingnum_speculative_tokens and method configs: https://docs.vllm.ai/en/latest/features/spec_decode/
  • Lab-03 — where α comes from (Σ min(p,q)); lab-02 — the measured numbers this model predicts; Phase 0 lab-04 — why the batched verify costs ~1.