Phase 08 Labs — Speculative Decoding
Four labs on the art of spending idle FLOPs to buy latency. The arc: build the draft→verify machine with a free drafter (lab-01), prove the losslessness theorem for the sampled case (lab-03), price the trade with the expected-speedup model — including when to turn it off (lab-04), then measure the state of the art (EAGLE) on real silicon and reconcile every number against the models you built (lab-02).
Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04:
mechanism, theorem, economics, measurement.) CPU labs follow the standard contract —
starter.py (your work), solution.py (reference), test_lab.py (the spec); default
runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-08-speculative-decoding/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-08-speculative-decoding/labs/lab-01-ngram-spec-decode -q
Contents
- lab-01-ngram-spec-decode
[CPU-OK] - lab-02-eagle-on-real-vllm
[GPU-OPT] - lab-03-rejection-sampling
[CPU-OK] - lab-04-speedup-model
[CPU-OK] - What you can do after this phase
Labs
lab-01-ngram-spec-decode [CPU-OK]
The whole machine with the simplest drafter: n-gram prompt-lookup proposes, a greedy verifier accepts the leading run, corrections and bonus tokens keep progress ≥ 1 token/cycle. Proven token-identical to baseline; speedup measured as a property of the text (dramatic on repetitive, zero on random). Skills: the invariant verify loop; tokens-per-run as THE metric; graceful degradation; the evolving-context off-by-one.
lab-02-eagle-on-real-vllm [GPU-OPT]
The integration test: EAGLE (a one-layer head reading the target's hidden states) on Llama-3-8B — ITL 18.2 → 9.6 ms, acceptance 2.8/5 — reconciled number-by-number against labs 01/03/04, including the two honest qualifications (acceptance is workload-dependent; the win fades at saturated batch). Annotated capture included. Skills: predict-then-measure; the three acceptance metrics and their denominators; spec decode as a latency tool funded by spare compute.
lab-03-rejection-sampling [CPU-OK]
The theorem: accept draft x with min(1, p[x]/q[x]), else resample from
normalize(max(p − q, 0)) — and the output is distributed exactly as the target,
for any drafter. Verified empirically (200k draws through a clueless uniform drafter
land on the target to 0.005), plus the closed form α = Σ min(p, q) and the adversarial
limits. Skills: the residual construction; distributional testing with calibrated
tolerances; α as distribution overlap.
lab-04-speedup-model [CPU-OK]
The economics in three functions: E[tokens/cycle] = (1−α^(k+1))/(1−α), speedup =
that over k·c + 1, and optimal_k — which is sometimes zero (a mediocre drafter
at real cost loses to no speculation, and the model says so). Validated against
simulation to 1%; EAGLE's published numbers drop out of the formula. Skills: the
(α, c, k) economy; diminishing returns; why free drafters can't lose and saturated
GPUs can't win.
What you can do after this phase
Explain why speculative decoding is lossless — separately for greedy (trivial) and
sampled (the residual theorem) — and test the claim distributionally; evaluate any
drafter from two measured numbers (α on your traffic, c from a profile) before
deploying it; choose num_speculative_tokens from arithmetic; and reconcile vLLM's
spec-decode metrics with first principles. Phase 9 broadens sampling itself; the verify
machinery you now own reappears wherever one batched pass scores many candidates.