Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 08 Labs — Speculative Decoding

Four labs on the art of spending idle FLOPs to buy latency. The arc: build the draft→verify machine with a free drafter (lab-01), prove the losslessness theorem for the sampled case (lab-03), price the trade with the expected-speedup model — including when to turn it off (lab-04), then measure the state of the art (EAGLE) on real silicon and reconcile every number against the models you built (lab-02).

Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: mechanism, theorem, economics, measurement.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-08-speculative-decoding/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-08-speculative-decoding/labs/lab-01-ngram-spec-decode -q

Contents


Labs

lab-01-ngram-spec-decode [CPU-OK]

The whole machine with the simplest drafter: n-gram prompt-lookup proposes, a greedy verifier accepts the leading run, corrections and bonus tokens keep progress ≥ 1 token/cycle. Proven token-identical to baseline; speedup measured as a property of the text (dramatic on repetitive, zero on random). Skills: the invariant verify loop; tokens-per-run as THE metric; graceful degradation; the evolving-context off-by-one.

lab-02-eagle-on-real-vllm [GPU-OPT]

The integration test: EAGLE (a one-layer head reading the target's hidden states) on Llama-3-8B — ITL 18.2 → 9.6 ms, acceptance 2.8/5 — reconciled number-by-number against labs 01/03/04, including the two honest qualifications (acceptance is workload-dependent; the win fades at saturated batch). Annotated capture included. Skills: predict-then-measure; the three acceptance metrics and their denominators; spec decode as a latency tool funded by spare compute.

lab-03-rejection-sampling [CPU-OK]

The theorem: accept draft x with min(1, p[x]/q[x]), else resample from normalize(max(p − q, 0)) — and the output is distributed exactly as the target, for any drafter. Verified empirically (200k draws through a clueless uniform drafter land on the target to 0.005), plus the closed form α = Σ min(p, q) and the adversarial limits. Skills: the residual construction; distributional testing with calibrated tolerances; α as distribution overlap.

lab-04-speedup-model [CPU-OK]

The economics in three functions: E[tokens/cycle] = (1−α^(k+1))/(1−α), speedup = that over k·c + 1, and optimal_k — which is sometimes zero (a mediocre drafter at real cost loses to no speculation, and the model says so). Validated against simulation to 1%; EAGLE's published numbers drop out of the formula. Skills: the (α, c, k) economy; diminishing returns; why free drafters can't lose and saturated GPUs can't win.

What you can do after this phase

Explain why speculative decoding is lossless — separately for greedy (trivial) and sampled (the residual theorem) — and test the claim distributionally; evaluate any drafter from two measured numbers (α on your traffic, c from a profile) before deploying it; choose num_speculative_tokens from arithmetic; and reconcile vLLM's spec-decode metrics with first principles. Phase 9 broadens sampling itself; the verify machinery you now own reappears wherever one batched pass scores many candidates.