Phase 08 — Cheatsheet: Speculative Decoding
Contents
The one-liner
Cheap drafter guesses k next tokens → big model verifies all in ONE run → keep the longest correct prefix + 1 correction. Several tokens per expensive run; output identical to normal decoding.
Why it works
Verification = a mini-prefill (compute-bound, processes many tokens cheaply), so checking k tokens ≈ one decode run. Speedup ∝ acceptance rate.
Correctness
- Greedy: accept only the target's argmax → identical output.
- Sampling: rejection sampling — accept w.p.
min(1, p_target/p_draft), resamplenormalize(max(0, p_target−p_draft))on reject → exact target distribution.
Proposers
n-gram/prompt-lookup (free; great for repetitive/code) · EAGLE (trained head, predicts hidden states; best general) · Medusa · DFlash · suffix · small draft model.
Win/lose
Win: high acceptance, cheap drafter, small batch (spare capacity). Lose: low acceptance, or large
batch (GPU already saturated). Condition: accepted/run × C_target > C_draft + extra_verify.
Rides the scheduler
num_tokens_with_spec adds drafts to the gap; num_lookahead_tokens reserves KV; rejection result
applied in update_from_output. No special scheduler path.
Key upstream
v1/spec_decode/ngram_proposer.py:12/:131·eagle.py:10·medusa.pydflash.pysuffix_decoding.pyv1/sample/rejection_sampler.py:37 RejectionSampler :87 forward :392 rejection_samplev1/spec_decode/metrics.py(acceptance) ·scheduler.py(spec_token_ids / num_lookahead_tokens)
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md