Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 08 — Cheatsheet: Speculative Decoding

Contents


The one-liner

Cheap drafter guesses k next tokens → big model verifies all in ONE run → keep the longest correct prefix + 1 correction. Several tokens per expensive run; output identical to normal decoding.

Why it works

Verification = a mini-prefill (compute-bound, processes many tokens cheaply), so checking k tokens ≈ one decode run. Speedup ∝ acceptance rate.

Correctness

  • Greedy: accept only the target's argmax → identical output.
  • Sampling: rejection sampling — accept w.p. min(1, p_target/p_draft), resample normalize(max(0, p_target−p_draft)) on reject → exact target distribution.

Proposers

n-gram/prompt-lookup (free; great for repetitive/code) · EAGLE (trained head, predicts hidden states; best general) · Medusa · DFlash · suffix · small draft model.

Win/lose

Win: high acceptance, cheap drafter, small batch (spare capacity). Lose: low acceptance, or large batch (GPU already saturated). Condition: accepted/run × C_target > C_draft + extra_verify.

Rides the scheduler

num_tokens_with_spec adds drafts to the gap; num_lookahead_tokens reserves KV; rejection result applied in update_from_output. No special scheduler path.

Key upstream

  • v1/spec_decode/ngram_proposer.py:12/:131 · eagle.py:10 · medusa.py dflash.py suffix_decoding.py
  • v1/sample/rejection_sampler.py:37 RejectionSampler :87 forward :392 rejection_sample
  • v1/spec_decode/metrics.py (acceptance) · scheduler.py (spec_token_ids / num_lookahead_tokens)

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md