Phase 08 — The Hitchhiker's Guide to Speculative Decoding

← Phase 07 · Course home · Phase 09 →

Don't Panic
Step 1: Why checking many tokens costs ~one run
Step 2: Where the guesses come from (the proposers)
Step 3: Why the output doesn't change (the honest part)
Step 4: When it helps and when it hurts (the tradeoff to reason about)
Step 5: How it rides the same scheduler (no special case)
The invariants to memorize
What you'll do

Don't Panic

Decode is slow because each token needs a full run of the big model (Phase 0: one expensive "haul the books" memory read per token). Speculative decoding is a clever cheat:

A cheap "drafter" guesses the next several tokens. The big model then checks all of them in a single run and keeps the longest correct prefix. If the guesses are good, you get several tokens for the price of one big-model run — and, remarkably, the output is exactly what the big model would have produced anyway.

Think of it like a fast typist proposing the rest of your sentence and you (the careful editor) glancing at it: if the first few words are right, you accept them instantly and only start typing yourself where the guess goes wrong. You did far less typing, and the final sentence is identical to what you'd have written.

Big model alone:   [run]→t1  [run]→t2  [run]→t3  [run]→t4      4 expensive runs, 4 tokens
Speculative:       drafter guesses [t1,t2,t3,t4]  →  big model verifies all in ONE run
                   accepts t1,t2,t3 (t4 wrong) + fixes it      1 expensive run, ~4 tokens

Step 1: Why checking many tokens costs ~one run

This is the magic that makes it work. Remember from Phase 0 that prefill processes many tokens in one run cheaply (it's compute-bound, the math units stay busy). Verification is just a mini-prefill: feed the big model the context plus the drafted tokens, and in one run it produces, for each position, what it would have predicted there. Now compare:

context = "The capital of France is"
draft   = [" Paris", ".", " The"]            (the cheap drafter's guess)

one big-model run over context+draft gives the big model's own prediction at each spot:
   after "...is"        big model says " Paris"   == draft[0] ✓ accept
   after "...is Paris"  big model says "."        == draft[1] ✓ accept
   after "...is Paris." big model says " It"      != draft[2] ✗ stop, use " It"
result: accepted " Paris", ".", then the correction " It"   →  3 tokens from 1 run

So one expensive run yielded 3 tokens instead of 1. The acceptance rate (how many guesses the big model agrees with) decides the speedup.

Step 2: Where the guesses come from (the proposers)

Different "drafters," from free to fancy:

n-gram / prompt-lookup (free, no model): if the recent text repeats something seen earlier (a name, a code snippet, a quoted phrase), just copy what followed it last time. Shockingly effective for repetitive content (code, structured data, summarization).
EAGLE (a tiny trained head): a small network trained to predict the big model's next hidden states, giving high-quality guesses cheaply. One of the best methods today.
Medusa (extra prediction heads), DFlash, suffix decoding, a small draft model: variations on "produce cheap guesses."

vLLM supports several (vllm/v1/spec_decode/). You'll build the n-gram proposer yourself in lab-01 because it needs no model and lays bare the whole mechanism.

Step 3: Why the output doesn't change (the honest part)

A natural worry: "if a cheap drafter is involved, is the output worse?" No — and this is the beautiful guarantee. For greedy decoding it's obvious: you only accept a drafted token if it equals what the big model would have picked anyway (its argmax); the instant a guess disagrees, you throw it away and use the big model's choice. So the accepted sequence is identical to plain greedy decoding.

For random sampling, there's a slightly cleverer rule — rejection sampling — that accepts/ rejects drafted tokens with just the right probabilities so the final distribution is provably exactly the big model's distribution. Either way: speculative decoding changes the speed, never the output. (Same "optimization ≠ behavior change" theme as the KV cache and chunked prefill.)

Step 4: When it helps and when it hurts (the tradeoff to reason about)

Verification isn't totally free — drafting costs a little, and the big model does slightly more work per run (it processes the draft tokens too). So:

speculative is a win when:   accepted_tokens_per_run  >  cost of (drafting + extra verify work)

High acceptance (repetitive text, a good EAGLE head) → big win.
Low acceptance (creative text, weak drafter) → you wasted the draft; can be a loss.
Small batch / latency-bound → shines (the GPU has spare capacity to verify).
Large batch / already GPU-saturated → less benefit (no spare capacity; verifying drafts competes with real work).

The number that decides everything is acceptance rate × draft cost. You'll measure acceptance rate and effective tokens-per-run in lab-01.

Step 5: How it rides the same scheduler (no special case)

Recall Phase 3's mantra: a request is just num_computed_tokens racing num_tokens. Speculative decoding adds the draft tokens into that gap via num_tokens_with_spec, and reserves KV slots for them with num_lookahead_tokens. The scheduler doesn't know it's spec decode — it just schedules "some tokens to compute," exactly as the Phase 3 comment promised. Acceptance/rejection is sorted out afterward in update_from_output. Elegant: one general mechanism absorbs a whole feature.

The invariants to memorize

Drafter guesses k tokens cheaply; the big model verifies all in one run; keep the longest correct prefix + one correction.
Verification ≈ one prefill run (compute-bound) → checking k tokens costs ~one decode run.
Output is identical to normal decoding (greedy: accept only the argmax; sampling: rejection sampling preserves the distribution).
Speedup ∝ acceptance rate; it can lose at low acceptance or when the GPU is already full.
It rides the normal scheduler via num_tokens_with_spec + num_lookahead_tokens.

What you'll do

Read: 01-deep-dive.md — the n-gram proposer, EAGLE, the rejection sampler, and the scheduler hooks, line-anchored.
Build: 02-mini-build.md — an n-gram proposer + greedy verifier; measure acceptance and tokens-per-run.
Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
- lab-01-ngram-spec-decode [CPU-OK] — build draft+verify; prove output == baseline and measure the speedup on repetitive vs random text.
- lab-02-eagle-on-real-vllm [GPU-OPT] — enable EAGLE on real vLLM; measure ITL + acceptance (captured).
- lab-03-rejection-sampling [CPU-OK] — the losslessness theorem for sampling: accept with min(1, p/q), resample the residual; verify empirically that outputs are distributed exactly as p.
- lab-04-speedup-model [CPU-OK] — the (α, c, k) economics: expected tokens/cycle, speedup, optimal k — including the regime where the right answer is "turn it off".
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.