Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 08 — Interview Questions: Speculative Decoding

Q1. How does speculative decoding speed up decode?

Model answer

A cheap drafter proposes the next k tokens; the big model verifies all of them in one forward run (a mini-prefill, which is compute-bound and processes many tokens cheaply) and keeps the longest correct prefix plus one correction. So one expensive run yields multiple tokens instead of one. The speedup is set by the acceptance rate × draft length, minus the small drafting/verify overhead.

Q2. Why doesn't it change the model's output?

Model answer

Greedy: you only accept a drafted token if it equals the big model's argmax; on disagreement you discard the rest and use the big model's token — so the sequence is identical to plain greedy. Sampling: the rejection sampler accepts token i with probability min(1, p_target/p_draft) and, on rejection, resamples from normalize(max(0, p_target − p_draft)); the math guarantees the accepted tokens follow the target's exact distribution. Speed changes, behavior doesn't.

Q3. When is it a win, and when does it hurt?

Model answer

Win when accepted-tokens-per-run × target-step-cost exceeds the cost of drafting plus the extra verify work — i.e. high acceptance and a cheap drafter, in latency-bound (small-batch) regimes with spare GPU capacity. It can lose at low acceptance (creative text, weak drafter) or at large batch where the GPU is already saturated and verifying drafts steals capacity from real work.

Q4. What proposers exist and how do they differ?

Model answer

n-gram / prompt-lookup (free, copies a repeated phrase's continuation — great for code/structured text); EAGLE (a small trained head predicting the target's next hidden states — high acceptance on general text); Medusa (extra heads), DFlash, suffix decoding, and a separate small draft model. All plug into the same verify path; they trade drafter cost vs acceptance quality.

Q5. How does spec decode fit vLLM's scheduler without a special case?

Model answer

A request's num_tokens_with_spec includes the draft tokens, so the standard num_new_tokens clamp schedules them; num_lookahead_tokens reserves KV slots for them. After the run, the rejection sampler decides accept/reject and update_from_output keeps the accepted prefix and rolls back the rest. The scheduler just sees "a few more tokens in the gap" — the Phase 3 abstraction absorbs the whole feature.

Rapid-fire

  • Verify cost ≈ ? one decode/prefill run (processes k+context together).
  • Output change? none (greedy: argmax-only accept; sampling: rejection sampling).
  • Deciding metric? acceptance rate.
  • Free proposer? n-gram / prompt-lookup. Best trained one (today)? EAGLE.
  • Scheduler hooks? num_tokens_with_spec, num_lookahead_tokens.