Phase 08 — Exercises: Speculative Decoding

Warm-up (explain)

In one breath: how does speculative decoding produce several tokens from one big-model run?
Why does verifying k drafted tokens cost about the same as one decode step? (Tie to prefill.)
Why is the output identical to normal decoding (greedy case)?

NgramProposer.propose (ngram_proposer.py:131) — what does it match on, and what does it return? Why is it great for code/summarization?
In rejection_sample (rejection_sampler.py:392), state the accept probability and what happens on rejection. Why does this preserve the target distribution?
In scheduler.py, how do num_lookahead_tokens and num_tokens_with_spec let spec decode ride the normal schedule with no special case?

In lab-01, derive expected tokens-per-run from acceptance rate a and draft length k (hint: it's 1 + (accepted before first reject)).
Add a k sweep: plot tokens-per-run vs k on the periodic target. Why does it plateau?
Construct an input where n-gram hurts (proposals never accepted): show runs == baseline and explain the wasted draft cost.

Given target step cost C_t, draft cost C_d, and acceptance a, write the condition for spec decode to be a net win. When does large batch flip it negative?
A customer's workload is 70% code (repetitive) and 30% chat (creative). Would you enable spec decode globally, per-request, or adaptively? Justify.
EAGLE vs n-gram: when would you pick each, and what does EAGLE need that n-gram doesn't?
Spec decode interacts with the KV cache (drafts need slots) — what must the scheduler do on rejection, and what's the memory risk?

4–6 and 10–13 are interview-grade. Could you whiteboard draft→verify and the win condition? If not, re-read 01-deep-dive.md.