Phase 08 — Exercises: Speculative Decoding
Contents
Warm-up (explain)
- In one breath: how does speculative decoding produce several tokens from one big-model run?
- Why does verifying k drafted tokens cost about the same as one decode step? (Tie to prefill.)
- Why is the output identical to normal decoding (greedy case)?
Core (trace the code)
NgramProposer.propose(ngram_proposer.py:131) — what does it match on, and what does it return? Why is it great for code/summarization?- In
rejection_sample(rejection_sampler.py:392), state the accept probability and what happens on rejection. Why does this preserve the target distribution? - In
scheduler.py, how donum_lookahead_tokensandnum_tokens_with_speclet spec decode ride the normal schedule with no special case?
Build (your lab)
- In lab-01, derive expected tokens-per-run from acceptance rate
aand draft lengthk(hint: it's1 + (accepted before first reject)). - Add a
ksweep: plot tokens-per-run vskon the periodic target. Why does it plateau? - Construct an input where n-gram hurts (proposals never accepted): show runs == baseline and explain the wasted draft cost.
Design (staff-level)
- Given target step cost
C_t, draft costC_d, and acceptancea, write the condition for spec decode to be a net win. When does large batch flip it negative? - A customer's workload is 70% code (repetitive) and 30% chat (creative). Would you enable spec decode globally, per-request, or adaptively? Justify.
- EAGLE vs n-gram: when would you pick each, and what does EAGLE need that n-gram doesn't?
- Spec decode interacts with the KV cache (drafts need slots) — what must the scheduler do on rejection, and what's the memory risk?
Self-grading
4–6 and 10–13 are interview-grade. Could you whiteboard draft→verify and the win condition? If not, re-read 01-deep-dive.md.