Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 08 — Exercises: Speculative Decoding

Contents


Warm-up (explain)

  1. In one breath: how does speculative decoding produce several tokens from one big-model run?
  2. Why does verifying k drafted tokens cost about the same as one decode step? (Tie to prefill.)
  3. Why is the output identical to normal decoding (greedy case)?

Core (trace the code)

  1. NgramProposer.propose (ngram_proposer.py:131) — what does it match on, and what does it return? Why is it great for code/summarization?
  2. In rejection_sample (rejection_sampler.py:392), state the accept probability and what happens on rejection. Why does this preserve the target distribution?
  3. In scheduler.py, how do num_lookahead_tokens and num_tokens_with_spec let spec decode ride the normal schedule with no special case?

Build (your lab)

  1. In lab-01, derive expected tokens-per-run from acceptance rate a and draft length k (hint: it's 1 + (accepted before first reject)).
  2. Add a k sweep: plot tokens-per-run vs k on the periodic target. Why does it plateau?
  3. Construct an input where n-gram hurts (proposals never accepted): show runs == baseline and explain the wasted draft cost.

Design (staff-level)

  1. Given target step cost C_t, draft cost C_d, and acceptance a, write the condition for spec decode to be a net win. When does large batch flip it negative?
  2. A customer's workload is 70% code (repetitive) and 30% chat (creative). Would you enable spec decode globally, per-request, or adaptively? Justify.
  3. EAGLE vs n-gram: when would you pick each, and what does EAGLE need that n-gram doesn't?
  4. Spec decode interacts with the KV cache (drafts need slots) — what must the scheduler do on rejection, and what's the memory risk?

Self-grading

4–6 and 10–13 are interview-grade. Could you whiteboard draft→verify and the win condition? If not, re-read 01-deep-dive.md.