Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 08-01 — n-gram Draft + Greedy Verify [CPU-OK]

Speculative decoding starts from an indignity: the most powerful models on earth spend most of their decode steps predicting tokens a text search could have guessed — the closing half of a quoted phrase, the rest of a repeated identifier, boilerplate the prompt already contains. This lab builds the whole draft→verify mechanism end to end using exactly that text search as the drafter (n-gram "prompt lookup" — vLLM's method="ngram", the only drafter that needs no second model and costs nothing), and establishes the two facts every speculative method lives by: the output is token-for-token identical to plain decoding, and the speedup is entirely a function of how guessable the text is — dramatic on repetitive sequences, exactly zero on random ones. No GPU, no weights; the mechanism in its purest form.

Contents


Why this lab exists

Every speculative method — n-gram, EAGLE, Medusa, draft models, DFlash — is the same machine with a different drafter plugged in: propose k tokens cheaply, verify them with one target pass, keep the leading run that matches, always emit at least one token. Building that machine once, with the simplest possible drafter, separates the invariant structure (the verify loop, the correction token, the bonus token, the ≥1-token progress guarantee) from the pluggable part (where proposals come from). After this lab, every spec-decode paper you read is "lab-01 with a fancier propose," and every acceptance metric in vLLM's logs maps to a counter you maintained by hand.

The n-gram drafter is also worth knowing in its own right, not as a toy: it ships in production vLLM, it's free (no extra model, no GPU memory, no draft forward), and on the workloads where it shines — RAG (the answer quotes the context), code editing (the new code repeats the old), structured output — it's embarrassingly competitive with learned drafters. Free and sometimes great is a good tool to know exists; lab-04's economics make "free" precise (c = 0 ⇒ speculation can never lose).

Background: the asymmetry being arbitraged

Phase 0 lab-04: a decode step is bandwidth-bound — the GPU streams all weights to emit one token, using ~1% of its compute. But verifying k+1 candidate positions in one forward pass costs barely more than that single step (it's prefill-shaped work riding the same weight stream). So the target model can check k guesses for the price of making one. Speculation arbitrages this: any source of decent guesses converts idle compute into accepted tokens. The drafter here is the cheapest source imaginable — "find the last time the current n-gram appeared, propose what followed it" — your ngram_propose, a backwards string scan.

The verify rule (verify_greedy) is the part with invariants worth memorizing:

  • Accept proposal p_i iff it equals the target's own greedy choice given everything before it (computed left to right, each acceptance extending the context).
  • On the first mismatch, append the target's choice — the correction — and stop. That's why a cycle always emits ≥ 1 token: failure mode is baseline speed, never stall.
  • If all k accepted, the verify pass has already computed the distribution at the position after them — append that bonus token. k accepted = k+1 emitted, free.

Files

  • starter.pyngram_propose, verify_greedy, run_speculative, run_baseline. Your work.
  • solution.py — reference.
  • test_lab.py — output identity, the periodic-text speedup, the random-text non-speedup.

Run

LAB_IMPL=starter pytest phase-08-speculative-decoding/labs/lab-01-ngram-spec-decode -q
pytest phase-08-speculative-decoding/labs/lab-01-ngram-spec-decode -q   # reference

What to implement

Per 02-mini-build.md. Watch the two classic off-by-ones: the n-gram scan must find an earlier occurrence (excluding the pattern's own position at the very end — propose from yourself and you'll propose nothing useful forever), and the verify must compare proposal i against the target evaluated on context + accepted[:i] — against the evolving context, not the original one (compare against the frozen context and proposals 2+ are checked against the wrong distribution; outputs diverge from baseline and the identity test catches it).

What the tests prove

TestWhat it pins
output == baselineThe lossless guarantee, greedy form: you only ever keep the target's own choices, so the sequence cannot differ. (The sampled-temperature generalization is lab-03's theorem)
periodic text → runs ≪ num_tokensHigh acceptance on guessable text: many tokens per target pass — the entire value proposition, measured by your own counters
random text → runs == num_tokensZero acceptance → exactly baseline-many target runs. Speculation's failure mode is no speedup, not wrong output — the graceful-degradation property that makes it deployable

Hitchhiker's notes

  • Tokens-per-run is THE metric. (accepted + runs) / runs from your stats dict is "mean acceptance length + 1" in vLLM's spec-decode logs (lab-02 reads 2.8 / 5 from a real run). Everything else — ITL improvement, the lab-04 economics — derives from it. When evaluating any drafter, ask for this number on your workload before anything else.
  • Why scan latest-first? Recent context predicts continuation better than distant context (local repetition dominates). Upstream's NgramProposer.propose (ngram_proposer.py:131) does the same, with bounded lookback and n-gram sizes — read it after; it's your function with engineering around it.
  • The verify loop's "always ≥1 token" is load-bearing for the scheduler: a speculation cycle is, from Phase 3's perspective, just a decode step that may emit several tokens. The KV for accepted tokens was already written during the verify pass; rejected positions' KV must be discarded (in paged terms: the slots are simply overwritten next cycle — the counters from Phase 1 make rollback almost free). If you wondered why num_computed_tokens racing num_tokens was such a good idea — speculative decoding is one more feature that composes with it for free.
  • k is not free even when drafting is — each proposed-but-rejected token wastes a slot in the verify batch. With a free drafter the waste is mild (the verify pass was ~constant cost anyway); with a costly drafter it's lab-04's whole subject.

Going further

  • Track per-position acceptance (does proposal #1 accept more often than #4?) on periodic text with noise injected — you'll rediscover the geometric decay that lab-04's α^i models.
  • Implement the suffix-automaton upgrade: index all earlier occurrences instead of scanning (longest-match instead of fixed-n). Compare acceptance on text with mixed repetition lengths — this is the direction production prompt-lookup variants take.
  • Run your speculative loop with mini_vllm's deterministic ToyModel as the target (greedy) and verify the identity holds against LLMEngine.generate — wiring the lab into the course's engine, the way upstream wires NgramProposer into the runner.

References

  • upstream/vllm/v1/spec_decode/ngram_proposer.py:131NgramProposer.propose: your drafter, productionized.
  • upstream/vllm/v1/sample/rejection_sampler.py:87 — the greedy verify fast path.
  • Leviathan et al. (2022) — the original draft/verify formulation: https://arxiv.org/abs/2211.17192
  • Prompt Lookup Decoding (Saxena, 2023) — the n-gram drafter's origin as a standalone trick: https://github.com/apoorvumang/prompt-lookup-decoding
  • Labs 03 (the sampled-case theorem) and 04 (the economics) — this lab's two generalizations.