Lab 08-01 — n-gram Draft + Greedy Verify [CPU-OK]
Speculative decoding starts from an indignity: the most powerful models on earth spend
most of their decode steps predicting tokens a text search could have guessed — the
closing half of a quoted phrase, the rest of a repeated identifier, boilerplate the
prompt already contains. This lab builds the whole draft→verify mechanism end to end
using exactly that text search as the drafter (n-gram "prompt lookup" — vLLM's
method="ngram", the only drafter that needs no second model and costs nothing), and
establishes the two facts every speculative method lives by: the output is
token-for-token identical to plain decoding, and the speedup is entirely a
function of how guessable the text is — dramatic on repetitive sequences, exactly zero
on random ones. No GPU, no weights; the mechanism in its purest form.
Contents
- Why this lab exists
- Background: the asymmetry being arbitraged
- Files
- Run
- What to implement
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Every speculative method — n-gram, EAGLE, Medusa, draft models, DFlash — is the same
machine with a different drafter plugged in: propose k tokens cheaply, verify them with
one target pass, keep the leading run that matches, always emit at least one token.
Building that machine once, with the simplest possible drafter, separates the invariant
structure (the verify loop, the correction token, the bonus token, the ≥1-token progress
guarantee) from the pluggable part (where proposals come from). After this lab, every
spec-decode paper you read is "lab-01 with a fancier propose," and every acceptance
metric in vLLM's logs maps to a counter you maintained by hand.
The n-gram drafter is also worth knowing in its own right, not as a toy: it ships in
production vLLM, it's free (no extra model, no GPU memory, no draft forward), and on
the workloads where it shines — RAG (the answer quotes the context), code editing (the
new code repeats the old), structured output — it's embarrassingly competitive with
learned drafters. Free and sometimes great is a good tool to know exists; lab-04's
economics make "free" precise (c = 0 ⇒ speculation can never lose).
Background: the asymmetry being arbitraged
Phase 0 lab-04: a decode step is bandwidth-bound — the GPU streams all weights to emit
one token, using ~1% of its compute. But verifying k+1 candidate positions in one
forward pass costs barely more than that single step (it's prefill-shaped work riding
the same weight stream). So the target model can check k guesses for the price of
making one. Speculation arbitrages this: any source of decent guesses converts idle
compute into accepted tokens. The drafter here is the cheapest source imaginable —
"find the last time the current n-gram appeared, propose what followed it" — your
ngram_propose, a backwards string scan.
The verify rule (verify_greedy) is the part with invariants worth memorizing:
- Accept proposal
p_iiff it equals the target's own greedy choice given everything before it (computed left to right, each acceptance extending the context). - On the first mismatch, append the target's choice — the correction — and stop. That's why a cycle always emits ≥ 1 token: failure mode is baseline speed, never stall.
- If all k accepted, the verify pass has already computed the distribution at the position after them — append that bonus token. k accepted = k+1 emitted, free.
Files
starter.py—ngram_propose,verify_greedy,run_speculative,run_baseline. Your work.solution.py— reference.test_lab.py— output identity, the periodic-text speedup, the random-text non-speedup.
Run
LAB_IMPL=starter pytest phase-08-speculative-decoding/labs/lab-01-ngram-spec-decode -q
pytest phase-08-speculative-decoding/labs/lab-01-ngram-spec-decode -q # reference
What to implement
Per 02-mini-build.md. Watch the two classic off-by-ones: the
n-gram scan must find an earlier occurrence (excluding the pattern's own position at
the very end — propose from yourself and you'll propose nothing useful forever), and the
verify must compare proposal i against the target evaluated on
context + accepted[:i] — against the evolving context, not the original one (compare
against the frozen context and proposals 2+ are checked against the wrong distribution;
outputs diverge from baseline and the identity test catches it).
What the tests prove
| Test | What it pins |
|---|---|
| output == baseline | The lossless guarantee, greedy form: you only ever keep the target's own choices, so the sequence cannot differ. (The sampled-temperature generalization is lab-03's theorem) |
periodic text → runs ≪ num_tokens | High acceptance on guessable text: many tokens per target pass — the entire value proposition, measured by your own counters |
random text → runs == num_tokens | Zero acceptance → exactly baseline-many target runs. Speculation's failure mode is no speedup, not wrong output — the graceful-degradation property that makes it deployable |
Hitchhiker's notes
- Tokens-per-run is THE metric.
(accepted + runs) / runsfrom your stats dict is "mean acceptance length + 1" in vLLM's spec-decode logs (lab-02 reads2.8 / 5from a real run). Everything else — ITL improvement, the lab-04 economics — derives from it. When evaluating any drafter, ask for this number on your workload before anything else. - Why scan latest-first? Recent context predicts continuation better than distant
context (local repetition dominates). Upstream's
NgramProposer.propose(ngram_proposer.py:131) does the same, with bounded lookback and n-gram sizes — read it after; it's your function with engineering around it. - The verify loop's "always ≥1 token" is load-bearing for the scheduler: a
speculation cycle is, from Phase 3's perspective, just a decode step that may emit
several tokens. The KV for accepted tokens was already written during the verify
pass; rejected positions' KV must be discarded (in paged terms: the slots are simply
overwritten next cycle — the counters from Phase 1 make rollback almost free). If
you wondered why
num_computed_tokensracingnum_tokenswas such a good idea — speculative decoding is one more feature that composes with it for free. - k is not free even when drafting is — each proposed-but-rejected token wastes a slot in the verify batch. With a free drafter the waste is mild (the verify pass was ~constant cost anyway); with a costly drafter it's lab-04's whole subject.
Going further
- Track per-position acceptance (does proposal #1 accept more often than #4?) on
periodic text with noise injected — you'll rediscover the geometric decay that
lab-04's
α^imodels. - Implement the suffix-automaton upgrade: index all earlier occurrences instead of scanning (longest-match instead of fixed-n). Compare acceptance on text with mixed repetition lengths — this is the direction production prompt-lookup variants take.
- Run your speculative loop with
mini_vllm's deterministicToyModelas the target (greedy) and verify the identity holds againstLLMEngine.generate— wiring the lab into the course's engine, the way upstream wiresNgramProposerinto the runner.
References
upstream/vllm/v1/spec_decode/ngram_proposer.py:131—NgramProposer.propose: your drafter, productionized.upstream/vllm/v1/sample/rejection_sampler.py:87— the greedy verify fast path.- Leviathan et al. (2022) — the original draft/verify formulation: https://arxiv.org/abs/2211.17192
- Prompt Lookup Decoding (Saxena, 2023) — the n-gram drafter's origin as a standalone trick: https://github.com/apoorvumang/prompt-lookup-decoding
- Labs 03 (the sampled-case theorem) and 04 (the economics) — this lab's two generalizations.