Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 08 — Mini-Build: n-gram draft + greedy verify

You'll build the whole speculative loop with a free drafter (n-gram / prompt-lookup) and a greedy verifier, then measure the two numbers that decide everything: acceptance rate and tokens per big-model run. No GPU, no weights — just the mechanism.

Contents


The task (lab-01)

Model the big model as a deterministic target(context) -> token (so greedy is reproducible). Implement:

  • ngram_propose(context, n, k) → up to k proposed tokens: find the most recent earlier occurrence of context[-n:], propose the k tokens that followed it last time ([] if no match).
  • verify_greedy(context, proposed, target)(accepted, n_proposed_accepted): accept proposed[i] iff it equals target(context + accepted_so_far); stop at the first mismatch; then append the correction token target(context + accepted_so_far) (the big model's own choice). So you always emit at least 1 token per run.
  • run_speculative(context, target, n, k, num_tokens) → generate ≥num_tokens, returning the produced tokens plus counts: total proposed, total accepted, number of big-model runs.
  • run_baseline(context, target, num_tokens) → plain greedy: one token per run.

What you'll prove (the two headline properties)

  1. Identical output: run_speculative(...) == run_baseline(...) token-for-token. (You only ever accept the target's own greedy choice.) This is the correctness guarantee.
  2. Fewer runs when guesses are good: on a periodic sequence the n-gram drafter nails the pattern, so tokens_per_run = total_tokens / num_runs ≫ 1; on a random target it's ≈ 1 (no speedup). That's acceptance rate in action.

Definition of done

pytest phase-08-speculative-decoding/labs -q

Map to the real engine

your codereal vLLM
ngram_proposeNgramProposer.propose (ngram_proposer.py:131)
verify_greedygreedy path of RejectionSampler (rejection_sampler.py:87)
(sampling version)rejection_sample (rejection_sampler.py:392)
counts / acceptancespec_decode/metrics.py
"runs" = scheduler stepsnum_lookahead_tokens + num_tokens_with_spec (Phase 3)