Phase 08 — Mini-Build: n-gram draft + greedy verify
You'll build the whole speculative loop with a free drafter (n-gram / prompt-lookup) and a greedy verifier, then measure the two numbers that decide everything: acceptance rate and tokens per big-model run. No GPU, no weights — just the mechanism.
Contents
- The task (lab-01)
- What you'll prove (the two headline properties)
- Definition of done
- Map to the real engine
The task (lab-01)
Model the big model as a deterministic target(context) -> token (so greedy is reproducible).
Implement:
ngram_propose(context, n, k)→ up tokproposed tokens: find the most recent earlier occurrence ofcontext[-n:], propose thektokens that followed it last time ([]if no match).verify_greedy(context, proposed, target)→(accepted, n_proposed_accepted): acceptproposed[i]iff it equalstarget(context + accepted_so_far); stop at the first mismatch; then append the correction tokentarget(context + accepted_so_far)(the big model's own choice). So you always emit at least 1 token per run.run_speculative(context, target, n, k, num_tokens)→ generate ≥num_tokens, returning the produced tokens plus counts: total proposed, total accepted, number of big-model runs.run_baseline(context, target, num_tokens)→ plain greedy: one token per run.
What you'll prove (the two headline properties)
- Identical output:
run_speculative(...) == run_baseline(...)token-for-token. (You only ever accept the target's own greedy choice.) This is the correctness guarantee. - Fewer runs when guesses are good: on a periodic sequence the n-gram drafter nails the
pattern, so
tokens_per_run = total_tokens / num_runs≫ 1; on a random target it's ≈ 1 (no speedup). That's acceptance rate in action.
Definition of done
pytest phase-08-speculative-decoding/labs -q
Map to the real engine
| your code | real vLLM |
|---|---|
ngram_propose | NgramProposer.propose (ngram_proposer.py:131) |
verify_greedy | greedy path of RejectionSampler (rejection_sampler.py:87) |
| (sampling version) | rejection_sample (rejection_sampler.py:392) |
| counts / acceptance | spec_decode/metrics.py |
| "runs" = scheduler steps | num_lookahead_tokens + num_tokens_with_spec (Phase 3) |