Phase 08 — Mini-Build: n-gram draft + greedy verify

You'll build the whole speculative loop with a free drafter (n-gram / prompt-lookup) and a greedy verifier, then measure the two numbers that decide everything: acceptance rate and tokens per big-model run. No GPU, no weights — just the mechanism.

The task (lab-01)
What you'll prove (the two headline properties)
Definition of done
Map to the real engine

The task (lab-01)

Model the big model as a deterministic target(context) -> token (so greedy is reproducible). Implement:

ngram_propose(context, n, k) → up to k proposed tokens: find the most recent earlier occurrence of context[-n:], propose the k tokens that followed it last time ([] if no match).
verify_greedy(context, proposed, target) → (accepted, n_proposed_accepted): accept proposed[i] iff it equals target(context + accepted_so_far); stop at the first mismatch; then append the correction token target(context + accepted_so_far) (the big model's own choice). So you always emit at least 1 token per run.
run_speculative(context, target, n, k, num_tokens) → generate ≥num_tokens, returning the produced tokens plus counts: total proposed, total accepted, number of big-model runs.
run_baseline(context, target, num_tokens) → plain greedy: one token per run.

What you'll prove (the two headline properties)

Identical output: run_speculative(...) == run_baseline(...) token-for-token. (You only ever accept the target's own greedy choice.) This is the correctness guarantee.
Fewer runs when guesses are good: on a periodic sequence the n-gram drafter nails the pattern, so tokens_per_run = total_tokens / num_runs ≫ 1; on a random target it's ≈ 1 (no speedup). That's acceptance rate in action.

Definition of done

pytest phase-08-speculative-decoding/labs -q

Map to the real engine

your code	real vLLM
`ngram_propose`	`NgramProposer.propose` (`ngram_proposer.py:131`)
`verify_greedy`	greedy path of `RejectionSampler` (`rejection_sampler.py:87`)
(sampling version)	`rejection_sample` (`rejection_sampler.py:392`)
counts / acceptance	`spec_decode/metrics.py`
"runs" = scheduler steps	`num_lookahead_tokens` + `num_tokens_with_spec` (Phase 3)

vLLM Mastery — From Zero to Maintainer

Phase 08 — Mini-Build: n-gram draft + greedy verify

Contents

The task (lab-01)

What you'll prove (the two headline properties)

Definition of done

Map to the real engine

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer

Phase 08 — Mini-Build: n-gram draft + greedy verify

Contents

The task (lab-01)

What you'll prove (the two headline properties)

Definition of done

Map to the real engine