Phase 08 — Deep Dive: speculative decoding in real vLLM
Paths relative to
upstream/atv0.22.1 @ 0decac0.vllm/v1/spec_decode/ngram_proposer.py n-gram / prompt-lookup proposer (no model — read first) vllm/v1/spec_decode/eagle.py EAGLE proposer (a tiny trained head) vllm/v1/spec_decode/{medusa,dflash,suffix_decoding,draft_model}.py other proposers vllm/v1/spec_decode/metadata.py spec metadata passed around vllm/v1/sample/rejection_sampler.py verification that preserves the distribution vllm/v1/core/sched/scheduler.py spec_token_ids / num_lookahead_tokens (the hook)
Contents
- 1. The simplest proposer: n-gram (
ngram_proposer.py) - 2. A trained proposer: EAGLE (
eagle.py) - 3. Verification that preserves the distribution: the rejection sampler
- 4. How it rides the scheduler (the elegant part)
- 5. Metrics
- Reading checklist
1. The simplest proposer: n-gram (ngram_proposer.py)
class NgramProposer (:12), propose (:131). The idea (prompt-lookup): take the last n
tokens of the sequence, search earlier in the same sequence for a previous occurrence, and if found,
propose the k tokens that followed it last time. No model, no weights — pure string matching, yet
it crushes repetitive workloads (code, JSON, summarization where the answer quotes the source).
Read propose and notice it returns up to k candidate token ids. Your lab-01
ngram_propose is this exact algorithm.
2. A trained proposer: EAGLE (eagle.py)
class EagleProposer(SpecDecodeBaseProposer) (:10). EAGLE runs a small network that predicts
the target model's next hidden states (not just tokens), which it then turns into high-quality
draft tokens — far better acceptance than n-gram on general text, for a small extra cost. It shares
the target's KV/hidden states (note extract_hidden_states.py). Medusa (medusa.py), DFlash
(dflash.py), suffix decoding, and a plain small draft model are siblings — all implement "produce
k cheap, plausible next tokens." They plug into the same verify path.
3. Verification that preserves the distribution: the rejection sampler
vllm/v1/sample/rejection_sampler.py: class RejectionSampler (:37), forward (:87),
rejection_sample (:392). For greedy it's trivial (accept a draft token iff it equals the
target's argmax). For sampling, rejection_sample implements the speculative-sampling rule:
accept draft token i with probability min(1, p_target(i) / p_draft(i)); on rejection, resample
from the adjusted distribution normalize(max(0, p_target − p_draft)). The math guarantees the
accepted tokens are distributed exactly as if the target had sampled directly — the proof of
"speed, not behavior." Skim the function and find the accept test and the resample-on-reject branch.
4. How it rides the scheduler (the elegant part)
Open vllm/v1/core/sched/scheduler.py and search spec_token_ids and num_lookahead_tokens
(around the running-request loop, ~:447/:502). What you'll see:
num_lookahead_tokensis passed toallocate_slotsso KV space is reserved for the draft tokens (Phase 2).- a request's
num_tokens_with_spec(request.py:243) includes the draft tokens, so the samenum_new_tokens = num_tokens_with_spec − num_computed_tokensclamp (Phase 3) naturally schedules them to be verified. - after the model runs,
update_from_outputconsults the rejection sampler's result, keeps the accepted prefix, and rolls back the rest (un-computes rejected tokens' KV).
So spec decode is not a special path in the scheduler — it's "a few extra tokens in the gap," exactly as Phase 3's top-of-function comment promised. That's the design lesson: a good abstraction ("close the num_computed→num_tokens gap") absorbs a whole feature for free.
5. Metrics
spec_decode/metrics.py tracks acceptance rate and accepted-tokens-per-step — the numbers that
tell you whether spec decode is paying off (Step 4 of the guide). In production you watch these to
decide whether to keep it on for a given workload.
Reading checklist
-
NgramProposer.propose— how does it find a candidate, and what does it return? - EAGLE — what does it predict that makes its drafts good (hidden states, not just tokens)?
-
rejection_sample— find the accept test and the resample-on-reject; why does it preserve the distribution? -
In
scheduler.py, how donum_lookahead_tokensandnum_tokens_with_specmake spec decode ride the normal schedule? - What does the metrics module measure, and why is acceptance rate the deciding number?
Now build it: 02-mini-build.md, then the labs.