Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 08 — Deep Dive: speculative decoding in real vLLM

Paths relative to upstream/ at v0.22.1 @ 0decac0.

vllm/v1/spec_decode/ngram_proposer.py    n-gram / prompt-lookup proposer (no model — read first)
vllm/v1/spec_decode/eagle.py             EAGLE proposer (a tiny trained head)
vllm/v1/spec_decode/{medusa,dflash,suffix_decoding,draft_model}.py   other proposers
vllm/v1/spec_decode/metadata.py          spec metadata passed around
vllm/v1/sample/rejection_sampler.py      verification that preserves the distribution
vllm/v1/core/sched/scheduler.py          spec_token_ids / num_lookahead_tokens (the hook)

Contents


1. The simplest proposer: n-gram (ngram_proposer.py)

class NgramProposer (:12), propose (:131). The idea (prompt-lookup): take the last n tokens of the sequence, search earlier in the same sequence for a previous occurrence, and if found, propose the k tokens that followed it last time. No model, no weights — pure string matching, yet it crushes repetitive workloads (code, JSON, summarization where the answer quotes the source). Read propose and notice it returns up to k candidate token ids. Your lab-01 ngram_propose is this exact algorithm.

2. A trained proposer: EAGLE (eagle.py)

class EagleProposer(SpecDecodeBaseProposer) (:10). EAGLE runs a small network that predicts the target model's next hidden states (not just tokens), which it then turns into high-quality draft tokens — far better acceptance than n-gram on general text, for a small extra cost. It shares the target's KV/hidden states (note extract_hidden_states.py). Medusa (medusa.py), DFlash (dflash.py), suffix decoding, and a plain small draft model are siblings — all implement "produce k cheap, plausible next tokens." They plug into the same verify path.

3. Verification that preserves the distribution: the rejection sampler

vllm/v1/sample/rejection_sampler.py: class RejectionSampler (:37), forward (:87), rejection_sample (:392). For greedy it's trivial (accept a draft token iff it equals the target's argmax). For sampling, rejection_sample implements the speculative-sampling rule: accept draft token i with probability min(1, p_target(i) / p_draft(i)); on rejection, resample from the adjusted distribution normalize(max(0, p_target − p_draft)). The math guarantees the accepted tokens are distributed exactly as if the target had sampled directly — the proof of "speed, not behavior." Skim the function and find the accept test and the resample-on-reject branch.

4. How it rides the scheduler (the elegant part)

Open vllm/v1/core/sched/scheduler.py and search spec_token_ids and num_lookahead_tokens (around the running-request loop, ~:447/:502). What you'll see:

  • num_lookahead_tokens is passed to allocate_slots so KV space is reserved for the draft tokens (Phase 2).
  • a request's num_tokens_with_spec (request.py:243) includes the draft tokens, so the same num_new_tokens = num_tokens_with_spec − num_computed_tokens clamp (Phase 3) naturally schedules them to be verified.
  • after the model runs, update_from_output consults the rejection sampler's result, keeps the accepted prefix, and rolls back the rest (un-computes rejected tokens' KV).

So spec decode is not a special path in the scheduler — it's "a few extra tokens in the gap," exactly as Phase 3's top-of-function comment promised. That's the design lesson: a good abstraction ("close the num_computed→num_tokens gap") absorbs a whole feature for free.

5. Metrics

spec_decode/metrics.py tracks acceptance rate and accepted-tokens-per-step — the numbers that tell you whether spec decode is paying off (Step 4 of the guide). In production you watch these to decide whether to keep it on for a given workload.

Reading checklist

  • NgramProposer.propose — how does it find a candidate, and what does it return?
  • EAGLE — what does it predict that makes its drafts good (hidden states, not just tokens)?
  • rejection_sample — find the accept test and the resample-on-reject; why does it preserve the distribution?
  • In scheduler.py, how do num_lookahead_tokens and num_tokens_with_spec make spec decode ride the normal schedule?
  • What does the metrics module measure, and why is acceptance rate the deciding number?

Now build it: 02-mini-build.md, then the labs.