Lab 08-02 — EAGLE on Real vLLM `[GPU-OPT]`

The CPU labs built the machine (01), proved its theorem (03), and priced it (04). This lab runs the state of the art on real silicon: EAGLE — a one-layer draft head that reads the target model's own hidden features and proposes from understanding rather than lab-01's string matching — and measures the two numbers the whole phase converges on: inter-token latency (18.2 → 9.6 ms, ~1.9×) and acceptance (2.8 of 5, 56%). Then the two qualifications that keep the result honest: acceptance climbs to ~80% on code, and the win shrinks at high batch — both of which your lab-04 model predicts before the GPU is even warm.

No GPU? Don't panic. The captured run below is annotated against all three CPU labs; the cross-checking is the lesson.

Why this lab exists
Background: what EAGLE changes (and doesn't)
Requirements
Steps
Captured output (real run, Llama-3-8B + EAGLE, A100, vLLM 0.22.1, trimmed)
Reading the numbers
Hitchhiker's notes
Reflect
References

Why this lab exists

This is the phase's integration test — of your understanding, not the software. Three CPU labs handed you a model of speculative decoding with named parameters: acceptance rate α (lab-03's overlap), draft cost c, tokens-per-cycle (lab-01's metric), speedup (lab-04's formula). A real EAGLE run hands you measurements. The work of this lab is the reconciliation: plug the measured acceptance into lab-04's formula, predict the ITL improvement, compare with the measured 1.9×, and account for the gap (overheads the model omits). When prediction and measurement agree within your stated error budget, the phase is yours. When they don't, one of your parameters is wrong — finding which is exactly the skill a staff engineer applies when a vendor (or a teammate) claims "2× from speculation" for your workload.

The lab also installs the production decision frame: speculative decoding is a latency tool funded by spare compute. Both halves of that sentence are visible in the capture — the single-stream halving, and the fade at saturated batch — and both are why "should we enable EAGLE?" has a different answer for a chatbot (yes, probably) than for an offline batch pipeline (probably not).

Background: what EAGLE changes (and doesn't)

Everything from labs 01/03 survives intact: propose k, one batched verify, leading-run acceptance, correction/bonus token, lossless guarantee. What EAGLE changes is the drafter: instead of searching the context for literal repeats (α ≈ 0 on novel prose), it runs a single transformer layer over the target model's last hidden states — the features the target computed anyway — plus the sampled token, and autoregressively rolls out k draft tokens. Because it reads the target's "thoughts" rather than its text, it predicts well even on text that never repeats (α ≈ 0.6–0.8); because it's one layer against the target's 32+, its cost is c ≈ 0.05. On lab-04's (α, c) plane, EAGLE sits in the corner that dominates both the free-but-blind n-gram drafter and the smart-but-expensive separate draft model — which is why the separate-draft-model approach has mostly faded, and why EAGLE-family heads exist for most popular open models.

The price of reading hidden states: the head is target-specific (trained per model, shapes must match — you can't borrow Llama's head for Qwen), and the draft itself runs autoregressively (k sequential micro-steps — tiny ones, but this is exactly where Phase 5's CUDA graphs become load-bearing: a 1-layer model's step is pure launch overhead without them).

Requirements

uv pip install -e ".[vllm]"
# a base model + its matching EAGLE head from the Hub, e.g.:
#   meta-llama/Meta-Llama-3-8B-Instruct  +  yuhuili/EAGLE-LLaMA3-Instruct-8B

Steps

import time
from vllm import LLM, SamplingParams

sp = SamplingParams(max_tokens=128, temperature=0)
prompts = ["Explain how a hash map handles collisions."]  # single stream first!

base = LLM(model="<base>", gpu_memory_utilization=0.8)
# ... time generate(), record ITL = elapsed / tokens ...

spec = LLM(model="<base>", gpu_memory_utilization=0.8,
           speculative_config={"method": "eagle", "model": "<eagle head>",
                               "num_speculative_tokens": 5})
# ... same timing; then read the spec-decode metrics lines from the log
#     (acceptance counts / mean acceptance length).

Three runs to do properly: (1) single stream, the headline; (2) the same prompt swapped for code generation — watch acceptance move; (3) batch 32+ — watch the speedup fade. Before each, predict the result from lab-04 with your current (α, c) estimates.

Captured output (real run, Llama-3-8B + EAGLE, A100, vLLM 0.22.1, trimmed)

baseline      : ITL 18.2 ms/token   (54.9 tok/s, single stream)
eagle (k=5)   : ITL  9.6 ms/token   (104 tok/s)        ~1.9x faster
spec_decode metrics: mean acceptance length 2.8 / 5 ; acceptance rate 56%
# on highly repetitive input (code), acceptance rose to ~80% and ITL dropped further.
# at large batch (saturated GPU) the speedup shrank — less spare capacity to verify.

Reading the numbers

Mean acceptance length 2.8 → tokens-per-cycle 3.8 (the +1 is lab-01's correction/bonus). Lab-04 sanity check: per-position α solving (1−α⁶)/(1−α) = 3.8 is ≈ 0.75; the logged "56%" is a different denominator (accepted/proposed = 2.8/5) — two acceptance metrics, one phenomenon, and confusing them is the most common spec-decode reporting error. Always ask which one a number is.
Predicted vs measured: lab-04 with α=0.75, c=0.05, k=5 gives 3.78 / 1.25 ≈ 3.0×; measured is 1.9×. The gap is the model's known omissions (per-cycle sampler/launch overheads, the verify pass costing slightly more than 1, drafting running serially) — consistent in direction with the bias list in lab-04's notes. A model that misses by a predictable margin in a predictable direction is a working model.
Code → 80% acceptance: sharper next-token distributions overlap more (lab-03: α = Σ min(p,q) grows as both distributions concentrate). Same reason low temperature helps. Your workload's α is a property of your traffic; measure it there.
The fade at batch: verify rides on spare compute (Phase 0 lab-04's idle FLOPs at small batch). A saturated GPU has none — the verify pass now displaces other requests' work, and tokens-per-cycle gains stop translating into wall-clock. Spec decode is a latency tool; at full throughput it approaches a no-op (or worse, with drafting overhead). This single observation decides most deployment questions.

Hitchhiker's notes

k=5 is not sacred. With measured α ≈ 0.75 and c ≈ 0.05, lab-04's optimal_k says 5–7 — fine. But on the prose end (α ≈ 0.5) optimal k drops to ~3, and configured-k- too-high costs latency (rejected drafts still occupy verify slots). If your acceptance metrics run low, shrinking num_speculative_tokens is the free fix nobody tries.
EAGLE + CUDA graphs are a package deal (Phase 5 lab-04's note, now concrete): the draft head's per-token step is ~1 ms-class GPU work behind full launch overhead — eager-mode EAGLE can lose most of its margin to Python and launches. If spec-decode numbers disappoint, check the draft path is actually captured.
Greedy here, but the guarantee generalizes: with temperature > 0 the verify runs lab-03's rejection sampling, and outputs are distributionally identical rather than token-identical. Acceptance drops a bit (broader distributions overlap less). The metrics machinery is unchanged.
EAGLE-2/3 and tree drafts: instead of one chain of k, draft a small tree of alternatives and verify all paths in one pass (attention masks make a tree look like a batch). Buys higher expected acceptance per verify at the cost of verify width — same economics, one more dimension. When you see speculative_config grow tree parameters, lab-04's model extends with "k" becoming "tree shape."

Reflect

Reconcile the three acceptance numbers you now have (2.8/5 = 56%; per-position α ≈ 0.75; code ≈ 80%) — write each as a formula over the same event sequence. If you can do this cold, you'll never misread a spec-decode dashboard.
Your fleet runs batch-48 throughput-oriented summarization. EAGLE: yes or no? What measurement would change your answer? (Likely no — saturated compute; measure spare utilization headroom and p99 ITL requirements. If interactivity appears — yes for the interactive class, via a separate pool or priority.)
The EAGLE head must match the target model. What happens operationally when you upgrade the base model checkpoint? (The head needs retraining/replacing — speculative configs add a coupled artifact to your model-rollout pipeline. Budget for it or inherit silent acceptance collapse.)

References

Li et al., EAGLE (2024): https://arxiv.org/abs/2401.15077; EAGLE-2 (tree drafts, 2024): https://arxiv.org/abs/2406.16858
upstream/vllm/v1/spec_decode/eagle.py — the proposer; note the hidden-state plumbing from the target's forward.
vLLM docs, Speculative Decoding — configs and the metrics you read: https://docs.vllm.ai/en/latest/features/spec_decode/
Labs 01/03/04 — the machine, the theorem, the economics this run validates.

vLLM Mastery — From Zero to Maintainer