Lab 09-02 — Parallel Sampling Shares Prompt KV [GPU-OPT]
n=4 in a SamplingParams looks like syntactic sugar for "send the request four
times." This lab shows why it's structurally better: the engine prefills the prompt
once, and all four samples share its KV blocks via the Phase 2 ref-count
machinery, diverging only from the first sampled token onward. You'll watch it happen
in the logs (75% prefix-cache hit rate for n=4 — three of four samples ride the
first's blocks) and connect three phases of machinery into the single cheapest way to
buy output diversity.
No GPU? Don't panic. The captured run below carries the whole argument — and the arithmetic sections need no hardware at all.
Contents
- Why this lab exists
- Background: one prompt, n tails
- Requirements
- Steps
- Captured output (real run, facebook/opt-125m, L4, vLLM 0.22.1, trimmed)
- Reading the numbers
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Parallel sampling is quietly one of the most-used features in production: best-of-n
ranking, self-consistency for reasoning (sample k chains, vote), candidate generation
for RLHF and eval pipelines, "regenerate" buttons. All of them multiply output
tokens while reusing one prompt — and whether that prompt is processed once or n times
is often the dominant cost term (prompts routinely outweigh completions 10:1 in chat
and RAG). Knowing that n>1 shares the prefill — and why, down to the block
refcounts — turns "should we batch our candidates into one request?" from a guess
into arithmetic you can do in your head: n=4 on a 2,000-token prompt with 100-token
outputs ≈ 6,000 prompt tokens saved — roughly 3× cheaper than four independent
requests at this shape.
The lab is also the phase's bridge back to the memory phases: it's the first place
sampling policy (how many candidates) visibly drives memory behavior (block
sharing). The diversity you buy with n is priced in KV blocks, and the discount —
sharing — comes from infrastructure built three phases ago for a different feature
(prefix caching). Composability like that is what good engine design looks like from
the outside.
Background: one prompt, n tails
What the engine does with n=4: the frontend fans the request into 4 sequences;
sequence 1 prefills the prompt, caching its full blocks (Phase 2 lab-05's eager
caching at allocation); sequences 2–4 hit those blocks at admission
(get_computed_blocks → touch → ref_cnt = 4 on the shared blocks) and prefill
only the cache-ineligible remainder (the partial tail block + last token — the
num_tokens − 1 rule). From the first sampled token, each sequence allocates its own
private tail blocks and diverges — temperature 1.0 plus per-sequence RNG state
(lab-04's machinery) makes the four continuations distinct.
Cost shape: prefill ≈ 1× prompt + small change (instead of 4×); KV ≈ 1× prompt + 4× outputs (instead of 4× everything); decode = 4 streams, which batch together in every step (Phase 1 lab-04's mixed batches — four rows, one weight stream, nearly free at small n per Phase 0 lab-04's bandwidth math).
Requirements
uv pip install -e ".[vllm]"
huggingface-cli download facebook/opt-125m
Steps
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m", enable_prefix_caching=True, gpu_memory_utilization=0.5)
out = llm.generate(["Write a haiku about GPUs:"],
SamplingParams(n=4, temperature=1.0, max_tokens=24))
for c in out[0].outputs:
print(repr(c.text))
Run under VLLM_LOGGING_LEVEL=DEBUG. Three things to verify: the prompt's prefill
happens once (scheduled-token counts in the logs); the prefix-cache hit rate lands at
(n−1)/n-ish; the four texts genuinely differ. Then the control runs: same with
enable_prefix_caching=False (watch prompt tokens 4×), and with seed set (watch
the four outputs stay distinct — see the notes for why).
Captured output (real run, facebook/opt-125m, L4, vLLM 0.22.1, trimmed)
DEBUG ... Prefix cache hit rate: GPU: 75.0% # samples 2-4 reuse sample 1's prompt blocks
'GPUs run hot / silicon dreams in parallel / fans hum all night long'
'Tensors flow like streams ...'
'Cores ignite at dawn ...'
'Threads in warps align ...'
Reading the numbers
- 75.0% = 3/4 — the pioneer effect with n as the denominator: sample 1 populates, samples 2–4 hit. (Phase 2 lab-03's 87.5% was 7/8; same law, different n. You can now read any of these rates as "1 − 1/cohort".)
- Four distinct haikus — divergence is immediate (first sampled token) because each sequence draws independently. If two of your n samples come back identical on a short prompt, that's not a bug — peaked distributions (low temperature, strong prompt) genuinely collide; raise temperature or use distinct seeds per your diversity needs.
- What the log doesn't show: the shared blocks'
ref_cntsitting at 4, and the free-on-finish path decrementing it as each sample completes — the Phase 2 lab-05 biography, now with four readers. The blocks free only when the last sample finishes; a straggler sample (one haiku that rambles to max_tokens) holds the prompt's KV for everyone. Worth knowing when n is large and outputs are long.
Hitchhiker's notes
n>1with aseedstill gives n different outputs — vLLM derives per-sequence randomness so the n samples don't collapse into n copies (which would make seeded best-of-n useless). The whole-request stream is reproducible; the sequences differ from each other. (Lab-04's per-request → per-sequence state, one level finer.)- V0 had a
best_ofdistinct fromn(generate best_of, return n by logprob); V1 simplified the API surface — ranking moved client-side. If you seebest_ofin older docs/code, that's the fossil. The sharing machinery is the same either way. - Self-consistency at scale: k=16 reasoning chains over a long CoT prompt is the flagship use — prompt KV once, 16 cheap decode streams, majority-vote the answers. The cost model above is why the technique is affordable at all; quote it when someone proposes 16 separate API calls instead.
- Contrast with beam search (lab-03): parallel samples never interact after the
fork — no pruning, no rescoring, scheduler-trivial. Beams branch and die mid
flight, which is exactly the interaction that got beam search evicted from the V1
core. Independence is what makes
ncheap to support.
Reflect
- Write the exact cost ratio of
n=4vs four separate requests (with caching off) for prompt P, output L tokens: prefill (P + 3·1-ish vs 4P) and KV (P + 4L vs 4P + 4L). At what L/P does the advantage fade? (When outputs dwarf prompts — diversity's discount is a prompt-side effect.) - Four separate requests with prefix caching on get most of the same sharing
(Phase 3 lab-03). What does
n=4still buy? (One API call, guaranteed same-step admission so the hit is certain rather than eviction-dependent, single response object — and intra-request sequence accounting. The mechanism is shared; the contract differs.) - Where do the n sequences' sampling states live, given they're one "request"? (Per-sequence rows in the input batch — generators, penalties' history, all of lab-01/lab-04's state, n times. "Request" is an API word; the engine schedules sequences.)
References
upstream/vllm/v1/engine/— the n>1 fan-out in the output processor/frontend (searchn=handling andparentrequest logic).- Phase 2 lab-05 —
touch/ref_cnt: the sharing mechanics; Phase 3 lab-06 — the exact-token accounting of what sharing saves. - Wang et al., Self-Consistency Improves Chain of Thought Reasoning (2022) — the workload this feature exists for: https://arxiv.org/abs/2203.11171
- vLLM docs, Sampling Parameters —
nand friends: https://docs.vllm.ai/en/latest/api/inference_params.html