Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 09-02 — Parallel Sampling Shares Prompt KV [GPU-OPT]

n=4 in a SamplingParams looks like syntactic sugar for "send the request four times." This lab shows why it's structurally better: the engine prefills the prompt once, and all four samples share its KV blocks via the Phase 2 ref-count machinery, diverging only from the first sampled token onward. You'll watch it happen in the logs (75% prefix-cache hit rate for n=4 — three of four samples ride the first's blocks) and connect three phases of machinery into the single cheapest way to buy output diversity.

No GPU? Don't panic. The captured run below carries the whole argument — and the arithmetic sections need no hardware at all.

Contents


Why this lab exists

Parallel sampling is quietly one of the most-used features in production: best-of-n ranking, self-consistency for reasoning (sample k chains, vote), candidate generation for RLHF and eval pipelines, "regenerate" buttons. All of them multiply output tokens while reusing one prompt — and whether that prompt is processed once or n times is often the dominant cost term (prompts routinely outweigh completions 10:1 in chat and RAG). Knowing that n>1 shares the prefill — and why, down to the block refcounts — turns "should we batch our candidates into one request?" from a guess into arithmetic you can do in your head: n=4 on a 2,000-token prompt with 100-token outputs ≈ 6,000 prompt tokens saved — roughly 3× cheaper than four independent requests at this shape.

The lab is also the phase's bridge back to the memory phases: it's the first place sampling policy (how many candidates) visibly drives memory behavior (block sharing). The diversity you buy with n is priced in KV blocks, and the discount — sharing — comes from infrastructure built three phases ago for a different feature (prefix caching). Composability like that is what good engine design looks like from the outside.

Background: one prompt, n tails

What the engine does with n=4: the frontend fans the request into 4 sequences; sequence 1 prefills the prompt, caching its full blocks (Phase 2 lab-05's eager caching at allocation); sequences 2–4 hit those blocks at admission (get_computed_blockstouchref_cnt = 4 on the shared blocks) and prefill only the cache-ineligible remainder (the partial tail block + last token — the num_tokens − 1 rule). From the first sampled token, each sequence allocates its own private tail blocks and diverges — temperature 1.0 plus per-sequence RNG state (lab-04's machinery) makes the four continuations distinct.

Cost shape: prefill ≈ 1× prompt + small change (instead of 4×); KV ≈ 1× prompt + 4× outputs (instead of 4× everything); decode = 4 streams, which batch together in every step (Phase 1 lab-04's mixed batches — four rows, one weight stream, nearly free at small n per Phase 0 lab-04's bandwidth math).

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download facebook/opt-125m

Steps

from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m", enable_prefix_caching=True, gpu_memory_utilization=0.5)
out = llm.generate(["Write a haiku about GPUs:"],
                   SamplingParams(n=4, temperature=1.0, max_tokens=24))
for c in out[0].outputs:
    print(repr(c.text))

Run under VLLM_LOGGING_LEVEL=DEBUG. Three things to verify: the prompt's prefill happens once (scheduled-token counts in the logs); the prefix-cache hit rate lands at (n−1)/n-ish; the four texts genuinely differ. Then the control runs: same with enable_prefix_caching=False (watch prompt tokens 4×), and with seed set (watch the four outputs stay distinct — see the notes for why).

Captured output (real run, facebook/opt-125m, L4, vLLM 0.22.1, trimmed)

DEBUG ... Prefix cache hit rate: GPU: 75.0%     # samples 2-4 reuse sample 1's prompt blocks
'GPUs run hot / silicon dreams in parallel / fans hum all night long'
'Tensors flow like streams ...'
'Cores ignite at dawn ...'
'Threads in warps align ...'

Reading the numbers

  • 75.0% = 3/4 — the pioneer effect with n as the denominator: sample 1 populates, samples 2–4 hit. (Phase 2 lab-03's 87.5% was 7/8; same law, different n. You can now read any of these rates as "1 − 1/cohort".)
  • Four distinct haikus — divergence is immediate (first sampled token) because each sequence draws independently. If two of your n samples come back identical on a short prompt, that's not a bug — peaked distributions (low temperature, strong prompt) genuinely collide; raise temperature or use distinct seeds per your diversity needs.
  • What the log doesn't show: the shared blocks' ref_cnt sitting at 4, and the free-on-finish path decrementing it as each sample completes — the Phase 2 lab-05 biography, now with four readers. The blocks free only when the last sample finishes; a straggler sample (one haiku that rambles to max_tokens) holds the prompt's KV for everyone. Worth knowing when n is large and outputs are long.

Hitchhiker's notes

  • n>1 with a seed still gives n different outputs — vLLM derives per-sequence randomness so the n samples don't collapse into n copies (which would make seeded best-of-n useless). The whole-request stream is reproducible; the sequences differ from each other. (Lab-04's per-request → per-sequence state, one level finer.)
  • V0 had a best_of distinct from n (generate best_of, return n by logprob); V1 simplified the API surface — ranking moved client-side. If you see best_of in older docs/code, that's the fossil. The sharing machinery is the same either way.
  • Self-consistency at scale: k=16 reasoning chains over a long CoT prompt is the flagship use — prompt KV once, 16 cheap decode streams, majority-vote the answers. The cost model above is why the technique is affordable at all; quote it when someone proposes 16 separate API calls instead.
  • Contrast with beam search (lab-03): parallel samples never interact after the fork — no pruning, no rescoring, scheduler-trivial. Beams branch and die mid flight, which is exactly the interaction that got beam search evicted from the V1 core. Independence is what makes n cheap to support.

Reflect

  • Write the exact cost ratio of n=4 vs four separate requests (with caching off) for prompt P, output L tokens: prefill (P + 3·1-ish vs 4P) and KV (P + 4L vs 4P + 4L). At what L/P does the advantage fade? (When outputs dwarf prompts — diversity's discount is a prompt-side effect.)
  • Four separate requests with prefix caching on get most of the same sharing (Phase 3 lab-03). What does n=4 still buy? (One API call, guaranteed same-step admission so the hit is certain rather than eviction-dependent, single response object — and intra-request sequence accounting. The mechanism is shared; the contract differs.)
  • Where do the n sequences' sampling states live, given they're one "request"? (Per-sequence rows in the input batch — generators, penalties' history, all of lab-01/lab-04's state, n times. "Request" is an API word; the engine schedules sequences.)

References

  • upstream/vllm/v1/engine/ — the n>1 fan-out in the output processor/frontend (search n= handling and parent request logic).
  • Phase 2 lab-05 — touch/ref_cnt: the sharing mechanics; Phase 3 lab-06 — the exact-token accounting of what sharing saves.
  • Wang et al., Self-Consistency Improves Chain of Thought Reasoning (2022) — the workload this feature exists for: https://arxiv.org/abs/2203.11171
  • vLLM docs, Sampling Parametersn and friends: https://docs.vllm.ai/en/latest/api/inference_params.html