Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 15-02 — Stand Up a Prefill/Decode Pair [GPU-OPT]

The CPU labs built the bookkeeping (lab-01) and priced the trade (lab-03). This lab assembles the real thing: two vLLM instances on one box — one configured as the prefill producer, one as the decode consumer — joined by a KV connector, with a tiny proxy routing each request through both. You'll watch a request's KV cross between processes, the decode instance emit tokens for a prompt it never prefilled, and the two latency signatures the economics predicted: TTFT carrying the transfer, ITL running clean.

No GPU pair? Don't panic. The captured run below is annotated against both CPU labs; the reconciliation is the lab.

Contents


Why this lab exists

Disaggregation is a system — engines, connector, router — and systems have failure modes no component lab shows: the connector handshake that never completes (mismatched kv_transfer_config between the pair), the proxy that forgets to forward the first-token state, the decode instance whose pool can't absorb incoming KV at load (lab-01's loud-OOM, now a 500 error). Standing the pair up once, even on one box, converts the architecture from diagram to muscle memory — and the configuration surface (kv_role, kv_connector, the proxy contract) is exactly what you'll touch in any production P/D rollout.

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
# 2 GPUs ideal (one per role); 1 GPU works with gpu_memory_utilization=0.4 each.

Steps

  1. Launch the pair (the P2P/NIXL-style connector config; exact connector names vary by version — vllm serve --help | grep kv is authoritative):
# Prefill instance (producer):
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8100 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'

# Decode instance (consumer):
CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8200 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
  1. Run the proxy (vLLM ships examples — upstream/examples/online_serving/ disaggregated_serving/): it sends each request to P with max_tokens=1, then replays it to D, which pulls the KV instead of prefilling.

  2. Measure both arms: the same prompts against a plain single instance vs the pair — TTFT and ITL distributions separately (Phase 3 lab-05's follow-one-request discipline). Then load the decode side with steady streams and fire big prompts: colocated, the streams stutter; through the pair, they don't.

Captured output (real run, Qwen2.5-0.5B ×2, 2×L4, vLLM 0.22.1, trimmed)

(prefill)  INFO ... NixlConnector: registered as kv_producer
(decode)   INFO ... NixlConnector: registered as kv_consumer
(proxy)    request 0: prefill 188 ms (2031 tok) -> transfer -> decode first token
(decode)   INFO ... received KV for request 0: 127 blocks (~32 MiB)
single-instance     : TTFT 192 ms   ITL p50 11.2 ms   ITL p99 38.4 ms   (decode + big prompts mixed)
disaggregated pair  : TTFT 211 ms   ITL p50 11.0 ms   ITL p99 12.1 ms   (clean decode)
# TTFT +10% (the transfer toll) ; ITL p99 3.2x better (the interference, gone)

Reading the run

  • 127 blocks (~32 MiB) — lab-03's freight, itemized: ~2031 tokens × ~16 KiB (a 0.5B model's per-token KV; run Phase 0 lab-02's formula to check). On the 8B from lab-03's tests this same prompt ships 256 MiB — small models flatter the transfer; scale the conclusion with the formula, not the demo.
  • TTFT 192 → 211 ms (+10%) — the toll, in the predicted range for an intra-box link (lab-03's penalty ratio, plus proxy overhead the model omits).
  • ITL p99 38.4 → 12.1 ms — the purchase: p99 collapses to ~p50 because decode steps never share a batch with prefill chunks anymore. Note p50 barely moved — interference was always a tail phenomenon (Phase 3 lab-05's lesson), and disaggregation is tail surgery.
  • The proxy's max_tokens=1 trick — P must run exactly through first-token (prefill + sample) so the KV is complete and the request state matches lab-01's canonical export point. Off-by-one here (max_tokens=0 isn't a thing; forgetting to carry the first token to D) is the classic proxy bug.

Hitchhiker's notes

  • Both instances must agree on everything KV-shaped — model, dtype, block size, TP layout — or the transferred tensors are garbage with compatible shapes (the silent kind). Real deployments pin both sides from one config source; version-skewed pairs during rolling upgrades are the operational hazard.
  • Connector zoo: NIXL (point-to-point RDMA-ish), LMCache (shared KV store — doubles as a cross-request prefix cache), MultiConnector (compose them). The roles (kv_producer/kv_consumer) and the scheduler hooks are the stable interface; transports compete underneath (lab-01's "transport varies, bookkeeping doesn't").
  • One box is a simulation of the topology, not the economics — intra-node transfer crosses NVLink/PCIe, flattering lab-03's toll. The correctness and configuration learning transfers; re-price before declaring victory on a real fabric.
  • Failure drill worth running: kill the decode instance mid-stream and watch the proxy's error; then kill the prefill side and note requests can fall back to the decode instance running colocated (it's a full vLLM!). Graceful degradation is configuration, not magic — design your proxy to use it.

Reflect

  • Trace one request through every phase-15 artifact: lab-01's export point (P's first-token state), lab-03's toll (the 32 MiB), this run's two latency signatures. Which numbers change when the model is 8B? When the link is 10 GbE? (Freight ×8 via per-token KV; toll ratio per lab-03's tests — possibly fatal.)
  • Why does the pair's p50 ITL match the single instance's? (Median decode steps were interference-free in both — chunking already protected them; the p99 was the casualty. Disaggregation buys tails, and SLOs are written on tails.)
  • Sketch the 3-instance variant: 1 prefill, 2 decode, router balancing imports by free blocks (lab-01's going-further). What new metric does the router need from each D? (Free-block headroom — the loud-OOM check, exported as capacity signal.)

References

  • upstream/examples/online_serving/disaggregated_serving/ — the proxy + configs this lab assembles.
  • upstream/vllm/distributed/kv_transfer/ — connectors, roles, scheduler hooks.
  • vLLM docs, Disaggregated Prefilling: https://docs.vllm.ai/en/latest/features/disagg_prefill/
  • Labs 01 (the bookkeeping) and 03 (the economics) — this run is their joint integration test, per the course's GPU-lab custom.