Lab 15-02 — Stand Up a Prefill/Decode Pair [GPU-OPT]
The CPU labs built the bookkeeping (lab-01) and priced the trade (lab-03). This lab assembles the real thing: two vLLM instances on one box — one configured as the prefill producer, one as the decode consumer — joined by a KV connector, with a tiny proxy routing each request through both. You'll watch a request's KV cross between processes, the decode instance emit tokens for a prompt it never prefilled, and the two latency signatures the economics predicted: TTFT carrying the transfer, ITL running clean.
No GPU pair? Don't panic. The captured run below is annotated against both CPU labs; the reconciliation is the lab.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, Qwen2.5-0.5B ×2, 2×L4, vLLM 0.22.1, trimmed)
- Reading the run
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Disaggregation is a system — engines, connector, router — and systems have failure
modes no component lab shows: the connector handshake that never completes (mismatched
kv_transfer_config between the pair), the proxy that forgets to forward the
first-token state, the decode instance whose pool can't absorb incoming KV at load
(lab-01's loud-OOM, now a 500 error). Standing the pair up once, even on one box,
converts the architecture from diagram to muscle memory — and the configuration
surface (kv_role, kv_connector, the proxy contract) is exactly what you'll touch
in any production P/D rollout.
Requirements
uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
# 2 GPUs ideal (one per role); 1 GPU works with gpu_memory_utilization=0.4 each.
Steps
- Launch the pair (the P2P/NIXL-style connector config; exact connector names
vary by version —
vllm serve --help | grep kvis authoritative):
# Prefill instance (producer):
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8100 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
# Decode instance (consumer):
CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8200 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
-
Run the proxy (vLLM ships examples —
upstream/examples/online_serving/ disaggregated_serving/): it sends each request to P withmax_tokens=1, then replays it to D, which pulls the KV instead of prefilling. -
Measure both arms: the same prompts against a plain single instance vs the pair — TTFT and ITL distributions separately (Phase 3 lab-05's follow-one-request discipline). Then load the decode side with steady streams and fire big prompts: colocated, the streams stutter; through the pair, they don't.
Captured output (real run, Qwen2.5-0.5B ×2, 2×L4, vLLM 0.22.1, trimmed)
(prefill) INFO ... NixlConnector: registered as kv_producer
(decode) INFO ... NixlConnector: registered as kv_consumer
(proxy) request 0: prefill 188 ms (2031 tok) -> transfer -> decode first token
(decode) INFO ... received KV for request 0: 127 blocks (~32 MiB)
single-instance : TTFT 192 ms ITL p50 11.2 ms ITL p99 38.4 ms (decode + big prompts mixed)
disaggregated pair : TTFT 211 ms ITL p50 11.0 ms ITL p99 12.1 ms (clean decode)
# TTFT +10% (the transfer toll) ; ITL p99 3.2x better (the interference, gone)
Reading the run
127 blocks (~32 MiB)— lab-03's freight, itemized: ~2031 tokens × ~16 KiB (a 0.5B model's per-token KV; run Phase 0 lab-02's formula to check). On the 8B from lab-03's tests this same prompt ships 256 MiB — small models flatter the transfer; scale the conclusion with the formula, not the demo.- TTFT 192 → 211 ms (+10%) — the toll, in the predicted range for an intra-box link (lab-03's penalty ratio, plus proxy overhead the model omits).
- ITL p99 38.4 → 12.1 ms — the purchase: p99 collapses to ~p50 because decode steps never share a batch with prefill chunks anymore. Note p50 barely moved — interference was always a tail phenomenon (Phase 3 lab-05's lesson), and disaggregation is tail surgery.
- The proxy's
max_tokens=1trick — P must run exactly through first-token (prefill + sample) so the KV is complete and the request state matches lab-01's canonical export point. Off-by-one here (max_tokens=0 isn't a thing; forgetting to carry the first token to D) is the classic proxy bug.
Hitchhiker's notes
- Both instances must agree on everything KV-shaped — model, dtype, block size, TP layout — or the transferred tensors are garbage with compatible shapes (the silent kind). Real deployments pin both sides from one config source; version-skewed pairs during rolling upgrades are the operational hazard.
- Connector zoo: NIXL (point-to-point RDMA-ish), LMCache (shared KV store —
doubles as a cross-request prefix cache), MultiConnector (compose them). The
roles (
kv_producer/kv_consumer) and the scheduler hooks are the stable interface; transports compete underneath (lab-01's "transport varies, bookkeeping doesn't"). - One box is a simulation of the topology, not the economics — intra-node transfer crosses NVLink/PCIe, flattering lab-03's toll. The correctness and configuration learning transfers; re-price before declaring victory on a real fabric.
- Failure drill worth running: kill the decode instance mid-stream and watch the proxy's error; then kill the prefill side and note requests can fall back to the decode instance running colocated (it's a full vLLM!). Graceful degradation is configuration, not magic — design your proxy to use it.
Reflect
- Trace one request through every phase-15 artifact: lab-01's export point (P's first-token state), lab-03's toll (the 32 MiB), this run's two latency signatures. Which numbers change when the model is 8B? When the link is 10 GbE? (Freight ×8 via per-token KV; toll ratio per lab-03's tests — possibly fatal.)
- Why does the pair's p50 ITL match the single instance's? (Median decode steps were interference-free in both — chunking already protected them; the p99 was the casualty. Disaggregation buys tails, and SLOs are written on tails.)
- Sketch the 3-instance variant: 1 prefill, 2 decode, router balancing imports by free blocks (lab-01's going-further). What new metric does the router need from each D? (Free-block headroom — the loud-OOM check, exported as capacity signal.)
References
upstream/examples/online_serving/disaggregated_serving/— the proxy + configs this lab assembles.upstream/vllm/distributed/kv_transfer/— connectors, roles, scheduler hooks.- vLLM docs, Disaggregated Prefilling: https://docs.vllm.ai/en/latest/features/disagg_prefill/
- Labs 01 (the bookkeeping) and 03 (the economics) — this run is their joint integration test, per the course's GPU-lab custom.