Phase 15 — Disaggregated Serving

← Phase 14 · Course home · Phase 16 →

Don't Panic
Why this phase matters
What you'll learn
The map: where this lives in the real code
Labs in this phase
How to work this phase
Where you are

Don't Panic

Prefill and decode have opposite appetites: prefill wants compute, decode wants memory bandwidth and runs much longer. Disaggregation runs them on SEPARATE machines — prefill servers and decode servers — and ships the KV cache between them. Each fleet is tuned and scaled independently. This phase is that split and the KV transfer that enables it.

Why this phase matters

P/D disaggregation is how the largest deployments hit both tight TTFT and high throughput at once, and it's a frontier of vLLM. Understanding KV connectors also unlocks KV offloading and cross-engine caching.

What you'll learn

Why co-locating prefill+decode causes interference (prefill stalls decodes)
Prefill node -> KV transfer -> decode node; the request handoff
KV connectors: the transfer abstraction (NIXL, shared storage, etc.)
Encode disaggregation for multimodal
Routing / proxy between P and D fleets; load balancing

The map: where this lives in the real code

Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md) walks through the important ones line by line.

vllm/distributed/kv_transfer/ — The KV connector framework (the heart of disagg).
vllm/distributed/kv_transfer/kv_connector/v1/ — V1 connectors (base + implementations).
vllm/v1/core/sched/scheduler.py — Search 'connector' / 'WAITING_FOR_REMOTE_KVS' to see async KV load.
examples/ — Look for disaggregated-prefill example scripts/configs.

Labs in this phase

lab-01-kv-handoff [CPU-OK] — migrate a live request between two mini_vllm engines (export/import + the KV block bill) and prove the continuation token-for-token identical.
lab-02-pd-pair [GPU-OPT] — a real producer/consumer pair with a KV connector: TTFT +10% (the toll), ITL p99 3× better (the interference, gone). Captured output included.
lab-03-disagg-economics [CPU-OK] — the trade in five functions: 256 MiB of freight per 2048-token 8B prompt, ~11 ms on fast fabric vs ~215 ms on 10 GbE, and the decision function that says no two different ways.

See labs/README.md for the recommended order (01 → 03 → 02) and how to run them.

How to work this phase

Read this guide for intuition.
Read 01-deep-dive.md with the upstream/ files open.
Do 02-mini-build.md — build the mini_vllm piece yourself.
Run the labs, then attempt EXERCISES.md.
Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.

Where you are

This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.