Phase 15 — Disaggregated Serving
← Phase 14 · Course home · Phase 16 →
Contents
- Don't Panic
- Why this phase matters
- What you'll learn
- The map: where this lives in the real code
- Labs in this phase
- How to work this phase
- Where you are
Don't Panic
Prefill and decode have opposite appetites: prefill wants compute, decode wants memory bandwidth and runs much longer. Disaggregation runs them on SEPARATE machines — prefill servers and decode servers — and ships the KV cache between them. Each fleet is tuned and scaled independently. This phase is that split and the KV transfer that enables it.
Why this phase matters
P/D disaggregation is how the largest deployments hit both tight TTFT and high throughput at once, and it's a frontier of vLLM. Understanding KV connectors also unlocks KV offloading and cross-engine caching.
What you'll learn
- Why co-locating prefill+decode causes interference (prefill stalls decodes)
- Prefill node -> KV transfer -> decode node; the request handoff
- KV connectors: the transfer abstraction (NIXL, shared storage, etc.)
- Encode disaggregation for multimodal
- Routing / proxy between P and D fleets; load balancing
The map: where this lives in the real code
Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see
UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md)
walks through the important ones line by line.
vllm/distributed/kv_transfer/— The KV connector framework (the heart of disagg).vllm/distributed/kv_transfer/kv_connector/v1/— V1 connectors (base + implementations).vllm/v1/core/sched/scheduler.py— Search 'connector' / 'WAITING_FOR_REMOTE_KVS' to see async KV load.examples/— Look for disaggregated-prefill example scripts/configs.
Labs in this phase
- lab-01-kv-handoff
[CPU-OK]— migrate a live request between two mini_vllm engines (export/import + the KV block bill) and prove the continuation token-for-token identical. - lab-02-pd-pair
[GPU-OPT]— a real producer/consumer pair with a KV connector: TTFT +10% (the toll), ITL p99 3× better (the interference, gone). Captured output included. - lab-03-disagg-economics
[CPU-OK]— the trade in five functions: 256 MiB of freight per 2048-token 8B prompt, ~11 ms on fast fabric vs ~215 ms on 10 GbE, and the decision function that says no two different ways.
See labs/README.md for the recommended order (01 → 03 → 02) and how to run them.
How to work this phase
- Read this guide for intuition.
- Read 01-deep-dive.md with the
upstream/files open. - Do 02-mini-build.md — build the
mini_vllmpiece yourself. - Run the labs, then attempt EXERCISES.md.
- Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.
Where you are
This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.
← Phase 14 · Course home · Phase 16 →