Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 15 — Disaggregated Serving

Phase 14 · Course home · Phase 16

Contents


Don't Panic

Prefill and decode have opposite appetites: prefill wants compute, decode wants memory bandwidth and runs much longer. Disaggregation runs them on SEPARATE machines — prefill servers and decode servers — and ships the KV cache between them. Each fleet is tuned and scaled independently. This phase is that split and the KV transfer that enables it.

Why this phase matters

P/D disaggregation is how the largest deployments hit both tight TTFT and high throughput at once, and it's a frontier of vLLM. Understanding KV connectors also unlocks KV offloading and cross-engine caching.

What you'll learn

  • Why co-locating prefill+decode causes interference (prefill stalls decodes)
  • Prefill node -> KV transfer -> decode node; the request handoff
  • KV connectors: the transfer abstraction (NIXL, shared storage, etc.)
  • Encode disaggregation for multimodal
  • Routing / proxy between P and D fleets; load balancing

The map: where this lives in the real code

Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md) walks through the important ones line by line.

Labs in this phase

  • lab-01-kv-handoff [CPU-OK] — migrate a live request between two mini_vllm engines (export/import + the KV block bill) and prove the continuation token-for-token identical.
  • lab-02-pd-pair [GPU-OPT] — a real producer/consumer pair with a KV connector: TTFT +10% (the toll), ITL p99 3× better (the interference, gone). Captured output included.
  • lab-03-disagg-economics [CPU-OK] — the trade in five functions: 256 MiB of freight per 2048-token 8B prompt, ~11 ms on fast fabric vs ~215 ms on 10 GbE, and the decision function that says no two different ways.

See labs/README.md for the recommended order (01 → 03 → 02) and how to run them.

How to work this phase

  1. Read this guide for intuition.
  2. Read 01-deep-dive.md with the upstream/ files open.
  3. Do 02-mini-build.md — build the mini_vllm piece yourself.
  4. Run the labs, then attempt EXERCISES.md.
  5. Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.

Where you are

This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.

Phase 14 · Course home · Phase 16