Phase 15 — Interview Questions: Disaggregated Serving

Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)

Q1. Why disaggregate prefill and decode?

Model answer

They have different resource profiles and interfere when co-located: a big prefill stalls ongoing decodes (latency spikes). Splitting them lets you scale and tune each fleet independently — more compute for prefill TTFT, more memory-bandwidth/instances for decode throughput — at the cost of transferring the KV cache between them.

Q2. What's the main cost/risk of disaggregation?

Model answer

Shipping the KV cache over the network adds latency and bandwidth pressure; it only pays off when interference savings exceed transfer cost. It also adds routing/orchestration complexity and failure modes (a decode node waiting on remote KV).

Going deeper

The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer

Phase 15 — Interview Questions: Disaggregated Serving

Q1. Why disaggregate prefill and decode?

Q2. What's the main cost/risk of disaggregation?

Going deeper