Phase 10 — The Hitchhiker's Guide to Distributed Inference

← Phase 09 · Course home · Phase 11 →

Don't Panic
Step 1: A team analogy
Step 2: Tensor parallelism, concretely (the one to really get)
Step 3: Pipeline parallelism and the bubble
Step 4: DP, EP, CP in one line each
Step 5: Who runs all this in vLLM
The invariants to memorize
What you'll do

Don't Panic

A big model doesn't fit on one GPU, or you want it to run faster than one GPU can. So you split the work across several GPUs. The only question is how you split — and there are a few distinct ways, each with its own pattern of GPU-to-GPU chatter. This phase is "which split when, and what crosses the wires." Get it wrong and half your GPUs sit idle talking to each other; get it right and you serve models no single GPU could hold.

A useful way to picture a model: a tall stack of layers, each layer a big multiplication. Now imagine a team of GPUs working on it. There are five ways to divide the labor:

TP (tensor parallel):   split EACH layer's math across GPUs   (everyone works on every token)
PP (pipeline parallel): give each GPU some LAYERS             (token flows GPU0 → GPU1 → GPU2)
DP (data parallel):     give each GPU a full COPY             (split the USERS across copies)
EP (expert parallel):   put different MoE EXPERTS on each GPU  (route tokens to their expert's GPU)
CP (context parallel):  split ONE long sequence's CONTEXT     (each GPU holds part of the history)

You'll mostly reason about TP and PP (the big two), so we go deepest there.

Step 1: A team analogy

Imagine translating a huge book with a team:

Tensor parallelism (TP) — everyone works on the same page at once, each person doing part of the work on that page, then they combine notes before moving on. Fast per page, but they have to talk constantly (combine notes after every step). Only works if they're in the same room (fast links — NVLink inside one machine).
Pipeline parallelism (PP) — an assembly line: person 1 does chapters 1–3, hands off to person 2 for chapters 4–6, etc. Little talking (just hand the page along), works across rooms (across machines), but person 2 is idle until person 1 finishes the first page (a bubble).
Data parallelism (DP) — everyone has their own copy of the whole book and translates different readers' requests. No coordination on the work itself; you just send each reader to whoever's free. Scales throughput, needs the model to fit on one GPU.

Step 2: Tensor parallelism, concretely (the one to really get)

Every layer is essentially y = x · W (a matrix multiply). TP splits W across GPUs. There are two flavors, and the clever part is how they pair up.

Column-parallel — split W by output columns. Each GPU computes part of the output:

GPU0 computes y[:, left half]      GPU1 computes y[:, right half]
result: glue the halves together (an "all-gather")

Row-parallel — split W by input rows, and split the input too. Each GPU computes a partial of the whole output, and you add them up:

GPU0: y0 = x[:, left] · W[left, :]    GPU1: y1 = x[:, right] · W[right, :]
result: y = y0 + y1   (an "all-reduce" — everyone shares and sums)

The trick vLLM uses: in a transformer block, do the first matmul column-parallel and the second row-parallel. The column-parallel output stays split (no gluing needed), feeds straight into the row-parallel input, and you pay just one all-reduce at the end of the block instead of two communications. You'll implement exactly this in lab-01 and prove the multi-GPU result equals the single-GPU one — bit for bit.

Why TP needs fast links: that all-reduce happens every layer (dozens of times per token). If the GPUs aren't connected by something fast (NVLink), the chatter dominates and TP is slow. Rule of thumb: TP within a machine, PP across machines.

🆕 New words: all-reduce (every GPU sends its partial result and everyone gets the sum), all-gather (every GPU shares its piece and everyone gets the concatenation), collective (any such group communication, run by a library called NCCL).

Step 3: Pipeline parallelism and the bubble

PP puts layers 1–16 on GPU0 and 17–32 on GPU1. A token's data flows GPU0 → GPU1. The problem: while GPU0 works on the first chunk, GPU1 has nothing to do yet (the pipeline bubble). The fix is micro-batches: chop the work into many small pieces so that once the pipeline fills, every GPU is always busy on some piece. PP's communication is cheap (just pass activations along, GPU→GPU), so it scales across machines — at the cost of a little latency and bubble overhead.

GPU0: [mb1][mb2][mb3][mb4]
GPU1:      [mb1][mb2][mb3][mb4]   ← starts late (bubble), then stays busy

Step 4: DP, EP, CP in one line each

Data parallelism (DP) — replicate the model; route different requests to different replicas. Pure throughput scaling; needs the model to fit on one GPU (or one TP group). vLLM also uses DP for MoE attention (run attention data-parallel while experts are expert-parallel).
Expert parallelism (EP) — for MoE (Phase 7): put different experts on different GPUs; an all-to-all ships each token to its expert's GPU and back. Scales expert count; watch load balance.
Context parallelism (CP) — split a single very long sequence's context (its KV cache) across GPUs, so you can serve contexts too long for one GPU's memory.

Real large deployments combine these: e.g. TP=8 within a node, PP=2 across two nodes, DP to add replicas, EP for the MoE layers. Picking the combination for a given model + SLA + cluster is a defining staff-level decision.

Step 5: Who runs all this in vLLM

From Phase 1: EngineCore → Executor → Worker → ModelRunner. For multi-GPU, the Executor becomes a MultiprocExecutor that owns N worker processes, one per GPU. Each worker holds its shard of the model and runs the same step in lockstep; the collectives (all-reduce etc.) happen inside the layers (ColumnParallelLinear/RowParallelLinear). The "who is rank 0, which GPUs form the TP group" bookkeeping lives in parallel_state.py. The beauty: the model code is identical — it just uses parallel layers, and the engine fans out. That's why the same vLLM runs on 1 or 64 GPUs.

The invariants to memorize

TP splits each layer (all-reduce every layer; NVLink-hungry; latency win; within a node).
PP splits layers (cheap point-to-point; bubbles; scales across nodes).
DP replicates + routes requests (throughput; model must fit).
EP spreads MoE experts (all-to-all; load balance). CP splits one sequence's context.
TP pattern: column-parallel then row-parallel → one all-reduce per block, and the multi-GPU result is identical to single-GPU.
Big deployments combine them; the Executor fans out to one worker process per GPU.

What you'll do

Read: 01-deep-dive.md — parallel_state, the collectives, the parallel Linear layers, and the multiproc executor, line-anchored.
Build: 02-mini-build.md — column/row-parallel matmul in numpy; prove the all-reduce reconstructs the single-GPU result.
Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
- lab-01-tp-sharding-math [CPU-OK] — implement TP and verify it equals the unsharded result.
- lab-02-two-way-tp [GPU-OPT] — run tensor_parallel_size=2; observe the memory split (captured).
- lab-03-tp-comm-cost [CPU-OK] — the ring all-reduce cost model: derive "TP within a node, never across" as an assert, and the decode-latency vs prefill-bandwidth regime split.
- lab-04-pipeline-bubble [CPU-OK] — the PP bubble (p−1)/(p+m−1), derived as algebra AND as a simulated schedule grid that must reconcile exactly; why PP needs deep batching.
Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.