Lab 10-04 — The Pipeline-Parallel Bubble [CPU-OK]
Lab-03 closed one door: TP across slow links is fatal — 64 latency-bound all-reduces
per token see to that. Pipeline parallelism is what's left when the model is too big
for one node: cut by layers into stages, and the only communication is handing one
activation tensor to the next stage — point-to-point, once per stage boundary,
indifferent to link latency in a way TP can only envy. The catch has a name and a
closed form: the bubble — (p−1)/(p+m−1) of the pipeline's capacity idles during
fill and drain — and this lab has you derive it twice: as algebra, and as a schedule
grid you build cell by cell, where the two derivations must reconcile exactly (the
test counts idle cells and divides). One microbatch through four stages wastes 75% of
the hardware; the entire craft of PP is making m large enough that the formula
forgives you.
Contents
- Why this lab exists
- Background: the fill-drain geometry
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
PP is the parallelism people deploy reluctantly — and the bubble formula is the entire content of that reluctance, so you should own it cold. It answers the deployment questions TP's bill (lab-03) leaves open: two nodes with no fast interconnect and a model that fits neither — PP works, but only if your workload keeps enough microbatches in flight (the test pins it: p=8 stages under a 10% bubble budget needs 63 concurrent microbatches — a number that should make you pause before proposing PP for a low-traffic latency-sensitive service). PP's economics are batch economics; knowing the formula means knowing instantly which workloads can pay.
The schedule-grid half of the lab is the more transferable skill: pipeline reasoning is the grid (stages × ticks, diagonal occupancy), and every scheduling refinement in the literature — 1F1B, interleaved stages, zero-bubble schedules — is a rearrangement of this grid you can draw and count. Build the simulator once and those papers become pictures.
Background: the fill-drain geometry
Microbatch b occupies stage s at tick s + b — a diagonal sweeping through a
p × (p+m−1) grid. Everything follows from counting cells:
total ticks = p + m − 1 (m diagonals, offset by one each)
useful cells = m · p (every microbatch visits every stage once)
capacity = p · (p + m − 1)
bubble = 1 − useful/capacity = (p − 1)/(p + m − 1)
Read the formula's two limits like an engineer: m = 1 → bubble (p−1)/p — a single
request through a deep pipeline uses one GPU's worth of an 8-GPU rack (which is why PP
does nothing for single-stream latency: total ticks ≥ p regardless — latency through
a pipeline is the pipeline's depth); m → ∞ → bubble → 0 — at high concurrency the
fill/drain cost amortizes to noise. PP converts throughput into efficiency and has
nothing to offer latency. That asymmetry — exactly opposite to TP, which buys latency
and taxes every step — is why the two compose rather than compete (TP inside the node,
PP across; vLLM's tensor_parallel_size × pipeline_parallel_size grid).
For inference specifically, "microbatch" maps onto the continuous-batching engine
naturally: each scheduler step's batch flows through the stages, and a busy engine
(Phase 3's full queues) keeps every stage fed — inference PP at high load lives near
the good end of the formula. The bad end is a quiet engine: requests trickle in,
stages idle, and the p99 user pays p stage-latencies regardless.
Files
starter.py—pipeline_total_ticks,bubble_fraction,simulate_schedule,min_microbatches_for_bubble. Your work.solution.py— reference.test_lab.py— the formulas, the grid's diagonal structure, the exact grid-vs-formula reconciliation, serial-stage discipline, and the bubble-budget inversion.
Run
LAB_IMPL=starter pytest phase-10-distributed-inference/labs/lab-04-pipeline-bubble -q
pytest phase-10-distributed-inference/labs/lab-04-pipeline-bubble -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_total_ticks | p + m − 1, including the m=1 pure-latency case |
test_bubble_formula | The closed form at its corners: no pipeline → 0; one microbatch through 4 stages → 75%; m ≫ p → vanishing |
test_schedule_grid_matches_the_formula | The reconciliation: idle cells counted in the simulated grid ÷ capacity equals bubble_fraction exactly. Two independent derivations agreeing is what "I understand this formula" means |
test_stage_never_runs_two_microbatches_at_once | The serial-worker constraint that makes the diagonal the only schedule (until you interleave — see notes) |
test_min_microbatches_for_a_bubble_budget | The inversion you'll actually use: p=8, 10% budget → m = 63; deeper pipelines need proportionally deeper batching |
Hitchhiker's notes
- Where PP's communication bill hides: one activation tensor per microbatch per
stage boundary —
batch_tokens × hidden × dtypebytes,p − 1times per step total (not per layer!). Run lab-03's arithmetic on it: even over 10 GbE, a decode microbatch's 8 KB handoff is microseconds, and there are 7 of them instead of 64 all-reduces. That's the entire "PP tolerates slow links" argument, quantified with the model you already built. - Interleaved stages (each GPU holds several non-contiguous layer chunks) shrink
the bubble by making
pvirtual stages cheaper to fill — the formula becomes(p−1)/(p·v + m − 1)-flavored withvchunks per GPU, at the cost ofv×more handoffs. Zero-bubble schedules (training-side) rearrange backward passes — inference, having no backward, mostly cares about the plain formula you built. - The KV-cache wrinkle inference adds: each stage holds the KV for its layers
only — PP splits cache naturally, like weights. But a request's tokens revisit
stage 0 every decode step, so PP decode is a loop through the pipeline, not a
single pass: steady-state decode keeps all stages busy only if the in-flight
request count ≥ p. Same formula, with
m= concurrent requests — Phase 3'smax_num_seqsacquires a new lower bound. - vLLM specifics:
pipeline_parallel_sizeshards layers across nodes (Ray or multiprocessing); the V1 engine overlaps stage execution with its async scheduling. PP support historically lagged TP in vLLM precisely because continuous batching × pipelining is bookkeeping-heavy — readingupstream/vllm/distributed/and the executor's PP paths after this lab, you'll recognize the grid under the code.
Going further
- Extend the simulator with per-stage durations (stage = its layers' cost; make one stage 2× slower) and watch the bubble formula stop being exact: the slow stage becomes a straggler and the pipeline clocks at its rate — Phase 7 lab-04's imbalance lesson, third appearance. Then rebalance layers across stages to fix it (real PP deployments tune stage boundaries for exactly this).
- Add TP×PP composition: total GPUs = t × p; for a fixed 16-GPU budget and lab-03's comm model on both axes, find the (t, p) that minimizes decode latency for an in-node-NVLink, cross-node-IB cluster. You've just done the capacity-planning exercise that precedes every large-model deployment.
- Plot bubble vs m for p ∈ {2, 4, 8, 16} and overlay your service's actual concurrent- request distribution — the visual that settles "can we afford PP?" in one meeting.
References
- Huang et al., GPipe (2018) — the fill-drain schedule and the bubble: https://arxiv.org/abs/1811.06965
- Narayanan et al., PipeDream / 1F1B (2019–21) — the schedule refinements that rearrange your grid: https://arxiv.org/abs/2104.04473
upstream/vllm/distributed/andupstream/vllm/v1/executor/—pipeline_parallel_sizepaths; the deep-dive maps them.- Pope et al., Efficiently Scaling Transformer Inference (2022) — TP vs PP for inference, with the cost models side by side: https://arxiv.org/abs/2211.05102
- Lab-03 — the TP bill this lab is the alternative to; Phase 15 — disaggregation, the third way to split work across machines.