Lab 10-04 — The Pipeline-Parallel Bubble `[CPU-OK]`

Lab-03 closed one door: TP across slow links is fatal — 64 latency-bound all-reduces per token see to that. Pipeline parallelism is what's left when the model is too big for one node: cut by layers into stages, and the only communication is handing one activation tensor to the next stage — point-to-point, once per stage boundary, indifferent to link latency in a way TP can only envy. The catch has a name and a closed form: the bubble — (p−1)/(p+m−1) of the pipeline's capacity idles during fill and drain — and this lab has you derive it twice: as algebra, and as a schedule grid you build cell by cell, where the two derivations must reconcile exactly (the test counts idle cells and divides). One microbatch through four stages wastes 75% of the hardware; the entire craft of PP is making m large enough that the formula forgives you.

Why this lab exists
Background: the fill-drain geometry
Files
Run
What the tests prove
Hitchhiker's notes
Going further
References

Why this lab exists

PP is the parallelism people deploy reluctantly — and the bubble formula is the entire content of that reluctance, so you should own it cold. It answers the deployment questions TP's bill (lab-03) leaves open: two nodes with no fast interconnect and a model that fits neither — PP works, but only if your workload keeps enough microbatches in flight (the test pins it: p=8 stages under a 10% bubble budget needs 63 concurrent microbatches — a number that should make you pause before proposing PP for a low-traffic latency-sensitive service). PP's economics are batch economics; knowing the formula means knowing instantly which workloads can pay.

The schedule-grid half of the lab is the more transferable skill: pipeline reasoning is the grid (stages × ticks, diagonal occupancy), and every scheduling refinement in the literature — 1F1B, interleaved stages, zero-bubble schedules — is a rearrangement of this grid you can draw and count. Build the simulator once and those papers become pictures.

Background: the fill-drain geometry

Microbatch b occupies stage s at tick s + b — a diagonal sweeping through a p × (p+m−1) grid. Everything follows from counting cells:

total ticks   = p + m − 1          (m diagonals, offset by one each)
useful cells  = m · p              (every microbatch visits every stage once)
capacity      = p · (p + m − 1)
bubble        = 1 − useful/capacity = (p − 1)/(p + m − 1)

Read the formula's two limits like an engineer: m = 1 → bubble (p−1)/p — a single request through a deep pipeline uses one GPU's worth of an 8-GPU rack (which is why PP does nothing for single-stream latency: total ticks ≥ p regardless — latency through a pipeline is the pipeline's depth); m → ∞ → bubble → 0 — at high concurrency the fill/drain cost amortizes to noise. PP converts throughput into efficiency and has nothing to offer latency. That asymmetry — exactly opposite to TP, which buys latency and taxes every step — is why the two compose rather than compete (TP inside the node, PP across; vLLM's tensor_parallel_size × pipeline_parallel_size grid).

For inference specifically, "microbatch" maps onto the continuous-batching engine naturally: each scheduler step's batch flows through the stages, and a busy engine (Phase 3's full queues) keeps every stage fed — inference PP at high load lives near the good end of the formula. The bad end is a quiet engine: requests trickle in, stages idle, and the p99 user pays p stage-latencies regardless.

Files

starter.py — pipeline_total_ticks, bubble_fraction, simulate_schedule, min_microbatches_for_bubble. Your work.
solution.py — reference.
test_lab.py — the formulas, the grid's diagonal structure, the exact grid-vs-formula reconciliation, serial-stage discipline, and the bubble-budget inversion.

Run

LAB_IMPL=starter pytest phase-10-distributed-inference/labs/lab-04-pipeline-bubble -q
pytest phase-10-distributed-inference/labs/lab-04-pipeline-bubble -q   # reference

What the tests prove

Test	What it pins
`test_total_ticks`	`p + m − 1`, including the m=1 pure-latency case
`test_bubble_formula`	The closed form at its corners: no pipeline → 0; one microbatch through 4 stages → 75%; m ≫ p → vanishing
`test_schedule_grid_matches_the_formula`	The reconciliation: idle cells counted in the simulated grid ÷ capacity equals `bubble_fraction` exactly. Two independent derivations agreeing is what "I understand this formula" means
`test_stage_never_runs_two_microbatches_at_once`	The serial-worker constraint that makes the diagonal the only schedule (until you interleave — see notes)
`test_min_microbatches_for_a_bubble_budget`	The inversion you'll actually use: p=8, 10% budget → m = 63; deeper pipelines need proportionally deeper batching

Hitchhiker's notes

Where PP's communication bill hides: one activation tensor per microbatch per stage boundary — batch_tokens × hidden × dtype bytes, p − 1 times per step total (not per layer!). Run lab-03's arithmetic on it: even over 10 GbE, a decode microbatch's 8 KB handoff is microseconds, and there are 7 of them instead of 64 all-reduces. That's the entire "PP tolerates slow links" argument, quantified with the model you already built.
Interleaved stages (each GPU holds several non-contiguous layer chunks) shrink the bubble by making p virtual stages cheaper to fill — the formula becomes (p−1)/(p·v + m − 1)-flavored with v chunks per GPU, at the cost of v× more handoffs. Zero-bubble schedules (training-side) rearrange backward passes — inference, having no backward, mostly cares about the plain formula you built.
The KV-cache wrinkle inference adds: each stage holds the KV for its layers only — PP splits cache naturally, like weights. But a request's tokens revisit stage 0 every decode step, so PP decode is a loop through the pipeline, not a single pass: steady-state decode keeps all stages busy only if the in-flight request count ≥ p. Same formula, with m = concurrent requests — Phase 3's max_num_seqs acquires a new lower bound.
vLLM specifics: pipeline_parallel_size shards layers across nodes (Ray or multiprocessing); the V1 engine overlaps stage execution with its async scheduling. PP support historically lagged TP in vLLM precisely because continuous batching × pipelining is bookkeeping-heavy — reading upstream/vllm/distributed/ and the executor's PP paths after this lab, you'll recognize the grid under the code.

Going further

Extend the simulator with per-stage durations (stage = its layers' cost; make one stage 2× slower) and watch the bubble formula stop being exact: the slow stage becomes a straggler and the pipeline clocks at its rate — Phase 7 lab-04's imbalance lesson, third appearance. Then rebalance layers across stages to fix it (real PP deployments tune stage boundaries for exactly this).
Add TP×PP composition: total GPUs = t × p; for a fixed 16-GPU budget and lab-03's comm model on both axes, find the (t, p) that minimizes decode latency for an in-node-NVLink, cross-node-IB cluster. You've just done the capacity-planning exercise that precedes every large-model deployment.
Plot bubble vs m for p ∈ {2, 4, 8, 16} and overlay your service's actual concurrent- request distribution — the visual that settles "can we afford PP?" in one meeting.

References

Huang et al., GPipe (2018) — the fill-drain schedule and the bubble: https://arxiv.org/abs/1811.06965
Narayanan et al., PipeDream / 1F1B (2019–21) — the schedule refinements that rearrange your grid: https://arxiv.org/abs/2104.04473
upstream/vllm/distributed/ and upstream/vllm/v1/executor/ — pipeline_parallel_size paths; the deep-dive maps them.
Pope et al., Efficiently Scaling Transformer Inference (2022) — TP vs PP for inference, with the cost models side by side: https://arxiv.org/abs/2211.05102
Lab-03 — the TP bill this lab is the alternative to; Phase 15 — disaggregation, the third way to split work across machines.

vLLM Mastery — From Zero to Maintainer