Lab 10-02 — Two-Way Tensor Parallelism `[GPU-OPT]`

The math (lab-01) said each rank holds 1/N of every matrix; the cost model (lab-03) said in-node links make the collectives cheap. This lab is where you watch both claims cash out on real hardware: tensor_parallel_size=2 spawns two worker processes, each reporting half the weight memory (1.24 GiB where TP=1 reported 2.48), each carving its own KV blocks from its own leftover HBM — and a model generates coherent text while no single GPU ever holds all of it. The startup log is the lab; reading it against the two CPU labs is the work.

No GPU pair? Don't panic. The captured run below is annotated line by line; the reconciliation exercises need only the numbers.

Why this lab exists
Requirements
Steps
Captured output (real run, opt-1.3b, 2×L4, vLLM 0.22.1, trimmed)
Reading the numbers
Hitchhiker's notes
Reflect
References

Why this lab exists

Two reasons. First, the observable surface of TP — worker processes, per-rank memory reports, per-rank block counts, NCCL initialization lines — is what you'll actually have in front of you during a production incident, and learning to read it against the underlying sharding math is the diagnostic skill (is rank 1's memory wildly different from rank 0's? Something's wrong with sharding or loading. Do blocks per worker × TP ≈ expected total KV? If not, where did the HBM go?). Second, TP is the first feature in this course that changes the process model: the engine becomes a coordinator of N workers in lockstep, every scheduler decision (Phase 3) broadcast to all ranks, every forward a synchronized ensemble. Several later phases (15's disaggregation, 17's platform plugins) build on that worker abstraction, and this is where you first see it breathe.

Requirements

uv pip install -e ".[vllm]"   # needs 2 visible CUDA GPUs for the live run
huggingface-cli download facebook/opt-1.3b

(opt-1.3b: big enough that halving its 2.6 GB is visible in the logs, small enough to also run TP=1 on one card for the baseline — you want both runs for the diff.)

Steps

from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-1.3b", tensor_parallel_size=2, gpu_memory_utilization=0.8)
print(llm.generate(["Distributed inference means"],
                   SamplingParams(max_tokens=32, temperature=0))[0].outputs[0].text)

Then the baseline (tensor_parallel_size=1) and the three comparisons: weight memory per worker (should halve), # GPU blocks per worker (should roughly double total — see below), and the generated text (should match the TP=1 output token-for-token... almost — see the last-ulp note).

Captured output (real run, opt-1.3b, 2×L4, vLLM 0.22.1, trimmed)

INFO ... Started 2 worker processes (tensor_parallel_size=2)
INFO (Worker_TP0) ... Model weights take 1.24 GiB         # ~half of 2.6 GiB
INFO (Worker_TP1) ... Model weights take 1.24 GiB         # the other half
INFO ... # GPU blocks: 28,500 (per worker)                # KV also splits across TP ranks
 ... distributed inference means splitting one model across multiple GPUs ...
# single-GPU baseline (tensor_parallel_size=1): Model weights take 2.48 GiB on one GPU

Reading the numbers

1.24 + 1.24 = 2.48 — lab-01's W[r*chunk:(r+1)*chunk], weighed. Every linear layer's shards, embedding's vocab slices, all summing back to the whole. If the two workers ever report different weight sizes, some tensor didn't shard (replicated layers — norms, biases — are expected and tiny; a large asymmetry means a loader bug).
# GPU blocks: 28,500 per worker — the subtle one. Each rank's KV per token is also halved (it caches only its own heads' K/V — attention sharding splits the cache naturally), and each rank carves blocks from its own freed-up HBM. Per-rank block count × block tokens ≈ same token capacity per rank as... work it through with Phase 2 lab-03's arithmetic: weights halved → more free HBM per GPU; KV per token halved → more tokens per GiB. Both effects push capacity up — TP=2 roughly doubles total concurrent tokens, which is the capacity story that often justifies TP even when the model would fit on one GPU.
The generated text matches TP=1 — semantically always, token-for-token usually. The all-reduce reorders fp16 additions (lab-01's note), so a near-tie can flip a token. Greedy + short output usually survives; if you diff long generations and find one divergence at position 200, you've observed the last-ulp story, not a bug.
What you don't see: the 64-per-step all-reduces (lab-03) — invisible in logs, visible only as the gap between ideal 2× latency scaling and what you measure. Time a single-stream generation under TP=1 vs TP=2: the ITL improvement lands under 2× by exactly the comm fraction your lab-03 model predicts for your link.

Hitchhiker's notes

Process topology: TP workers are separate processes (one per GPU), not threads — CUDA contexts, NCCL communicators, and Python's GIL all push that way. The engine core broadcasts each step's scheduler output to all ranks; they execute the identical step in lockstep and the rank-0 worker returns logits for sampling. Lockstep means the slowest rank sets the pace — a thermally-throttled GPU in a TP group drags the ensemble, a classic and maddening production hunt (symptom: TP=4 slower than TP=2; cause: one card at 70% clocks).
CUDA_VISIBLE_DEVICES and placement matter: TP wants the GPUs with the fastest mutual links (same NVLink island / NUMA node). On mixed-topology machines, nvidia-smi topo -m before choosing — lab-03's bill varies by which pair you pick on the same box.
When TP=2 is the wrong tool: model fits on one GPU and you're throughput-bound — two independent TP=1 replicas (data parallelism) beat TP=2 (no comm tax, perfect scaling, simpler ops). TP earns its tax only for fit-or-latency reasons (lab-03's notes). "We have two GPUs so we set TP=2" is the most common distributed-inference misconfiguration in the wild.
Startup is slower under TP — N processes, NCCL rendezvous, sharded loading, graph capture per rank (Phase 5's cost, ×N but parallel). Budget it in deploy pipelines; it's the TP line item people forget.

Reflect

Reconcile per-worker blocks with Phase 2 lab-03's formula: weights/rank = 1.24 GiB, KV/token/rank = half of lab 0-02's per-token bytes. Predict # GPU blocks for TP=2 on your card before reading the log. Within 10%?
A 70B fp16 model (~140 GiB weights), 80 GiB GPUs: what's the minimum TP, and what does lab-03 say about running it across two 8-GPU nodes at TP=16 vs TP=8 × PP=2? (TP=2 minimum for fit; cross-node TP=16 pays 64 latency-bound all-reduces over IB per token vs PP=2's single activation handoff — the composition lab-04 closes.)
Why does vLLM broadcast scheduler decisions rather than letting each rank run its own scheduler? (The ranks must execute byte-identical steps — same batch, same block tables — or the all-reduces would be summing mismatched partials. One brain, N hands; determinism across ranks is a correctness requirement, not a preference.)

References

upstream/vllm/v1/executor/ and upstream/vllm/v1/worker/ — the multiprocess executor and worker lockstep.
upstream/vllm/distributed/parallel_state.py — process groups and communicator setup (the NCCL lines in your startup log).
vLLM docs, Distributed Inference and Serving — TP/PP configuration and the placement guidance: https://docs.vllm.ai/en/latest/serving/distributed_serving.html
Labs 01 (the math), 03 (the bill), 04 (the cross-node alternative) — this run is their joint demo.

vLLM Mastery — From Zero to Maintainer