Phase 10 — Interview Questions: Distributed Inference
Q1. TP vs PP — when do you reach for each?
Model answer
TP splits every layer's math across GPUs, so all GPUs work on each token — great for latency, but it all-reduces every layer, so it needs fast intra-node links (NVLink). PP splits the layers across GPUs with cheap point-to-point handoffs, so it scales across nodes/memory, but adds pipeline bubbles (mitigated by micro-batching) and a bit of latency. Rule of thumb: TP within a node, PP across nodes; combine them for very large models.
Q2. Walk me through tensor-parallel matmuls.
Model answer
Column-parallel splits the weight by output columns: each GPU computes part of the output, combined by all-gather. Row-parallel splits by input rows (and the input): each GPU computes a partial of the whole output, combined by all-reduce (sum). vLLM does the first matmul in a block column-parallel and the second row-parallel, so the column output stays sharded and feeds the row input directly — one all-reduce per block. The combined result is bit-identical to single-GPU (lab-01 proves it).
Q3. What's a pipeline bubble and how is it reduced?
Model answer
In PP, downstream stages idle while the first stage processes the initial input — wasted GPU time called the bubble. Splitting the work into many micro-batches keeps the pipeline full: once it's primed, every stage is always working on some micro-batch. The bubble shrinks with more micro-batches but never fully disappears.
Q4. Why does MoE motivate expert parallelism + data-parallel attention?
Model answer
Experts are independent FFNs, so placing whole experts on different GPUs (EP) scales expert capacity with just an all-to-all to route tokens. Attention has different parallelism economics, so it's often run data-parallel across the same GPUs to balance work. Mixing EP (experts) with DP/TP (attention) is common for large MoE models; the main risks are all-to-all cost and expert load imbalance.
Q5. How does vLLM run the same model on 1 or 64 GPUs unchanged?
Model answer
The model uses parallel layers (ColumnParallelLinear/RowParallelLinear) that internally do the
collectives, and parallel_state.py holds the group/rank bookkeeping. For multi-GPU the Executor
becomes a MultiprocExecutor that spawns one worker process per GPU, each holding a shard, running
in lockstep. The engine logic above (scheduler, sampler) and the model code are identical — only the
executor fans out.
Rapid-fire
- Row-parallel combine? all-reduce (sum). Column-parallel combine? all-gather (concat).
- All-reduces per transformer block under TP? ~2 (one per attention + MLP), pattern = col→row each.
- Collective library? NCCL. Group bookkeeping?
parallel_state.py. - Workers for TP=4? 4 processes, one per GPU.
- EP shards? whole experts (all-to-all). CP shards? one sequence's context/KV.