Phase 11 — Interview Questions: Multi-LoRA
Q1. What is a LoRA and why is it cheap?
Model answer
A LoRA replaces full fine-tuning with a small additive patch: W' = W + scaling·B·A, where A
(r×in) and B (out×r) have a tiny rank r (8–64). Applying it is the base matmul plus two small
rank-r matmuls (shrink x→r, expand r→out). A and B together are thousands of times smaller
than W, so you can store and serve many adapters cheaply over one shared base.
Q2. How does vLLM apply different adapters to different requests in one batch?
Model answer
It groups the batch by adapter id and uses SGMV/punica kernels: rows are segmented by their
lora_int_id, and each segment is matmul'd against its adapter's A/B in one grouped kernel
(add_shrink/add_expand/add_lora_linear). So a heterogeneous batch costs the shared base read
plus a little per distinct adapter — not one model run per request. It's the same "group by id, do a
grouped matmul" trick as MoE, keyed by adapter instead of expert.
Q3. What's the cost model that makes multi-LoRA a structural advantage?
Model answer
The base weights are shared and read once for the whole batch; each adapter adds only r×(in+out)
parameters and a rank-r matmul. So serving N fine-tunes costs ≈ base + N tiny deltas, versus N
full model copies. That lets a platform serve thousands of customer fine-tunes from one deployment —
a real cost moat (Phase 19, Track C).
Q4. How are adapters managed in memory, and how does the scheduler get involved?
Model answer
The LoRAModelManager loads adapters into a bounded set of GPU slots and LRU-evicts when over
max_loras (same eviction discipline as the KV cache). max_loras bounds distinct adapters per
step; the scheduler enforces it during waiting-admission (it tracks scheduled_loras and skips a
request whose adapter would exceed the limit this step). So multi-LoRA rides the normal scheduler
with one extra constraint.
Q5. Does batching many adapters change the output?
Model answer
No — the grouped/SGMV application produces exactly the same result as applying each adapter to its request individually; it just shares the base matmul and fuses the per-adapter work. Same "optimization, not behavior change" guarantee as the KV cache, chunked prefill, and spec decode.
Rapid-fire
- LoRA formula?
W' = W + scaling·B·A, rank r small. - Two apply steps? shrink (in→r), expand (r→out).
- Batched-LoRA kernel family? punica / SGMV (segment by adapter id).
- Bounds adapters/step?
max_loras(manager LRU + scheduler check). - Adapter id 0/-1? base / no adapter.