Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 11 — Interview Questions: Multi-LoRA

Q1. What is a LoRA and why is it cheap?

Model answer

A LoRA replaces full fine-tuning with a small additive patch: W' = W + scaling·B·A, where A (r×in) and B (out×r) have a tiny rank r (8–64). Applying it is the base matmul plus two small rank-r matmuls (shrink x→r, expand r→out). A and B together are thousands of times smaller than W, so you can store and serve many adapters cheaply over one shared base.

Q2. How does vLLM apply different adapters to different requests in one batch?

Model answer

It groups the batch by adapter id and uses SGMV/punica kernels: rows are segmented by their lora_int_id, and each segment is matmul'd against its adapter's A/B in one grouped kernel (add_shrink/add_expand/add_lora_linear). So a heterogeneous batch costs the shared base read plus a little per distinct adapter — not one model run per request. It's the same "group by id, do a grouped matmul" trick as MoE, keyed by adapter instead of expert.

Q3. What's the cost model that makes multi-LoRA a structural advantage?

Model answer

The base weights are shared and read once for the whole batch; each adapter adds only r×(in+out) parameters and a rank-r matmul. So serving N fine-tunes costs ≈ base + N tiny deltas, versus N full model copies. That lets a platform serve thousands of customer fine-tunes from one deployment — a real cost moat (Phase 19, Track C).

Q4. How are adapters managed in memory, and how does the scheduler get involved?

Model answer

The LoRAModelManager loads adapters into a bounded set of GPU slots and LRU-evicts when over max_loras (same eviction discipline as the KV cache). max_loras bounds distinct adapters per step; the scheduler enforces it during waiting-admission (it tracks scheduled_loras and skips a request whose adapter would exceed the limit this step). So multi-LoRA rides the normal scheduler with one extra constraint.

Q5. Does batching many adapters change the output?

Model answer

No — the grouped/SGMV application produces exactly the same result as applying each adapter to its request individually; it just shares the base matmul and fuses the per-adapter work. Same "optimization, not behavior change" guarantee as the KV cache, chunked prefill, and spec decode.

Rapid-fire

  • LoRA formula? W' = W + scaling·B·A, rank r small.
  • Two apply steps? shrink (in→r), expand (r→out).
  • Batched-LoRA kernel family? punica / SGMV (segment by adapter id).
  • Bounds adapters/step? max_loras (manager LRU + scheduler check).
  • Adapter id 0/-1? base / no adapter.