Phase 11 — The Hitchhiker's Guide to Multi-LoRA
← Phase 10 · Course home · Phase 12 →
Contents
- Don't Panic
- Step 1: What a LoRA actually is (the math, gently)
- Step 2: The hard part — many adapters in one batch
- Step 3: Managing adapters in memory
- Step 4: LoRA on MoE and which layers get patched
- The invariants to memorize
- What you'll do
Don't Panic
A LoRA is a tiny "personality patch" for a big model. Instead of fine-tuning all 8 billion weights (expensive, and you'd need a full copy per use-case), you train two small matrices that nudge the frozen base model toward a specific task — legal writing, a coding style, a customer's tone. The magic vLLM does:
Serve many different LoRAs from ONE base model at the same time — request A uses the legal adapter, request B the medical one, request C none — all in a single batch, sharing the base weights, by applying each request's tiny patch inside one batched operation.
This is a structural cost win: thousands of fine-tunes on shared base weights, instead of a whole deployment per customer. This phase is that batched-adapter machinery.
Base model (shared, frozen) ──────────────┐
request A → + legal adapter (A_legal, B_legal) ┐
request B → + medical adapter (A_med, B_med) ├─ all in one batch, one base read
request C → + nothing (base only) ┘
Step 1: What a LoRA actually is (the math, gently)
A model layer multiplies by a big weight W (say 4096×4096 = 16M numbers). A LoRA says: don't
change W; add a small correction made of two skinny matrices.
W' = W + scaling × (B · A)
│ │
│ └ A: shape (r, in) "down" — squeeze to a tiny rank r (e.g. 16)
└ B: shape (out, r) "up" — expand back to full size
r (the rank) is tiny — 8, 16, 64 — so A and B together are a few thousand times smaller
than W. Applying the patch to an input x is two small matmuls:
1. SHRINK: s = x · Aᵀ (in → r) "squeeze x down to rank r"
2. EXPAND: Δ = s · Bᵀ (r → out) "expand back up"
output = x · Wᵀ + scaling × Δ
So a LoRA costs one big base matmul (shared by everyone) plus two tiny rank-r matmuls. That's
why it's cheap. You'll implement exactly this shrink/expand in lab-01.
🆕 New words: LoRA (Low-Rank Adaptation — a small additive patch), rank r (the squeeze dimension, small), A/B (the down/up matrices), shrink/expand (the two matmuls), adapter (one trained (A,B) pair).
Step 2: The hard part — many adapters in one batch
Serving one LoRA is easy (just add its delta). The challenge is a batch where different rows use different adapters:
batch row 0 → adapter "legal" row 1 → adapter "medical" row 2 → base (no adapter)
The naive fix — loop over rows, apply each adapter separately — destroys batching (you're back to tiny per-request work, Phase 5's enemy). The real fix is a grouped operation: sort/group rows by adapter, and in one kernel apply each adapter to its group. This is what the punica / SGMV kernels do (SGMV = Segmented Gather Matrix-Vector). Conceptually it's the same "group by id, do a grouped matmul" trick you saw for MoE experts in Phase 7 — here grouped by adapter id instead of expert id.
group rows by adapter id → for each adapter: one matmul on its rows → scatter back
cost ≈ base matmul (shared) + a little per distinct adapter ≪ N separate model runs
You'll build this grouped application in lab-01 and prove it equals the per-row reference.
Step 3: Managing adapters in memory
GPUs have limited memory, so vLLM keeps a bounded number of adapters resident:
max_loras— how many distinct adapters can be in a single batch/step.- adapters are loaded on demand and LRU-evicted when the budget is exceeded (like the KV cache's eviction, Phase 2 — same pattern, different objects).
- the scheduler (Phase 3) respects
max_loras: it won't admit a request whose adapter would exceed the limit this step (you saw thescheduled_lorascheck in the Phase 3 deep-dive).
A request names its adapter with a LoRARequest (id + name + path). Adapter id 0 conventionally
means "base model, no adapter."
Step 4: LoRA on MoE and which layers get patched
LoRA is applied to the linear layers — typically the attention projections (Q/K/V/O) and the MLP.
For MoE models (Phase 7), adapters can patch the expert layers too (lora/layers/fused_moe.py)
— trickier because of the routing, but the same shrink/expand idea. Not every layer needs an
adapter; which ones are patched is part of how the LoRA was trained.
The invariants to memorize
- LoRA:
W' = W + scaling × B·A, rankr ≪ in,out. Apply = base matmul + shrink (→r) + expand (→out). - Multi-LoRA = grouped application by adapter id (punica/SGMV): one base read, a little extra per adapter — not N separate runs.
max_lorasbounds distinct adapters per step; the manager LRU-evicts the rest; the scheduler enforces it.- Base weights are shared and read once; each adapter adds only
r×(in+out)params. - Output for a batch of mixed adapters equals applying each adapter per-request — batching is an optimization, not a behavior change (recurring theme).
What you'll do
- Read: 01-deep-dive.md —
LoRARequest, the LoRA layers, the punica shrink/expand/add_lora_linear, and the manager + scheduler hook, line-anchored. - Build: 02-mini-build.md — batched multi-adapter LoRA matmul.
- Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
lab-01-batched-lora-matmul[CPU-OK]— implement shrink/expand + grouped multi-adapter application; prove it equals the per-request loop.lab-02-serve-many-loras[GPU-OPT]— serve 3 adapters in one batch on real vLLM (captured).lab-03-lora-economics[CPU-OK]— the multi-tenant arithmetic: 32 MiB per adapter (deriving lab-02's logged number), ~430× shrink, 87 GPUs saved at 100 tenants.lab-04-adapter-slot-cache[CPU-OK]— the LRU slot cache behind max_loras and the scheduler walk that defers (not barriers) overflow requests; thrash arithmetic included.
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
← Phase 10 · Course home · Phase 12 →