Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 11 — The Hitchhiker's Guide to Multi-LoRA

Phase 10 · Course home · Phase 12

Contents


Don't Panic

A LoRA is a tiny "personality patch" for a big model. Instead of fine-tuning all 8 billion weights (expensive, and you'd need a full copy per use-case), you train two small matrices that nudge the frozen base model toward a specific task — legal writing, a coding style, a customer's tone. The magic vLLM does:

Serve many different LoRAs from ONE base model at the same time — request A uses the legal adapter, request B the medical one, request C none — all in a single batch, sharing the base weights, by applying each request's tiny patch inside one batched operation.

This is a structural cost win: thousands of fine-tunes on shared base weights, instead of a whole deployment per customer. This phase is that batched-adapter machinery.

Base model (shared, frozen) ──────────────┐
   request A → + legal adapter (A_legal, B_legal)   ┐
   request B → + medical adapter (A_med, B_med)     ├─ all in one batch, one base read
   request C → + nothing (base only)                ┘

Step 1: What a LoRA actually is (the math, gently)

A model layer multiplies by a big weight W (say 4096×4096 = 16M numbers). A LoRA says: don't change W; add a small correction made of two skinny matrices.

W'  =  W  +  scaling × (B · A)
                         │   │
                         │   └ A: shape (r, in)    "down" — squeeze to a tiny rank r (e.g. 16)
                         └ B: shape (out, r)       "up"   — expand back to full size

r (the rank) is tiny — 8, 16, 64 — so A and B together are a few thousand times smaller than W. Applying the patch to an input x is two small matmuls:

1. SHRINK:  s = x · Aᵀ          (in → r)   "squeeze x down to rank r"
2. EXPAND:  Δ = s · Bᵀ          (r → out)  "expand back up"
output = x · Wᵀ  +  scaling × Δ

So a LoRA costs one big base matmul (shared by everyone) plus two tiny rank-r matmuls. That's why it's cheap. You'll implement exactly this shrink/expand in lab-01.

🆕 New words: LoRA (Low-Rank Adaptation — a small additive patch), rank r (the squeeze dimension, small), A/B (the down/up matrices), shrink/expand (the two matmuls), adapter (one trained (A,B) pair).


Step 2: The hard part — many adapters in one batch

Serving one LoRA is easy (just add its delta). The challenge is a batch where different rows use different adapters:

batch row 0 → adapter "legal"     row 1 → adapter "medical"    row 2 → base (no adapter)

The naive fix — loop over rows, apply each adapter separately — destroys batching (you're back to tiny per-request work, Phase 5's enemy). The real fix is a grouped operation: sort/group rows by adapter, and in one kernel apply each adapter to its group. This is what the punica / SGMV kernels do (SGMV = Segmented Gather Matrix-Vector). Conceptually it's the same "group by id, do a grouped matmul" trick you saw for MoE experts in Phase 7 — here grouped by adapter id instead of expert id.

group rows by adapter id  →  for each adapter: one matmul on its rows  →  scatter back
cost ≈ base matmul (shared)  +  a little per distinct adapter  ≪  N separate model runs

You'll build this grouped application in lab-01 and prove it equals the per-row reference.


Step 3: Managing adapters in memory

GPUs have limited memory, so vLLM keeps a bounded number of adapters resident:

  • max_loras — how many distinct adapters can be in a single batch/step.
  • adapters are loaded on demand and LRU-evicted when the budget is exceeded (like the KV cache's eviction, Phase 2 — same pattern, different objects).
  • the scheduler (Phase 3) respects max_loras: it won't admit a request whose adapter would exceed the limit this step (you saw the scheduled_loras check in the Phase 3 deep-dive).

A request names its adapter with a LoRARequest (id + name + path). Adapter id 0 conventionally means "base model, no adapter."


Step 4: LoRA on MoE and which layers get patched

LoRA is applied to the linear layers — typically the attention projections (Q/K/V/O) and the MLP. For MoE models (Phase 7), adapters can patch the expert layers too (lora/layers/fused_moe.py) — trickier because of the routing, but the same shrink/expand idea. Not every layer needs an adapter; which ones are patched is part of how the LoRA was trained.


The invariants to memorize

  1. LoRA: W' = W + scaling × B·A, rank r ≪ in,out. Apply = base matmul + shrink (→r) + expand (→out).
  2. Multi-LoRA = grouped application by adapter id (punica/SGMV): one base read, a little extra per adapter — not N separate runs.
  3. max_loras bounds distinct adapters per step; the manager LRU-evicts the rest; the scheduler enforces it.
  4. Base weights are shared and read once; each adapter adds only r×(in+out) params.
  5. Output for a batch of mixed adapters equals applying each adapter per-request — batching is an optimization, not a behavior change (recurring theme).

What you'll do

  • Read: 01-deep-dive.mdLoRARequest, the LoRA layers, the punica shrink/expand/add_lora_linear, and the manager + scheduler hook, line-anchored.
  • Build: 02-mini-build.md — batched multi-adapter LoRA matmul.
  • Labs (see labs/README.md; recommended order 01 → 03 → 04 → 02):
    • lab-01-batched-lora-matmul [CPU-OK] — implement shrink/expand + grouped multi-adapter application; prove it equals the per-request loop.
    • lab-02-serve-many-loras [GPU-OPT] — serve 3 adapters in one batch on real vLLM (captured).
    • lab-03-lora-economics [CPU-OK] — the multi-tenant arithmetic: 32 MiB per adapter (deriving lab-02's logged number), ~430× shrink, 87 GPUs saved at 100 tenants.
    • lab-04-adapter-slot-cache [CPU-OK] — the LRU slot cache behind max_loras and the scheduler walk that defers (not barriers) overflow requests; thrash arithmetic included.
  • Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

Phase 10 · Course home · Phase 12