Phase 11 — Cheatsheet: Multi-LoRA
Contents
The one-liner
A LoRA is a tiny additive patch W' = W + scaling·B·A (rank r ≪ in,out). vLLM serves MANY adapters
in one batch over a shared base by grouping rows by adapter id (punica/SGMV) — base read once, a
little per adapter.
The math
- shrink:
s = x·Aᵀ(in→r). expand:Δ = s·Bᵀ(r→out). output =x·Wᵀ + scaling·Δ. A:(r,in),B:(out,r). Adapter size =r×(in+out)≪W=in×out.
Multi-adapter batching
Group rows by lora_int_id; per-group grouped matmul (SGMV). Cost ≈ base + Σ(small per adapter),
NOT N model runs. Output identical to per-request application.
Memory & scheduling
max_loras: distinct adapters per step. Manager LRU-evicts extras (like the KV BlockPool).- Scheduler enforces
max_lorasat waiting-admission (scheduled_lorascheck, Phase 3). LoRARequest(id+name+path); id 0 = base.
MoE LoRA
lora/layers/fused_moe.py patches expert layers too (same shrink/expand, trickier routing).
Key upstream
lora/request.py:8 LoRARequestlora/punica_wrapper/punica_base.py:42 add_shrink :57 add_expand :88 add_lora_linear·punica_cpu.py:166/:197(readable)lora/layers/{base_linear,column_parallel_linear,row_parallel_linear,fused_moe}.pylora/model_manager.py(load/activate/LRU) ·lora/lora_weights.py(A,B)
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md