Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 11 — Cheatsheet: Multi-LoRA

Contents


The one-liner

A LoRA is a tiny additive patch W' = W + scaling·B·A (rank r ≪ in,out). vLLM serves MANY adapters in one batch over a shared base by grouping rows by adapter id (punica/SGMV) — base read once, a little per adapter.

The math

  • shrink: s = x·Aᵀ (in→r). expand: Δ = s·Bᵀ (r→out). output = x·Wᵀ + scaling·Δ.
  • A:(r,in), B:(out,r). Adapter size = r×(in+out)W = in×out.

Multi-adapter batching

Group rows by lora_int_id; per-group grouped matmul (SGMV). Cost ≈ base + Σ(small per adapter), NOT N model runs. Output identical to per-request application.

Memory & scheduling

  • max_loras: distinct adapters per step. Manager LRU-evicts extras (like the KV BlockPool).
  • Scheduler enforces max_loras at waiting-admission (scheduled_loras check, Phase 3).
  • LoRARequest (id+name+path); id 0 = base.

MoE LoRA

lora/layers/fused_moe.py patches expert layers too (same shrink/expand, trickier routing).

Key upstream

  • lora/request.py:8 LoRARequest
  • lora/punica_wrapper/punica_base.py:42 add_shrink :57 add_expand :88 add_lora_linear · punica_cpu.py:166/:197 (readable)
  • lora/layers/{base_linear,column_parallel_linear,row_parallel_linear,fused_moe}.py
  • lora/model_manager.py (load/activate/LRU) · lora/lora_weights.py (A,B)

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md