Phase 11 — Deep Dive: multi-LoRA in real vLLM
Paths relative to
upstream/atv0.22.1 @ 0decac0.vllm/lora/request.py LoRARequest (how a request names its adapter) vllm/lora/lora_weights.py the (A, B) weight tensors of an adapter vllm/lora/lora_model.py LoRAModel (one loaded adapter's layers) vllm/lora/model_manager.py load / activate / LRU-evict adapters vllm/lora/worker_manager.py per-worker adapter management vllm/lora/layers/ LoRA-wrapped layers (base_linear, column/row parallel, fused_moe) vllm/lora/punica_wrapper/ the batched SGMV/BGMV kernels (shrink / expand / add_lora_linear)
Contents
- 1. The request:
LoRARequest - 2. The patched layers:
lora/layers/ - 3. The batched kernels:
punica_wrapper/ - 4. The manager: load, activate, evict
- 5. The scheduler hook (recall Phase 3)
- Reading checklist
1. The request: LoRARequest
vllm/lora/request.py:8 — class LoRARequest with lora_int_id (globally unique id), lora_name,
and the adapter path. The scheduler and managers key everything off lora_int_id; id 0 means base.
This is what a user attaches to a request to say "serve me with the legal adapter."
2. The patched layers: lora/layers/
A LoRA layer wraps a base layer (Phase 6's ColumnParallelLinear, etc.) and adds the
shrink/expand delta. Read lora/layers/base_linear.py and column_parallel_linear.py: in forward
they compute the base output, then call the punica wrapper to add the per-request LoRA delta. So the
model still builds normal layers; the LoRA manager swaps in these wrappers when adapters are
active. lora/layers/fused_moe.py does the same for MoE expert layers (Phase 7).
3. The batched kernels: punica_wrapper/
This is the heart — applying different adapters to different rows in one call. punica_base.py
defines the interface (PunicaWrapperABC :22, PunicaWrapperBase :124):
add_shrink(:42) — the down-projections = x · Aᵀfor all rows, each using its adapter'sA.add_expand(:57) — the up-projectionΔ = s · Bᵀ, each using its adapter'sB.add_lora_linear(:88) — the full "base + shrink + expand" for a linear layer.
The implementations (punica_gpu.py, punica_cpu.py, selected by punica_selector.py) use SGMV
(Segmented Gather Matrix-Vector): rows are segmented by adapter id, and each segment is matmul'd
against its adapter's slice in one grouped kernel. Read PunicaWrapperCPU.add_shrink/add_expand
(punica_cpu.py:166/:197) for the most readable version — it's literally "for each adapter
segment, do the small matmul," which is exactly your lab-01 grouped implementation.
4. The manager: load, activate, evict
vllm/lora/model_manager.py — LoRAModelManager loads adapters into a fixed set of GPU "slots",
activates the ones needed this step, and LRU-evicts when over max_loras (same eviction pattern as
the KV BlockPool, Phase 2). worker_manager.py drives this per worker. lora_weights.py holds an
adapter's A/B tensors (stacked across layers).
5. The scheduler hook (recall Phase 3)
In vllm/v1/core/sched/scheduler.py, the waiting-admission loop checks max_loras: it tracks
scheduled_loras and skips a waiting request if admitting its adapter would exceed the limit this
step (you saw this around :573 in the Phase 3 deep-dive). So multi-LoRA, like spec decode, rides
the normal scheduler with one extra constraint rather than a separate path.
Reading checklist
-
LoRARequest— what identifies an adapter, and what does id 0 mean? -
A LoRA layer's
forward— base output then what? Where does the delta come from? -
add_shrink/add_expand(punica_cpu.py:166/:197) — match them to shrink (→r) / expand (→out). - How does SGMV apply different adapters to different rows in one call (segments)?
-
Where does
max_lorasget enforced — in the manager and the scheduler?
Now build it: 02-mini-build.md, then the labs.