Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 11 — Deep Dive: multi-LoRA in real vLLM

Paths relative to upstream/ at v0.22.1 @ 0decac0.

vllm/lora/request.py            LoRARequest (how a request names its adapter)
vllm/lora/lora_weights.py       the (A, B) weight tensors of an adapter
vllm/lora/lora_model.py         LoRAModel (one loaded adapter's layers)
vllm/lora/model_manager.py      load / activate / LRU-evict adapters
vllm/lora/worker_manager.py     per-worker adapter management
vllm/lora/layers/               LoRA-wrapped layers (base_linear, column/row parallel, fused_moe)
vllm/lora/punica_wrapper/       the batched SGMV/BGMV kernels (shrink / expand / add_lora_linear)

Contents


1. The request: LoRARequest

vllm/lora/request.py:8class LoRARequest with lora_int_id (globally unique id), lora_name, and the adapter path. The scheduler and managers key everything off lora_int_id; id 0 means base. This is what a user attaches to a request to say "serve me with the legal adapter."

2. The patched layers: lora/layers/

A LoRA layer wraps a base layer (Phase 6's ColumnParallelLinear, etc.) and adds the shrink/expand delta. Read lora/layers/base_linear.py and column_parallel_linear.py: in forward they compute the base output, then call the punica wrapper to add the per-request LoRA delta. So the model still builds normal layers; the LoRA manager swaps in these wrappers when adapters are active. lora/layers/fused_moe.py does the same for MoE expert layers (Phase 7).

3. The batched kernels: punica_wrapper/

This is the heart — applying different adapters to different rows in one call. punica_base.py defines the interface (PunicaWrapperABC :22, PunicaWrapperBase :124):

  • add_shrink (:42) — the down-projection s = x · Aᵀ for all rows, each using its adapter's A.
  • add_expand (:57) — the up-projection Δ = s · Bᵀ, each using its adapter's B.
  • add_lora_linear (:88) — the full "base + shrink + expand" for a linear layer.

The implementations (punica_gpu.py, punica_cpu.py, selected by punica_selector.py) use SGMV (Segmented Gather Matrix-Vector): rows are segmented by adapter id, and each segment is matmul'd against its adapter's slice in one grouped kernel. Read PunicaWrapperCPU.add_shrink/add_expand (punica_cpu.py:166/:197) for the most readable version — it's literally "for each adapter segment, do the small matmul," which is exactly your lab-01 grouped implementation.

4. The manager: load, activate, evict

vllm/lora/model_manager.pyLoRAModelManager loads adapters into a fixed set of GPU "slots", activates the ones needed this step, and LRU-evicts when over max_loras (same eviction pattern as the KV BlockPool, Phase 2). worker_manager.py drives this per worker. lora_weights.py holds an adapter's A/B tensors (stacked across layers).

5. The scheduler hook (recall Phase 3)

In vllm/v1/core/sched/scheduler.py, the waiting-admission loop checks max_loras: it tracks scheduled_loras and skips a waiting request if admitting its adapter would exceed the limit this step (you saw this around :573 in the Phase 3 deep-dive). So multi-LoRA, like spec decode, rides the normal scheduler with one extra constraint rather than a separate path.

Reading checklist

  • LoRARequest — what identifies an adapter, and what does id 0 mean?
  • A LoRA layer's forward — base output then what? Where does the delta come from?
  • add_shrink/add_expand (punica_cpu.py:166/:197) — match them to shrink (→r) / expand (→out).
  • How does SGMV apply different adapters to different rows in one call (segments)?
  • Where does max_loras get enforced — in the manager and the scheduler?

Now build it: 02-mini-build.md, then the labs.