Phase 11 — Mini-Build: batched multi-adapter LoRA
You'll implement the LoRA delta (shrink → expand) and the grouped application that serves many adapters in one batch, then prove it equals applying each adapter per-request. This is the punica/ SGMV idea in numpy.
Contents
The task (lab-01)
Implement, in numpy:
lora_delta(x, A, B, scaling)→scaling × (x @ A.T) @ B.T. (A:(r,in), B:(out,r).) Note it's two small matmuls with a rank-rbottleneck.apply_single(x, W, A, B, scaling)→x @ W.T + lora_delta(...)(base + one adapter).apply_batched(x, W, adapters, adapter_ids, scalings)→ each rowiofxusesadapters[adapter_ids[i]](an(A,B)pair, orNonefor base-only). Do it grouped: compute the shared basex @ W.Tonce, then for each distinct adapter id add its delta to its rows. Must equal a per-row reference loop.
adapter_ids[i] == -1 (or None entry) means "base only, no adapter" for that row.
The point (the insight)
apply_batched reads the base weight once for the whole batch and adds only a tiny rank-r
delta per adapter group — so serving N adapters costs ≈ base + N small matmuls, not N full model
runs. That's the multi-tenant cost advantage. Your grouping by adapter_id mirrors SGMV's
segmenting; it's the same "group by id" trick as MoE (Phase 7), here by adapter.
Definition of done
pytest phase-11-multi-lora/labs -q
Tests pin: apply_batched == per-row reference; base-only rows equal x @ W.T; the delta has the
right rank-r structure; and a single shared base matmul covers all rows.
Map to the real engine
| your numpy | real vLLM |
|---|---|
lora_delta (shrink→expand) | add_shrink / add_expand (punica_cpu.py:166/:197) |
apply_batched (grouped by id) | add_lora_linear / SGMV (punica_base.py:88) |
adapters dict by id | LoRAModelManager slots (model_manager.py) |
adapter_ids per row | LoRARequest.lora_int_id (request.py:8) |
max distinct adapters | max_loras (manager LRU + scheduler check) |