Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 11 — Mini-Build: batched multi-adapter LoRA

You'll implement the LoRA delta (shrink → expand) and the grouped application that serves many adapters in one batch, then prove it equals applying each adapter per-request. This is the punica/ SGMV idea in numpy.

Contents


The task (lab-01)

Implement, in numpy:

  • lora_delta(x, A, B, scaling)scaling × (x @ A.T) @ B.T. (A:(r,in), B:(out,r).) Note it's two small matmuls with a rank-r bottleneck.
  • apply_single(x, W, A, B, scaling)x @ W.T + lora_delta(...) (base + one adapter).
  • apply_batched(x, W, adapters, adapter_ids, scalings) → each row i of x uses adapters[adapter_ids[i]] (an (A,B) pair, or None for base-only). Do it grouped: compute the shared base x @ W.T once, then for each distinct adapter id add its delta to its rows. Must equal a per-row reference loop.

adapter_ids[i] == -1 (or None entry) means "base only, no adapter" for that row.

The point (the insight)

apply_batched reads the base weight once for the whole batch and adds only a tiny rank-r delta per adapter group — so serving N adapters costs ≈ base + N small matmuls, not N full model runs. That's the multi-tenant cost advantage. Your grouping by adapter_id mirrors SGMV's segmenting; it's the same "group by id" trick as MoE (Phase 7), here by adapter.

Definition of done

pytest phase-11-multi-lora/labs -q

Tests pin: apply_batched == per-row reference; base-only rows equal x @ W.T; the delta has the right rank-r structure; and a single shared base matmul covers all rows.

Map to the real engine

your numpyreal vLLM
lora_delta (shrink→expand)add_shrink / add_expand (punica_cpu.py:166/:197)
apply_batched (grouped by id)add_lora_linear / SGMV (punica_base.py:88)
adapters dict by idLoRAModelManager slots (model_manager.py)
adapter_ids per rowLoRARequest.lora_int_id (request.py:8)
max distinct adaptersmax_loras (manager LRU + scheduler check)