Phase 11 — Mini-Build: batched multi-adapter LoRA

You'll implement the LoRA delta (shrink → expand) and the grouped application that serves many adapters in one batch, then prove it equals applying each adapter per-request. This is the punica/ SGMV idea in numpy.

The task (lab-01)
The point (the insight)
Definition of done
Map to the real engine

The task (lab-01)

Implement, in numpy:

lora_delta(x, A, B, scaling) → scaling × (x @ A.T) @ B.T. (A:(r,in), B:(out,r).) Note it's two small matmuls with a rank-r bottleneck.
apply_single(x, W, A, B, scaling) → x @ W.T + lora_delta(...) (base + one adapter).
apply_batched(x, W, adapters, adapter_ids, scalings) → each row i of x uses adapters[adapter_ids[i]] (an (A,B) pair, or None for base-only). Do it grouped: compute the shared base x @ W.T once, then for each distinct adapter id add its delta to its rows. Must equal a per-row reference loop.

adapter_ids[i] == -1 (or None entry) means "base only, no adapter" for that row.

The point (the insight)

apply_batched reads the base weight once for the whole batch and adds only a tiny rank-r delta per adapter group — so serving N adapters costs ≈ base + N small matmuls, not N full model runs. That's the multi-tenant cost advantage. Your grouping by adapter_id mirrors SGMV's segmenting; it's the same "group by id" trick as MoE (Phase 7), here by adapter.

Definition of done

pytest phase-11-multi-lora/labs -q

Tests pin: apply_batched == per-row reference; base-only rows equal x @ W.T; the delta has the right rank-r structure; and a single shared base matmul covers all rows.

Map to the real engine

your numpy	real vLLM
`lora_delta` (shrink→expand)	`add_shrink` / `add_expand` (`punica_cpu.py:166`/`:197`)
`apply_batched` (grouped by id)	`add_lora_linear` / SGMV (`punica_base.py:88`)
`adapters` dict by id	`LoRAModelManager` slots (`model_manager.py`)
`adapter_ids` per row	`LoRARequest.lora_int_id` (`request.py:8`)
`max distinct adapters`	`max_loras` (manager LRU + scheduler check)

vLLM Mastery — From Zero to Maintainer

Phase 11 — Mini-Build: batched multi-adapter LoRA

Contents

The task (lab-01)

The point (the insight)

Definition of done

Map to the real engine

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer

Phase 11 — Mini-Build: batched multi-adapter LoRA

Contents

The task (lab-01)

The point (the insight)

Definition of done

Map to the real engine