Phase 11 Labs — Multi-LoRA
Four labs on serving many fine-tunes over one set of base weights. The arc: build the grouped delta math and prove consolidation changes nothing (lab-01), price the adapters and the fleet savings (lab-03), manage the slot cache and its scheduler constraint (lab-04), then watch a mixed base+adapter batch produce two models' behavior in one step on real hardware (lab-02).
Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: math,
economics, machinery, demo.) CPU labs follow the standard contract — starter.py
(your work), solution.py (reference), test_lab.py (the spec); default runs the
solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-11-multi-lora/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-11-multi-lora/labs/lab-01-batched-lora-matmul -q
Contents
- lab-01-batched-lora-matmul
[CPU-OK] - lab-02-serve-many-loras
[GPU-OPT] - lab-03-lora-economics
[CPU-OK] - lab-04-adapter-slot-cache
[CPU-OK] - What you can do after this phase
Labs
lab-01-batched-lora-matmul [CPU-OK]
The punica/SGMV idea in numpy: shrink → expand deltas, one shared base matmul for the whole batch, per-adapter-group scatter-adds — proven exactly equal to the per-row loop (the consolidation safety case). Base-only rows ride free; the delta provably factors through the rank-r bottleneck. Skills: never materialize B@A; the group-by-parameter-set pattern (MoE's permute trick, second appearance); why mixed batches cost ~nothing.
lab-02-serve-many-loras [GPU-OPT]
The integration test: one batch, base + SQL adapter, two behaviors, one 12.55 GiB weight copy, 0.03 GiB of adapter — every number reconciled against labs 01/03/04. Plus the productization surface: adapters as model names in the OpenAI API, runtime loading, the cold-slot p99 signature. Annotated capture included. Skills: the operational knobs; behavior follows the tag; eval-diff due diligence for tenant migrations.
lab-03-lora-economics [CPU-OK]
The multi-tenant arithmetic as functions: 32 MiB per rank-16 7B adapter (deriving
lab-02's logged "0.03 GiB" from constants), ~430× smaller than the base, 32 per GiB,
rank as a linear memory dial with fleet-wide blast radius, and 87 GPUs saved at 100
tenants. Skills: economics-as-tested-functions; max_lora_rank as a memory
commitment; auditing platform pitches in your head.
lab-04-adapter-slot-cache [CPU-OK]
The machinery max_loras names: pre-allocated slots (kernel/graph shape stability —
Phase 5's constraint, again), an LRU cache with honest hit accounting (>75% on 80/20
traffic with 4 slots over 16 adapters), and the scheduler walk that defers — not
barriers — overflow requests. The serving-systems kata (cache-with-eviction +
admission-under-capacity), third appearance. Skills: OrderedDict as LRU; thrash
arithmetic; per-resource admission policy as a design decision; cross-component
invariants.
What you can do after this phase
Explain to a CFO why 100 fine-tunes need 13 engines, and to an engineer why the
consolidation is provably lossless; size max_loras/max_lora_rank from traffic
shape and memory budget rather than defaults; diagnose tenant p99 complaints down to
slot thrash with the cache model; and read vllm/lora/ — punica wrappers, the model
manager, the scheduler gating — as three labs you've already written. Phase 12 rides
lab 09-01's processor hook; the slot discipline you built here returns whenever
per-request GPU state does.