Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 11 Labs — Multi-LoRA

Four labs on serving many fine-tunes over one set of base weights. The arc: build the grouped delta math and prove consolidation changes nothing (lab-01), price the adapters and the fleet savings (lab-03), manage the slot cache and its scheduler constraint (lab-04), then watch a mixed base+adapter batch produce two models' behavior in one step on real hardware (lab-02).

Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: math, economics, machinery, demo.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-11-multi-lora/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-11-multi-lora/labs/lab-01-batched-lora-matmul -q

Contents


Labs

lab-01-batched-lora-matmul [CPU-OK]

The punica/SGMV idea in numpy: shrink → expand deltas, one shared base matmul for the whole batch, per-adapter-group scatter-adds — proven exactly equal to the per-row loop (the consolidation safety case). Base-only rows ride free; the delta provably factors through the rank-r bottleneck. Skills: never materialize B@A; the group-by-parameter-set pattern (MoE's permute trick, second appearance); why mixed batches cost ~nothing.

lab-02-serve-many-loras [GPU-OPT]

The integration test: one batch, base + SQL adapter, two behaviors, one 12.55 GiB weight copy, 0.03 GiB of adapter — every number reconciled against labs 01/03/04. Plus the productization surface: adapters as model names in the OpenAI API, runtime loading, the cold-slot p99 signature. Annotated capture included. Skills: the operational knobs; behavior follows the tag; eval-diff due diligence for tenant migrations.

lab-03-lora-economics [CPU-OK]

The multi-tenant arithmetic as functions: 32 MiB per rank-16 7B adapter (deriving lab-02's logged "0.03 GiB" from constants), ~430× smaller than the base, 32 per GiB, rank as a linear memory dial with fleet-wide blast radius, and 87 GPUs saved at 100 tenants. Skills: economics-as-tested-functions; max_lora_rank as a memory commitment; auditing platform pitches in your head.

lab-04-adapter-slot-cache [CPU-OK]

The machinery max_loras names: pre-allocated slots (kernel/graph shape stability — Phase 5's constraint, again), an LRU cache with honest hit accounting (>75% on 80/20 traffic with 4 slots over 16 adapters), and the scheduler walk that defers — not barriers — overflow requests. The serving-systems kata (cache-with-eviction + admission-under-capacity), third appearance. Skills: OrderedDict as LRU; thrash arithmetic; per-resource admission policy as a design decision; cross-component invariants.

What you can do after this phase

Explain to a CFO why 100 fine-tunes need 13 engines, and to an engineer why the consolidation is provably lossless; size max_loras/max_lora_rank from traffic shape and memory budget rather than defaults; diagnose tenant p99 complaints down to slot thrash with the cache model; and read vllm/lora/ — punica wrappers, the model manager, the scheduler gating — as three labs you've already written. Phase 12 rides lab 09-01's processor hook; the slot discipline you built here returns whenever per-request GPU state does.