Lab 11-03 — LoRA Economics: the Multi-Tenant Arithmetic [CPU-OK]
Multi-LoRA serving exists as a product category because of five numbers, and this lab has you compute all five: a rank-16 adapter for a 7B model weighs 32 MiB (you'll derive the exact figure that appeared as "0.03 GiB" in lab-02's capture — model and measurement agreeing is the course's favorite trick); that's ~1/430th of the base weights; 32 of them fit in a single GiB of spare HBM; rank scales the bill linearly (and quality famously doesn't); and serving 100 tenants takes 13 engines instead of 100 GPUs. When a platform pitch says "thousands of fine-tunes on shared infrastructure," this lab is the spreadsheet behind the slide — and after it, you can audit such pitches in your head.
Contents
- Why this lab exists
- Background: where the 400× comes from
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
This is the third "economics as functions" lab in the course (after Phase 0 lab-02's
KV calculator and Phase 8 lab-04's speculation model), and the pattern deserves
naming: the highest-leverage engineering questions — can we afford it? how many fit?
when does it stop paying? — reduce to short arithmetic over architecture constants,
and an engineer who has packaged that arithmetic into tested functions answers in
seconds what others answer with meetings. Multi-LoRA's arithmetic is the most
business-shaped of the three: it directly prices a product (per-tenant fine-tunes)
against its alternative (dedicated deployments), and the gpus_saved function is, not
even metaphorically, a line in someone's cloud bill.
The lab also grounds two config knobs you'll meet operationally: max_lora_rank sizes
the pre-allocated adapter buffers (rank is a memory commitment, not just a quality
dial — lab-04 builds the slots this arithmetic sizes), and max_loras is the
concurrency denominator in the fleet math.
Background: where the 400× comes from
A LoRA adapter replaces a weight update ΔW (which would be out × in, as big as the
layer) with a rank-r factorization B @ A — A: (r, in), B: (out, r) — so the
parameter count collapses from out·in to r·(in + out). For a 4096² projection at
r=16: 16.8M → 131K parameters, a 128× shrink per layer. Across a 7B model (32
layers × 4 attention projections targeted, the standard recipe):
131,072 params × 4 targets × 32 layers × 2 B (fp16) = 32 MiB
7,000,000,000 / 16,777,216 params ≈ 417×
The shrink is the whole business model: base weights are read once per step for the
entire batch regardless of how many tenants it contains (lab-01's shared base
matmul), KV cache is adapter-agnostic, and each tenant's marginal footprint is their
32 MiB plus nothing. The compute side has the same shape — the delta costs
2·r·(in+out) FLOPs per token against the base's 2·in·out, the same ~128× ratio —
which is why a batch full of different adapters runs at nearly base-model speed
(punica/SGMV kernels make the grouping efficient; lab-01 built their logic).
Files
starter.py—lora_params_per_layer,adapter_bytes,adapters_per_gib,shrink_ratio,gpus_saved. Your work.solution.py— reference.test_lab.py— the per-layer count, the 32 MiB ↔ lab-02 reconciliation, density per GiB, the headline ratio, rank linearity, and the fleet math.
Run
LAB_IMPL=starter pytest phase-11-multi-lora/labs/lab-03-lora-economics -q
pytest phase-11-multi-lora/labs/lab-03-lora-economics -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_per_layer_params | r·(in+out) — the factorization's bill, exactly |
test_adapter_size_matches_the_lab02_capture | 32 MiB = the "0.03 GiB" from lab-02's real log. Deriving a measured number from constants is the moment the model becomes trustworthy |
test_hundreds_of_adapters_per_gib | 32/GiB — adapter storage is never the constraint; slots and loading are (lab-04) |
test_shrink_ratio_is_the_headline | 400–450×, computed not quoted |
test_rank_is_a_linear_dial | r=64 costs exactly 4× r=16 — and since max_lora_rank sizes every pre-allocated slot, one tenant demanding rank 64 quadruples everyone's slot reservation. Config knobs with fleet-wide blast radius deserve tests |
test_gpus_saved | 100 tenants @ max_loras=8 → 87 GPUs saved. The slide, audited |
Hitchhiker's notes
- What the simplification hides (know before quoting): real targets aren't all
square — Llama's
gate/up/downMLP projections (often also adapted) arehidden × 2.7·hidden-ish, and GQA's k/v projections are narrower than q/o. Adapting all-linear-layers at r=16 lands nearer 80–120 MiB for a 7B. The structure of the arithmetic is what transfers; refit the constants to any model card in two minutes. - Why rank-16 at all, if rank is linear cost? Because LoRA quality saturates fast — the original paper's striking result was r=1..4 capturing most of full fine-tuning on many tasks. The production default of 8–16 is generosity, not necessity; tenants asking for 256 are usually solving a data problem with a parameter budget (and quadrupling your slot memory — push back with this lab's numbers).
- The denominator in
gpus_savedismax_loras, not "adapters you host." Hundreds can sit in host RAM or disk; onlymax_lorasare concurrently active per step. The fleet math assumes tenant traffic interleaves well — 100 tenants who all spike at 9 a.m. sharp need more headroom than the formula's floor. Capacity formulas are load-shape assumptions in disguise (Phase 7 lab-04's lesson, tenant edition). - Why not merge the adapter into the weights (
W + BA, zero overhead)? Single- tenant: absolutely, and tooling does. Multi-tenant: merging forks the base — you're back to one model copy per tenant, which is the disease this phase cures. The unmerged factorization is the sharing mechanism, the same way Phase 2's block indirection is the memory sharing.
Going further
- Refit
adapter_byteswith real Llama-2-7B shapes (q/k/v/o + gate/up/down, GQA widths) and compare against an actual adapter checkpoint's file size from the Hub — close the loop with adu -sh. - Add
slot_reservation_bytes(max_loras, max_lora_rank, ...)— the pre-allocated HBM the engine reserves at startup whether or not adapters load (it competes with KV blocks! Phase 2 lab-03's carving, with a new claimant). Compute the KV-block cost ofmax_loras=32, max_lora_rank=64on a 24 GiB card. - Price the compute side: delta FLOPs per token vs base FLOPs, then the batch-of- mixed-adapters overhead vs batch-of-one. The answer (~1%) is why lab-02's capture shows no visible throughput tax — verify against it.
References
- Hu et al., LoRA: Low-Rank Adaptation of Large Language Models (2021) — the factorization and the rank-saturation result: https://arxiv.org/abs/2106.09685
- Chen et al., Punica: Multi-Tenant LoRA Serving (2023) — the SGMV kernel and the multi-tenant economics formalized: https://arxiv.org/abs/2310.18547
- Sheng et al., S-LoRA: Serving Thousands of Concurrent LoRA Adapters (2023) — the thousands-of-adapters regime this arithmetic enables: https://arxiv.org/abs/2311.03285
upstream/vllm/lora/— wheremax_loras/max_lora_ranksize real buffers.- Lab-02 — the captured 0.03 GiB this lab derives; lab-04 — the slots this lab sizes.