Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 11-03 — LoRA Economics: the Multi-Tenant Arithmetic [CPU-OK]

Multi-LoRA serving exists as a product category because of five numbers, and this lab has you compute all five: a rank-16 adapter for a 7B model weighs 32 MiB (you'll derive the exact figure that appeared as "0.03 GiB" in lab-02's capture — model and measurement agreeing is the course's favorite trick); that's ~1/430th of the base weights; 32 of them fit in a single GiB of spare HBM; rank scales the bill linearly (and quality famously doesn't); and serving 100 tenants takes 13 engines instead of 100 GPUs. When a platform pitch says "thousands of fine-tunes on shared infrastructure," this lab is the spreadsheet behind the slide — and after it, you can audit such pitches in your head.

Contents


Why this lab exists

This is the third "economics as functions" lab in the course (after Phase 0 lab-02's KV calculator and Phase 8 lab-04's speculation model), and the pattern deserves naming: the highest-leverage engineering questions — can we afford it? how many fit? when does it stop paying? — reduce to short arithmetic over architecture constants, and an engineer who has packaged that arithmetic into tested functions answers in seconds what others answer with meetings. Multi-LoRA's arithmetic is the most business-shaped of the three: it directly prices a product (per-tenant fine-tunes) against its alternative (dedicated deployments), and the gpus_saved function is, not even metaphorically, a line in someone's cloud bill.

The lab also grounds two config knobs you'll meet operationally: max_lora_rank sizes the pre-allocated adapter buffers (rank is a memory commitment, not just a quality dial — lab-04 builds the slots this arithmetic sizes), and max_loras is the concurrency denominator in the fleet math.

Background: where the 400× comes from

A LoRA adapter replaces a weight update ΔW (which would be out × in, as big as the layer) with a rank-r factorization B @ AA: (r, in), B: (out, r) — so the parameter count collapses from out·in to r·(in + out). For a 4096² projection at r=16: 16.8M → 131K parameters, a 128× shrink per layer. Across a 7B model (32 layers × 4 attention projections targeted, the standard recipe):

131,072 params × 4 targets × 32 layers × 2 B (fp16) = 32 MiB
7,000,000,000 / 16,777,216 params ≈ 417×

The shrink is the whole business model: base weights are read once per step for the entire batch regardless of how many tenants it contains (lab-01's shared base matmul), KV cache is adapter-agnostic, and each tenant's marginal footprint is their 32 MiB plus nothing. The compute side has the same shape — the delta costs 2·r·(in+out) FLOPs per token against the base's 2·in·out, the same ~128× ratio — which is why a batch full of different adapters runs at nearly base-model speed (punica/SGMV kernels make the grouping efficient; lab-01 built their logic).

Files

  • starter.pylora_params_per_layer, adapter_bytes, adapters_per_gib, shrink_ratio, gpus_saved. Your work.
  • solution.py — reference.
  • test_lab.py — the per-layer count, the 32 MiB ↔ lab-02 reconciliation, density per GiB, the headline ratio, rank linearity, and the fleet math.

Run

LAB_IMPL=starter pytest phase-11-multi-lora/labs/lab-03-lora-economics -q
pytest phase-11-multi-lora/labs/lab-03-lora-economics -q   # reference

What the tests prove

TestWhat it pins
test_per_layer_paramsr·(in+out) — the factorization's bill, exactly
test_adapter_size_matches_the_lab02_capture32 MiB = the "0.03 GiB" from lab-02's real log. Deriving a measured number from constants is the moment the model becomes trustworthy
test_hundreds_of_adapters_per_gib32/GiB — adapter storage is never the constraint; slots and loading are (lab-04)
test_shrink_ratio_is_the_headline400–450×, computed not quoted
test_rank_is_a_linear_dialr=64 costs exactly 4× r=16 — and since max_lora_rank sizes every pre-allocated slot, one tenant demanding rank 64 quadruples everyone's slot reservation. Config knobs with fleet-wide blast radius deserve tests
test_gpus_saved100 tenants @ max_loras=8 → 87 GPUs saved. The slide, audited

Hitchhiker's notes

  • What the simplification hides (know before quoting): real targets aren't all square — Llama's gate/up/down MLP projections (often also adapted) are hidden × 2.7·hidden-ish, and GQA's k/v projections are narrower than q/o. Adapting all-linear-layers at r=16 lands nearer 80–120 MiB for a 7B. The structure of the arithmetic is what transfers; refit the constants to any model card in two minutes.
  • Why rank-16 at all, if rank is linear cost? Because LoRA quality saturates fast — the original paper's striking result was r=1..4 capturing most of full fine-tuning on many tasks. The production default of 8–16 is generosity, not necessity; tenants asking for 256 are usually solving a data problem with a parameter budget (and quadrupling your slot memory — push back with this lab's numbers).
  • The denominator in gpus_saved is max_loras, not "adapters you host." Hundreds can sit in host RAM or disk; only max_loras are concurrently active per step. The fleet math assumes tenant traffic interleaves well — 100 tenants who all spike at 9 a.m. sharp need more headroom than the formula's floor. Capacity formulas are load-shape assumptions in disguise (Phase 7 lab-04's lesson, tenant edition).
  • Why not merge the adapter into the weights (W + BA, zero overhead)? Single- tenant: absolutely, and tooling does. Multi-tenant: merging forks the base — you're back to one model copy per tenant, which is the disease this phase cures. The unmerged factorization is the sharing mechanism, the same way Phase 2's block indirection is the memory sharing.

Going further

  • Refit adapter_bytes with real Llama-2-7B shapes (q/k/v/o + gate/up/down, GQA widths) and compare against an actual adapter checkpoint's file size from the Hub — close the loop with a du -sh.
  • Add slot_reservation_bytes(max_loras, max_lora_rank, ...) — the pre-allocated HBM the engine reserves at startup whether or not adapters load (it competes with KV blocks! Phase 2 lab-03's carving, with a new claimant). Compute the KV-block cost of max_loras=32, max_lora_rank=64 on a 24 GiB card.
  • Price the compute side: delta FLOPs per token vs base FLOPs, then the batch-of- mixed-adapters overhead vs batch-of-one. The answer (~1%) is why lab-02's capture shows no visible throughput tax — verify against it.

References

  • Hu et al., LoRA: Low-Rank Adaptation of Large Language Models (2021) — the factorization and the rank-saturation result: https://arxiv.org/abs/2106.09685
  • Chen et al., Punica: Multi-Tenant LoRA Serving (2023) — the SGMV kernel and the multi-tenant economics formalized: https://arxiv.org/abs/2310.18547
  • Sheng et al., S-LoRA: Serving Thousands of Concurrent LoRA Adapters (2023) — the thousands-of-adapters regime this arithmetic enables: https://arxiv.org/abs/2311.03285
  • upstream/vllm/lora/ — where max_loras / max_lora_rank size real buffers.
  • Lab-02 — the captured 0.03 GiB this lab derives; lab-04 — the slots this lab sizes.