Phase 11 — Exercises: Multi-LoRA
Contents
Warm-up (explain)
- What is a LoRA, in terms of
W' = W + ?? Why is it cheap (use the rankr)? - What are the shrink and expand steps, and what shapes do they pass through?
- Why does serving N adapters cost ≈ one base read + N small matmuls, not N model runs?
Core (trace the code)
- In
LoRARequest(request.py:8), what identifies an adapter and what does id 0/-1 mean? - In
punica_cpu.py, matchadd_shrink(:166) andadd_expand(:197) to shrink/expand. - How does SGMV apply different adapters to different rows in one kernel (segments by id)?
- Where is
max_lorasenforced — name both the manager and the scheduler spot (Phase 3).
Build (your lab)
- In lab-01, why is the LoRA delta at most rank
r? Prove it withmatrix_rank. - Add an
effective_rankknob: stack two adapters on the same rows (sum of deltas) and verify it equals adding them sequentially. - Measure FLOPs: compare base matmul FLOPs to the adapter's shrink+expand FLOPs for
r=16,in=out=4096. What's the overhead ratio?
Design (staff-level)
- A platform serves 5,000 customer fine-tunes. Compare (a) one full deployment per customer vs (b) shared base + multi-LoRA: memory, cost, cold-start. Where does (b) win and where does it hurt?
max_lorasis hit constantly (lots of distinct adapters per batch). What are your options (raise it, route by adapter, replicate), and the tradeoffs?- How does LoRA on MoE expert layers (
lora/layers/fused_moe.py) complicate the batched apply, and why?
Self-grading
4–7 and 11–13 are interview-grade. Could you whiteboard shrink/expand and the grouped batched apply? If not, re-read 01-deep-dive.md.