Phase 11 — Exercises: Multi-LoRA

Contents

Warm-up (explain)
Core (trace the code)
Build (your lab)
Design (staff-level)
Self-grading

Warm-up (explain)

What is a LoRA, in terms of W' = W + ?? Why is it cheap (use the rank r)?
What are the shrink and expand steps, and what shapes do they pass through?
Why does serving N adapters cost ≈ one base read + N small matmuls, not N model runs?

Core (trace the code)

In LoRARequest (request.py:8), what identifies an adapter and what does id 0/-1 mean?
In punica_cpu.py, match add_shrink (:166) and add_expand (:197) to shrink/expand.
How does SGMV apply different adapters to different rows in one kernel (segments by id)?
Where is max_loras enforced — name both the manager and the scheduler spot (Phase 3).

Build (your lab)

In lab-01, why is the LoRA delta at most rank r? Prove it with matrix_rank.
Add an effective_rank knob: stack two adapters on the same rows (sum of deltas) and verify it equals adding them sequentially.
Measure FLOPs: compare base matmul FLOPs to the adapter's shrink+expand FLOPs for r=16, in=out=4096. What's the overhead ratio?

Design (staff-level)

A platform serves 5,000 customer fine-tunes. Compare (a) one full deployment per customer vs (b) shared base + multi-LoRA: memory, cost, cold-start. Where does (b) win and where does it hurt?
max_loras is hit constantly (lots of distinct adapters per batch). What are your options (raise it, route by adapter, replicate), and the tradeoffs?
How does LoRA on MoE expert layers (lora/layers/fused_moe.py) complicate the batched apply, and why?

Self-grading

4–7 and 11–13 are interview-grade. Could you whiteboard shrink/expand and the grouped batched apply? If not, re-read 01-deep-dive.md.