Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 11-04 — Adapter Slots: the LRU Cache the Scheduler Must Obey [CPU-OK]

Lab-03 priced adapters at 32 MiB; hundreds fit in spare HBM. So why does max_loras=2 exist, and why is exceeding it a scheduling event rather than a memory error? Because active adapters don't live in loose 32 MiB allocations — they live in pre-allocated slots (fixed buffers sized for max_lora_rank, baked into the kernels' launch shapes and CUDA graphs), and max_loras is the slot count. This lab builds both halves of the machinery that manages them: the LRU slot cache (hit / load / evict+load, with the recency bookkeeping that keeps hot tenants resident) and the scheduler constraint it forces — a step's batch may reference at most max_loras distinct adapters, with overflow requests deferred, not barriered. It is Phase 2's eviction story and Phase 3's admission story, replayed one level up the stack — deliberately.

Contents


Why this lab exists

Multi-LoRA's failure modes in production are almost never about the math (lab-01 settled that) — they're about slot pressure: a tenant complains about p99 latency and the cause is their adapter thrashing in and out of slots behind two hotter tenants; throughput sags after onboarding tenant #9 on a max_loras=8 engine because every step now defers someone. Diagnosing these requires exactly the two models you'll build: the cache (whose hit rate is the tenant-experience metric) and the admission rule (whose deferrals are the throughput tax). Both are ~20 lines, and both behave counterintuitively enough under skewed traffic that you want the test suite's numbers in your head before the incident.

The pedagogical reason is the rhyme. You have now built an LRU-flavored eviction structure for KV blocks (Phase 2 lab-05), a multi-resource admission loop (Phase 3 lab-01), and here both again for adapters. The course repeats the pattern on purpose: cache-with-eviction + admission-under-capacity is THE serving-systems kata, and recognizing it instantly — whatever the cached object is — is a maintainer reflex. (You'll see it once more with prefix-cache-aware routing in Phase 15.)

Background: why slots, and what they cost

Why not allocate adapters dynamically, since they're tiny? Three converging reasons:

  1. Kernel shape stability — the punica/SGMV kernels (lab-01's grouping, fused) index adapter weights by slot id out of a stacked buffer (max_loras, max_lora_rank, …); a fixed buffer means fixed pointers and shapes, which CUDA graphs (Phase 5's Constraint 2!) can capture. Dynamic allocation would re-trigger capture or force eager mode.
  2. Predictable memorymax_loras × slot_size is reserved at startup, before KV blocks are carved (Phase 2 lab-03's ritual gains a line item). No mid-serving OOM from a tenant spike; the cost is paid visibly, up front (the course's recurring "pay it where you can see it").
  3. Bounded step complexity — the per-step adapter gather is over ≤ max_loras segments, keeping the kernel's metadata small and the scheduler's reasoning finite.

The slot cache's job is then classic: keep the right max_loras adapters resident. LRU is the policy (recency ≈ tenant activity), move_to_end is the entire implementation subtlety, and a miss costs a host→device copy of lab-03's 32 MiB (~milliseconds — a few decode steps' worth, painful only when it recurs, i.e. when thrashing).

Files

  • starter.pyAdapterSlotCache (ensure/resident/stats) and max_schedulable (the FCFS admission walk with deferral). Your work.
  • solution.py — reference (note OrderedDict as the LRU: insertion order + move_to_end + popitem(last=False) — the standard Python idiom, worth owning).
  • test_lab.py — fill/hit/evict mechanics, LRU ordering, the skewed-traffic hit rate, the distinct-adapter cap, base requests riding free, and deferral-not-barrier.

Run

LAB_IMPL=starter pytest phase-11-multi-lora/labs/lab-04-adapter-slot-cache -q
pytest phase-11-multi-lora/labs/lab-04-adapter-slot-cache -q   # reference

What the tests prove

TestWhat it pins
test_fill_then_hitThe three outcomes and honest hit/miss accounting — the metric a tenant dashboard graphs
test_lru_evicts_the_coldestRecency refresh works: re-touching adapter 1 saves it; 2 dies. Forget move_to_end and this test is your tripwire (FIFO masquerading as LRU is the classic one-line bug)
test_skewed_traffic_loves_lru80/20 traffic over 16 adapters, 4 slots → >75% hit rate. Skew is the friend of small caches — the same reason CPU caches work — and the reason max_loras=8 serves 100 tenants acceptably if traffic is skewed (lab-03's fleet math gains its load-shape footnote)
test_scheduler_caps_distinct_adapters_per_stepThe admission walk: slots claimed FCFS, reuse free, overflow deferred
test_base_requests_never_consume_slotsNone rides free — mixed base+adapter batches (lab-02's demo) cost slots only for the adapters
test_deferral_is_not_a_barrierA blocked adapter request doesn't stall later admissible ones — contrast with Phase 3 lab-01's head-of-line break for memory. Two resources, two deliberately different policies: KV exhaustion stops admission (fairness, deadlock logic); slot exhaustion skips individuals (slots free predictably next step). Policy per resource is a design decision, not a default

Hitchhiker's notes

  • Where this lives upstream: upstream/vllm/lora/models.py (LoRAModelManager / LRUCacheLoRAModelManager — your cache, with host-side tiers) and the scheduler's lora gating (search max_loras in vllm/v1/core/sched/scheduler.py — your max_schedulable walk, inline in the admission loop). The two-tier reality: evicted adapters drop to host RAM (cheap reload), not to disk; "cold start" for a brand-new adapter adds checkpoint loading on top.
  • Thrash arithmetic: at max_loras slots and k > max_loras simultaneously active uniform tenants, every step evicts — hit rate collapses toward max_loras/k, each miss costs a 32 MiB copy, and aggregate throughput cliffs. The fix hierarchy: raise max_loras (costs slot memory — lab-03's reservation), shard tenants across engines by affinity (routing — Phase 15's cousin), or batch tenant traffic in time. Knowing the cliff exists before tenant #9 onboards is this lab's operational payoff.
  • Prefix caching interaction (Phase 2 lab-05's note, now load-bearing): KV computed under adapter X is not valid for adapter Y — the adapter changes the model. The block hash therefore includes the LoRA id; two tenants with identical system prompts share nothing. Multi-tenant capacity planning that assumed prefix-cache savings across tenants is wrong by exactly that assumption.
  • ensure and max_schedulable must agree — the scheduler admits a set of adapters, then the cache loads them; if the admission cap exceeded the slot count, the load would evict an adapter another admitted request needs this same step. The invariant "admitted distinct adapters ≤ slots" is cross-component (scheduler promises, cache relies), the same shape as Phase 3 lab-04's deadlock invariant. When you modify one side upstream, the review question is always "who relies on this bound?"

Going further

  • Add a host tier: evicted adapters go to a (larger) host LRU; ensure returns "hit" / "load-from-host" / "load-from-disk" with costs 0 / 1 / 30. Run the skewed workload and price the tiers — you've rebuilt LRUCacheLoRAModelManager's actual shape and S-LoRA's core argument.
  • Couple the two halves: drive max_schedulable's admitted set into the cache per step and assert the invariant above holds for random traffic — then break the cap (+1) and watch which workloads corrupt. Cross-component invariants deserve cross-component tests.
  • Simulate tenant p99: timestamped requests, miss = +3 steps of latency; compare per-tenant p99 under LRU vs random eviction at various skews. The plot is the argument for LRU — and for affinity routing once skew fades.

References

  • upstream/vllm/lora/models.pyLoRAModelManager and the LRU variant.
  • upstream/vllm/lora/punica_wrapper/ — the slot-indexed kernel buffers your cache fronts.
  • Sheng et al., S-LoRA (2023) — paged adapter memory + the host tier at thousands of adapters: https://arxiv.org/abs/2311.03285
  • Phase 2 lab-05 — the eviction kata's first appearance; Phase 3 lab-01 — the admission kata's; this lab — both, one level up.