Lab 11-02 — Serve Many LoRAs in One Batch [GPU-OPT]
The CPU labs built the machinery: the grouped delta (lab-01), the 32 MiB price tag
(lab-03), the slot cache (lab-04). This lab watches all three earn their keep on real
hardware: one batch, two requests — one wanting the plain base model, one wanting a
SQL fine-tune — served together over a single 12.5 GiB copy of Llama-2-7B, each
getting visibly different behavior ('apple, banana, orange' vs
'SELECT name FROM users;'), with the adapter adding 0.03 GiB. The multi-tenant
economics, demonstrated in four lines of API and one annotated log.
No GPU? Don't panic. The capture below carries the demonstration; the reconciliation against labs 01/03/04 is the work, and it's hardware-free.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, Llama-2-7b + SQL LoRA, A100, vLLM 0.22.1, trimmed)
- Reading the numbers
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Every GPU-OPT lab in this course is an integration test of the CPU labs' models, and
this one has the most user-visible payoff: different model behavior per request in
one batch is the kind of thing that sounds impossible until you've traced lab-01's
grouped matmul, and obvious afterward. Running it (or reading the capture) closes the
loop — and teaches the operational surface you'll actually touch: enable_lora, the
max_loras/max_lora_rank reservations (lab-04 and lab-03's knobs, now with startup-
log consequences), LoRARequest's id-and-path plumbing, and the per-request
lora_request parameter that the OpenAI-compatible server exposes as the model
field (each adapter looks like a model name to API clients — the productization detail
that makes multi-tenant serving feel like multi-model serving).
Requirements
uv pip install -e ".[vllm]"
huggingface-cli download meta-llama/Llama-2-7b-hf # the shared base
# plus a small LoRA adapter for it from the Hub (any task with visible behavior —
# SQL generation is ideal because base-vs-adapter outputs differ unmistakably)
Steps
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True,
max_loras=2, max_lora_rank=16)
sql = LoRARequest("sql-adapter", 1, "/path/to/sql_lora")
sp = SamplingParams(max_tokens=32, temperature=0)
out = llm.generate(
["List 3 fruits:", "Table users(id,name). Query all names:"],
sp,
lora_request=[None, sql], # request 0 = base, request 1 = SQL adapter
)
for o in out:
print(repr(o.outputs[0].text))
Then the experiments that make it a lab rather than a demo: swap which request gets
the adapter (behavior follows the tag, not the prompt); send the SQL prompt to the
base (watch it ramble — the adapter, not the prompt, carries the behavior); and load
a third adapter with max_loras=2 to meet lab-04's slot machinery in the logs.
Captured output (real run, Llama-2-7b + SQL LoRA, A100, vLLM 0.22.1, trimmed)
INFO ... LoRA enabled: max_loras=2, max_lora_rank=16
'apple, banana, orange' # request 0: base behavior
'SELECT name FROM users;' # request 1: SQL adapter behavior
INFO ... Model weights take 12.55 GiB (shared by ALL requests)
INFO ... LoRA adapter 'sql-adapter' loaded: 0.03 GiB # ~1/400th of the base
Reading the numbers
- 12.55 GiB, shared — the base read once per step for the whole batch: lab-01's
step-1 matmul, weighed. The dedicated-deployment alternative would hold this per
tenant — lab-03's
gpus_saved, with real units. - 0.03 GiB — lab-03's
adapter_bytes(4096, 32, 16)= 32 MiB, measured. When a derived constant and a log line agree to two figures, both the model and your reading of the log are validated — the reconciliation habit, sixth appearance. - Two behaviors, one batch — the rows took the same forward pass through the
base; only request 1's rows detoured through the shrink/expand delta
(
lora_request=[None, sql]is literally lab-01'sadapter_ids = [-1, 1]). The same step, two models' worth of behavior — there is no trick left in that sentence for you anymore. max_loras=2, max_lora_rank=16in the first line — lab-04's slot count and lab-03's per-slot size, reserved at startup. Read them as a memory line item: 2 slots × rank-16 buffers, carved before KV blocks (Phase 2 lab-03's ritual gained a claimant).
Hitchhiker's notes
- The API server's productization: under
vllm serve --enable-lora --lora-modules sql=/path/..., each adapter appears as a model name in the OpenAI-compatible/v1/modelslist, and clients select fine-tunes via the standardmodelfield. Tenants never learn they're sharing; the consolidation is invisible by design. Runtime add/remove exists too (/v1/load_lora_adapter) — onboarding a tenant without a restart. - Latency asymmetry to expect: the first request for a cold adapter pays the host→device slot load (lab-04's miss, milliseconds) plus — first time ever — disk loading. Steady-state requests pay only the delta compute (~1%, invisible). If a tenant's p50 is fine but p99 spikes correlate with their traffic gaps, that's the slot cache breathing — lab-04's thrash arithmetic is the diagnosis sheet.
- Quality due diligence transfers from Phase 6 lab-02: "the outputs looked right" is a smoke test. A tenant migration to shared serving deserves an eval-set diff (dedicated vs consolidated), which — per lab-01's equality proof — should show only float-reordering noise. If it shows more, suspect rank/config mismatches in the adapter conversion, not the engine.
- What doesn't work (v0.22): adapters must target the base's linear layers
(embedding/lm-head support varies), rank ≤
max_lora_rank, and the base model must match exactly (an adapter trained on Llama-2-7B-chat applied to Llama-2-7B-hf loads fine and behaves subtly wrong — the silent version-skew failure; checksum your bases).
Reflect
- Trace request 1's tokens through the phase: which lab's code decided it could enter
the batch (lab-04), which loaded its weights where (lab-03's bytes into lab-04's
slot), which computed its detour (lab-01), and what the base request paid for any
of it (nothing — lab-01's
-1rows). If you can narrate that chain cold, the phase is yours. - Your platform hosts 40 tenant fine-tunes on
max_loras=8engines. Using labs 03+04: what traffic shape makes this comfortable, what shape melts it, and what do you monitor to tell them apart? (Skew → slot hit rate; uniform simultaneous activity → thrash; monitor per-engine adapter hit rate and defer counts.) - Why does the engine require
max_lora_rankup front instead of sizing slots per adapter? (Phase 5's Constraint: fixed buffer shapes for captured graphs and fused kernels — the recurring trade of flexibility for replay. Heterogeneous ranks pad to the max; lab-01's going-further priced that.)
References
upstream/vllm/lora/— request plumbing, slot manager, punica kernels: the whole phase's upstream home.- vLLM docs, LoRA Adapters — serving config, runtime loading, the OpenAI-server productization: https://docs.vllm.ai/en/latest/features/lora/
- Labs 01 (the math), 03 (the bill), 04 (the slots) — this run is their joint integration test.
- Phase 6 lab-02 — the quality-verification discipline that transfers here verbatim.