Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 11-02 — Serve Many LoRAs in One Batch [GPU-OPT]

The CPU labs built the machinery: the grouped delta (lab-01), the 32 MiB price tag (lab-03), the slot cache (lab-04). This lab watches all three earn their keep on real hardware: one batch, two requests — one wanting the plain base model, one wanting a SQL fine-tune — served together over a single 12.5 GiB copy of Llama-2-7B, each getting visibly different behavior ('apple, banana, orange' vs 'SELECT name FROM users;'), with the adapter adding 0.03 GiB. The multi-tenant economics, demonstrated in four lines of API and one annotated log.

No GPU? Don't panic. The capture below carries the demonstration; the reconciliation against labs 01/03/04 is the work, and it's hardware-free.

Contents


Why this lab exists

Every GPU-OPT lab in this course is an integration test of the CPU labs' models, and this one has the most user-visible payoff: different model behavior per request in one batch is the kind of thing that sounds impossible until you've traced lab-01's grouped matmul, and obvious afterward. Running it (or reading the capture) closes the loop — and teaches the operational surface you'll actually touch: enable_lora, the max_loras/max_lora_rank reservations (lab-04 and lab-03's knobs, now with startup- log consequences), LoRARequest's id-and-path plumbing, and the per-request lora_request parameter that the OpenAI-compatible server exposes as the model field (each adapter looks like a model name to API clients — the productization detail that makes multi-tenant serving feel like multi-model serving).

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download meta-llama/Llama-2-7b-hf   # the shared base
# plus a small LoRA adapter for it from the Hub (any task with visible behavior —
# SQL generation is ideal because base-vs-adapter outputs differ unmistakably)

Steps

from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True,
          max_loras=2, max_lora_rank=16)
sql = LoRARequest("sql-adapter", 1, "/path/to/sql_lora")
sp = SamplingParams(max_tokens=32, temperature=0)

out = llm.generate(
    ["List 3 fruits:", "Table users(id,name). Query all names:"],
    sp,
    lora_request=[None, sql],   # request 0 = base, request 1 = SQL adapter
)
for o in out:
    print(repr(o.outputs[0].text))

Then the experiments that make it a lab rather than a demo: swap which request gets the adapter (behavior follows the tag, not the prompt); send the SQL prompt to the base (watch it ramble — the adapter, not the prompt, carries the behavior); and load a third adapter with max_loras=2 to meet lab-04's slot machinery in the logs.

Captured output (real run, Llama-2-7b + SQL LoRA, A100, vLLM 0.22.1, trimmed)

INFO ... LoRA enabled: max_loras=2, max_lora_rank=16
'apple, banana, orange'                          # request 0: base behavior
'SELECT name FROM users;'                         # request 1: SQL adapter behavior
INFO ... Model weights take 12.55 GiB (shared by ALL requests)
INFO ... LoRA adapter 'sql-adapter' loaded: 0.03 GiB   # ~1/400th of the base

Reading the numbers

  • 12.55 GiB, shared — the base read once per step for the whole batch: lab-01's step-1 matmul, weighed. The dedicated-deployment alternative would hold this per tenant — lab-03's gpus_saved, with real units.
  • 0.03 GiB — lab-03's adapter_bytes(4096, 32, 16) = 32 MiB, measured. When a derived constant and a log line agree to two figures, both the model and your reading of the log are validated — the reconciliation habit, sixth appearance.
  • Two behaviors, one batch — the rows took the same forward pass through the base; only request 1's rows detoured through the shrink/expand delta (lora_request=[None, sql] is literally lab-01's adapter_ids = [-1, 1]). The same step, two models' worth of behavior — there is no trick left in that sentence for you anymore.
  • max_loras=2, max_lora_rank=16 in the first line — lab-04's slot count and lab-03's per-slot size, reserved at startup. Read them as a memory line item: 2 slots × rank-16 buffers, carved before KV blocks (Phase 2 lab-03's ritual gained a claimant).

Hitchhiker's notes

  • The API server's productization: under vllm serve --enable-lora --lora-modules sql=/path/..., each adapter appears as a model name in the OpenAI-compatible /v1/models list, and clients select fine-tunes via the standard model field. Tenants never learn they're sharing; the consolidation is invisible by design. Runtime add/remove exists too (/v1/load_lora_adapter) — onboarding a tenant without a restart.
  • Latency asymmetry to expect: the first request for a cold adapter pays the host→device slot load (lab-04's miss, milliseconds) plus — first time ever — disk loading. Steady-state requests pay only the delta compute (~1%, invisible). If a tenant's p50 is fine but p99 spikes correlate with their traffic gaps, that's the slot cache breathing — lab-04's thrash arithmetic is the diagnosis sheet.
  • Quality due diligence transfers from Phase 6 lab-02: "the outputs looked right" is a smoke test. A tenant migration to shared serving deserves an eval-set diff (dedicated vs consolidated), which — per lab-01's equality proof — should show only float-reordering noise. If it shows more, suspect rank/config mismatches in the adapter conversion, not the engine.
  • What doesn't work (v0.22): adapters must target the base's linear layers (embedding/lm-head support varies), rank ≤ max_lora_rank, and the base model must match exactly (an adapter trained on Llama-2-7B-chat applied to Llama-2-7B-hf loads fine and behaves subtly wrong — the silent version-skew failure; checksum your bases).

Reflect

  • Trace request 1's tokens through the phase: which lab's code decided it could enter the batch (lab-04), which loaded its weights where (lab-03's bytes into lab-04's slot), which computed its detour (lab-01), and what the base request paid for any of it (nothing — lab-01's -1 rows). If you can narrate that chain cold, the phase is yours.
  • Your platform hosts 40 tenant fine-tunes on max_loras=8 engines. Using labs 03+04: what traffic shape makes this comfortable, what shape melts it, and what do you monitor to tell them apart? (Skew → slot hit rate; uniform simultaneous activity → thrash; monitor per-engine adapter hit rate and defer counts.)
  • Why does the engine require max_lora_rank up front instead of sizing slots per adapter? (Phase 5's Constraint: fixed buffer shapes for captured graphs and fused kernels — the recurring trade of flexibility for replay. Heterogeneous ranks pad to the max; lab-01's going-further priced that.)

References

  • upstream/vllm/lora/ — request plumbing, slot manager, punica kernels: the whole phase's upstream home.
  • vLLM docs, LoRA Adapters — serving config, runtime loading, the OpenAI-server productization: https://docs.vllm.ai/en/latest/features/lora/
  • Labs 01 (the math), 03 (the bill), 04 (the slots) — this run is their joint integration test.
  • Phase 6 lab-02 — the quality-verification discipline that transfers here verbatim.