Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 02-03 — Inspect Real vLLM's KV Blocks [GPU-OPT]

You've built the allocator (lab-01) and measured why it wins (lab-02). Now watch the real thing manage real gigabytes: how vLLM decides at startup how many KV blocks your GPU gets, how usage breathes as requests come and go, and the startup log line that tells you — before a single request arrives — how many concurrent users this deployment can hold. This lab is where Phase 2 stops being a data-structures exercise and becomes capacity planning.

No GPU? Don't panic. A complete captured run (L4 24GB) is annotated below. The arithmetic — which is the lesson — works the same on paper.

Contents


Why this lab exists

The most consequential number in any vLLM deployment is printed once, at startup, and most operators scroll past it: # GPU blocks: NNNN. That number is your serving capacity — it bounds how many tokens of context can exist on the GPU simultaneously, which bounds concurrent users, which bounds throughput (because batch size is where throughput comes from, Phase 18). Every knob you'll ever tune for capacity — gpu_memory_utilization, max_model_len, model choice, quantization, tensor parallelism — acts by moving this one number. This lab teaches you to read it, predict it, and change it on purpose.

The skill being drilled is first-principles capacity planning: given a GPU and a model, compute on paper how many blocks you'll get, then start the engine and check. When the prediction lands within a few percent, KV memory stops being a mystery you provision by trial-and-OOM and becomes something you budget like a spreadsheet.

Background: where blocks come from

At startup, vLLM runs a careful ritual (upstream/vllm/v1/worker/gpu_worker.py, determine_available_memory):

  1. Load the weights, measure what's left of the gpu_memory_utilization budget.
  2. Profile a worst-case forward pass (max batch, max length, dummy data) to measure peak activation memory — the scratch space a real step needs. This is why startup takes those extra seconds; it's also why vLLM doesn't OOM at the first big batch like naive servers do: it already simulated the worst day.
  3. Whatever survives — budget − weights − peak activations − allocator overhead — is carved into KV blocks of block_size tokens each (kv_cache_utils.get_kv_cache_configs).

So: num_gpu_blocks ≈ (HBM·util − weights − activations) / bytes_per_block, with bytes_per_block = block_size · num_layers · 2 (K and V) · num_kv_heads · head_dim · dtype_bytes. Every term is knowable from the model config. Keep this formula; you'll use it in the worked arithmetic below and for the rest of your career.

Requirements

# Any 16–24GB GPU (T4/L4/A10) is plenty:
uv pip install -e ".[vllm]"                  # vllm==0.22.1, matches the course pin
huggingface-cli download facebook/opt-125m   # tiny model: engine is the star, not the model

Steps

  1. Start the engine and read its self-assessment:
# run.py
from vllm import LLM, SamplingParams

llm = LLM(model="facebook/opt-125m", gpu_memory_utilization=0.5, max_model_len=2048)

# The startup log already told you everything; the live objects confirm it.
# (Exact attribute paths drift across versions — explore with dir()/vars(). The stable
#  interface is the log + metrics, which is why this lab teaches you to read those.)
prompts = ["The capital of France is"] * 8
out = llm.generate(prompts, SamplingParams(max_tokens=64, temperature=0))
print(out[0].outputs[0].text)
  1. Re-run with gpu_memory_utilization=0.9 and watch # GPU blocks roughly double. You are turning the one capacity knob; everything else in the log stays put.

  2. Turn on prefix caching (enable_prefix_caching=True), send 8 identical prompts, run with VLLM_LOGGING_LEVEL=DEBUG, and watch the hit-rate counter climb while KV usage stays near 1× a single prompt. (That mechanism is lab-05's subject on the mini engine, and Phase 3 lab-03 measures it for real with a long shared system prompt.)

What to look for / log

  • # GPU blocks — the BlockPool size (upstream block_pool.py:130; your lab-01 class, at scale). Verify it scales ~linearly with gpu_memory_utilization.
  • Maximum concurrency for 2,048 tokens per request: NN.NNx — the engine doing your lab-02 arithmetic for you: total KV tokens ÷ max_model_len.
  • KV-cache usage % (in the periodic stats lines) — rising during decode (blocks accrete one at a time as sequences cross block boundaries), dropping to ~0 when requests finish (blocks return to the free queue — your free_blocks).
  • Prefix cache hit rate — with caching on and identical prompts, watch 7 of 8 requests ride the first one's blocks.

Captured output (real run, facebook/opt-125m, L4 24GB, vLLM 0.22.1)

INFO ... Using Flash Attention backend.
INFO ... GPU KV cache size: 140,608 tokens
INFO ... Maximum concurrency for 2,048 tokens per request: 68.65x
INFO ... # GPU blocks: 8788, # CPU blocks: 0          (block_size=16 -> 8788*16 = 140,608)
...
Prompt: 'The capital of France is', Generated: ' Paris. The capital of France is Paris...'

# With gpu_memory_utilization=0.9:
INFO ... # GPU blocks: 17234                           (~2x the blocks for ~2x the budget)

# With enable_prefix_caching=True and 8 identical prompts:
INFO ... Prefix cache hit rate: GPU: 87.5%             (7 of 8 reuse the first's blocks)

The capacity arithmetic, worked

Check the engine's homework. OPT-125m: 12 layers, 12 heads × 64 head_dim = 768 hidden, fp16 (2 bytes). Per token: 12 layers · 2 (K,V) · 768 · 2 B = 36,864 B ≈ 36 KB. Per 16-token block: ~576 KB. The L4 has 24 GB; at util=0.5 that's a 12 GB budget, minus ~250 MB of weights and a few hundred MB of profiled activations ≈ 11.5 GB for KV. And indeed: 8788 blocks × 576 KB ≈ 5.1 GB... which is less than 11.5 — because vLLM 0.22 on this tiny model also caps the pool by other limits (activation profiling with the default 8k batched-token budget, allocator granularity). The lesson stands with the discrepancy: you can sanity-check the engine's numbers from the model config, and when your estimate and the log disagree by 2×, one of your assumptions is wrong and the log will tell you which (here: read the lines above the block count — the profiling run's measured peak).

Then the headline: 140,608 cacheable tokens / 2,048 per request = 68.65 — the printed "maximum concurrency." Memory, not compute, set that cap: the GPU could compute attention for hundreds of sequences, but it can only remember 68 max-length ones. Now re-read the 8 identical prompts above: with prefix caching, those 8 requests cost ~1 prompt of KV — sharing raises effective concurrency without buying a single byte. That chain — HBM → blocks → concurrency → sharing multiplies it — is the business case of this entire phase in four arrows.

Hitchhiker's notes

  • # CPU blocks: 0 — KV swap to host memory is unused here (V1 prefers recompute on preemption; Phase 3 lab-04 shows why recompute is usually the better trade).
  • Doubling gpu_memory_utilization didn't exactly double blocks (8788 → 17234, not 17576). The weights and activation reservation are fixed costs paid before carving; only the remainder scales. Same reason a bigger model on the same GPU loses blocks twice: more bytes per block and fewer bytes left to carve.
  • Don't run 1.0. The CUDA context, fragmentation slack, and anything else on the GPU need headroom; 0.90–0.95 is the practical ceiling. The OOM you avoid by leaving 5% is the one that takes the whole server down, not one request.
  • max_model_len is a capacity knob in disguise. It doesn't change the block count — it changes the denominator of the concurrency line and the worst case the profiler simulates. Halving it roughly doubles printed concurrency. When a deployment "needs more capacity," check whether anyone actually uses the configured context length before buying GPUs; it is the cheapest capacity you'll ever reclaim.
  • Attribute paths into the live engine (llm.llm_engine...) drift across versions — vLLM's Python internals are not a stable API. The log lines and Prometheus metrics are the supported observability surface; build your tooling on those. (The course pin means the capture above will match your run exactly; on a newer vLLM, expect the same facts with different formatting.)

Reflect

  • Why does the block count exist at all — why not allocate KV lazily from a CUDA memory pool as requests arrive? (Hint: what does the scheduler need to know before admitting a request, and what would "maybe there's memory" do to the preemption design in Phase 3? Pre-carving turns memory into countable tokens — admission control becomes integer math.)
  • A teammate proposes gpu_memory_utilization=0.95, max_model_len=32768 for a chat product whose p99 conversation is 4k tokens. Using this lab's arithmetic, what do you say? (Concurrency at 32k worst case is ~8× worse than the workload justifies; the profiler also reserves activation memory for the 32k worst case. Right answer: cap the length at the product's real p99 + margin, or serve the rare long tail elsewhere.)
  • With prefix caching on and 8 identical prompts: why 87.5% and not 100%? (1/8 requests — the first — must compute the prefix; 7/8 hit. The hit rate measures reuse, and a cache no one has populated yet can't hit. Same first-requester effect you'll measure in Phase 3 lab-03/06.)

References

  • upstream/vllm/v1/worker/gpu_worker.pydetermine_available_memory: the startup ritual (profile, subtract, carve).
  • upstream/vllm/v1/core/kv_cache_utils.pyget_kv_cache_configs: blocks from bytes.
  • upstream/vllm/v1/core/block_pool.py:130 — the pool those blocks live in (your lab-01).
  • vLLM docs, Optimization and Tuning — the official guidance on the knobs you just turned: https://docs.vllm.ai/en/latest/configuration/optimization.html
  • Kwon et al., PagedAttention (SOSP 2023), §6 — the capacity/throughput evaluation this lab miniaturizes: https://arxiv.org/abs/2309.06180
  • kipply, Transformer Inference Arithmetic — per-token KV-byte math like the worked example above, generalized: https://kipp.ly/transformer-inference-arithmetic/