Lab 02-03 — Inspect Real vLLM's KV Blocks [GPU-OPT]
You've built the allocator (lab-01) and measured why it wins (lab-02). Now watch the real thing manage real gigabytes: how vLLM decides at startup how many KV blocks your GPU gets, how usage breathes as requests come and go, and the startup log line that tells you — before a single request arrives — how many concurrent users this deployment can hold. This lab is where Phase 2 stops being a data-structures exercise and becomes capacity planning.
No GPU? Don't panic. A complete captured run (L4 24GB) is annotated below. The arithmetic — which is the lesson — works the same on paper.
Contents
- Why this lab exists
- Background: where blocks come from
- Requirements
- Steps
- What to look for / log
- Captured output (real run, facebook/opt-125m, L4 24GB, vLLM 0.22.1)
- The capacity arithmetic, worked
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
The most consequential number in any vLLM deployment is printed once, at startup, and most
operators scroll past it: # GPU blocks: NNNN. That number is your serving capacity — it
bounds how many tokens of context can exist on the GPU simultaneously, which bounds
concurrent users, which bounds throughput (because batch size is where throughput comes
from, Phase 18). Every knob you'll ever tune for capacity — gpu_memory_utilization,
max_model_len, model choice, quantization, tensor parallelism — acts by moving this one
number. This lab teaches you to read it, predict it, and change it on purpose.
The skill being drilled is first-principles capacity planning: given a GPU and a model, compute on paper how many blocks you'll get, then start the engine and check. When the prediction lands within a few percent, KV memory stops being a mystery you provision by trial-and-OOM and becomes something you budget like a spreadsheet.
Background: where blocks come from
At startup, vLLM runs a careful ritual (upstream/vllm/v1/worker/gpu_worker.py,
determine_available_memory):
- Load the weights, measure what's left of the
gpu_memory_utilizationbudget. - Profile a worst-case forward pass (max batch, max length, dummy data) to measure peak activation memory — the scratch space a real step needs. This is why startup takes those extra seconds; it's also why vLLM doesn't OOM at the first big batch like naive servers do: it already simulated the worst day.
- Whatever survives — budget − weights − peak activations − allocator overhead — is carved
into KV blocks of
block_sizetokens each (kv_cache_utils.get_kv_cache_configs).
So: num_gpu_blocks ≈ (HBM·util − weights − activations) / bytes_per_block, with
bytes_per_block = block_size · num_layers · 2 (K and V) · num_kv_heads · head_dim · dtype_bytes. Every term is knowable from the model config. Keep this formula; you'll use
it in the worked arithmetic below and for the rest of
your career.
Requirements
# Any 16–24GB GPU (T4/L4/A10) is plenty:
uv pip install -e ".[vllm]" # vllm==0.22.1, matches the course pin
huggingface-cli download facebook/opt-125m # tiny model: engine is the star, not the model
Steps
- Start the engine and read its self-assessment:
# run.py
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m", gpu_memory_utilization=0.5, max_model_len=2048)
# The startup log already told you everything; the live objects confirm it.
# (Exact attribute paths drift across versions — explore with dir()/vars(). The stable
# interface is the log + metrics, which is why this lab teaches you to read those.)
prompts = ["The capital of France is"] * 8
out = llm.generate(prompts, SamplingParams(max_tokens=64, temperature=0))
print(out[0].outputs[0].text)
-
Re-run with
gpu_memory_utilization=0.9and watch# GPU blocksroughly double. You are turning the one capacity knob; everything else in the log stays put. -
Turn on prefix caching (
enable_prefix_caching=True), send 8 identical prompts, run withVLLM_LOGGING_LEVEL=DEBUG, and watch the hit-rate counter climb while KV usage stays near 1× a single prompt. (That mechanism is lab-05's subject on the mini engine, and Phase 3 lab-03 measures it for real with a long shared system prompt.)
What to look for / log
# GPU blocks— theBlockPoolsize (upstreamblock_pool.py:130; your lab-01 class, at scale). Verify it scales ~linearly withgpu_memory_utilization.Maximum concurrency for 2,048 tokens per request: NN.NNx— the engine doing your lab-02 arithmetic for you: total KV tokens ÷ max_model_len.- KV-cache usage % (in the periodic stats lines) — rising during decode (blocks accrete
one at a time as sequences cross block boundaries), dropping to ~0 when requests finish
(blocks return to the free queue — your
free_blocks). Prefix cache hit rate— with caching on and identical prompts, watch 7 of 8 requests ride the first one's blocks.
Captured output (real run, facebook/opt-125m, L4 24GB, vLLM 0.22.1)
INFO ... Using Flash Attention backend.
INFO ... GPU KV cache size: 140,608 tokens
INFO ... Maximum concurrency for 2,048 tokens per request: 68.65x
INFO ... # GPU blocks: 8788, # CPU blocks: 0 (block_size=16 -> 8788*16 = 140,608)
...
Prompt: 'The capital of France is', Generated: ' Paris. The capital of France is Paris...'
# With gpu_memory_utilization=0.9:
INFO ... # GPU blocks: 17234 (~2x the blocks for ~2x the budget)
# With enable_prefix_caching=True and 8 identical prompts:
INFO ... Prefix cache hit rate: GPU: 87.5% (7 of 8 reuse the first's blocks)
The capacity arithmetic, worked
Check the engine's homework. OPT-125m: 12 layers, 12 heads × 64 head_dim = 768 hidden,
fp16 (2 bytes). Per token: 12 layers · 2 (K,V) · 768 · 2 B = 36,864 B ≈ 36 KB. Per
16-token block: ~576 KB. The L4 has 24 GB; at util=0.5 that's a 12 GB budget, minus ~250
MB of weights and a few hundred MB of profiled activations ≈ 11.5 GB for KV. And indeed:
8788 blocks × 576 KB ≈ 5.1 GB... which is less than 11.5 — because vLLM 0.22 on this
tiny model also caps the pool by other limits (activation profiling with the default 8k
batched-token budget, allocator granularity). The lesson stands with the discrepancy: you
can sanity-check the engine's numbers from the model config, and when your estimate and
the log disagree by 2×, one of your assumptions is wrong and the log will tell you which
(here: read the lines above the block count — the profiling run's measured peak).
Then the headline: 140,608 cacheable tokens / 2,048 per request = 68.65 — the printed
"maximum concurrency." Memory, not compute, set that cap: the GPU could compute attention
for hundreds of sequences, but it can only remember 68 max-length ones. Now re-read the
8 identical prompts above: with prefix caching, those 8 requests cost ~1 prompt of KV —
sharing raises effective concurrency without buying a single byte. That chain —
HBM → blocks → concurrency → sharing multiplies it — is the business case of this entire
phase in four arrows.
Hitchhiker's notes
# CPU blocks: 0— KV swap to host memory is unused here (V1 prefers recompute on preemption; Phase 3 lab-04 shows why recompute is usually the better trade).- Doubling
gpu_memory_utilizationdidn't exactly double blocks (8788 → 17234, not 17576). The weights and activation reservation are fixed costs paid before carving; only the remainder scales. Same reason a bigger model on the same GPU loses blocks twice: more bytes per block and fewer bytes left to carve. - Don't run 1.0. The CUDA context, fragmentation slack, and anything else on the GPU need headroom; 0.90–0.95 is the practical ceiling. The OOM you avoid by leaving 5% is the one that takes the whole server down, not one request.
max_model_lenis a capacity knob in disguise. It doesn't change the block count — it changes the denominator of the concurrency line and the worst case the profiler simulates. Halving it roughly doubles printed concurrency. When a deployment "needs more capacity," check whether anyone actually uses the configured context length before buying GPUs; it is the cheapest capacity you'll ever reclaim.- Attribute paths into the live engine (
llm.llm_engine...) drift across versions — vLLM's Python internals are not a stable API. The log lines and Prometheus metrics are the supported observability surface; build your tooling on those. (The course pin means the capture above will match your run exactly; on a newer vLLM, expect the same facts with different formatting.)
Reflect
- Why does the block count exist at all — why not allocate KV lazily from a CUDA memory pool as requests arrive? (Hint: what does the scheduler need to know before admitting a request, and what would "maybe there's memory" do to the preemption design in Phase 3? Pre-carving turns memory into countable tokens — admission control becomes integer math.)
- A teammate proposes
gpu_memory_utilization=0.95, max_model_len=32768for a chat product whose p99 conversation is 4k tokens. Using this lab's arithmetic, what do you say? (Concurrency at 32k worst case is ~8× worse than the workload justifies; the profiler also reserves activation memory for the 32k worst case. Right answer: cap the length at the product's real p99 + margin, or serve the rare long tail elsewhere.) - With prefix caching on and 8 identical prompts: why 87.5% and not 100%? (1/8 requests — the first — must compute the prefix; 7/8 hit. The hit rate measures reuse, and a cache no one has populated yet can't hit. Same first-requester effect you'll measure in Phase 3 lab-03/06.)
References
upstream/vllm/v1/worker/gpu_worker.py—determine_available_memory: the startup ritual (profile, subtract, carve).upstream/vllm/v1/core/kv_cache_utils.py—get_kv_cache_configs: blocks from bytes.upstream/vllm/v1/core/block_pool.py:130— the pool those blocks live in (your lab-01).- vLLM docs, Optimization and Tuning — the official guidance on the knobs you just turned: https://docs.vllm.ai/en/latest/configuration/optimization.html
- Kwon et al., PagedAttention (SOSP 2023), §6 — the capacity/throughput evaluation this lab miniaturizes: https://arxiv.org/abs/2309.06180
- kipply, Transformer Inference Arithmetic — per-token KV-byte math like the worked example above, generalized: https://kipp.ly/transformer-inference-arithmetic/