Lab 02-02 — Measure Fragmentation: Contiguous vs Paged [CPU-OK]
"Paging saves memory" is a slogan. This lab turns it into a number — one you produce, on your laptop, in milliseconds. You'll simulate the pre-vLLM allocation strategy and the paged strategy on the same stream of requests and measure exactly how many requests each admits and how many memory slots each wastes. The ratio you compute is, in miniature, the entire empirical case for PagedAttention — the 2–4× from the SOSP paper, re-derived by you.
Contents
- Why this lab exists
- Background: the two kinds of waste
- The experiment
- Files
- Run
- What you should see — with the arithmetic
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
A staff engineer's job is frequently to re-derive someone's headline claim from first
principles before betting an architecture on it. Papers cherry-pick; blog posts round up;
your workload is never quite theirs. The skill this lab drills is building the smallest
simulation that captures a memory-allocation phenomenon — no GPU, no model, no engine, just
the allocation math — and using it to interrogate a claim. You'll use this move constantly:
sizing KV for a new deployment, evaluating "should we raise block_size?", estimating what
a longer max_model_len costs before anyone provisions hardware.
It also makes Phase 2's core trade quantitative. After this lab you won't say "contiguous allocation wastes memory"; you'll say "on a reserve-512-use-32 workload it wastes 94% and admits 8 requests where paging admits 120 — and here's the four-line model that says so."
Background: the two kinds of waste
Memory allocators lose memory two ways, and the distinction drives everything:
- Internal fragmentation — waste inside an allocation: you reserved more than you
used. The contiguous strategy reserves
max_lenper request because generation length is unknowable in advance; a request that stops after 32 tokens of a 512 reservation strands 480 slots. Note this waste is invisible to the allocator — the slots are "allocated," the dashboard says the memory is in use, and yet it holds nothing. The PagedAttention paper measured real systems (Orca-style reservation) at 60–80% waste this way. - External fragmentation — waste between allocations: enough total free slots exist, but no single contiguous run is big enough, so an admission fails anyway. This one appears only after churn (allocate/free cycles punch holes), which is why naive benchmarks — and naive simulations — miss it.
Paging attacks both at once: nothing is reserved beyond the current need (internal waste
collapses to a partial tail block, < block_size per request), and no run needs to be
contiguous (external waste becomes structurally impossible — every free block is exactly
the right size). The price: an indirection table and a kernel that can follow it (lab-04/06).
That trade — a pointer per page for near-zero waste — is the same one OS designers accepted
in the 1960s, and for the same reason.
The experiment
A stream of requests arrives, each actually using some number of KV slots (tokens) out of
a worst-case max_len. You have a fixed pool of total_slots. Two allocators:
contiguous_admit(the old way): each request reservesmax_lencontiguous slots, first-fit. Reject if no extent is big enough. Waste =max_len − usedper admitted request.paged_admit(vLLM's way): each request takesceil(used / block_size)blocks from anywhere. Reject if not enough blocks remain. Waste = the partial tail block:need·block_size − used.
Same arrivals, same pool. Count admissions, rejections, wasted slots.
Files
starter.py— implementcontiguous_admit()andpaged_admit()(TODOs). Your work.solution.py— reference, plus areport()you can run directly:python phase-02-paged-attention/labs/lab-02-fragmentation-viz/solution.py.test_lab.py— asserts paged admits more and wastes less on a reserve-big-use-small workload, and pins the exact waste arithmetic of each strategy.
Run
LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-02-fragmentation-viz -q
pytest phase-02-paged-attention/labs/lab-02-fragmentation-viz -q # reference (default)
What you should see — with the arithmetic
Default report() parameters: total_slots=4096, max_len=512, used_len=32,
block_size=16, 64 arrivals.
contiguous: admitted=8 rejected=56 wasted=3840
paged: admitted=64 rejected=0 wasted=0
Walk the numbers; each is checkable in your head, which is the point of a model this small:
- Contiguous admits 8 = ⌊4096 / 512⌋. The pool is "full" after eight 512-slot reservations — even though those eight requests are using only 8 × 32 = 256 slots, i.e. 6% of the pool is doing work while 100% is reserved.
wasted=3840= 8 × (512 − 32). Internal fragmentation, precisely.- Paged admits all 64 — they need 64 × ⌈32/16⌉ = 128 blocks = 2048 slots of the 4096 available. The pool isn't even half full. 15× the admissions on identical memory — and admissions ≈ concurrent users ≈ throughput, which is why a memory-bookkeeping change produced vLLM's throughput headline.
- Paged
wasted=0is an artifact worth noticing: 32 divides evenly by 16, so the tail block is full. Changeused_len=33and waste jumps to 64 × 15 = 960 — each request's last block holds 1 token and strands 15 slots. Maximum possible paged waste is alwaysblock_size − 1per request; that bound is the design. (This is also your first taste of theblock_sizetrade-off: small blocks → less tail waste but bigger tables and more lookups; large blocks → the reverse. vLLM defaults to 16.)
Now shrink the pool or interleave frees (see Going further) to watch external fragmentation — the subtler killer — show up in the contiguous column too.
What the tests prove
| Test | What it pins |
|---|---|
| paged admits ≥ contiguous on reserve-big-use-small | the headline claim, as an inequality your code must earn |
contiguous waste = Σ(max_len − used) | internal fragmentation is exactly over-reservation |
paged waste < block_size per request | the bounded-tail guarantee |
| rejection counting | the failure mode is admission, not a crash — capacity bugs are silent |
Hitchhiker's notes
- Why can't the contiguous allocator just reserve less? Because generation length is decided by the model, token by token (Phase 1 lab-05 — EOS is sampled, not scheduled). A smaller reservation means mid-generation OOM with KV laid out so it can't grow in place — the realloc-and-copy would stall the whole batch. Reserve-the-max was the correct answer under contiguity; the insight was to remove the contiguity requirement, not to blame the reservation. When a design's only fix within its constraints is bad, attack the constraints. That's the actual PagedAttention lesson.
- This simulation has no frees — it's a single admission wave. That's deliberately conservative: churn makes contiguous worse (holes), never better, while paged is immune to hole geometry. When your simplified model favors the baseline, your conclusion survives reviewers. Note the trick for your own benchmarking work.
- Real vLLM never frees mid-request either — blocks accrete one at a time as a request
grows (
allocate_slots, called every step the request crosses a block boundary) and are freed together at finish. The grow-by-one-block pattern is exactly what yourpaged_admit'sceilmodels at admission granularity. - The waste you measured is why
gpu_memory_utilizationworks at 0.9. With near-zero internal waste, the engine can run its KV pool nearly full without lying to itself about what fits. Under contiguous reservation, "90% utilized" would mean 90% reserved, perhaps 20% used — a dashboard fiction. Bookkeeping honesty is a prerequisite for running hot.
Going further
- Make external fragmentation bite. Extend the simulation with frees: admit, free every
other request (leaving 512-slot holes... or smaller ones with varied
max_len), then try to admit one larger request. Total free space: plenty. Largest extent: too small. Admission: fails. Paged, same workload: succeeds. That's the demo to show anyone who thinks a defragmenter could have saved contiguous allocation (on a GPU, "defragmenting" = copying gigabytes of KV mid-serving). - Sweep
block_size∈ {1, 4, 16, 64, 256} atused_len=33and plot waste vs table size (Σ needentries). You'll rediscover the page-size trade-off every OS textbook draws — and why 16 tokens is a sane middle. - Feed it a real distribution. Replace the constant
used_lenwith samples from a log-normal or from real conversation-length data (e.g. ShareGPT lengths, as used in vLLM's own benchmarks). The contiguous column gets worse — variance is its enemy: the reservation must cover the tail of the distribution while the mean pays for it.
References
- Kwon et al., PagedAttention (SOSP 2023), §2–3 — the measured 60–80% waste and the fragmentation taxonomy you just reproduced: https://arxiv.org/abs/2309.06180
- vLLM blog (June 2023) — the announcement, with the memory-waste figure this lab recreates: https://blog.vllm.ai/2023/06/20/vllm.html
- Yu et al., Orca (OSDI 2022) — the prior state of the art whose reservation strategy is
your
contiguous_admit: https://www.usenix.org/conference/osdi22/presentation/yu - Wilson et al., Dynamic Storage Allocation: A Survey and Critical Review (1995) — the classic allocator-fragmentation survey, if you want the deep end: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.47.275
mini_vllm/kv_cache.py::allocate_slots— where theceil(used/block_size)you wrote runs for real, every engine step.