Lab 02-02 — Measure Fragmentation: Contiguous vs Paged `[CPU-OK]`

"Paging saves memory" is a slogan. This lab turns it into a number — one you produce, on your laptop, in milliseconds. You'll simulate the pre-vLLM allocation strategy and the paged strategy on the same stream of requests and measure exactly how many requests each admits and how many memory slots each wastes. The ratio you compute is, in miniature, the entire empirical case for PagedAttention — the 2–4× from the SOSP paper, re-derived by you.

Why this lab exists
Background: the two kinds of waste
The experiment
Files
Run
What you should see — with the arithmetic
What the tests prove
Hitchhiker's notes
Going further
References

Why this lab exists

A staff engineer's job is frequently to re-derive someone's headline claim from first principles before betting an architecture on it. Papers cherry-pick; blog posts round up; your workload is never quite theirs. The skill this lab drills is building the smallest simulation that captures a memory-allocation phenomenon — no GPU, no model, no engine, just the allocation math — and using it to interrogate a claim. You'll use this move constantly: sizing KV for a new deployment, evaluating "should we raise block_size?", estimating what a longer max_model_len costs before anyone provisions hardware.

It also makes Phase 2's core trade quantitative. After this lab you won't say "contiguous allocation wastes memory"; you'll say "on a reserve-512-use-32 workload it wastes 94% and admits 8 requests where paging admits 120 — and here's the four-line model that says so."

Background: the two kinds of waste

Memory allocators lose memory two ways, and the distinction drives everything:

Internal fragmentation — waste inside an allocation: you reserved more than you used. The contiguous strategy reserves max_len per request because generation length is unknowable in advance; a request that stops after 32 tokens of a 512 reservation strands 480 slots. Note this waste is invisible to the allocator — the slots are "allocated," the dashboard says the memory is in use, and yet it holds nothing. The PagedAttention paper measured real systems (Orca-style reservation) at 60–80% waste this way.
External fragmentation — waste between allocations: enough total free slots exist, but no single contiguous run is big enough, so an admission fails anyway. This one appears only after churn (allocate/free cycles punch holes), which is why naive benchmarks — and naive simulations — miss it.

Paging attacks both at once: nothing is reserved beyond the current need (internal waste collapses to a partial tail block, < block_size per request), and no run needs to be contiguous (external waste becomes structurally impossible — every free block is exactly the right size). The price: an indirection table and a kernel that can follow it (lab-04/06). That trade — a pointer per page for near-zero waste — is the same one OS designers accepted in the 1960s, and for the same reason.

The experiment

A stream of requests arrives, each actually using some number of KV slots (tokens) out of a worst-case max_len. You have a fixed pool of total_slots. Two allocators:

contiguous_admit (the old way): each request reserves max_len contiguous slots, first-fit. Reject if no extent is big enough. Waste = max_len − used per admitted request.
paged_admit (vLLM's way): each request takes ceil(used / block_size) blocks from anywhere. Reject if not enough blocks remain. Waste = the partial tail block: need·block_size − used.

Same arrivals, same pool. Count admissions, rejections, wasted slots.

Files

starter.py — implement contiguous_admit() and paged_admit() (TODOs). Your work.
solution.py — reference, plus a report() you can run directly: python phase-02-paged-attention/labs/lab-02-fragmentation-viz/solution.py.
test_lab.py — asserts paged admits more and wastes less on a reserve-big-use-small workload, and pins the exact waste arithmetic of each strategy.

Run

LAB_IMPL=starter pytest phase-02-paged-attention/labs/lab-02-fragmentation-viz -q
pytest phase-02-paged-attention/labs/lab-02-fragmentation-viz -q     # reference (default)

What you should see — with the arithmetic

Default report() parameters: total_slots=4096, max_len=512, used_len=32, block_size=16, 64 arrivals.

contiguous: admitted=8  rejected=56 wasted=3840
paged:      admitted=64 rejected=0  wasted=0

Walk the numbers; each is checkable in your head, which is the point of a model this small:

Contiguous admits 8 = ⌊4096 / 512⌋. The pool is "full" after eight 512-slot reservations — even though those eight requests are using only 8 × 32 = 256 slots, i.e. 6% of the pool is doing work while 100% is reserved.
wasted=3840 = 8 × (512 − 32). Internal fragmentation, precisely.
Paged admits all 64 — they need 64 × ⌈32/16⌉ = 128 blocks = 2048 slots of the 4096 available. The pool isn't even half full. 15× the admissions on identical memory — and admissions ≈ concurrent users ≈ throughput, which is why a memory-bookkeeping change produced vLLM's throughput headline.
Paged wasted=0 is an artifact worth noticing: 32 divides evenly by 16, so the tail block is full. Change used_len=33 and waste jumps to 64 × 15 = 960 — each request's last block holds 1 token and strands 15 slots. Maximum possible paged waste is always block_size − 1 per request; that bound is the design. (This is also your first taste of the block_size trade-off: small blocks → less tail waste but bigger tables and more lookups; large blocks → the reverse. vLLM defaults to 16.)

Now shrink the pool or interleave frees (see Going further) to watch external fragmentation — the subtler killer — show up in the contiguous column too.

What the tests prove

Test	What it pins
paged admits ≥ contiguous on reserve-big-use-small	the headline claim, as an inequality your code must earn
contiguous waste = `Σ(max_len − used)`	internal fragmentation is exactly over-reservation
paged waste < `block_size` per request	the bounded-tail guarantee
rejection counting	the failure mode is admission, not a crash — capacity bugs are silent

Hitchhiker's notes

Why can't the contiguous allocator just reserve less? Because generation length is decided by the model, token by token (Phase 1 lab-05 — EOS is sampled, not scheduled). A smaller reservation means mid-generation OOM with KV laid out so it can't grow in place — the realloc-and-copy would stall the whole batch. Reserve-the-max was the correct answer under contiguity; the insight was to remove the contiguity requirement, not to blame the reservation. When a design's only fix within its constraints is bad, attack the constraints. That's the actual PagedAttention lesson.
This simulation has no frees — it's a single admission wave. That's deliberately conservative: churn makes contiguous worse (holes), never better, while paged is immune to hole geometry. When your simplified model favors the baseline, your conclusion survives reviewers. Note the trick for your own benchmarking work.
Real vLLM never frees mid-request either — blocks accrete one at a time as a request grows (allocate_slots, called every step the request crosses a block boundary) and are freed together at finish. The grow-by-one-block pattern is exactly what your paged_admit's ceil models at admission granularity.
The waste you measured is why gpu_memory_utilization works at 0.9. With near-zero internal waste, the engine can run its KV pool nearly full without lying to itself about what fits. Under contiguous reservation, "90% utilized" would mean 90% reserved, perhaps 20% used — a dashboard fiction. Bookkeeping honesty is a prerequisite for running hot.

Going further

Make external fragmentation bite. Extend the simulation with frees: admit, free every other request (leaving 512-slot holes... or smaller ones with varied max_len), then try to admit one larger request. Total free space: plenty. Largest extent: too small. Admission: fails. Paged, same workload: succeeds. That's the demo to show anyone who thinks a defragmenter could have saved contiguous allocation (on a GPU, "defragmenting" = copying gigabytes of KV mid-serving).
Sweep block_size ∈ {1, 4, 16, 64, 256} at used_len=33 and plot waste vs table size (Σ need entries). You'll rediscover the page-size trade-off every OS textbook draws — and why 16 tokens is a sane middle.
Feed it a real distribution. Replace the constant used_len with samples from a log-normal or from real conversation-length data (e.g. ShareGPT lengths, as used in vLLM's own benchmarks). The contiguous column gets worse — variance is its enemy: the reservation must cover the tail of the distribution while the mean pays for it.

References

Kwon et al., PagedAttention (SOSP 2023), §2–3 — the measured 60–80% waste and the fragmentation taxonomy you just reproduced: https://arxiv.org/abs/2309.06180
vLLM blog (June 2023) — the announcement, with the memory-waste figure this lab recreates: https://blog.vllm.ai/2023/06/20/vllm.html
Yu et al., Orca (OSDI 2022) — the prior state of the art whose reservation strategy is your contiguous_admit: https://www.usenix.org/conference/osdi22/presentation/yu
Wilson et al., Dynamic Storage Allocation: A Survey and Critical Review (1995) — the classic allocator-fragmentation survey, if you want the deep end: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.47.275
mini_vllm/kv_cache.py::allocate_slots — where the ceil(used/block_size) you wrote runs for real, every engine step.

vLLM Mastery — From Zero to Maintainer