Phase 15 — Mini-Build: extend mini_vllm
Contents
Your task
Model disaggregation in mini_vllm: run a 'prefill engine' that produces KV blocks, serialize the block table + (fake) KV, and hand it to a separate 'decode engine' that continues generation — proving the handoff preserves output.
Why build it (and not just read it)
Reading the real kernel/feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.
Method
- Look at the matching real code from 01-deep-dive.md.
- Add your module under
mini_vllm/(or extend an existing one). - Write a
test_*.pynext to it that pins the behavior you care about. - Run
pytest mini_vllm -qand keep it green.
Definition of done
- Your component runs on CPU with no extra dependencies (numpy ok).
- A test demonstrates the property this phase is about (not just "it runs").
- You can explain, out loud, how your toy maps to the real implementation and where it intentionally simplifies.
The flagship phases ship complete
mini_vllmmodules + tests (mini_vllm/block_pool.py,mini_vllm/scheduler.py) — use them as your reference for structure and test style.