Lab 17-02 — Run vLLM on CPU, and Read What the Platform Overrode [CPU-OK]
The one GPU-flavored lab in this course that genuinely needs no GPU: install
vLLM's CPU backend, serve a tiny model on your laptop cores, and then read cpu.py
against lab-01's interface to see exactly which decisions the platform redirected —
attention backend swapped, CUDA graphs gone, KV cache carved from RAM by a different
knob (VLLM_CPU_KVCACHE_SPACE instead of gpu_memory_utilization). Same engine,
same scheduler, same paged KV, different silicon — Phase 1–3's machinery proving
itself hardware-agnostic before your eyes.
The captured run below is from a 16-core laptop; yours will differ in tok/s and nothing else. That's the lesson.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, Qwen2.5-0.5B, 16-core CPU, vLLM 0.22.1, trimmed)
- Reading cpu.py against lab-01
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
Three reasons, ascending. Practically: CPU vLLM is real deployment surface — CI
pipelines, edge boxes, air-gapped environments, and cost-floor serving of small
models all use it, and its knobs differ enough from CUDA's to merit one deliberate
run. Pedagogically: it's the existence proof for lab-01's architecture — every
phase of this course you learned on GPU concepts (paged KV, continuous batching,
chunked prefill) executes here unmodified, because none of them were ever GPU
concepts; they were engine concepts, and the platform layer is what kept them so.
Strategically: reading cpu.py teaches you the size of a backend — it's a short
file, and "supporting new hardware is a short file plus kernels" is the fact that
makes Phase 17's vendor-plugin world believable.
Requirements
# CPU wheels/build per the official guide (the pip default wheel is CUDA-flavored):
# https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
Steps
VLLM_CPU_KVCACHE_SPACE=4 python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='Qwen/Qwen2.5-0.5B-Instruct', dtype='bfloat16', max_model_len=1024)
print(llm.generate(['The CPU backend exists because'],
SamplingParams(max_tokens=48, temperature=0))[0].outputs[0].text)
"
Three observations to collect: the startup log naming the platform and attention
backend (compare with any GPU capture from earlier phases); the KV cache sized from
VLLM_CPU_KVCACHE_SPACE (gigabytes of RAM — Phase 2 lab-03's carving with a new
budget source); and your tok/s (single-digit to low-double — see the roofline note).
Captured output (real run, Qwen2.5-0.5B, 16-core CPU, vLLM 0.22.1, trimmed)
INFO ... Using CPU platform. # lab-01's resolver, choosing the floor
INFO ... Using Torch SDPA backend. # the platform's attention answer
INFO ... CPU KV cache space: 4 GiB # VLLM_CPU_KVCACHE_SPACE, not gpu_mem_util
INFO ... # CPU blocks: 13,107 # same BlockPool, RAM-backed
WARNING ... CUDA graphs are not supported ... falling back to eager
the CPU backend exists because not every deployment has a GPU ...
# generation: ~9 tok/s single stream (16 cores, bf16)
Reading cpu.py against lab-01
Open upstream/vllm/platforms/cpu.py next to your lab-01 Platform and check off
the decisions: get_attn_backend_cls → Torch SDPA (the platform is Phase 4's
selector for this hardware); dtype checks (fp16 discouraged on CPU — your
check_dtype negotiation, with a warning); graphs unsupported (Phase 5 short-
circuits — note the engine degrades, not crashes: eager mode was always a valid
path); memory introspection reading system RAM. Then notice what's absent: nothing
about schedulers, blocks, batching, or sampling. The platform overrides the
hardware-touching edge and only the edge — lab-01's funnel, confirmed by reading
what a real backend did not have to implement.
Hitchhiker's notes
- The performance is honest, and the roofline explains it (Phase 0 lab-04 with CPU constants): ~50 GB/s of DRAM bandwidth vs a GPU's 2,000 — decode's weight-streaming bound lands at ~9 tok/s for a 1 GB-weight model, right where the capture sits. CPU serving is bandwidth-priced, same physics, smaller numbers — which is also why small models + quantization (fewer bytes!) are disproportionately effective here.
- Knob translation table:
gpu_memory_utilization→VLLM_CPU_KVCACHE_SPACE(absolute GiB — RAM isn't pre-carved like HBM); TP within a node → multiple NUMA-pinned CPU "devices" (VLLM_CPU_OMP_THREADS_BIND); graphs → nothing (eager always). The concepts you tuned all course exist; the spellings moved to where the hardware's truth lives. - CI is the killer app: vLLM's own test suite exercises engine logic on CPU runners constantly — correctness of schedulers and parsers doesn't need an A100 (this course's whole premise, which the project itself relies on).
- From
cpu.pyto a vendor plugin is a difference of packaging, not kind:vllm-ascend,vllm-spyreand friends are out-of-treecpu.py-shaped files plus kernels, registered through lab-01's entry-point mechanism. After reading one in-tree backend, you can review (or write) an out-of-tree one.
Reflect
- List which course phases' machinery you just watched run unchanged on CPU, and which were platform-swapped. (Unchanged: 1, 2, 3, 9, 12, 16 — the engine and text layers. Swapped: 4's backend choice, 5 disabled, 7's kernels, 0/18's constants.) The ratio is the architecture's grade.
- Why is
VLLM_CPU_KVCACHE_SPACEabsolute GiB while the GPU knob is a fraction? (HBM is the engine's to claim — a fraction of a dedicated resource; RAM is shared with the OS and everything else — an absolute budget is the honest contract. Knob design encodes resource ownership.) - A vendor pitches you "vLLM support" for their accelerator. From this phase, what three artifacts do you ask to see? (Their platform class and what it overrides; their attention backend's correctness story against Phase 4's reference shapes; benchmark constants for the Phase 0 lab-04 roofline so claims can be checked.)
References
upstream/vllm/platforms/cpu.py— the backend under read.- vLLM docs, CPU installation: https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html
- vLLM docs, Plugin System — the out-of-tree path this is the in-tree template for: https://docs.vllm.ai/en/latest/design/plugin_system.html
- Lab-01 — the interface this file implements; Phase 0 lab-04 — the physics that prices the capture's 9 tok/s.