Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 17-02 — Run vLLM on CPU, and Read What the Platform Overrode [CPU-OK]

The one GPU-flavored lab in this course that genuinely needs no GPU: install vLLM's CPU backend, serve a tiny model on your laptop cores, and then read cpu.py against lab-01's interface to see exactly which decisions the platform redirected — attention backend swapped, CUDA graphs gone, KV cache carved from RAM by a different knob (VLLM_CPU_KVCACHE_SPACE instead of gpu_memory_utilization). Same engine, same scheduler, same paged KV, different silicon — Phase 1–3's machinery proving itself hardware-agnostic before your eyes.

The captured run below is from a 16-core laptop; yours will differ in tok/s and nothing else. That's the lesson.

Contents


Why this lab exists

Three reasons, ascending. Practically: CPU vLLM is real deployment surface — CI pipelines, edge boxes, air-gapped environments, and cost-floor serving of small models all use it, and its knobs differ enough from CUDA's to merit one deliberate run. Pedagogically: it's the existence proof for lab-01's architecture — every phase of this course you learned on GPU concepts (paged KV, continuous batching, chunked prefill) executes here unmodified, because none of them were ever GPU concepts; they were engine concepts, and the platform layer is what kept them so. Strategically: reading cpu.py teaches you the size of a backend — it's a short file, and "supporting new hardware is a short file plus kernels" is the fact that makes Phase 17's vendor-plugin world believable.

Requirements

# CPU wheels/build per the official guide (the pip default wheel is CUDA-flavored):
# https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct

Steps

VLLM_CPU_KVCACHE_SPACE=4 python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='Qwen/Qwen2.5-0.5B-Instruct', dtype='bfloat16', max_model_len=1024)
print(llm.generate(['The CPU backend exists because'],
                   SamplingParams(max_tokens=48, temperature=0))[0].outputs[0].text)
"

Three observations to collect: the startup log naming the platform and attention backend (compare with any GPU capture from earlier phases); the KV cache sized from VLLM_CPU_KVCACHE_SPACE (gigabytes of RAM — Phase 2 lab-03's carving with a new budget source); and your tok/s (single-digit to low-double — see the roofline note).

Captured output (real run, Qwen2.5-0.5B, 16-core CPU, vLLM 0.22.1, trimmed)

INFO ... Using CPU platform.                       # lab-01's resolver, choosing the floor
INFO ... Using Torch SDPA backend.                 # the platform's attention answer
INFO ... CPU KV cache space: 4 GiB                 # VLLM_CPU_KVCACHE_SPACE, not gpu_mem_util
INFO ... # CPU blocks: 13,107                      # same BlockPool, RAM-backed
WARNING ... CUDA graphs are not supported ... falling back to eager
 the CPU backend exists because not every deployment has a GPU ...
# generation: ~9 tok/s single stream (16 cores, bf16)

Reading cpu.py against lab-01

Open upstream/vllm/platforms/cpu.py next to your lab-01 Platform and check off the decisions: get_attn_backend_cls → Torch SDPA (the platform is Phase 4's selector for this hardware); dtype checks (fp16 discouraged on CPU — your check_dtype negotiation, with a warning); graphs unsupported (Phase 5 short- circuits — note the engine degrades, not crashes: eager mode was always a valid path); memory introspection reading system RAM. Then notice what's absent: nothing about schedulers, blocks, batching, or sampling. The platform overrides the hardware-touching edge and only the edge — lab-01's funnel, confirmed by reading what a real backend did not have to implement.

Hitchhiker's notes

  • The performance is honest, and the roofline explains it (Phase 0 lab-04 with CPU constants): ~50 GB/s of DRAM bandwidth vs a GPU's 2,000 — decode's weight-streaming bound lands at ~9 tok/s for a 1 GB-weight model, right where the capture sits. CPU serving is bandwidth-priced, same physics, smaller numbers — which is also why small models + quantization (fewer bytes!) are disproportionately effective here.
  • Knob translation table: gpu_memory_utilizationVLLM_CPU_KVCACHE_SPACE (absolute GiB — RAM isn't pre-carved like HBM); TP within a node → multiple NUMA-pinned CPU "devices" (VLLM_CPU_OMP_THREADS_BIND); graphs → nothing (eager always). The concepts you tuned all course exist; the spellings moved to where the hardware's truth lives.
  • CI is the killer app: vLLM's own test suite exercises engine logic on CPU runners constantly — correctness of schedulers and parsers doesn't need an A100 (this course's whole premise, which the project itself relies on).
  • From cpu.py to a vendor plugin is a difference of packaging, not kind: vllm-ascend, vllm-spyre and friends are out-of-tree cpu.py-shaped files plus kernels, registered through lab-01's entry-point mechanism. After reading one in-tree backend, you can review (or write) an out-of-tree one.

Reflect

  • List which course phases' machinery you just watched run unchanged on CPU, and which were platform-swapped. (Unchanged: 1, 2, 3, 9, 12, 16 — the engine and text layers. Swapped: 4's backend choice, 5 disabled, 7's kernels, 0/18's constants.) The ratio is the architecture's grade.
  • Why is VLLM_CPU_KVCACHE_SPACE absolute GiB while the GPU knob is a fraction? (HBM is the engine's to claim — a fraction of a dedicated resource; RAM is shared with the OS and everything else — an absolute budget is the honest contract. Knob design encodes resource ownership.)
  • A vendor pitches you "vLLM support" for their accelerator. From this phase, what three artifacts do you ask to see? (Their platform class and what it overrides; their attention backend's correctness story against Phase 4's reference shapes; benchmark constants for the Phase 0 lab-04 roofline so claims can be checked.)

References