Phase 17 — Hardware Backends & Plugins

← Phase 16 · Course home · Phase 18 →

Don't Panic
Why this phase matters
What you'll learn
The map: where this lives in the real code
Labs in this phase
How to work this phase
Where you are

Don't Panic

vLLM runs on NVIDIA, AMD, CPUs, TPUs, Gaudi, and more. It does this by hiding every hardware difference behind a Platform abstraction and a plugin system, so the engine code stays hardware-agnostic and new accelerators arrive as plugins. This phase is that abstraction — and you'll run the CPU backend with no GPU at all.

Why this phase matters

Hardware breadth is a strategic advantage (GPU supply, cost arbitrage) and the Platform abstraction is a clean piece of architecture worth studying. Knowing where the seams are lets you reason about porting and about why a feature is available on one backend but not another.

What you'll learn

The Platform abstraction: device type, attention backend default, capabilities
How the engine queries the platform instead of hardcoding CUDA
The out-of-tree plugin system (entry points) for new hardware
CPU backend: what changes (no paging kernels? threading? dtype support)
Why some features are platform-gated (FP8, CUDA graphs, certain kernels)

The map: where this lives in the real code

Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md) walks through the important ones line by line.

vllm/platforms/interface.py — The Platform base class — the contract every backend implements.
vllm/platforms/cuda.py — The NVIDIA platform.
vllm/platforms/cpu.py — The CPU platform — read this; you can run it on a laptop.
vllm/platforms/__init__.py — Platform detection/resolution + plugin discovery.
vllm/plugins/ — The plugin loading mechanism.

Labs in this phase

lab-01-platform-abstraction [CPU-OK] — build the Platform interface, registry, resolver (CPU floor + loud override), then register an out-of-tree platform and change the engine's decisions with zero core edits — plus the duplicate-registration supply-chain guard.
lab-02-run-cpu-vllm [CPU-OK] — run vLLM on laptop cores and read cpu.py against lab-01's interface: every override checked off, the Phase 1–3 engine untouched. Captured output included.

See labs/README.md for how to run them.

How to work this phase

Read this guide for intuition.
Read 01-deep-dive.md with the upstream/ files open.
Do 02-mini-build.md — build the mini_vllm piece yourself.
Run the labs, then attempt EXERCISES.md.
Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.

Where you are

This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.