Lab 17-01 — The Platform Abstraction: One Engine, Any Silicon `[CPU-OK]`

vLLM runs on NVIDIA, AMD, Intel GPUs, TPUs, Gaudi, and plain CPUs — and the reason it can is one interface and one registry: every hardware-specific decision (which attention backend? which dtypes? are CUDA graphs a thing here?) is asked of a Platform object, and platforms register into a table that out-of-tree plugins can join without touching a line of core code. You'll build the whole mechanism small — the interface, two in-tree platforms, the resolver with its override and its CPU floor — and then the test that is the architecture: register a third platform from "outside" and watch the engine's decisions change, core untouched. Plus the security posture detail most plugin systems forget: duplicate registration is refused, because a plugin silently shadowing the CUDA platform is a supply-chain incident wearing a convenience feature.

Why this lab exists
Background: the decisions that funnel through
Files
Run
What the tests prove
Hitchhiker's notes
Going further
References

Why this lab exists

The platform layer is how vLLM scaled organizationally, not just technically: hardware vendors (AMD, Intel, Google, Huawei, IBM) maintain their own backends — some in-tree, some as plugin packages — without serializing through the core team. That only works because the interface is explicit and the extension point is a registry, and the lab's plugin test demonstrates the payoff in its purest form: new silicon support is additive. If you ever bring vLLM to new hardware (a real career path — ask the Spyre and Ascend teams), this lab is the map of what you'll implement; if you review plugin PRs, it's the map of what to check.

The design pattern is also the course's registry trilogy completed: attention backends (Phase 4's selector), model architectures (Phase 14's registry), and now platforms — three tables, one philosophy: core code asks "who handles this?" instead of knowing. Each table is also a place where Phase 4 lab-02's bisection move works (override exists at every layer for exactly that reason).

Background: the decisions that funnel through

The real Platform interface (upstream/vllm/platforms/interface.py) answers, per hardware: which attention backend class (this is literally where Phase 4's selector gets its platform default), supported dtypes (your check_dtype is the negotiation — bf16 everywhere, fp16 not on CPU, fp8 only on Hopper+-class), device introspection (memory totals — Phase 2 lab-03's carving needs to ask someone), graph capture support (Phase 5 is a no-op on CPU), and communicator choices (Phase 10's collectives differ per fabric). Resolution happens once at import/startup: detect devices → consult the registry → (or honor the override) → fall back to CPU, the platform that always exists — the floor that makes "no accelerator detected" a slow day instead of a crash.

Plugins join via Python entry points: installing vllm-ascend registers its platform at import time — your register_platform, with packaging around it. The refuse-duplicates rule is the trust boundary: in-tree names are spoken for.

Files

starter.py — Platform.check_dtype, register_platform, resolve_platform, make_default_platforms. Your work.
solution.py — reference.
test_lab.py — accelerator preference, the CPU floor, override + loud unknowns, dtype negotiation, the out-of-tree plugin, and the duplicate refusal.

Run

LAB_IMPL=starter pytest phase-17-hardware-backends-and-plugins/labs/lab-01-platform-abstraction -q
pytest phase-17-hardware-backends-and-plugins/labs/lab-01-platform-abstraction -q   # reference

What the tests prove

Test	What it pins
`test_resolution_prefers_the_accelerator`	Detection order: the GPU wins when present, and with it come flash_attn and graphs — the decisions travel as a bundle
`test_cpu_is_the_floor`	Empty device list still resolves — vLLM always has somewhere to run, which is why lab-02 works at all
`test_override_wins_and_unknown_is_loud`	The bisection hook (Phase 4 lab-02's reflex, platform edition) — and typos fail fast instead of silently falling back
`test_dtype_negotiation`	Unsupported dtype → float32, never a crash mid-load: capability mismatches are negotiated at the boundary
`test_out_of_tree_plugin_changes_decisions_without_core_edits`	The architecture: a "vendor" registers `mytpu`, resolution returns it, the attention backend is now `pallas` — and the diff to core is zero lines
`test_duplicate_registration_is_refused`	A plugin cannot shadow `cpu` or `cuda` — the supply-chain guard, as an assert

Hitchhiker's notes

Find your functions upstream: upstream/vllm/platforms/interface.py (Platform, with ~30 methods where you wrote 1 — same skeleton), upstream/vllm/platforms/__init__.py (detection + resolution + plugin loading — your resolve_platform with the entry-point scan), and any of cuda.py / cpu.py / rocm.py / tpu.py as the in-tree implementations. Read cpu.py with lab-02 — its overrides are exactly the decision list above.
The plugin mechanism is general: vLLM's plugin system (upstream/vllm/plugins/) loads any registered entry point at startup — platforms, but also out-of-tree models (Phase 14's registry accepts plugins the same way) and custom components. One loading mechanism, many tables — when you see VLLM_PLUGINS in an environment, this is what it gates.
Why funnel rather than if torch.cuda.is_available() sprinkled everywhere? Because the sprinkled version is what most codebases have, and it makes new hardware a grep-and-pray refactor across hundreds of sites. The funnel makes it one class. The lab's plugin test is unwritable against sprinkled conditionals — which is the test-as-architecture-proof point again (Phase 14's tripwire, in registry form).
Capability negotiation beats capability assumption: check_dtype's fall-to-float32 is a microcosm of how the whole layer behaves — requests for the unsupported degrade explicitly (with a warning upstream) rather than crashing or, worse, silently miscomputing. Every backend boundary in your own systems deserves the same negotiation shape.

Going further

Wire it into mini_vllm: give LLMEngine a platform parameter whose attention_backend string selects between two toy attention impls (both correct, different "hardware"). The Phase 14 lab-01 tripwire test then proves the engine consults only the platform — the funnel, enforced.
Add get_device_memory() per platform and route Phase 2 lab-03's blocks-from-bytes carving through it — the startup ritual becomes platform-portable, which is precisely how the real worker does it.
Simulate the entry-point load: a plugins/ dict of callables, each registering a platform; load them in sorted order and re-run the duplicate test. Then consider: what should happen when two plugins collide? (Upstream: first wins
- a warning. Reasonable people disagree — write down the trade.)

References

upstream/vllm/platforms/interface.py — the real Platform.
upstream/vllm/platforms/__init__.py — detection, resolution, plugin loading.
vLLM docs, vLLM Plugin System: https://docs.vllm.ai/en/latest/design/plugin_system.html
Phase 4 lab-02 (attention selector) and Phase 14 lab-01 (model registry) — the other two tables in the trilogy.

vLLM Mastery — From Zero to Maintainer