Lab 17-01 — The Platform Abstraction: One Engine, Any Silicon [CPU-OK]
vLLM runs on NVIDIA, AMD, Intel GPUs, TPUs, Gaudi, and plain CPUs — and the reason
it can is one interface and one registry: every hardware-specific decision (which
attention backend? which dtypes? are CUDA graphs a thing here?) is asked of a
Platform object, and platforms register into a table that out-of-tree plugins can
join without touching a line of core code. You'll build the whole mechanism small —
the interface, two in-tree platforms, the resolver with its override and its CPU
floor — and then the test that is the architecture: register a third platform from
"outside" and watch the engine's decisions change, core untouched. Plus the security
posture detail most plugin systems forget: duplicate registration is refused,
because a plugin silently shadowing the CUDA platform is a supply-chain incident
wearing a convenience feature.
Contents
- Why this lab exists
- Background: the decisions that funnel through
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
The platform layer is how vLLM scaled organizationally, not just technically: hardware vendors (AMD, Intel, Google, Huawei, IBM) maintain their own backends — some in-tree, some as plugin packages — without serializing through the core team. That only works because the interface is explicit and the extension point is a registry, and the lab's plugin test demonstrates the payoff in its purest form: new silicon support is additive. If you ever bring vLLM to new hardware (a real career path — ask the Spyre and Ascend teams), this lab is the map of what you'll implement; if you review plugin PRs, it's the map of what to check.
The design pattern is also the course's registry trilogy completed: attention
backends (Phase 4's selector), model architectures (Phase 14's registry), and now
platforms — three tables, one philosophy: core code asks "who handles this?"
instead of knowing. Each table is also a place where Phase 4 lab-02's bisection
move works (override exists at every layer for exactly that reason).
Background: the decisions that funnel through
The real Platform interface (upstream/vllm/platforms/interface.py) answers, per
hardware: which attention backend class (this is literally where Phase 4's
selector gets its platform default), supported dtypes (your check_dtype is the
negotiation — bf16 everywhere, fp16 not on CPU, fp8 only on Hopper+-class),
device introspection (memory totals — Phase 2 lab-03's carving needs to ask
someone), graph capture support (Phase 5 is a no-op on CPU), and communicator
choices (Phase 10's collectives differ per fabric). Resolution happens once at
import/startup: detect devices → consult the registry → (or honor the override) →
fall back to CPU, the platform that always exists — the floor that makes "no
accelerator detected" a slow day instead of a crash.
Plugins join via Python entry points: installing vllm-ascend registers its
platform at import time — your register_platform, with packaging around it. The
refuse-duplicates rule is the trust boundary: in-tree names are spoken for.
Files
starter.py—Platform.check_dtype,register_platform,resolve_platform,make_default_platforms. Your work.solution.py— reference.test_lab.py— accelerator preference, the CPU floor, override + loud unknowns, dtype negotiation, the out-of-tree plugin, and the duplicate refusal.
Run
LAB_IMPL=starter pytest phase-17-hardware-backends-and-plugins/labs/lab-01-platform-abstraction -q
pytest phase-17-hardware-backends-and-plugins/labs/lab-01-platform-abstraction -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_resolution_prefers_the_accelerator | Detection order: the GPU wins when present, and with it come flash_attn and graphs — the decisions travel as a bundle |
test_cpu_is_the_floor | Empty device list still resolves — vLLM always has somewhere to run, which is why lab-02 works at all |
test_override_wins_and_unknown_is_loud | The bisection hook (Phase 4 lab-02's reflex, platform edition) — and typos fail fast instead of silently falling back |
test_dtype_negotiation | Unsupported dtype → float32, never a crash mid-load: capability mismatches are negotiated at the boundary |
test_out_of_tree_plugin_changes_decisions_without_core_edits | The architecture: a "vendor" registers mytpu, resolution returns it, the attention backend is now pallas — and the diff to core is zero lines |
test_duplicate_registration_is_refused | A plugin cannot shadow cpu or cuda — the supply-chain guard, as an assert |
Hitchhiker's notes
- Find your functions upstream:
upstream/vllm/platforms/interface.py(Platform, with ~30 methods where you wrote 1 — same skeleton),upstream/vllm/platforms/__init__.py(detection + resolution + plugin loading — yourresolve_platformwith the entry-point scan), and any ofcuda.py/cpu.py/rocm.py/tpu.pyas the in-tree implementations. Readcpu.pywith lab-02 — its overrides are exactly the decision list above. - The plugin mechanism is general: vLLM's plugin system
(
upstream/vllm/plugins/) loads any registered entry point at startup — platforms, but also out-of-tree models (Phase 14's registry accepts plugins the same way) and custom components. One loading mechanism, many tables — when you seeVLLM_PLUGINSin an environment, this is what it gates. - Why funnel rather than
if torch.cuda.is_available()sprinkled everywhere? Because the sprinkled version is what most codebases have, and it makes new hardware a grep-and-pray refactor across hundreds of sites. The funnel makes it one class. The lab's plugin test is unwritable against sprinkled conditionals — which is the test-as-architecture-proof point again (Phase 14's tripwire, in registry form). - Capability negotiation beats capability assumption:
check_dtype's fall-to-float32 is a microcosm of how the whole layer behaves — requests for the unsupported degrade explicitly (with a warning upstream) rather than crashing or, worse, silently miscomputing. Every backend boundary in your own systems deserves the same negotiation shape.
Going further
- Wire it into
mini_vllm: giveLLMEngineaplatformparameter whoseattention_backendstring selects between two toy attention impls (both correct, different "hardware"). The Phase 14 lab-01 tripwire test then proves the engine consults only the platform — the funnel, enforced. - Add
get_device_memory()per platform and route Phase 2 lab-03's blocks-from-bytes carving through it — the startup ritual becomes platform-portable, which is precisely how the real worker does it. - Simulate the entry-point load: a
plugins/dict of callables, each registering a platform; load them in sorted order and re-run the duplicate test. Then consider: what should happen when two plugins collide? (Upstream: first wins- a warning. Reasonable people disagree — write down the trade.)
References
upstream/vllm/platforms/interface.py— the realPlatform.upstream/vllm/platforms/__init__.py— detection, resolution, plugin loading.- vLLM docs, vLLM Plugin System: https://docs.vllm.ai/en/latest/design/plugin_system.html
- Phase 4 lab-02 (attention selector) and Phase 14 lab-01 (model registry) — the other two tables in the trilogy.