Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 17-01 — The Platform Abstraction: One Engine, Any Silicon [CPU-OK]

vLLM runs on NVIDIA, AMD, Intel GPUs, TPUs, Gaudi, and plain CPUs — and the reason it can is one interface and one registry: every hardware-specific decision (which attention backend? which dtypes? are CUDA graphs a thing here?) is asked of a Platform object, and platforms register into a table that out-of-tree plugins can join without touching a line of core code. You'll build the whole mechanism small — the interface, two in-tree platforms, the resolver with its override and its CPU floor — and then the test that is the architecture: register a third platform from "outside" and watch the engine's decisions change, core untouched. Plus the security posture detail most plugin systems forget: duplicate registration is refused, because a plugin silently shadowing the CUDA platform is a supply-chain incident wearing a convenience feature.

Contents


Why this lab exists

The platform layer is how vLLM scaled organizationally, not just technically: hardware vendors (AMD, Intel, Google, Huawei, IBM) maintain their own backends — some in-tree, some as plugin packages — without serializing through the core team. That only works because the interface is explicit and the extension point is a registry, and the lab's plugin test demonstrates the payoff in its purest form: new silicon support is additive. If you ever bring vLLM to new hardware (a real career path — ask the Spyre and Ascend teams), this lab is the map of what you'll implement; if you review plugin PRs, it's the map of what to check.

The design pattern is also the course's registry trilogy completed: attention backends (Phase 4's selector), model architectures (Phase 14's registry), and now platforms — three tables, one philosophy: core code asks "who handles this?" instead of knowing. Each table is also a place where Phase 4 lab-02's bisection move works (override exists at every layer for exactly that reason).

Background: the decisions that funnel through

The real Platform interface (upstream/vllm/platforms/interface.py) answers, per hardware: which attention backend class (this is literally where Phase 4's selector gets its platform default), supported dtypes (your check_dtype is the negotiation — bf16 everywhere, fp16 not on CPU, fp8 only on Hopper+-class), device introspection (memory totals — Phase 2 lab-03's carving needs to ask someone), graph capture support (Phase 5 is a no-op on CPU), and communicator choices (Phase 10's collectives differ per fabric). Resolution happens once at import/startup: detect devices → consult the registry → (or honor the override) → fall back to CPU, the platform that always exists — the floor that makes "no accelerator detected" a slow day instead of a crash.

Plugins join via Python entry points: installing vllm-ascend registers its platform at import time — your register_platform, with packaging around it. The refuse-duplicates rule is the trust boundary: in-tree names are spoken for.

Files

  • starter.pyPlatform.check_dtype, register_platform, resolve_platform, make_default_platforms. Your work.
  • solution.py — reference.
  • test_lab.py — accelerator preference, the CPU floor, override + loud unknowns, dtype negotiation, the out-of-tree plugin, and the duplicate refusal.

Run

LAB_IMPL=starter pytest phase-17-hardware-backends-and-plugins/labs/lab-01-platform-abstraction -q
pytest phase-17-hardware-backends-and-plugins/labs/lab-01-platform-abstraction -q   # reference

What the tests prove

TestWhat it pins
test_resolution_prefers_the_acceleratorDetection order: the GPU wins when present, and with it come flash_attn and graphs — the decisions travel as a bundle
test_cpu_is_the_floorEmpty device list still resolves — vLLM always has somewhere to run, which is why lab-02 works at all
test_override_wins_and_unknown_is_loudThe bisection hook (Phase 4 lab-02's reflex, platform edition) — and typos fail fast instead of silently falling back
test_dtype_negotiationUnsupported dtype → float32, never a crash mid-load: capability mismatches are negotiated at the boundary
test_out_of_tree_plugin_changes_decisions_without_core_editsThe architecture: a "vendor" registers mytpu, resolution returns it, the attention backend is now pallas — and the diff to core is zero lines
test_duplicate_registration_is_refusedA plugin cannot shadow cpu or cuda — the supply-chain guard, as an assert

Hitchhiker's notes

  • Find your functions upstream: upstream/vllm/platforms/interface.py (Platform, with ~30 methods where you wrote 1 — same skeleton), upstream/vllm/platforms/__init__.py (detection + resolution + plugin loading — your resolve_platform with the entry-point scan), and any of cuda.py / cpu.py / rocm.py / tpu.py as the in-tree implementations. Read cpu.py with lab-02 — its overrides are exactly the decision list above.
  • The plugin mechanism is general: vLLM's plugin system (upstream/vllm/plugins/) loads any registered entry point at startup — platforms, but also out-of-tree models (Phase 14's registry accepts plugins the same way) and custom components. One loading mechanism, many tables — when you see VLLM_PLUGINS in an environment, this is what it gates.
  • Why funnel rather than if torch.cuda.is_available() sprinkled everywhere? Because the sprinkled version is what most codebases have, and it makes new hardware a grep-and-pray refactor across hundreds of sites. The funnel makes it one class. The lab's plugin test is unwritable against sprinkled conditionals — which is the test-as-architecture-proof point again (Phase 14's tripwire, in registry form).
  • Capability negotiation beats capability assumption: check_dtype's fall-to-float32 is a microcosm of how the whole layer behaves — requests for the unsupported degrade explicitly (with a warning upstream) rather than crashing or, worse, silently miscomputing. Every backend boundary in your own systems deserves the same negotiation shape.

Going further

  • Wire it into mini_vllm: give LLMEngine a platform parameter whose attention_backend string selects between two toy attention impls (both correct, different "hardware"). The Phase 14 lab-01 tripwire test then proves the engine consults only the platform — the funnel, enforced.
  • Add get_device_memory() per platform and route Phase 2 lab-03's blocks-from-bytes carving through it — the startup ritual becomes platform-portable, which is precisely how the real worker does it.
  • Simulate the entry-point load: a plugins/ dict of callables, each registering a platform; load them in sorted order and re-run the duplicate test. Then consider: what should happen when two plugins collide? (Upstream: first wins
    • a warning. Reasonable people disagree — write down the trade.)

References

  • upstream/vllm/platforms/interface.py — the real Platform.
  • upstream/vllm/platforms/__init__.py — detection, resolution, plugin loading.
  • vLLM docs, vLLM Plugin System: https://docs.vllm.ai/en/latest/design/plugin_system.html
  • Phase 4 lab-02 (attention selector) and Phase 14 lab-01 (model registry) — the other two tables in the trilogy.