The Career Map: Maintainer, Staff Engineer, Founder
This course has three end-states in mind. They overlap, but each has its own "what does great look like" bar. Use this document as a compass: at any phase, ask "which of these am I building toward right now?"
Contents
- Track A — Become a vLLM maintainer
- Track B — Staff / Principal LLM-inference engineer
- Track C — Found a startup in inference
- A note on mindset
Track A — Become a vLLM maintainer
A maintainer is someone whose judgment the project trusts. You get there by a track record, not a title.
The ladder
- First contribution. A docs fix, a small bug, a test. Learn the workflow (Phase 19).
- Sustained contributions. Real features/fixes in one area (say, the scheduler or a quant method). You become "the person who knows X".
- Reviewer. You review others' PRs in your area credibly.
- Committer / maintainer. You're trusted to merge and to shape direction.
What maintainers actually do (and this course trains)
- Read code fast and correctly. Every
01-deep-dive.mdis reps for this. - Reason about invariants. "Block tables are append-only." "
ref_cnt==0⟺ in free queue." Maintainers hold dozens of these in their head. The deep-dives name them explicitly; theCHEATSHEET.mdfiles collect them. - Protect the hot path. vLLM's scheduler runs every token step for every request — a Python list scan in the wrong place is a throughput regression. You learn to feel this.
- Write tests that pin behavior. Look at
upstream/tests/v1/core/— that's the standard. - Communicate. PR descriptions, RFCs, issue triage. See
upstream/AGENTS.mdfor the project's literal rules (e.g. no pure code-agent PRs, cite that AI was used, include test commands and results).
The non-obvious advice
- Specialize, then generalize. Pick one subsystem from this course (scheduler, KV cache, a quant format, an attention backend) and go deeper than anyone. Depth in one area earns the trust that lets you touch others.
- Watch the firehose. Subscribe to the repo. Read merged PRs in your area daily. Diffing how the engine evolves (Phase 19) is the fastest way to learn the current mental model.
Track B — Staff / Principal LLM-inference engineer
This is the industry role: you own how models serve — throughput, latency, cost, reliability — at a company. The interview loops test exactly the material in this course.
The competency map
| Competency | Phases | "Staff-level" looks like |
|---|---|---|
| Transformer inference fundamentals | 0, 1 | Can derive KV-cache memory from first principles; explain prefill vs decode bottlenecks. |
| Memory management | 2 | Can size KV cache for a deployment; explain paging vs fragmentation with numbers. |
| Throughput engineering | 3, 18 | Can diagnose a throughput cliff from metrics; tune batch/token budgets; reason about Little's Law. |
| Kernels & precision | 4–7 | Knows when FlashInfer beats FlashAttention; what FP8 costs in accuracy; reads a roofline. |
| Latency techniques | 8, 9 | Knows when spec decode helps (acceptance rate × draft cost); chunked prefill tradeoffs. |
| Scale-out | 10, 15 | Picks TP vs PP vs DP vs EP for a model+SLA; understands P/D disaggregation economics. |
| Productization | 11, 12, 16 | Multi-tenant LoRA, structured output, API design, streaming, observability. |
| Hardware breadth | 17 | Reasons about NVIDIA vs AMD vs TPU tradeoffs and the plugin abstraction. |
How to use the INTERVIEW.md files
Each phase ships staff-level Q&A. Treat them as a mock loop: cover the answer, attempt it out
loud, then compare. The flagship phases (2, 3) show the depth expected. A strong candidate
can whiteboard the PagedAttention block allocator and the continuous-batching step loop from
memory — which, after this course, you will have written yourself in mini_vllm/.
Your portfolio
By the end you have three artifacts that beat any résumé bullet:
mini_vllm/— a working engine you built. Walk an interviewer through it.- A merged upstream PR (Phase 19). Public proof you operate at the real bar.
- A tuning/benchmark writeup (Phase 18). Shows you think in numbers.
Track C — Found a startup in inference
The inference layer is one of the most valuable and contested in the AI stack. This course makes you dangerous in it.
Where the value (and the moats) are
- Cost per token. The whole game. Everything in Phases 2–7 and 10 is a lever on it. A 2× throughput win is a 2× gross-margin win.
- Latency SLAs. TTFT and ITL guarantees (Phases 3, 8, 9, 15) are what enterprise buyers actually pay for.
- Multi-tenancy. Serving thousands of fine-tunes cheaply = multi-LoRA + prefix caching (Phases 3, 11). A structural cost advantage over per-customer deployments.
- Hardware arbitrage. Running well on cheaper/available silicon (Phase 17) when NVIDIA is supply-constrained.
Honest take on moats
Raw "we wrap vLLM and rent GPUs" is not a moat — margins compress fast. Defensible angles:
- A genuine kernel/scheduling edge you can sustain (hard, but this course is where you'd build the expertise to try).
- Workload specialization — agentic/long-context/structured-output/RAG-shaped traffic has different optimal configs; owning a vertical's serving stack is defensible.
- The control plane — routing, autoscaling, multi-tenancy, observability, cost attribution around the engine. Often more durable than the engine itself.
- Distribution / switching costs — being embedded in customers' pipelines.
The build/buy/contribute calculus
You will almost always build on vLLM rather than replace it — that's the point of open source. The startup question is "what do we add on top, and what should we upstream?" Phase 19 covers the contribute-vs-keep-private tradeoff (upstreaming buys you maintenance leverage and credibility; hoarding a commodity feature buys you nothing).
A note on mindset
The people who reach all three end-states share one habit: they read the source. Not docs
about the source — the source. This entire course is built to make that your default
reflex. Open upstream/ now and keep it open for the next 20 phases.