Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The Career Map: Maintainer, Staff Engineer, Founder

This course has three end-states in mind. They overlap, but each has its own "what does great look like" bar. Use this document as a compass: at any phase, ask "which of these am I building toward right now?"


Contents


Track A — Become a vLLM maintainer

A maintainer is someone whose judgment the project trusts. You get there by a track record, not a title.

The ladder

  1. First contribution. A docs fix, a small bug, a test. Learn the workflow (Phase 19).
  2. Sustained contributions. Real features/fixes in one area (say, the scheduler or a quant method). You become "the person who knows X".
  3. Reviewer. You review others' PRs in your area credibly.
  4. Committer / maintainer. You're trusted to merge and to shape direction.

What maintainers actually do (and this course trains)

  • Read code fast and correctly. Every 01-deep-dive.md is reps for this.
  • Reason about invariants. "Block tables are append-only." "ref_cnt==0 ⟺ in free queue." Maintainers hold dozens of these in their head. The deep-dives name them explicitly; the CHEATSHEET.md files collect them.
  • Protect the hot path. vLLM's scheduler runs every token step for every request — a Python list scan in the wrong place is a throughput regression. You learn to feel this.
  • Write tests that pin behavior. Look at upstream/tests/v1/core/ — that's the standard.
  • Communicate. PR descriptions, RFCs, issue triage. See upstream/AGENTS.md for the project's literal rules (e.g. no pure code-agent PRs, cite that AI was used, include test commands and results).

The non-obvious advice

  • Specialize, then generalize. Pick one subsystem from this course (scheduler, KV cache, a quant format, an attention backend) and go deeper than anyone. Depth in one area earns the trust that lets you touch others.
  • Watch the firehose. Subscribe to the repo. Read merged PRs in your area daily. Diffing how the engine evolves (Phase 19) is the fastest way to learn the current mental model.

Track B — Staff / Principal LLM-inference engineer

This is the industry role: you own how models serve — throughput, latency, cost, reliability — at a company. The interview loops test exactly the material in this course.

The competency map

CompetencyPhases"Staff-level" looks like
Transformer inference fundamentals0, 1Can derive KV-cache memory from first principles; explain prefill vs decode bottlenecks.
Memory management2Can size KV cache for a deployment; explain paging vs fragmentation with numbers.
Throughput engineering3, 18Can diagnose a throughput cliff from metrics; tune batch/token budgets; reason about Little's Law.
Kernels & precision4–7Knows when FlashInfer beats FlashAttention; what FP8 costs in accuracy; reads a roofline.
Latency techniques8, 9Knows when spec decode helps (acceptance rate × draft cost); chunked prefill tradeoffs.
Scale-out10, 15Picks TP vs PP vs DP vs EP for a model+SLA; understands P/D disaggregation economics.
Productization11, 12, 16Multi-tenant LoRA, structured output, API design, streaming, observability.
Hardware breadth17Reasons about NVIDIA vs AMD vs TPU tradeoffs and the plugin abstraction.

How to use the INTERVIEW.md files

Each phase ships staff-level Q&A. Treat them as a mock loop: cover the answer, attempt it out loud, then compare. The flagship phases (2, 3) show the depth expected. A strong candidate can whiteboard the PagedAttention block allocator and the continuous-batching step loop from memory — which, after this course, you will have written yourself in mini_vllm/.

Your portfolio

By the end you have three artifacts that beat any résumé bullet:

  1. mini_vllm/ — a working engine you built. Walk an interviewer through it.
  2. A merged upstream PR (Phase 19). Public proof you operate at the real bar.
  3. A tuning/benchmark writeup (Phase 18). Shows you think in numbers.

Track C — Found a startup in inference

The inference layer is one of the most valuable and contested in the AI stack. This course makes you dangerous in it.

Where the value (and the moats) are

  • Cost per token. The whole game. Everything in Phases 2–7 and 10 is a lever on it. A 2× throughput win is a 2× gross-margin win.
  • Latency SLAs. TTFT and ITL guarantees (Phases 3, 8, 9, 15) are what enterprise buyers actually pay for.
  • Multi-tenancy. Serving thousands of fine-tunes cheaply = multi-LoRA + prefix caching (Phases 3, 11). A structural cost advantage over per-customer deployments.
  • Hardware arbitrage. Running well on cheaper/available silicon (Phase 17) when NVIDIA is supply-constrained.

Honest take on moats

Raw "we wrap vLLM and rent GPUs" is not a moat — margins compress fast. Defensible angles:

  • A genuine kernel/scheduling edge you can sustain (hard, but this course is where you'd build the expertise to try).
  • Workload specialization — agentic/long-context/structured-output/RAG-shaped traffic has different optimal configs; owning a vertical's serving stack is defensible.
  • The control plane — routing, autoscaling, multi-tenancy, observability, cost attribution around the engine. Often more durable than the engine itself.
  • Distribution / switching costs — being embedded in customers' pipelines.

The build/buy/contribute calculus

You will almost always build on vLLM rather than replace it — that's the point of open source. The startup question is "what do we add on top, and what should we upstream?" Phase 19 covers the contribute-vs-keep-private tradeoff (upstreaming buys you maintenance leverage and credibility; hoarding a commodity feature buys you nothing).


A note on mindset

The people who reach all three end-states share one habit: they read the source. Not docs about the source — the source. This entire course is built to make that your default reflex. Open upstream/ now and keep it open for the next 20 phases.