Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 12 — Deep Dive: structured outputs in real vLLM

Paths relative to upstream/ at v0.22.1 @ 0decac0 (UPSTREAM_PIN.md). If a line number drifts in a newer tree, search for the named symbol.

vllm/sampling_params.py                          StructuredOutputsParams (the user API)
vllm/v1/structured_output/backend_types.py       the two-interface contract (read first)
vllm/v1/structured_output/__init__.py            StructuredOutputManager (compile + bitmask)
vllm/v1/structured_output/backend_xgrammar.py    the default backend
vllm/v1/structured_output/request.py             per-request state + the cache key
vllm/v1/structured_output/utils.py               apply_grammar_bitmask (runner side)
vllm/v1/core/sched/scheduler.py                  get_grammar_bitmask (the scheduler hook)

Contents


1. The user API: StructuredOutputsParams

vllm/sampling_params.py:41class StructuredOutputsParams holds exactly one of json | regex | choice | grammar | json_object | structural_tag (__post_init__ counts the set fields and raises if ≠ 1). This rides on every SamplingParams, so a constraint is a per-request property — one batch can mix free requests, a JSON-schema request, and a regex request.

The constraint becomes a cache key in vllm/v1/structured_output/request.py:77get_structured_output_key() maps params to a (StructuredOutputOptions, spec_string) tuple (JSON dict gets json.dumps-normalized). Two requests with the same schema share one compiled grammar context.

2. The contract: two abstract classes

backend_types.py is the whole design in 136 lines — read it before anything else:

  • StructuredOutputOptions (:19) — the six request types (JSON, JSON_OBJECT, REGEX, GRAMMAR, CHOICE, STRUCTURAL_TAG).
  • StructuredOutputGrammar (:31) — per-request state. Five methods carry the whole feature: accept_tokens (advance state), validate_tokens (check without advancing — used to vet spec-decode drafts), rollback(n) (un-advance — spec-decode rejection), fill_bitmask(tensor, index) (write this request's allowed-token bits into row index), is_terminated (grammar reached an accepting end state).
  • StructuredOutputBackend (:99) — engine-level: compile_grammar(type, spec) → StructuredOutputGrammar and allocate_token_bitmask(max_seqs).

That rollback is in the base interface tells you spec decode wasn't bolted on — the contract was designed so constraints and speculation compose.

3. The manager: compile off the hot path

vllm/v1/structured_output/__init__.py:36class StructuredOutputManager, owned by the scheduler (scheduler.py:90), not the workers. Compile and bitmask-fill happen on the scheduler side; only the finished numpy bitmask is shipped to GPU workers.

  • grammar_init (:115) — called when a constrained request arrives. Lazily instantiates the single engine-wide backend (xgrammar / guidance / outlines / lm-format-enforcer — note the comment: one backend per engine, not per request), then submits _create_grammar to a ThreadPoolExecutor: the request's grammar field holds a Future until compilation lands.
  • request.py:60 — the grammar property resolves that Future: a request whose grammar isn't ready yet is simply not schedulable (the scheduler skips it — search structured_output_request.grammar in scheduler.py). Compile latency costs that one request TTFT, never the engine loop.

4. The bitmask: one row per position, spec-decode included

grammar_bitmask (__init__.py:204) is the heart. Per step the scheduler calls Scheduler.get_grammar_bitmask (scheduler.py:1259), which collects the scheduled constrained request IDs and delegates here. What to notice:

  • Allocation (once): max_num_seqs × (1 + num_speculative_tokens) rows — one row per possible sampled position, not per request. With spec decode, request r drafting tokens d1..dk contributes k+1 rows: mask for the state before d1, before d2, …, before the bonus token.
  • The spec-decode dance (the serial path): for each draft token it fills a row, then accept_tokens([token]) to advance the state, counting state_advancements; after the last row it calls grammar.rollback(state_advancements) — the grammar temporarily pretends the drafts were accepted to compute their masks, then rewinds, because the real accept/reject verdict belongs to the rejection sampler (Phase 8).
  • Parallel fill: above fill_bitmask_parallel_threshold (non-spec case), requests are batched to the executor in groups — bitmask filling is pure CPU work and parallelizes.
  • Serialization: the tensor is returned as numpy (.numpy(), see the comment) because ndarray serializes much faster than a torch tensor on the way to workers — it travels in GrammarOutput (scheduler.py:1281).
  • should_advance (:322) / should_fill_bitmask (:302) — the reasoning-model gate: while a model is inside its thinking section, the constraint is suspended (the mask row is set to all-ones via _full_mask) and the automaton doesn't advance; enforcement begins when the reasoning parser (Phase 16) says reasoning ended.

5. The runner: reorder and apply

vllm/v1/structured_output/utils.py:44apply_grammar_bitmask(scheduler_output, grammar_output, input_batch, logits), called from the GPU model runner right before sampling (gpu_model_runner.py:4359). Two jobs:

  1. Reorder: the bitmask rows are in the scheduler's request order; the runner's batch order differs, and spec-decode offsets each request's logit rows. The function builds struct_out_req_batch_indices walking input_batch.req_ids with a cumulative_offset of spec tokens, then scatters rows into a sorted_bitmask sized [logits.rows, words] (unconstrained rows = all -1 = all-allowed).
  2. Apply: xgr.apply_token_bitmask_inplace(logits, bitmask, indices=out_indices) — one fused kernel writes −inf into every disallowed logit. 32 vocab entries per int32 word is why the mask is cheap to ship and apply.

6. The xgrammar backend

backend_xgrammar.py:35class XgrammarBackend. compile_grammar (:77) is a clean switch over the six request types: compile_json_schema / compile_regex / compile_grammar (EBNF) / compile_structural_tag, each returning a compiled context ctx. CHOICE never reaches here as such — choices are converted to a grammar upstream. Then:

return XgrammarGrammar(
    matcher=xgr.GrammarMatcher(ctx, max_rollback_tokens=self.num_speculative_tokens),
    vocab_size=self.vocab_size, ctx=ctx)

max_rollback_tokens sized to the spec-decode draft length — the compose-with-Phase-8 contract again, now at the C++ matcher level. XgrammarGrammar (:132) is a thin wrapper: accept_tokensmatcher.accept_token loop, fill_bitmask (:191) → matcher.fill_next_token_bitmask(bitmask, idx), rollbackmatcher.rollback. The actual FSM/pushdown machinery — and the compile-time token classification from the guide's Step 4 — lives inside the xgrammar library; what vLLM owns is the plumbing you just traced. Also skim has_xgrammar_unsupported_json_features (:221) and validate_xgrammar_grammar (:268): unsupported schema features are rejected at the front door (processor), not at compile time — fail fast, fail in the API layer.

backend_guidance.py implements the same two interfaces over llguidance (better coverage of exotic JSON-schema features, lazy-computed masks); backend_outlines.py and backend_lm_format_enforcer.py likewise. One contract, four interchangeable engines — the same backend-registry pattern you saw for attention (Phase 4) and quantization (Phase 6).

Reading checklist

  • backend_types.py — why are validate_tokens and rollback in the per-request interface? Which phase forces their existence?
  • grammar_init — what exactly is async, and what state is a request in while its grammar compiles?
  • grammar_bitmask — why max_num_seqs × (1 + num_spec_tokens) rows? Walk the fill→accept→…→rollback sequence for one request with 2 draft tokens.
  • apply_grammar_bitmask — why is reordering needed, and what does an all -1 row mean?
  • XgrammarBackend.compile_grammar — where does max_rollback_tokens come from?
  • In scheduler.py:968, why is a request with is_prefill_chunk excluded from bitmask generation? (Hint: which step actually samples a token?)

Now build it: 02-mini-build.md, then the labs.