Phase 12 — Deep Dive: structured outputs in real vLLM
Paths relative to
upstream/atv0.22.1 @ 0decac0(UPSTREAM_PIN.md). If a line number drifts in a newer tree, search for the named symbol.vllm/sampling_params.py StructuredOutputsParams (the user API) vllm/v1/structured_output/backend_types.py the two-interface contract (read first) vllm/v1/structured_output/__init__.py StructuredOutputManager (compile + bitmask) vllm/v1/structured_output/backend_xgrammar.py the default backend vllm/v1/structured_output/request.py per-request state + the cache key vllm/v1/structured_output/utils.py apply_grammar_bitmask (runner side) vllm/v1/core/sched/scheduler.py get_grammar_bitmask (the scheduler hook)
Contents
- 1. The user API:
StructuredOutputsParams - 2. The contract: two abstract classes
- 3. The manager: compile off the hot path
- 4. The bitmask: one row per position, spec-decode included
- 5. The runner: reorder and apply
- 6. The xgrammar backend
- Reading checklist
1. The user API: StructuredOutputsParams
vllm/sampling_params.py:41 — class StructuredOutputsParams holds exactly one of
json | regex | choice | grammar | json_object | structural_tag (__post_init__ counts the
set fields and raises if ≠ 1). This rides on every SamplingParams, so a constraint is a
per-request property — one batch can mix free requests, a JSON-schema request, and a regex
request.
The constraint becomes a cache key in
vllm/v1/structured_output/request.py:77 — get_structured_output_key() maps params to a
(StructuredOutputOptions, spec_string) tuple (JSON dict gets json.dumps-normalized).
Two requests with the same schema share one compiled grammar context.
2. The contract: two abstract classes
backend_types.py is the whole design in 136 lines — read it before anything else:
StructuredOutputOptions(:19) — the six request types (JSON, JSON_OBJECT, REGEX, GRAMMAR, CHOICE, STRUCTURAL_TAG).StructuredOutputGrammar(:31) — per-request state. Five methods carry the whole feature:accept_tokens(advance state),validate_tokens(check without advancing — used to vet spec-decode drafts),rollback(n)(un-advance — spec-decode rejection),fill_bitmask(tensor, index)(write this request's allowed-token bits into rowindex),is_terminated(grammar reached an accepting end state).StructuredOutputBackend(:99) — engine-level:compile_grammar(type, spec) → StructuredOutputGrammarandallocate_token_bitmask(max_seqs).
That rollback is in the base interface tells you spec decode wasn't bolted on — the
contract was designed so constraints and speculation compose.
3. The manager: compile off the hot path
vllm/v1/structured_output/__init__.py:36 — class StructuredOutputManager, owned by the
scheduler (scheduler.py:90), not the workers. Compile and bitmask-fill happen on the
scheduler side; only the finished numpy bitmask is shipped to GPU workers.
grammar_init(:115) — called when a constrained request arrives. Lazily instantiates the single engine-wide backend (xgrammar / guidance / outlines / lm-format-enforcer — note the comment: one backend per engine, not per request), then submits_create_grammarto aThreadPoolExecutor: the request'sgrammarfield holds aFutureuntil compilation lands.request.py:60— thegrammarproperty resolves that Future: a request whose grammar isn't ready yet is simply not schedulable (the scheduler skips it — searchstructured_output_request.grammarinscheduler.py). Compile latency costs that one request TTFT, never the engine loop.
4. The bitmask: one row per position, spec-decode included
grammar_bitmask (__init__.py:204) is the heart. Per step the scheduler calls
Scheduler.get_grammar_bitmask (scheduler.py:1259), which collects the scheduled
constrained request IDs and delegates here. What to notice:
- Allocation (once):
max_num_seqs × (1 + num_speculative_tokens)rows — one row per possible sampled position, not per request. With spec decode, requestrdrafting tokensd1..dkcontributes k+1 rows: mask for the state befored1, befored2, …, before the bonus token. - The spec-decode dance (the serial path): for each draft token it fills a row, then
accept_tokens([token])to advance the state, countingstate_advancements; after the last row it callsgrammar.rollback(state_advancements)— the grammar temporarily pretends the drafts were accepted to compute their masks, then rewinds, because the real accept/reject verdict belongs to the rejection sampler (Phase 8). - Parallel fill: above
fill_bitmask_parallel_threshold(non-spec case), requests are batched to the executor in groups — bitmask filling is pure CPU work and parallelizes. - Serialization: the tensor is returned as
numpy(.numpy(), see the comment) because ndarray serializes much faster than a torch tensor on the way to workers — it travels inGrammarOutput(scheduler.py:1281). should_advance(:322) /should_fill_bitmask(:302) — the reasoning-model gate: while a model is inside its thinking section, the constraint is suspended (the mask row is set to all-ones via_full_mask) and the automaton doesn't advance; enforcement begins when the reasoning parser (Phase 16) says reasoning ended.
5. The runner: reorder and apply
vllm/v1/structured_output/utils.py:44 — apply_grammar_bitmask(scheduler_output, grammar_output, input_batch, logits), called from the GPU model runner right before
sampling (gpu_model_runner.py:4359). Two jobs:
- Reorder: the bitmask rows are in the scheduler's request order; the runner's batch
order differs, and spec-decode offsets each request's logit rows. The function builds
struct_out_req_batch_indiceswalkinginput_batch.req_idswith acumulative_offsetof spec tokens, then scatters rows into asorted_bitmasksized[logits.rows, words](unconstrained rows = all-1= all-allowed). - Apply:
xgr.apply_token_bitmask_inplace(logits, bitmask, indices=out_indices)— one fused kernel writes −inf into every disallowed logit. 32 vocab entries per int32 word is why the mask is cheap to ship and apply.
6. The xgrammar backend
backend_xgrammar.py:35 — class XgrammarBackend. compile_grammar (:77) is a clean
switch over the six request types: compile_json_schema / compile_regex /
compile_grammar (EBNF) / compile_structural_tag, each returning a compiled context
ctx. CHOICE never reaches here as such — choices are converted to a grammar upstream.
Then:
return XgrammarGrammar(
matcher=xgr.GrammarMatcher(ctx, max_rollback_tokens=self.num_speculative_tokens),
vocab_size=self.vocab_size, ctx=ctx)
max_rollback_tokens sized to the spec-decode draft length — the compose-with-Phase-8
contract again, now at the C++ matcher level. XgrammarGrammar (:132) is a thin wrapper:
accept_tokens → matcher.accept_token loop, fill_bitmask (:191) →
matcher.fill_next_token_bitmask(bitmask, idx), rollback → matcher.rollback.
The actual FSM/pushdown machinery — and the compile-time token classification from the
guide's Step 4 — lives inside the xgrammar library; what vLLM owns is the plumbing you just
traced. Also skim has_xgrammar_unsupported_json_features (:221) and
validate_xgrammar_grammar (:268): unsupported schema features are rejected at the
front door (processor), not at compile time — fail fast, fail in the API layer.
backend_guidance.py implements the same two interfaces over llguidance (better coverage of
exotic JSON-schema features, lazy-computed masks); backend_outlines.py and
backend_lm_format_enforcer.py likewise. One contract, four interchangeable engines — the
same backend-registry pattern you saw for attention (Phase 4) and quantization (Phase 6).
Reading checklist
-
backend_types.py— why arevalidate_tokensandrollbackin the per-request interface? Which phase forces their existence? -
grammar_init— what exactly is async, and what state is a request in while its grammar compiles? -
grammar_bitmask— whymax_num_seqs × (1 + num_spec_tokens)rows? Walk the fill→accept→…→rollback sequence for one request with 2 draft tokens. -
apply_grammar_bitmask— why is reordering needed, and what does an all-1row mean? -
XgrammarBackend.compile_grammar— where doesmax_rollback_tokenscome from? -
In
scheduler.py:968, why is a request withis_prefill_chunkexcluded from bitmask generation? (Hint: which step actually samples a token?)
Now build it: 02-mini-build.md, then the labs.