Phase 12 — Interview Questions: Structured Outputs
Staff/principal-level questions. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)
Q1. How does vLLM guarantee valid JSON output?
Model answer
It compiles the JSON schema into a grammar automaton (xgrammar: a pushdown automaton, since JSON nests), and at each decode step computes a bitmask of tokens that keep the output grammar-valid, applying −inf to all illegal logits before sampling. Softmax renormalizes over the legal set, so the model still expresses preference among valid tokens — but invalid output is impossible by construction, not just unlikely. One grammar state per request, advanced as tokens are accepted.
Q2. Grammars are over characters; models emit multi-character tokens. How is that bridged efficiently?
Model answer
A token is legal iff its character sequence keeps the automaton alive from the current state. Checking that naively is vocab × token-length automaton steps per decode step. The production trick (xgrammar) is compile-time token classification: for each automaton context, precompute which tokens are unconditionally legal/illegal and pack them into a bitmask (vocab/32 int32 words); only a small context-dependent remainder (e.g. tokens interacting with the stack) is checked at runtime. Per-step cost becomes "copy a precomputed row + check a few stragglers."
Q3. Where does grammar compilation run, and why does that placement matter?
Model answer
On the scheduler side, in a thread-pool executor (StructuredOutputManager.grammar_init).
The request's grammar field holds a Future, and the scheduler won't schedule the request
until it resolves — so a 200 ms schema compile costs that request TTFT but never stalls the
engine loop or other requests' steps. Compiled grammars are keyed by (type, spec) so
repeated schemas don't recompile. The bitmask is also filled scheduler-side and shipped to
workers as numpy (cheap serialization); the GPU only applies it.
Q4. How do structured outputs compose with speculative decoding?
Model answer
The bitmask carries one row per sampled position: for k draft tokens you need masks for
the state before each draft plus the bonus position, so allocation is
max_num_seqs × (1+k) rows. To compute row i the grammar tentatively accept_tokens's
drafts 1..i−1, and after filling all rows it rollback's — the true accept/reject verdict
belongs to the rejection sampler post-forward, which then advances the grammar by only the
accepted prefix. rollback is part of the base grammar interface and xgrammar's matcher is
constructed with max_rollback_tokens=num_speculative_tokens — composition was designed in,
not patched on.
Q5. A customer says constrained outputs are "valid but dumber." Diagnose.
Model answer
Real effect. Masking renormalizes but the model wasn't conditioned to follow this grammar;
when its preferred continuation is illegal, probability mass shifts to tokens it would
rarely choose, pushing generation off-distribution (worst at awkward token boundaries and
rigid whitespace rules). Mitigations: prompt with the schema and examples so the
high-probability path is also the legal path; relax the grammar where harmless (xgrammar's
any_whitespace); generate reasoning unconstrained and only constrain the final answer —
vLLM's reasoning gate (should_advance) does exactly this for thinking models. Also check
finish_reason="length": truncation, not the mask, is the most common "broken JSON" report.
Q6. Engine design: would you support different grammar backends per request? What does vLLM do and why?
Model answer
vLLM V1 deliberately supports one backend per engine (first constrained request picks it;
see the NOTE in grammar_init). Per-request backends would mean multiple compiled-grammar
caches, multiple bitmask allocation schemes, and validating every request's spec against
every backend's feature matrix — for little gain since backends are interchangeable behind
the two-interface contract (StructuredOutputBackend / StructuredOutputGrammar). The
right extension point is the contract, not per-request dispatch: that's also how attention
and quantization backends are handled.
Rapid-fire
- Mask applied where? Logits, −inf, pre-softmax (one fused kernel,
apply_grammar_bitmask). - Regex needs? FSM. JSON needs? Pushdown (stack). Schema → ? compiled to grammar first.
- Bitmask row size? vocab_size / 32 int32 words. All
-1row = ? unconstrained. - Compile blocking the engine loop? Never — async executor, request waits unscheduled.
- Spec-decode hooks in the grammar interface?
validate_tokens,rollback(n). - What constraints can't fix: truth of content, and
max_tokenstruncation.