Phase 16 — Interview Questions: Serving APIs & Parsers
Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)
Q1. How does vLLM implement tool calling on top of plain text generation?
Model answer
The server injects tool schemas into the prompt (often via the chat template / structured output), then a model-specific tool-call parser extracts the function name and JSON args from the generated text — incrementally during streaming — and emits OpenAI-style tool_calls. Structured output can hard-constrain the args to the schema.
Q2. What's tricky about streaming responses?
Model answer
You must emit incremental deltas while maintaining correct semantics (role, finish_reason), and parse partial content (tool-call JSON, reasoning tags) that spans multiple chunks without committing to an interpretation too early.
Going deeper
The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.