Lab 18-01 — Tune the Knobs: an SLO-Constrained Grid Search [CPU-OK]
Performance engineering is one loop, run with discipline: define metrics, measure a
workload under a config, search the config space under a constraint. This lab has
you build the whole loop on mini_vllm — a simulator that runs an arrival schedule
(requests landing at different times, the thing every previous lab simplified away)
and emits three metrics: per-request TTFT, the worst step's token count (the
ITL-spike proxy), and total steps (throughput). Then grid_search sweeps budget ×
chunk-threshold under a hard spike SLO and returns the best legal config — refusing
loudly when no config qualifies, because a quietly violated SLO is the worst outcome
in the trade. Along the way the tests teach two facts that surprise most tuners: the
chunk threshold is per-request (two chunked prefills still stack in one step —
only the budget caps globally), and chunking can cost zero throughput when a long
decode stream's steps are already there to hide the chunks in.
Contents
- Why this lab exists
- Background: metrics, then search
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Every knob in this course got its own lab; this one is where they meet a workload
— and workloads, not knobs, are what you actually tune for. The arrival schedule is
the lab's quiet upgrade over everything before it: queueing (TTFT now includes
waiting — test_queueing_shows_up_in_ttft), interference between requests that
arrive at different moments, and the SLO-vs-throughput tension that only exists when
both matter at once. The simulator is deliberately the cheapest possible version of
the loop (steps and token counts, no GPUs, milliseconds per run) because the
methodology is the deliverable: lab-02 runs the identical loop with vllm bench and
wall-clocks, and the only thing that changes is the cost of each measurement —
which is exactly why you prototype the search cheap.
The grid search's design choices are the staff-engineer content: the SLO is a
constraint, not a weighted term (latency SLOs are promises, not preferences);
ties break toward lower worst-TTFT (when throughput is equal, take the latency);
and unsatisfiability raises (test_unsatisfiable_slo_is_loud) — the tuning loop's
version of the course's loud-failure habit, because "best effort" on an impossible
SLO ships a violation with extra steps.
Background: metrics, then search
The three metrics, and what each proxies:
ttft_steps— steps from arrival (not admission!) to first token. Queueing, scheduling, prefill: all of it. The user-facing wait.max_step_tokens— the worst step's total scheduled tokens ≈ the worst inter-token stall any decoding user felt (Phase 3 lab-05's proxy, now a tunable's objective).total_steps— the schedule's length ≈ inverse throughput at fixed step cost. (The proxy's known bias: real steps' wall-clock varies with their token count — total tokens would weight differently; lab-02's wall-clocks settle it.)
The two knobs swept are the course's latency dial (threshold) and throughput dial (budget) — and the search space is tiny on purpose. Real tuning fails far more often from unclear objectives than from undersized grids; get the constraint and the tiebreak right first, then enlarge the grid.
Files
starter.py—Metrics,simulate(arrivals + the Phase 1 lab-04 probe + first-token tracking),grid_search. Your work.solution.py— reference.test_lab.py— monotonicity, the per-request-vs-global cap lesson, queueing in TTFT, arrival-relative measurement, SLO compliance, and loud unsatisfiability.
Run
LAB_IMPL=starter pytest phase-18-performance-engineering/labs/lab-01-tune-the-knobs -q
pytest phase-18-performance-engineering/labs/lab-01-tune-the-knobs -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_throughput_more_budget_never_more_steps | The sanity direction every tuning loop needs before it can be trusted with anything subtle |
test_spike_threshold_is_per_request_budget_is_global | The two-cap structure: threshold=32 still allows a 65-token step (two chunks + a decode); only budget=40 forces ≤ 40. And the surprise: chunking cost zero steps here, because the 24-token decode stream's steps were already there to hide chunks in — Sarathi's piggybacking, measured from the throughput side. A lonely fat prompt, with nowhere to hide, pays the full chunking step-count bill |
test_queueing_shows_up_in_ttft | max_num_seqs=1 makes later arrivals wait, and the metric sees it — TTFT without queueing is a benchmark fiction |
test_ttft_is_measured_from_arrival | The zero-point check: an idle engine serves first tokens in the arrival step. Metrics need calibration tests too |
test_grid_search_respects_the_slo | The constrained search refuses the throughput-optimal-but-violating config — constraints beat preferences |
test_unsatisfiable_slo_is_loud | An impossible SLO raises; it does not return the least-bad violation |
Hitchhiker's notes
- The hide-the-chunks result generalizes and matters: chunked prefill's
throughput cost is
max(0, chunk_steps − coexisting_decode_steps)-shaped. Fleets with deep decode streams (chat) chunk nearly free; bursty prefill-only fleets (batch summarization) pay full price — and that's also the fleet that didn't need the latency protection. The knob's cost and its benefit anti-correlate across workloads, which is why per-deployment tuning beats global defaults. - Arrival schedules are the difference between benchmarks and reality: this
lab's three-request workload already produces queueing, interference, and
hiding effects no all-at-once batch shows. Real benchmark suites
(
vllm bench serve) generate Poisson arrivals at a target QPS for the same reason — lab-02 uses exactly that. - Grid search is the right first search: 6 configs here, exhaustive, done. At real scale (5+ knobs), the same loop wraps Bayesian/successive-halving optimizers — but the metrics, the constraint handling, and the loud unsatisfiability transfer unchanged. The loop is the asset; the optimizer is a plug-in.
- One proxy limitation to carry consciously: step counts can't see fixed per-step overheads (launch costs, scheduler time), so this simulator systematically favors many-small-steps configs vs what wall-clocks will say — Phase 5's whole subject is that bias. Cheap models with known biases, again (Phase 8 lab-04, Phase 15 lab-03): use them to shrink the expensive search, never to replace the final measurement.
Going further
- Add a
worst_ttftSLO as a second constraint and find workloads where the two SLOs conflict (spike cap wants small budget; TTFT wants big) — the multi-objective frontier, met honestly. - Generate Poisson arrivals (
rng.poisson) at increasing rates and plot worst-TTFT vs offered load for two configs: the hockey stick where queueing takes over is the capacity limit, found by simulation — Phase 3 lab-04's going-further, completed. - Port
simulate's probe to count tokens per step and weighttotal_stepsby a per-step cost model (fixed + per-token) — calibrate the two constants against one lab-02 measurement, then re-run the grid. You've built the cheap-model/ expensive-measurement two-tier loop production tuning actually uses.
References
- Phase 1 lab-04 (the probe), Phase 3 labs 01/05 (the two caps and the spike) — the parts this lab assembles.
upstream/vllm/benchmarks/andvllm bench— the production version of this loop (lab-02).- vLLM docs, Optimization and Tuning — the knobs' official guidance, now checkable against your own search: https://docs.vllm.ai/en/latest/configuration/optimization.html
- Agrawal et al., Sarathi-Serve (OSDI 2024) — the piggybacking result your zero-cost-chunking test measured: https://arxiv.org/abs/2403.02310