Phase 18 Labs — Performance Engineering
Two labs, one loop: define metrics → measure a workload under a config → search
under an SLO constraint. First built cheap — a simulator over mini_vllm with
arrival schedules, spike proxies, and a grid search that refuses impossible SLOs
(lab-01) — then run for real with vllm bench serve, wall-clocks, percentile
distributions, and the tuning report as the deliverable artifact (lab-02).
CPU labs follow the standard contract — starter.py (your work), solution.py
(reference), test_lab.py (the spec); default runs the solution,
LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-18-performance-engineering/labs -m "not gpu"
# Grade yourself:
LAB_IMPL=starter pytest phase-18-performance-engineering/labs/lab-01-tune-the-knobs -q
Contents
- lab-01-tune-the-knobs
[CPU-OK] - lab-02-benchmark-real-vllm
[GPU-OPT] - What you can do after this phase
Labs
lab-01-tune-the-knobs [CPU-OK]
The loop, built cheap: a simulator running arrival schedules (queueing finally enters the course) with three metrics — TTFT-from-arrival, worst-step tokens (the spike proxy), total steps — and an SLO-constrained grid search that breaks ties toward latency and raises on unsatisfiable SLOs. Two measured surprises: the chunk threshold is per-request (only the budget caps a step globally), and chunking can cost zero throughput when decode steps are already there to hide chunks in. Skills: constraints beat preferences; metric calibration tests; cheap models with known biases shrink expensive searches.
lab-02-benchmark-real-vllm [GPU-OPT]
The loop, run for real: vllm bench serve sweeps with warm servers, two runs per
config, percentiles everywhere, the rate sweep that finds the knee first — and
the tuning report as the artifact (workload, table, distributions, recommendation
with its trade named). The captured sweep reconciles every row against the CPU labs
that predicted it. Skills: the four methodology checks; conservation of suffering
(admission knobs relocate latency); macro before micro; benchmark at the knee.
What you can do after this phase
Run a tuning engagement end to end: state the workload, find the knee, sweep one knob at a time with honest variance, report distributions, and recommend with the trade named — having prototyped the search cheaply enough to afford it. You can also audit anyone else's benchmark in about a minute (workload? warm? percentiles? one knob?), which is its own kind of superpower. Phase 19 sends everything upstream.