vegarl / inferencegym_plan.html
ronitraj's picture
Deploy Space without oversized raw dataset
4fbc241
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>InferenceGym β€” Master Build Document</title>
<style>
@import url('https://fonts.googleapis.com/css2?family=Geist+Mono:wght@300;400;500;600;700&family=Syne:wght@400;500;600;700;800&display=swap');
:root {
--bg: #070809;
--bg1: #0c0e11;
--bg2: #111418;
--bg3: #171c22;
--bg4: #1d242d;
--border: rgba(255,255,255,0.06);
--border2: rgba(255,255,255,0.10);
--border3: rgba(255,255,255,0.15);
--text: #dce3ec;
--text2: #7a8494;
--text3: #424c5c;
--text4: #2c3340;
--green: #22d3a0;
--green2: #0fa870;
--gdim: rgba(34,211,160,0.08);
--gborder: rgba(34,211,160,0.20);
--blue: #5b9cf6;
--bdim: rgba(91,156,246,0.08);
--bborder: rgba(91,156,246,0.20);
--amber: #f0a832;
--adim: rgba(240,168,50,0.08);
--aborder: rgba(240,168,50,0.20);
--red: #f05c5c;
--rdim: rgba(240,92,92,0.08);
--rborder: rgba(240,92,92,0.20);
--purple: #a78bfa;
--pdim: rgba(167,139,250,0.08);
--pborder: rgba(167,139,250,0.20);
--cyan: #38bdf8;
--cdim: rgba(56,189,248,0.08);
--cborder: rgba(56,189,248,0.20);
--mono: 'Geist Mono', monospace;
--sans: 'Syne', sans-serif;
}
*, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
html { scroll-behavior: smooth; }
body {
background: var(--bg);
color: var(--text);
font-family: var(--mono);
font-size: 13px;
line-height: 1.7;
}
/* ── LAYOUT ── */
.wrap { max-width: 1100px; margin: 0 auto; padding: 56px 40px 120px; }
/* ── COVER ── */
.cover {
position: relative;
border: 1px solid var(--border2);
border-radius: 16px;
overflow: hidden;
margin-bottom: 56px;
background: var(--bg1);
}
.cover-gradient {
position: absolute;
inset: 0;
background:
radial-gradient(ellipse 60% 50% at 10% 20%, rgba(34,211,160,0.06) 0%, transparent 70%),
radial-gradient(ellipse 50% 60% at 90% 80%, rgba(91,156,246,0.05) 0%, transparent 70%);
pointer-events: none;
}
.cover-top-bar {
height: 2px;
background: linear-gradient(90deg, var(--green), var(--blue), var(--purple), var(--amber));
}
.cover-inner { padding: 48px 52px 52px; }
.cover-eyebrow {
display: flex;
align-items: center;
gap: 12px;
margin-bottom: 24px;
}
.eyebrow-tag {
font-family: var(--mono);
font-size: 10px;
font-weight: 600;
letter-spacing: 0.12em;
text-transform: uppercase;
padding: 4px 10px;
border-radius: 4px;
}
.et-green { color: var(--green); background: var(--gdim); border: 1px solid var(--gborder); }
.et-blue { color: var(--blue); background: var(--bdim); border: 1px solid var(--bborder); }
.et-amber { color: var(--amber); background: var(--adim); border: 1px solid var(--aborder); }
.et-red { color: var(--red); background: var(--rdim); border: 1px solid var(--rborder); }
.et-purple { color: var(--purple); background: var(--pdim); border: 1px solid var(--pborder); }
.et-cyan { color: var(--cyan); background: var(--cdim); border: 1px solid var(--cborder); }
.cover h1 {
font-family: var(--sans);
font-size: 44px;
font-weight: 800;
letter-spacing: -0.03em;
line-height: 1.1;
color: #fff;
margin-bottom: 16px;
}
.cover h1 em { font-style: normal; color: var(--green); }
.cover-desc {
font-family: var(--mono);
font-size: 13px;
color: var(--text2);
max-width: 680px;
line-height: 1.75;
margin-bottom: 36px;
}
.cover-stats {
display: grid;
grid-template-columns: repeat(6, 1fr);
gap: 0;
border: 1px solid var(--border2);
border-radius: 10px;
overflow: hidden;
}
.stat-cell {
padding: 14px 18px;
border-right: 1px solid var(--border);
}
.stat-cell:last-child { border-right: none; }
.stat-label { font-size: 9px; font-weight: 600; letter-spacing: 0.10em; text-transform: uppercase; color: var(--text3); margin-bottom: 4px; }
.stat-val { font-size: 13px; font-weight: 600; color: var(--text); }
.stat-val.green { color: var(--green); }
.stat-val.amber { color: var(--amber); }
.stat-val.red { color: var(--red); }
/* ── TOC ── */
.toc-box {
background: var(--bg1);
border: 1px solid var(--border2);
border-radius: 12px;
padding: 28px 32px;
margin-bottom: 56px;
}
.toc-title {
font-family: var(--sans);
font-size: 11px;
font-weight: 700;
letter-spacing: 0.14em;
text-transform: uppercase;
color: var(--text3);
margin-bottom: 20px;
}
.toc-phases {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 6px;
}
.toc-item {
display: flex;
align-items: center;
gap: 10px;
padding: 8px 12px;
border-radius: 6px;
text-decoration: none;
transition: background 0.15s;
border: 1px solid transparent;
}
.toc-item:hover { background: var(--bg3); border-color: var(--border2); }
.toc-num {
font-size: 10px;
font-weight: 700;
color: var(--text3);
width: 28px;
flex-shrink: 0;
}
.toc-name { font-size: 12px; color: var(--text2); }
.toc-badge {
margin-left: auto;
font-size: 9px;
font-weight: 700;
padding: 2px 7px;
border-radius: 3px;
letter-spacing: 0.06em;
flex-shrink: 0;
}
/* ── SECTION ── */
.section { margin-bottom: 64px; scroll-margin-top: 32px; }
.section-header {
display: flex;
align-items: flex-start;
gap: 20px;
margin-bottom: 28px;
padding-bottom: 20px;
border-bottom: 1px solid var(--border);
}
.section-num {
font-family: var(--mono);
font-size: 11px;
font-weight: 700;
color: var(--text4);
padding-top: 4px;
flex-shrink: 0;
width: 32px;
}
.section-meta { flex: 1; }
.section-title {
font-family: var(--sans);
font-size: 24px;
font-weight: 700;
letter-spacing: -0.02em;
color: #fff;
margin-bottom: 6px;
}
.section-sub { font-size: 12px; color: var(--text2); line-height: 1.65; }
/* ── PHASE CARD ── */
.phase-card {
border-radius: 12px;
border: 1px solid var(--border2);
overflow: hidden;
margin-bottom: 20px;
}
.phase-header {
display: flex;
align-items: center;
gap: 16px;
padding: 18px 22px;
border-bottom: 1px solid var(--border);
}
.phase-icon {
font-size: 18px;
width: 36px;
text-align: center;
}
.phase-label {
font-family: var(--mono);
font-size: 10px;
font-weight: 600;
letter-spacing: 0.10em;
text-transform: uppercase;
margin-bottom: 3px;
}
.phase-name {
font-family: var(--sans);
font-size: 16px;
font-weight: 700;
color: #fff;
}
.phase-meta {
margin-left: auto;
display: flex;
flex-direction: column;
align-items: flex-end;
gap: 4px;
}
.phase-days { font-size: 11px; color: var(--text2); }
.phase-body { padding: 22px; background: var(--bg1); }
.phase-desc { font-size: 12px; color: var(--text2); line-height: 1.7; margin-bottom: 20px; }
/* ── DELIVERABLE BOX ── */
.deliverable-box {
background: var(--bg2);
border: 1px solid var(--border);
border-radius: 8px;
padding: 14px 16px;
margin-bottom: 14px;
}
.deliverable-title {
font-size: 11px;
font-weight: 700;
letter-spacing: 0.08em;
text-transform: uppercase;
color: var(--text3);
margin-bottom: 10px;
}
.deliverable-list { list-style: none; display: flex; flex-direction: column; gap: 6px; }
.deliverable-list li {
display: flex;
align-items: flex-start;
gap: 10px;
font-size: 12px;
color: var(--text2);
line-height: 1.6;
}
.dl-bullet {
width: 16px;
height: 16px;
border-radius: 3px;
flex-shrink: 0;
margin-top: 1px;
display: flex;
align-items: center;
justify-content: center;
font-size: 9px;
font-weight: 700;
}
.dl-green { background: var(--gdim); color: var(--green); border: 1px solid var(--gborder); }
.dl-blue { background: var(--bdim); color: var(--blue); border: 1px solid var(--bborder); }
.dl-amber { background: var(--adim); color: var(--amber); border: 1px solid var(--aborder); }
.dl-red { background: var(--rdim); color: var(--red); border: 1px solid var(--rborder); }
.dl-purple { background: var(--pdim); color: var(--purple);border: 1px solid var(--pborder); }
.dl-cyan { background: var(--cdim); color: var(--cyan); border: 1px solid var(--cborder); }
.dl-text strong { color: var(--text); font-weight: 600; display: block; }
/* ── MODULE SPEC ── */
.module-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 14px; margin-bottom: 14px; }
.module-card {
background: var(--bg2);
border: 1px solid var(--border);
border-radius: 10px;
overflow: hidden;
}
.module-card-header {
padding: 12px 16px;
border-bottom: 1px solid var(--border);
display: flex;
align-items: center;
justify-content: space-between;
}
.module-card-name { font-size: 13px; font-weight: 600; color: #fff; }
.module-card-file { font-size: 10px; color: var(--text3); font-family: var(--mono); }
.module-card-body { padding: 14px 16px; }
.module-card-desc { font-size: 12px; color: var(--text2); line-height: 1.65; margin-bottom: 12px; }
.spec-list { list-style: none; display: flex; flex-direction: column; gap: 5px; }
.spec-list li {
display: flex;
gap: 8px;
font-size: 11px;
color: var(--text2);
line-height: 1.5;
}
.spec-list li::before {
content: 'β†’';
color: var(--text3);
flex-shrink: 0;
font-size: 10px;
margin-top: 1px;
}
.spec-list li code { color: var(--green); font-size: 10px; }
/* ── CODE BLOCK ── */
.code-block {
background: var(--bg3);
border: 1px solid var(--border);
border-radius: 10px;
overflow: hidden;
margin-bottom: 14px;
font-size: 11.5px;
line-height: 1.65;
}
.code-block-header {
display: flex;
align-items: center;
justify-content: space-between;
padding: 8px 14px;
border-bottom: 1px solid var(--border);
background: var(--bg4);
}
.code-lang { font-size: 9px; font-weight: 700; letter-spacing: 0.10em; text-transform: uppercase; color: var(--text3); }
.code-file { font-size: 10px; color: var(--text3); }
.code-body { padding: 16px 18px; overflow-x: auto; }
.code-body pre { margin: 0; white-space: pre; }
.kw { color: var(--purple); }
.fn { color: var(--blue); }
.st { color: var(--green); }
.cm { color: var(--text3); font-style: italic; }
.nm { color: var(--amber); }
.dc { color: var(--cyan); }
.tp { color: #e879f9; }
.op { color: var(--text2); }
/* ── TABLE ── */
.table-wrap { overflow-x: auto; border-radius: 10px; border: 1px solid var(--border); margin-bottom: 14px; }
table { width: 100%; border-collapse: collapse; font-size: 12px; }
th {
font-family: var(--mono);
font-size: 9px;
font-weight: 700;
letter-spacing: 0.10em;
text-transform: uppercase;
color: var(--text3);
padding: 10px 14px;
border-bottom: 1px solid var(--border2);
background: var(--bg3);
text-align: left;
white-space: nowrap;
}
td {
padding: 10px 14px;
border-bottom: 1px solid var(--border);
color: var(--text2);
vertical-align: top;
line-height: 1.5;
}
td strong { color: var(--text); font-weight: 600; }
td code { font-family: var(--mono); font-size: 11px; color: var(--green); }
tr:last-child td { border-bottom: none; }
/* ── ALERT ── */
.alert {
border-radius: 8px;
padding: 14px 16px;
margin-bottom: 14px;
font-size: 12px;
line-height: 1.65;
border: 1px solid;
}
.alert-title { font-weight: 700; font-size: 11px; letter-spacing: 0.06em; text-transform: uppercase; margin-bottom: 5px; }
.alert-green { background: var(--gdim); border-color: var(--gborder); color: #a3f0d8; }
.alert-amber { background: var(--adim); border-color: var(--aborder); color: #f5d49a; }
.alert-blue { background: var(--bdim); border-color: var(--bborder); color: #a8c7fa; }
.alert-red { background: var(--rdim); border-color: var(--rborder); color: #f5a0a0; }
.alert-purple { background: var(--pdim); border-color: var(--pborder); color: #d4c0ff; }
.alert-cyan { background: var(--cdim); border-color: var(--cborder); color: #a0dff5; }
/* ── TIMELINE ── */
.timeline { position: relative; }
.tl-row {
display: flex;
gap: 0;
margin-bottom: 0;
}
.tl-left {
width: 90px;
flex-shrink: 0;
text-align: right;
padding-right: 20px;
padding-top: 16px;
}
.tl-day-label { font-size: 10px; font-weight: 600; color: var(--text3); line-height: 1.4; }
.tl-day-label.today { color: var(--green); }
.tl-connector {
width: 20px;
flex-shrink: 0;
display: flex;
flex-direction: column;
align-items: center;
}
.tl-dot {
width: 10px;
height: 10px;
border-radius: 50%;
background: var(--bg4);
border: 2px solid var(--border2);
margin-top: 20px;
flex-shrink: 0;
z-index: 1;
}
.tl-dot.g { background: var(--green); border-color: var(--green); box-shadow: 0 0 8px rgba(34,211,160,0.5); }
.tl-dot.b { background: var(--blue); border-color: var(--blue); }
.tl-dot.a { background: var(--amber); border-color: var(--amber); }
.tl-dot.p { background: var(--purple);border-color: var(--purple); }
.tl-dot.r { background: var(--red); border-color: var(--red); }
.tl-line { width: 1px; flex: 1; background: var(--border2); }
.tl-right { flex: 1; padding: 8px 0 8px 16px; }
.tl-card {
background: var(--bg1);
border: 1px solid var(--border);
border-radius: 8px;
padding: 14px 16px;
margin-bottom: 8px;
}
.tl-card.green-border { border-color: var(--gborder); }
.tl-card.blue-border { border-color: var(--bborder); }
.tl-card.amber-border { border-color: var(--aborder); }
.tl-card.red-border { border-color: var(--rborder); }
.tl-card-phase {
font-size: 9px;
font-weight: 700;
letter-spacing: 0.10em;
text-transform: uppercase;
margin-bottom: 8px;
}
.tl-tasks-list { list-style: none; display: flex; flex-direction: column; gap: 5px; }
.tl-tasks-list li {
font-size: 11.5px;
color: var(--text2);
display: grid;
grid-template-columns: 70px 1fr;
gap: 8px;
line-height: 1.5;
}
.tl-person { font-size: 10px; font-weight: 700; color: var(--text3); padding-top: 1px; }
.tl-task { color: var(--text2); }
.tl-task strong { color: var(--text); font-weight: 600; }
/* ── PERSON TRACKS ── */
.person-grid { display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 12px; margin-bottom: 14px; }
.person-card {
background: var(--bg2);
border: 1px solid var(--border);
border-radius: 10px;
overflow: hidden;
}
.person-header {
padding: 12px 16px;
border-bottom: 1px solid var(--border);
}
.person-name { font-family: var(--sans); font-size: 15px; font-weight: 700; color: #fff; margin-bottom: 4px; }
.person-role { font-size: 10px; color: var(--text3); }
.person-body { padding: 14px 16px; }
.person-tasks { list-style: none; display: flex; flex-direction: column; gap: 8px; }
.person-tasks li {
font-size: 11px;
color: var(--text2);
padding-left: 10px;
border-left: 2px solid var(--border2);
line-height: 1.55;
}
.person-tasks li strong { color: var(--text); font-weight: 600; display: block; }
/* ── RISK ── */
.risk-row {
display: grid;
grid-template-columns: 2fr 80px 3fr;
gap: 0;
border-bottom: 1px solid var(--border);
padding: 12px 14px;
align-items: start;
font-size: 12px;
}
.risk-row:last-child { border-bottom: none; }
.risk-name { color: var(--text); font-weight: 500; padding-right: 12px; }
.risk-prob { text-align: center; }
.risk-mit { color: var(--text2); padding-left: 12px; border-left: 1px solid var(--border); }
/* ── CHECKLIST ── */
.checklist { list-style: none; display: flex; flex-direction: column; gap: 7px; }
.checklist li {
display: flex;
align-items: flex-start;
gap: 10px;
font-size: 12px;
color: var(--text2);
line-height: 1.6;
}
.chk {
width: 16px; height: 16px;
border-radius: 4px;
border: 1px solid var(--border2);
flex-shrink: 0;
margin-top: 1px;
display: flex;
align-items: center;
justify-content: center;
font-size: 9px;
}
/* ── GATE BOX ── */
.gate-box {
background: var(--gdim);
border: 1px solid var(--gborder);
border-radius: 8px;
padding: 14px 16px;
margin: 14px 0;
display: flex;
align-items: flex-start;
gap: 12px;
}
.gate-icon { font-size: 18px; flex-shrink: 0; }
.gate-label { font-size: 9px; font-weight: 700; letter-spacing: 0.10em; text-transform: uppercase; color: var(--green); margin-bottom: 4px; }
.gate-text { font-size: 12px; color: var(--text2); line-height: 1.65; }
.gate-text strong { color: var(--text); font-weight: 600; }
/* ── UTILS ── */
.grid2 { display: grid; grid-template-columns: 1fr 1fr; gap: 14px; }
.grid3 { display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 14px; }
.mb8 { margin-bottom: 8px; }
.mb14 { margin-bottom: 14px; }
.mb20 { margin-bottom: 20px; }
.mt20 { margin-top: 20px; }
hr.rule { border: none; border-top: 1px solid var(--border); margin: 40px 0; }
.label {
font-size: 9px;
font-weight: 700;
letter-spacing: 0.12em;
text-transform: uppercase;
color: var(--text3);
margin-bottom: 10px;
}
.mono { font-family: var(--mono); }
</style>
</head>
<body>
<div class="wrap">
<!-- ═══════════════════════════════ COVER ═══════════════════════════════ -->
<div class="cover">
<div class="cover-gradient"></div>
<div class="cover-top-bar"></div>
<div class="cover-inner">
<div class="cover-eyebrow">
<span class="eyebrow-tag et-green">MASTER BUILD DOCUMENT</span>
<span class="eyebrow-tag et-blue">PHASE-BY-PHASE</span>
<span class="eyebrow-tag et-amber">ALWAYS FUNCTIONAL</span>
</div>
<h1><em>InferenceGym</em><br>Complete Engineering Plan</h1>
<p class="cover-desc">
A modular, phase-gated engineering plan for building the first RL environment for LLM inference control.
Every phase ends with a fully functional, testable system. No phase leaves you broken.
Deadline: April 7, 2026 Β· 11 days Β· 3 people.
</p>
<div class="cover-stats">
<div class="stat-cell">
<div class="stat-label">Deadline</div>
<div class="stat-val red">Apr 7, 2026</div>
</div>
<div class="stat-cell">
<div class="stat-label">Days Left</div>
<div class="stat-val amber">11 days</div>
</div>
<div class="stat-cell">
<div class="stat-label">Team Size</div>
<div class="stat-val">3 people</div>
</div>
<div class="stat-cell">
<div class="stat-label">Phases</div>
<div class="stat-val green">6 phases</div>
</div>
<div class="stat-cell">
<div class="stat-label">Deploy Target</div>
<div class="stat-val">HF Spaces</div>
</div>
<div class="stat-cell">
<div class="stat-label">Prize Pool</div>
<div class="stat-val green">$30,000</div>
</div>
</div>
</div>
</div>
<!-- ═══════════════════════════════ TOC ═══════════════════════════════ -->
<div class="toc-box">
<div class="toc-title">Table of Contents</div>
<div class="toc-phases">
<a href="#phase0" class="toc-item">
<span class="toc-num">P0</span>
<span class="toc-name">Setup & Architecture Lock</span>
<span class="toc-badge et-cyan">Day 1</span>
</a>
<a href="#phase1" class="toc-item">
<span class="toc-num">P1</span>
<span class="toc-name">Simulator Core (MVP)</span>
<span class="toc-badge et-green">Days 2–3</span>
</a>
<a href="#phase2" class="toc-item">
<span class="toc-num">P2</span>
<span class="toc-name">Environment Logic</span>
<span class="toc-badge et-blue">Day 4</span>
</a>
<a href="#phase3" class="toc-item">
<span class="toc-num">P3</span>
<span class="toc-name">API Layer & Docker</span>
<span class="toc-badge et-blue">Day 5</span>
</a>
<a href="#phase4" class="toc-item">
<span class="toc-num">P4</span>
<span class="toc-name">Grader, Baseline & Tasks</span>
<span class="toc-badge et-amber">Days 6–7</span>
</a>
<a href="#phase5" class="toc-item">
<span class="toc-num">P5</span>
<span class="toc-name">Deployment & Demo Agent</span>
<span class="toc-badge et-amber">Days 8–9</span>
</a>
<a href="#phase6" class="toc-item">
<span class="toc-num">P6</span>
<span class="toc-name">Polish, Submission & Buffer</span>
<span class="toc-badge et-purple">Days 10–11</span>
</a>
<a href="#modules" class="toc-item">
<span class="toc-num">Β§A</span>
<span class="toc-name">Full Module Specifications</span>
<span class="toc-badge et-green">Reference</span>
</a>
<a href="#dataschema" class="toc-item">
<span class="toc-num">Β§B</span>
<span class="toc-name">Data Schemas & APIs</span>
<span class="toc-badge et-blue">Reference</span>
</a>
<a href="#risks" class="toc-item">
<span class="toc-num">Β§C</span>
<span class="toc-name">Risk Register & Mitigations</span>
<span class="toc-badge et-red">Reference</span>
</a>
<a href="#checklist" class="toc-item">
<span class="toc-num">Β§D</span>
<span class="toc-name">Final Submission Checklist</span>
<span class="toc-badge et-purple">Reference</span>
</a>
</div>
</div>
<!-- ═══════════════════════════════ PHILOSOPHY ═══════════════════════════════ -->
<div class="section">
<div class="section-header">
<div class="section-num">00</div>
<div class="section-meta">
<div class="section-title">Engineering Philosophy</div>
<div class="section-sub">Guiding principles that govern every implementation decision in this project.</div>
</div>
</div>
<div class="grid3 mb14">
<div class="module-card">
<div class="module-card-header">
<span class="module-card-name" style="color:var(--green)">Always Functional</span>
</div>
<div class="module-card-body">
<div class="module-card-desc">After every phase ends, the system must be in a state where you can run it, call it, and get a valid response. No "half-built" states that block testing. If Phase 1 is done, someone can import the simulator and call <code>simulate(action)</code> right now.</div>
</div>
</div>
<div class="module-card">
<div class="module-card-header">
<span class="module-card-name" style="color:var(--blue)">Stub First, Flesh Later</span>
</div>
<div class="module-card-body">
<div class="module-card-desc">Every module gets a stub implementation on Day 1 that returns valid-shaped data. This lets Person B wire the API and Person C write the grader before Person A finishes the simulator. Real logic replaces stubs phase by phase.</div>
</div>
</div>
<div class="module-card">
<div class="module-card-header">
<span class="module-card-name" style="color:var(--amber)">Data Schema First</span>
</div>
<div class="module-card-body">
<div class="module-card-desc">All three people must agree on the exact shape of <code>ServeAction</code>, <code>ServeObservation</code>, and <code>MetricsSnapshot</code> on Day 1, before writing a single line of logic. Changing the schema mid-build is the #1 cause of integration hell.</div>
</div>
</div>
</div>
<div class="alert alert-amber">
<div class="alert-title">⚠ The Critical Path</div>
Person A's simulator core is the only hard dependency for everyone else. That is why Person A's Day 3 deliverable is a strict gate β€” no simulator, no environment, no env, no API, no demo. Everything else can be parallelised after Day 3. Protect this gate fiercely.
</div>
</div>
<!-- ═══════════════════════════════ PHASE 0 ═══════════════════════════════ -->
<div class="section" id="phase0">
<div class="section-header">
<div class="section-num">P0</div>
<div class="section-meta">
<div class="section-title">Phase 0 β€” Setup & Architecture Lock</div>
<div class="section-sub">Day 1 (Mar 27). Goal: every team member has a running environment, a shared repo, agreed data schemas, and a working stub server that returns valid-shaped responses.</div>
</div>
<span class="eyebrow-tag et-cyan">Day 1 Β· Mar 27</span>
</div>
<div class="gate-box">
<div class="gate-icon">🏁</div>
<div>
<div class="gate-label">Phase Gate β€” End of Day 1</div>
<div class="gate-text"><strong>You can run <code>curl http://localhost:7860/health</code> and get a 200 OK.</strong> All three people have cloned the repo, installed deps, and can run the stub server locally. The data schemas are written and committed to <code>models.py</code>. Nobody can start Day 2 until this is true.</div>
</div>
</div>
<div class="person-grid mb14">
<div class="person-card">
<div class="person-header" style="border-top: 3px solid var(--green);">
<div class="person-name">Person A β€” Simulator Lead</div>
<div class="person-role">Owns: simulator/, env/ directories</div>
</div>
<div class="person-body">
<ul class="person-tasks">
<li><strong>Read OpenEnv spec completely</strong> Clone openenv-course, run the echo example env, understand what /reset β†’ /step β†’ /grader looks like end to end.</li>
<li><strong>Design TraceSimulator data schema</strong> Decide the exact column names for the lookup CSV. Write it down. Share with the team. This is a decision that cannot change later.</li>
<li><strong>Write skeleton classes</strong> Create <code>simulator/trace_sim.py</code> with class stubs: <code>TraceSimulator.__init__</code>, <code>simulate(action, workload)</code> returning a hardcoded <code>MetricsSnapshot</code>.</li>
<li><strong>Write skeleton workload generator</strong> <code>simulator/workload.py</code> β€” stub that returns a fixed <code>WorkloadState</code> dict every time.</li>
</ul>
</div>
</div>
<div class="person-card">
<div class="person-header" style="border-top: 3px solid var(--blue);">
<div class="person-name">Person B β€” API Lead</div>
<div class="person-role">Owns: server/ directory, Dockerfile</div>
</div>
<div class="person-body">
<ul class="person-tasks">
<li><strong>Set up FastAPI project</strong> Install FastAPI, uvicorn, pydantic. Create <code>server/app.py</code> with all 8 endpoint stubs that return hardcoded valid responses.</li>
<li><strong>Install openenv CLI</strong> Run <code>openenv init</code>, understand what <code>openenv validate</code> checks. Make sure the stub server passes basic validation.</li>
<li><strong>Create Dockerfile skeleton</strong> Multi-stage build that starts the uvicorn server. Confirm it builds locally and the /health endpoint responds from inside Docker.</li>
<li><strong>Set up GitHub repo</strong> Main branch protection, agree on feature branch naming (<code>feat/simulator</code>, <code>feat/api</code>, etc.), set up <code>.gitignore</code>.</li>
</ul>
</div>
</div>
<div class="person-card">
<div class="person-header" style="border-top: 3px solid var(--amber);">
<div class="person-name">Person C β€” Grader & Demo Lead</div>
<div class="person-role">Owns: grader/, agents/, notebooks/</div>
</div>
<div class="person-body">
<ul class="person-tasks">
<li><strong>Design grader rubric on paper</strong> For each of the 3 tasks: what is the score formula? What is the theoretical optimal? What is the expected baseline score? Write this as a one-page doc.</li>
<li><strong>Decide trace data strategy</strong> Evaluate Option A (published benchmarks), B (Colab T4), C (synthetic). Download whichever dataset you're going with. Confirm it has the needed columns.</li>
<li><strong>Define workload configs</strong> Write <code>simulator/data/workload_configs.json</code> with the exact parameters for Task 1, 2, and 3 (arrival rate, SLO, prompt distribution params).</li>
<li><strong>Agree on ENV_NAME</strong> Confirm the HuggingFace Spaces org, repo name, and environment name string. Register the HF account if needed.</li>
</ul>
</div>
</div>
</div>
<div class="label">SHARED DELIVERABLE β€” models.py (everyone must agree before Day 2)</div>
<div class="code-block">
<div class="code-block-header">
<span class="code-lang">python</span>
<span class="code-file">inferencegym/models.py β€” Data schema, locked on Day 1</span>
</div>
<div class="code-body"><pre><span class="kw">from</span> dataclasses <span class="kw">import</span> dataclass, field
<span class="kw">from</span> typing <span class="kw">import</span> <span class="tp">Optional, List, Dict, Any</span>
<span class="kw">from</span> enum <span class="kw">import</span> Enum
<span class="cm"># ── Action space ─────────────────────────────────────────────────────────────</span>
<span class="kw">class</span> <span class="tp">QuantTier</span>(Enum):
FP16 = <span class="nm">0</span>
INT8 = <span class="nm">1</span>
INT4 = <span class="nm">2</span>
@dataclass
<span class="kw">class</span> <span class="tp">ServeAction</span>:
kv_budget: <span class="tp">float</span> <span class="cm"># 0.1 – 1.0 : fraction of KV cache allocated</span>
spec_length: <span class="tp">int</span> <span class="cm"># 0,1,2,4,8 : speculative draft tokens</span>
batch_size: <span class="tp">int</span> <span class="cm"># 1–512 : max concurrent requests</span>
prefill_disagg: <span class="tp">bool</span> <span class="cm"># True/False : disaggregate prefill GPU</span>
quant_tier: <span class="tp">QuantTier</span> <span class="cm"># FP16/INT8/INT4</span>
<span class="kw">def</span> <span class="fn">validate</span>(self) -> <span class="tp">bool</span>:
<span class="kw">assert</span> <span class="nm">0.1</span> <= self.kv_budget <= <span class="nm">1.0</span>
<span class="kw">assert</span> self.spec_length <span class="kw">in</span> {<span class="nm">0</span>,<span class="nm">1</span>,<span class="nm">2</span>,<span class="nm">4</span>,<span class="nm">8</span>}
<span class="kw">assert</span> <span class="nm">1</span> <= self.batch_size <= <span class="nm">512</span>
<span class="kw">return</span> <span class="kw">True</span>
<span class="cm"># ── Simulator output ──────────────────────────────────────────────────────────</span>
@dataclass
<span class="kw">class</span> <span class="tp">MetricsSnapshot</span>:
ttft_p50_ms: <span class="tp">float</span> <span class="cm"># median time to first token</span>
ttft_p99_ms: <span class="tp">float</span> <span class="cm"># tail latency</span>
tpot_ms: <span class="tp">float</span> <span class="cm"># time per output token</span>
tokens_per_sec: <span class="tp">float</span> <span class="cm"># throughput</span>
gpu_memory_gb: <span class="tp">float</span> <span class="cm"># simulated memory pressure</span>
cost_per_1k: <span class="tp">float</span> <span class="cm"># compute cost (normalised units)</span>
spec_accept_rate: <span class="tp">float</span> <span class="cm"># 0.0 if spec_length == 0</span>
eviction_events: <span class="tp">int</span> <span class="cm"># KV cache evictions this step</span>
slo_violations: <span class="tp">int</span> <span class="cm"># requests that exceeded SLO this step</span>
<span class="cm"># ── Observation (what agent sees) ────────────────────────────────────────────</span>
@dataclass
<span class="kw">class</span> <span class="tp">ServeObservation</span>:
queue_depth: <span class="tp">float</span>
mean_prompt_len: <span class="tp">float</span>
arrival_rate: <span class="tp">float</span>
kv_cache_occupancy: <span class="tp">float</span>
ttft_p50: <span class="tp">float</span>
tpot_p50: <span class="tp">float</span>
slo_violation_rate: <span class="tp">float</span>
gpu_memory_used_gb: <span class="tp">float</span>
spec_accept_rate: <span class="tp">float</span>
priority_distribution: <span class="tp">List[float]</span> <span class="cm"># [interactive, batch, best_effort]</span>
timestep: <span class="tp">int</span>
cost_so_far: <span class="tp">float</span>
<span class="cm"># ── Workload state ────────────────────────────────────────────────────────────</span>
@dataclass
<span class="kw">class</span> <span class="tp">WorkloadState</span>:
arrival_rate: <span class="tp">float</span>
mean_prompt_len: <span class="tp">float</span>
prompt_len_bucket: <span class="tp">int</span> <span class="cm"># 0–7, discrete bucket for lookup table</span>
queue_depth: <span class="tp">int</span>
priority_distribution: <span class="tp">List[float]</span>
is_burst: <span class="tp">bool</span>
phase: <span class="tp">str</span> <span class="cm"># "warmup" | "steady" | "burst" | "cooldown"</span></pre></div>
</div>
<div class="label">PHASE 0 COMPLETION PROOF</div>
<div class="code-block">
<div class="code-block-header">
<span class="code-lang">bash</span>
<span class="code-file">These commands must all pass before Day 2 starts</span>
</div>
<div class="code-body"><pre><span class="cm"># From repo root:</span>
docker build -t inferencegym . && docker run -p 7860:7860 inferencegym &
curl http://localhost:7860/health <span class="cm"># β†’ {"status": "ok"}</span>
curl http://localhost:7860/tasks <span class="cm"># β†’ {"tasks": [{...}, {...}, {...}]}</span>
python -c <span class="st">"from inferencegym.models import ServeAction, ServeObservation; print('schemas OK')"</span></pre></div>
</div>
</div>
<!-- ═══════════════════════════════ PHASE 1 ═══════════════════════════════ -->
<div class="section" id="phase1">
<div class="section-header">
<div class="section-num">P1</div>
<div class="section-meta">
<div class="section-title">Phase 1 β€” Simulator Core</div>
<div class="section-sub">Days 2–3 (Mar 28–29). Goal: a fully working TraceSimulator that takes a real ServeAction and returns a realistic MetricsSnapshot. This is the hardest and most critical module in the entire project.</div>
</div>
<span class="eyebrow-tag et-green">Days 2–3</span>
</div>
<div class="alert alert-green">
<div class="alert-title">βœ… Why This Phase Unlocks Everything</div>
Once <code>TraceSimulator.simulate(action, workload) β†’ MetricsSnapshot</code> works, Person B can wire it into the API and Person C can build the grader. Both of those can proceed in parallel. Person A must finish this by end of Day 3 even if it means simplifying the interpolation.
</div>
<div class="gate-box">
<div class="gate-icon">πŸ”‘</div>
<div>
<div class="gate-label">Phase Gate β€” End of Day 3</div>
<div class="gate-text"><strong>Running <code>python tests/test_simulator.py</code> passes all tests.</strong> The simulator returns realistic-shaped numbers for a variety of (action, workload) inputs. The workload generator produces a different workload state on every call. These are the two things that need to be true before Phase 2 begins.</div>
</div>
</div>
<div class="grid2 mb14">
<div>
<div class="label">DAY 2 TASKS (Person A, primary)</div>
<div class="deliverable-box">
<div class="deliverable-title">TraceSimulator β€” Core Implementation</div>
<ul class="deliverable-list">
<li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Load lookup table from CSV/Parquet</strong> Read the trace data file into a dict keyed by <code>(batch_bucket, kv_bucket, spec_bucket, prompt_bucket)</code>. Each value is a <code>MetricsSnapshot</code>. The lookup table must be loaded once at startup and cached in memory.</div></li>
<li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Implement bilinear interpolation</strong> Use <code>scipy.interpolate.RegularGridInterpolator</code> for continuous actions (kv_budget, batch_size) between discrete lookup points. For discrete actions (spec_length, quant_tier), use nearest-neighbor lookup.</div></li>
<li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Add Gaussian noise model</strong> Inject Β±5% Gaussian noise on <code>ttft_p50_ms</code> and <code>tpot_ms</code> to simulate hardware jitter. Use <code>np.random.default_rng(seed)</code> so episodes are reproducible.</div></li>
<li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Memory overflow detection</strong> If interpolated <code>gpu_memory_gb > 40.0</code>, set a hard OOM flag, cap memory at 40GB, and multiply <code>slo_violations</code> by 5 as a penalty signal.</div></li>
</ul>
</div>
<div class="deliverable-box">
<div class="deliverable-title">WorkloadGenerator β€” Day 2</div>
<ul class="deliverable-list">
<li><span class="dl-bullet dl-blue">A</span><div class="dl-text"><strong>Poisson arrival generator</strong> <code>np.random.poisson(lam=arrival_rate)</code> per step. Arrival rate varies by task config loaded from <code>workload_configs.json</code>.</div></li>
<li><span class="dl-bullet dl-blue">A</span><div class="dl-text"><strong>Prompt length sampling</strong> Task 1: <code>np.random.uniform(64, 128)</code>. Task 2: <code>np.random.lognormal(5.2, 1.3)</code> clamped to [32, 8192]. Task 3: bimodal β€” 70% uniform(32, 128), 30% uniform(4096, 8192).</div></li>
<li><span class="dl-bullet dl-blue">A</span><div class="dl-text"><strong>Discrete prompt bucket mapping</strong> Map continuous prompt_len to an integer bucket 0–7 using <code>np.digitize</code> against <code>[64, 128, 256, 512, 1024, 2048, 4096]</code>. This is the lookup table key.</div></li>
</ul>
</div>
</div>
<div>
<div class="label">DAY 3 TASKS (Person A, primary)</div>
<div class="deliverable-box">
<div class="deliverable-title">WorkloadGenerator β€” Day 3 Completion</div>
<ul class="deliverable-list">
<li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Queue depth simulation</strong> Maintain a running <code>queue_depth</code> counter. Each step: add new arrivals, subtract <code>min(batch_size, queue_depth)</code> served requests. Queue cannot go negative.</div></li>
<li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Burst injection for Task 3</strong> Every 120 timesteps, multiply arrival_rate by 10 for 15 consecutive steps. Set <code>is_burst=True</code> in <code>WorkloadState</code> during these steps.</div></li>
<li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Priority distribution tracking</strong> Task 3: maintain a rolling 50-step window of request classes [INTERACTIVE, BATCH, BEST_EFFORT] as fractions. Pass this to <code>WorkloadState.priority_distribution</code>.</div></li>
<li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Speculative acceptance model</strong> Implement the acceptance rate formula: <code>accept_rate = base_rate * (1 - complexity_penalty) * depth_decay</code> where <code>depth_decay = 1.0 / (1 + 0.15 * spec_length)</code>. Base rate by task: Task1=0.80, Task2=0.65, Task3=0.45.</div></li>
</ul>
</div>
<div class="deliverable-box">
<div class="deliverable-title">Unit Tests β€” must pass by Day 3 EOD</div>
<ul class="deliverable-list">
<li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>Smoke test</strong> Call <code>simulate(action, workload)</code> with 20 random valid actions β€” all return a non-null <code>MetricsSnapshot</code> with values in expected ranges.</div></li>
<li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>Monotonicity test</strong> Increasing <code>batch_size</code> while holding other actions constant should strictly increase <code>tokens_per_sec</code> (up to a threshold). This validates the lookup table is correctly loaded.</div></li>
<li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>Determinism test</strong> Two calls with the same seed and same action must produce the same noise-injected output. Tests reproducibility.</div></li>
<li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>OOM detection test</strong> Pass an action with <code>batch_size=512, kv_budget=1.0</code> β€” confirm <code>gpu_memory_gb</code> triggers the overflow flag.</div></li>
</ul>
</div>
</div>
</div>
<div class="label">SIMULATOR CORE IMPLEMENTATION</div>
<div class="code-block">
<div class="code-block-header">
<span class="code-lang">python</span>
<span class="code-file">simulator/trace_sim.py</span>
</div>
<div class="code-body"><pre><span class="kw">import</span> numpy <span class="kw">as</span> np
<span class="kw">import</span> pandas <span class="kw">as</span> pd
<span class="kw">from</span> scipy.interpolate <span class="kw">import</span> RegularGridInterpolator
<span class="kw">from</span> pathlib <span class="kw">import</span> Path
<span class="kw">from</span> inferencegym.models <span class="kw">import</span> ServeAction, WorkloadState, MetricsSnapshot, QuantTier
<span class="kw">class</span> <span class="tp">TraceSimulator</span>:
<span class="st">"""
CPU-only trace-driven simulator.
Loads a pre-built lookup table and interpolates (action, workload) β†’ MetricsSnapshot.
"""</span>
BATCH_POINTS = [<span class="nm">1</span>, <span class="nm">4</span>, <span class="nm">8</span>, <span class="nm">16</span>, <span class="nm">32</span>, <span class="nm">64</span>, <span class="nm">128</span>, <span class="nm">256</span>, <span class="nm">512</span>]
KV_POINTS = [<span class="nm">0.1</span>, <span class="nm">0.25</span>, <span class="nm">0.5</span>, <span class="nm">0.75</span>, <span class="nm">1.0</span>]
PLEN_BUCKETS = [<span class="nm">64</span>, <span class="nm">128</span>, <span class="nm">256</span>, <span class="nm">512</span>, <span class="nm">1024</span>, <span class="nm">2048</span>, <span class="nm">4096</span>, <span class="nm">8192</span>]
OOM_THRESHOLD = <span class="nm">40.0</span> <span class="cm"># GB</span>
NOISE_STD = <span class="nm">0.05</span> <span class="cm"># Β±5% Gaussian jitter on latency metrics</span>
<span class="kw">def</span> <span class="fn">__init__</span>(self, trace_path: <span class="tp">str</span>, seed: <span class="tp">int</span> = <span class="nm">42</span>):
self.rng = np.random.default_rng(seed)
self._load_tables(Path(trace_path))
self._build_interpolators()
<span class="kw">def</span> <span class="fn">_load_tables</span>(self, path: <span class="tp">Path</span>) -> <span class="tp">None</span>:
df = pd.read_parquet(path)
<span class="cm"># Expected columns: batch_size, kv_budget, spec_length, quant_tier,</span>
<span class="cm"># prompt_len_bucket, ttft_p50, ttft_p99, tpot, tps, gpu_mem_gb, cost_per_1k</span>
self._df = df
<span class="kw">def</span> <span class="fn">_build_interpolators</span>(self) -> <span class="tp">None</span>:
<span class="cm"># Build 4-D interpolator over (batch_size, kv_budget, spec_len, prompt_bucket)</span>
<span class="cm"># for FP16 baseline. INT8/INT4 handled via multiplicative correction factors.</span>
fp16_df = self._df[self._df[<span class="st">'quant_tier'</span>] == <span class="nm">0</span>]
grid_vals = {
<span class="st">'ttft_p50'</span>: self._reshape_for_interp(fp16_df, <span class="st">'ttft_p50'</span>),
<span class="st">'ttft_p99'</span>: self._reshape_for_interp(fp16_df, <span class="st">'ttft_p99'</span>),
<span class="st">'tpot'</span>: self._reshape_for_interp(fp16_df, <span class="st">'tpot'</span>),
<span class="st">'tps'</span>: self._reshape_for_interp(fp16_df, <span class="st">'tps'</span>),
<span class="st">'gpu_mem'</span>: self._reshape_for_interp(fp16_df, <span class="st">'gpu_mem_gb'</span>),
}
points = (self.BATCH_POINTS, self.KV_POINTS, [<span class="nm">0</span>,<span class="nm">1</span>,<span class="nm">2</span>,<span class="nm">4</span>,<span class="nm">8</span>], self.PLEN_BUCKETS)
self._interps = {k: RegularGridInterpolator(points, v, method=<span class="st">'linear'</span>, bounds_error=<span class="nm">False</span>)
<span class="kw">for</span> k, v <span class="kw">in</span> grid_vals.items()}
<span class="kw">def</span> <span class="fn">simulate</span>(self, action: <span class="tp">ServeAction</span>, workload: <span class="tp">WorkloadState</span>) -> <span class="tp">MetricsSnapshot</span>:
action.validate()
query = [[action.batch_size, action.kv_budget,
action.spec_length, workload.mean_prompt_len]]
<span class="cm"># Interpolate base metrics</span>
base = {k: float(fn(query)[<span class="nm">0</span>]) <span class="kw">for</span> k, fn <span class="kw">in</span> self._interps.items()}
<span class="cm"># Apply quant tier correction factors (from benchmark data)</span>
quant_factors = {QuantTier.FP16: <span class="nm">1.0</span>, QuantTier.INT8: <span class="nm">0.82</span>, QuantTier.INT4: <span class="nm">0.68</span>}
q_factor = quant_factors[action.quant_tier]
base[<span class="st">'ttft_p50'</span>] *= q_factor
base[<span class="st">'tps'</span>] /= q_factor <span class="cm"># quantised models serve faster</span>
base[<span class="st">'gpu_mem'</span>] *= q_factor <span class="cm"># quantised models use less memory</span>
<span class="cm"># Apply speculative decoding acceptance bonus</span>
<span class="kw">if</span> action.spec_length > <span class="nm">0</span>:
depth_decay = <span class="nm">1.0</span> / (<span class="nm">1</span> + <span class="nm">0.15</span> * action.spec_length)
accept_rate = <span class="nm">0.75</span> * (<span class="nm">1</span> - <span class="nm">0.1</span> * workload.prompt_len_bucket) * depth_decay
accept_rate = max(<span class="nm">0.0</span>, min(<span class="nm">1.0</span>, accept_rate))
speedup = <span class="nm">1.0</span> + accept_rate * action.spec_length * <span class="nm">0.1</span>
base[<span class="st">'ttft_p50'</span>] /= speedup
<span class="kw">else</span>:
accept_rate = <span class="nm">0.0</span>
<span class="cm"># Inject Gaussian noise</span>
noise = self.rng.normal(<span class="nm">1.0</span>, self.NOISE_STD, size=<span class="nm">3</span>)
base[<span class="st">'ttft_p50'</span>] *= noise[<span class="nm">0</span>]
base[<span class="st">'ttft_p99'</span>] *= noise[<span class="nm">1</span>]
base[<span class="st">'tpot'</span>] *= noise[<span class="nm">2</span>]
<span class="cm"># OOM detection</span>
oom = base[<span class="st">'gpu_mem'</span>] > self.OOM_THRESHOLD
slo_violations = <span class="nm">0</span> <span class="cm"># computed by env, not simulator</span>
<span class="kw">if</span> oom:
base[<span class="st">'gpu_mem'</span>] = self.OOM_THRESHOLD
slo_violations = action.batch_size <span class="cm"># all requests fail on OOM</span>
<span class="kw">return</span> MetricsSnapshot(
ttft_p50_ms = max(<span class="nm">1.0</span>, base[<span class="st">'ttft_p50'</span>]),
ttft_p99_ms = max(<span class="nm">1.0</span>, base[<span class="st">'ttft_p99'</span>]),
tpot_ms = max(<span class="nm">1.0</span>, base[<span class="st">'tpot'</span>]),
tokens_per_sec = max(<span class="nm">0.0</span>, base[<span class="st">'tps'</span>]),
gpu_memory_gb = base[<span class="st">'gpu_mem'</span>],
cost_per_1k = base[<span class="st">'tps'</span>] * q_factor * <span class="nm">0.001</span>,
spec_accept_rate = accept_rate,
eviction_events = int(max(<span class="nm">0</span>, (<span class="nm">1.0</span> - action.kv_budget) * workload.queue_depth)),
slo_violations = slo_violations,
)</pre></div>
</div>
<div class="label">TRACE DATA β€” How to Build It Without a GPU</div>
<div class="grid3 mb14">
<div class="module-card">
<div class="module-card-header">
<span class="module-card-name" style="color:var(--green)">Option A (Recommended)</span>
<span class="eyebrow-tag et-green">0 GPU hrs</span>
</div>
<div class="module-card-body">
<div class="module-card-desc">Download published vLLM benchmark CSVs from <code>github.com/vllm-project/vllm/tree/main/benchmarks</code> and the HuggingFace llm-perf-leaderboard. These have real measured latencies across batch sizes. Fit a pandas pivot table to get the lookup grid.</div>
<ul class="spec-list">
<li>Already covers Llama-3-8B on A100 β€” your exact target model</li>
<li>Includes TTFT, TPOT, throughput, memory across batch sizes</li>
<li>Needs ~2 hours of data wrangling to reshape into your schema</li>
</ul>
</div>
</div>
<div class="module-card">
<div class="module-card-header">
<span class="module-card-name" style="color:var(--blue)">Option B (Good)</span>
<span class="eyebrow-tag et-blue">2-4 GPU hrs</span>
</div>
<div class="module-card-body">
<div class="module-card-desc">Run <code>llmperf</code> on a Colab free T4 with Llama-3.2-1B-Instruct (free tier works). Grid search over batch_size=[1,4,8,16,32] Γ— prompt_len=[64,128,256,512] β€” that's 20 measurements. 2 hours of Colab time.</div>
<ul class="spec-list">
<li>Your own measurements β€” stronger story for judges</li>
<li>Can extrapolate to larger batch sizes analytically</li>
<li>Risk: Colab disconnects. Use checkpointing.</li>
</ul>
</div>
</div>
<div class="module-card">
<div class="module-card-header">
<span class="module-card-name" style="color:var(--amber)">Option C (Fallback)</span>
<span class="eyebrow-tag et-amber">30 min, CPU</span>
</div>
<div class="module-card-body">
<div class="module-card-desc">Generate synthetic data from a roofline model. <code>ttft = base_ms + batch_factor * batch_size + memory_factor * prompt_len</code>. These constants are documented in vLLM's OSDI paper. Fully deterministic, always works.</div>
<ul class="spec-list">
<li>Implement this FIRST as a fallback even if you use A or B</li>
<li>Guarantees you always have valid data no matter what</li>
<li>Good enough for an RL agent to learn relative improvements</li>
</ul>
</div>
</div>
</div>
</div>
<!-- ═══════════════════════════════ PHASE 2 ═══════════════════════════════ -->
<div class="section" id="phase2">
<div class="section-header">
<div class="section-num">P2</div>
<div class="section-meta">
<div class="section-title">Phase 2 β€” Environment Logic</div>
<div class="section-sub">Day 4 (Mar 30). Goal: a complete InferenceEnv class with working reset(), step(), and state(). An agent can interact with it in a loop and receive valid rewards.</div>
</div>
<span class="eyebrow-tag et-blue">Day 4 Β· Mar 30</span>
</div>
<div class="gate-box">
<div class="gate-icon">🎯</div>
<div>
<div class="gate-label">Phase Gate β€” End of Day 4</div>
<div class="gate-text"><strong>The following Python loop runs without error and completes all 200 steps:</strong> <code>obs = env.reset(task_id=1); [env.step(random_action()) for _ in range(200)]</code>. Rewards are floats in [-1, 1]. The episode terminates at step 200. Session IDs are unique per reset call.</div>
</div>
</div>
<div class="label">ENVIRONMENT CLASS β€” Full Implementation</div>
<div class="code-block">
<div class="code-block-header">
<span class="code-lang">python</span>
<span class="code-file">env/inference_env.py β€” Core environment (Person A, Day 4)</span>
</div>
<div class="code-body"><pre><span class="kw">import</span> uuid, json, threading
<span class="kw">import</span> numpy <span class="kw">as</span> np
<span class="kw">from</span> dataclasses <span class="kw">import</span> dataclass
<span class="kw">from</span> inferencegym.models <span class="kw">import</span> ServeAction, ServeObservation, WorkloadState, MetricsSnapshot
<span class="kw">from</span> simulator.trace_sim <span class="kw">import</span> TraceSimulator
<span class="kw">from</span> simulator.workload <span class="kw">import</span> WorkloadGenerator
@dataclass
<span class="kw">class</span> <span class="tp">EnvConfig</span>:
task_id: <span class="tp">int</span>
episode_len: <span class="tp">int</span> = <span class="nm">200</span>
slo_target_ms: <span class="tp">float</span> = <span class="nm">300.0</span>
max_memory_gb: <span class="tp">float</span> = <span class="nm">40.0</span>
<span class="cm"># Reward weights</span>
alpha: <span class="tp">float</span> = <span class="nm">0.40</span> <span class="cm"># throughput</span>
beta: <span class="tp">float</span> = <span class="nm">0.25</span> <span class="cm"># latency</span>
gamma: <span class="tp">float</span> = <span class="nm">0.25</span> <span class="cm"># SLO violations</span>
delta: <span class="tp">float</span> = <span class="nm">0.10</span> <span class="cm"># cost</span>
<span class="cm"># Task configs β€” loaded from workload_configs.json</span>
TASK_CONFIGS = {
<span class="nm">1</span>: EnvConfig(task_id=<span class="nm">1</span>, slo_target_ms=<span class="nm">500.0</span>),
<span class="nm">2</span>: EnvConfig(task_id=<span class="nm">2</span>, slo_target_ms=<span class="nm">300.0</span>, gamma=<span class="nm">0.30</span>),
<span class="nm">3</span>: EnvConfig(task_id=<span class="nm">3</span>, slo_target_ms=<span class="nm">200.0</span>, gamma=<span class="nm">0.35</span>, delta=<span class="nm">0.15</span>),
}
<span class="cm"># Max achievable throughput per task (set after running optimal solver)</span>
MAX_THROUGHPUT = {<span class="nm">1</span>: <span class="nm">8500.0</span>, <span class="nm">2</span>: <span class="nm">6200.0</span>, <span class="nm">3</span>: <span class="nm">4800.0</span>}
<span class="kw">class</span> <span class="tp">InferenceEnv</span>:
<span class="kw">def</span> <span class="fn">__init__</span>(self, simulator: <span class="tp">TraceSimulator</span>, task_id: <span class="tp">int</span>, seed: <span class="tp">int</span> = <span class="nm">42</span>):
self.sim = simulator
self.config = TASK_CONFIGS[task_id]
self.gen = WorkloadGenerator(task_id=task_id, seed=seed)
self.session_id = str(uuid.uuid4())
self._step = <span class="nm">0</span>
self._cost_so_far = <span class="nm">0.0</span>
self._workload = self.gen.reset()
self._last_metrics: MetricsSnapshot = <span class="kw">None</span>
self._episode_log: <span class="tp">list</span> = []
<span class="kw">def</span> <span class="fn">reset</span>(self) -> <span class="tp">ServeObservation</span>:
self.session_id = str(uuid.uuid4())
self._step = <span class="nm">0</span>
self._cost_so_far = <span class="nm">0.0</span>
self._workload = self.gen.reset()
self._episode_log = []
<span class="kw">return</span> self._build_obs(MetricsSnapshot(
ttft_p50_ms=<span class="nm">200.0</span>, ttft_p99_ms=<span class="nm">350.0</span>, tpot_ms=<span class="nm">20.0</span>,
tokens_per_sec=<span class="nm">2000.0</span>, gpu_memory_gb=<span class="nm">24.0</span>, cost_per_1k=<span class="nm">0.001</span>,
spec_accept_rate=<span class="nm">0.0</span>, eviction_events=<span class="nm">0</span>, slo_violations=<span class="nm">0</span>))
<span class="kw">def</span> <span class="fn">step</span>(self, action: <span class="tp">ServeAction</span>):
<span class="kw">if</span> self._step >= self.config.episode_len:
<span class="kw">raise</span> RuntimeError(<span class="st">"Episode already done. Call reset() first."</span>)
<span class="cm"># Task 1 & 2: lock certain actions</span>
action = self._enforce_action_mask(action)
<span class="cm"># Advance workload one step</span>
self._workload = self.gen.step(action)
<span class="cm"># Simulate this step</span>
metrics = self.sim.simulate(action, self._workload)
self._last_metrics = metrics
<span class="cm"># Compute SLO violations from simulator metrics + SLO target</span>
metrics.slo_violations += int(
metrics.ttft_p50_ms > self.config.slo_target_ms) * self._workload.queue_depth
<span class="cm"># Compute reward</span>
reward = self._compute_reward(metrics)
<span class="cm"># Update episode state</span>
self._cost_so_far += metrics.cost_per_1k
self._step += <span class="nm">1</span>
done = self._step >= self.config.episode_len
obs = self._build_obs(metrics)
info = {<span class="st">"timestep"</span>: self._step, <span class="st">"metrics"</span>: metrics.__dict__,
<span class="st">"workload"</span>: self._workload.__dict__}
self._episode_log.append({<span class="st">"action"</span>: action.__dict__, <span class="st">"reward"</span>: reward, <span class="st">"metrics"</span>: metrics.__dict__})
<span class="kw">return</span> obs, reward, done, info
<span class="kw">def</span> <span class="fn">_compute_reward</span>(self, m: <span class="tp">MetricsSnapshot</span>) -> <span class="tp">float</span>:
c = self.config
T = m.tokens_per_sec / MAX_THROUGHPUT[c.task_id]
L = m.ttft_p50_ms / c.slo_target_ms
V = m.slo_violations / max(self._workload.queue_depth, <span class="nm">1</span>)
C = m.cost_per_1k / <span class="nm">0.005</span> <span class="cm"># normalise against budget ceiling</span>
reward = c.alpha * T - c.beta * L - c.gamma * V - c.delta * C
<span class="kw">return</span> float(np.clip(reward, -<span class="nm">1.0</span>, <span class="nm">1.0</span>))
<span class="kw">def</span> <span class="fn">_enforce_action_mask</span>(self, action: <span class="tp">ServeAction</span>) -> <span class="tp">ServeAction</span>:
<span class="kw">if</span> self.config.task_id == <span class="nm">1</span>:
action.spec_length = <span class="nm">0</span>; action.prefill_disagg = <span class="kw">False</span>; action.quant_tier = QuantTier.FP16
<span class="kw">elif</span> self.config.task_id == <span class="nm">2</span>:
action.prefill_disagg = <span class="kw">False</span>; action.quant_tier = QuantTier.FP16
<span class="kw">return</span> action
<span class="kw">def</span> <span class="fn">_build_obs</span>(self, m: <span class="tp">MetricsSnapshot</span>) -> <span class="tp">ServeObservation</span>:
w = self._workload
<span class="kw">return</span> ServeObservation(
queue_depth = float(w.queue_depth),
mean_prompt_len = w.mean_prompt_len,
arrival_rate = w.arrival_rate,
kv_cache_occupancy = (<span class="nm">1.0</span> - (m.eviction_events / max(w.queue_depth, <span class="nm">1</span>))),
ttft_p50 = m.ttft_p50_ms,
tpot_p50 = m.tpot_ms,
slo_violation_rate = m.slo_violations / max(w.queue_depth, <span class="nm">1</span>),
gpu_memory_used_gb = m.gpu_memory_gb,
spec_accept_rate = m.spec_accept_rate,
priority_distribution = w.priority_distribution,
timestep = self._step,
cost_so_far = self._cost_so_far,
)</pre></div>
</div>
</div>
<!-- ═══════════════════════════════ PHASE 3 ═══════════════════════════════ -->
<div class="section" id="phase3">
<div class="section-header">
<div class="section-num">P3</div>
<div class="section-meta">
<div class="section-title">Phase 3 β€” API Layer & Docker</div>
<div class="section-sub">Day 5 (Mar 31). Goal: all 8 HTTP endpoints are live, wired to the real InferenceEnv, and the Docker image builds cleanly and passes openenv validate.</div>
</div>
<span class="eyebrow-tag et-blue">Day 5 Β· Mar 31</span>
</div>
<div class="gate-box">
<div class="gate-icon">🌐</div>
<div>
<div class="gate-label">Phase Gate β€” End of Day 5</div>
<div class="gate-text"><strong>Running the openenv CLI validation passes with no errors:</strong> <code>openenv validate --url http://localhost:7860</code>. Every endpoint returns the correct shape. The Docker image is under 2GB. A full reset→step×200→grader cycle completes in under 60 seconds.</div>
</div>
</div>
<div class="label">ALL ENDPOINTS β€” Implementation Spec</div>
<div class="table-wrap mb14">
<table>
<tr>
<th>Endpoint</th><th>Method</th><th>Owns</th><th>Wired to</th><th>Key Behaviour</th>
</tr>
<tr>
<td><code>/health</code></td><td>GET</td><td>Person B</td><td>Session cache count</td>
<td>Returns <code>{"status":"ok","active_sessions":N,"uptime_s":T}</code></td>
</tr>
<tr>
<td><code>/tasks</code></td><td>GET</td><td>Person B</td><td>Static task config dict</td>
<td>Returns list of 3 tasks with id, name, difficulty, description, active_actions</td>
</tr>
<tr>
<td><code>/reset</code></td><td>POST</td><td>Person B</td><td><code>InferenceEnv.reset()</code></td>
<td>Creates new session_id, instantiates InferenceEnv for that task, stores in LRU cache. Returns session_id + observation.</td>
</tr>
<tr>
<td><code>/step</code></td><td>POST</td><td>Person B</td><td><code>InferenceEnv.step()</code></td>
<td>Looks up session by session_id, validates ServeAction, calls step(), returns obs+reward+done+info. 422 if session not found.</td>
</tr>
<tr>
<td><code>/state</code></td><td>GET</td><td>Person B</td><td><code>InferenceEnv.state()</code></td>
<td>Returns current episode metadata: step_count, cumulative_reward, done, workload_phase.</td>
</tr>
<tr>
<td><code>/grader</code></td><td>POST</td><td>Person C</td><td><code>GraderModule.score()</code></td>
<td>Accepts episode_log JSON, returns score 0–1 with breakdown. Stateless β€” same input always same output.</td>
</tr>
<tr>
<td><code>/baseline</code></td><td>GET</td><td>Person C</td><td><code>BaselineAgent.run()</code></td>
<td>Runs the fixed-config baseline agent on all 3 tasks, returns scores. Fixed seed guarantees reproducibility.</td>
</tr>
<tr>
<td><code>/info</code></td><td>GET</td><td>Person B</td><td>Static schema</td>
<td>Returns full JSON schema for action space, observation space, reward weights. Used by agent frameworks.</td>
</tr>
</table>
</div>
<div class="label">SESSION MANAGEMENT β€” Critical Design</div>
<div class="code-block">
<div class="code-block-header">
<span class="code-lang">python</span>
<span class="code-file">simulator/session_manager.py β€” Thread-safe LRU session cache</span>
</div>
<div class="code-body"><pre><span class="kw">import</span> threading
<span class="kw">from</span> collections <span class="kw">import</span> OrderedDict
<span class="kw">from</span> typing <span class="kw">import</span> <span class="tp">Optional</span>
<span class="kw">from</span> env.inference_env <span class="kw">import</span> InferenceEnv
<span class="kw">class</span> <span class="tp">SessionManager</span>:
<span class="cm">"""Thread-safe LRU cache of active InferenceEnv instances."""</span>
MAX_SESSIONS = <span class="nm">50</span>
<span class="kw">def</span> <span class="fn">__init__</span>(self, simulator):
self._sim = simulator
self._lock = threading.Lock()
self._sessions: <span class="tp">OrderedDict[str, InferenceEnv]</span> = OrderedDict()
<span class="kw">def</span> <span class="fn">create</span>(self, task_id: <span class="tp">int</span>, seed: <span class="tp">int</span>) -> <span class="tp">InferenceEnv</span>:
<span class="kw">with</span> self._lock:
<span class="kw">if</span> len(self._sessions) >= self.MAX_SESSIONS:
self._sessions.popitem(last=<span class="kw">False</span>) <span class="cm"># evict oldest</span>
env = InferenceEnv(self._sim, task_id, seed)
self._sessions[env.session_id] = env
<span class="kw">return</span> env
<span class="kw">def</span> <span class="fn">get</span>(self, session_id: <span class="tp">str</span>) -> <span class="tp">Optional[InferenceEnv]</span>:
<span class="kw">with</span> self._lock:
env = self._sessions.get(session_id)
<span class="kw">if</span> env: <span class="cm"># move to end (mark as recently used)</span>
self._sessions.move_to_end(session_id)
<span class="kw">return</span> env
<span class="kw">def</span> <span class="fn">remove</span>(self, session_id: <span class="tp">str</span>) -> <span class="tp">None</span>:
<span class="kw">with</span> self._lock:
self._sessions.pop(session_id, <span class="kw">None</span>)
<span class="kw">def</span> <span class="fn">count</span>(self) -> <span class="tp">int</span>:
<span class="kw">return</span> len(self._sessions)</pre></div>
</div>
<div class="label">FASTAPI APP SKELETON β€” Person B writes this on Day 4 (stubs) and wires on Day 5</div>
<div class="code-block">
<div class="code-block-header">
<span class="code-lang">python</span>
<span class="code-file">server/app.py β€” Main FastAPI application</span>
</div>
<div class="code-body"><pre><span class="kw">from</span> fastapi <span class="kw">import</span> FastAPI, HTTPException
<span class="kw">from</span> fastapi.middleware.cors <span class="kw">import</span> CORSMiddleware
<span class="kw">from</span> pydantic <span class="kw">import</span> BaseModel
<span class="kw">from</span> typing <span class="kw">import</span> <span class="tp">Optional</span>
<span class="kw">import</span> time
<span class="kw">from</span> simulator.trace_sim <span class="kw">import</span> TraceSimulator
<span class="kw">from</span> simulator.session_manager <span class="kw">import</span> SessionManager
<span class="kw">from</span> inferencegym.models <span class="kw">import</span> ServeAction, QuantTier
app = FastAPI(title=<span class="st">"InferenceGym"</span>, version=<span class="st">"1.0.0"</span>)
app.add_middleware(CORSMiddleware, allow_origins=[<span class="st">"*"</span>], allow_methods=[<span class="st">"*"</span>], allow_headers=[<span class="st">"*"</span>])
<span class="cm"># ── App startup: load simulator once, create session manager ─────────────────</span>
_sim = <span class="kw">None</span>
_sessions = <span class="kw">None</span>
_start_time = time.time()
@app.on_event(<span class="st">"startup"</span>)
<span class="kw">async def</span> <span class="fn">startup</span>():
<span class="kw">global</span> _sim, _sessions
_sim = TraceSimulator(<span class="st">"simulator/data/traces_llama3_8b.parquet"</span>)
_sessions = SessionManager(_sim)
<span class="cm"># ── Pydantic request/response models ────────────────────────────────────────</span>
<span class="kw">class</span> <span class="tp">ResetRequest</span>(BaseModel):
task_id: <span class="tp">int</span>
seed: <span class="tp">int</span> = <span class="nm">42</span>
config: <span class="tp">Optional[dict]</span> = <span class="kw">None</span> <span class="cm"># override alpha/beta/gamma/delta</span>
<span class="kw">class</span> <span class="tp">StepRequest</span>(BaseModel):
session_id: <span class="tp">str</span>
action: <span class="tp">dict</span>
<span class="kw">class</span> <span class="tp">GraderRequest</span>(BaseModel):
task_id: <span class="tp">int</span>
episode_log: <span class="tp">list</span>
<span class="cm"># ── Endpoints ─────────────────────────────────────────────────────────────────</span>
@app.get(<span class="st">"/health"</span>)
<span class="kw">def</span> <span class="fn">health</span>():
<span class="kw">return</span> {<span class="st">"status"</span>: <span class="st">"ok"</span>, <span class="st">"active_sessions"</span>: _sessions.count(),
<span class="st">"uptime_seconds"</span>: int(time.time() - _start_time)}
@app.get(<span class="st">"/tasks"</span>)
<span class="kw">def</span> <span class="fn">get_tasks</span>():
<span class="kw">return</span> {<span class="st">"tasks"</span>: [
{<span class="st">"id"</span>:<span class="nm">1</span>, <span class="st">"name"</span>:<span class="st">"Static Uniform"</span>, <span class="st">"difficulty"</span>:<span class="st">"easy"</span>, <span class="st">"active_actions"</span>:[<span class="st">"kv_budget"</span>,<span class="st">"batch_size"</span>]},
{<span class="st">"id"</span>:<span class="nm">2</span>, <span class="st">"name"</span>:<span class="st">"Bursty ShareGPT"</span>, <span class="st">"difficulty"</span>:<span class="st">"medium"</span>, <span class="st">"active_actions"</span>:[<span class="st">"kv_budget"</span>,<span class="st">"batch_size"</span>,<span class="st">"spec_length"</span>]},
{<span class="st">"id"</span>:<span class="nm">3</span>, <span class="st">"name"</span>:<span class="st">"Adversarial Multi-Tenant"</span>,<span class="st">"difficulty"</span>:<span class="st">"hard"</span>, <span class="st">"active_actions"</span>:[<span class="st">"kv_budget"</span>,<span class="st">"batch_size"</span>,<span class="st">"spec_length"</span>,<span class="st">"prefill_disagg"</span>,<span class="st">"quant_tier"</span>]},
]}
@app.post(<span class="st">"/reset"</span>)
<span class="kw">def</span> <span class="fn">reset</span>(req: <span class="tp">ResetRequest</span>):
<span class="kw">if</span> req.task_id <span class="kw">not in</span> {<span class="nm">1</span>, <span class="nm">2</span>, <span class="nm">3</span>}:
<span class="kw">raise</span> HTTPException(<span class="nm">422</span>, <span class="st">f"task_id must be 1, 2, or 3. Got {req.task_id}"</span>)
env = _sessions.create(req.task_id, req.seed)
obs = env.reset()
<span class="kw">return</span> {<span class="st">"session_id"</span>: env.session_id, <span class="st">"observation"</span>: obs.__dict__, <span class="st">"episode_length"</span>: <span class="nm">200</span>}
@app.post(<span class="st">"/step"</span>)
<span class="kw">def</span> <span class="fn">step</span>(req: <span class="tp">StepRequest</span>):
env = _sessions.get(req.session_id)
<span class="kw">if not</span> env:
<span class="kw">raise</span> HTTPException(<span class="nm">404</span>, <span class="st">f"Session '{req.session_id}' not found. Call /reset first."</span>)
action = ServeAction(
kv_budget = req.action.get(<span class="st">"kv_budget"</span>, <span class="nm">1.0</span>),
spec_length = req.action.get(<span class="st">"spec_length"</span>, <span class="nm">0</span>),
batch_size = req.action.get(<span class="st">"batch_size"</span>, <span class="nm">32</span>),
prefill_disagg = req.action.get(<span class="st">"prefill_disagg"</span>, <span class="kw">False</span>),
quant_tier = QuantTier(req.action.get(<span class="st">"quant_tier"</span>, <span class="nm">0</span>)),
)
obs, reward, done, info = env.step(action)
<span class="kw">if</span> done:
_sessions.remove(req.session_id)
<span class="kw">return</span> {<span class="st">"observation"</span>: obs.__dict__, <span class="st">"reward"</span>: reward, <span class="st">"done"</span>: done, <span class="st">"info"</span>: info}</pre></div>
</div>
<div class="label">DOCKERFILE β€” Multi-stage, CPU-only, &lt;2GB</div>
<div class="code-block">
<div class="code-block-header">
<span class="code-lang">dockerfile</span>
<span class="code-file">Dockerfile</span>
</div>
<div class="code-body"><pre><span class="cm"># Stage 1: Install dependencies only</span>
<span class="kw">FROM</span> python:3.11-slim <span class="kw">AS</span> builder
<span class="kw">WORKDIR</span> /build
<span class="kw">COPY</span> requirements.txt .
<span class="kw">RUN</span> pip install --no-cache-dir --user -r requirements.txt
<span class="cm"># Stage 2: Minimal runtime (no build tools)</span>
<span class="kw">FROM</span> python:3.11-slim
<span class="kw">WORKDIR</span> /app
<span class="kw">COPY</span> --from=builder /root/.local /root/.local
<span class="kw">COPY</span> . .
<span class="kw">ENV</span> PATH=/root/.local/bin:$PATH
<span class="kw">ENV</span> PYTHONPATH=/app
<span class="kw">EXPOSE</span> 7860
<span class="cm"># HuggingFace Spaces convention: port 7860</span>
<span class="kw">CMD</span> ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "2"]
<span class="cm">## requirements.txt (CPU-only β€” NO torch, NO CUDA)</span>
<span class="cm"># fastapi==0.115.0</span>
<span class="cm"># uvicorn[standard]==0.30.0</span>
<span class="cm"># pydantic==2.7.0</span>
<span class="cm"># numpy==1.26.4</span>
<span class="cm"># scipy==1.13.0</span>
<span class="cm"># pandas==2.2.0</span>
<span class="cm"># pyarrow==15.0.0 (for parquet reading)</span>
<span class="cm"># stable-baselines3==2.3.0 (PPO demo only)</span>
<span class="cm"># gymnasium==0.29.1</span>
<span class="cm"># httpx==0.27.0 (for integration tests)</span></pre></div>
</div>
</div>
<!-- ═══════════════════════════════ PHASE 4 ═══════════════════════════════ -->
<div class="section" id="phase4">
<div class="section-header">
<div class="section-num">P4</div>
<div class="section-meta">
<div class="section-title">Phase 4 β€” Grader, Baseline & Task Completion</div>
<div class="section-sub">Days 6–7 (Apr 1–2). Goal: all three tasks are complete, the /grader endpoint scores any episode log deterministically, and the baseline agent runs and produces reproducible scores around 0.22–0.35.</div>
</div>
<span class="eyebrow-tag et-amber">Days 6–7</span>
</div>
<div class="gate-box">
<div class="gate-icon">πŸ“Š</div>
<div>
<div class="gate-label">Phase Gate β€” End of Day 7</div>
<div class="gate-text"><strong>POST /grader with a handcrafted episode log returns a score between 0.0 and 1.0 with a complete breakdown dict.</strong> GET /baseline returns scores in the range [0.20, 0.40] for all 3 tasks. The grader returns the same score on repeated calls with the same input. All grader unit tests pass.</div>
</div>
</div>
<div class="label">GRADER DESIGN β€” Per-Task Formula Detail</div>
<div class="grid2 mb14">
<div class="module-card">
<div class="module-card-header">
<span class="module-card-name" style="color:var(--green)">Task 1 Grader</span>
<span class="eyebrow-tag et-green">EASY</span>
</div>
<div class="module-card-body">
<div class="module-card-desc">Pure throughput optimisation. Score is the normalised improvement over baseline on mean tokens/sec, capped at 1.0.</div>
<div class="code-block" style="margin-bottom:0; font-size:11px;">
<div class="code-body" style="padding:10px 12px;"><pre><span class="cm"># All values are means over the 200-step episode log</span>
score = (agent_tps - baseline_tps) / (optimal_tps - baseline_tps)
score = max(0.0, min(1.0, score))
<span class="cm"># baseline_tps β‰ˆ 2800 tokens/s (batch=32, kv=1.0)</span>
<span class="cm"># optimal_tps β‰ˆ 8200 tokens/s (batch=128, kv=0.5)</span></pre></div>
</div>
</div>
</div>
<div class="module-card">
<div class="module-card-header">
<span class="module-card-name" style="color:var(--blue)">Task 2 Grader</span>
<span class="eyebrow-tag et-blue">MEDIUM</span>
</div>
<div class="module-card-body">
<div class="module-card-desc">Balances TTFT and memory compliance. Both components are independently scored and averaged.</div>
<div class="code-block" style="margin-bottom:0; font-size:11px;">
<div class="code-body" style="padding:10px 12px;"><pre>ttft_score = max(0.0, 1.0 - mean_ttft_p50 / 300.0)
peak_mem = max(episode_log, key=lambda x: x['metrics']['gpu_memory_gb'])
mem_score = 1.0 if peak_mem < 36.0 else max(0.0, 1.0 - (peak_mem-36)/10)
score = 0.5 * ttft_score + 0.5 * mem_score</pre></div>
</div>
</div>
</div>
<div class="module-card">
<div class="module-card-header">
<span class="module-card-name" style="color:var(--red)">Task 3 Grader</span>
<span class="eyebrow-tag et-red">HARD</span>
</div>
<div class="module-card-body">
<div class="module-card-desc">4-component scoring with explicit weights. Stability score penalises wild action thrashing β€” rewards a smooth, learnable policy.</div>
<div class="code-block" style="margin-bottom:0; font-size:11px;">
<div class="code-body" style="padding:10px 12px;"><pre>T = mean_tps / optimal_tps <span class="cm"># throughput</span>
S = 1.0 - mean_slo_violation_rate <span class="cm"># SLO compliance</span>
C = max(0.0, 1.0 - total_cost/5.0) <span class="cm"># cost (budget=5.0)</span>
A = 1.0 - action_variance_score <span class="cm"># stability</span>
score = 0.40*T + 0.30*S + 0.20*C + 0.10*A</pre></div>
</div>
</div>
</div>
<div class="module-card">
<div class="module-card-header">
<span class="module-card-name" style="color:var(--amber)">Stability Score</span>
<span class="eyebrow-tag et-amber">Anti-Thrashing</span>
</div>
<div class="module-card-body">
<div class="module-card-desc">Computes the variance of consecutive actions taken by the agent. High variance = thrashing = unstable policy. The stability score penalises this.</div>
<div class="code-block" style="margin-bottom:0; font-size:11px;">
<div class="code-body" style="padding:10px 12px;"><pre>actions = [step['action'] for step in episode_log]
batch_diffs = np.diff([a['batch_size'] for a in actions])
kv_diffs = np.diff([a['kv_budget'] for a in actions])
variance = np.std(batch_diffs)/512 + np.std(kv_diffs)/1.0
action_variance_score = min(1.0, variance / 0.5) <span class="cm"># 0=stable, 1=chaotic</span></pre></div>
</div>
</div>
</div>
</div>
<div class="label">GRADER MODULE β€” Full Implementation</div>
<div class="code-block">
<div class="code-block-header">
<span class="code-lang">python</span>
<span class="code-file">grader/grader.py β€” Deterministic episode scorer</span>
</div>
<div class="code-body"><pre><span class="kw">import</span> numpy <span class="kw">as</span> np
<span class="kw">from</span> typing <span class="kw">import</span> <span class="tp">List, Dict, Any</span>
<span class="kw">class</span> <span class="tp">GraderModule</span>:
<span class="cm">"""Deterministic grader. Same episode_log β†’ same score, always."""</span>
BASELINE_TPS = {<span class="nm">1</span>: <span class="nm">2800.0</span>, <span class="nm">2</span>: <span class="nm">2100.0</span>, <span class="nm">3</span>: <span class="nm">1600.0</span>}
OPTIMAL_TPS = {<span class="nm">1</span>: <span class="nm">8200.0</span>, <span class="nm">2</span>: <span class="nm">5800.0</span>, <span class="nm">3</span>: <span class="nm">4200.0</span>}
<span class="kw">def</span> <span class="fn">score</span>(self, task_id: <span class="tp">int</span>, episode_log: <span class="tp">List[Dict[str, Any]]</span>) -> <span class="tp">Dict</span>:
<span class="kw">if not</span> episode_log:
<span class="kw">return</span> {<span class="st">"score"</span>: <span class="nm">0.0</span>, <span class="st">"breakdown"</span>: {}, <span class="st">"feedback"</span>: <span class="st">"Empty episode log."</span>}
graders = {<span class="nm">1</span>: self._task1, <span class="nm">2</span>: self._task2, <span class="nm">3</span>: self._task3}
<span class="kw">if</span> task_id <span class="kw">not in</span> graders:
<span class="kw">raise</span> ValueError(<span class="st">f"Unknown task_id: {task_id}"</span>)
<span class="kw">return</span> graders[task_id](episode_log)
<span class="kw">def</span> <span class="fn">_task1</span>(self, log) -> <span class="tp">Dict</span>:
mean_tps = np.mean([s[<span class="st">'metrics'</span>][<span class="st">'tokens_per_sec'</span>] <span class="kw">for</span> s <span class="kw">in</span> log])
score = (mean_tps - self.BASELINE_TPS[<span class="nm">1</span>]) / (self.OPTIMAL_TPS[<span class="nm">1</span>] - self.BASELINE_TPS[<span class="nm">1</span>])
score = float(np.clip(score, <span class="nm">0.0</span>, <span class="nm">1.0</span>))
feedback = self._throughput_feedback(mean_tps, <span class="nm">1</span>)
<span class="kw">return</span> {<span class="st">"score"</span>: score, <span class="st">"breakdown"</span>: {<span class="st">"throughput"</span>: score}, <span class="st">"feedback"</span>: feedback}
<span class="kw">def</span> <span class="fn">_task2</span>(self, log) -> <span class="tp">Dict</span>:
mean_ttft = np.mean([s[<span class="st">'metrics'</span>][<span class="st">'ttft_p50_ms'</span>] <span class="kw">for</span> s <span class="kw">in</span> log])
peak_mem = max(s[<span class="st">'metrics'</span>][<span class="st">'gpu_memory_gb'</span>] <span class="kw">for</span> s <span class="kw">in</span> log)
ttft_score = float(np.clip(<span class="nm">1.0</span> - mean_ttft / <span class="nm">300.0</span>, <span class="nm">0.0</span>, <span class="nm">1.0</span>))
mem_score = <span class="nm">1.0</span> <span class="kw">if</span> peak_mem < <span class="nm">36.0</span> <span class="kw">else</span> float(np.clip(<span class="nm">1.0</span> - (peak_mem-<span class="nm">36</span>)/<span class="nm">10</span>, <span class="nm">0.0</span>, <span class="nm">1.0</span>))
score = <span class="nm">0.5</span> * ttft_score + <span class="nm">0.5</span> * mem_score
feedback = <span class="kw">f</span><span class="st">"TTFT score: {ttft_score:.2f} (mean TTFT {mean_ttft:.0f}ms vs 300ms SLO). Memory score: {mem_score:.2f} (peak {peak_mem:.1f}GB vs 36GB limit)."</span>
<span class="kw">return</span> {<span class="st">"score"</span>: score, <span class="st">"breakdown"</span>: {<span class="st">"ttft"</span>: ttft_score, <span class="st">"memory"</span>: mem_score}, <span class="st">"feedback"</span>: feedback}
<span class="kw">def</span> <span class="fn">_task3</span>(self, log) -> <span class="tp">Dict</span>:
mean_tps = np.mean([s[<span class="st">'metrics'</span>][<span class="st">'tokens_per_sec'</span>] <span class="kw">for</span> s <span class="kw">in</span> log])
mean_slo = np.mean([s[<span class="st">'metrics'</span>][<span class="st">'slo_violations'</span>] <span class="kw">for</span> s <span class="kw">in</span> log])
total_cost = sum(s[<span class="st">'metrics'</span>][<span class="st">'cost_per_1k'</span>] <span class="kw">for</span> s <span class="kw">in</span> log)
actions = [s[<span class="st">'action'</span>] <span class="kw">for</span> s <span class="kw">in</span> log]
T = float(np.clip(mean_tps / self.OPTIMAL_TPS[<span class="nm">3</span>], <span class="nm">0.0</span>, <span class="nm">1.0</span>))
S = float(np.clip(<span class="nm">1.0</span> - mean_slo / <span class="nm">100.0</span>, <span class="nm">0.0</span>, <span class="nm">1.0</span>))
C = float(np.clip(<span class="nm">1.0</span> - total_cost / <span class="nm">5.0</span>, <span class="nm">0.0</span>, <span class="nm">1.0</span>))
A = <span class="nm">1.0</span> - self._action_variance(actions)
score = <span class="nm">0.40</span>*T + <span class="nm">0.30</span>*S + <span class="nm">0.20</span>*C + <span class="nm">0.10</span>*A
feedback = self._task3_feedback(T, S, C, A, log)
<span class="kw">return</span> {<span class="st">"score"</span>: score, <span class="st">"breakdown"</span>: {<span class="st">"throughput"</span>:T,<span class="st">"slo"</span>:S,<span class="st">"cost"</span>:C,<span class="st">"stability"</span>:A}, <span class="st">"feedback"</span>: feedback}
<span class="kw">def</span> <span class="fn">_action_variance</span>(self, actions) -> <span class="tp">float</span>:
batch_vals = [a.get(<span class="st">'batch_size'</span>, <span class="nm">32</span>) <span class="kw">for</span> a <span class="kw">in</span> actions]
kv_vals = [a.get(<span class="st">'kv_budget'</span>, <span class="nm">1.0</span>) <span class="kw">for</span> a <span class="kw">in</span> actions]
variance = np.std(np.diff(batch_vals))/<span class="nm">512</span> + np.std(np.diff(kv_vals))/<span class="nm">1.0</span>
<span class="kw">return</span> float(np.clip(variance / <span class="nm">0.5</span>, <span class="nm">0.0</span>, <span class="nm">1.0</span>))
<span class="kw">def</span> <span class="fn">_throughput_feedback</span>(self, mean_tps, task_id) -> <span class="tp">str</span>:
pct = (mean_tps - self.BASELINE_TPS[task_id]) / (self.OPTIMAL_TPS[task_id] - self.BASELINE_TPS[task_id]) * <span class="nm">100</span>
<span class="kw">return</span> <span class="kw">f</span><span class="st">f"Agent achieved {mean_tps:.0f} TPS ({pct:.0f}% of way from baseline to optimal)."</span></pre></div>
</div>
<div class="label">BASELINE AGENT β€” Fixed-config, deterministic</div>
<div class="code-block">
<div class="code-block-header">
<span class="code-lang">python</span>
<span class="code-file">agents/baseline.py β€” NaΓ―ve vLLM defaults (Person C, Day 6)</span>
</div>
<div class="code-body"><pre><span class="kw">from</span> inferencegym.models <span class="kw">import</span> ServeAction, QuantTier
<span class="kw">from</span> env.inference_env <span class="kw">import</span> InferenceEnv
<span class="kw">from</span> simulator.trace_sim <span class="kw">import</span> TraceSimulator
<span class="kw">from</span> grader.grader <span class="kw">import</span> GraderModule
<span class="cm"># The fixed action that the baseline ALWAYS takes, regardless of observation</span>
BASELINE_ACTION = ServeAction(
kv_budget = <span class="nm">1.0</span>, <span class="cm"># no eviction</span>
spec_length = <span class="nm">0</span>, <span class="cm"># speculative decoding off</span>
batch_size = <span class="nm">32</span>, <span class="cm"># vLLM default</span>
prefill_disagg = <span class="kw">False</span>, <span class="cm"># colocated</span>
quant_tier = QuantTier.FP16, <span class="cm"># full precision</span>
)
<span class="kw">def</span> <span class="fn">run_baseline</span>(task_id: <span class="tp">int</span>, seed: <span class="tp">int</span> = <span class="nm">0</span>) -> <span class="tp">dict</span>:
<span class="st">"""Runs fixed baseline agent on one task, returns grader score."""</span>
sim = TraceSimulator(<span class="st">"simulator/data/traces_llama3_8b.parquet"</span>, seed=seed)
env = InferenceEnv(sim, task_id=task_id, seed=seed)
grader = GraderModule()
env.reset()
done = <span class="kw">False</span>
<span class="kw">while not</span> done:
_, _, done, _ = env.step(BASELINE_ACTION)
result = grader.score(task_id, env._episode_log)
<span class="kw">return</span> {<span class="st">"task_id"</span>: task_id, <span class="st">"score"</span>: result[<span class="st">"score"</span>],
<span class="st">"breakdown"</span>: result[<span class="st">"breakdown"</span>], <span class="st">"action_config"</span>: BASELINE_ACTION.__dict__}
<span class="kw">def</span> <span class="fn">run_all_baselines</span>() -> <span class="tp">dict</span>:
<span class="cm"># Seed=0 guarantees identical results every run</span>
<span class="kw">return</span> {<span class="st">"scores"</span>: {<span class="kw">f</span><span class="st">"task{i}"</span>: run_baseline(i, seed=<span class="nm">0</span>)[<span class="st">"score"</span>] <span class="kw">for</span> i <span class="kw">in</span> [<span class="nm">1</span>,<span class="nm">2</span>,<span class="nm">3</span>]},
<span class="st">"expected_range"</span>: {<span class="st">"task1"</span>:[<span class="nm">0.30</span>,<span class="nm">0.40</span>], <span class="st">"task2"</span>:[<span class="nm">0.22</span>,<span class="nm">0.32</span>], <span class="st">"task3"</span>:[<span class="nm">0.18</span>,<span class="nm">0.28</span>]}}</pre></div>
</div>
</div>
<!-- ═══════════════════════════════ PHASE 5 ═══════════════════════════════ -->
<div class="section" id="phase5">
<div class="section-header">
<div class="section-num">P5</div>
<div class="section-meta">
<div class="section-title">Phase 5 β€” Deployment & Demo Agent</div>
<div class="section-sub">Days 8–9 (Apr 3–4). Goal: the environment is live on HuggingFace Spaces at a public URL, a PPO agent shows a rising reward curve, and the Colab demo notebook runs end-to-end.</div>
</div>
<span class="eyebrow-tag et-amber">Days 8–9</span>
</div>
<div class="gate-box">
<div class="gate-icon">πŸš€</div>
<div>
<div class="gate-label">Phase Gate β€” End of Day 9</div>
<div class="gate-text"><strong>From a fresh machine with no local setup, running the Colab notebook completes all cells without error.</strong> The HuggingFace Spaces URL is public and all endpoints respond. The PPO reward curve plot shows a statistically increasing trend from first 5k steps to last 5k steps of training.</div>
</div>
</div>
<div class="grid2 mb14">
<div>
<div class="label">HUGGINGFACE SPACES DEPLOYMENT</div>
<div class="deliverable-box">
<div class="deliverable-title">Person B β€” Days 8-9</div>
<ul class="deliverable-list">
<li><span class="dl-bullet dl-blue">B</span><div class="dl-text"><strong>Create HF Space with Docker SDK</strong> Go to huggingface.co/new-space. Select SDK: Docker. This will create a Dockerfile-based deployment where port 7860 is auto-exposed. Push your repo code.</div></li>
<li><span class="dl-bullet dl-blue">B</span><div class="dl-text"><strong>README.md HF frontmatter</strong> Add the required YAML block at the top of README.md: <code>title: InferenceGym, emoji: πŸ‹οΈ, colorFrom: green, colorTo: blue, sdk: docker, pinned: false</code>. This controls the HF Space landing page.</div></li>
<li><span class="dl-bullet dl-blue">B</span><div class="dl-text"><strong>Health check verification</strong> After push, HF Spaces shows a build log. Wait for "Running" status. Hit the public URL's /health endpoint. If it doesn't respond in 2 minutes, check build logs for import errors β€” most commonly a missing package in requirements.txt.</div></li>
<li><span class="dl-bullet dl-blue">B</span><div class="dl-text"><strong>Stress test from live URL</strong> Run 10 concurrent reset+stepΓ—5 loops against the live URL. Check /health shows active_sessions > 0 during the test. Confirm no 500 errors appear in HF Space logs.</div></li>
</ul>
</div>
</div>
<div>
<div class="label">PPO DEMO AGENT β€” Person C, Day 8</div>
<div class="deliverable-box">
<div class="deliverable-title">Gym wrapper + stable-baselines3 PPO</div>
<ul class="deliverable-list">
<li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>Write HTTPGymEnv wrapper</strong> Subclass <code>gymnasium.Env</code>. <code>reset()</code> calls POST /reset. <code>step(action)</code> calls POST /step. <code>observation_space</code> is <code>Box(low=-inf, high=inf, shape=(12,))</code>. <code>action_space</code> is <code>Box</code> for continuous knobs.</div></li>
<li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>Run PPO for 50k steps on Task 1</strong> Use <code>stable_baselines3.PPO("MlpPolicy", env, verbose=1)</code>. Train 50k steps. Plot <code>ep_rew_mean</code> over time using matplotlib. It should go from ~0.1 at start to ~0.35+ by 50k steps.</div></li>
<li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>If PPO doesn't converge</strong> Check: (1) normalise observations with <code>VecNormalize</code>, (2) reduce learning rate to 1e-4, (3) increase n_steps to 2048, (4) check reward range is [-1,1] (it should be from InferenceEnv). The environment is designed to be learnable β€” reward engineering is correct.</div></li>
</ul>
</div>
</div>
</div>
<div class="label">COLAB DEMO NOTEBOOK STRUCTURE β€” Person C, Day 9</div>
<div class="code-block">
<div class="code-block-header">
<span class="code-lang">python</span>
<span class="code-file">notebooks/InferenceGym_Demo.ipynb β€” Cell-by-cell structure</span>
</div>
<div class="code-body"><pre><span class="cm"># Cell 1: Title markdown</span>
<span class="cm"># "# InferenceGym Demo β€” Meta PyTorch Γ— Scaler Hackathon 2026"</span>
<span class="cm"># Cell 2: Install (runs in 90 seconds on Colab)</span>
!pip install stable-baselines3 gymnasium httpx pandas matplotlib -q
<span class="cm"># Cell 3: Connect to live environment</span>
HF_URL = <span class="st">"https://YOUR_ORG-inferencegym.hf.space"</span>
<span class="kw">import</span> httpx
response = httpx.get(<span class="kw">f</span><span class="st">"{HF_URL}/health"</span>)
print(<span class="st">"Environment status:"</span>, response.json())
<span class="cm"># Cell 4: Show available tasks</span>
tasks = httpx.get(<span class="kw">f</span><span class="st">"{HF_URL}/tasks"</span>).json()
<span class="kw">for</span> t <span class="kw">in</span> tasks[<span class="st">'tasks'</span>]: print(<span class="kw">f</span><span class="st">"{t['id']}: {t['name']} ({t['difficulty']})"</span>)
<span class="cm"># Cell 5: Run baseline agent, show scores</span>
baseline = httpx.get(<span class="kw">f</span><span class="st">"{HF_URL}/baseline"</span>).json()
print(<span class="st">"Baseline scores (naΓ―ve vLLM defaults):"</span>, baseline[<span class="st">'scores'</span>])
<span class="cm"># Cell 6: Manual episode β€” human in the loop</span>
res = httpx.post(<span class="kw">f</span><span class="st">"{HF_URL}/reset"</span>, json={<span class="st">"task_id"</span>: <span class="nm">1</span>, <span class="st">"seed"</span>: <span class="nm">42</span>}).json()
session_id = res[<span class="st">'session_id'</span>]; obs = res[<span class="st">'observation'</span>]
print(<span class="st">"Initial observation:"</span>, obs)
<span class="cm"># Cell 7: Run 10 manual steps with a smart action</span>
episode_log = []
<span class="kw">for</span> _ <span class="kw">in</span> range(<span class="nm">10</span>):
result = httpx.post(<span class="kw">f</span><span class="st">"{HF_URL}/step"</span>, json={<span class="st">"session_id"</span>: session_id,
<span class="st">"action"</span>: {<span class="st">"kv_budget"</span>:<span class="nm">0.6</span>, <span class="st">"batch_size"</span>:<span class="nm">128</span>, <span class="st">"spec_length"</span>:<span class="nm">0</span>, <span class="st">"prefill_disagg"</span>:<span class="kw">False</span>, <span class="st">"quant_tier"</span>:<span class="nm">0</span>}}).json()
episode_log.append(result)
<span class="cm"># Cell 8: Gym wrapper</span>
<span class="kw">import</span> gymnasium <span class="kw">as</span> gym; <span class="kw">import</span> numpy <span class="kw">as</span> np; <span class="kw">import</span> httpx
<span class="kw">class</span> <span class="tp">InferenceGymEnv</span>(gym.Env):
<span class="kw">def</span> <span class="fn">__init__</span>(self, base_url, task_id=<span class="nm">1</span>):
self.url = base_url; self.task_id = task_id; self.session_id = <span class="kw">None</span>
self.observation_space = gym.spaces.Box(-np.inf, np.inf, shape=(<span class="nm">12</span>,), dtype=np.float32)
self.action_space = gym.spaces.Box(
low=np.array([<span class="nm">0.1</span>, <span class="nm">0.0</span>, <span class="nm">1.0</span>], dtype=np.float32),
high=np.array([<span class="nm">1.0</span>, <span class="nm">1.0</span>, <span class="nm">512.0</span>], dtype=np.float32))
<span class="kw">def</span> <span class="fn">obs_to_array</span>(self, obs): <span class="kw">return</span> np.array(list(obs.values())[:12], dtype=np.float32)
<span class="kw">def</span> <span class="fn">reset</span>(self, **kwargs):
r = httpx.post(<span class="kw">f</span><span class="st">"{self.url}/reset"</span>, json={<span class="st">"task_id"</span>:self.task_id}).json()
self.session_id = r[<span class="st">'session_id'</span>]; <span class="kw">return</span> self.obs_to_array(r[<span class="st">'observation'</span>]), {}
<span class="kw">def</span> <span class="fn">step</span>(self, action):
act = {<span class="st">"kv_budget"</span>:float(action[<span class="nm">0</span>]), <span class="st">"spec_length"</span>:<span class="nm">0</span>, <span class="st">"batch_size"</span>:int(action[<span class="nm">2</span>]),
<span class="st">"prefill_disagg"</span>:<span class="kw">False</span>, <span class="st">"quant_tier"</span>:<span class="nm">0</span>}
r = httpx.post(<span class="kw">f</span><span class="st">"{self.url}/step"</span>, json={<span class="st">"session_id"</span>:self.session_id,<span class="st">"action"</span>:act}).json()
<span class="kw">return</span> self.obs_to_array(r[<span class="st">'observation'</span>]), r[<span class="st">'reward'</span>], r[<span class="st">'done'</span>], <span class="kw">False</span>, {}
<span class="cm"># Cell 9: Train PPO (takes ~10 minutes on Colab T4)</span>
<span class="kw">from</span> stable_baselines3 <span class="kw">import</span> PPO
env = InferenceGymEnv(HF_URL, task_id=<span class="nm">1</span>)
model = PPO(<span class="st">"MlpPolicy"</span>, env, verbose=<span class="nm">1</span>, learning_rate=<span class="nm">3e-4</span>, n_steps=<span class="nm">512</span>)
model.learn(total_timesteps=<span class="nm">50_000</span>)
<span class="cm"># Cell 10: Plot reward curve (the money shot)</span>
<span class="kw">import</span> matplotlib.pyplot <span class="kw">as</span> plt
rewards = [ep[<span class="st">'r'</span>] <span class="kw">for</span> ep <span class="kw">in</span> model.ep_info_buffer]
plt.figure(figsize=(<span class="nm">12</span>,<span class="nm">4</span>)); plt.plot(rewards, alpha=<span class="nm">0.3</span>, label=<span class="st">'Episode reward'</span>)
plt.axhline(y=<span class="nm">0.35</span>, color=<span class="st">'r'</span>, linestyle=<span class="st">'--'</span>, label=<span class="st">'Baseline score'</span>)
plt.title(<span class="st">'PPO Agent Learning on InferenceGym Task 1'</span>); plt.legend(); plt.show()
print(<span class="st">f"Final agent score: {np.mean(rewards[-20:]):.3f} vs baseline: 0.35"</span>)</pre></div>
</div>
</div>
<!-- ═══════════════════════════════ PHASE 6 ═══════════════════════════════ -->
<div class="section" id="phase6">
<div class="section-header">
<div class="section-num">P6</div>
<div class="section-meta">
<div class="section-title">Phase 6 β€” Polish, Writeup & Submission</div>
<div class="section-sub">Days 10–11 (Apr 5–7). Goal: every submission checklist item is ticked. The repo is clean. The writeup is compelling. The video is recorded. The form is submitted.</div>
</div>
<span class="eyebrow-tag et-purple">Days 10–11</span>
</div>
<div class="gate-box">
<div class="gate-icon">πŸ†</div>
<div>
<div class="gate-label">Final Gate β€” Submit by Apr 7 11:59 PM</div>
<div class="gate-text"><strong>The submission form is filled with HF Space URL + GitHub repo URL.</strong> No code changes after submission. The repo is public, has a clean README, and contains no API keys or large binary files committed to git.</div>
</div>
</div>
<div class="grid2 mb14">
<div>
<div class="label">ENVIRONMENT.md β€” Technical spec for judges</div>
<div class="deliverable-box">
<div class="deliverable-title">Person A writes this on Day 10</div>
<ul class="deliverable-list">
<li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Observation space table</strong> Full table with field name, type, range, and description for all 12 observation fields. Copy from models.py and expand.</div></li>
<li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Action space table</strong> Full table with field name, type, valid values, default, and effect when changed for all 5 action dimensions.</div></li>
<li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Reward function derivation</strong> Show the R = Ξ±T - Ξ²L - Ξ³V - Ξ΄C formula with all constants, normalization choices, and why each weight was set the way it was.</div></li>
<li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Trace data methodology</strong> Document exactly what source data you used, how it was preprocessed, and why it's realistic. If using published benchmarks, cite them.</div></li>
</ul>
</div>
</div>
<div>
<div class="label">README.md β€” The first thing judges see</div>
<div class="deliverable-box">
<div class="deliverable-title">Person C writes this on Day 10</div>
<ul class="deliverable-list">
<li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>One-paragraph pitch first</strong> Before any technical content. Why does this environment matter? What problem does it solve? This should be the same words you'd use to pitch to a judge in 30 seconds.</div></li>
<li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>Quick start in 5 lines</strong> Show the curl commands to hit /health, /reset, /step, /grader. A judge who never reads further should still understand the API from these 5 lines.</div></li>
<li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>Baseline vs agent scores table</strong> Show a simple table: Task 1/2/3 Γ— Baseline/PPO Agent. The numbers do the talking.</div></li>
<li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>Link to Colab notebook prominently</strong> "Open in Colab" badge. Judges who click this and see the reward curve rising will be convinced.</div></li>
</ul>
</div>
</div>
</div>
<div class="label">2-MINUTE DEMO VIDEO SCRIPT β€” Person C, Day 10</div>
<div class="table-wrap mb14">
<table>
<tr><th>Time</th><th>Screen</th><th>What You Say / Show</th></tr>
<tr><td><strong>0:00–0:20</strong></td><td>Slide: problem statement</td><td>"LLM inference is where 80% of AI budget is spent. There's no RL environment for optimising it. We built one."</td></tr>
<tr><td><strong>0:20–0:40</strong></td><td>HF Space β€” /health β†’ /tasks</td><td>"This is InferenceGym on HuggingFace Spaces, live right now. 3 tasks, 5 action knobs, fully CPU-only." Hit the endpoints live.</td></tr>
<tr><td><strong>0:40–1:00</strong></td><td>Colab β€” run baseline</td><td>"NaΓ―ve vLLM defaults score 0.35 on Task 1. That's your baseline β€” static config, no optimisation."</td></tr>
<tr><td><strong>1:00–1:30</strong></td><td>Colab β€” PPO reward curve</td><td>"A simple PPO agent trained for 50k steps hits 0.65 β€” almost double. No GPU, no model, just our trace-driven simulator." Show the plot.</td></tr>
<tr><td><strong>1:30–2:00</strong></td><td>Architecture diagram</td><td>"Any company can drop in their own trace data and train an agent for their specific workload. That's the value proposition. Thank you."</td></tr>
</table>
</div>
</div>
<!-- ═══════════════════════════════ FULL DAY TIMELINE ═══════════════════════════════ -->
<div class="section">
<div class="section-header">
<div class="section-num">TL</div>
<div class="section-meta">
<div class="section-title">Complete 11-Day Timeline</div>
<div class="section-sub">Every person, every day. The critical path runs through Person A's simulator β€” protect it above all else.</div>
</div>
</div>
<div class="timeline">
<div class="tl-row">
<div class="tl-left"><div class="tl-day-label today">Mar 27<br>Day 1<br>TODAY</div></div>
<div class="tl-connector"><div class="tl-dot g"></div><div class="tl-line"></div></div>
<div class="tl-right">
<div class="tl-card green-border">
<div class="tl-phase" style="color:var(--green)">PHASE 0 β€” SETUP & ARCHITECTURE LOCK</div>
<ul class="tl-tasks-list">
<li><span class="tl-person">A β†’</span><span class="tl-task"><strong>Design data schemas</strong> in models.py. Write skeleton TraceSimulator with hardcoded stub output. Design lookup table format.</span></li>
<li><span class="tl-person">B β†’</span><span class="tl-task"><strong>Create FastAPI app</strong> with all 8 endpoint stubs returning valid-shaped hardcoded JSON. Dockerfile builds. /health returns 200.</span></li>
<li><span class="tl-person">C β†’</span><span class="tl-task"><strong>Write grader rubric</strong> on paper for all 3 tasks. Download trace data. Write workload_configs.json. Agree on HF Space naming.</span></li>
<li><span class="tl-person">ALL β†’</span><span class="tl-task"><strong>Agree and commit models.py</strong> to main. This file cannot change after today without unanimous consent.</span></li>
</ul>
</div>
</div>
</div>
<div class="tl-row">
<div class="tl-left"><div class="tl-day-label">Mar 28<br>Day 2</div></div>
<div class="tl-connector"><div class="tl-dot b"></div><div class="tl-line"></div></div>
<div class="tl-right">
<div class="tl-card blue-border">
<div class="tl-phase" style="color:var(--blue)">PHASE 1 β€” SIMULATOR CORE (Day 1 of 2)</div>
<ul class="tl-tasks-list">
<li><span class="tl-person">A β†’</span><span class="tl-task"><strong>Implement TraceSimulator</strong> β€” load parquet, bilinear interpolation, Gaussian noise, OOM detection. Write WorkloadGenerator (Poisson arrivals, prompt sampling).</span></li>
<li><span class="tl-person">B β†’</span><span class="tl-task"><strong>Wire /reset and /step</strong> endpoints to the InferenceEnv stubs (not real yet β€” use A's skeleton). Test with curl that responses are correctly shaped.</span></li>
<li><span class="tl-person">C β†’</span><span class="tl-task"><strong>Process trace data</strong> β€” reshape into lookup table Parquet format with correct columns. Validate at least 50 data points across the batchΓ—prompt grid. Start grader skeleton.</span></li>
</ul>
</div>
</div>
</div>
<div class="tl-row">
<div class="tl-left"><div class="tl-day-label">Mar 29<br>Day 3</div></div>
<div class="tl-connector"><div class="tl-dot b"></div><div class="tl-line"></div></div>
<div class="tl-right">
<div class="tl-card blue-border">
<div class="tl-phase" style="color:var(--blue)">PHASE 1 β€” SIMULATOR CORE (Day 2 of 2) πŸ”‘ CRITICAL GATE</div>
<ul class="tl-tasks-list">
<li><span class="tl-person">A β†’</span><span class="tl-task"><strong>Complete WorkloadGenerator</strong> β€” queue depth, burst injection, spec acceptance model. Complete InferenceEnv.reset() and step(). All simulator unit tests pass.</span></li>
<li><span class="tl-person">B →</span><span class="tl-task"><strong>Wire all endpoints</strong> to real InferenceEnv (replacing stubs). Implement SessionManager. Test full reset→step×10 cycle via HTTP.</span></li>
<li><span class="tl-person">C β†’</span><span class="tl-task"><strong>Implement GraderModule skeleton</strong> with correct formula shape (even if constants need tuning). Run smoke test: score a 10-step episode log. Get any finite number.</span></li>
</ul>
</div>
</div>
</div>
<div class="tl-row">
<div class="tl-left"><div class="tl-day-label">Mar 30<br>Day 4</div></div>
<div class="tl-connector"><div class="tl-dot b"></div><div class="tl-line"></div></div>
<div class="tl-right">
<div class="tl-card blue-border">
<div class="tl-phase" style="color:var(--blue)">PHASE 2 β€” ENVIRONMENT LOGIC COMPLETE</div>
<ul class="tl-tasks-list">
<li><span class="tl-person">A β†’</span><span class="tl-task"><strong>Implement all 3 task configs</strong> (action masking for T1/T2, burst injection for T3). Full reward function with Ξ± Ξ² Ξ³ Ξ΄ weights. Write full unit test suite (20+ tests).</span></li>
<li><span class="tl-person">B β†’</span><span class="tl-task"><strong>Build Dockerfile</strong> β€” multi-stage, confirm image &lt;2GB. Run full Docker cycle locally. Implement /state, /info, /health endpoints. Add Pydantic request validation.</span></li>
<li><span class="tl-person">C β†’</span><span class="tl-task"><strong>Complete GraderModule</strong> β€” calibrate baseline TPS constants, write unit tests for all 3 task graders with known expected outputs. Score computation verified by hand.</span></li>
</ul>
</div>
</div>
</div>
<div class="tl-row">
<div class="tl-left"><div class="tl-day-label">Mar 31<br>Day 5</div></div>
<div class="tl-connector"><div class="tl-dot b"></div><div class="tl-line"></div></div>
<div class="tl-right">
<div class="tl-card blue-border">
<div class="tl-phase" style="color:var(--blue)">PHASE 3 β€” API LAYER COMPLETE & OPENENV VALIDATED</div>
<ul class="tl-tasks-list">
<li><span class="tl-person">A β†’</span><span class="tl-task"><strong>Full integration test</strong> β€” run 200-step episode for all 3 tasks programmatically. Confirm rewards are in [-1,1] range. Fix any edge cases (divide by zero, negative queue).</span></li>
<li><span class="tl-person">B β†’</span><span class="tl-task"><strong>Run openenv validate</strong> β€” fix any compliance issues. Implement /grader and /baseline endpoints (wiring C's modules). Add rate limiting and CORS middleware.</span></li>
<li><span class="tl-person">C β†’</span><span class="tl-task"><strong>Write BaselineAgent</strong> and run against all 3 tasks. Record expected scores (should be ~0.30-0.35 for T1, ~0.22-0.28 for T2, ~0.18-0.24 for T3). Adjust grader constants if needed.</span></li>
</ul>
</div>
</div>
</div>
<div class="tl-row">
<div class="tl-left"><div class="tl-day-label">Apr 1<br>Day 6</div></div>
<div class="tl-connector"><div class="tl-dot a"></div><div class="tl-line"></div></div>
<div class="tl-right">
<div class="tl-card amber-border">
<div class="tl-phase" style="color:var(--amber)">PHASE 4 β€” GRADER & BASELINE COMPLETE</div>
<ul class="tl-tasks-list">
<li><span class="tl-person">A β†’</span><span class="tl-task"><strong>Adversarial task stress test</strong> β€” run 1000-step Task 3 episodes, check burst injection fires at correct intervals, priority routing triggers, no state corruption.</span></li>
<li><span class="tl-person">B →</span><span class="tl-task"><strong>Concurrent session test</strong> — run 10 simultaneous reset→step×5 cycles, confirm no session leakage. Profile memory usage under load — must stay under 512MB.</span></li>
<li><span class="tl-person">C β†’</span><span class="tl-task"><strong>Write PPO gym wrapper</strong> (HTTPGymEnv). Start PPO training on Task 1. Set it running overnight β€” 50k steps should complete in ~4-6 hours on a modern CPU.</span></li>
</ul>
</div>
</div>
</div>
<div class="tl-row">
<div class="tl-left"><div class="tl-day-label">Apr 2<br>Day 7</div></div>
<div class="tl-connector"><div class="tl-dot a"></div><div class="tl-line"></div></div>
<div class="tl-right">
<div class="tl-card amber-border">
<div class="tl-phase" style="color:var(--amber)">BUFFER DAY + INTERNAL DEMO</div>
<ul class="tl-tasks-list">
<li><span class="tl-person">ALL β†’</span><span class="tl-task"><strong>Internal demo meeting</strong> β€” each person walks through the Colab notebook end to end. Find anything broken. Fix it today.</span></li>
<li><span class="tl-person">A β†’</span><span class="tl-task">Fix any bugs found in internal demo. Add /info endpoint with full JSON schema. Docstrings on all public methods.</span></li>
<li><span class="tl-person">C β†’</span><span class="tl-task"><strong>Review PPO training results</strong> β€” plot reward curve, verify it's increasing. If not, debug (check normalization, learning rate, reward scale). Start writing Colab notebook.</span></li>
</ul>
</div>
</div>
</div>
<div class="tl-row">
<div class="tl-left"><div class="tl-day-label">Apr 3<br>Day 8</div></div>
<div class="tl-connector"><div class="tl-dot a"></div><div class="tl-line"></div></div>
<div class="tl-right">
<div class="tl-card amber-border">
<div class="tl-phase" style="color:var(--amber)">PHASE 5 β€” DEPLOYMENT</div>
<ul class="tl-tasks-list">
<li><span class="tl-person">B β†’</span><span class="tl-task"><strong>Deploy to HuggingFace Spaces</strong> β€” push, watch build logs, verify all endpoints respond from live public URL. Document the URL in README.</span></li>
<li><span class="tl-person">C β†’</span><span class="tl-task"><strong>Complete Colab notebook</strong> β€” all 10 cells work end-to-end against the live HF Space URL. The notebook should run cold in under 15 minutes.</span></li>
<li><span class="tl-person">A β†’</span><span class="tl-task"><strong>Test from fresh machine</strong> β€” clone the repo, build Docker, run all tests. Confirm there are no hidden local dependencies. Fix whatever breaks.</span></li>
</ul>
</div>
</div>
</div>
<div class="tl-row">
<div class="tl-left"><div class="tl-day-label">Apr 4<br>Day 9</div></div>
<div class="tl-connector"><div class="tl-dot a"></div><div class="tl-line"></div></div>
<div class="tl-right">
<div class="tl-card amber-border">
<div class="tl-phase" style="color:var(--amber)">PHASE 5 β€” DEMO COMPLETE</div>
<ul class="tl-tasks-list">
<li><span class="tl-person">C β†’</span><span class="tl-task"><strong>Record 2-minute demo video</strong> using OBS or Loom. Follow the script. Upload to YouTube (unlisted) and link in README. Do not make it public until submission.</span></li>
<li><span class="tl-person">B β†’</span><span class="tl-task"><strong>Stress test live deployment</strong> β€” 50 concurrent requests, verify no 500 errors. Check HF Space memory and CPU usage stays stable.</span></li>
<li><span class="tl-person">ALL β†’</span><span class="tl-task"><strong>Write submission description draft</strong> (~500 words covering: problem, design, grader design, baseline vs agent results). Will refine on Day 10.</span></li>
</ul>
</div>
</div>
</div>
<div class="tl-row">
<div class="tl-left"><div class="tl-day-label">Apr 5–6<br>Days 10-11</div></div>
<div class="tl-connector"><div class="tl-dot" style="background:var(--purple); border-color:var(--purple);"></div><div class="tl-line"></div></div>
<div class="tl-right">
<div class="tl-card" style="border-color:var(--pborder);">
<div class="tl-phase" style="color:var(--purple)">PHASE 6 β€” WRITEUP, POLISH & SUBMISSION PREP</div>
<ul class="tl-tasks-list">
<li><span class="tl-person">A β†’</span><span class="tl-task"><strong>Write ENVIRONMENT.md</strong> β€” full technical spec for judges (observation space, action space, reward formula, task descriptions, simulator methodology).</span></li>
<li><span class="tl-person">C β†’</span><span class="tl-task"><strong>Write final README</strong> β€” pitch paragraph, quick start, baseline vs agent table, Colab link, video link. Run through the submission checklist line by line.</span></li>
<li><span class="tl-person">ALL β†’</span><span class="tl-task"><strong>Final end-to-end verification</strong> β€” test from a fresh browser with no cookies or local setup. Every endpoint must work. Grader must score any completed episode.</span></li>
</ul>
</div>
</div>
</div>
<div class="tl-row">
<div class="tl-left"><div class="tl-day-label" style="color:var(--red);">Apr 7<br>DEADLINE</div></div>
<div class="tl-connector"><div class="tl-dot r"></div></div>
<div class="tl-right">
<div class="tl-card" style="border-color:var(--rborder);">
<div class="tl-phase" style="color:var(--red)">SUBMIT BY 11:59 PM β€” NO CODE CHANGES AFTER</div>
<ul class="tl-tasks-list">
<li><span class="tl-person">ALL β†’</span><span class="tl-task">Submit HF Space URL + GitHub repo URL on hackathon portal. Fill in: env name, description, team members. Double check the HF Space is public.</span></li>
</ul>
</div>
</div>
</div>
</div>
</div>
<!-- ═══════════════════════════════ APPENDIX A: MODULES ═══════════════════════════════ -->
<div class="section" id="modules">
<div class="section-header">
<div class="section-num">Β§A</div>
<div class="section-meta">
<div class="section-title">Appendix A β€” Full Module Specifications</div>
<div class="section-sub">Every file in the repository, what it owns, and the exact interface it must expose.</div>
</div>
</div>
<div class="label">COMPLETE FILE TREE WITH OWNERSHIP</div>
<div class="code-block">
<div class="code-block-header"><span class="code-lang">text</span><span class="code-file">Repository structure</span></div>
<div class="code-body"><pre>inferencegym/
β”œβ”€β”€ <span class="fn">models.py</span> <span class="cm">[ALL] β€” Locked Day 1. ServeAction, ServeObservation, MetricsSnapshot, WorkloadState</span>
β”‚
β”œβ”€β”€ <span class="fn">env/</span>
β”‚ β”œβ”€β”€ inference_env.py <span class="cm">[A] β€” Core InferenceEnv class. reset(), step(), _compute_reward(), _enforce_action_mask()</span>
β”‚ β”œβ”€β”€ observation.py <span class="cm">[A] β€” _build_obs() helper, normalise values to [0,1] for RL agents</span>
β”‚ β”œβ”€β”€ action.py <span class="cm">[A] β€” ActionValidator, clamp continuous actions to valid ranges</span>
β”‚ └── reward.py <span class="cm">[A] β€” RewardComputer, configurable Ξ± Ξ² Ξ³ Ξ΄, TASK_CONFIGS dict</span>
β”‚
β”œβ”€β”€ <span class="fn">simulator/</span>
β”‚ β”œβ”€β”€ trace_sim.py <span class="cm">[A] β€” TraceSimulator: load parquet, interpolate, noise, OOM detection</span>
β”‚ β”œβ”€β”€ workload.py <span class="cm">[A] β€” WorkloadGenerator: Poisson, LogNormal, burst injection, queue</span>
β”‚ β”œβ”€β”€ session_manager.py <span class="cm">[B] β€” SessionManager: thread-safe LRU cache of InferenceEnv instances</span>
β”‚ └── data/
β”‚ β”œβ”€β”€ traces_llama3_8b.parquet <span class="cm">[C] β€” lookup table: (batch,kv,spec,plen) β†’ metrics</span>
β”‚ β”œβ”€β”€ sharegpt_dist.json <span class="cm">[C] β€” LogNormal params for Task 2 prompt distribution</span>
β”‚ └── workload_configs.json <span class="cm">[C] β€” Task 1/2/3 workload configuration parameters</span>
β”‚
β”œβ”€β”€ <span class="fn">grader/</span>
β”‚ β”œβ”€β”€ grader.py <span class="cm">[C] β€” GraderModule: dispatches to per-task graders, returns score+breakdown</span>
β”‚ β”œβ”€β”€ task1_grader.py <span class="cm">[C] β€” Throughput normalisation formula</span>
β”‚ β”œβ”€β”€ task2_grader.py <span class="cm">[C] β€” TTFT + memory compliance formula</span>
β”‚ └── task3_grader.py <span class="cm">[C] β€” 4-objective formula including action stability</span>
β”‚
β”œβ”€β”€ <span class="fn">agents/</span>
β”‚ β”œβ”€β”€ baseline.py <span class="cm">[C] β€” BaselineAgent: fixed BASELINE_ACTION, run_all_baselines()</span>
β”‚ └── ppo_demo.py <span class="cm">[C] β€” HTTPGymEnv wrapper + PPO training script</span>
β”‚
β”œβ”€β”€ <span class="fn">server/</span>
β”‚ β”œβ”€β”€ app.py <span class="cm">[B] β€” FastAPI application, all 8 endpoints, startup event</span>
β”‚ β”œβ”€β”€ schemas.py <span class="cm">[B] β€” Pydantic request/response models (ResetRequest, StepRequest, etc.)</span>
β”‚ └── middleware.py <span class="cm">[B] β€” CORS, rate limiting (max 100 req/min per IP), request logging</span>
β”‚
β”œβ”€β”€ <span class="fn">tests/</span>
β”‚ β”œβ”€β”€ test_simulator.py <span class="cm">[A] β€” 20+ unit tests for TraceSimulator and WorkloadGenerator</span>
β”‚ β”œβ”€β”€ test_env.py <span class="cm">[A] β€” Contract tests for step/reset/state, edge cases</span>
β”‚ β”œβ”€β”€ test_grader.py <span class="cm">[C] β€” Unit tests for all 3 grader formulas with known expected outputs</span>
β”‚ └── test_api.py <span class="cm">[B] β€” Integration tests: httpx client hitting full FastAPI stack</span>
β”‚
β”œβ”€β”€ <span class="fn">notebooks/</span>
β”‚ └── InferenceGym_Demo.ipynb <span class="cm">[C] β€” 10-cell Colab demo notebook</span>
β”‚
β”œβ”€β”€ Dockerfile <span class="cm">[B] β€” Multi-stage, CPU-only, port 7860, &lt;2GB image</span>
β”œβ”€β”€ docker-compose.yml <span class="cm">[B] β€” Local dev: volume mount source, hot reload</span>
β”œβ”€β”€ requirements.txt <span class="cm">[B] β€” Pinned CPU-only deps. No torch. No CUDA.</span>
β”œβ”€β”€ README.md <span class="cm">[C] β€” HF Spaces frontmatter + pitch + quickstart + links</span>
└── ENVIRONMENT.md <span class="cm">[A] β€” Full technical spec for judges</span></pre></div>
</div>
<div class="label">MODULE INTERFACE CONTRACTS β€” What each module must expose</div>
<div class="module-grid">
<div class="module-card">
<div class="module-card-header">
<span class="module-card-name">TraceSimulator</span>
<span class="module-card-file">simulator/trace_sim.py</span>
</div>
<div class="module-card-body">
<ul class="spec-list">
<li><code>__init__(trace_path: str, seed: int = 42)</code> β€” loads parquet, builds interpolators, sets rng</li>
<li><code>simulate(action: ServeAction, workload: WorkloadState) β†’ MetricsSnapshot</code> β€” the core method</li>
<li><code>reset_seed(seed: int)</code> β€” resets the rng for episode reproducibility</li>
<li>Must not raise exceptions on valid input. OOM conditions are returned as data, not exceptions.</li>
</ul>
</div>
</div>
<div class="module-card">
<div class="module-card-header">
<span class="module-card-name">WorkloadGenerator</span>
<span class="module-card-file">simulator/workload.py</span>
</div>
<div class="module-card-body">
<ul class="spec-list">
<li><code>__init__(task_id: int, seed: int = 42)</code> β€” loads workload config for this task</li>
<li><code>reset() β†’ WorkloadState</code> β€” returns initial state, resets internal step counter</li>
<li><code>step(action: ServeAction) β†’ WorkloadState</code> β€” advances one step, updates queue</li>
<li><code>is_burst_active() β†’ bool</code> β€” True during burst windows for Task 3</li>
</ul>
</div>
</div>
<div class="module-card">
<div class="module-card-header">
<span class="module-card-name">InferenceEnv</span>
<span class="module-card-file">env/inference_env.py</span>
</div>
<div class="module-card-body">
<ul class="spec-list">
<li><code>reset() β†’ ServeObservation</code> β€” starts new episode, returns initial observation</li>
<li><code>step(action) β†’ (obs, reward, done, info)</code> β€” Gym-compatible signature</li>
<li><code>state() β†’ dict</code> β€” returns episode metadata for /state endpoint</li>
<li><code>_episode_log: list</code> β€” accumulates step dicts for grader consumption</li>
<li><code>session_id: str</code> β€” unique UUID per episode, set on reset()</li>
</ul>
</div>
</div>
<div class="module-card">
<div class="module-card-header">
<span class="module-card-name">GraderModule</span>
<span class="module-card-file">grader/grader.py</span>
</div>
<div class="module-card-body">
<ul class="spec-list">
<li><code>score(task_id: int, episode_log: list) β†’ dict</code> β€” returns <code>{score, breakdown, feedback}</code></li>
<li>Must be stateless β€” no internal mutable state. Same input β†’ same output always.</li>
<li><code>score</code> must be a float in [0.0, 1.0]</li>
<li><code>breakdown</code> must contain one float per scoring component</li>
<li><code>feedback</code> must be a human-readable string explaining the score</li>
</ul>
</div>
</div>
</div>
</div>
<!-- ═══════════════════════════════ APPENDIX B: DATA SCHEMAS ═══════════════════════════════ -->
<div class="section" id="dataschema">
<div class="section-header">
<div class="section-num">Β§B</div>
<div class="section-meta">
<div class="section-title">Appendix B β€” Data Schemas & Complete API Reference</div>
</div>
</div>
<div class="label">LOOKUP TABLE PARQUET SCHEMA β€” traces_llama3_8b.parquet</div>
<div class="table-wrap mb14">
<table>
<tr><th>Column</th><th>Type</th><th>Values</th><th>Description</th></tr>
<tr><td><code>batch_size</code></td><td>int</td><td>1,4,8,16,32,64,128,256,512</td><td>Max concurrent requests served</td></tr>
<tr><td><code>kv_budget</code></td><td>float</td><td>0.1, 0.25, 0.5, 0.75, 1.0</td><td>KV cache allocation fraction</td></tr>
<tr><td><code>spec_length</code></td><td>int</td><td>0, 1, 2, 4, 8</td><td>Speculative draft tokens (0 = disabled)</td></tr>
<tr><td><code>quant_tier</code></td><td>int</td><td>0, 1, 2</td><td>0=FP16, 1=INT8, 2=INT4</td></tr>
<tr><td><code>prompt_len_bucket</code></td><td>int</td><td>0–7</td><td>Bucket index: [64,128,256,512,1024,2048,4096,8192]</td></tr>
<tr><td><code>ttft_p50_ms</code></td><td>float</td><td>&gt;0</td><td>Median time to first token (milliseconds)</td></tr>
<tr><td><code>ttft_p99_ms</code></td><td>float</td><td>&gt;0</td><td>99th percentile TTFT</td></tr>
<tr><td><code>tpot_ms</code></td><td>float</td><td>&gt;0</td><td>Time per output token</td></tr>
<tr><td><code>tps</code></td><td>float</td><td>&gt;0</td><td>Output tokens per second</td></tr>
<tr><td><code>gpu_mem_gb</code></td><td>float</td><td>0–80</td><td>GPU memory footprint in GB</td></tr>
<tr><td><code>cost_per_1k</code></td><td>float</td><td>&gt;0</td><td>Relative cost per 1000 tokens (normalised)</td></tr>
</table>
</div>
<div class="label">WORKLOAD CONFIGS β€” workload_configs.json structure</div>
<div class="code-block">
<div class="code-block-header"><span class="code-lang">json</span><span class="code-file">simulator/data/workload_configs.json</span></div>
<div class="code-body"><pre>{
<span class="st">"tasks"</span>: {
<span class="st">"1"</span>: {
<span class="st">"name"</span>: <span class="st">"Static Uniform"</span>,
<span class="st">"arrival_rate_rps"</span>: <span class="nm">10.0</span>,
<span class="st">"arrival_dist"</span>: <span class="st">"poisson"</span>,
<span class="st">"prompt_len_dist"</span>: <span class="st">"uniform"</span>,
<span class="st">"prompt_len_min"</span>: <span class="nm">64</span>,
<span class="st">"prompt_len_max"</span>: <span class="nm">128</span>,
<span class="st">"slo_target_ms"</span>: <span class="nm">500.0</span>,
<span class="st">"burst_enabled"</span>: <span class="kw">false</span>,
<span class="st">"priority_routing"</span>: <span class="kw">false</span>,
<span class="st">"active_actions"</span>: [<span class="st">"kv_budget"</span>, <span class="st">"batch_size"</span>]
},
<span class="st">"2"</span>: {
<span class="st">"name"</span>: <span class="st">"Bursty ShareGPT"</span>,
<span class="st">"arrival_rate_rps"</span>: <span class="nm">25.0</span>,
<span class="st">"arrival_rate_burst"</span>: <span class="nm">80.0</span>,
<span class="st">"burst_period_steps"</span>: <span class="nm">30</span>,
<span class="st">"arrival_dist"</span>: <span class="st">"poisson_bursty"</span>,
<span class="st">"prompt_len_dist"</span>: <span class="st">"lognormal"</span>,
<span class="st">"prompt_len_mu"</span>: <span class="nm">5.2</span>,
<span class="st">"prompt_len_sigma"</span>: <span class="nm">1.3</span>,
<span class="st">"prompt_len_clamp_min"</span>: <span class="nm">32</span>,
<span class="st">"prompt_len_clamp_max"</span>: <span class="nm">8192</span>,
<span class="st">"memory_hard_limit_gb"</span>: <span class="nm">36.0</span>,
<span class="st">"slo_target_ms"</span>: <span class="nm">300.0</span>,
<span class="st">"burst_enabled"</span>: <span class="kw">true</span>,
<span class="st">"active_actions"</span>: [<span class="st">"kv_budget"</span>, <span class="st">"batch_size"</span>, <span class="st">"spec_length"</span>]
},
<span class="st">"3"</span>: {
<span class="st">"name"</span>: <span class="st">"Adversarial Multi-Tenant"</span>,
<span class="st">"arrival_rate_rps"</span>: <span class="nm">30.0</span>,
<span class="st">"burst_multiplier"</span>: <span class="nm">10.0</span>,
<span class="st">"burst_interval_steps"</span>: <span class="nm">120</span>,
<span class="st">"burst_duration_steps"</span>: <span class="nm">15</span>,
<span class="st">"prompt_len_dist"</span>: <span class="st">"bimodal"</span>,
<span class="st">"short_request_frac"</span>: <span class="nm">0.7</span>,
<span class="st">"short_prompt_max"</span>: <span class="nm">128</span>,
<span class="st">"long_prompt_min"</span>: <span class="nm">4096</span>,
<span class="st">"long_prompt_max"</span>: <span class="nm">8192</span>,
<span class="st">"priority_mix"</span>: [<span class="nm">0.2</span>, <span class="nm">0.5</span>, <span class="nm">0.3</span>],
<span class="st">"slo_interactive_ms"</span>: <span class="nm">200.0</span>,
<span class="st">"slo_batch_ms"</span>: <span class="nm">2000.0</span>,
<span class="st">"cost_budget_episode"</span>: <span class="nm">5.0</span>,
<span class="st">"memory_hard_limit_gb"</span>: <span class="nm">38.0</span>,
<span class="st">"active_actions"</span>: [<span class="st">"kv_budget"</span>, <span class="st">"batch_size"</span>, <span class="st">"spec_length"</span>, <span class="st">"prefill_disagg"</span>, <span class="st">"quant_tier"</span>]
}
}
}</pre></div>
</div>
<div class="label">COMPLETE OBSERVATION & ACTION SPACE REFERENCE</div>
<div class="table-wrap mb14">
<table>
<tr><th>Field</th><th>Type</th><th>Range</th><th>Normalised?</th><th>Description</th></tr>
<tr><td><code>queue_depth</code></td><td>float</td><td>[0, 512]</td><td>No</td><td>Pending requests in serving queue</td></tr>
<tr><td><code>mean_prompt_len</code></td><td>float</td><td>[32, 8192]</td><td>No</td><td>Mean token count of current window</td></tr>
<tr><td><code>arrival_rate</code></td><td>float</td><td>[0, 200]</td><td>No</td><td>10-step EMA requests/second</td></tr>
<tr><td><code>kv_cache_occupancy</code></td><td>float</td><td>[0.0, 1.0]</td><td>Yes</td><td>Fraction of KV cache in use</td></tr>
<tr><td><code>ttft_p50</code></td><td>float</td><td>[0, 5000] ms</td><td>No</td><td>Median TTFT last 20 requests</td></tr>
<tr><td><code>tpot_p50</code></td><td>float</td><td>[0, 500] ms</td><td>No</td><td>Median time-per-output-token</td></tr>
<tr><td><code>slo_violation_rate</code></td><td>float</td><td>[0.0, 1.0]</td><td>Yes</td><td>Fraction of requests missing SLO</td></tr>
<tr><td><code>gpu_memory_used_gb</code></td><td>float</td><td>[0, 80]</td><td>No</td><td>Simulated GPU memory pressure</td></tr>
<tr><td><code>spec_accept_rate</code></td><td>float</td><td>[0.0, 1.0]</td><td>Yes</td><td>Speculative token acceptance rate</td></tr>
<tr><td><code>priority_distribution</code></td><td>float[3]</td><td>[0,1] each</td><td>Yes</td><td>[interactive, batch, best_effort] fractions</td></tr>
<tr><td><code>timestep</code></td><td>int</td><td>[0, 200]</td><td>No</td><td>Current episode step</td></tr>
<tr><td><code>cost_so_far</code></td><td>float</td><td>[0, ∞]</td><td>No</td><td>Cumulative cost this episode</td></tr>
</table>
</div>
</div>
<!-- ═══════════════════════════════ APPENDIX C: RISKS ═══════════════════════════════ -->
<div class="section" id="risks">
<div class="section-header">
<div class="section-num">Β§C</div>
<div class="section-meta">
<div class="section-title">Appendix C β€” Risk Register</div>
<div class="section-sub">Every known failure mode, its probability, and exact mitigation steps.</div>
</div>
</div>
<div class="table-wrap mb14">
<table>
<tr>
<th style="width:200px">Risk</th>
<th style="width:80px">Prob</th>
<th>Mitigation</th>
<th style="width:80px">Owner</th>
</tr>
<tr>
<td><strong>Trace data is wrong shape</strong><br><span style="font-size:10px;color:var(--text3)">Published benchmarks don't have the exact columns needed</span></td>
<td><span class="eyebrow-tag et-amber">Medium</span></td>
<td>Implement Option C (synthetic data) on Day 1 before even trying Option A. This takes 30 minutes and gives you a valid fallback. Option A then becomes an enhancement, not a dependency.</td>
<td>C</td>
</tr>
<tr>
<td><strong>PPO doesn't converge</strong><br><span style="font-size:10px;color:var(--text3)">Reward curve is flat or decreasing</span></td>
<td><span class="eyebrow-tag et-green">Low</span></td>
<td>Task 1 is designed for easy learning. If PPO fails: (1) add VecNormalize wrapper, (2) lower learning rate to 1e-4, (3) check reward is truly in [-1,1]. If still failing, use a simple hill-climbing agent β€” just show any rising curve.</td>
<td>C</td>
</tr>
<tr>
<td><strong>HuggingFace Spaces OOM</strong><br><span style="font-size:10px;color:var(--text3)">Free tier has 16GB RAM β€” simulator might use too much</span></td>
<td><span class="eyebrow-tag et-green">Low</span></td>
<td>Load trace data as a numpy array, not a pandas DataFrame, at startup. Target &lt;200MB for the lookup table. Use <code>parquet</code> with snappy compression. Test memory usage locally with <code>psutil</code> before deploying.</td>
<td>B</td>
</tr>
<tr>
<td><strong>Race condition in session cache</strong><br><span style="font-size:10px;color:var(--text3)">Concurrent requests corrupt session state</span></td>
<td><span class="eyebrow-tag et-amber">Medium</span></td>
<td>All reads and writes to <code>self._sessions</code> dict are wrapped in <code>threading.Lock()</code>. Individual <code>InferenceEnv</code> instances are not thread-safe but each session is owned by one caller at a time β€” this is fine because the /step endpoint is synchronous and FastAPI serialises calls per session_id.</td>
<td>B</td>
</tr>
<tr>
<td><strong>Grader gives score &gt; 1.0 or &lt; 0.0</strong><br><span style="font-size:10px;color:var(--text3)">Formula constants are miscalibrated</span></td>
<td><span class="eyebrow-tag et-amber">Medium</span></td>
<td>All grader component scores are individually <code>np.clip(x, 0.0, 1.0)</code> before the weighted sum. The final score is also clipped. Calibrate BASELINE_TPS and OPTIMAL_TPS constants on Day 5 by running the actual baseline agent and verifying scores fall in [0.20, 0.40].</td>
<td>C</td>
</tr>
<tr>
<td><strong>Person A is blocked on Day 3</strong><br><span style="font-size:10px;color:var(--text3)">Simulator not done, Person B and C can't proceed</span></td>
<td><span class="eyebrow-tag et-amber">Medium</span></td>
<td>Person A prioritises the interface (<code>simulate() returns a valid MetricsSnapshot</code>) over the implementation quality. A synthetic linear model with hardcoded constants is enough for Day 3. Person B and C only need the method signature to work. Real trace data can be plugged in on Day 4.</td>
<td>A</td>
</tr>
<tr>
<td><strong>Docker image &gt;2GB</strong><br><span style="font-size:10px;color:var(--text3)">stable-baselines3 pulls large PyTorch dependency</span></td>
<td><span class="eyebrow-tag et-amber">Medium</span></td>
<td>Install <code>stable-baselines3[extra]</code> only in a separate <code>requirements-demo.txt</code> that is NOT in the Dockerfile. The server only needs the environment. The PPO demo runs from outside the container (in Colab). This keeps the image under 500MB.</td>
<td>B</td>
</tr>
<tr>
<td><strong>OpenEnv spec compliance fails</strong><br><span style="font-size:10px;color:var(--text3)">openenv validate finds schema mismatches</span></td>
<td><span class="eyebrow-tag et-green">Low</span></td>
<td>Run <code>openenv validate</code> at the end of every day starting Day 3. Validation issues are always about JSON schema β€” field names, types, missing fields. Fix immediately, never defer. Keep a local copy of the openenv spec open while writing endpoint response schemas.</td>
<td>B</td>
</tr>
</table>
</div>
</div>
<!-- ═══════════════════════════════ APPENDIX D: CHECKLIST ═══════════════════════════════ -->
<div class="section" id="checklist">
<div class="section-header">
<div class="section-num">Β§D</div>
<div class="section-meta">
<div class="section-title">Appendix D β€” Final Submission Checklist</div>
<div class="section-sub">Every item must be checked before submitting. Do not submit until all boxes are ticked.</div>
</div>
</div>
<div class="grid2 mb20">
<div>
<div class="label">OPENENV COMPLIANCE</div>
<ul class="checklist">
<li><div class="chk"></div>POST /reset returns <code>session_id</code> + initial <code>observation</code> dict</li>
<li><div class="chk"></div>POST /step returns <code>observation</code> + <code>reward</code> (float) + <code>done</code> (bool) + <code>info</code></li>
<li><div class="chk"></div>GET /state returns current episode metadata</li>
<li><div class="chk"></div>GET /tasks returns 3 tasks with id, name, difficulty labels</li>
<li><div class="chk"></div>POST /grader returns score 0.0–1.0 + breakdown dict + feedback string</li>
<li><div class="chk"></div>GET /baseline returns reproducible baseline scores for all 3 tasks</li>
<li><div class="chk"></div>GET /health returns <code>{"status": "ok"}</code></li>
<li><div class="chk"></div><code>openenv validate --url https://YOUR_SPACE.hf.space</code> passes with no errors</li>
<li><div class="chk"></div>3 tasks with easy/medium/hard difficulty labels present</li>
<li><div class="chk"></div>Reward function documented with partial credit design</li>
</ul>
</div>
<div>
<div class="label">QUALITY CRITERIA</div>
<ul class="checklist">
<li><div class="chk"></div>Baseline agent runs reproducibly (fixed seed=0, same score every run)</li>
<li><div class="chk"></div>PPO reward curve plot shows statistically increasing trend</li>
<li><div class="chk"></div>Colab notebook runs end-to-end in &lt;15 minutes on free T4</li>
<li><div class="chk"></div>README has: pitch paragraph, quickstart, scores table, Colab link, video link</li>
<li><div class="chk"></div>ENVIRONMENT.md has full technical spec</li>
<li><div class="chk"></div>No API keys, no secrets in repository</li>
<li><div class="chk"></div>No large binary files committed to git (use .gitignore for *.parquet β€” serve from HF repo)</li>
<li><div class="chk"></div>Grader is deterministic (run same episode log twice, get same score)</li>
<li><div class="chk"></div>2-minute demo video recorded and linked in README</li>
<li><div class="chk"></div>HF Space is public (not private or gated)</li>
</ul>
</div>
</div>
<div class="grid2">
<div>
<div class="label">DEPLOYMENT CHECKS</div>
<ul class="checklist">
<li><div class="chk"></div>Docker image builds locally with <code>docker build -t test .</code></li>
<li><div class="chk"></div>Image is under 2GB (<code>docker image ls</code>)</li>
<li><div class="chk"></div>Container starts and /health responds within 30s</li>
<li><div class="chk"></div>HF Spaces URL is live and all endpoints respond</li>
<li><div class="chk"></div>Tested from a fresh browser/machine with no local setup</li>
<li><div class="chk"></div>50 concurrent requests don't produce 500 errors</li>
<li><div class="chk"></div>HF Spaces shows "Running" not "Building" or "Error"</li>
</ul>
</div>
<div>
<div class="label">SUBMISSION FORM</div>
<ul class="checklist">
<li><div class="chk"></div>Environment name: <code>InferenceGym</code> (or your chosen name)</li>
<li><div class="chk"></div>Description: 500-word submission text</li>
<li><div class="chk"></div>All team member names listed</li>
<li><div class="chk"></div>HuggingFace Spaces URL submitted</li>
<li><div class="chk"></div>GitHub repository URL submitted (public)</li>
<li><div class="chk"></div>Submitted BEFORE 11:59 PM April 7</li>
<li><div class="chk"></div>No code changes pushed after submission time</li>
</ul>
</div>
</div>
<div class="alert alert-green" style="margin-top:20px;">
<div class="alert-title">🎯 The One-Line Summary for Judges</div>
InferenceGym is the first RL environment for LLM inference control. A naΓ―ve vLLM config scores 0.22 on the hardest task. A simple PPO agent trained for 50k steps reaches 0.65 β€” a 3Γ— improvement in serving efficiency, no GPU, no model required. That's the pitch. Everything else in this document is how you build the thing that delivers that demo.
</div>
<hr class="rule">
<div style="text-align:center; font-family:var(--mono); font-size:10px; color:var(--text3); padding: 8px 0 20px; letter-spacing:0.08em;">
INFERENCEGYM Β· MASTER BUILD DOCUMENT Β· META PYTORCH Γ— SCALER HACKATHON 2026 Β· DEADLINE APRIL 7
</div>
</div>
</div><!-- /wrap -->
</body>
</html>