| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <meta charset="UTF-8"> |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> |
| <title>InferenceGym β Master Build Document</title> |
| <style> |
| @import url('https://fonts.googleapis.com/css2?family=Geist+Mono:wght@300;400;500;600;700&family=Syne:wght@400;500;600;700;800&display=swap'); |
| |
| :root { |
| --bg: #070809; |
| --bg1: #0c0e11; |
| --bg2: #111418; |
| --bg3: #171c22; |
| --bg4: #1d242d; |
| --border: rgba(255,255,255,0.06); |
| --border2: rgba(255,255,255,0.10); |
| --border3: rgba(255,255,255,0.15); |
| --text: #dce3ec; |
| --text2: #7a8494; |
| --text3: #424c5c; |
| --text4: #2c3340; |
| --green: #22d3a0; |
| --green2: #0fa870; |
| --gdim: rgba(34,211,160,0.08); |
| --gborder: rgba(34,211,160,0.20); |
| --blue: #5b9cf6; |
| --bdim: rgba(91,156,246,0.08); |
| --bborder: rgba(91,156,246,0.20); |
| --amber: #f0a832; |
| --adim: rgba(240,168,50,0.08); |
| --aborder: rgba(240,168,50,0.20); |
| --red: #f05c5c; |
| --rdim: rgba(240,92,92,0.08); |
| --rborder: rgba(240,92,92,0.20); |
| --purple: #a78bfa; |
| --pdim: rgba(167,139,250,0.08); |
| --pborder: rgba(167,139,250,0.20); |
| --cyan: #38bdf8; |
| --cdim: rgba(56,189,248,0.08); |
| --cborder: rgba(56,189,248,0.20); |
| --mono: 'Geist Mono', monospace; |
| --sans: 'Syne', sans-serif; |
| } |
| |
| *, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; } |
| |
| html { scroll-behavior: smooth; } |
| |
| body { |
| background: var(--bg); |
| color: var(--text); |
| font-family: var(--mono); |
| font-size: 13px; |
| line-height: 1.7; |
| } |
| |
| |
| .wrap { max-width: 1100px; margin: 0 auto; padding: 56px 40px 120px; } |
| |
| |
| .cover { |
| position: relative; |
| border: 1px solid var(--border2); |
| border-radius: 16px; |
| overflow: hidden; |
| margin-bottom: 56px; |
| background: var(--bg1); |
| } |
| .cover-gradient { |
| position: absolute; |
| inset: 0; |
| background: |
| radial-gradient(ellipse 60% 50% at 10% 20%, rgba(34,211,160,0.06) 0%, transparent 70%), |
| radial-gradient(ellipse 50% 60% at 90% 80%, rgba(91,156,246,0.05) 0%, transparent 70%); |
| pointer-events: none; |
| } |
| .cover-top-bar { |
| height: 2px; |
| background: linear-gradient(90deg, var(--green), var(--blue), var(--purple), var(--amber)); |
| } |
| .cover-inner { padding: 48px 52px 52px; } |
| .cover-eyebrow { |
| display: flex; |
| align-items: center; |
| gap: 12px; |
| margin-bottom: 24px; |
| } |
| .eyebrow-tag { |
| font-family: var(--mono); |
| font-size: 10px; |
| font-weight: 600; |
| letter-spacing: 0.12em; |
| text-transform: uppercase; |
| padding: 4px 10px; |
| border-radius: 4px; |
| } |
| .et-green { color: var(--green); background: var(--gdim); border: 1px solid var(--gborder); } |
| .et-blue { color: var(--blue); background: var(--bdim); border: 1px solid var(--bborder); } |
| .et-amber { color: var(--amber); background: var(--adim); border: 1px solid var(--aborder); } |
| .et-red { color: var(--red); background: var(--rdim); border: 1px solid var(--rborder); } |
| .et-purple { color: var(--purple); background: var(--pdim); border: 1px solid var(--pborder); } |
| .et-cyan { color: var(--cyan); background: var(--cdim); border: 1px solid var(--cborder); } |
| |
| .cover h1 { |
| font-family: var(--sans); |
| font-size: 44px; |
| font-weight: 800; |
| letter-spacing: -0.03em; |
| line-height: 1.1; |
| color: #fff; |
| margin-bottom: 16px; |
| } |
| .cover h1 em { font-style: normal; color: var(--green); } |
| .cover-desc { |
| font-family: var(--mono); |
| font-size: 13px; |
| color: var(--text2); |
| max-width: 680px; |
| line-height: 1.75; |
| margin-bottom: 36px; |
| } |
| .cover-stats { |
| display: grid; |
| grid-template-columns: repeat(6, 1fr); |
| gap: 0; |
| border: 1px solid var(--border2); |
| border-radius: 10px; |
| overflow: hidden; |
| } |
| .stat-cell { |
| padding: 14px 18px; |
| border-right: 1px solid var(--border); |
| } |
| .stat-cell:last-child { border-right: none; } |
| .stat-label { font-size: 9px; font-weight: 600; letter-spacing: 0.10em; text-transform: uppercase; color: var(--text3); margin-bottom: 4px; } |
| .stat-val { font-size: 13px; font-weight: 600; color: var(--text); } |
| .stat-val.green { color: var(--green); } |
| .stat-val.amber { color: var(--amber); } |
| .stat-val.red { color: var(--red); } |
| |
| |
| .toc-box { |
| background: var(--bg1); |
| border: 1px solid var(--border2); |
| border-radius: 12px; |
| padding: 28px 32px; |
| margin-bottom: 56px; |
| } |
| .toc-title { |
| font-family: var(--sans); |
| font-size: 11px; |
| font-weight: 700; |
| letter-spacing: 0.14em; |
| text-transform: uppercase; |
| color: var(--text3); |
| margin-bottom: 20px; |
| } |
| .toc-phases { |
| display: grid; |
| grid-template-columns: 1fr 1fr; |
| gap: 6px; |
| } |
| .toc-item { |
| display: flex; |
| align-items: center; |
| gap: 10px; |
| padding: 8px 12px; |
| border-radius: 6px; |
| text-decoration: none; |
| transition: background 0.15s; |
| border: 1px solid transparent; |
| } |
| .toc-item:hover { background: var(--bg3); border-color: var(--border2); } |
| .toc-num { |
| font-size: 10px; |
| font-weight: 700; |
| color: var(--text3); |
| width: 28px; |
| flex-shrink: 0; |
| } |
| .toc-name { font-size: 12px; color: var(--text2); } |
| .toc-badge { |
| margin-left: auto; |
| font-size: 9px; |
| font-weight: 700; |
| padding: 2px 7px; |
| border-radius: 3px; |
| letter-spacing: 0.06em; |
| flex-shrink: 0; |
| } |
| |
| |
| .section { margin-bottom: 64px; scroll-margin-top: 32px; } |
| .section-header { |
| display: flex; |
| align-items: flex-start; |
| gap: 20px; |
| margin-bottom: 28px; |
| padding-bottom: 20px; |
| border-bottom: 1px solid var(--border); |
| } |
| .section-num { |
| font-family: var(--mono); |
| font-size: 11px; |
| font-weight: 700; |
| color: var(--text4); |
| padding-top: 4px; |
| flex-shrink: 0; |
| width: 32px; |
| } |
| .section-meta { flex: 1; } |
| .section-title { |
| font-family: var(--sans); |
| font-size: 24px; |
| font-weight: 700; |
| letter-spacing: -0.02em; |
| color: #fff; |
| margin-bottom: 6px; |
| } |
| .section-sub { font-size: 12px; color: var(--text2); line-height: 1.65; } |
| |
| |
| .phase-card { |
| border-radius: 12px; |
| border: 1px solid var(--border2); |
| overflow: hidden; |
| margin-bottom: 20px; |
| } |
| .phase-header { |
| display: flex; |
| align-items: center; |
| gap: 16px; |
| padding: 18px 22px; |
| border-bottom: 1px solid var(--border); |
| } |
| .phase-icon { |
| font-size: 18px; |
| width: 36px; |
| text-align: center; |
| } |
| .phase-label { |
| font-family: var(--mono); |
| font-size: 10px; |
| font-weight: 600; |
| letter-spacing: 0.10em; |
| text-transform: uppercase; |
| margin-bottom: 3px; |
| } |
| .phase-name { |
| font-family: var(--sans); |
| font-size: 16px; |
| font-weight: 700; |
| color: #fff; |
| } |
| .phase-meta { |
| margin-left: auto; |
| display: flex; |
| flex-direction: column; |
| align-items: flex-end; |
| gap: 4px; |
| } |
| .phase-days { font-size: 11px; color: var(--text2); } |
| .phase-body { padding: 22px; background: var(--bg1); } |
| .phase-desc { font-size: 12px; color: var(--text2); line-height: 1.7; margin-bottom: 20px; } |
| |
| |
| .deliverable-box { |
| background: var(--bg2); |
| border: 1px solid var(--border); |
| border-radius: 8px; |
| padding: 14px 16px; |
| margin-bottom: 14px; |
| } |
| .deliverable-title { |
| font-size: 11px; |
| font-weight: 700; |
| letter-spacing: 0.08em; |
| text-transform: uppercase; |
| color: var(--text3); |
| margin-bottom: 10px; |
| } |
| .deliverable-list { list-style: none; display: flex; flex-direction: column; gap: 6px; } |
| .deliverable-list li { |
| display: flex; |
| align-items: flex-start; |
| gap: 10px; |
| font-size: 12px; |
| color: var(--text2); |
| line-height: 1.6; |
| } |
| .dl-bullet { |
| width: 16px; |
| height: 16px; |
| border-radius: 3px; |
| flex-shrink: 0; |
| margin-top: 1px; |
| display: flex; |
| align-items: center; |
| justify-content: center; |
| font-size: 9px; |
| font-weight: 700; |
| } |
| .dl-green { background: var(--gdim); color: var(--green); border: 1px solid var(--gborder); } |
| .dl-blue { background: var(--bdim); color: var(--blue); border: 1px solid var(--bborder); } |
| .dl-amber { background: var(--adim); color: var(--amber); border: 1px solid var(--aborder); } |
| .dl-red { background: var(--rdim); color: var(--red); border: 1px solid var(--rborder); } |
| .dl-purple { background: var(--pdim); color: var(--purple);border: 1px solid var(--pborder); } |
| .dl-cyan { background: var(--cdim); color: var(--cyan); border: 1px solid var(--cborder); } |
| |
| .dl-text strong { color: var(--text); font-weight: 600; display: block; } |
| |
| |
| .module-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 14px; margin-bottom: 14px; } |
| .module-card { |
| background: var(--bg2); |
| border: 1px solid var(--border); |
| border-radius: 10px; |
| overflow: hidden; |
| } |
| .module-card-header { |
| padding: 12px 16px; |
| border-bottom: 1px solid var(--border); |
| display: flex; |
| align-items: center; |
| justify-content: space-between; |
| } |
| .module-card-name { font-size: 13px; font-weight: 600; color: #fff; } |
| .module-card-file { font-size: 10px; color: var(--text3); font-family: var(--mono); } |
| .module-card-body { padding: 14px 16px; } |
| .module-card-desc { font-size: 12px; color: var(--text2); line-height: 1.65; margin-bottom: 12px; } |
| |
| .spec-list { list-style: none; display: flex; flex-direction: column; gap: 5px; } |
| .spec-list li { |
| display: flex; |
| gap: 8px; |
| font-size: 11px; |
| color: var(--text2); |
| line-height: 1.5; |
| } |
| .spec-list li::before { |
| content: 'β'; |
| color: var(--text3); |
| flex-shrink: 0; |
| font-size: 10px; |
| margin-top: 1px; |
| } |
| .spec-list li code { color: var(--green); font-size: 10px; } |
| |
| |
| .code-block { |
| background: var(--bg3); |
| border: 1px solid var(--border); |
| border-radius: 10px; |
| overflow: hidden; |
| margin-bottom: 14px; |
| font-size: 11.5px; |
| line-height: 1.65; |
| } |
| .code-block-header { |
| display: flex; |
| align-items: center; |
| justify-content: space-between; |
| padding: 8px 14px; |
| border-bottom: 1px solid var(--border); |
| background: var(--bg4); |
| } |
| .code-lang { font-size: 9px; font-weight: 700; letter-spacing: 0.10em; text-transform: uppercase; color: var(--text3); } |
| .code-file { font-size: 10px; color: var(--text3); } |
| .code-body { padding: 16px 18px; overflow-x: auto; } |
| .code-body pre { margin: 0; white-space: pre; } |
| .kw { color: var(--purple); } |
| .fn { color: var(--blue); } |
| .st { color: var(--green); } |
| .cm { color: var(--text3); font-style: italic; } |
| .nm { color: var(--amber); } |
| .dc { color: var(--cyan); } |
| .tp { color: #e879f9; } |
| .op { color: var(--text2); } |
| |
| |
| .table-wrap { overflow-x: auto; border-radius: 10px; border: 1px solid var(--border); margin-bottom: 14px; } |
| table { width: 100%; border-collapse: collapse; font-size: 12px; } |
| th { |
| font-family: var(--mono); |
| font-size: 9px; |
| font-weight: 700; |
| letter-spacing: 0.10em; |
| text-transform: uppercase; |
| color: var(--text3); |
| padding: 10px 14px; |
| border-bottom: 1px solid var(--border2); |
| background: var(--bg3); |
| text-align: left; |
| white-space: nowrap; |
| } |
| td { |
| padding: 10px 14px; |
| border-bottom: 1px solid var(--border); |
| color: var(--text2); |
| vertical-align: top; |
| line-height: 1.5; |
| } |
| td strong { color: var(--text); font-weight: 600; } |
| td code { font-family: var(--mono); font-size: 11px; color: var(--green); } |
| tr:last-child td { border-bottom: none; } |
| |
| |
| .alert { |
| border-radius: 8px; |
| padding: 14px 16px; |
| margin-bottom: 14px; |
| font-size: 12px; |
| line-height: 1.65; |
| border: 1px solid; |
| } |
| .alert-title { font-weight: 700; font-size: 11px; letter-spacing: 0.06em; text-transform: uppercase; margin-bottom: 5px; } |
| .alert-green { background: var(--gdim); border-color: var(--gborder); color: #a3f0d8; } |
| .alert-amber { background: var(--adim); border-color: var(--aborder); color: #f5d49a; } |
| .alert-blue { background: var(--bdim); border-color: var(--bborder); color: #a8c7fa; } |
| .alert-red { background: var(--rdim); border-color: var(--rborder); color: #f5a0a0; } |
| .alert-purple { background: var(--pdim); border-color: var(--pborder); color: #d4c0ff; } |
| .alert-cyan { background: var(--cdim); border-color: var(--cborder); color: #a0dff5; } |
| |
| |
| .timeline { position: relative; } |
| .tl-row { |
| display: flex; |
| gap: 0; |
| margin-bottom: 0; |
| } |
| .tl-left { |
| width: 90px; |
| flex-shrink: 0; |
| text-align: right; |
| padding-right: 20px; |
| padding-top: 16px; |
| } |
| .tl-day-label { font-size: 10px; font-weight: 600; color: var(--text3); line-height: 1.4; } |
| .tl-day-label.today { color: var(--green); } |
| .tl-connector { |
| width: 20px; |
| flex-shrink: 0; |
| display: flex; |
| flex-direction: column; |
| align-items: center; |
| } |
| .tl-dot { |
| width: 10px; |
| height: 10px; |
| border-radius: 50%; |
| background: var(--bg4); |
| border: 2px solid var(--border2); |
| margin-top: 20px; |
| flex-shrink: 0; |
| z-index: 1; |
| } |
| .tl-dot.g { background: var(--green); border-color: var(--green); box-shadow: 0 0 8px rgba(34,211,160,0.5); } |
| .tl-dot.b { background: var(--blue); border-color: var(--blue); } |
| .tl-dot.a { background: var(--amber); border-color: var(--amber); } |
| .tl-dot.p { background: var(--purple);border-color: var(--purple); } |
| .tl-dot.r { background: var(--red); border-color: var(--red); } |
| .tl-line { width: 1px; flex: 1; background: var(--border2); } |
| .tl-right { flex: 1; padding: 8px 0 8px 16px; } |
| .tl-card { |
| background: var(--bg1); |
| border: 1px solid var(--border); |
| border-radius: 8px; |
| padding: 14px 16px; |
| margin-bottom: 8px; |
| } |
| .tl-card.green-border { border-color: var(--gborder); } |
| .tl-card.blue-border { border-color: var(--bborder); } |
| .tl-card.amber-border { border-color: var(--aborder); } |
| .tl-card.red-border { border-color: var(--rborder); } |
| .tl-card-phase { |
| font-size: 9px; |
| font-weight: 700; |
| letter-spacing: 0.10em; |
| text-transform: uppercase; |
| margin-bottom: 8px; |
| } |
| .tl-tasks-list { list-style: none; display: flex; flex-direction: column; gap: 5px; } |
| .tl-tasks-list li { |
| font-size: 11.5px; |
| color: var(--text2); |
| display: grid; |
| grid-template-columns: 70px 1fr; |
| gap: 8px; |
| line-height: 1.5; |
| } |
| .tl-person { font-size: 10px; font-weight: 700; color: var(--text3); padding-top: 1px; } |
| .tl-task { color: var(--text2); } |
| .tl-task strong { color: var(--text); font-weight: 600; } |
| |
| |
| .person-grid { display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 12px; margin-bottom: 14px; } |
| .person-card { |
| background: var(--bg2); |
| border: 1px solid var(--border); |
| border-radius: 10px; |
| overflow: hidden; |
| } |
| .person-header { |
| padding: 12px 16px; |
| border-bottom: 1px solid var(--border); |
| } |
| .person-name { font-family: var(--sans); font-size: 15px; font-weight: 700; color: #fff; margin-bottom: 4px; } |
| .person-role { font-size: 10px; color: var(--text3); } |
| .person-body { padding: 14px 16px; } |
| .person-tasks { list-style: none; display: flex; flex-direction: column; gap: 8px; } |
| .person-tasks li { |
| font-size: 11px; |
| color: var(--text2); |
| padding-left: 10px; |
| border-left: 2px solid var(--border2); |
| line-height: 1.55; |
| } |
| .person-tasks li strong { color: var(--text); font-weight: 600; display: block; } |
| |
| |
| .risk-row { |
| display: grid; |
| grid-template-columns: 2fr 80px 3fr; |
| gap: 0; |
| border-bottom: 1px solid var(--border); |
| padding: 12px 14px; |
| align-items: start; |
| font-size: 12px; |
| } |
| .risk-row:last-child { border-bottom: none; } |
| .risk-name { color: var(--text); font-weight: 500; padding-right: 12px; } |
| .risk-prob { text-align: center; } |
| .risk-mit { color: var(--text2); padding-left: 12px; border-left: 1px solid var(--border); } |
| |
| |
| .checklist { list-style: none; display: flex; flex-direction: column; gap: 7px; } |
| .checklist li { |
| display: flex; |
| align-items: flex-start; |
| gap: 10px; |
| font-size: 12px; |
| color: var(--text2); |
| line-height: 1.6; |
| } |
| .chk { |
| width: 16px; height: 16px; |
| border-radius: 4px; |
| border: 1px solid var(--border2); |
| flex-shrink: 0; |
| margin-top: 1px; |
| display: flex; |
| align-items: center; |
| justify-content: center; |
| font-size: 9px; |
| } |
| |
| |
| .gate-box { |
| background: var(--gdim); |
| border: 1px solid var(--gborder); |
| border-radius: 8px; |
| padding: 14px 16px; |
| margin: 14px 0; |
| display: flex; |
| align-items: flex-start; |
| gap: 12px; |
| } |
| .gate-icon { font-size: 18px; flex-shrink: 0; } |
| .gate-label { font-size: 9px; font-weight: 700; letter-spacing: 0.10em; text-transform: uppercase; color: var(--green); margin-bottom: 4px; } |
| .gate-text { font-size: 12px; color: var(--text2); line-height: 1.65; } |
| .gate-text strong { color: var(--text); font-weight: 600; } |
| |
| |
| .grid2 { display: grid; grid-template-columns: 1fr 1fr; gap: 14px; } |
| .grid3 { display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 14px; } |
| .mb8 { margin-bottom: 8px; } |
| .mb14 { margin-bottom: 14px; } |
| .mb20 { margin-bottom: 20px; } |
| .mt20 { margin-top: 20px; } |
| hr.rule { border: none; border-top: 1px solid var(--border); margin: 40px 0; } |
| .label { |
| font-size: 9px; |
| font-weight: 700; |
| letter-spacing: 0.12em; |
| text-transform: uppercase; |
| color: var(--text3); |
| margin-bottom: 10px; |
| } |
| .mono { font-family: var(--mono); } |
| </style> |
| </head> |
| <body> |
| <div class="wrap"> |
|
|
| |
| <div class="cover"> |
| <div class="cover-gradient"></div> |
| <div class="cover-top-bar"></div> |
| <div class="cover-inner"> |
| <div class="cover-eyebrow"> |
| <span class="eyebrow-tag et-green">MASTER BUILD DOCUMENT</span> |
| <span class="eyebrow-tag et-blue">PHASE-BY-PHASE</span> |
| <span class="eyebrow-tag et-amber">ALWAYS FUNCTIONAL</span> |
| </div> |
| <h1><em>InferenceGym</em><br>Complete Engineering Plan</h1> |
| <p class="cover-desc"> |
| A modular, phase-gated engineering plan for building the first RL environment for LLM inference control. |
| Every phase ends with a fully functional, testable system. No phase leaves you broken. |
| Deadline: April 7, 2026 Β· 11 days Β· 3 people. |
| </p> |
| <div class="cover-stats"> |
| <div class="stat-cell"> |
| <div class="stat-label">Deadline</div> |
| <div class="stat-val red">Apr 7, 2026</div> |
| </div> |
| <div class="stat-cell"> |
| <div class="stat-label">Days Left</div> |
| <div class="stat-val amber">11 days</div> |
| </div> |
| <div class="stat-cell"> |
| <div class="stat-label">Team Size</div> |
| <div class="stat-val">3 people</div> |
| </div> |
| <div class="stat-cell"> |
| <div class="stat-label">Phases</div> |
| <div class="stat-val green">6 phases</div> |
| </div> |
| <div class="stat-cell"> |
| <div class="stat-label">Deploy Target</div> |
| <div class="stat-val">HF Spaces</div> |
| </div> |
| <div class="stat-cell"> |
| <div class="stat-label">Prize Pool</div> |
| <div class="stat-val green">$30,000</div> |
| </div> |
| </div> |
| </div> |
| </div> |
|
|
| |
| <div class="toc-box"> |
| <div class="toc-title">Table of Contents</div> |
| <div class="toc-phases"> |
| <a href="#phase0" class="toc-item"> |
| <span class="toc-num">P0</span> |
| <span class="toc-name">Setup & Architecture Lock</span> |
| <span class="toc-badge et-cyan">Day 1</span> |
| </a> |
| <a href="#phase1" class="toc-item"> |
| <span class="toc-num">P1</span> |
| <span class="toc-name">Simulator Core (MVP)</span> |
| <span class="toc-badge et-green">Days 2β3</span> |
| </a> |
| <a href="#phase2" class="toc-item"> |
| <span class="toc-num">P2</span> |
| <span class="toc-name">Environment Logic</span> |
| <span class="toc-badge et-blue">Day 4</span> |
| </a> |
| <a href="#phase3" class="toc-item"> |
| <span class="toc-num">P3</span> |
| <span class="toc-name">API Layer & Docker</span> |
| <span class="toc-badge et-blue">Day 5</span> |
| </a> |
| <a href="#phase4" class="toc-item"> |
| <span class="toc-num">P4</span> |
| <span class="toc-name">Grader, Baseline & Tasks</span> |
| <span class="toc-badge et-amber">Days 6β7</span> |
| </a> |
| <a href="#phase5" class="toc-item"> |
| <span class="toc-num">P5</span> |
| <span class="toc-name">Deployment & Demo Agent</span> |
| <span class="toc-badge et-amber">Days 8β9</span> |
| </a> |
| <a href="#phase6" class="toc-item"> |
| <span class="toc-num">P6</span> |
| <span class="toc-name">Polish, Submission & Buffer</span> |
| <span class="toc-badge et-purple">Days 10β11</span> |
| </a> |
| <a href="#modules" class="toc-item"> |
| <span class="toc-num">Β§A</span> |
| <span class="toc-name">Full Module Specifications</span> |
| <span class="toc-badge et-green">Reference</span> |
| </a> |
| <a href="#dataschema" class="toc-item"> |
| <span class="toc-num">Β§B</span> |
| <span class="toc-name">Data Schemas & APIs</span> |
| <span class="toc-badge et-blue">Reference</span> |
| </a> |
| <a href="#risks" class="toc-item"> |
| <span class="toc-num">Β§C</span> |
| <span class="toc-name">Risk Register & Mitigations</span> |
| <span class="toc-badge et-red">Reference</span> |
| </a> |
| <a href="#checklist" class="toc-item"> |
| <span class="toc-num">Β§D</span> |
| <span class="toc-name">Final Submission Checklist</span> |
| <span class="toc-badge et-purple">Reference</span> |
| </a> |
| </div> |
| </div> |
|
|
| |
| <div class="section"> |
| <div class="section-header"> |
| <div class="section-num">00</div> |
| <div class="section-meta"> |
| <div class="section-title">Engineering Philosophy</div> |
| <div class="section-sub">Guiding principles that govern every implementation decision in this project.</div> |
| </div> |
| </div> |
|
|
| <div class="grid3 mb14"> |
| <div class="module-card"> |
| <div class="module-card-header"> |
| <span class="module-card-name" style="color:var(--green)">Always Functional</span> |
| </div> |
| <div class="module-card-body"> |
| <div class="module-card-desc">After every phase ends, the system must be in a state where you can run it, call it, and get a valid response. No "half-built" states that block testing. If Phase 1 is done, someone can import the simulator and call <code>simulate(action)</code> right now.</div> |
| </div> |
| </div> |
| <div class="module-card"> |
| <div class="module-card-header"> |
| <span class="module-card-name" style="color:var(--blue)">Stub First, Flesh Later</span> |
| </div> |
| <div class="module-card-body"> |
| <div class="module-card-desc">Every module gets a stub implementation on Day 1 that returns valid-shaped data. This lets Person B wire the API and Person C write the grader before Person A finishes the simulator. Real logic replaces stubs phase by phase.</div> |
| </div> |
| </div> |
| <div class="module-card"> |
| <div class="module-card-header"> |
| <span class="module-card-name" style="color:var(--amber)">Data Schema First</span> |
| </div> |
| <div class="module-card-body"> |
| <div class="module-card-desc">All three people must agree on the exact shape of <code>ServeAction</code>, <code>ServeObservation</code>, and <code>MetricsSnapshot</code> on Day 1, before writing a single line of logic. Changing the schema mid-build is the #1 cause of integration hell.</div> |
| </div> |
| </div> |
| </div> |
|
|
| <div class="alert alert-amber"> |
| <div class="alert-title">β The Critical Path</div> |
| Person A's simulator core is the only hard dependency for everyone else. That is why Person A's Day 3 deliverable is a strict gate β no simulator, no environment, no env, no API, no demo. Everything else can be parallelised after Day 3. Protect this gate fiercely. |
| </div> |
| </div> |
|
|
| |
| <div class="section" id="phase0"> |
| <div class="section-header"> |
| <div class="section-num">P0</div> |
| <div class="section-meta"> |
| <div class="section-title">Phase 0 β Setup & Architecture Lock</div> |
| <div class="section-sub">Day 1 (Mar 27). Goal: every team member has a running environment, a shared repo, agreed data schemas, and a working stub server that returns valid-shaped responses.</div> |
| </div> |
| <span class="eyebrow-tag et-cyan">Day 1 Β· Mar 27</span> |
| </div> |
|
|
| <div class="gate-box"> |
| <div class="gate-icon">π</div> |
| <div> |
| <div class="gate-label">Phase Gate β End of Day 1</div> |
| <div class="gate-text"><strong>You can run <code>curl http://localhost:7860/health</code> and get a 200 OK.</strong> All three people have cloned the repo, installed deps, and can run the stub server locally. The data schemas are written and committed to <code>models.py</code>. Nobody can start Day 2 until this is true.</div> |
| </div> |
| </div> |
|
|
| <div class="person-grid mb14"> |
| <div class="person-card"> |
| <div class="person-header" style="border-top: 3px solid var(--green);"> |
| <div class="person-name">Person A β Simulator Lead</div> |
| <div class="person-role">Owns: simulator/, env/ directories</div> |
| </div> |
| <div class="person-body"> |
| <ul class="person-tasks"> |
| <li><strong>Read OpenEnv spec completely</strong> Clone openenv-course, run the echo example env, understand what /reset β /step β /grader looks like end to end.</li> |
| <li><strong>Design TraceSimulator data schema</strong> Decide the exact column names for the lookup CSV. Write it down. Share with the team. This is a decision that cannot change later.</li> |
| <li><strong>Write skeleton classes</strong> Create <code>simulator/trace_sim.py</code> with class stubs: <code>TraceSimulator.__init__</code>, <code>simulate(action, workload)</code> returning a hardcoded <code>MetricsSnapshot</code>.</li> |
| <li><strong>Write skeleton workload generator</strong> <code>simulator/workload.py</code> β stub that returns a fixed <code>WorkloadState</code> dict every time.</li> |
| </ul> |
| </div> |
| </div> |
| <div class="person-card"> |
| <div class="person-header" style="border-top: 3px solid var(--blue);"> |
| <div class="person-name">Person B β API Lead</div> |
| <div class="person-role">Owns: server/ directory, Dockerfile</div> |
| </div> |
| <div class="person-body"> |
| <ul class="person-tasks"> |
| <li><strong>Set up FastAPI project</strong> Install FastAPI, uvicorn, pydantic. Create <code>server/app.py</code> with all 8 endpoint stubs that return hardcoded valid responses.</li> |
| <li><strong>Install openenv CLI</strong> Run <code>openenv init</code>, understand what <code>openenv validate</code> checks. Make sure the stub server passes basic validation.</li> |
| <li><strong>Create Dockerfile skeleton</strong> Multi-stage build that starts the uvicorn server. Confirm it builds locally and the /health endpoint responds from inside Docker.</li> |
| <li><strong>Set up GitHub repo</strong> Main branch protection, agree on feature branch naming (<code>feat/simulator</code>, <code>feat/api</code>, etc.), set up <code>.gitignore</code>.</li> |
| </ul> |
| </div> |
| </div> |
| <div class="person-card"> |
| <div class="person-header" style="border-top: 3px solid var(--amber);"> |
| <div class="person-name">Person C β Grader & Demo Lead</div> |
| <div class="person-role">Owns: grader/, agents/, notebooks/</div> |
| </div> |
| <div class="person-body"> |
| <ul class="person-tasks"> |
| <li><strong>Design grader rubric on paper</strong> For each of the 3 tasks: what is the score formula? What is the theoretical optimal? What is the expected baseline score? Write this as a one-page doc.</li> |
| <li><strong>Decide trace data strategy</strong> Evaluate Option A (published benchmarks), B (Colab T4), C (synthetic). Download whichever dataset you're going with. Confirm it has the needed columns.</li> |
| <li><strong>Define workload configs</strong> Write <code>simulator/data/workload_configs.json</code> with the exact parameters for Task 1, 2, and 3 (arrival rate, SLO, prompt distribution params).</li> |
| <li><strong>Agree on ENV_NAME</strong> Confirm the HuggingFace Spaces org, repo name, and environment name string. Register the HF account if needed.</li> |
| </ul> |
| </div> |
| </div> |
| </div> |
|
|
| <div class="label">SHARED DELIVERABLE β models.py (everyone must agree before Day 2)</div> |
| <div class="code-block"> |
| <div class="code-block-header"> |
| <span class="code-lang">python</span> |
| <span class="code-file">inferencegym/models.py β Data schema, locked on Day 1</span> |
| </div> |
| <div class="code-body"><pre><span class="kw">from</span> dataclasses <span class="kw">import</span> dataclass, field |
| <span class="kw">from</span> typing <span class="kw">import</span> <span class="tp">Optional, List, Dict, Any</span> |
| <span class="kw">from</span> enum <span class="kw">import</span> Enum |
|
|
| <span class="cm"># ββ Action space βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ</span> |
| <span class="kw">class</span> <span class="tp">QuantTier</span>(Enum): |
| FP16 = <span class="nm">0</span> |
| INT8 = <span class="nm">1</span> |
| INT4 = <span class="nm">2</span> |
|
|
| @dataclass |
| <span class="kw">class</span> <span class="tp">ServeAction</span>: |
| kv_budget: <span class="tp">float</span> <span class="cm"># 0.1 β 1.0 : fraction of KV cache allocated</span> |
| spec_length: <span class="tp">int</span> <span class="cm"># 0,1,2,4,8 : speculative draft tokens</span> |
| batch_size: <span class="tp">int</span> <span class="cm"># 1β512 : max concurrent requests</span> |
| prefill_disagg: <span class="tp">bool</span> <span class="cm"># True/False : disaggregate prefill GPU</span> |
| quant_tier: <span class="tp">QuantTier</span> <span class="cm"># FP16/INT8/INT4</span> |
| |
| <span class="kw">def</span> <span class="fn">validate</span>(self) -> <span class="tp">bool</span>: |
| <span class="kw">assert</span> <span class="nm">0.1</span> <= self.kv_budget <= <span class="nm">1.0</span> |
| <span class="kw">assert</span> self.spec_length <span class="kw">in</span> {<span class="nm">0</span>,<span class="nm">1</span>,<span class="nm">2</span>,<span class="nm">4</span>,<span class="nm">8</span>} |
| <span class="kw">assert</span> <span class="nm">1</span> <= self.batch_size <= <span class="nm">512</span> |
| <span class="kw">return</span> <span class="kw">True</span> |
|
|
| <span class="cm"># ββ Simulator output ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ</span> |
| @dataclass |
| <span class="kw">class</span> <span class="tp">MetricsSnapshot</span>: |
| ttft_p50_ms: <span class="tp">float</span> <span class="cm"># median time to first token</span> |
| ttft_p99_ms: <span class="tp">float</span> <span class="cm"># tail latency</span> |
| tpot_ms: <span class="tp">float</span> <span class="cm"># time per output token</span> |
| tokens_per_sec: <span class="tp">float</span> <span class="cm"># throughput</span> |
| gpu_memory_gb: <span class="tp">float</span> <span class="cm"># simulated memory pressure</span> |
| cost_per_1k: <span class="tp">float</span> <span class="cm"># compute cost (normalised units)</span> |
| spec_accept_rate: <span class="tp">float</span> <span class="cm"># 0.0 if spec_length == 0</span> |
| eviction_events: <span class="tp">int</span> <span class="cm"># KV cache evictions this step</span> |
| slo_violations: <span class="tp">int</span> <span class="cm"># requests that exceeded SLO this step</span> |
|
|
| <span class="cm"># ββ Observation (what agent sees) ββββββββββββββββββββββββββββββββββββββββββββ</span> |
| @dataclass |
| <span class="kw">class</span> <span class="tp">ServeObservation</span>: |
| queue_depth: <span class="tp">float</span> |
| mean_prompt_len: <span class="tp">float</span> |
| arrival_rate: <span class="tp">float</span> |
| kv_cache_occupancy: <span class="tp">float</span> |
| ttft_p50: <span class="tp">float</span> |
| tpot_p50: <span class="tp">float</span> |
| slo_violation_rate: <span class="tp">float</span> |
| gpu_memory_used_gb: <span class="tp">float</span> |
| spec_accept_rate: <span class="tp">float</span> |
| priority_distribution: <span class="tp">List[float]</span> <span class="cm"># [interactive, batch, best_effort]</span> |
| timestep: <span class="tp">int</span> |
| cost_so_far: <span class="tp">float</span> |
|
|
| <span class="cm"># ββ Workload state ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ</span> |
| @dataclass |
| <span class="kw">class</span> <span class="tp">WorkloadState</span>: |
| arrival_rate: <span class="tp">float</span> |
| mean_prompt_len: <span class="tp">float</span> |
| prompt_len_bucket: <span class="tp">int</span> <span class="cm"># 0β7, discrete bucket for lookup table</span> |
| queue_depth: <span class="tp">int</span> |
| priority_distribution: <span class="tp">List[float]</span> |
| is_burst: <span class="tp">bool</span> |
| phase: <span class="tp">str</span> <span class="cm"># "warmup" | "steady" | "burst" | "cooldown"</span></pre></div> |
| </div> |
|
|
| <div class="label">PHASE 0 COMPLETION PROOF</div> |
| <div class="code-block"> |
| <div class="code-block-header"> |
| <span class="code-lang">bash</span> |
| <span class="code-file">These commands must all pass before Day 2 starts</span> |
| </div> |
| <div class="code-body"><pre><span class="cm"># From repo root:</span> |
| docker build -t inferencegym . && docker run -p 7860:7860 inferencegym & |
| curl http://localhost:7860/health <span class="cm"># β {"status": "ok"}</span> |
| curl http://localhost:7860/tasks <span class="cm"># β {"tasks": [{...}, {...}, {...}]}</span> |
| python -c <span class="st">"from inferencegym.models import ServeAction, ServeObservation; print('schemas OK')"</span></pre></div> |
| </div> |
| </div> |
|
|
| |
| <div class="section" id="phase1"> |
| <div class="section-header"> |
| <div class="section-num">P1</div> |
| <div class="section-meta"> |
| <div class="section-title">Phase 1 β Simulator Core</div> |
| <div class="section-sub">Days 2β3 (Mar 28β29). Goal: a fully working TraceSimulator that takes a real ServeAction and returns a realistic MetricsSnapshot. This is the hardest and most critical module in the entire project.</div> |
| </div> |
| <span class="eyebrow-tag et-green">Days 2β3</span> |
| </div> |
|
|
| <div class="alert alert-green"> |
| <div class="alert-title">β
Why This Phase Unlocks Everything</div> |
| Once <code>TraceSimulator.simulate(action, workload) β MetricsSnapshot</code> works, Person B can wire it into the API and Person C can build the grader. Both of those can proceed in parallel. Person A must finish this by end of Day 3 even if it means simplifying the interpolation. |
| </div> |
|
|
| <div class="gate-box"> |
| <div class="gate-icon">π</div> |
| <div> |
| <div class="gate-label">Phase Gate β End of Day 3</div> |
| <div class="gate-text"><strong>Running <code>python tests/test_simulator.py</code> passes all tests.</strong> The simulator returns realistic-shaped numbers for a variety of (action, workload) inputs. The workload generator produces a different workload state on every call. These are the two things that need to be true before Phase 2 begins.</div> |
| </div> |
| </div> |
|
|
| <div class="grid2 mb14"> |
| <div> |
| <div class="label">DAY 2 TASKS (Person A, primary)</div> |
| <div class="deliverable-box"> |
| <div class="deliverable-title">TraceSimulator β Core Implementation</div> |
| <ul class="deliverable-list"> |
| <li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Load lookup table from CSV/Parquet</strong> Read the trace data file into a dict keyed by <code>(batch_bucket, kv_bucket, spec_bucket, prompt_bucket)</code>. Each value is a <code>MetricsSnapshot</code>. The lookup table must be loaded once at startup and cached in memory.</div></li> |
| <li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Implement bilinear interpolation</strong> Use <code>scipy.interpolate.RegularGridInterpolator</code> for continuous actions (kv_budget, batch_size) between discrete lookup points. For discrete actions (spec_length, quant_tier), use nearest-neighbor lookup.</div></li> |
| <li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Add Gaussian noise model</strong> Inject Β±5% Gaussian noise on <code>ttft_p50_ms</code> and <code>tpot_ms</code> to simulate hardware jitter. Use <code>np.random.default_rng(seed)</code> so episodes are reproducible.</div></li> |
| <li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Memory overflow detection</strong> If interpolated <code>gpu_memory_gb > 40.0</code>, set a hard OOM flag, cap memory at 40GB, and multiply <code>slo_violations</code> by 5 as a penalty signal.</div></li> |
| </ul> |
| </div> |
| <div class="deliverable-box"> |
| <div class="deliverable-title">WorkloadGenerator β Day 2</div> |
| <ul class="deliverable-list"> |
| <li><span class="dl-bullet dl-blue">A</span><div class="dl-text"><strong>Poisson arrival generator</strong> <code>np.random.poisson(lam=arrival_rate)</code> per step. Arrival rate varies by task config loaded from <code>workload_configs.json</code>.</div></li> |
| <li><span class="dl-bullet dl-blue">A</span><div class="dl-text"><strong>Prompt length sampling</strong> Task 1: <code>np.random.uniform(64, 128)</code>. Task 2: <code>np.random.lognormal(5.2, 1.3)</code> clamped to [32, 8192]. Task 3: bimodal β 70% uniform(32, 128), 30% uniform(4096, 8192).</div></li> |
| <li><span class="dl-bullet dl-blue">A</span><div class="dl-text"><strong>Discrete prompt bucket mapping</strong> Map continuous prompt_len to an integer bucket 0β7 using <code>np.digitize</code> against <code>[64, 128, 256, 512, 1024, 2048, 4096]</code>. This is the lookup table key.</div></li> |
| </ul> |
| </div> |
| </div> |
| <div> |
| <div class="label">DAY 3 TASKS (Person A, primary)</div> |
| <div class="deliverable-box"> |
| <div class="deliverable-title">WorkloadGenerator β Day 3 Completion</div> |
| <ul class="deliverable-list"> |
| <li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Queue depth simulation</strong> Maintain a running <code>queue_depth</code> counter. Each step: add new arrivals, subtract <code>min(batch_size, queue_depth)</code> served requests. Queue cannot go negative.</div></li> |
| <li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Burst injection for Task 3</strong> Every 120 timesteps, multiply arrival_rate by 10 for 15 consecutive steps. Set <code>is_burst=True</code> in <code>WorkloadState</code> during these steps.</div></li> |
| <li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Priority distribution tracking</strong> Task 3: maintain a rolling 50-step window of request classes [INTERACTIVE, BATCH, BEST_EFFORT] as fractions. Pass this to <code>WorkloadState.priority_distribution</code>.</div></li> |
| <li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Speculative acceptance model</strong> Implement the acceptance rate formula: <code>accept_rate = base_rate * (1 - complexity_penalty) * depth_decay</code> where <code>depth_decay = 1.0 / (1 + 0.15 * spec_length)</code>. Base rate by task: Task1=0.80, Task2=0.65, Task3=0.45.</div></li> |
| </ul> |
| </div> |
| <div class="deliverable-box"> |
| <div class="deliverable-title">Unit Tests β must pass by Day 3 EOD</div> |
| <ul class="deliverable-list"> |
| <li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>Smoke test</strong> Call <code>simulate(action, workload)</code> with 20 random valid actions β all return a non-null <code>MetricsSnapshot</code> with values in expected ranges.</div></li> |
| <li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>Monotonicity test</strong> Increasing <code>batch_size</code> while holding other actions constant should strictly increase <code>tokens_per_sec</code> (up to a threshold). This validates the lookup table is correctly loaded.</div></li> |
| <li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>Determinism test</strong> Two calls with the same seed and same action must produce the same noise-injected output. Tests reproducibility.</div></li> |
| <li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>OOM detection test</strong> Pass an action with <code>batch_size=512, kv_budget=1.0</code> β confirm <code>gpu_memory_gb</code> triggers the overflow flag.</div></li> |
| </ul> |
| </div> |
| </div> |
| </div> |
|
|
| <div class="label">SIMULATOR CORE IMPLEMENTATION</div> |
| <div class="code-block"> |
| <div class="code-block-header"> |
| <span class="code-lang">python</span> |
| <span class="code-file">simulator/trace_sim.py</span> |
| </div> |
| <div class="code-body"><pre><span class="kw">import</span> numpy <span class="kw">as</span> np |
| <span class="kw">import</span> pandas <span class="kw">as</span> pd |
| <span class="kw">from</span> scipy.interpolate <span class="kw">import</span> RegularGridInterpolator |
| <span class="kw">from</span> pathlib <span class="kw">import</span> Path |
| <span class="kw">from</span> inferencegym.models <span class="kw">import</span> ServeAction, WorkloadState, MetricsSnapshot, QuantTier |
|
|
| <span class="kw">class</span> <span class="tp">TraceSimulator</span>: |
| <span class="st">""" |
| CPU-only trace-driven simulator. |
| Loads a pre-built lookup table and interpolates (action, workload) β MetricsSnapshot. |
| """</span> |
| |
| BATCH_POINTS = [<span class="nm">1</span>, <span class="nm">4</span>, <span class="nm">8</span>, <span class="nm">16</span>, <span class="nm">32</span>, <span class="nm">64</span>, <span class="nm">128</span>, <span class="nm">256</span>, <span class="nm">512</span>] |
| KV_POINTS = [<span class="nm">0.1</span>, <span class="nm">0.25</span>, <span class="nm">0.5</span>, <span class="nm">0.75</span>, <span class="nm">1.0</span>] |
| PLEN_BUCKETS = [<span class="nm">64</span>, <span class="nm">128</span>, <span class="nm">256</span>, <span class="nm">512</span>, <span class="nm">1024</span>, <span class="nm">2048</span>, <span class="nm">4096</span>, <span class="nm">8192</span>] |
| OOM_THRESHOLD = <span class="nm">40.0</span> <span class="cm"># GB</span> |
| NOISE_STD = <span class="nm">0.05</span> <span class="cm"># Β±5% Gaussian jitter on latency metrics</span> |
|
|
| <span class="kw">def</span> <span class="fn">__init__</span>(self, trace_path: <span class="tp">str</span>, seed: <span class="tp">int</span> = <span class="nm">42</span>): |
| self.rng = np.random.default_rng(seed) |
| self._load_tables(Path(trace_path)) |
| self._build_interpolators() |
|
|
| <span class="kw">def</span> <span class="fn">_load_tables</span>(self, path: <span class="tp">Path</span>) -> <span class="tp">None</span>: |
| df = pd.read_parquet(path) |
| <span class="cm"># Expected columns: batch_size, kv_budget, spec_length, quant_tier,</span> |
| <span class="cm"># prompt_len_bucket, ttft_p50, ttft_p99, tpot, tps, gpu_mem_gb, cost_per_1k</span> |
| self._df = df |
|
|
| <span class="kw">def</span> <span class="fn">_build_interpolators</span>(self) -> <span class="tp">None</span>: |
| <span class="cm"># Build 4-D interpolator over (batch_size, kv_budget, spec_len, prompt_bucket)</span> |
| <span class="cm"># for FP16 baseline. INT8/INT4 handled via multiplicative correction factors.</span> |
| fp16_df = self._df[self._df[<span class="st">'quant_tier'</span>] == <span class="nm">0</span>] |
| grid_vals = { |
| <span class="st">'ttft_p50'</span>: self._reshape_for_interp(fp16_df, <span class="st">'ttft_p50'</span>), |
| <span class="st">'ttft_p99'</span>: self._reshape_for_interp(fp16_df, <span class="st">'ttft_p99'</span>), |
| <span class="st">'tpot'</span>: self._reshape_for_interp(fp16_df, <span class="st">'tpot'</span>), |
| <span class="st">'tps'</span>: self._reshape_for_interp(fp16_df, <span class="st">'tps'</span>), |
| <span class="st">'gpu_mem'</span>: self._reshape_for_interp(fp16_df, <span class="st">'gpu_mem_gb'</span>), |
| } |
| points = (self.BATCH_POINTS, self.KV_POINTS, [<span class="nm">0</span>,<span class="nm">1</span>,<span class="nm">2</span>,<span class="nm">4</span>,<span class="nm">8</span>], self.PLEN_BUCKETS) |
| self._interps = {k: RegularGridInterpolator(points, v, method=<span class="st">'linear'</span>, bounds_error=<span class="nm">False</span>) |
| <span class="kw">for</span> k, v <span class="kw">in</span> grid_vals.items()} |
|
|
| <span class="kw">def</span> <span class="fn">simulate</span>(self, action: <span class="tp">ServeAction</span>, workload: <span class="tp">WorkloadState</span>) -> <span class="tp">MetricsSnapshot</span>: |
| action.validate() |
| query = [[action.batch_size, action.kv_budget, |
| action.spec_length, workload.mean_prompt_len]] |
| |
| <span class="cm"># Interpolate base metrics</span> |
| base = {k: float(fn(query)[<span class="nm">0</span>]) <span class="kw">for</span> k, fn <span class="kw">in</span> self._interps.items()} |
| |
| <span class="cm"># Apply quant tier correction factors (from benchmark data)</span> |
| quant_factors = {QuantTier.FP16: <span class="nm">1.0</span>, QuantTier.INT8: <span class="nm">0.82</span>, QuantTier.INT4: <span class="nm">0.68</span>} |
| q_factor = quant_factors[action.quant_tier] |
| base[<span class="st">'ttft_p50'</span>] *= q_factor |
| base[<span class="st">'tps'</span>] /= q_factor <span class="cm"># quantised models serve faster</span> |
| base[<span class="st">'gpu_mem'</span>] *= q_factor <span class="cm"># quantised models use less memory</span> |
| |
| <span class="cm"># Apply speculative decoding acceptance bonus</span> |
| <span class="kw">if</span> action.spec_length > <span class="nm">0</span>: |
| depth_decay = <span class="nm">1.0</span> / (<span class="nm">1</span> + <span class="nm">0.15</span> * action.spec_length) |
| accept_rate = <span class="nm">0.75</span> * (<span class="nm">1</span> - <span class="nm">0.1</span> * workload.prompt_len_bucket) * depth_decay |
| accept_rate = max(<span class="nm">0.0</span>, min(<span class="nm">1.0</span>, accept_rate)) |
| speedup = <span class="nm">1.0</span> + accept_rate * action.spec_length * <span class="nm">0.1</span> |
| base[<span class="st">'ttft_p50'</span>] /= speedup |
| <span class="kw">else</span>: |
| accept_rate = <span class="nm">0.0</span> |
| |
| <span class="cm"># Inject Gaussian noise</span> |
| noise = self.rng.normal(<span class="nm">1.0</span>, self.NOISE_STD, size=<span class="nm">3</span>) |
| base[<span class="st">'ttft_p50'</span>] *= noise[<span class="nm">0</span>] |
| base[<span class="st">'ttft_p99'</span>] *= noise[<span class="nm">1</span>] |
| base[<span class="st">'tpot'</span>] *= noise[<span class="nm">2</span>] |
| |
| <span class="cm"># OOM detection</span> |
| oom = base[<span class="st">'gpu_mem'</span>] > self.OOM_THRESHOLD |
| slo_violations = <span class="nm">0</span> <span class="cm"># computed by env, not simulator</span> |
| <span class="kw">if</span> oom: |
| base[<span class="st">'gpu_mem'</span>] = self.OOM_THRESHOLD |
| slo_violations = action.batch_size <span class="cm"># all requests fail on OOM</span> |
| |
| <span class="kw">return</span> MetricsSnapshot( |
| ttft_p50_ms = max(<span class="nm">1.0</span>, base[<span class="st">'ttft_p50'</span>]), |
| ttft_p99_ms = max(<span class="nm">1.0</span>, base[<span class="st">'ttft_p99'</span>]), |
| tpot_ms = max(<span class="nm">1.0</span>, base[<span class="st">'tpot'</span>]), |
| tokens_per_sec = max(<span class="nm">0.0</span>, base[<span class="st">'tps'</span>]), |
| gpu_memory_gb = base[<span class="st">'gpu_mem'</span>], |
| cost_per_1k = base[<span class="st">'tps'</span>] * q_factor * <span class="nm">0.001</span>, |
| spec_accept_rate = accept_rate, |
| eviction_events = int(max(<span class="nm">0</span>, (<span class="nm">1.0</span> - action.kv_budget) * workload.queue_depth)), |
| slo_violations = slo_violations, |
| )</pre></div> |
| </div> |
|
|
| <div class="label">TRACE DATA β How to Build It Without a GPU</div> |
| <div class="grid3 mb14"> |
| <div class="module-card"> |
| <div class="module-card-header"> |
| <span class="module-card-name" style="color:var(--green)">Option A (Recommended)</span> |
| <span class="eyebrow-tag et-green">0 GPU hrs</span> |
| </div> |
| <div class="module-card-body"> |
| <div class="module-card-desc">Download published vLLM benchmark CSVs from <code>github.com/vllm-project/vllm/tree/main/benchmarks</code> and the HuggingFace llm-perf-leaderboard. These have real measured latencies across batch sizes. Fit a pandas pivot table to get the lookup grid.</div> |
| <ul class="spec-list"> |
| <li>Already covers Llama-3-8B on A100 β your exact target model</li> |
| <li>Includes TTFT, TPOT, throughput, memory across batch sizes</li> |
| <li>Needs ~2 hours of data wrangling to reshape into your schema</li> |
| </ul> |
| </div> |
| </div> |
| <div class="module-card"> |
| <div class="module-card-header"> |
| <span class="module-card-name" style="color:var(--blue)">Option B (Good)</span> |
| <span class="eyebrow-tag et-blue">2-4 GPU hrs</span> |
| </div> |
| <div class="module-card-body"> |
| <div class="module-card-desc">Run <code>llmperf</code> on a Colab free T4 with Llama-3.2-1B-Instruct (free tier works). Grid search over batch_size=[1,4,8,16,32] Γ prompt_len=[64,128,256,512] β that's 20 measurements. 2 hours of Colab time.</div> |
| <ul class="spec-list"> |
| <li>Your own measurements β stronger story for judges</li> |
| <li>Can extrapolate to larger batch sizes analytically</li> |
| <li>Risk: Colab disconnects. Use checkpointing.</li> |
| </ul> |
| </div> |
| </div> |
| <div class="module-card"> |
| <div class="module-card-header"> |
| <span class="module-card-name" style="color:var(--amber)">Option C (Fallback)</span> |
| <span class="eyebrow-tag et-amber">30 min, CPU</span> |
| </div> |
| <div class="module-card-body"> |
| <div class="module-card-desc">Generate synthetic data from a roofline model. <code>ttft = base_ms + batch_factor * batch_size + memory_factor * prompt_len</code>. These constants are documented in vLLM's OSDI paper. Fully deterministic, always works.</div> |
| <ul class="spec-list"> |
| <li>Implement this FIRST as a fallback even if you use A or B</li> |
| <li>Guarantees you always have valid data no matter what</li> |
| <li>Good enough for an RL agent to learn relative improvements</li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| </div> |
|
|
| |
| <div class="section" id="phase2"> |
| <div class="section-header"> |
| <div class="section-num">P2</div> |
| <div class="section-meta"> |
| <div class="section-title">Phase 2 β Environment Logic</div> |
| <div class="section-sub">Day 4 (Mar 30). Goal: a complete InferenceEnv class with working reset(), step(), and state(). An agent can interact with it in a loop and receive valid rewards.</div> |
| </div> |
| <span class="eyebrow-tag et-blue">Day 4 Β· Mar 30</span> |
| </div> |
|
|
| <div class="gate-box"> |
| <div class="gate-icon">π―</div> |
| <div> |
| <div class="gate-label">Phase Gate β End of Day 4</div> |
| <div class="gate-text"><strong>The following Python loop runs without error and completes all 200 steps:</strong> <code>obs = env.reset(task_id=1); [env.step(random_action()) for _ in range(200)]</code>. Rewards are floats in [-1, 1]. The episode terminates at step 200. Session IDs are unique per reset call.</div> |
| </div> |
| </div> |
|
|
| <div class="label">ENVIRONMENT CLASS β Full Implementation</div> |
| <div class="code-block"> |
| <div class="code-block-header"> |
| <span class="code-lang">python</span> |
| <span class="code-file">env/inference_env.py β Core environment (Person A, Day 4)</span> |
| </div> |
| <div class="code-body"><pre><span class="kw">import</span> uuid, json, threading |
| <span class="kw">import</span> numpy <span class="kw">as</span> np |
| <span class="kw">from</span> dataclasses <span class="kw">import</span> dataclass |
| <span class="kw">from</span> inferencegym.models <span class="kw">import</span> ServeAction, ServeObservation, WorkloadState, MetricsSnapshot |
| <span class="kw">from</span> simulator.trace_sim <span class="kw">import</span> TraceSimulator |
| <span class="kw">from</span> simulator.workload <span class="kw">import</span> WorkloadGenerator |
|
|
| @dataclass |
| <span class="kw">class</span> <span class="tp">EnvConfig</span>: |
| task_id: <span class="tp">int</span> |
| episode_len: <span class="tp">int</span> = <span class="nm">200</span> |
| slo_target_ms: <span class="tp">float</span> = <span class="nm">300.0</span> |
| max_memory_gb: <span class="tp">float</span> = <span class="nm">40.0</span> |
| <span class="cm"># Reward weights</span> |
| alpha: <span class="tp">float</span> = <span class="nm">0.40</span> <span class="cm"># throughput</span> |
| beta: <span class="tp">float</span> = <span class="nm">0.25</span> <span class="cm"># latency</span> |
| gamma: <span class="tp">float</span> = <span class="nm">0.25</span> <span class="cm"># SLO violations</span> |
| delta: <span class="tp">float</span> = <span class="nm">0.10</span> <span class="cm"># cost</span> |
|
|
| <span class="cm"># Task configs β loaded from workload_configs.json</span> |
| TASK_CONFIGS = { |
| <span class="nm">1</span>: EnvConfig(task_id=<span class="nm">1</span>, slo_target_ms=<span class="nm">500.0</span>), |
| <span class="nm">2</span>: EnvConfig(task_id=<span class="nm">2</span>, slo_target_ms=<span class="nm">300.0</span>, gamma=<span class="nm">0.30</span>), |
| <span class="nm">3</span>: EnvConfig(task_id=<span class="nm">3</span>, slo_target_ms=<span class="nm">200.0</span>, gamma=<span class="nm">0.35</span>, delta=<span class="nm">0.15</span>), |
| } |
| <span class="cm"># Max achievable throughput per task (set after running optimal solver)</span> |
| MAX_THROUGHPUT = {<span class="nm">1</span>: <span class="nm">8500.0</span>, <span class="nm">2</span>: <span class="nm">6200.0</span>, <span class="nm">3</span>: <span class="nm">4800.0</span>} |
|
|
| <span class="kw">class</span> <span class="tp">InferenceEnv</span>: |
| <span class="kw">def</span> <span class="fn">__init__</span>(self, simulator: <span class="tp">TraceSimulator</span>, task_id: <span class="tp">int</span>, seed: <span class="tp">int</span> = <span class="nm">42</span>): |
| self.sim = simulator |
| self.config = TASK_CONFIGS[task_id] |
| self.gen = WorkloadGenerator(task_id=task_id, seed=seed) |
| self.session_id = str(uuid.uuid4()) |
| self._step = <span class="nm">0</span> |
| self._cost_so_far = <span class="nm">0.0</span> |
| self._workload = self.gen.reset() |
| self._last_metrics: MetricsSnapshot = <span class="kw">None</span> |
| self._episode_log: <span class="tp">list</span> = [] |
|
|
| <span class="kw">def</span> <span class="fn">reset</span>(self) -> <span class="tp">ServeObservation</span>: |
| self.session_id = str(uuid.uuid4()) |
| self._step = <span class="nm">0</span> |
| self._cost_so_far = <span class="nm">0.0</span> |
| self._workload = self.gen.reset() |
| self._episode_log = [] |
| <span class="kw">return</span> self._build_obs(MetricsSnapshot( |
| ttft_p50_ms=<span class="nm">200.0</span>, ttft_p99_ms=<span class="nm">350.0</span>, tpot_ms=<span class="nm">20.0</span>, |
| tokens_per_sec=<span class="nm">2000.0</span>, gpu_memory_gb=<span class="nm">24.0</span>, cost_per_1k=<span class="nm">0.001</span>, |
| spec_accept_rate=<span class="nm">0.0</span>, eviction_events=<span class="nm">0</span>, slo_violations=<span class="nm">0</span>)) |
|
|
| <span class="kw">def</span> <span class="fn">step</span>(self, action: <span class="tp">ServeAction</span>): |
| <span class="kw">if</span> self._step >= self.config.episode_len: |
| <span class="kw">raise</span> RuntimeError(<span class="st">"Episode already done. Call reset() first."</span>) |
| |
| <span class="cm"># Task 1 & 2: lock certain actions</span> |
| action = self._enforce_action_mask(action) |
| |
| <span class="cm"># Advance workload one step</span> |
| self._workload = self.gen.step(action) |
| |
| <span class="cm"># Simulate this step</span> |
| metrics = self.sim.simulate(action, self._workload) |
| self._last_metrics = metrics |
| |
| <span class="cm"># Compute SLO violations from simulator metrics + SLO target</span> |
| metrics.slo_violations += int( |
| metrics.ttft_p50_ms > self.config.slo_target_ms) * self._workload.queue_depth |
| |
| <span class="cm"># Compute reward</span> |
| reward = self._compute_reward(metrics) |
| |
| <span class="cm"># Update episode state</span> |
| self._cost_so_far += metrics.cost_per_1k |
| self._step += <span class="nm">1</span> |
| done = self._step >= self.config.episode_len |
| |
| obs = self._build_obs(metrics) |
| info = {<span class="st">"timestep"</span>: self._step, <span class="st">"metrics"</span>: metrics.__dict__, |
| <span class="st">"workload"</span>: self._workload.__dict__} |
| self._episode_log.append({<span class="st">"action"</span>: action.__dict__, <span class="st">"reward"</span>: reward, <span class="st">"metrics"</span>: metrics.__dict__}) |
| <span class="kw">return</span> obs, reward, done, info |
|
|
| <span class="kw">def</span> <span class="fn">_compute_reward</span>(self, m: <span class="tp">MetricsSnapshot</span>) -> <span class="tp">float</span>: |
| c = self.config |
| T = m.tokens_per_sec / MAX_THROUGHPUT[c.task_id] |
| L = m.ttft_p50_ms / c.slo_target_ms |
| V = m.slo_violations / max(self._workload.queue_depth, <span class="nm">1</span>) |
| C = m.cost_per_1k / <span class="nm">0.005</span> <span class="cm"># normalise against budget ceiling</span> |
| reward = c.alpha * T - c.beta * L - c.gamma * V - c.delta * C |
| <span class="kw">return</span> float(np.clip(reward, -<span class="nm">1.0</span>, <span class="nm">1.0</span>)) |
|
|
| <span class="kw">def</span> <span class="fn">_enforce_action_mask</span>(self, action: <span class="tp">ServeAction</span>) -> <span class="tp">ServeAction</span>: |
| <span class="kw">if</span> self.config.task_id == <span class="nm">1</span>: |
| action.spec_length = <span class="nm">0</span>; action.prefill_disagg = <span class="kw">False</span>; action.quant_tier = QuantTier.FP16 |
| <span class="kw">elif</span> self.config.task_id == <span class="nm">2</span>: |
| action.prefill_disagg = <span class="kw">False</span>; action.quant_tier = QuantTier.FP16 |
| <span class="kw">return</span> action |
|
|
| <span class="kw">def</span> <span class="fn">_build_obs</span>(self, m: <span class="tp">MetricsSnapshot</span>) -> <span class="tp">ServeObservation</span>: |
| w = self._workload |
| <span class="kw">return</span> ServeObservation( |
| queue_depth = float(w.queue_depth), |
| mean_prompt_len = w.mean_prompt_len, |
| arrival_rate = w.arrival_rate, |
| kv_cache_occupancy = (<span class="nm">1.0</span> - (m.eviction_events / max(w.queue_depth, <span class="nm">1</span>))), |
| ttft_p50 = m.ttft_p50_ms, |
| tpot_p50 = m.tpot_ms, |
| slo_violation_rate = m.slo_violations / max(w.queue_depth, <span class="nm">1</span>), |
| gpu_memory_used_gb = m.gpu_memory_gb, |
| spec_accept_rate = m.spec_accept_rate, |
| priority_distribution = w.priority_distribution, |
| timestep = self._step, |
| cost_so_far = self._cost_so_far, |
| )</pre></div> |
| </div> |
| </div> |
|
|
| |
| <div class="section" id="phase3"> |
| <div class="section-header"> |
| <div class="section-num">P3</div> |
| <div class="section-meta"> |
| <div class="section-title">Phase 3 β API Layer & Docker</div> |
| <div class="section-sub">Day 5 (Mar 31). Goal: all 8 HTTP endpoints are live, wired to the real InferenceEnv, and the Docker image builds cleanly and passes openenv validate.</div> |
| </div> |
| <span class="eyebrow-tag et-blue">Day 5 Β· Mar 31</span> |
| </div> |
|
|
| <div class="gate-box"> |
| <div class="gate-icon">π</div> |
| <div> |
| <div class="gate-label">Phase Gate β End of Day 5</div> |
| <div class="gate-text"><strong>Running the openenv CLI validation passes with no errors:</strong> <code>openenv validate --url http://localhost:7860</code>. Every endpoint returns the correct shape. The Docker image is under 2GB. A full resetβstepΓ200βgrader cycle completes in under 60 seconds.</div> |
| </div> |
| </div> |
|
|
| <div class="label">ALL ENDPOINTS β Implementation Spec</div> |
| <div class="table-wrap mb14"> |
| <table> |
| <tr> |
| <th>Endpoint</th><th>Method</th><th>Owns</th><th>Wired to</th><th>Key Behaviour</th> |
| </tr> |
| <tr> |
| <td><code>/health</code></td><td>GET</td><td>Person B</td><td>Session cache count</td> |
| <td>Returns <code>{"status":"ok","active_sessions":N,"uptime_s":T}</code></td> |
| </tr> |
| <tr> |
| <td><code>/tasks</code></td><td>GET</td><td>Person B</td><td>Static task config dict</td> |
| <td>Returns list of 3 tasks with id, name, difficulty, description, active_actions</td> |
| </tr> |
| <tr> |
| <td><code>/reset</code></td><td>POST</td><td>Person B</td><td><code>InferenceEnv.reset()</code></td> |
| <td>Creates new session_id, instantiates InferenceEnv for that task, stores in LRU cache. Returns session_id + observation.</td> |
| </tr> |
| <tr> |
| <td><code>/step</code></td><td>POST</td><td>Person B</td><td><code>InferenceEnv.step()</code></td> |
| <td>Looks up session by session_id, validates ServeAction, calls step(), returns obs+reward+done+info. 422 if session not found.</td> |
| </tr> |
| <tr> |
| <td><code>/state</code></td><td>GET</td><td>Person B</td><td><code>InferenceEnv.state()</code></td> |
| <td>Returns current episode metadata: step_count, cumulative_reward, done, workload_phase.</td> |
| </tr> |
| <tr> |
| <td><code>/grader</code></td><td>POST</td><td>Person C</td><td><code>GraderModule.score()</code></td> |
| <td>Accepts episode_log JSON, returns score 0β1 with breakdown. Stateless β same input always same output.</td> |
| </tr> |
| <tr> |
| <td><code>/baseline</code></td><td>GET</td><td>Person C</td><td><code>BaselineAgent.run()</code></td> |
| <td>Runs the fixed-config baseline agent on all 3 tasks, returns scores. Fixed seed guarantees reproducibility.</td> |
| </tr> |
| <tr> |
| <td><code>/info</code></td><td>GET</td><td>Person B</td><td>Static schema</td> |
| <td>Returns full JSON schema for action space, observation space, reward weights. Used by agent frameworks.</td> |
| </tr> |
| </table> |
| </div> |
|
|
| <div class="label">SESSION MANAGEMENT β Critical Design</div> |
| <div class="code-block"> |
| <div class="code-block-header"> |
| <span class="code-lang">python</span> |
| <span class="code-file">simulator/session_manager.py β Thread-safe LRU session cache</span> |
| </div> |
| <div class="code-body"><pre><span class="kw">import</span> threading |
| <span class="kw">from</span> collections <span class="kw">import</span> OrderedDict |
| <span class="kw">from</span> typing <span class="kw">import</span> <span class="tp">Optional</span> |
| <span class="kw">from</span> env.inference_env <span class="kw">import</span> InferenceEnv |
|
|
| <span class="kw">class</span> <span class="tp">SessionManager</span>: |
| <span class="cm">"""Thread-safe LRU cache of active InferenceEnv instances."""</span> |
| MAX_SESSIONS = <span class="nm">50</span> |
| |
| <span class="kw">def</span> <span class="fn">__init__</span>(self, simulator): |
| self._sim = simulator |
| self._lock = threading.Lock() |
| self._sessions: <span class="tp">OrderedDict[str, InferenceEnv]</span> = OrderedDict() |
| |
| <span class="kw">def</span> <span class="fn">create</span>(self, task_id: <span class="tp">int</span>, seed: <span class="tp">int</span>) -> <span class="tp">InferenceEnv</span>: |
| <span class="kw">with</span> self._lock: |
| <span class="kw">if</span> len(self._sessions) >= self.MAX_SESSIONS: |
| self._sessions.popitem(last=<span class="kw">False</span>) <span class="cm"># evict oldest</span> |
| env = InferenceEnv(self._sim, task_id, seed) |
| self._sessions[env.session_id] = env |
| <span class="kw">return</span> env |
| |
| <span class="kw">def</span> <span class="fn">get</span>(self, session_id: <span class="tp">str</span>) -> <span class="tp">Optional[InferenceEnv]</span>: |
| <span class="kw">with</span> self._lock: |
| env = self._sessions.get(session_id) |
| <span class="kw">if</span> env: <span class="cm"># move to end (mark as recently used)</span> |
| self._sessions.move_to_end(session_id) |
| <span class="kw">return</span> env |
| |
| <span class="kw">def</span> <span class="fn">remove</span>(self, session_id: <span class="tp">str</span>) -> <span class="tp">None</span>: |
| <span class="kw">with</span> self._lock: |
| self._sessions.pop(session_id, <span class="kw">None</span>) |
| |
| <span class="kw">def</span> <span class="fn">count</span>(self) -> <span class="tp">int</span>: |
| <span class="kw">return</span> len(self._sessions)</pre></div> |
| </div> |
|
|
| <div class="label">FASTAPI APP SKELETON β Person B writes this on Day 4 (stubs) and wires on Day 5</div> |
| <div class="code-block"> |
| <div class="code-block-header"> |
| <span class="code-lang">python</span> |
| <span class="code-file">server/app.py β Main FastAPI application</span> |
| </div> |
| <div class="code-body"><pre><span class="kw">from</span> fastapi <span class="kw">import</span> FastAPI, HTTPException |
| <span class="kw">from</span> fastapi.middleware.cors <span class="kw">import</span> CORSMiddleware |
| <span class="kw">from</span> pydantic <span class="kw">import</span> BaseModel |
| <span class="kw">from</span> typing <span class="kw">import</span> <span class="tp">Optional</span> |
| <span class="kw">import</span> time |
|
|
| <span class="kw">from</span> simulator.trace_sim <span class="kw">import</span> TraceSimulator |
| <span class="kw">from</span> simulator.session_manager <span class="kw">import</span> SessionManager |
| <span class="kw">from</span> inferencegym.models <span class="kw">import</span> ServeAction, QuantTier |
|
|
| app = FastAPI(title=<span class="st">"InferenceGym"</span>, version=<span class="st">"1.0.0"</span>) |
| app.add_middleware(CORSMiddleware, allow_origins=[<span class="st">"*"</span>], allow_methods=[<span class="st">"*"</span>], allow_headers=[<span class="st">"*"</span>]) |
|
|
| <span class="cm"># ββ App startup: load simulator once, create session manager βββββββββββββββββ</span> |
| _sim = <span class="kw">None</span> |
| _sessions = <span class="kw">None</span> |
| _start_time = time.time() |
|
|
| @app.on_event(<span class="st">"startup"</span>) |
| <span class="kw">async def</span> <span class="fn">startup</span>(): |
| <span class="kw">global</span> _sim, _sessions |
| _sim = TraceSimulator(<span class="st">"simulator/data/traces_llama3_8b.parquet"</span>) |
| _sessions = SessionManager(_sim) |
|
|
| <span class="cm"># ββ Pydantic request/response models ββββββββββββββββββββββββββββββββββββββββ</span> |
| <span class="kw">class</span> <span class="tp">ResetRequest</span>(BaseModel): |
| task_id: <span class="tp">int</span> |
| seed: <span class="tp">int</span> = <span class="nm">42</span> |
| config: <span class="tp">Optional[dict]</span> = <span class="kw">None</span> <span class="cm"># override alpha/beta/gamma/delta</span> |
|
|
| <span class="kw">class</span> <span class="tp">StepRequest</span>(BaseModel): |
| session_id: <span class="tp">str</span> |
| action: <span class="tp">dict</span> |
|
|
| <span class="kw">class</span> <span class="tp">GraderRequest</span>(BaseModel): |
| task_id: <span class="tp">int</span> |
| episode_log: <span class="tp">list</span> |
|
|
| <span class="cm"># ββ Endpoints βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ</span> |
| @app.get(<span class="st">"/health"</span>) |
| <span class="kw">def</span> <span class="fn">health</span>(): |
| <span class="kw">return</span> {<span class="st">"status"</span>: <span class="st">"ok"</span>, <span class="st">"active_sessions"</span>: _sessions.count(), |
| <span class="st">"uptime_seconds"</span>: int(time.time() - _start_time)} |
|
|
| @app.get(<span class="st">"/tasks"</span>) |
| <span class="kw">def</span> <span class="fn">get_tasks</span>(): |
| <span class="kw">return</span> {<span class="st">"tasks"</span>: [ |
| {<span class="st">"id"</span>:<span class="nm">1</span>, <span class="st">"name"</span>:<span class="st">"Static Uniform"</span>, <span class="st">"difficulty"</span>:<span class="st">"easy"</span>, <span class="st">"active_actions"</span>:[<span class="st">"kv_budget"</span>,<span class="st">"batch_size"</span>]}, |
| {<span class="st">"id"</span>:<span class="nm">2</span>, <span class="st">"name"</span>:<span class="st">"Bursty ShareGPT"</span>, <span class="st">"difficulty"</span>:<span class="st">"medium"</span>, <span class="st">"active_actions"</span>:[<span class="st">"kv_budget"</span>,<span class="st">"batch_size"</span>,<span class="st">"spec_length"</span>]}, |
| {<span class="st">"id"</span>:<span class="nm">3</span>, <span class="st">"name"</span>:<span class="st">"Adversarial Multi-Tenant"</span>,<span class="st">"difficulty"</span>:<span class="st">"hard"</span>, <span class="st">"active_actions"</span>:[<span class="st">"kv_budget"</span>,<span class="st">"batch_size"</span>,<span class="st">"spec_length"</span>,<span class="st">"prefill_disagg"</span>,<span class="st">"quant_tier"</span>]}, |
| ]} |
|
|
| @app.post(<span class="st">"/reset"</span>) |
| <span class="kw">def</span> <span class="fn">reset</span>(req: <span class="tp">ResetRequest</span>): |
| <span class="kw">if</span> req.task_id <span class="kw">not in</span> {<span class="nm">1</span>, <span class="nm">2</span>, <span class="nm">3</span>}: |
| <span class="kw">raise</span> HTTPException(<span class="nm">422</span>, <span class="st">f"task_id must be 1, 2, or 3. Got {req.task_id}"</span>) |
| env = _sessions.create(req.task_id, req.seed) |
| obs = env.reset() |
| <span class="kw">return</span> {<span class="st">"session_id"</span>: env.session_id, <span class="st">"observation"</span>: obs.__dict__, <span class="st">"episode_length"</span>: <span class="nm">200</span>} |
|
|
| @app.post(<span class="st">"/step"</span>) |
| <span class="kw">def</span> <span class="fn">step</span>(req: <span class="tp">StepRequest</span>): |
| env = _sessions.get(req.session_id) |
| <span class="kw">if not</span> env: |
| <span class="kw">raise</span> HTTPException(<span class="nm">404</span>, <span class="st">f"Session '{req.session_id}' not found. Call /reset first."</span>) |
| action = ServeAction( |
| kv_budget = req.action.get(<span class="st">"kv_budget"</span>, <span class="nm">1.0</span>), |
| spec_length = req.action.get(<span class="st">"spec_length"</span>, <span class="nm">0</span>), |
| batch_size = req.action.get(<span class="st">"batch_size"</span>, <span class="nm">32</span>), |
| prefill_disagg = req.action.get(<span class="st">"prefill_disagg"</span>, <span class="kw">False</span>), |
| quant_tier = QuantTier(req.action.get(<span class="st">"quant_tier"</span>, <span class="nm">0</span>)), |
| ) |
| obs, reward, done, info = env.step(action) |
| <span class="kw">if</span> done: |
| _sessions.remove(req.session_id) |
| <span class="kw">return</span> {<span class="st">"observation"</span>: obs.__dict__, <span class="st">"reward"</span>: reward, <span class="st">"done"</span>: done, <span class="st">"info"</span>: info}</pre></div> |
| </div> |
|
|
| <div class="label">DOCKERFILE β Multi-stage, CPU-only, <2GB</div> |
| <div class="code-block"> |
| <div class="code-block-header"> |
| <span class="code-lang">dockerfile</span> |
| <span class="code-file">Dockerfile</span> |
| </div> |
| <div class="code-body"><pre><span class="cm"># Stage 1: Install dependencies only</span> |
| <span class="kw">FROM</span> python:3.11-slim <span class="kw">AS</span> builder |
| <span class="kw">WORKDIR</span> /build |
| <span class="kw">COPY</span> requirements.txt . |
| <span class="kw">RUN</span> pip install --no-cache-dir --user -r requirements.txt |
|
|
| <span class="cm"># Stage 2: Minimal runtime (no build tools)</span> |
| <span class="kw">FROM</span> python:3.11-slim |
| <span class="kw">WORKDIR</span> /app |
| <span class="kw">COPY</span> --from=builder /root/.local /root/.local |
| <span class="kw">COPY</span> . . |
| <span class="kw">ENV</span> PATH=/root/.local/bin:$PATH |
| <span class="kw">ENV</span> PYTHONPATH=/app |
| <span class="kw">EXPOSE</span> 7860 |
|
|
| <span class="cm"># HuggingFace Spaces convention: port 7860</span> |
| <span class="kw">CMD</span> ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "2"] |
|
|
| <span class="cm">## requirements.txt (CPU-only β NO torch, NO CUDA)</span> |
| <span class="cm"># fastapi==0.115.0</span> |
| <span class="cm"># uvicorn[standard]==0.30.0</span> |
| <span class="cm"># pydantic==2.7.0</span> |
| <span class="cm"># numpy==1.26.4</span> |
| <span class="cm"># scipy==1.13.0</span> |
| <span class="cm"># pandas==2.2.0</span> |
| <span class="cm"># pyarrow==15.0.0 (for parquet reading)</span> |
| <span class="cm"># stable-baselines3==2.3.0 (PPO demo only)</span> |
| <span class="cm"># gymnasium==0.29.1</span> |
| <span class="cm"># httpx==0.27.0 (for integration tests)</span></pre></div> |
| </div> |
| </div> |
|
|
| |
| <div class="section" id="phase4"> |
| <div class="section-header"> |
| <div class="section-num">P4</div> |
| <div class="section-meta"> |
| <div class="section-title">Phase 4 β Grader, Baseline & Task Completion</div> |
| <div class="section-sub">Days 6β7 (Apr 1β2). Goal: all three tasks are complete, the /grader endpoint scores any episode log deterministically, and the baseline agent runs and produces reproducible scores around 0.22β0.35.</div> |
| </div> |
| <span class="eyebrow-tag et-amber">Days 6β7</span> |
| </div> |
|
|
| <div class="gate-box"> |
| <div class="gate-icon">π</div> |
| <div> |
| <div class="gate-label">Phase Gate β End of Day 7</div> |
| <div class="gate-text"><strong>POST /grader with a handcrafted episode log returns a score between 0.0 and 1.0 with a complete breakdown dict.</strong> GET /baseline returns scores in the range [0.20, 0.40] for all 3 tasks. The grader returns the same score on repeated calls with the same input. All grader unit tests pass.</div> |
| </div> |
| </div> |
|
|
| <div class="label">GRADER DESIGN β Per-Task Formula Detail</div> |
| <div class="grid2 mb14"> |
| <div class="module-card"> |
| <div class="module-card-header"> |
| <span class="module-card-name" style="color:var(--green)">Task 1 Grader</span> |
| <span class="eyebrow-tag et-green">EASY</span> |
| </div> |
| <div class="module-card-body"> |
| <div class="module-card-desc">Pure throughput optimisation. Score is the normalised improvement over baseline on mean tokens/sec, capped at 1.0.</div> |
| <div class="code-block" style="margin-bottom:0; font-size:11px;"> |
| <div class="code-body" style="padding:10px 12px;"><pre><span class="cm"># All values are means over the 200-step episode log</span> |
| score = (agent_tps - baseline_tps) / (optimal_tps - baseline_tps) |
| score = max(0.0, min(1.0, score)) |
|
|
| <span class="cm"># baseline_tps β 2800 tokens/s (batch=32, kv=1.0)</span> |
| <span class="cm"># optimal_tps β 8200 tokens/s (batch=128, kv=0.5)</span></pre></div> |
| </div> |
| </div> |
| </div> |
| <div class="module-card"> |
| <div class="module-card-header"> |
| <span class="module-card-name" style="color:var(--blue)">Task 2 Grader</span> |
| <span class="eyebrow-tag et-blue">MEDIUM</span> |
| </div> |
| <div class="module-card-body"> |
| <div class="module-card-desc">Balances TTFT and memory compliance. Both components are independently scored and averaged.</div> |
| <div class="code-block" style="margin-bottom:0; font-size:11px;"> |
| <div class="code-body" style="padding:10px 12px;"><pre>ttft_score = max(0.0, 1.0 - mean_ttft_p50 / 300.0) |
| peak_mem = max(episode_log, key=lambda x: x['metrics']['gpu_memory_gb']) |
| mem_score = 1.0 if peak_mem < 36.0 else max(0.0, 1.0 - (peak_mem-36)/10) |
| score = 0.5 * ttft_score + 0.5 * mem_score</pre></div> |
| </div> |
| </div> |
| </div> |
| <div class="module-card"> |
| <div class="module-card-header"> |
| <span class="module-card-name" style="color:var(--red)">Task 3 Grader</span> |
| <span class="eyebrow-tag et-red">HARD</span> |
| </div> |
| <div class="module-card-body"> |
| <div class="module-card-desc">4-component scoring with explicit weights. Stability score penalises wild action thrashing β rewards a smooth, learnable policy.</div> |
| <div class="code-block" style="margin-bottom:0; font-size:11px;"> |
| <div class="code-body" style="padding:10px 12px;"><pre>T = mean_tps / optimal_tps <span class="cm"># throughput</span> |
| S = 1.0 - mean_slo_violation_rate <span class="cm"># SLO compliance</span> |
| C = max(0.0, 1.0 - total_cost/5.0) <span class="cm"># cost (budget=5.0)</span> |
| A = 1.0 - action_variance_score <span class="cm"># stability</span> |
|
|
| score = 0.40*T + 0.30*S + 0.20*C + 0.10*A</pre></div> |
| </div> |
| </div> |
| </div> |
| <div class="module-card"> |
| <div class="module-card-header"> |
| <span class="module-card-name" style="color:var(--amber)">Stability Score</span> |
| <span class="eyebrow-tag et-amber">Anti-Thrashing</span> |
| </div> |
| <div class="module-card-body"> |
| <div class="module-card-desc">Computes the variance of consecutive actions taken by the agent. High variance = thrashing = unstable policy. The stability score penalises this.</div> |
| <div class="code-block" style="margin-bottom:0; font-size:11px;"> |
| <div class="code-body" style="padding:10px 12px;"><pre>actions = [step['action'] for step in episode_log] |
| batch_diffs = np.diff([a['batch_size'] for a in actions]) |
| kv_diffs = np.diff([a['kv_budget'] for a in actions]) |
| variance = np.std(batch_diffs)/512 + np.std(kv_diffs)/1.0 |
| action_variance_score = min(1.0, variance / 0.5) <span class="cm"># 0=stable, 1=chaotic</span></pre></div> |
| </div> |
| </div> |
| </div> |
| </div> |
|
|
| <div class="label">GRADER MODULE β Full Implementation</div> |
| <div class="code-block"> |
| <div class="code-block-header"> |
| <span class="code-lang">python</span> |
| <span class="code-file">grader/grader.py β Deterministic episode scorer</span> |
| </div> |
| <div class="code-body"><pre><span class="kw">import</span> numpy <span class="kw">as</span> np |
| <span class="kw">from</span> typing <span class="kw">import</span> <span class="tp">List, Dict, Any</span> |
|
|
| <span class="kw">class</span> <span class="tp">GraderModule</span>: |
| <span class="cm">"""Deterministic grader. Same episode_log β same score, always."""</span> |
|
|
| BASELINE_TPS = {<span class="nm">1</span>: <span class="nm">2800.0</span>, <span class="nm">2</span>: <span class="nm">2100.0</span>, <span class="nm">3</span>: <span class="nm">1600.0</span>} |
| OPTIMAL_TPS = {<span class="nm">1</span>: <span class="nm">8200.0</span>, <span class="nm">2</span>: <span class="nm">5800.0</span>, <span class="nm">3</span>: <span class="nm">4200.0</span>} |
|
|
| <span class="kw">def</span> <span class="fn">score</span>(self, task_id: <span class="tp">int</span>, episode_log: <span class="tp">List[Dict[str, Any]]</span>) -> <span class="tp">Dict</span>: |
| <span class="kw">if not</span> episode_log: |
| <span class="kw">return</span> {<span class="st">"score"</span>: <span class="nm">0.0</span>, <span class="st">"breakdown"</span>: {}, <span class="st">"feedback"</span>: <span class="st">"Empty episode log."</span>} |
| |
| graders = {<span class="nm">1</span>: self._task1, <span class="nm">2</span>: self._task2, <span class="nm">3</span>: self._task3} |
| <span class="kw">if</span> task_id <span class="kw">not in</span> graders: |
| <span class="kw">raise</span> ValueError(<span class="st">f"Unknown task_id: {task_id}"</span>) |
| <span class="kw">return</span> graders[task_id](episode_log) |
|
|
| <span class="kw">def</span> <span class="fn">_task1</span>(self, log) -> <span class="tp">Dict</span>: |
| mean_tps = np.mean([s[<span class="st">'metrics'</span>][<span class="st">'tokens_per_sec'</span>] <span class="kw">for</span> s <span class="kw">in</span> log]) |
| score = (mean_tps - self.BASELINE_TPS[<span class="nm">1</span>]) / (self.OPTIMAL_TPS[<span class="nm">1</span>] - self.BASELINE_TPS[<span class="nm">1</span>]) |
| score = float(np.clip(score, <span class="nm">0.0</span>, <span class="nm">1.0</span>)) |
| feedback = self._throughput_feedback(mean_tps, <span class="nm">1</span>) |
| <span class="kw">return</span> {<span class="st">"score"</span>: score, <span class="st">"breakdown"</span>: {<span class="st">"throughput"</span>: score}, <span class="st">"feedback"</span>: feedback} |
|
|
| <span class="kw">def</span> <span class="fn">_task2</span>(self, log) -> <span class="tp">Dict</span>: |
| mean_ttft = np.mean([s[<span class="st">'metrics'</span>][<span class="st">'ttft_p50_ms'</span>] <span class="kw">for</span> s <span class="kw">in</span> log]) |
| peak_mem = max(s[<span class="st">'metrics'</span>][<span class="st">'gpu_memory_gb'</span>] <span class="kw">for</span> s <span class="kw">in</span> log) |
| ttft_score = float(np.clip(<span class="nm">1.0</span> - mean_ttft / <span class="nm">300.0</span>, <span class="nm">0.0</span>, <span class="nm">1.0</span>)) |
| mem_score = <span class="nm">1.0</span> <span class="kw">if</span> peak_mem < <span class="nm">36.0</span> <span class="kw">else</span> float(np.clip(<span class="nm">1.0</span> - (peak_mem-<span class="nm">36</span>)/<span class="nm">10</span>, <span class="nm">0.0</span>, <span class="nm">1.0</span>)) |
| score = <span class="nm">0.5</span> * ttft_score + <span class="nm">0.5</span> * mem_score |
| feedback = <span class="kw">f</span><span class="st">"TTFT score: {ttft_score:.2f} (mean TTFT {mean_ttft:.0f}ms vs 300ms SLO). Memory score: {mem_score:.2f} (peak {peak_mem:.1f}GB vs 36GB limit)."</span> |
| <span class="kw">return</span> {<span class="st">"score"</span>: score, <span class="st">"breakdown"</span>: {<span class="st">"ttft"</span>: ttft_score, <span class="st">"memory"</span>: mem_score}, <span class="st">"feedback"</span>: feedback} |
|
|
| <span class="kw">def</span> <span class="fn">_task3</span>(self, log) -> <span class="tp">Dict</span>: |
| mean_tps = np.mean([s[<span class="st">'metrics'</span>][<span class="st">'tokens_per_sec'</span>] <span class="kw">for</span> s <span class="kw">in</span> log]) |
| mean_slo = np.mean([s[<span class="st">'metrics'</span>][<span class="st">'slo_violations'</span>] <span class="kw">for</span> s <span class="kw">in</span> log]) |
| total_cost = sum(s[<span class="st">'metrics'</span>][<span class="st">'cost_per_1k'</span>] <span class="kw">for</span> s <span class="kw">in</span> log) |
| actions = [s[<span class="st">'action'</span>] <span class="kw">for</span> s <span class="kw">in</span> log] |
| |
| T = float(np.clip(mean_tps / self.OPTIMAL_TPS[<span class="nm">3</span>], <span class="nm">0.0</span>, <span class="nm">1.0</span>)) |
| S = float(np.clip(<span class="nm">1.0</span> - mean_slo / <span class="nm">100.0</span>, <span class="nm">0.0</span>, <span class="nm">1.0</span>)) |
| C = float(np.clip(<span class="nm">1.0</span> - total_cost / <span class="nm">5.0</span>, <span class="nm">0.0</span>, <span class="nm">1.0</span>)) |
| A = <span class="nm">1.0</span> - self._action_variance(actions) |
| |
| score = <span class="nm">0.40</span>*T + <span class="nm">0.30</span>*S + <span class="nm">0.20</span>*C + <span class="nm">0.10</span>*A |
| feedback = self._task3_feedback(T, S, C, A, log) |
| <span class="kw">return</span> {<span class="st">"score"</span>: score, <span class="st">"breakdown"</span>: {<span class="st">"throughput"</span>:T,<span class="st">"slo"</span>:S,<span class="st">"cost"</span>:C,<span class="st">"stability"</span>:A}, <span class="st">"feedback"</span>: feedback} |
|
|
| <span class="kw">def</span> <span class="fn">_action_variance</span>(self, actions) -> <span class="tp">float</span>: |
| batch_vals = [a.get(<span class="st">'batch_size'</span>, <span class="nm">32</span>) <span class="kw">for</span> a <span class="kw">in</span> actions] |
| kv_vals = [a.get(<span class="st">'kv_budget'</span>, <span class="nm">1.0</span>) <span class="kw">for</span> a <span class="kw">in</span> actions] |
| variance = np.std(np.diff(batch_vals))/<span class="nm">512</span> + np.std(np.diff(kv_vals))/<span class="nm">1.0</span> |
| <span class="kw">return</span> float(np.clip(variance / <span class="nm">0.5</span>, <span class="nm">0.0</span>, <span class="nm">1.0</span>)) |
| |
| <span class="kw">def</span> <span class="fn">_throughput_feedback</span>(self, mean_tps, task_id) -> <span class="tp">str</span>: |
| pct = (mean_tps - self.BASELINE_TPS[task_id]) / (self.OPTIMAL_TPS[task_id] - self.BASELINE_TPS[task_id]) * <span class="nm">100</span> |
| <span class="kw">return</span> <span class="kw">f</span><span class="st">f"Agent achieved {mean_tps:.0f} TPS ({pct:.0f}% of way from baseline to optimal)."</span></pre></div> |
| </div> |
|
|
| <div class="label">BASELINE AGENT β Fixed-config, deterministic</div> |
| <div class="code-block"> |
| <div class="code-block-header"> |
| <span class="code-lang">python</span> |
| <span class="code-file">agents/baseline.py β NaΓ―ve vLLM defaults (Person C, Day 6)</span> |
| </div> |
| <div class="code-body"><pre><span class="kw">from</span> inferencegym.models <span class="kw">import</span> ServeAction, QuantTier |
| <span class="kw">from</span> env.inference_env <span class="kw">import</span> InferenceEnv |
| <span class="kw">from</span> simulator.trace_sim <span class="kw">import</span> TraceSimulator |
| <span class="kw">from</span> grader.grader <span class="kw">import</span> GraderModule |
|
|
| <span class="cm"># The fixed action that the baseline ALWAYS takes, regardless of observation</span> |
| BASELINE_ACTION = ServeAction( |
| kv_budget = <span class="nm">1.0</span>, <span class="cm"># no eviction</span> |
| spec_length = <span class="nm">0</span>, <span class="cm"># speculative decoding off</span> |
| batch_size = <span class="nm">32</span>, <span class="cm"># vLLM default</span> |
| prefill_disagg = <span class="kw">False</span>, <span class="cm"># colocated</span> |
| quant_tier = QuantTier.FP16, <span class="cm"># full precision</span> |
| ) |
|
|
| <span class="kw">def</span> <span class="fn">run_baseline</span>(task_id: <span class="tp">int</span>, seed: <span class="tp">int</span> = <span class="nm">0</span>) -> <span class="tp">dict</span>: |
| <span class="st">"""Runs fixed baseline agent on one task, returns grader score."""</span> |
| sim = TraceSimulator(<span class="st">"simulator/data/traces_llama3_8b.parquet"</span>, seed=seed) |
| env = InferenceEnv(sim, task_id=task_id, seed=seed) |
| grader = GraderModule() |
| |
| env.reset() |
| done = <span class="kw">False</span> |
| <span class="kw">while not</span> done: |
| _, _, done, _ = env.step(BASELINE_ACTION) |
| |
| result = grader.score(task_id, env._episode_log) |
| <span class="kw">return</span> {<span class="st">"task_id"</span>: task_id, <span class="st">"score"</span>: result[<span class="st">"score"</span>], |
| <span class="st">"breakdown"</span>: result[<span class="st">"breakdown"</span>], <span class="st">"action_config"</span>: BASELINE_ACTION.__dict__} |
|
|
| <span class="kw">def</span> <span class="fn">run_all_baselines</span>() -> <span class="tp">dict</span>: |
| <span class="cm"># Seed=0 guarantees identical results every run</span> |
| <span class="kw">return</span> {<span class="st">"scores"</span>: {<span class="kw">f</span><span class="st">"task{i}"</span>: run_baseline(i, seed=<span class="nm">0</span>)[<span class="st">"score"</span>] <span class="kw">for</span> i <span class="kw">in</span> [<span class="nm">1</span>,<span class="nm">2</span>,<span class="nm">3</span>]}, |
| <span class="st">"expected_range"</span>: {<span class="st">"task1"</span>:[<span class="nm">0.30</span>,<span class="nm">0.40</span>], <span class="st">"task2"</span>:[<span class="nm">0.22</span>,<span class="nm">0.32</span>], <span class="st">"task3"</span>:[<span class="nm">0.18</span>,<span class="nm">0.28</span>]}}</pre></div> |
| </div> |
| </div> |
|
|
| |
| <div class="section" id="phase5"> |
| <div class="section-header"> |
| <div class="section-num">P5</div> |
| <div class="section-meta"> |
| <div class="section-title">Phase 5 β Deployment & Demo Agent</div> |
| <div class="section-sub">Days 8β9 (Apr 3β4). Goal: the environment is live on HuggingFace Spaces at a public URL, a PPO agent shows a rising reward curve, and the Colab demo notebook runs end-to-end.</div> |
| </div> |
| <span class="eyebrow-tag et-amber">Days 8β9</span> |
| </div> |
|
|
| <div class="gate-box"> |
| <div class="gate-icon">π</div> |
| <div> |
| <div class="gate-label">Phase Gate β End of Day 9</div> |
| <div class="gate-text"><strong>From a fresh machine with no local setup, running the Colab notebook completes all cells without error.</strong> The HuggingFace Spaces URL is public and all endpoints respond. The PPO reward curve plot shows a statistically increasing trend from first 5k steps to last 5k steps of training.</div> |
| </div> |
| </div> |
|
|
| <div class="grid2 mb14"> |
| <div> |
| <div class="label">HUGGINGFACE SPACES DEPLOYMENT</div> |
| <div class="deliverable-box"> |
| <div class="deliverable-title">Person B β Days 8-9</div> |
| <ul class="deliverable-list"> |
| <li><span class="dl-bullet dl-blue">B</span><div class="dl-text"><strong>Create HF Space with Docker SDK</strong> Go to huggingface.co/new-space. Select SDK: Docker. This will create a Dockerfile-based deployment where port 7860 is auto-exposed. Push your repo code.</div></li> |
| <li><span class="dl-bullet dl-blue">B</span><div class="dl-text"><strong>README.md HF frontmatter</strong> Add the required YAML block at the top of README.md: <code>title: InferenceGym, emoji: ποΈ, colorFrom: green, colorTo: blue, sdk: docker, pinned: false</code>. This controls the HF Space landing page.</div></li> |
| <li><span class="dl-bullet dl-blue">B</span><div class="dl-text"><strong>Health check verification</strong> After push, HF Spaces shows a build log. Wait for "Running" status. Hit the public URL's /health endpoint. If it doesn't respond in 2 minutes, check build logs for import errors β most commonly a missing package in requirements.txt.</div></li> |
| <li><span class="dl-bullet dl-blue">B</span><div class="dl-text"><strong>Stress test from live URL</strong> Run 10 concurrent reset+stepΓ5 loops against the live URL. Check /health shows active_sessions > 0 during the test. Confirm no 500 errors appear in HF Space logs.</div></li> |
| </ul> |
| </div> |
| </div> |
| <div> |
| <div class="label">PPO DEMO AGENT β Person C, Day 8</div> |
| <div class="deliverable-box"> |
| <div class="deliverable-title">Gym wrapper + stable-baselines3 PPO</div> |
| <ul class="deliverable-list"> |
| <li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>Write HTTPGymEnv wrapper</strong> Subclass <code>gymnasium.Env</code>. <code>reset()</code> calls POST /reset. <code>step(action)</code> calls POST /step. <code>observation_space</code> is <code>Box(low=-inf, high=inf, shape=(12,))</code>. <code>action_space</code> is <code>Box</code> for continuous knobs.</div></li> |
| <li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>Run PPO for 50k steps on Task 1</strong> Use <code>stable_baselines3.PPO("MlpPolicy", env, verbose=1)</code>. Train 50k steps. Plot <code>ep_rew_mean</code> over time using matplotlib. It should go from ~0.1 at start to ~0.35+ by 50k steps.</div></li> |
| <li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>If PPO doesn't converge</strong> Check: (1) normalise observations with <code>VecNormalize</code>, (2) reduce learning rate to 1e-4, (3) increase n_steps to 2048, (4) check reward range is [-1,1] (it should be from InferenceEnv). The environment is designed to be learnable β reward engineering is correct.</div></li> |
| </ul> |
| </div> |
| </div> |
| </div> |
|
|
| <div class="label">COLAB DEMO NOTEBOOK STRUCTURE β Person C, Day 9</div> |
| <div class="code-block"> |
| <div class="code-block-header"> |
| <span class="code-lang">python</span> |
| <span class="code-file">notebooks/InferenceGym_Demo.ipynb β Cell-by-cell structure</span> |
| </div> |
| <div class="code-body"><pre><span class="cm"># Cell 1: Title markdown</span> |
| <span class="cm"># "# InferenceGym Demo β Meta PyTorch Γ Scaler Hackathon 2026"</span> |
|
|
| <span class="cm"># Cell 2: Install (runs in 90 seconds on Colab)</span> |
| !pip install stable-baselines3 gymnasium httpx pandas matplotlib -q |
|
|
| <span class="cm"># Cell 3: Connect to live environment</span> |
| HF_URL = <span class="st">"https://YOUR_ORG-inferencegym.hf.space"</span> |
| <span class="kw">import</span> httpx |
| response = httpx.get(<span class="kw">f</span><span class="st">"{HF_URL}/health"</span>) |
| print(<span class="st">"Environment status:"</span>, response.json()) |
|
|
| <span class="cm"># Cell 4: Show available tasks</span> |
| tasks = httpx.get(<span class="kw">f</span><span class="st">"{HF_URL}/tasks"</span>).json() |
| <span class="kw">for</span> t <span class="kw">in</span> tasks[<span class="st">'tasks'</span>]: print(<span class="kw">f</span><span class="st">"{t['id']}: {t['name']} ({t['difficulty']})"</span>) |
|
|
| <span class="cm"># Cell 5: Run baseline agent, show scores</span> |
| baseline = httpx.get(<span class="kw">f</span><span class="st">"{HF_URL}/baseline"</span>).json() |
| print(<span class="st">"Baseline scores (naΓ―ve vLLM defaults):"</span>, baseline[<span class="st">'scores'</span>]) |
|
|
| <span class="cm"># Cell 6: Manual episode β human in the loop</span> |
| res = httpx.post(<span class="kw">f</span><span class="st">"{HF_URL}/reset"</span>, json={<span class="st">"task_id"</span>: <span class="nm">1</span>, <span class="st">"seed"</span>: <span class="nm">42</span>}).json() |
| session_id = res[<span class="st">'session_id'</span>]; obs = res[<span class="st">'observation'</span>] |
| print(<span class="st">"Initial observation:"</span>, obs) |
|
|
| <span class="cm"># Cell 7: Run 10 manual steps with a smart action</span> |
| episode_log = [] |
| <span class="kw">for</span> _ <span class="kw">in</span> range(<span class="nm">10</span>): |
| result = httpx.post(<span class="kw">f</span><span class="st">"{HF_URL}/step"</span>, json={<span class="st">"session_id"</span>: session_id, |
| <span class="st">"action"</span>: {<span class="st">"kv_budget"</span>:<span class="nm">0.6</span>, <span class="st">"batch_size"</span>:<span class="nm">128</span>, <span class="st">"spec_length"</span>:<span class="nm">0</span>, <span class="st">"prefill_disagg"</span>:<span class="kw">False</span>, <span class="st">"quant_tier"</span>:<span class="nm">0</span>}}).json() |
| episode_log.append(result) |
|
|
| <span class="cm"># Cell 8: Gym wrapper</span> |
| <span class="kw">import</span> gymnasium <span class="kw">as</span> gym; <span class="kw">import</span> numpy <span class="kw">as</span> np; <span class="kw">import</span> httpx |
|
|
| <span class="kw">class</span> <span class="tp">InferenceGymEnv</span>(gym.Env): |
| <span class="kw">def</span> <span class="fn">__init__</span>(self, base_url, task_id=<span class="nm">1</span>): |
| self.url = base_url; self.task_id = task_id; self.session_id = <span class="kw">None</span> |
| self.observation_space = gym.spaces.Box(-np.inf, np.inf, shape=(<span class="nm">12</span>,), dtype=np.float32) |
| self.action_space = gym.spaces.Box( |
| low=np.array([<span class="nm">0.1</span>, <span class="nm">0.0</span>, <span class="nm">1.0</span>], dtype=np.float32), |
| high=np.array([<span class="nm">1.0</span>, <span class="nm">1.0</span>, <span class="nm">512.0</span>], dtype=np.float32)) |
| <span class="kw">def</span> <span class="fn">obs_to_array</span>(self, obs): <span class="kw">return</span> np.array(list(obs.values())[:12], dtype=np.float32) |
| <span class="kw">def</span> <span class="fn">reset</span>(self, **kwargs): |
| r = httpx.post(<span class="kw">f</span><span class="st">"{self.url}/reset"</span>, json={<span class="st">"task_id"</span>:self.task_id}).json() |
| self.session_id = r[<span class="st">'session_id'</span>]; <span class="kw">return</span> self.obs_to_array(r[<span class="st">'observation'</span>]), {} |
| <span class="kw">def</span> <span class="fn">step</span>(self, action): |
| act = {<span class="st">"kv_budget"</span>:float(action[<span class="nm">0</span>]), <span class="st">"spec_length"</span>:<span class="nm">0</span>, <span class="st">"batch_size"</span>:int(action[<span class="nm">2</span>]), |
| <span class="st">"prefill_disagg"</span>:<span class="kw">False</span>, <span class="st">"quant_tier"</span>:<span class="nm">0</span>} |
| r = httpx.post(<span class="kw">f</span><span class="st">"{self.url}/step"</span>, json={<span class="st">"session_id"</span>:self.session_id,<span class="st">"action"</span>:act}).json() |
| <span class="kw">return</span> self.obs_to_array(r[<span class="st">'observation'</span>]), r[<span class="st">'reward'</span>], r[<span class="st">'done'</span>], <span class="kw">False</span>, {} |
|
|
| <span class="cm"># Cell 9: Train PPO (takes ~10 minutes on Colab T4)</span> |
| <span class="kw">from</span> stable_baselines3 <span class="kw">import</span> PPO |
| env = InferenceGymEnv(HF_URL, task_id=<span class="nm">1</span>) |
| model = PPO(<span class="st">"MlpPolicy"</span>, env, verbose=<span class="nm">1</span>, learning_rate=<span class="nm">3e-4</span>, n_steps=<span class="nm">512</span>) |
| model.learn(total_timesteps=<span class="nm">50_000</span>) |
|
|
| <span class="cm"># Cell 10: Plot reward curve (the money shot)</span> |
| <span class="kw">import</span> matplotlib.pyplot <span class="kw">as</span> plt |
| rewards = [ep[<span class="st">'r'</span>] <span class="kw">for</span> ep <span class="kw">in</span> model.ep_info_buffer] |
| plt.figure(figsize=(<span class="nm">12</span>,<span class="nm">4</span>)); plt.plot(rewards, alpha=<span class="nm">0.3</span>, label=<span class="st">'Episode reward'</span>) |
| plt.axhline(y=<span class="nm">0.35</span>, color=<span class="st">'r'</span>, linestyle=<span class="st">'--'</span>, label=<span class="st">'Baseline score'</span>) |
| plt.title(<span class="st">'PPO Agent Learning on InferenceGym Task 1'</span>); plt.legend(); plt.show() |
| print(<span class="st">f"Final agent score: {np.mean(rewards[-20:]):.3f} vs baseline: 0.35"</span>)</pre></div> |
| </div> |
| </div> |
|
|
| |
| <div class="section" id="phase6"> |
| <div class="section-header"> |
| <div class="section-num">P6</div> |
| <div class="section-meta"> |
| <div class="section-title">Phase 6 β Polish, Writeup & Submission</div> |
| <div class="section-sub">Days 10β11 (Apr 5β7). Goal: every submission checklist item is ticked. The repo is clean. The writeup is compelling. The video is recorded. The form is submitted.</div> |
| </div> |
| <span class="eyebrow-tag et-purple">Days 10β11</span> |
| </div> |
|
|
| <div class="gate-box"> |
| <div class="gate-icon">π</div> |
| <div> |
| <div class="gate-label">Final Gate β Submit by Apr 7 11:59 PM</div> |
| <div class="gate-text"><strong>The submission form is filled with HF Space URL + GitHub repo URL.</strong> No code changes after submission. The repo is public, has a clean README, and contains no API keys or large binary files committed to git.</div> |
| </div> |
| </div> |
|
|
| <div class="grid2 mb14"> |
| <div> |
| <div class="label">ENVIRONMENT.md β Technical spec for judges</div> |
| <div class="deliverable-box"> |
| <div class="deliverable-title">Person A writes this on Day 10</div> |
| <ul class="deliverable-list"> |
| <li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Observation space table</strong> Full table with field name, type, range, and description for all 12 observation fields. Copy from models.py and expand.</div></li> |
| <li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Action space table</strong> Full table with field name, type, valid values, default, and effect when changed for all 5 action dimensions.</div></li> |
| <li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Reward function derivation</strong> Show the R = Ξ±T - Ξ²L - Ξ³V - Ξ΄C formula with all constants, normalization choices, and why each weight was set the way it was.</div></li> |
| <li><span class="dl-bullet dl-green">A</span><div class="dl-text"><strong>Trace data methodology</strong> Document exactly what source data you used, how it was preprocessed, and why it's realistic. If using published benchmarks, cite them.</div></li> |
| </ul> |
| </div> |
| </div> |
| <div> |
| <div class="label">README.md β The first thing judges see</div> |
| <div class="deliverable-box"> |
| <div class="deliverable-title">Person C writes this on Day 10</div> |
| <ul class="deliverable-list"> |
| <li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>One-paragraph pitch first</strong> Before any technical content. Why does this environment matter? What problem does it solve? This should be the same words you'd use to pitch to a judge in 30 seconds.</div></li> |
| <li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>Quick start in 5 lines</strong> Show the curl commands to hit /health, /reset, /step, /grader. A judge who never reads further should still understand the API from these 5 lines.</div></li> |
| <li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>Baseline vs agent scores table</strong> Show a simple table: Task 1/2/3 Γ Baseline/PPO Agent. The numbers do the talking.</div></li> |
| <li><span class="dl-bullet dl-amber">C</span><div class="dl-text"><strong>Link to Colab notebook prominently</strong> "Open in Colab" badge. Judges who click this and see the reward curve rising will be convinced.</div></li> |
| </ul> |
| </div> |
| </div> |
| </div> |
|
|
| <div class="label">2-MINUTE DEMO VIDEO SCRIPT β Person C, Day 10</div> |
| <div class="table-wrap mb14"> |
| <table> |
| <tr><th>Time</th><th>Screen</th><th>What You Say / Show</th></tr> |
| <tr><td><strong>0:00β0:20</strong></td><td>Slide: problem statement</td><td>"LLM inference is where 80% of AI budget is spent. There's no RL environment for optimising it. We built one."</td></tr> |
| <tr><td><strong>0:20β0:40</strong></td><td>HF Space β /health β /tasks</td><td>"This is InferenceGym on HuggingFace Spaces, live right now. 3 tasks, 5 action knobs, fully CPU-only." Hit the endpoints live.</td></tr> |
| <tr><td><strong>0:40β1:00</strong></td><td>Colab β run baseline</td><td>"NaΓ―ve vLLM defaults score 0.35 on Task 1. That's your baseline β static config, no optimisation."</td></tr> |
| <tr><td><strong>1:00β1:30</strong></td><td>Colab β PPO reward curve</td><td>"A simple PPO agent trained for 50k steps hits 0.65 β almost double. No GPU, no model, just our trace-driven simulator." Show the plot.</td></tr> |
| <tr><td><strong>1:30β2:00</strong></td><td>Architecture diagram</td><td>"Any company can drop in their own trace data and train an agent for their specific workload. That's the value proposition. Thank you."</td></tr> |
| </table> |
| </div> |
| </div> |
|
|
| |
| <div class="section"> |
| <div class="section-header"> |
| <div class="section-num">TL</div> |
| <div class="section-meta"> |
| <div class="section-title">Complete 11-Day Timeline</div> |
| <div class="section-sub">Every person, every day. The critical path runs through Person A's simulator β protect it above all else.</div> |
| </div> |
| </div> |
|
|
| <div class="timeline"> |
| <div class="tl-row"> |
| <div class="tl-left"><div class="tl-day-label today">Mar 27<br>Day 1<br>TODAY</div></div> |
| <div class="tl-connector"><div class="tl-dot g"></div><div class="tl-line"></div></div> |
| <div class="tl-right"> |
| <div class="tl-card green-border"> |
| <div class="tl-phase" style="color:var(--green)">PHASE 0 β SETUP & ARCHITECTURE LOCK</div> |
| <ul class="tl-tasks-list"> |
| <li><span class="tl-person">A β</span><span class="tl-task"><strong>Design data schemas</strong> in models.py. Write skeleton TraceSimulator with hardcoded stub output. Design lookup table format.</span></li> |
| <li><span class="tl-person">B β</span><span class="tl-task"><strong>Create FastAPI app</strong> with all 8 endpoint stubs returning valid-shaped hardcoded JSON. Dockerfile builds. /health returns 200.</span></li> |
| <li><span class="tl-person">C β</span><span class="tl-task"><strong>Write grader rubric</strong> on paper for all 3 tasks. Download trace data. Write workload_configs.json. Agree on HF Space naming.</span></li> |
| <li><span class="tl-person">ALL β</span><span class="tl-task"><strong>Agree and commit models.py</strong> to main. This file cannot change after today without unanimous consent.</span></li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| <div class="tl-row"> |
| <div class="tl-left"><div class="tl-day-label">Mar 28<br>Day 2</div></div> |
| <div class="tl-connector"><div class="tl-dot b"></div><div class="tl-line"></div></div> |
| <div class="tl-right"> |
| <div class="tl-card blue-border"> |
| <div class="tl-phase" style="color:var(--blue)">PHASE 1 β SIMULATOR CORE (Day 1 of 2)</div> |
| <ul class="tl-tasks-list"> |
| <li><span class="tl-person">A β</span><span class="tl-task"><strong>Implement TraceSimulator</strong> β load parquet, bilinear interpolation, Gaussian noise, OOM detection. Write WorkloadGenerator (Poisson arrivals, prompt sampling).</span></li> |
| <li><span class="tl-person">B β</span><span class="tl-task"><strong>Wire /reset and /step</strong> endpoints to the InferenceEnv stubs (not real yet β use A's skeleton). Test with curl that responses are correctly shaped.</span></li> |
| <li><span class="tl-person">C β</span><span class="tl-task"><strong>Process trace data</strong> β reshape into lookup table Parquet format with correct columns. Validate at least 50 data points across the batchΓprompt grid. Start grader skeleton.</span></li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| <div class="tl-row"> |
| <div class="tl-left"><div class="tl-day-label">Mar 29<br>Day 3</div></div> |
| <div class="tl-connector"><div class="tl-dot b"></div><div class="tl-line"></div></div> |
| <div class="tl-right"> |
| <div class="tl-card blue-border"> |
| <div class="tl-phase" style="color:var(--blue)">PHASE 1 β SIMULATOR CORE (Day 2 of 2) π CRITICAL GATE</div> |
| <ul class="tl-tasks-list"> |
| <li><span class="tl-person">A β</span><span class="tl-task"><strong>Complete WorkloadGenerator</strong> β queue depth, burst injection, spec acceptance model. Complete InferenceEnv.reset() and step(). All simulator unit tests pass.</span></li> |
| <li><span class="tl-person">B β</span><span class="tl-task"><strong>Wire all endpoints</strong> to real InferenceEnv (replacing stubs). Implement SessionManager. Test full resetβstepΓ10 cycle via HTTP.</span></li> |
| <li><span class="tl-person">C β</span><span class="tl-task"><strong>Implement GraderModule skeleton</strong> with correct formula shape (even if constants need tuning). Run smoke test: score a 10-step episode log. Get any finite number.</span></li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| <div class="tl-row"> |
| <div class="tl-left"><div class="tl-day-label">Mar 30<br>Day 4</div></div> |
| <div class="tl-connector"><div class="tl-dot b"></div><div class="tl-line"></div></div> |
| <div class="tl-right"> |
| <div class="tl-card blue-border"> |
| <div class="tl-phase" style="color:var(--blue)">PHASE 2 β ENVIRONMENT LOGIC COMPLETE</div> |
| <ul class="tl-tasks-list"> |
| <li><span class="tl-person">A β</span><span class="tl-task"><strong>Implement all 3 task configs</strong> (action masking for T1/T2, burst injection for T3). Full reward function with Ξ± Ξ² Ξ³ Ξ΄ weights. Write full unit test suite (20+ tests).</span></li> |
| <li><span class="tl-person">B β</span><span class="tl-task"><strong>Build Dockerfile</strong> β multi-stage, confirm image <2GB. Run full Docker cycle locally. Implement /state, /info, /health endpoints. Add Pydantic request validation.</span></li> |
| <li><span class="tl-person">C β</span><span class="tl-task"><strong>Complete GraderModule</strong> β calibrate baseline TPS constants, write unit tests for all 3 task graders with known expected outputs. Score computation verified by hand.</span></li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| <div class="tl-row"> |
| <div class="tl-left"><div class="tl-day-label">Mar 31<br>Day 5</div></div> |
| <div class="tl-connector"><div class="tl-dot b"></div><div class="tl-line"></div></div> |
| <div class="tl-right"> |
| <div class="tl-card blue-border"> |
| <div class="tl-phase" style="color:var(--blue)">PHASE 3 β API LAYER COMPLETE & OPENENV VALIDATED</div> |
| <ul class="tl-tasks-list"> |
| <li><span class="tl-person">A β</span><span class="tl-task"><strong>Full integration test</strong> β run 200-step episode for all 3 tasks programmatically. Confirm rewards are in [-1,1] range. Fix any edge cases (divide by zero, negative queue).</span></li> |
| <li><span class="tl-person">B β</span><span class="tl-task"><strong>Run openenv validate</strong> β fix any compliance issues. Implement /grader and /baseline endpoints (wiring C's modules). Add rate limiting and CORS middleware.</span></li> |
| <li><span class="tl-person">C β</span><span class="tl-task"><strong>Write BaselineAgent</strong> and run against all 3 tasks. Record expected scores (should be ~0.30-0.35 for T1, ~0.22-0.28 for T2, ~0.18-0.24 for T3). Adjust grader constants if needed.</span></li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| <div class="tl-row"> |
| <div class="tl-left"><div class="tl-day-label">Apr 1<br>Day 6</div></div> |
| <div class="tl-connector"><div class="tl-dot a"></div><div class="tl-line"></div></div> |
| <div class="tl-right"> |
| <div class="tl-card amber-border"> |
| <div class="tl-phase" style="color:var(--amber)">PHASE 4 β GRADER & BASELINE COMPLETE</div> |
| <ul class="tl-tasks-list"> |
| <li><span class="tl-person">A β</span><span class="tl-task"><strong>Adversarial task stress test</strong> β run 1000-step Task 3 episodes, check burst injection fires at correct intervals, priority routing triggers, no state corruption.</span></li> |
| <li><span class="tl-person">B β</span><span class="tl-task"><strong>Concurrent session test</strong> β run 10 simultaneous resetβstepΓ5 cycles, confirm no session leakage. Profile memory usage under load β must stay under 512MB.</span></li> |
| <li><span class="tl-person">C β</span><span class="tl-task"><strong>Write PPO gym wrapper</strong> (HTTPGymEnv). Start PPO training on Task 1. Set it running overnight β 50k steps should complete in ~4-6 hours on a modern CPU.</span></li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| <div class="tl-row"> |
| <div class="tl-left"><div class="tl-day-label">Apr 2<br>Day 7</div></div> |
| <div class="tl-connector"><div class="tl-dot a"></div><div class="tl-line"></div></div> |
| <div class="tl-right"> |
| <div class="tl-card amber-border"> |
| <div class="tl-phase" style="color:var(--amber)">BUFFER DAY + INTERNAL DEMO</div> |
| <ul class="tl-tasks-list"> |
| <li><span class="tl-person">ALL β</span><span class="tl-task"><strong>Internal demo meeting</strong> β each person walks through the Colab notebook end to end. Find anything broken. Fix it today.</span></li> |
| <li><span class="tl-person">A β</span><span class="tl-task">Fix any bugs found in internal demo. Add /info endpoint with full JSON schema. Docstrings on all public methods.</span></li> |
| <li><span class="tl-person">C β</span><span class="tl-task"><strong>Review PPO training results</strong> β plot reward curve, verify it's increasing. If not, debug (check normalization, learning rate, reward scale). Start writing Colab notebook.</span></li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| <div class="tl-row"> |
| <div class="tl-left"><div class="tl-day-label">Apr 3<br>Day 8</div></div> |
| <div class="tl-connector"><div class="tl-dot a"></div><div class="tl-line"></div></div> |
| <div class="tl-right"> |
| <div class="tl-card amber-border"> |
| <div class="tl-phase" style="color:var(--amber)">PHASE 5 β DEPLOYMENT</div> |
| <ul class="tl-tasks-list"> |
| <li><span class="tl-person">B β</span><span class="tl-task"><strong>Deploy to HuggingFace Spaces</strong> β push, watch build logs, verify all endpoints respond from live public URL. Document the URL in README.</span></li> |
| <li><span class="tl-person">C β</span><span class="tl-task"><strong>Complete Colab notebook</strong> β all 10 cells work end-to-end against the live HF Space URL. The notebook should run cold in under 15 minutes.</span></li> |
| <li><span class="tl-person">A β</span><span class="tl-task"><strong>Test from fresh machine</strong> β clone the repo, build Docker, run all tests. Confirm there are no hidden local dependencies. Fix whatever breaks.</span></li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| <div class="tl-row"> |
| <div class="tl-left"><div class="tl-day-label">Apr 4<br>Day 9</div></div> |
| <div class="tl-connector"><div class="tl-dot a"></div><div class="tl-line"></div></div> |
| <div class="tl-right"> |
| <div class="tl-card amber-border"> |
| <div class="tl-phase" style="color:var(--amber)">PHASE 5 β DEMO COMPLETE</div> |
| <ul class="tl-tasks-list"> |
| <li><span class="tl-person">C β</span><span class="tl-task"><strong>Record 2-minute demo video</strong> using OBS or Loom. Follow the script. Upload to YouTube (unlisted) and link in README. Do not make it public until submission.</span></li> |
| <li><span class="tl-person">B β</span><span class="tl-task"><strong>Stress test live deployment</strong> β 50 concurrent requests, verify no 500 errors. Check HF Space memory and CPU usage stays stable.</span></li> |
| <li><span class="tl-person">ALL β</span><span class="tl-task"><strong>Write submission description draft</strong> (~500 words covering: problem, design, grader design, baseline vs agent results). Will refine on Day 10.</span></li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| <div class="tl-row"> |
| <div class="tl-left"><div class="tl-day-label">Apr 5β6<br>Days 10-11</div></div> |
| <div class="tl-connector"><div class="tl-dot" style="background:var(--purple); border-color:var(--purple);"></div><div class="tl-line"></div></div> |
| <div class="tl-right"> |
| <div class="tl-card" style="border-color:var(--pborder);"> |
| <div class="tl-phase" style="color:var(--purple)">PHASE 6 β WRITEUP, POLISH & SUBMISSION PREP</div> |
| <ul class="tl-tasks-list"> |
| <li><span class="tl-person">A β</span><span class="tl-task"><strong>Write ENVIRONMENT.md</strong> β full technical spec for judges (observation space, action space, reward formula, task descriptions, simulator methodology).</span></li> |
| <li><span class="tl-person">C β</span><span class="tl-task"><strong>Write final README</strong> β pitch paragraph, quick start, baseline vs agent table, Colab link, video link. Run through the submission checklist line by line.</span></li> |
| <li><span class="tl-person">ALL β</span><span class="tl-task"><strong>Final end-to-end verification</strong> β test from a fresh browser with no cookies or local setup. Every endpoint must work. Grader must score any completed episode.</span></li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| <div class="tl-row"> |
| <div class="tl-left"><div class="tl-day-label" style="color:var(--red);">Apr 7<br>DEADLINE</div></div> |
| <div class="tl-connector"><div class="tl-dot r"></div></div> |
| <div class="tl-right"> |
| <div class="tl-card" style="border-color:var(--rborder);"> |
| <div class="tl-phase" style="color:var(--red)">SUBMIT BY 11:59 PM β NO CODE CHANGES AFTER</div> |
| <ul class="tl-tasks-list"> |
| <li><span class="tl-person">ALL β</span><span class="tl-task">Submit HF Space URL + GitHub repo URL on hackathon portal. Fill in: env name, description, team members. Double check the HF Space is public.</span></li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| </div> |
| </div> |
|
|
| |
| <div class="section" id="modules"> |
| <div class="section-header"> |
| <div class="section-num">Β§A</div> |
| <div class="section-meta"> |
| <div class="section-title">Appendix A β Full Module Specifications</div> |
| <div class="section-sub">Every file in the repository, what it owns, and the exact interface it must expose.</div> |
| </div> |
| </div> |
|
|
| <div class="label">COMPLETE FILE TREE WITH OWNERSHIP</div> |
| <div class="code-block"> |
| <div class="code-block-header"><span class="code-lang">text</span><span class="code-file">Repository structure</span></div> |
| <div class="code-body"><pre>inferencegym/ |
| βββ <span class="fn">models.py</span> <span class="cm">[ALL] β Locked Day 1. ServeAction, ServeObservation, MetricsSnapshot, WorkloadState</span> |
| β |
| βββ <span class="fn">env/</span> |
| β βββ inference_env.py <span class="cm">[A] β Core InferenceEnv class. reset(), step(), _compute_reward(), _enforce_action_mask()</span> |
| β βββ observation.py <span class="cm">[A] β _build_obs() helper, normalise values to [0,1] for RL agents</span> |
| β βββ action.py <span class="cm">[A] β ActionValidator, clamp continuous actions to valid ranges</span> |
| β βββ reward.py <span class="cm">[A] β RewardComputer, configurable Ξ± Ξ² Ξ³ Ξ΄, TASK_CONFIGS dict</span> |
| β |
| βββ <span class="fn">simulator/</span> |
| β βββ trace_sim.py <span class="cm">[A] β TraceSimulator: load parquet, interpolate, noise, OOM detection</span> |
| β βββ workload.py <span class="cm">[A] β WorkloadGenerator: Poisson, LogNormal, burst injection, queue</span> |
| β βββ session_manager.py <span class="cm">[B] β SessionManager: thread-safe LRU cache of InferenceEnv instances</span> |
| β βββ data/ |
| β βββ traces_llama3_8b.parquet <span class="cm">[C] β lookup table: (batch,kv,spec,plen) β metrics</span> |
| β βββ sharegpt_dist.json <span class="cm">[C] β LogNormal params for Task 2 prompt distribution</span> |
| β βββ workload_configs.json <span class="cm">[C] β Task 1/2/3 workload configuration parameters</span> |
| β |
| βββ <span class="fn">grader/</span> |
| β βββ grader.py <span class="cm">[C] β GraderModule: dispatches to per-task graders, returns score+breakdown</span> |
| β βββ task1_grader.py <span class="cm">[C] β Throughput normalisation formula</span> |
| β βββ task2_grader.py <span class="cm">[C] β TTFT + memory compliance formula</span> |
| β βββ task3_grader.py <span class="cm">[C] β 4-objective formula including action stability</span> |
| β |
| βββ <span class="fn">agents/</span> |
| β βββ baseline.py <span class="cm">[C] β BaselineAgent: fixed BASELINE_ACTION, run_all_baselines()</span> |
| β βββ ppo_demo.py <span class="cm">[C] β HTTPGymEnv wrapper + PPO training script</span> |
| β |
| βββ <span class="fn">server/</span> |
| β βββ app.py <span class="cm">[B] β FastAPI application, all 8 endpoints, startup event</span> |
| β βββ schemas.py <span class="cm">[B] β Pydantic request/response models (ResetRequest, StepRequest, etc.)</span> |
| β βββ middleware.py <span class="cm">[B] β CORS, rate limiting (max 100 req/min per IP), request logging</span> |
| β |
| βββ <span class="fn">tests/</span> |
| β βββ test_simulator.py <span class="cm">[A] β 20+ unit tests for TraceSimulator and WorkloadGenerator</span> |
| β βββ test_env.py <span class="cm">[A] β Contract tests for step/reset/state, edge cases</span> |
| β βββ test_grader.py <span class="cm">[C] β Unit tests for all 3 grader formulas with known expected outputs</span> |
| β βββ test_api.py <span class="cm">[B] β Integration tests: httpx client hitting full FastAPI stack</span> |
| β |
| βββ <span class="fn">notebooks/</span> |
| β βββ InferenceGym_Demo.ipynb <span class="cm">[C] β 10-cell Colab demo notebook</span> |
| β |
| βββ Dockerfile <span class="cm">[B] β Multi-stage, CPU-only, port 7860, <2GB image</span> |
| βββ docker-compose.yml <span class="cm">[B] β Local dev: volume mount source, hot reload</span> |
| βββ requirements.txt <span class="cm">[B] β Pinned CPU-only deps. No torch. No CUDA.</span> |
| βββ README.md <span class="cm">[C] β HF Spaces frontmatter + pitch + quickstart + links</span> |
| βββ ENVIRONMENT.md <span class="cm">[A] β Full technical spec for judges</span></pre></div> |
| </div> |
|
|
| <div class="label">MODULE INTERFACE CONTRACTS β What each module must expose</div> |
| <div class="module-grid"> |
| <div class="module-card"> |
| <div class="module-card-header"> |
| <span class="module-card-name">TraceSimulator</span> |
| <span class="module-card-file">simulator/trace_sim.py</span> |
| </div> |
| <div class="module-card-body"> |
| <ul class="spec-list"> |
| <li><code>__init__(trace_path: str, seed: int = 42)</code> β loads parquet, builds interpolators, sets rng</li> |
| <li><code>simulate(action: ServeAction, workload: WorkloadState) β MetricsSnapshot</code> β the core method</li> |
| <li><code>reset_seed(seed: int)</code> β resets the rng for episode reproducibility</li> |
| <li>Must not raise exceptions on valid input. OOM conditions are returned as data, not exceptions.</li> |
| </ul> |
| </div> |
| </div> |
| <div class="module-card"> |
| <div class="module-card-header"> |
| <span class="module-card-name">WorkloadGenerator</span> |
| <span class="module-card-file">simulator/workload.py</span> |
| </div> |
| <div class="module-card-body"> |
| <ul class="spec-list"> |
| <li><code>__init__(task_id: int, seed: int = 42)</code> β loads workload config for this task</li> |
| <li><code>reset() β WorkloadState</code> β returns initial state, resets internal step counter</li> |
| <li><code>step(action: ServeAction) β WorkloadState</code> β advances one step, updates queue</li> |
| <li><code>is_burst_active() β bool</code> β True during burst windows for Task 3</li> |
| </ul> |
| </div> |
| </div> |
| <div class="module-card"> |
| <div class="module-card-header"> |
| <span class="module-card-name">InferenceEnv</span> |
| <span class="module-card-file">env/inference_env.py</span> |
| </div> |
| <div class="module-card-body"> |
| <ul class="spec-list"> |
| <li><code>reset() β ServeObservation</code> β starts new episode, returns initial observation</li> |
| <li><code>step(action) β (obs, reward, done, info)</code> β Gym-compatible signature</li> |
| <li><code>state() β dict</code> β returns episode metadata for /state endpoint</li> |
| <li><code>_episode_log: list</code> β accumulates step dicts for grader consumption</li> |
| <li><code>session_id: str</code> β unique UUID per episode, set on reset()</li> |
| </ul> |
| </div> |
| </div> |
| <div class="module-card"> |
| <div class="module-card-header"> |
| <span class="module-card-name">GraderModule</span> |
| <span class="module-card-file">grader/grader.py</span> |
| </div> |
| <div class="module-card-body"> |
| <ul class="spec-list"> |
| <li><code>score(task_id: int, episode_log: list) β dict</code> β returns <code>{score, breakdown, feedback}</code></li> |
| <li>Must be stateless β no internal mutable state. Same input β same output always.</li> |
| <li><code>score</code> must be a float in [0.0, 1.0]</li> |
| <li><code>breakdown</code> must contain one float per scoring component</li> |
| <li><code>feedback</code> must be a human-readable string explaining the score</li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| </div> |
|
|
| |
| <div class="section" id="dataschema"> |
| <div class="section-header"> |
| <div class="section-num">Β§B</div> |
| <div class="section-meta"> |
| <div class="section-title">Appendix B β Data Schemas & Complete API Reference</div> |
| </div> |
| </div> |
|
|
| <div class="label">LOOKUP TABLE PARQUET SCHEMA β traces_llama3_8b.parquet</div> |
| <div class="table-wrap mb14"> |
| <table> |
| <tr><th>Column</th><th>Type</th><th>Values</th><th>Description</th></tr> |
| <tr><td><code>batch_size</code></td><td>int</td><td>1,4,8,16,32,64,128,256,512</td><td>Max concurrent requests served</td></tr> |
| <tr><td><code>kv_budget</code></td><td>float</td><td>0.1, 0.25, 0.5, 0.75, 1.0</td><td>KV cache allocation fraction</td></tr> |
| <tr><td><code>spec_length</code></td><td>int</td><td>0, 1, 2, 4, 8</td><td>Speculative draft tokens (0 = disabled)</td></tr> |
| <tr><td><code>quant_tier</code></td><td>int</td><td>0, 1, 2</td><td>0=FP16, 1=INT8, 2=INT4</td></tr> |
| <tr><td><code>prompt_len_bucket</code></td><td>int</td><td>0β7</td><td>Bucket index: [64,128,256,512,1024,2048,4096,8192]</td></tr> |
| <tr><td><code>ttft_p50_ms</code></td><td>float</td><td>>0</td><td>Median time to first token (milliseconds)</td></tr> |
| <tr><td><code>ttft_p99_ms</code></td><td>float</td><td>>0</td><td>99th percentile TTFT</td></tr> |
| <tr><td><code>tpot_ms</code></td><td>float</td><td>>0</td><td>Time per output token</td></tr> |
| <tr><td><code>tps</code></td><td>float</td><td>>0</td><td>Output tokens per second</td></tr> |
| <tr><td><code>gpu_mem_gb</code></td><td>float</td><td>0β80</td><td>GPU memory footprint in GB</td></tr> |
| <tr><td><code>cost_per_1k</code></td><td>float</td><td>>0</td><td>Relative cost per 1000 tokens (normalised)</td></tr> |
| </table> |
| </div> |
|
|
| <div class="label">WORKLOAD CONFIGS β workload_configs.json structure</div> |
| <div class="code-block"> |
| <div class="code-block-header"><span class="code-lang">json</span><span class="code-file">simulator/data/workload_configs.json</span></div> |
| <div class="code-body"><pre>{ |
| <span class="st">"tasks"</span>: { |
| <span class="st">"1"</span>: { |
| <span class="st">"name"</span>: <span class="st">"Static Uniform"</span>, |
| <span class="st">"arrival_rate_rps"</span>: <span class="nm">10.0</span>, |
| <span class="st">"arrival_dist"</span>: <span class="st">"poisson"</span>, |
| <span class="st">"prompt_len_dist"</span>: <span class="st">"uniform"</span>, |
| <span class="st">"prompt_len_min"</span>: <span class="nm">64</span>, |
| <span class="st">"prompt_len_max"</span>: <span class="nm">128</span>, |
| <span class="st">"slo_target_ms"</span>: <span class="nm">500.0</span>, |
| <span class="st">"burst_enabled"</span>: <span class="kw">false</span>, |
| <span class="st">"priority_routing"</span>: <span class="kw">false</span>, |
| <span class="st">"active_actions"</span>: [<span class="st">"kv_budget"</span>, <span class="st">"batch_size"</span>] |
| }, |
| <span class="st">"2"</span>: { |
| <span class="st">"name"</span>: <span class="st">"Bursty ShareGPT"</span>, |
| <span class="st">"arrival_rate_rps"</span>: <span class="nm">25.0</span>, |
| <span class="st">"arrival_rate_burst"</span>: <span class="nm">80.0</span>, |
| <span class="st">"burst_period_steps"</span>: <span class="nm">30</span>, |
| <span class="st">"arrival_dist"</span>: <span class="st">"poisson_bursty"</span>, |
| <span class="st">"prompt_len_dist"</span>: <span class="st">"lognormal"</span>, |
| <span class="st">"prompt_len_mu"</span>: <span class="nm">5.2</span>, |
| <span class="st">"prompt_len_sigma"</span>: <span class="nm">1.3</span>, |
| <span class="st">"prompt_len_clamp_min"</span>: <span class="nm">32</span>, |
| <span class="st">"prompt_len_clamp_max"</span>: <span class="nm">8192</span>, |
| <span class="st">"memory_hard_limit_gb"</span>: <span class="nm">36.0</span>, |
| <span class="st">"slo_target_ms"</span>: <span class="nm">300.0</span>, |
| <span class="st">"burst_enabled"</span>: <span class="kw">true</span>, |
| <span class="st">"active_actions"</span>: [<span class="st">"kv_budget"</span>, <span class="st">"batch_size"</span>, <span class="st">"spec_length"</span>] |
| }, |
| <span class="st">"3"</span>: { |
| <span class="st">"name"</span>: <span class="st">"Adversarial Multi-Tenant"</span>, |
| <span class="st">"arrival_rate_rps"</span>: <span class="nm">30.0</span>, |
| <span class="st">"burst_multiplier"</span>: <span class="nm">10.0</span>, |
| <span class="st">"burst_interval_steps"</span>: <span class="nm">120</span>, |
| <span class="st">"burst_duration_steps"</span>: <span class="nm">15</span>, |
| <span class="st">"prompt_len_dist"</span>: <span class="st">"bimodal"</span>, |
| <span class="st">"short_request_frac"</span>: <span class="nm">0.7</span>, |
| <span class="st">"short_prompt_max"</span>: <span class="nm">128</span>, |
| <span class="st">"long_prompt_min"</span>: <span class="nm">4096</span>, |
| <span class="st">"long_prompt_max"</span>: <span class="nm">8192</span>, |
| <span class="st">"priority_mix"</span>: [<span class="nm">0.2</span>, <span class="nm">0.5</span>, <span class="nm">0.3</span>], |
| <span class="st">"slo_interactive_ms"</span>: <span class="nm">200.0</span>, |
| <span class="st">"slo_batch_ms"</span>: <span class="nm">2000.0</span>, |
| <span class="st">"cost_budget_episode"</span>: <span class="nm">5.0</span>, |
| <span class="st">"memory_hard_limit_gb"</span>: <span class="nm">38.0</span>, |
| <span class="st">"active_actions"</span>: [<span class="st">"kv_budget"</span>, <span class="st">"batch_size"</span>, <span class="st">"spec_length"</span>, <span class="st">"prefill_disagg"</span>, <span class="st">"quant_tier"</span>] |
| } |
| } |
| }</pre></div> |
| </div> |
|
|
| <div class="label">COMPLETE OBSERVATION & ACTION SPACE REFERENCE</div> |
| <div class="table-wrap mb14"> |
| <table> |
| <tr><th>Field</th><th>Type</th><th>Range</th><th>Normalised?</th><th>Description</th></tr> |
| <tr><td><code>queue_depth</code></td><td>float</td><td>[0, 512]</td><td>No</td><td>Pending requests in serving queue</td></tr> |
| <tr><td><code>mean_prompt_len</code></td><td>float</td><td>[32, 8192]</td><td>No</td><td>Mean token count of current window</td></tr> |
| <tr><td><code>arrival_rate</code></td><td>float</td><td>[0, 200]</td><td>No</td><td>10-step EMA requests/second</td></tr> |
| <tr><td><code>kv_cache_occupancy</code></td><td>float</td><td>[0.0, 1.0]</td><td>Yes</td><td>Fraction of KV cache in use</td></tr> |
| <tr><td><code>ttft_p50</code></td><td>float</td><td>[0, 5000] ms</td><td>No</td><td>Median TTFT last 20 requests</td></tr> |
| <tr><td><code>tpot_p50</code></td><td>float</td><td>[0, 500] ms</td><td>No</td><td>Median time-per-output-token</td></tr> |
| <tr><td><code>slo_violation_rate</code></td><td>float</td><td>[0.0, 1.0]</td><td>Yes</td><td>Fraction of requests missing SLO</td></tr> |
| <tr><td><code>gpu_memory_used_gb</code></td><td>float</td><td>[0, 80]</td><td>No</td><td>Simulated GPU memory pressure</td></tr> |
| <tr><td><code>spec_accept_rate</code></td><td>float</td><td>[0.0, 1.0]</td><td>Yes</td><td>Speculative token acceptance rate</td></tr> |
| <tr><td><code>priority_distribution</code></td><td>float[3]</td><td>[0,1] each</td><td>Yes</td><td>[interactive, batch, best_effort] fractions</td></tr> |
| <tr><td><code>timestep</code></td><td>int</td><td>[0, 200]</td><td>No</td><td>Current episode step</td></tr> |
| <tr><td><code>cost_so_far</code></td><td>float</td><td>[0, β]</td><td>No</td><td>Cumulative cost this episode</td></tr> |
| </table> |
| </div> |
| </div> |
|
|
| |
| <div class="section" id="risks"> |
| <div class="section-header"> |
| <div class="section-num">Β§C</div> |
| <div class="section-meta"> |
| <div class="section-title">Appendix C β Risk Register</div> |
| <div class="section-sub">Every known failure mode, its probability, and exact mitigation steps.</div> |
| </div> |
| </div> |
|
|
| <div class="table-wrap mb14"> |
| <table> |
| <tr> |
| <th style="width:200px">Risk</th> |
| <th style="width:80px">Prob</th> |
| <th>Mitigation</th> |
| <th style="width:80px">Owner</th> |
| </tr> |
| <tr> |
| <td><strong>Trace data is wrong shape</strong><br><span style="font-size:10px;color:var(--text3)">Published benchmarks don't have the exact columns needed</span></td> |
| <td><span class="eyebrow-tag et-amber">Medium</span></td> |
| <td>Implement Option C (synthetic data) on Day 1 before even trying Option A. This takes 30 minutes and gives you a valid fallback. Option A then becomes an enhancement, not a dependency.</td> |
| <td>C</td> |
| </tr> |
| <tr> |
| <td><strong>PPO doesn't converge</strong><br><span style="font-size:10px;color:var(--text3)">Reward curve is flat or decreasing</span></td> |
| <td><span class="eyebrow-tag et-green">Low</span></td> |
| <td>Task 1 is designed for easy learning. If PPO fails: (1) add VecNormalize wrapper, (2) lower learning rate to 1e-4, (3) check reward is truly in [-1,1]. If still failing, use a simple hill-climbing agent β just show any rising curve.</td> |
| <td>C</td> |
| </tr> |
| <tr> |
| <td><strong>HuggingFace Spaces OOM</strong><br><span style="font-size:10px;color:var(--text3)">Free tier has 16GB RAM β simulator might use too much</span></td> |
| <td><span class="eyebrow-tag et-green">Low</span></td> |
| <td>Load trace data as a numpy array, not a pandas DataFrame, at startup. Target <200MB for the lookup table. Use <code>parquet</code> with snappy compression. Test memory usage locally with <code>psutil</code> before deploying.</td> |
| <td>B</td> |
| </tr> |
| <tr> |
| <td><strong>Race condition in session cache</strong><br><span style="font-size:10px;color:var(--text3)">Concurrent requests corrupt session state</span></td> |
| <td><span class="eyebrow-tag et-amber">Medium</span></td> |
| <td>All reads and writes to <code>self._sessions</code> dict are wrapped in <code>threading.Lock()</code>. Individual <code>InferenceEnv</code> instances are not thread-safe but each session is owned by one caller at a time β this is fine because the /step endpoint is synchronous and FastAPI serialises calls per session_id.</td> |
| <td>B</td> |
| </tr> |
| <tr> |
| <td><strong>Grader gives score > 1.0 or < 0.0</strong><br><span style="font-size:10px;color:var(--text3)">Formula constants are miscalibrated</span></td> |
| <td><span class="eyebrow-tag et-amber">Medium</span></td> |
| <td>All grader component scores are individually <code>np.clip(x, 0.0, 1.0)</code> before the weighted sum. The final score is also clipped. Calibrate BASELINE_TPS and OPTIMAL_TPS constants on Day 5 by running the actual baseline agent and verifying scores fall in [0.20, 0.40].</td> |
| <td>C</td> |
| </tr> |
| <tr> |
| <td><strong>Person A is blocked on Day 3</strong><br><span style="font-size:10px;color:var(--text3)">Simulator not done, Person B and C can't proceed</span></td> |
| <td><span class="eyebrow-tag et-amber">Medium</span></td> |
| <td>Person A prioritises the interface (<code>simulate() returns a valid MetricsSnapshot</code>) over the implementation quality. A synthetic linear model with hardcoded constants is enough for Day 3. Person B and C only need the method signature to work. Real trace data can be plugged in on Day 4.</td> |
| <td>A</td> |
| </tr> |
| <tr> |
| <td><strong>Docker image >2GB</strong><br><span style="font-size:10px;color:var(--text3)">stable-baselines3 pulls large PyTorch dependency</span></td> |
| <td><span class="eyebrow-tag et-amber">Medium</span></td> |
| <td>Install <code>stable-baselines3[extra]</code> only in a separate <code>requirements-demo.txt</code> that is NOT in the Dockerfile. The server only needs the environment. The PPO demo runs from outside the container (in Colab). This keeps the image under 500MB.</td> |
| <td>B</td> |
| </tr> |
| <tr> |
| <td><strong>OpenEnv spec compliance fails</strong><br><span style="font-size:10px;color:var(--text3)">openenv validate finds schema mismatches</span></td> |
| <td><span class="eyebrow-tag et-green">Low</span></td> |
| <td>Run <code>openenv validate</code> at the end of every day starting Day 3. Validation issues are always about JSON schema β field names, types, missing fields. Fix immediately, never defer. Keep a local copy of the openenv spec open while writing endpoint response schemas.</td> |
| <td>B</td> |
| </tr> |
| </table> |
| </div> |
| </div> |
|
|
| |
| <div class="section" id="checklist"> |
| <div class="section-header"> |
| <div class="section-num">Β§D</div> |
| <div class="section-meta"> |
| <div class="section-title">Appendix D β Final Submission Checklist</div> |
| <div class="section-sub">Every item must be checked before submitting. Do not submit until all boxes are ticked.</div> |
| </div> |
| </div> |
|
|
| <div class="grid2 mb20"> |
| <div> |
| <div class="label">OPENENV COMPLIANCE</div> |
| <ul class="checklist"> |
| <li><div class="chk"></div>POST /reset returns <code>session_id</code> + initial <code>observation</code> dict</li> |
| <li><div class="chk"></div>POST /step returns <code>observation</code> + <code>reward</code> (float) + <code>done</code> (bool) + <code>info</code></li> |
| <li><div class="chk"></div>GET /state returns current episode metadata</li> |
| <li><div class="chk"></div>GET /tasks returns 3 tasks with id, name, difficulty labels</li> |
| <li><div class="chk"></div>POST /grader returns score 0.0β1.0 + breakdown dict + feedback string</li> |
| <li><div class="chk"></div>GET /baseline returns reproducible baseline scores for all 3 tasks</li> |
| <li><div class="chk"></div>GET /health returns <code>{"status": "ok"}</code></li> |
| <li><div class="chk"></div><code>openenv validate --url https://YOUR_SPACE.hf.space</code> passes with no errors</li> |
| <li><div class="chk"></div>3 tasks with easy/medium/hard difficulty labels present</li> |
| <li><div class="chk"></div>Reward function documented with partial credit design</li> |
| </ul> |
| </div> |
| <div> |
| <div class="label">QUALITY CRITERIA</div> |
| <ul class="checklist"> |
| <li><div class="chk"></div>Baseline agent runs reproducibly (fixed seed=0, same score every run)</li> |
| <li><div class="chk"></div>PPO reward curve plot shows statistically increasing trend</li> |
| <li><div class="chk"></div>Colab notebook runs end-to-end in <15 minutes on free T4</li> |
| <li><div class="chk"></div>README has: pitch paragraph, quickstart, scores table, Colab link, video link</li> |
| <li><div class="chk"></div>ENVIRONMENT.md has full technical spec</li> |
| <li><div class="chk"></div>No API keys, no secrets in repository</li> |
| <li><div class="chk"></div>No large binary files committed to git (use .gitignore for *.parquet β serve from HF repo)</li> |
| <li><div class="chk"></div>Grader is deterministic (run same episode log twice, get same score)</li> |
| <li><div class="chk"></div>2-minute demo video recorded and linked in README</li> |
| <li><div class="chk"></div>HF Space is public (not private or gated)</li> |
| </ul> |
| </div> |
| </div> |
|
|
| <div class="grid2"> |
| <div> |
| <div class="label">DEPLOYMENT CHECKS</div> |
| <ul class="checklist"> |
| <li><div class="chk"></div>Docker image builds locally with <code>docker build -t test .</code></li> |
| <li><div class="chk"></div>Image is under 2GB (<code>docker image ls</code>)</li> |
| <li><div class="chk"></div>Container starts and /health responds within 30s</li> |
| <li><div class="chk"></div>HF Spaces URL is live and all endpoints respond</li> |
| <li><div class="chk"></div>Tested from a fresh browser/machine with no local setup</li> |
| <li><div class="chk"></div>50 concurrent requests don't produce 500 errors</li> |
| <li><div class="chk"></div>HF Spaces shows "Running" not "Building" or "Error"</li> |
| </ul> |
| </div> |
| <div> |
| <div class="label">SUBMISSION FORM</div> |
| <ul class="checklist"> |
| <li><div class="chk"></div>Environment name: <code>InferenceGym</code> (or your chosen name)</li> |
| <li><div class="chk"></div>Description: 500-word submission text</li> |
| <li><div class="chk"></div>All team member names listed</li> |
| <li><div class="chk"></div>HuggingFace Spaces URL submitted</li> |
| <li><div class="chk"></div>GitHub repository URL submitted (public)</li> |
| <li><div class="chk"></div>Submitted BEFORE 11:59 PM April 7</li> |
| <li><div class="chk"></div>No code changes pushed after submission time</li> |
| </ul> |
| </div> |
| </div> |
|
|
| <div class="alert alert-green" style="margin-top:20px;"> |
| <div class="alert-title">π― The One-Line Summary for Judges</div> |
| InferenceGym is the first RL environment for LLM inference control. A naΓ―ve vLLM config scores 0.22 on the hardest task. A simple PPO agent trained for 50k steps reaches 0.65 β a 3Γ improvement in serving efficiency, no GPU, no model required. That's the pitch. Everything else in this document is how you build the thing that delivers that demo. |
| </div> |
|
|
| <hr class="rule"> |
| <div style="text-align:center; font-family:var(--mono); font-size:10px; color:var(--text3); padding: 8px 0 20px; letter-spacing:0.08em;"> |
| INFERENCEGYM Β· MASTER BUILD DOCUMENT Β· META PYTORCH Γ SCALER HACKATHON 2026 Β· DEADLINE APRIL 7 |
| </div> |
| </div> |
|
|
| </div> |
| </body> |
| </html> |
|
|