MASTER BUILD DOCUMENT PHASE-BY-PHASE ALWAYS FUNCTIONAL

InferenceGym
Complete Engineering Plan

A modular, phase-gated engineering plan for building the first RL environment for LLM inference control. Every phase ends with a fully functional, testable system. No phase leaves you broken. Deadline: April 7, 2026 · 11 days · 3 people.

Deadline
Apr 7, 2026
Days Left
11 days
Team Size
3 people
Phases
6 phases
Deploy Target
HF Spaces
Prize Pool
$30,000
Table of Contents
00
Always Functional
After every phase ends, the system must be in a state where you can run it, call it, and get a valid response. No "half-built" states that block testing. If Phase 1 is done, someone can import the simulator and call simulate(action) right now.
Stub First, Flesh Later
Every module gets a stub implementation on Day 1 that returns valid-shaped data. This lets Person B wire the API and Person C write the grader before Person A finishes the simulator. Real logic replaces stubs phase by phase.
Data Schema First
All three people must agree on the exact shape of ServeAction, ServeObservation, and MetricsSnapshot on Day 1, before writing a single line of logic. Changing the schema mid-build is the #1 cause of integration hell.
⚠ The Critical Path
Person A's simulator core is the only hard dependency for everyone else. That is why Person A's Day 3 deliverable is a strict gate — no simulator, no environment, no env, no API, no demo. Everything else can be parallelised after Day 3. Protect this gate fiercely.
P0
Day 1 · Mar 27
🏁
Phase Gate — End of Day 1
You can run curl http://localhost:7860/health and get a 200 OK. All three people have cloned the repo, installed deps, and can run the stub server locally. The data schemas are written and committed to models.py. Nobody can start Day 2 until this is true.
Person A — Simulator Lead
Owns: simulator/, env/ directories
  • Read OpenEnv spec completely Clone openenv-course, run the echo example env, understand what /reset → /step → /grader looks like end to end.
  • Design TraceSimulator data schema Decide the exact column names for the lookup CSV. Write it down. Share with the team. This is a decision that cannot change later.
  • Write skeleton classes Create simulator/trace_sim.py with class stubs: TraceSimulator.__init__, simulate(action, workload) returning a hardcoded MetricsSnapshot.
  • Write skeleton workload generator simulator/workload.py — stub that returns a fixed WorkloadState dict every time.
Person B — API Lead
Owns: server/ directory, Dockerfile
  • Set up FastAPI project Install FastAPI, uvicorn, pydantic. Create server/app.py with all 8 endpoint stubs that return hardcoded valid responses.
  • Install openenv CLI Run openenv init, understand what openenv validate checks. Make sure the stub server passes basic validation.
  • Create Dockerfile skeleton Multi-stage build that starts the uvicorn server. Confirm it builds locally and the /health endpoint responds from inside Docker.
  • Set up GitHub repo Main branch protection, agree on feature branch naming (feat/simulator, feat/api, etc.), set up .gitignore.
Person C — Grader & Demo Lead
Owns: grader/, agents/, notebooks/
  • Design grader rubric on paper For each of the 3 tasks: what is the score formula? What is the theoretical optimal? What is the expected baseline score? Write this as a one-page doc.
  • Decide trace data strategy Evaluate Option A (published benchmarks), B (Colab T4), C (synthetic). Download whichever dataset you're going with. Confirm it has the needed columns.
  • Define workload configs Write simulator/data/workload_configs.json with the exact parameters for Task 1, 2, and 3 (arrival rate, SLO, prompt distribution params).
  • Agree on ENV_NAME Confirm the HuggingFace Spaces org, repo name, and environment name string. Register the HF account if needed.
SHARED DELIVERABLE — models.py (everyone must agree before Day 2)
python inferencegym/models.py — Data schema, locked on Day 1
from dataclasses import dataclass, field
from typing import Optional, List, Dict, Any
from enum import Enum

# ── Action space ─────────────────────────────────────────────────────────────
class QuantTier(Enum):
    FP16 = 0
    INT8 = 1
    INT4 = 2

@dataclass
class ServeAction:
    kv_budget:       float     # 0.1 – 1.0  : fraction of KV cache allocated
    spec_length:     int       # 0,1,2,4,8  : speculative draft tokens
    batch_size:      int       # 1–512      : max concurrent requests
    prefill_disagg:  bool      # True/False : disaggregate prefill GPU
    quant_tier:      QuantTier # FP16/INT8/INT4
    
    def validate(self) -> bool:
        assert 0.1 <= self.kv_budget <= 1.0
        assert self.spec_length in {0,1,2,4,8}
        assert 1 <= self.batch_size <= 512
        return True

# ── Simulator output ──────────────────────────────────────────────────────────
@dataclass
class MetricsSnapshot:
    ttft_p50_ms:       float  # median time to first token
    ttft_p99_ms:       float  # tail latency
    tpot_ms:           float  # time per output token
    tokens_per_sec:    float  # throughput
    gpu_memory_gb:     float  # simulated memory pressure
    cost_per_1k:       float  # compute cost (normalised units)
    spec_accept_rate:  float  # 0.0 if spec_length == 0
    eviction_events:   int    # KV cache evictions this step
    slo_violations:    int    # requests that exceeded SLO this step

# ── Observation (what agent sees) ────────────────────────────────────────────
@dataclass
class ServeObservation:
    queue_depth:            float
    mean_prompt_len:        float
    arrival_rate:           float
    kv_cache_occupancy:     float
    ttft_p50:               float
    tpot_p50:               float
    slo_violation_rate:     float
    gpu_memory_used_gb:     float
    spec_accept_rate:       float
    priority_distribution:  List[float]   # [interactive, batch, best_effort]
    timestep:               int
    cost_so_far:            float

# ── Workload state ────────────────────────────────────────────────────────────
@dataclass
class WorkloadState:
    arrival_rate:           float
    mean_prompt_len:        float
    prompt_len_bucket:      int     # 0–7, discrete bucket for lookup table
    queue_depth:            int
    priority_distribution:  List[float]
    is_burst:               bool
    phase:                  str     # "warmup" | "steady" | "burst" | "cooldown"
PHASE 0 COMPLETION PROOF
bash These commands must all pass before Day 2 starts
# From repo root:
docker build -t inferencegym . && docker run -p 7860:7860 inferencegym &
curl http://localhost:7860/health              # → {"status": "ok"}
curl http://localhost:7860/tasks              # → {"tasks": [{...}, {...}, {...}]}
python -c "from inferencegym.models import ServeAction, ServeObservation; print('schemas OK')"
P1
Days 2–3
✅ Why This Phase Unlocks Everything
Once TraceSimulator.simulate(action, workload) → MetricsSnapshot works, Person B can wire it into the API and Person C can build the grader. Both of those can proceed in parallel. Person A must finish this by end of Day 3 even if it means simplifying the interpolation.
🔑
Phase Gate — End of Day 3
Running python tests/test_simulator.py passes all tests. The simulator returns realistic-shaped numbers for a variety of (action, workload) inputs. The workload generator produces a different workload state on every call. These are the two things that need to be true before Phase 2 begins.
DAY 2 TASKS (Person A, primary)
TraceSimulator — Core Implementation
  • A
    Load lookup table from CSV/Parquet Read the trace data file into a dict keyed by (batch_bucket, kv_bucket, spec_bucket, prompt_bucket). Each value is a MetricsSnapshot. The lookup table must be loaded once at startup and cached in memory.
  • A
    Implement bilinear interpolation Use scipy.interpolate.RegularGridInterpolator for continuous actions (kv_budget, batch_size) between discrete lookup points. For discrete actions (spec_length, quant_tier), use nearest-neighbor lookup.
  • A
    Add Gaussian noise model Inject ±5% Gaussian noise on ttft_p50_ms and tpot_ms to simulate hardware jitter. Use np.random.default_rng(seed) so episodes are reproducible.
  • A
    Memory overflow detection If interpolated gpu_memory_gb > 40.0, set a hard OOM flag, cap memory at 40GB, and multiply slo_violations by 5 as a penalty signal.
WorkloadGenerator — Day 2
  • A
    Poisson arrival generator np.random.poisson(lam=arrival_rate) per step. Arrival rate varies by task config loaded from workload_configs.json.
  • A
    Prompt length sampling Task 1: np.random.uniform(64, 128). Task 2: np.random.lognormal(5.2, 1.3) clamped to [32, 8192]. Task 3: bimodal — 70% uniform(32, 128), 30% uniform(4096, 8192).
  • A
    Discrete prompt bucket mapping Map continuous prompt_len to an integer bucket 0–7 using np.digitize against [64, 128, 256, 512, 1024, 2048, 4096]. This is the lookup table key.
DAY 3 TASKS (Person A, primary)
WorkloadGenerator — Day 3 Completion
  • A
    Queue depth simulation Maintain a running queue_depth counter. Each step: add new arrivals, subtract min(batch_size, queue_depth) served requests. Queue cannot go negative.
  • A
    Burst injection for Task 3 Every 120 timesteps, multiply arrival_rate by 10 for 15 consecutive steps. Set is_burst=True in WorkloadState during these steps.
  • A
    Priority distribution tracking Task 3: maintain a rolling 50-step window of request classes [INTERACTIVE, BATCH, BEST_EFFORT] as fractions. Pass this to WorkloadState.priority_distribution.
  • A
    Speculative acceptance model Implement the acceptance rate formula: accept_rate = base_rate * (1 - complexity_penalty) * depth_decay where depth_decay = 1.0 / (1 + 0.15 * spec_length). Base rate by task: Task1=0.80, Task2=0.65, Task3=0.45.
Unit Tests — must pass by Day 3 EOD
  • C
    Smoke test Call simulate(action, workload) with 20 random valid actions — all return a non-null MetricsSnapshot with values in expected ranges.
  • C
    Monotonicity test Increasing batch_size while holding other actions constant should strictly increase tokens_per_sec (up to a threshold). This validates the lookup table is correctly loaded.
  • C
    Determinism test Two calls with the same seed and same action must produce the same noise-injected output. Tests reproducibility.
  • C
    OOM detection test Pass an action with batch_size=512, kv_budget=1.0 — confirm gpu_memory_gb triggers the overflow flag.
SIMULATOR CORE IMPLEMENTATION
python simulator/trace_sim.py
import numpy as np
import pandas as pd
from scipy.interpolate import RegularGridInterpolator
from pathlib import Path
from inferencegym.models import ServeAction, WorkloadState, MetricsSnapshot, QuantTier

class TraceSimulator:
    """
    CPU-only trace-driven simulator.
    Loads a pre-built lookup table and interpolates (action, workload) → MetricsSnapshot.
    """
    
    BATCH_POINTS  = [1, 4, 8, 16, 32, 64, 128, 256, 512]
    KV_POINTS     = [0.1, 0.25, 0.5, 0.75, 1.0]
    PLEN_BUCKETS  = [64, 128, 256, 512, 1024, 2048, 4096, 8192]
    OOM_THRESHOLD = 40.0  # GB
    NOISE_STD     = 0.05  # ±5% Gaussian jitter on latency metrics

    def __init__(self, trace_path: str, seed: int = 42):
        self.rng = np.random.default_rng(seed)
        self._load_tables(Path(trace_path))
        self._build_interpolators()

    def _load_tables(self, path: Path) -> None:
        df = pd.read_parquet(path)
        # Expected columns: batch_size, kv_budget, spec_length, quant_tier,
        #   prompt_len_bucket, ttft_p50, ttft_p99, tpot, tps, gpu_mem_gb, cost_per_1k
        self._df = df

    def _build_interpolators(self) -> None:
        # Build 4-D interpolator over (batch_size, kv_budget, spec_len, prompt_bucket)
        # for FP16 baseline. INT8/INT4 handled via multiplicative correction factors.
        fp16_df = self._df[self._df['quant_tier'] == 0]
        grid_vals = {
            'ttft_p50': self._reshape_for_interp(fp16_df, 'ttft_p50'),
            'ttft_p99': self._reshape_for_interp(fp16_df, 'ttft_p99'),
            'tpot':     self._reshape_for_interp(fp16_df, 'tpot'),
            'tps':      self._reshape_for_interp(fp16_df, 'tps'),
            'gpu_mem':  self._reshape_for_interp(fp16_df, 'gpu_mem_gb'),
        }
        points = (self.BATCH_POINTS, self.KV_POINTS, [0,1,2,4,8], self.PLEN_BUCKETS)
        self._interps = {k: RegularGridInterpolator(points, v, method='linear', bounds_error=False)
                         for k, v in grid_vals.items()}

    def simulate(self, action: ServeAction, workload: WorkloadState) -> MetricsSnapshot:
        action.validate()
        query = [[action.batch_size, action.kv_budget,
                   action.spec_length, workload.mean_prompt_len]]
        
        # Interpolate base metrics
        base = {k: float(fn(query)[0]) for k, fn in self._interps.items()}
        
        # Apply quant tier correction factors (from benchmark data)
        quant_factors = {QuantTier.FP16: 1.0, QuantTier.INT8: 0.82, QuantTier.INT4: 0.68}
        q_factor = quant_factors[action.quant_tier]
        base['ttft_p50'] *= q_factor
        base['tps'] /= q_factor          # quantised models serve faster
        base['gpu_mem'] *= q_factor        # quantised models use less memory
        
        # Apply speculative decoding acceptance bonus
        if action.spec_length > 0:
            depth_decay = 1.0 / (1 + 0.15 * action.spec_length)
            accept_rate = 0.75 * (1 - 0.1 * workload.prompt_len_bucket) * depth_decay
            accept_rate = max(0.0, min(1.0, accept_rate))
            speedup = 1.0 + accept_rate * action.spec_length * 0.1
            base['ttft_p50'] /= speedup
        else:
            accept_rate = 0.0
        
        # Inject Gaussian noise
        noise = self.rng.normal(1.0, self.NOISE_STD, size=3)
        base['ttft_p50'] *= noise[0]
        base['ttft_p99'] *= noise[1]
        base['tpot']     *= noise[2]
        
        # OOM detection
        oom = base['gpu_mem'] > self.OOM_THRESHOLD
        slo_violations = 0  # computed by env, not simulator
        if oom:
            base['gpu_mem'] = self.OOM_THRESHOLD
            slo_violations = action.batch_size  # all requests fail on OOM
        
        return MetricsSnapshot(
            ttft_p50_ms    = max(1.0, base['ttft_p50']),
            ttft_p99_ms    = max(1.0, base['ttft_p99']),
            tpot_ms        = max(1.0, base['tpot']),
            tokens_per_sec = max(0.0, base['tps']),
            gpu_memory_gb  = base['gpu_mem'],
            cost_per_1k    = base['tps'] * q_factor * 0.001,
            spec_accept_rate = accept_rate,
            eviction_events  = int(max(0, (1.0 - action.kv_budget) * workload.queue_depth)),
            slo_violations   = slo_violations,
        )
TRACE DATA — How to Build It Without a GPU
Option A (Recommended) 0 GPU hrs
Download published vLLM benchmark CSVs from github.com/vllm-project/vllm/tree/main/benchmarks and the HuggingFace llm-perf-leaderboard. These have real measured latencies across batch sizes. Fit a pandas pivot table to get the lookup grid.
  • Already covers Llama-3-8B on A100 — your exact target model
  • Includes TTFT, TPOT, throughput, memory across batch sizes
  • Needs ~2 hours of data wrangling to reshape into your schema
Option B (Good) 2-4 GPU hrs
Run llmperf on a Colab free T4 with Llama-3.2-1B-Instruct (free tier works). Grid search over batch_size=[1,4,8,16,32] × prompt_len=[64,128,256,512] — that's 20 measurements. 2 hours of Colab time.
  • Your own measurements — stronger story for judges
  • Can extrapolate to larger batch sizes analytically
  • Risk: Colab disconnects. Use checkpointing.
Option C (Fallback) 30 min, CPU
Generate synthetic data from a roofline model. ttft = base_ms + batch_factor * batch_size + memory_factor * prompt_len. These constants are documented in vLLM's OSDI paper. Fully deterministic, always works.
  • Implement this FIRST as a fallback even if you use A or B
  • Guarantees you always have valid data no matter what
  • Good enough for an RL agent to learn relative improvements
P2
Day 4 · Mar 30
🎯
Phase Gate — End of Day 4
The following Python loop runs without error and completes all 200 steps: obs = env.reset(task_id=1); [env.step(random_action()) for _ in range(200)]. Rewards are floats in [-1, 1]. The episode terminates at step 200. Session IDs are unique per reset call.
ENVIRONMENT CLASS — Full Implementation
python env/inference_env.py — Core environment (Person A, Day 4)
import uuid, json, threading
import numpy as np
from dataclasses import dataclass
from inferencegym.models import ServeAction, ServeObservation, WorkloadState, MetricsSnapshot
from simulator.trace_sim import TraceSimulator
from simulator.workload import WorkloadGenerator

@dataclass
class EnvConfig:
    task_id:       int
    episode_len:   int   = 200
    slo_target_ms: float = 300.0
    max_memory_gb: float = 40.0
    # Reward weights
    alpha: float = 0.40  # throughput
    beta:  float = 0.25  # latency
    gamma: float = 0.25  # SLO violations
    delta: float = 0.10  # cost

# Task configs — loaded from workload_configs.json
TASK_CONFIGS = {
    1: EnvConfig(task_id=1, slo_target_ms=500.0),
    2: EnvConfig(task_id=2, slo_target_ms=300.0, gamma=0.30),
    3: EnvConfig(task_id=3, slo_target_ms=200.0, gamma=0.35, delta=0.15),
}
# Max achievable throughput per task (set after running optimal solver)
MAX_THROUGHPUT = {1: 8500.0, 2: 6200.0, 3: 4800.0}

class InferenceEnv:
    def __init__(self, simulator: TraceSimulator, task_id: int, seed: int = 42):
        self.sim     = simulator
        self.config  = TASK_CONFIGS[task_id]
        self.gen     = WorkloadGenerator(task_id=task_id, seed=seed)
        self.session_id   = str(uuid.uuid4())
        self._step        = 0
        self._cost_so_far = 0.0
        self._workload    = self.gen.reset()
        self._last_metrics: MetricsSnapshot = None
        self._episode_log: list = []

    def reset(self) -> ServeObservation:
        self.session_id   = str(uuid.uuid4())
        self._step        = 0
        self._cost_so_far = 0.0
        self._workload    = self.gen.reset()
        self._episode_log = []
        return self._build_obs(MetricsSnapshot(
            ttft_p50_ms=200.0, ttft_p99_ms=350.0, tpot_ms=20.0,
            tokens_per_sec=2000.0, gpu_memory_gb=24.0, cost_per_1k=0.001,
            spec_accept_rate=0.0, eviction_events=0, slo_violations=0))

    def step(self, action: ServeAction):
        if self._step >= self.config.episode_len:
            raise RuntimeError("Episode already done. Call reset() first.")
        
        # Task 1 & 2: lock certain actions
        action = self._enforce_action_mask(action)
        
        # Advance workload one step
        self._workload = self.gen.step(action)
        
        # Simulate this step
        metrics = self.sim.simulate(action, self._workload)
        self._last_metrics = metrics
        
        # Compute SLO violations from simulator metrics + SLO target
        metrics.slo_violations += int(
            metrics.ttft_p50_ms > self.config.slo_target_ms) * self._workload.queue_depth
        
        # Compute reward
        reward = self._compute_reward(metrics)
        
        # Update episode state
        self._cost_so_far += metrics.cost_per_1k
        self._step += 1
        done = self._step >= self.config.episode_len
        
        obs = self._build_obs(metrics)
        info = {"timestep": self._step, "metrics": metrics.__dict__,
                "workload": self._workload.__dict__}
        self._episode_log.append({"action": action.__dict__, "reward": reward, "metrics": metrics.__dict__})
        return obs, reward, done, info

    def _compute_reward(self, m: MetricsSnapshot) -> float:
        c = self.config
        T = m.tokens_per_sec / MAX_THROUGHPUT[c.task_id]
        L = m.ttft_p50_ms / c.slo_target_ms
        V = m.slo_violations / max(self._workload.queue_depth, 1)
        C = m.cost_per_1k / 0.005   # normalise against budget ceiling
        reward = c.alpha * T - c.beta * L - c.gamma * V - c.delta * C
        return float(np.clip(reward, -1.0, 1.0))

    def _enforce_action_mask(self, action: ServeAction) -> ServeAction:
        if self.config.task_id == 1:
            action.spec_length = 0; action.prefill_disagg = False; action.quant_tier = QuantTier.FP16
        elif self.config.task_id == 2:
            action.prefill_disagg = False; action.quant_tier = QuantTier.FP16
        return action

    def _build_obs(self, m: MetricsSnapshot) -> ServeObservation:
        w = self._workload
        return ServeObservation(
            queue_depth           = float(w.queue_depth),
            mean_prompt_len       = w.mean_prompt_len,
            arrival_rate          = w.arrival_rate,
            kv_cache_occupancy    = (1.0 - (m.eviction_events / max(w.queue_depth, 1))),
            ttft_p50              = m.ttft_p50_ms,
            tpot_p50              = m.tpot_ms,
            slo_violation_rate    = m.slo_violations / max(w.queue_depth, 1),
            gpu_memory_used_gb    = m.gpu_memory_gb,
            spec_accept_rate      = m.spec_accept_rate,
            priority_distribution = w.priority_distribution,
            timestep              = self._step,
            cost_so_far           = self._cost_so_far,
        )
P3
Day 5 · Mar 31
🌐
Phase Gate — End of Day 5
Running the openenv CLI validation passes with no errors: openenv validate --url http://localhost:7860. Every endpoint returns the correct shape. The Docker image is under 2GB. A full reset→step×200→grader cycle completes in under 60 seconds.
ALL ENDPOINTS — Implementation Spec
EndpointMethodOwnsWired toKey Behaviour
/healthGETPerson BSession cache count Returns {"status":"ok","active_sessions":N,"uptime_s":T}
/tasksGETPerson BStatic task config dict Returns list of 3 tasks with id, name, difficulty, description, active_actions
/resetPOSTPerson BInferenceEnv.reset() Creates new session_id, instantiates InferenceEnv for that task, stores in LRU cache. Returns session_id + observation.
/stepPOSTPerson BInferenceEnv.step() Looks up session by session_id, validates ServeAction, calls step(), returns obs+reward+done+info. 422 if session not found.
/stateGETPerson BInferenceEnv.state() Returns current episode metadata: step_count, cumulative_reward, done, workload_phase.
/graderPOSTPerson CGraderModule.score() Accepts episode_log JSON, returns score 0–1 with breakdown. Stateless — same input always same output.
/baselineGETPerson CBaselineAgent.run() Runs the fixed-config baseline agent on all 3 tasks, returns scores. Fixed seed guarantees reproducibility.
/infoGETPerson BStatic schema Returns full JSON schema for action space, observation space, reward weights. Used by agent frameworks.
SESSION MANAGEMENT — Critical Design
python simulator/session_manager.py — Thread-safe LRU session cache
import threading
from collections import OrderedDict
from typing import Optional
from env.inference_env import InferenceEnv

class SessionManager:
    """Thread-safe LRU cache of active InferenceEnv instances."""
    MAX_SESSIONS = 50
    
    def __init__(self, simulator):
        self._sim  = simulator
        self._lock = threading.Lock()
        self._sessions: OrderedDict[str, InferenceEnv] = OrderedDict()
    
    def create(self, task_id: int, seed: int) -> InferenceEnv:
        with self._lock:
            if len(self._sessions) >= self.MAX_SESSIONS:
                self._sessions.popitem(last=False)  # evict oldest
            env = InferenceEnv(self._sim, task_id, seed)
            self._sessions[env.session_id] = env
            return env
    
    def get(self, session_id: str) -> Optional[InferenceEnv]:
        with self._lock:
            env = self._sessions.get(session_id)
            if env:  # move to end (mark as recently used)
                self._sessions.move_to_end(session_id)
            return env
    
    def remove(self, session_id: str) -> None:
        with self._lock:
            self._sessions.pop(session_id, None)
    
    def count(self) -> int:
        return len(self._sessions)
FASTAPI APP SKELETON — Person B writes this on Day 4 (stubs) and wires on Day 5
python server/app.py — Main FastAPI application
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from typing import Optional
import time

from simulator.trace_sim import TraceSimulator
from simulator.session_manager import SessionManager
from inferencegym.models import ServeAction, QuantTier

app = FastAPI(title="InferenceGym", version="1.0.0")
app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"])

# ── App startup: load simulator once, create session manager ─────────────────
_sim = None
_sessions = None
_start_time = time.time()

@app.on_event("startup")
async def startup():
    global _sim, _sessions
    _sim = TraceSimulator("simulator/data/traces_llama3_8b.parquet")
    _sessions = SessionManager(_sim)

# ── Pydantic request/response models ────────────────────────────────────────
class ResetRequest(BaseModel):
    task_id: int
    seed: int = 42
    config: Optional[dict] = None   # override alpha/beta/gamma/delta

class StepRequest(BaseModel):
    session_id: str
    action: dict

class GraderRequest(BaseModel):
    task_id: int
    episode_log: list

# ── Endpoints ─────────────────────────────────────────────────────────────────
@app.get("/health")
def health():
    return {"status": "ok", "active_sessions": _sessions.count(), 
            "uptime_seconds": int(time.time() - _start_time)}

@app.get("/tasks")
def get_tasks():
    return {"tasks": [
        {"id":1, "name":"Static Uniform",    "difficulty":"easy",   "active_actions":["kv_budget","batch_size"]},
        {"id":2, "name":"Bursty ShareGPT",   "difficulty":"medium", "active_actions":["kv_budget","batch_size","spec_length"]},
        {"id":3, "name":"Adversarial Multi-Tenant","difficulty":"hard", "active_actions":["kv_budget","batch_size","spec_length","prefill_disagg","quant_tier"]},
    ]}

@app.post("/reset")
def reset(req: ResetRequest):
    if req.task_id not in {1, 2, 3}:
        raise HTTPException(422, f"task_id must be 1, 2, or 3. Got {req.task_id}")
    env = _sessions.create(req.task_id, req.seed)
    obs = env.reset()
    return {"session_id": env.session_id, "observation": obs.__dict__, "episode_length": 200}

@app.post("/step")
def step(req: StepRequest):
    env = _sessions.get(req.session_id)
    if not env:
        raise HTTPException(404, f"Session '{req.session_id}' not found. Call /reset first.")
    action = ServeAction(
        kv_budget      = req.action.get("kv_budget", 1.0),
        spec_length    = req.action.get("spec_length", 0),
        batch_size     = req.action.get("batch_size", 32),
        prefill_disagg = req.action.get("prefill_disagg", False),
        quant_tier     = QuantTier(req.action.get("quant_tier", 0)),
    )
    obs, reward, done, info = env.step(action)
    if done:
        _sessions.remove(req.session_id)
    return {"observation": obs.__dict__, "reward": reward, "done": done, "info": info}
DOCKERFILE — Multi-stage, CPU-only, <2GB
dockerfile Dockerfile
# Stage 1: Install dependencies only
FROM python:3.11-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Stage 2: Minimal runtime (no build tools)
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY . .
ENV PATH=/root/.local/bin:$PATH
ENV PYTHONPATH=/app
EXPOSE 7860

# HuggingFace Spaces convention: port 7860
CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "2"]

## requirements.txt (CPU-only — NO torch, NO CUDA)
# fastapi==0.115.0
# uvicorn[standard]==0.30.0
# pydantic==2.7.0
# numpy==1.26.4
# scipy==1.13.0
# pandas==2.2.0
# pyarrow==15.0.0    (for parquet reading)
# stable-baselines3==2.3.0  (PPO demo only)
# gymnasium==0.29.1
# httpx==0.27.0     (for integration tests)
P4
Days 6–7
📊
Phase Gate — End of Day 7
POST /grader with a handcrafted episode log returns a score between 0.0 and 1.0 with a complete breakdown dict. GET /baseline returns scores in the range [0.20, 0.40] for all 3 tasks. The grader returns the same score on repeated calls with the same input. All grader unit tests pass.
GRADER DESIGN — Per-Task Formula Detail
Task 1 Grader EASY
Pure throughput optimisation. Score is the normalised improvement over baseline on mean tokens/sec, capped at 1.0.
# All values are means over the 200-step episode log
score = (agent_tps - baseline_tps) / (optimal_tps - baseline_tps)
score = max(0.0, min(1.0, score))

# baseline_tps ≈ 2800 tokens/s (batch=32, kv=1.0)
# optimal_tps  ≈ 8200 tokens/s (batch=128, kv=0.5)
Task 2 Grader MEDIUM
Balances TTFT and memory compliance. Both components are independently scored and averaged.
ttft_score   = max(0.0, 1.0 - mean_ttft_p50 / 300.0)
peak_mem     = max(episode_log, key=lambda x: x['metrics']['gpu_memory_gb'])
mem_score    = 1.0 if peak_mem < 36.0 else max(0.0, 1.0 - (peak_mem-36)/10)
score = 0.5 * ttft_score + 0.5 * mem_score
Task 3 Grader HARD
4-component scoring with explicit weights. Stability score penalises wild action thrashing — rewards a smooth, learnable policy.
T = mean_tps / optimal_tps          # throughput
S = 1.0 - mean_slo_violation_rate   # SLO compliance
C = max(0.0, 1.0 - total_cost/5.0)  # cost (budget=5.0)
A = 1.0 - action_variance_score     # stability

score = 0.40*T + 0.30*S + 0.20*C + 0.10*A
Stability Score Anti-Thrashing
Computes the variance of consecutive actions taken by the agent. High variance = thrashing = unstable policy. The stability score penalises this.
actions = [step['action'] for step in episode_log]
batch_diffs  = np.diff([a['batch_size'] for a in actions])
kv_diffs     = np.diff([a['kv_budget'] for a in actions])
variance     = np.std(batch_diffs)/512 + np.std(kv_diffs)/1.0
action_variance_score = min(1.0, variance / 0.5)  # 0=stable, 1=chaotic
GRADER MODULE — Full Implementation
python grader/grader.py — Deterministic episode scorer
import numpy as np
from typing import List, Dict, Any

class GraderModule:
    """Deterministic grader. Same episode_log → same score, always."""

    BASELINE_TPS = {1: 2800.0, 2: 2100.0, 3: 1600.0}
    OPTIMAL_TPS  = {1: 8200.0, 2: 5800.0, 3: 4200.0}

    def score(self, task_id: int, episode_log: List[Dict[str, Any]]) -> Dict:
        if not episode_log:
            return {"score": 0.0, "breakdown": {}, "feedback": "Empty episode log."}
        
        graders = {1: self._task1, 2: self._task2, 3: self._task3}
        if task_id not in graders:
            raise ValueError(f"Unknown task_id: {task_id}")
        return graders[task_id](episode_log)

    def _task1(self, log) -> Dict:
        mean_tps = np.mean([s['metrics']['tokens_per_sec'] for s in log])
        score = (mean_tps - self.BASELINE_TPS[1]) / (self.OPTIMAL_TPS[1] - self.BASELINE_TPS[1])
        score = float(np.clip(score, 0.0, 1.0))
        feedback = self._throughput_feedback(mean_tps, 1)
        return {"score": score, "breakdown": {"throughput": score}, "feedback": feedback}

    def _task2(self, log) -> Dict:
        mean_ttft  = np.mean([s['metrics']['ttft_p50_ms'] for s in log])
        peak_mem   = max(s['metrics']['gpu_memory_gb'] for s in log)
        ttft_score = float(np.clip(1.0 - mean_ttft / 300.0, 0.0, 1.0))
        mem_score  = 1.0 if peak_mem < 36.0 else float(np.clip(1.0 - (peak_mem-36)/10, 0.0, 1.0))
        score = 0.5 * ttft_score + 0.5 * mem_score
        feedback = f"TTFT score: {ttft_score:.2f} (mean TTFT {mean_ttft:.0f}ms vs 300ms SLO). Memory score: {mem_score:.2f} (peak {peak_mem:.1f}GB vs 36GB limit)."
        return {"score": score, "breakdown": {"ttft": ttft_score, "memory": mem_score}, "feedback": feedback}

    def _task3(self, log) -> Dict:
        mean_tps     = np.mean([s['metrics']['tokens_per_sec'] for s in log])
        mean_slo     = np.mean([s['metrics']['slo_violations'] for s in log])
        total_cost   = sum(s['metrics']['cost_per_1k'] for s in log)
        actions      = [s['action'] for s in log]
        
        T = float(np.clip(mean_tps / self.OPTIMAL_TPS[3], 0.0, 1.0))
        S = float(np.clip(1.0 - mean_slo / 100.0, 0.0, 1.0))
        C = float(np.clip(1.0 - total_cost / 5.0, 0.0, 1.0))
        A = 1.0 - self._action_variance(actions)
        
        score = 0.40*T + 0.30*S + 0.20*C + 0.10*A
        feedback = self._task3_feedback(T, S, C, A, log)
        return {"score": score, "breakdown": {"throughput":T,"slo":S,"cost":C,"stability":A}, "feedback": feedback}

    def _action_variance(self, actions) -> float:
        batch_vals = [a.get('batch_size', 32) for a in actions]
        kv_vals    = [a.get('kv_budget', 1.0)   for a in actions]
        variance   = np.std(np.diff(batch_vals))/512 + np.std(np.diff(kv_vals))/1.0
        return float(np.clip(variance / 0.5, 0.0, 1.0))
    
    def _throughput_feedback(self, mean_tps, task_id) -> str:
        pct = (mean_tps - self.BASELINE_TPS[task_id]) / (self.OPTIMAL_TPS[task_id] - self.BASELINE_TPS[task_id]) * 100
        return ff"Agent achieved {mean_tps:.0f} TPS ({pct:.0f}% of way from baseline to optimal)."
BASELINE AGENT — Fixed-config, deterministic
python agents/baseline.py — Naïve vLLM defaults (Person C, Day 6)
from inferencegym.models import ServeAction, QuantTier
from env.inference_env import InferenceEnv
from simulator.trace_sim import TraceSimulator
from grader.grader import GraderModule

# The fixed action that the baseline ALWAYS takes, regardless of observation
BASELINE_ACTION = ServeAction(
    kv_budget      = 1.0,         # no eviction
    spec_length    = 0,           # speculative decoding off
    batch_size     = 32,          # vLLM default
    prefill_disagg = False,       # colocated
    quant_tier     = QuantTier.FP16, # full precision
)

def run_baseline(task_id: int, seed: int = 0) -> dict:
    """Runs fixed baseline agent on one task, returns grader score."""
    sim     = TraceSimulator("simulator/data/traces_llama3_8b.parquet", seed=seed)
    env     = InferenceEnv(sim, task_id=task_id, seed=seed)
    grader  = GraderModule()
    
    env.reset()
    done = False
    while not done:
        _, _, done, _ = env.step(BASELINE_ACTION)
    
    result = grader.score(task_id, env._episode_log)
    return {"task_id": task_id, "score": result["score"],
            "breakdown": result["breakdown"], "action_config": BASELINE_ACTION.__dict__}

def run_all_baselines() -> dict:
    # Seed=0 guarantees identical results every run
    return {"scores": {f"task{i}": run_baseline(i, seed=0)["score"] for i in [1,2,3]},
            "expected_range": {"task1":[0.30,0.40], "task2":[0.22,0.32], "task3":[0.18,0.28]}}
P5
Days 8–9
🚀
Phase Gate — End of Day 9
From a fresh machine with no local setup, running the Colab notebook completes all cells without error. The HuggingFace Spaces URL is public and all endpoints respond. The PPO reward curve plot shows a statistically increasing trend from first 5k steps to last 5k steps of training.
HUGGINGFACE SPACES DEPLOYMENT
Person B — Days 8-9
  • B
    Create HF Space with Docker SDK Go to huggingface.co/new-space. Select SDK: Docker. This will create a Dockerfile-based deployment where port 7860 is auto-exposed. Push your repo code.
  • B
    README.md HF frontmatter Add the required YAML block at the top of README.md: title: InferenceGym, emoji: 🏋️, colorFrom: green, colorTo: blue, sdk: docker, pinned: false. This controls the HF Space landing page.
  • B
    Health check verification After push, HF Spaces shows a build log. Wait for "Running" status. Hit the public URL's /health endpoint. If it doesn't respond in 2 minutes, check build logs for import errors — most commonly a missing package in requirements.txt.
  • B
    Stress test from live URL Run 10 concurrent reset+step×5 loops against the live URL. Check /health shows active_sessions > 0 during the test. Confirm no 500 errors appear in HF Space logs.
PPO DEMO AGENT — Person C, Day 8
Gym wrapper + stable-baselines3 PPO
  • C
    Write HTTPGymEnv wrapper Subclass gymnasium.Env. reset() calls POST /reset. step(action) calls POST /step. observation_space is Box(low=-inf, high=inf, shape=(12,)). action_space is Box for continuous knobs.
  • C
    Run PPO for 50k steps on Task 1 Use stable_baselines3.PPO("MlpPolicy", env, verbose=1). Train 50k steps. Plot ep_rew_mean over time using matplotlib. It should go from ~0.1 at start to ~0.35+ by 50k steps.
  • C
    If PPO doesn't converge Check: (1) normalise observations with VecNormalize, (2) reduce learning rate to 1e-4, (3) increase n_steps to 2048, (4) check reward range is [-1,1] (it should be from InferenceEnv). The environment is designed to be learnable — reward engineering is correct.
COLAB DEMO NOTEBOOK STRUCTURE — Person C, Day 9
python notebooks/InferenceGym_Demo.ipynb — Cell-by-cell structure
# Cell 1: Title markdown
# "# InferenceGym Demo — Meta PyTorch × Scaler Hackathon 2026"

# Cell 2: Install (runs in 90 seconds on Colab)
!pip install stable-baselines3 gymnasium httpx pandas matplotlib -q

# Cell 3: Connect to live environment
HF_URL = "https://YOUR_ORG-inferencegym.hf.space"
import httpx
response = httpx.get(f"{HF_URL}/health")
print("Environment status:", response.json())

# Cell 4: Show available tasks
tasks = httpx.get(f"{HF_URL}/tasks").json()
for t in tasks['tasks']: print(f"{t['id']}: {t['name']} ({t['difficulty']})")

# Cell 5: Run baseline agent, show scores
baseline = httpx.get(f"{HF_URL}/baseline").json()
print("Baseline scores (naïve vLLM defaults):", baseline['scores'])

# Cell 6: Manual episode — human in the loop
res = httpx.post(f"{HF_URL}/reset", json={"task_id": 1, "seed": 42}).json()
session_id = res['session_id']; obs = res['observation']
print("Initial observation:", obs)

# Cell 7: Run 10 manual steps with a smart action
episode_log = []
for _ in range(10):
    result = httpx.post(f"{HF_URL}/step", json={"session_id": session_id,
        "action": {"kv_budget":0.6, "batch_size":128, "spec_length":0, "prefill_disagg":False, "quant_tier":0}}).json()
    episode_log.append(result)

# Cell 8: Gym wrapper
import gymnasium as gym; import numpy as np; import httpx

class InferenceGymEnv(gym.Env):
    def __init__(self, base_url, task_id=1):
        self.url = base_url; self.task_id = task_id; self.session_id = None
        self.observation_space = gym.spaces.Box(-np.inf, np.inf, shape=(12,), dtype=np.float32)
        self.action_space = gym.spaces.Box(
            low=np.array([0.1, 0.0, 1.0], dtype=np.float32),
            high=np.array([1.0, 1.0, 512.0], dtype=np.float32))
    def obs_to_array(self, obs): return np.array(list(obs.values())[:12], dtype=np.float32)
    def reset(self, **kwargs):
        r = httpx.post(f"{self.url}/reset", json={"task_id":self.task_id}).json()
        self.session_id = r['session_id']; return self.obs_to_array(r['observation']), {}
    def step(self, action):
        act = {"kv_budget":float(action[0]), "spec_length":0, "batch_size":int(action[2]),
               "prefill_disagg":False, "quant_tier":0}
        r = httpx.post(f"{self.url}/step", json={"session_id":self.session_id,"action":act}).json()
        return self.obs_to_array(r['observation']), r['reward'], r['done'], False, {}

# Cell 9: Train PPO (takes ~10 minutes on Colab T4)
from stable_baselines3 import PPO
env = InferenceGymEnv(HF_URL, task_id=1)
model = PPO("MlpPolicy", env, verbose=1, learning_rate=3e-4, n_steps=512)
model.learn(total_timesteps=50_000)

# Cell 10: Plot reward curve (the money shot)
import matplotlib.pyplot as plt
rewards = [ep['r'] for ep in model.ep_info_buffer]
plt.figure(figsize=(12,4)); plt.plot(rewards, alpha=0.3, label='Episode reward')
plt.axhline(y=0.35, color='r', linestyle='--', label='Baseline score')
plt.title('PPO Agent Learning on InferenceGym Task 1'); plt.legend(); plt.show()
print(f"Final agent score: {np.mean(rewards[-20:]):.3f} vs baseline: 0.35")
P6
Days 10–11
🏆
Final Gate — Submit by Apr 7 11:59 PM
The submission form is filled with HF Space URL + GitHub repo URL. No code changes after submission. The repo is public, has a clean README, and contains no API keys or large binary files committed to git.
ENVIRONMENT.md — Technical spec for judges
Person A writes this on Day 10
  • A
    Observation space table Full table with field name, type, range, and description for all 12 observation fields. Copy from models.py and expand.
  • A
    Action space table Full table with field name, type, valid values, default, and effect when changed for all 5 action dimensions.
  • A
    Reward function derivation Show the R = αT - βL - γV - δC formula with all constants, normalization choices, and why each weight was set the way it was.
  • A
    Trace data methodology Document exactly what source data you used, how it was preprocessed, and why it's realistic. If using published benchmarks, cite them.
README.md — The first thing judges see
Person C writes this on Day 10
  • C
    One-paragraph pitch first Before any technical content. Why does this environment matter? What problem does it solve? This should be the same words you'd use to pitch to a judge in 30 seconds.
  • C
    Quick start in 5 lines Show the curl commands to hit /health, /reset, /step, /grader. A judge who never reads further should still understand the API from these 5 lines.
  • C
    Baseline vs agent scores table Show a simple table: Task 1/2/3 × Baseline/PPO Agent. The numbers do the talking.
  • C
    Link to Colab notebook prominently "Open in Colab" badge. Judges who click this and see the reward curve rising will be convinced.
2-MINUTE DEMO VIDEO SCRIPT — Person C, Day 10
TimeScreenWhat You Say / Show
0:00–0:20Slide: problem statement"LLM inference is where 80% of AI budget is spent. There's no RL environment for optimising it. We built one."
0:20–0:40HF Space — /health → /tasks"This is InferenceGym on HuggingFace Spaces, live right now. 3 tasks, 5 action knobs, fully CPU-only." Hit the endpoints live.
0:40–1:00Colab — run baseline"Naïve vLLM defaults score 0.35 on Task 1. That's your baseline — static config, no optimisation."
1:00–1:30Colab — PPO reward curve"A simple PPO agent trained for 50k steps hits 0.65 — almost double. No GPU, no model, just our trace-driven simulator." Show the plot.
1:30–2:00Architecture diagram"Any company can drop in their own trace data and train an agent for their specific workload. That's the value proposition. Thank you."
TL
Mar 27
Day 1
TODAY
PHASE 0 — SETUP & ARCHITECTURE LOCK
  • A →Design data schemas in models.py. Write skeleton TraceSimulator with hardcoded stub output. Design lookup table format.
  • B →Create FastAPI app with all 8 endpoint stubs returning valid-shaped hardcoded JSON. Dockerfile builds. /health returns 200.
  • C →Write grader rubric on paper for all 3 tasks. Download trace data. Write workload_configs.json. Agree on HF Space naming.
  • ALL →Agree and commit models.py to main. This file cannot change after today without unanimous consent.
Mar 28
Day 2
PHASE 1 — SIMULATOR CORE (Day 1 of 2)
  • A →Implement TraceSimulator — load parquet, bilinear interpolation, Gaussian noise, OOM detection. Write WorkloadGenerator (Poisson arrivals, prompt sampling).
  • B →Wire /reset and /step endpoints to the InferenceEnv stubs (not real yet — use A's skeleton). Test with curl that responses are correctly shaped.
  • C →Process trace data — reshape into lookup table Parquet format with correct columns. Validate at least 50 data points across the batch×prompt grid. Start grader skeleton.
Mar 29
Day 3
PHASE 1 — SIMULATOR CORE (Day 2 of 2) 🔑 CRITICAL GATE
  • A →Complete WorkloadGenerator — queue depth, burst injection, spec acceptance model. Complete InferenceEnv.reset() and step(). All simulator unit tests pass.
  • B →Wire all endpoints to real InferenceEnv (replacing stubs). Implement SessionManager. Test full reset→step×10 cycle via HTTP.
  • C →Implement GraderModule skeleton with correct formula shape (even if constants need tuning). Run smoke test: score a 10-step episode log. Get any finite number.
Mar 30
Day 4
PHASE 2 — ENVIRONMENT LOGIC COMPLETE
  • A →Implement all 3 task configs (action masking for T1/T2, burst injection for T3). Full reward function with α β γ δ weights. Write full unit test suite (20+ tests).
  • B →Build Dockerfile — multi-stage, confirm image <2GB. Run full Docker cycle locally. Implement /state, /info, /health endpoints. Add Pydantic request validation.
  • C →Complete GraderModule — calibrate baseline TPS constants, write unit tests for all 3 task graders with known expected outputs. Score computation verified by hand.
Mar 31
Day 5
PHASE 3 — API LAYER COMPLETE & OPENENV VALIDATED
  • A →Full integration test — run 200-step episode for all 3 tasks programmatically. Confirm rewards are in [-1,1] range. Fix any edge cases (divide by zero, negative queue).
  • B →Run openenv validate — fix any compliance issues. Implement /grader and /baseline endpoints (wiring C's modules). Add rate limiting and CORS middleware.
  • C →Write BaselineAgent and run against all 3 tasks. Record expected scores (should be ~0.30-0.35 for T1, ~0.22-0.28 for T2, ~0.18-0.24 for T3). Adjust grader constants if needed.
Apr 1
Day 6
PHASE 4 — GRADER & BASELINE COMPLETE
  • A →Adversarial task stress test — run 1000-step Task 3 episodes, check burst injection fires at correct intervals, priority routing triggers, no state corruption.
  • B →Concurrent session test — run 10 simultaneous reset→step×5 cycles, confirm no session leakage. Profile memory usage under load — must stay under 512MB.
  • C →Write PPO gym wrapper (HTTPGymEnv). Start PPO training on Task 1. Set it running overnight — 50k steps should complete in ~4-6 hours on a modern CPU.
Apr 2
Day 7
BUFFER DAY + INTERNAL DEMO
  • ALL →Internal demo meeting — each person walks through the Colab notebook end to end. Find anything broken. Fix it today.
  • A →Fix any bugs found in internal demo. Add /info endpoint with full JSON schema. Docstrings on all public methods.
  • C →Review PPO training results — plot reward curve, verify it's increasing. If not, debug (check normalization, learning rate, reward scale). Start writing Colab notebook.
Apr 3
Day 8
PHASE 5 — DEPLOYMENT
  • B →Deploy to HuggingFace Spaces — push, watch build logs, verify all endpoints respond from live public URL. Document the URL in README.
  • C →Complete Colab notebook — all 10 cells work end-to-end against the live HF Space URL. The notebook should run cold in under 15 minutes.
  • A →Test from fresh machine — clone the repo, build Docker, run all tests. Confirm there are no hidden local dependencies. Fix whatever breaks.
Apr 4
Day 9
PHASE 5 — DEMO COMPLETE
  • C →Record 2-minute demo video using OBS or Loom. Follow the script. Upload to YouTube (unlisted) and link in README. Do not make it public until submission.
  • B →Stress test live deployment — 50 concurrent requests, verify no 500 errors. Check HF Space memory and CPU usage stays stable.
  • ALL →Write submission description draft (~500 words covering: problem, design, grader design, baseline vs agent results). Will refine on Day 10.
Apr 5–6
Days 10-11
PHASE 6 — WRITEUP, POLISH & SUBMISSION PREP
  • A →Write ENVIRONMENT.md — full technical spec for judges (observation space, action space, reward formula, task descriptions, simulator methodology).
  • C →Write final README — pitch paragraph, quick start, baseline vs agent table, Colab link, video link. Run through the submission checklist line by line.
  • ALL →Final end-to-end verification — test from a fresh browser with no cookies or local setup. Every endpoint must work. Grader must score any completed episode.
Apr 7
DEADLINE
SUBMIT BY 11:59 PM — NO CODE CHANGES AFTER
  • ALL →Submit HF Space URL + GitHub repo URL on hackathon portal. Fill in: env name, description, team members. Double check the HF Space is public.
§A
COMPLETE FILE TREE WITH OWNERSHIP
textRepository structure
inferencegym/
├── models.py               [ALL] — Locked Day 1. ServeAction, ServeObservation, MetricsSnapshot, WorkloadState
│
├── env/
│   ├── inference_env.py    [A] — Core InferenceEnv class. reset(), step(), _compute_reward(), _enforce_action_mask()
│   ├── observation.py      [A] — _build_obs() helper, normalise values to [0,1] for RL agents
│   ├── action.py           [A] — ActionValidator, clamp continuous actions to valid ranges
│   └── reward.py           [A] — RewardComputer, configurable α β γ δ, TASK_CONFIGS dict
│
├── simulator/
│   ├── trace_sim.py        [A] — TraceSimulator: load parquet, interpolate, noise, OOM detection
│   ├── workload.py         [A] — WorkloadGenerator: Poisson, LogNormal, burst injection, queue
│   ├── session_manager.py  [B] — SessionManager: thread-safe LRU cache of InferenceEnv instances
│   └── data/
│       ├── traces_llama3_8b.parquet    [C] — lookup table: (batch,kv,spec,plen) → metrics
│       ├── sharegpt_dist.json          [C] — LogNormal params for Task 2 prompt distribution
│       └── workload_configs.json       [C] — Task 1/2/3 workload configuration parameters
│
├── grader/
│   ├── grader.py           [C] — GraderModule: dispatches to per-task graders, returns score+breakdown
│   ├── task1_grader.py     [C] — Throughput normalisation formula
│   ├── task2_grader.py     [C] — TTFT + memory compliance formula
│   └── task3_grader.py     [C] — 4-objective formula including action stability
│
├── agents/
│   ├── baseline.py         [C] — BaselineAgent: fixed BASELINE_ACTION, run_all_baselines()
│   └── ppo_demo.py         [C] — HTTPGymEnv wrapper + PPO training script
│
├── server/
│   ├── app.py              [B] — FastAPI application, all 8 endpoints, startup event
│   ├── schemas.py          [B] — Pydantic request/response models (ResetRequest, StepRequest, etc.)
│   └── middleware.py       [B] — CORS, rate limiting (max 100 req/min per IP), request logging
│
├── tests/
│   ├── test_simulator.py   [A] — 20+ unit tests for TraceSimulator and WorkloadGenerator
│   ├── test_env.py         [A] — Contract tests for step/reset/state, edge cases
│   ├── test_grader.py      [C] — Unit tests for all 3 grader formulas with known expected outputs
│   └── test_api.py         [B] — Integration tests: httpx client hitting full FastAPI stack
│
├── notebooks/
│   └── InferenceGym_Demo.ipynb   [C] — 10-cell Colab demo notebook
│
├── Dockerfile              [B] — Multi-stage, CPU-only, port 7860, <2GB image
├── docker-compose.yml      [B] — Local dev: volume mount source, hot reload
├── requirements.txt        [B] — Pinned CPU-only deps. No torch. No CUDA.
├── README.md               [C] — HF Spaces frontmatter + pitch + quickstart + links
└── ENVIRONMENT.md          [A] — Full technical spec for judges
MODULE INTERFACE CONTRACTS — What each module must expose
TraceSimulator simulator/trace_sim.py
  • __init__(trace_path: str, seed: int = 42) — loads parquet, builds interpolators, sets rng
  • simulate(action: ServeAction, workload: WorkloadState) → MetricsSnapshot — the core method
  • reset_seed(seed: int) — resets the rng for episode reproducibility
  • Must not raise exceptions on valid input. OOM conditions are returned as data, not exceptions.
WorkloadGenerator simulator/workload.py
  • __init__(task_id: int, seed: int = 42) — loads workload config for this task
  • reset() → WorkloadState — returns initial state, resets internal step counter
  • step(action: ServeAction) → WorkloadState — advances one step, updates queue
  • is_burst_active() → bool — True during burst windows for Task 3
InferenceEnv env/inference_env.py
  • reset() → ServeObservation — starts new episode, returns initial observation
  • step(action) → (obs, reward, done, info) — Gym-compatible signature
  • state() → dict — returns episode metadata for /state endpoint
  • _episode_log: list — accumulates step dicts for grader consumption
  • session_id: str — unique UUID per episode, set on reset()
GraderModule grader/grader.py
  • score(task_id: int, episode_log: list) → dict — returns {score, breakdown, feedback}
  • Must be stateless — no internal mutable state. Same input → same output always.
  • score must be a float in [0.0, 1.0]
  • breakdown must contain one float per scoring component
  • feedback must be a human-readable string explaining the score
§B
LOOKUP TABLE PARQUET SCHEMA — traces_llama3_8b.parquet
ColumnTypeValuesDescription
batch_sizeint1,4,8,16,32,64,128,256,512Max concurrent requests served
kv_budgetfloat0.1, 0.25, 0.5, 0.75, 1.0KV cache allocation fraction
spec_lengthint0, 1, 2, 4, 8Speculative draft tokens (0 = disabled)
quant_tierint0, 1, 20=FP16, 1=INT8, 2=INT4
prompt_len_bucketint0–7Bucket index: [64,128,256,512,1024,2048,4096,8192]
ttft_p50_msfloat>0Median time to first token (milliseconds)
ttft_p99_msfloat>099th percentile TTFT
tpot_msfloat>0Time per output token
tpsfloat>0Output tokens per second
gpu_mem_gbfloat0–80GPU memory footprint in GB
cost_per_1kfloat>0Relative cost per 1000 tokens (normalised)
WORKLOAD CONFIGS — workload_configs.json structure
jsonsimulator/data/workload_configs.json
{
  "tasks": {
    "1": {
      "name": "Static Uniform",
      "arrival_rate_rps": 10.0,
      "arrival_dist": "poisson",
      "prompt_len_dist": "uniform",
      "prompt_len_min": 64,
      "prompt_len_max": 128,
      "slo_target_ms": 500.0,
      "burst_enabled": false,
      "priority_routing": false,
      "active_actions": ["kv_budget", "batch_size"]
    },
    "2": {
      "name": "Bursty ShareGPT",
      "arrival_rate_rps": 25.0,
      "arrival_rate_burst": 80.0,
      "burst_period_steps": 30,
      "arrival_dist": "poisson_bursty",
      "prompt_len_dist": "lognormal",
      "prompt_len_mu": 5.2,
      "prompt_len_sigma": 1.3,
      "prompt_len_clamp_min": 32,
      "prompt_len_clamp_max": 8192,
      "memory_hard_limit_gb": 36.0,
      "slo_target_ms": 300.0,
      "burst_enabled": true,
      "active_actions": ["kv_budget", "batch_size", "spec_length"]
    },
    "3": {
      "name": "Adversarial Multi-Tenant",
      "arrival_rate_rps": 30.0,
      "burst_multiplier": 10.0,
      "burst_interval_steps": 120,
      "burst_duration_steps": 15,
      "prompt_len_dist": "bimodal",
      "short_request_frac": 0.7,
      "short_prompt_max": 128,
      "long_prompt_min": 4096,
      "long_prompt_max": 8192,
      "priority_mix": [0.2, 0.5, 0.3],
      "slo_interactive_ms": 200.0,
      "slo_batch_ms": 2000.0,
      "cost_budget_episode": 5.0,
      "memory_hard_limit_gb": 38.0,
      "active_actions": ["kv_budget", "batch_size", "spec_length", "prefill_disagg", "quant_tier"]
    }
  }
}
COMPLETE OBSERVATION & ACTION SPACE REFERENCE
FieldTypeRangeNormalised?Description
queue_depthfloat[0, 512]NoPending requests in serving queue
mean_prompt_lenfloat[32, 8192]NoMean token count of current window
arrival_ratefloat[0, 200]No10-step EMA requests/second
kv_cache_occupancyfloat[0.0, 1.0]YesFraction of KV cache in use
ttft_p50float[0, 5000] msNoMedian TTFT last 20 requests
tpot_p50float[0, 500] msNoMedian time-per-output-token
slo_violation_ratefloat[0.0, 1.0]YesFraction of requests missing SLO
gpu_memory_used_gbfloat[0, 80]NoSimulated GPU memory pressure
spec_accept_ratefloat[0.0, 1.0]YesSpeculative token acceptance rate
priority_distributionfloat[3][0,1] eachYes[interactive, batch, best_effort] fractions
timestepint[0, 200]NoCurrent episode step
cost_so_farfloat[0, ∞]NoCumulative cost this episode
§C
Risk Prob Mitigation Owner
Trace data is wrong shape
Published benchmarks don't have the exact columns needed
Medium Implement Option C (synthetic data) on Day 1 before even trying Option A. This takes 30 minutes and gives you a valid fallback. Option A then becomes an enhancement, not a dependency. C
PPO doesn't converge
Reward curve is flat or decreasing
Low Task 1 is designed for easy learning. If PPO fails: (1) add VecNormalize wrapper, (2) lower learning rate to 1e-4, (3) check reward is truly in [-1,1]. If still failing, use a simple hill-climbing agent — just show any rising curve. C
HuggingFace Spaces OOM
Free tier has 16GB RAM — simulator might use too much
Low Load trace data as a numpy array, not a pandas DataFrame, at startup. Target <200MB for the lookup table. Use parquet with snappy compression. Test memory usage locally with psutil before deploying. B
Race condition in session cache
Concurrent requests corrupt session state
Medium All reads and writes to self._sessions dict are wrapped in threading.Lock(). Individual InferenceEnv instances are not thread-safe but each session is owned by one caller at a time — this is fine because the /step endpoint is synchronous and FastAPI serialises calls per session_id. B
Grader gives score > 1.0 or < 0.0
Formula constants are miscalibrated
Medium All grader component scores are individually np.clip(x, 0.0, 1.0) before the weighted sum. The final score is also clipped. Calibrate BASELINE_TPS and OPTIMAL_TPS constants on Day 5 by running the actual baseline agent and verifying scores fall in [0.20, 0.40]. C
Person A is blocked on Day 3
Simulator not done, Person B and C can't proceed
Medium Person A prioritises the interface (simulate() returns a valid MetricsSnapshot) over the implementation quality. A synthetic linear model with hardcoded constants is enough for Day 3. Person B and C only need the method signature to work. Real trace data can be plugged in on Day 4. A
Docker image >2GB
stable-baselines3 pulls large PyTorch dependency
Medium Install stable-baselines3[extra] only in a separate requirements-demo.txt that is NOT in the Dockerfile. The server only needs the environment. The PPO demo runs from outside the container (in Colab). This keeps the image under 500MB. B
OpenEnv spec compliance fails
openenv validate finds schema mismatches
Low Run openenv validate at the end of every day starting Day 3. Validation issues are always about JSON schema — field names, types, missing fields. Fix immediately, never defer. Keep a local copy of the openenv spec open while writing endpoint response schemas. B
§D
OPENENV COMPLIANCE
  • POST /reset returns session_id + initial observation dict
  • POST /step returns observation + reward (float) + done (bool) + info
  • GET /state returns current episode metadata
  • GET /tasks returns 3 tasks with id, name, difficulty labels
  • POST /grader returns score 0.0–1.0 + breakdown dict + feedback string
  • GET /baseline returns reproducible baseline scores for all 3 tasks
  • GET /health returns {"status": "ok"}
  • openenv validate --url https://YOUR_SPACE.hf.space passes with no errors
  • 3 tasks with easy/medium/hard difficulty labels present
  • Reward function documented with partial credit design
QUALITY CRITERIA
  • Baseline agent runs reproducibly (fixed seed=0, same score every run)
  • PPO reward curve plot shows statistically increasing trend
  • Colab notebook runs end-to-end in <15 minutes on free T4
  • README has: pitch paragraph, quickstart, scores table, Colab link, video link
  • ENVIRONMENT.md has full technical spec
  • No API keys, no secrets in repository
  • No large binary files committed to git (use .gitignore for *.parquet — serve from HF repo)
  • Grader is deterministic (run same episode log twice, get same score)
  • 2-minute demo video recorded and linked in README
  • HF Space is public (not private or gated)
DEPLOYMENT CHECKS
  • Docker image builds locally with docker build -t test .
  • Image is under 2GB (docker image ls)
  • Container starts and /health responds within 30s
  • HF Spaces URL is live and all endpoints respond
  • Tested from a fresh browser/machine with no local setup
  • 50 concurrent requests don't produce 500 errors
  • HF Spaces shows "Running" not "Building" or "Error"
SUBMISSION FORM
  • Environment name: InferenceGym (or your chosen name)
  • Description: 500-word submission text
  • All team member names listed
  • HuggingFace Spaces URL submitted
  • GitHub repository URL submitted (public)
  • Submitted BEFORE 11:59 PM April 7
  • No code changes pushed after submission time
🎯 The One-Line Summary for Judges
InferenceGym is the first RL environment for LLM inference control. A naïve vLLM config scores 0.22 on the hardest task. A simple PPO agent trained for 50k steps reaches 0.65 — a 3× improvement in serving efficiency, no GPU, no model required. That's the pitch. Everything else in this document is how you build the thing that delivers that demo.

INFERENCEGYM · MASTER BUILD DOCUMENT · META PYTORCH × SCALER HACKATHON 2026 · DEADLINE APRIL 7