MASTER BUILD DOCUMENT PHASE-BY-PHASE ALWAYS FUNCTIONAL

InferenceGym
Complete Engineering Plan

A modular, phase-gated engineering plan for building the first RL environment for LLM inference control. Every phase ends with a fully functional, testable system. No phase leaves you broken. Deadline: April 7, 2026 · 11 days · 3 people.

Deadline

Apr 7, 2026

Days Left

11 days

Team Size

3 people

Phases

6 phases

Deploy Target

HF Spaces

Prize Pool

$30,000

Table of Contents

P0 Setup & Architecture Lock Day 1 P1 Simulator Core (MVP) Days 2–3 P2 Environment Logic Day 4 P3 API Layer & Docker Day 5 P4 Grader, Baseline & Tasks Days 6–7 P5 Deployment & Demo Agent Days 8–9 P6 Polish, Submission & Buffer Days 10–11 §A Full Module Specifications Reference §B Data Schemas & APIs Reference §C Risk Register & Mitigations Reference §D Final Submission Checklist Reference

Engineering Philosophy

Guiding principles that govern every implementation decision in this project.

Always Functional

After every phase ends, the system must be in a state where you can run it, call it, and get a valid response. No "half-built" states that block testing. If Phase 1 is done, someone can import the simulator and call simulate(action) right now.

Stub First, Flesh Later

Every module gets a stub implementation on Day 1 that returns valid-shaped data. This lets Person B wire the API and Person C write the grader before Person A finishes the simulator. Real logic replaces stubs phase by phase.

Data Schema First

All three people must agree on the exact shape of ServeAction, ServeObservation, and MetricsSnapshot on Day 1, before writing a single line of logic. Changing the schema mid-build is the #1 cause of integration hell.

⚠ The Critical Path

Person A's simulator core is the only hard dependency for everyone else. That is why Person A's Day 3 deliverable is a strict gate — no simulator, no environment, no env, no API, no demo. Everything else can be parallelised after Day 3. Protect this gate fiercely.

Phase 0 — Setup & Architecture Lock

Day 1 (Mar 27). Goal: every team member has a running environment, a shared repo, agreed data schemas, and a working stub server that returns valid-shaped responses.

Day 1 · Mar 27

🏁

Phase Gate — End of Day 1

You can run curl http://localhost:7860/health and get a 200 OK. All three people have cloned the repo, installed deps, and can run the stub server locally. The data schemas are written and committed to models.py. Nobody can start Day 2 until this is true.

Person A — Simulator Lead

Owns: simulator/, env/ directories

Read OpenEnv spec completely Clone openenv-course, run the echo example env, understand what /reset → /step → /grader looks like end to end.
Design TraceSimulator data schema Decide the exact column names for the lookup CSV. Write it down. Share with the team. This is a decision that cannot change later.
Write skeleton classes Create simulator/trace_sim.py with class stubs: TraceSimulator.__init__, simulate(action, workload) returning a hardcoded MetricsSnapshot.
Write skeleton workload generator simulator/workload.py — stub that returns a fixed WorkloadState dict every time.

Person B — API Lead

Owns: server/ directory, Dockerfile

Set up FastAPI project Install FastAPI, uvicorn, pydantic. Create server/app.py with all 8 endpoint stubs that return hardcoded valid responses.
Install openenv CLI Run openenv init, understand what openenv validate checks. Make sure the stub server passes basic validation.
Create Dockerfile skeleton Multi-stage build that starts the uvicorn server. Confirm it builds locally and the /health endpoint responds from inside Docker.
Set up GitHub repo Main branch protection, agree on feature branch naming (feat/simulator, feat/api, etc.), set up .gitignore.

Person C — Grader & Demo Lead

Owns: grader/, agents/, notebooks/

Design grader rubric on paper For each of the 3 tasks: what is the score formula? What is the theoretical optimal? What is the expected baseline score? Write this as a one-page doc.
Decide trace data strategy Evaluate Option A (published benchmarks), B (Colab T4), C (synthetic). Download whichever dataset you're going with. Confirm it has the needed columns.
Define workload configs Write simulator/data/workload_configs.json with the exact parameters for Task 1, 2, and 3 (arrival rate, SLO, prompt distribution params).
Agree on ENV_NAME Confirm the HuggingFace Spaces org, repo name, and environment name string. Register the HF account if needed.

SHARED DELIVERABLE — models.py (everyone must agree before Day 2)

      python
      inferencegym/models.py — Data schema, locked on Day 1
    
from dataclasses import dataclass, field
from typing import Optional, List, Dict, Any
from enum import Enum

# ── Action space ─────────────────────────────────────────────────────────────
class QuantTier(Enum):
    FP16 = 0
    INT8 = 1
    INT4 = 2

@dataclass
class ServeAction:
    kv_budget:       float     # 0.1 – 1.0  : fraction of KV cache allocated
    spec_length:     int       # 0,1,2,4,8  : speculative draft tokens
    batch_size:      int       # 1–512      : max concurrent requests
    prefill_disagg:  bool      # True/False : disaggregate prefill GPU
    quant_tier:      QuantTier # FP16/INT8/INT4
    
    def validate(self) -> bool:
        assert 0.1 <= self.kv_budget <= 1.0
        assert self.spec_length in {0,1,2,4,8}
        assert 1 <= self.batch_size <= 512
        return True

# ── Simulator output ──────────────────────────────────────────────────────────
@dataclass
class MetricsSnapshot:
    ttft_p50_ms:       float  # median time to first token
    ttft_p99_ms:       float  # tail latency
    tpot_ms:           float  # time per output token
    tokens_per_sec:    float  # throughput
    gpu_memory_gb:     float  # simulated memory pressure
    cost_per_1k:       float  # compute cost (normalised units)
    spec_accept_rate:  float  # 0.0 if spec_length == 0
    eviction_events:   int    # KV cache evictions this step
    slo_violations:    int    # requests that exceeded SLO this step

# ── Observation (what agent sees) ────────────────────────────────────────────
@dataclass
class ServeObservation:
    queue_depth:            float
    mean_prompt_len:        float
    arrival_rate:           float
    kv_cache_occupancy:     float
    ttft_p50:               float
    tpot_p50:               float
    slo_violation_rate:     float
    gpu_memory_used_gb:     float
    spec_accept_rate:       float
    priority_distribution:  List[float]   # [interactive, batch, best_effort]
    timestep:               int
    cost_so_far:            float

# ── Workload state ────────────────────────────────────────────────────────────
@dataclass
class WorkloadState:
    arrival_rate:           float
    mean_prompt_len:        float
    prompt_len_bucket:      int     # 0–7, discrete bucket for lookup table
    queue_depth:            int
    priority_distribution:  List[float]
    is_burst:               bool
    phase:                  str     # "warmup" | "steady" | "burst" | "cooldown"

PHASE 0 COMPLETION PROOF

      bash
      These commands must all pass before Day 2 starts
    
# From repo root:
docker build -t inferencegym . && docker run -p 7860:7860 inferencegym &
curl http://localhost:7860/health              # → {"status": "ok"}
curl http://localhost:7860/tasks              # → {"tasks": [{...}, {...}, {...}]}
python -c "from inferencegym.models import ServeAction, ServeObservation; print('schemas OK')"

Phase 1 — Simulator Core

Days 2–3 (Mar 28–29). Goal: a fully working TraceSimulator that takes a real ServeAction and returns a realistic MetricsSnapshot. This is the hardest and most critical module in the entire project.

Days 2–3

✅ Why This Phase Unlocks Everything

Once TraceSimulator.simulate(action, workload) → MetricsSnapshot works, Person B can wire it into the API and Person C can build the grader. Both of those can proceed in parallel. Person A must finish this by end of Day 3 even if it means simplifying the interpolation.

🔑

Phase Gate — End of Day 3

Running python tests/test_simulator.py passes all tests. The simulator returns realistic-shaped numbers for a variety of (action, workload) inputs. The workload generator produces a different workload state on every call. These are the two things that need to be true before Phase 2 begins.

DAY 2 TASKS (Person A, primary)

TraceSimulator — Core Implementation

A
Load lookup table from CSV/Parquet Read the trace data file into a dict keyed by (batch_bucket, kv_bucket, spec_bucket, prompt_bucket). Each value is a MetricsSnapshot. The lookup table must be loaded once at startup and cached in memory.
A
Implement bilinear interpolation Use scipy.interpolate.RegularGridInterpolator for continuous actions (kv_budget, batch_size) between discrete lookup points. For discrete actions (spec_length, quant_tier), use nearest-neighbor lookup.
A
Add Gaussian noise model Inject ±5% Gaussian noise on ttft_p50_ms and tpot_ms to simulate hardware jitter. Use np.random.default_rng(seed) so episodes are reproducible.
A
Memory overflow detection If interpolated gpu_memory_gb > 40.0, set a hard OOM flag, cap memory at 40GB, and multiply slo_violations by 5 as a penalty signal.

WorkloadGenerator — Day 2

A
Poisson arrival generator np.random.poisson(lam=arrival_rate) per step. Arrival rate varies by task config loaded from workload_configs.json.
A
Prompt length sampling Task 1: np.random.uniform(64, 128). Task 2: np.random.lognormal(5.2, 1.3) clamped to [32, 8192]. Task 3: bimodal — 70% uniform(32, 128), 30% uniform(4096, 8192).
A
Discrete prompt bucket mapping Map continuous prompt_len to an integer bucket 0–7 using np.digitize against [64, 128, 256, 512, 1024, 2048, 4096]. This is the lookup table key.

DAY 3 TASKS (Person A, primary)

WorkloadGenerator — Day 3 Completion

A
Queue depth simulation Maintain a running queue_depth counter. Each step: add new arrivals, subtract min(batch_size, queue_depth) served requests. Queue cannot go negative.
A
Burst injection for Task 3 Every 120 timesteps, multiply arrival_rate by 10 for 15 consecutive steps. Set is_burst=True in WorkloadState during these steps.
A
Priority distribution tracking Task 3: maintain a rolling 50-step window of request classes [INTERACTIVE, BATCH, BEST_EFFORT] as fractions. Pass this to WorkloadState.priority_distribution.
A
Speculative acceptance model Implement the acceptance rate formula: accept_rate = base_rate * (1 - complexity_penalty) * depth_decay where depth_decay = 1.0 / (1 + 0.15 * spec_length). Base rate by task: Task1=0.80, Task2=0.65, Task3=0.45.

Unit Tests — must pass by Day 3 EOD

C
Smoke test Call simulate(action, workload) with 20 random valid actions — all return a non-null MetricsSnapshot with values in expected ranges.
C
Monotonicity test Increasing batch_size while holding other actions constant should strictly increase tokens_per_sec (up to a threshold). This validates the lookup table is correctly loaded.
C
Determinism test Two calls with the same seed and same action must produce the same noise-injected output. Tests reproducibility.
C
OOM detection test Pass an action with batch_size=512, kv_budget=1.0 — confirm gpu_memory_gb triggers the overflow flag.

SIMULATOR CORE IMPLEMENTATION

      python
      simulator/trace_sim.py
    
import numpy as np
import pandas as pd
from scipy.interpolate import RegularGridInterpolator
from pathlib import Path
from inferencegym.models import ServeAction, WorkloadState, MetricsSnapshot, QuantTier

class TraceSimulator:
    """
    CPU-only trace-driven simulator.
    Loads a pre-built lookup table and interpolates (action, workload) → MetricsSnapshot.
    """
    
    BATCH_POINTS  = [1, 4, 8, 16, 32, 64, 128, 256, 512]
    KV_POINTS     = [0.1, 0.25, 0.5, 0.75, 1.0]
    PLEN_BUCKETS  = [64, 128, 256, 512, 1024, 2048, 4096, 8192]
    OOM_THRESHOLD = 40.0  # GB
    NOISE_STD     = 0.05  # ±5% Gaussian jitter on latency metrics

    def __init__(self, trace_path: str, seed: int = 42):
        self.rng = np.random.default_rng(seed)
        self._load_tables(Path(trace_path))
        self._build_interpolators()

    def _load_tables(self, path: Path) -> None:
        df = pd.read_parquet(path)
        # Expected columns: batch_size, kv_budget, spec_length, quant_tier,
        #   prompt_len_bucket, ttft_p50, ttft_p99, tpot, tps, gpu_mem_gb, cost_per_1k
        self._df = df

    def _build_interpolators(self) -> None:
        # Build 4-D interpolator over (batch_size, kv_budget, spec_len, prompt_bucket)
        # for FP16 baseline. INT8/INT4 handled via multiplicative correction factors.
        fp16_df = self._df[self._df['quant_tier'] == 0]
        grid_vals = {
            'ttft_p50': self._reshape_for_interp(fp16_df, 'ttft_p50'),
            'ttft_p99': self._reshape_for_interp(fp16_df, 'ttft_p99'),
            'tpot':     self._reshape_for_interp(fp16_df, 'tpot'),
            'tps':      self._reshape_for_interp(fp16_df, 'tps'),
            'gpu_mem':  self._reshape_for_interp(fp16_df, 'gpu_mem_gb'),
        }
        points = (self.BATCH_POINTS, self.KV_POINTS, [0,1,2,4,8], self.PLEN_BUCKETS)
        self._interps = {k: RegularGridInterpolator(points, v, method='linear', bounds_error=False)
                         for k, v in grid_vals.items()}

    def simulate(self, action: ServeAction, workload: WorkloadState) -> MetricsSnapshot:
        action.validate()
        query = [[action.batch_size, action.kv_budget,
                   action.spec_length, workload.mean_prompt_len]]
        
        # Interpolate base metrics
        base = {k: float(fn(query)[0]) for k, fn in self._interps.items()}
        
        # Apply quant tier correction factors (from benchmark data)
        quant_factors = {QuantTier.FP16: 1.0, QuantTier.INT8: 0.82, QuantTier.INT4: 0.68}
        q_factor = quant_factors[action.quant_tier]
        base['ttft_p50'] *= q_factor
        base['tps'] /= q_factor          # quantised models serve faster
        base['gpu_mem'] *= q_factor        # quantised models use less memory
        
        # Apply speculative decoding acceptance bonus
        if action.spec_length > 0:
            depth_decay = 1.0 / (1 + 0.15 * action.spec_length)
            accept_rate = 0.75 * (1 - 0.1 * workload.prompt_len_bucket) * depth_decay
            accept_rate = max(0.0, min(1.0, accept_rate))
            speedup = 1.0 + accept_rate * action.spec_length * 0.1
            base['ttft_p50'] /= speedup
        else:
            accept_rate = 0.0
        
        # Inject Gaussian noise
        noise = self.rng.normal(1.0, self.NOISE_STD, size=3)
        base['ttft_p50'] *= noise[0]
        base['ttft_p99'] *= noise[1]
        base['tpot']     *= noise[2]
        
        # OOM detection
        oom = base['gpu_mem'] > self.OOM_THRESHOLD
        slo_violations = 0  # computed by env, not simulator
        if oom:
            base['gpu_mem'] = self.OOM_THRESHOLD
            slo_violations = action.batch_size  # all requests fail on OOM
        
        return MetricsSnapshot(
            ttft_p50_ms    = max(1.0, base['ttft_p50']),
            ttft_p99_ms    = max(1.0, base['ttft_p99']),
            tpot_ms        = max(1.0, base['tpot']),
            tokens_per_sec = max(0.0, base['tps']),
            gpu_memory_gb  = base['gpu_mem'],
            cost_per_1k    = base['tps'] * q_factor * 0.001,
            spec_accept_rate = accept_rate,
            eviction_events  = int(max(0, (1.0 - action.kv_budget) * workload.queue_depth)),
            slo_violations   = slo_violations,
        )

TRACE DATA — How to Build It Without a GPU

Option A (Recommended) 0 GPU hrs

Download published vLLM benchmark CSVs from github.com/vllm-project/vllm/tree/main/benchmarks and the HuggingFace llm-perf-leaderboard. These have real measured latencies across batch sizes. Fit a pandas pivot table to get the lookup grid.

Already covers Llama-3-8B on A100 — your exact target model
Includes TTFT, TPOT, throughput, memory across batch sizes
Needs ~2 hours of data wrangling to reshape into your schema

Option B (Good) 2-4 GPU hrs

Run llmperf on a Colab free T4 with Llama-3.2-1B-Instruct (free tier works). Grid search over batch_size=[1,4,8,16,32] × prompt_len=[64,128,256,512] — that's 20 measurements. 2 hours of Colab time.

Your own measurements — stronger story for judges
Can extrapolate to larger batch sizes analytically
Risk: Colab disconnects. Use checkpointing.

Option C (Fallback) 30 min, CPU

Generate synthetic data from a roofline model. ttft = base_ms + batch_factor * batch_size + memory_factor * prompt_len. These constants are documented in vLLM's OSDI paper. Fully deterministic, always works.

Implement this FIRST as a fallback even if you use A or B
Guarantees you always have valid data no matter what
Good enough for an RL agent to learn relative improvements

Phase 2 — Environment Logic

Day 4 (Mar 30). Goal: a complete InferenceEnv class with working reset(), step(), and state(). An agent can interact with it in a loop and receive valid rewards.

Day 4 · Mar 30

🎯

Phase Gate — End of Day 4

The following Python loop runs without error and completes all 200 steps: obs = env.reset(task_id=1); [env.step(random_action()) for _ in range(200)]. Rewards are floats in [-1, 1]. The episode terminates at step 200. Session IDs are unique per reset call.

ENVIRONMENT CLASS — Full Implementation

      python
      env/inference_env.py — Core environment (Person A, Day 4)
    
import uuid, json, threading
import numpy as np
from dataclasses import dataclass
from inferencegym.models import ServeAction, ServeObservation, WorkloadState, MetricsSnapshot
from simulator.trace_sim import TraceSimulator
from simulator.workload import WorkloadGenerator

@dataclass
class EnvConfig:
    task_id:       int
    episode_len:   int   = 200
    slo_target_ms: float = 300.0
    max_memory_gb: float = 40.0
    # Reward weights
    alpha: float = 0.40  # throughput
    beta:  float = 0.25  # latency
    gamma: float = 0.25  # SLO violations
    delta: float = 0.10  # cost

# Task configs — loaded from workload_configs.json
TASK_CONFIGS = {
    1: EnvConfig(task_id=1, slo_target_ms=500.0),
    2: EnvConfig(task_id=2, slo_target_ms=300.0, gamma=0.30),
    3: EnvConfig(task_id=3, slo_target_ms=200.0, gamma=0.35, delta=0.15),
}
# Max achievable throughput per task (set after running optimal solver)
MAX_THROUGHPUT = {1: 8500.0, 2: 6200.0, 3: 4800.0}

class InferenceEnv:
    def __init__(self, simulator: TraceSimulator, task_id: int, seed: int = 42):
        self.sim     = simulator
        self.config  = TASK_CONFIGS[task_id]
        self.gen     = WorkloadGenerator(task_id=task_id, seed=seed)
        self.session_id   = str(uuid.uuid4())
        self._step        = 0
        self._cost_so_far = 0.0
        self._workload    = self.gen.reset()
        self._last_metrics: MetricsSnapshot = None
        self._episode_log: list = []

    def reset(self) -> ServeObservation:
        self.session_id   = str(uuid.uuid4())
        self._step        = 0
        self._cost_so_far = 0.0
        self._workload    = self.gen.reset()
        self._episode_log = []
        return self._build_obs(MetricsSnapshot(
            ttft_p50_ms=200.0, ttft_p99_ms=350.0, tpot_ms=20.0,
            tokens_per_sec=2000.0, gpu_memory_gb=24.0, cost_per_1k=0.001,
            spec_accept_rate=0.0, eviction_events=0, slo_violations=0))

    def step(self, action: ServeAction):
        if self._step >= self.config.episode_len:
            raise RuntimeError("Episode already done. Call reset() first.")
        
        # Task 1 & 2: lock certain actions
        action = self._enforce_action_mask(action)
        
        # Advance workload one step
        self._workload = self.gen.step(action)
        
        # Simulate this step
        metrics = self.sim.simulate(action, self._workload)
        self._last_metrics = metrics
        
        # Compute SLO violations from simulator metrics + SLO target
        metrics.slo_violations += int(
            metrics.ttft_p50_ms > self.config.slo_target_ms) * self._workload.queue_depth
        
        # Compute reward
        reward = self._compute_reward(metrics)
        
        # Update episode state
        self._cost_so_far += metrics.cost_per_1k
        self._step += 1
        done = self._step >= self.config.episode_len
        
        obs = self._build_obs(metrics)
        info = {"timestep": self._step, "metrics": metrics.__dict__,
                "workload": self._workload.__dict__}
        self._episode_log.append({"action": action.__dict__, "reward": reward, "metrics": metrics.__dict__})
        return obs, reward, done, info

    def _compute_reward(self, m: MetricsSnapshot) -> float:
        c = self.config
        T = m.tokens_per_sec / MAX_THROUGHPUT[c.task_id]
        L = m.ttft_p50_ms / c.slo_target_ms
        V = m.slo_violations / max(self._workload.queue_depth, 1)
        C = m.cost_per_1k / 0.005   # normalise against budget ceiling
        reward = c.alpha * T - c.beta * L - c.gamma * V - c.delta * C
        return float(np.clip(reward, -1.0, 1.0))

    def _enforce_action_mask(self, action: ServeAction) -> ServeAction:
        if self.config.task_id == 1:
            action.spec_length = 0; action.prefill_disagg = False; action.quant_tier = QuantTier.FP16
        elif self.config.task_id == 2:
            action.prefill_disagg = False; action.quant_tier = QuantTier.FP16
        return action

    def _build_obs(self, m: MetricsSnapshot) -> ServeObservation:
        w = self._workload
        return ServeObservation(
            queue_depth           = float(w.queue_depth),
            mean_prompt_len       = w.mean_prompt_len,
            arrival_rate          = w.arrival_rate,
            kv_cache_occupancy    = (1.0 - (m.eviction_events / max(w.queue_depth, 1))),
            ttft_p50              = m.ttft_p50_ms,
            tpot_p50              = m.tpot_ms,
            slo_violation_rate    = m.slo_violations / max(w.queue_depth, 1),
            gpu_memory_used_gb    = m.gpu_memory_gb,
            spec_accept_rate      = m.spec_accept_rate,
            priority_distribution = w.priority_distribution,
            timestep              = self._step,
            cost_so_far           = self._cost_so_far,
        )

Phase 3 — API Layer & Docker

Day 5 (Mar 31). Goal: all 8 HTTP endpoints are live, wired to the real InferenceEnv, and the Docker image builds cleanly and passes openenv validate.

Day 5 · Mar 31

🌐

Phase Gate — End of Day 5

Running the openenv CLI validation passes with no errors: openenv validate --url http://localhost:7860. Every endpoint returns the correct shape. The Docker image is under 2GB. A full reset→step×200→grader cycle completes in under 60 seconds.

ALL ENDPOINTS — Implementation Spec

Endpoint	Method	Owns	Wired to	Key Behaviour
`/health`	GET	Person B	Session cache count	Returns `{"status":"ok","active_sessions":N,"uptime_s":T}`
`/tasks`	GET	Person B	Static task config dict	Returns list of 3 tasks with id, name, difficulty, description, active_actions
`/reset`	POST	Person B	`InferenceEnv.reset()`	Creates new session_id, instantiates InferenceEnv for that task, stores in LRU cache. Returns session_id + observation.
`/step`	POST	Person B	`InferenceEnv.step()`	Looks up session by session_id, validates ServeAction, calls step(), returns obs+reward+done+info. 422 if session not found.
`/state`	GET	Person B	`InferenceEnv.state()`	Returns current episode metadata: step_count, cumulative_reward, done, workload_phase.
`/grader`	POST	Person C	`GraderModule.score()`	Accepts episode_log JSON, returns score 0–1 with breakdown. Stateless — same input always same output.
`/baseline`	GET	Person C	`BaselineAgent.run()`	Runs the fixed-config baseline agent on all 3 tasks, returns scores. Fixed seed guarantees reproducibility.
`/info`	GET	Person B	Static schema	Returns full JSON schema for action space, observation space, reward weights. Used by agent frameworks.

SESSION MANAGEMENT — Critical Design

      python
      simulator/session_manager.py — Thread-safe LRU session cache
    
import threading
from collections import OrderedDict
from typing import Optional
from env.inference_env import InferenceEnv

class SessionManager:
    """Thread-safe LRU cache of active InferenceEnv instances."""
    MAX_SESSIONS = 50
    
    def __init__(self, simulator):
        self._sim  = simulator
        self._lock = threading.Lock()
        self._sessions: OrderedDict[str, InferenceEnv] = OrderedDict()
    
    def create(self, task_id: int, seed: int) -> InferenceEnv:
        with self._lock:
            if len(self._sessions) >= self.MAX_SESSIONS:
                self._sessions.popitem(last=False)  # evict oldest
            env = InferenceEnv(self._sim, task_id, seed)
            self._sessions[env.session_id] = env
            return env
    
    def get(self, session_id: str) -> Optional[InferenceEnv]:
        with self._lock:
            env = self._sessions.get(session_id)
            if env:  # move to end (mark as recently used)
                self._sessions.move_to_end(session_id)
            return env
    
    def remove(self, session_id: str) -> None:
        with self._lock:
            self._sessions.pop(session_id, None)
    
    def count(self) -> int:
        return len(self._sessions)

FASTAPI APP SKELETON — Person B writes this on Day 4 (stubs) and wires on Day 5

      python
      server/app.py — Main FastAPI application
    
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from typing import Optional
import time

from simulator.trace_sim import TraceSimulator
from simulator.session_manager import SessionManager
from inferencegym.models import ServeAction, QuantTier

app = FastAPI(title="InferenceGym", version="1.0.0")
app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"])

# ── App startup: load simulator once, create session manager ─────────────────
_sim = None
_sessions = None
_start_time = time.time()

@app.on_event("startup")
async def startup():
    global _sim, _sessions
    _sim = TraceSimulator("simulator/data/traces_llama3_8b.parquet")
    _sessions = SessionManager(_sim)

# ── Pydantic request/response models ────────────────────────────────────────
class ResetRequest(BaseModel):
    task_id: int
    seed: int = 42
    config: Optional[dict] = None   # override alpha/beta/gamma/delta

class StepRequest(BaseModel):
    session_id: str
    action: dict

class GraderRequest(BaseModel):
    task_id: int
    episode_log: list

# ── Endpoints ─────────────────────────────────────────────────────────────────
@app.get("/health")
def health():
    return {"status": "ok", "active_sessions": _sessions.count(), 
            "uptime_seconds": int(time.time() - _start_time)}

@app.get("/tasks")
def get_tasks():
    return {"tasks": [
        {"id":1, "name":"Static Uniform",    "difficulty":"easy",   "active_actions":["kv_budget","batch_size"]},
        {"id":2, "name":"Bursty ShareGPT",   "difficulty":"medium", "active_actions":["kv_budget","batch_size","spec_length"]},
        {"id":3, "name":"Adversarial Multi-Tenant","difficulty":"hard", "active_actions":["kv_budget","batch_size","spec_length","prefill_disagg","quant_tier"]},
    ]}

@app.post("/reset")
def reset(req: ResetRequest):
    if req.task_id not in {1, 2, 3}:
        raise HTTPException(422, f"task_id must be 1, 2, or 3. Got {req.task_id}")
    env = _sessions.create(req.task_id, req.seed)
    obs = env.reset()
    return {"session_id": env.session_id, "observation": obs.__dict__, "episode_length": 200}

@app.post("/step")
def step(req: StepRequest):
    env = _sessions.get(req.session_id)
    if not env:
        raise HTTPException(404, f"Session '{req.session_id}' not found. Call /reset first.")
    action = ServeAction(
        kv_budget      = req.action.get("kv_budget", 1.0),
        spec_length    = req.action.get("spec_length", 0),
        batch_size     = req.action.get("batch_size", 32),
        prefill_disagg = req.action.get("prefill_disagg", False),
        quant_tier     = QuantTier(req.action.get("quant_tier", 0)),
    )
    obs, reward, done, info = env.step(action)
    if done:
        _sessions.remove(req.session_id)
    return {"observation": obs.__dict__, "reward": reward, "done": done, "info": info}

DOCKERFILE — Multi-stage, CPU-only, <2GB

      dockerfile
      Dockerfile
    
# Stage 1: Install dependencies only
FROM python:3.11-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Stage 2: Minimal runtime (no build tools)
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY . .
ENV PATH=/root/.local/bin:$PATH
ENV PYTHONPATH=/app
EXPOSE 7860

# HuggingFace Spaces convention: port 7860
CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "2"]

## requirements.txt (CPU-only — NO torch, NO CUDA)
# fastapi==0.115.0
# uvicorn[standard]==0.30.0
# pydantic==2.7.0
# numpy==1.26.4
# scipy==1.13.0
# pandas==2.2.0
# pyarrow==15.0.0    (for parquet reading)
# stable-baselines3==2.3.0  (PPO demo only)
# gymnasium==0.29.1
# httpx==0.27.0     (for integration tests)

Phase 4 — Grader, Baseline & Task Completion

Days 6–7 (Apr 1–2). Goal: all three tasks are complete, the /grader endpoint scores any episode log deterministically, and the baseline agent runs and produces reproducible scores around 0.22–0.35.

Days 6–7

📊

Phase Gate — End of Day 7

POST /grader with a handcrafted episode log returns a score between 0.0 and 1.0 with a complete breakdown dict. GET /baseline returns scores in the range [0.20, 0.40] for all 3 tasks. The grader returns the same score on repeated calls with the same input. All grader unit tests pass.

GRADER DESIGN — Per-Task Formula Detail

Task 1 Grader EASY

Pure throughput optimisation. Score is the normalised improvement over baseline on mean tokens/sec, capped at 1.0.

# All values are means over the 200-step episode log
score = (agent_tps - baseline_tps) / (optimal_tps - baseline_tps)
score = max(0.0, min(1.0, score))

# baseline_tps ≈ 2800 tokens/s (batch=32, kv=1.0)
# optimal_tps  ≈ 8200 tokens/s (batch=128, kv=0.5)

Task 2 Grader MEDIUM

Balances TTFT and memory compliance. Both components are independently scored and averaged.

ttft_score   = max(0.0, 1.0 - mean_ttft_p50 / 300.0)
peak_mem     = max(episode_log, key=lambda x: x['metrics']['gpu_memory_gb'])
mem_score    = 1.0 if peak_mem < 36.0 else max(0.0, 1.0 - (peak_mem-36)/10)
score = 0.5 * ttft_score + 0.5 * mem_score

Task 3 Grader HARD

4-component scoring with explicit weights. Stability score penalises wild action thrashing — rewards a smooth, learnable policy.

T = mean_tps / optimal_tps          # throughput
S = 1.0 - mean_slo_violation_rate   # SLO compliance
C = max(0.0, 1.0 - total_cost/5.0)  # cost (budget=5.0)
A = 1.0 - action_variance_score     # stability

score = 0.40*T + 0.30*S + 0.20*C + 0.10*A

Stability Score Anti-Thrashing

Computes the variance of consecutive actions taken by the agent. High variance = thrashing = unstable policy. The stability score penalises this.

actions = [step['action'] for step in episode_log]
batch_diffs  = np.diff([a['batch_size'] for a in actions])
kv_diffs     = np.diff([a['kv_budget'] for a in actions])
variance     = np.std(batch_diffs)/512 + np.std(kv_diffs)/1.0
action_variance_score = min(1.0, variance / 0.5)  # 0=stable, 1=chaotic

GRADER MODULE — Full Implementation

      python
      grader/grader.py — Deterministic episode scorer
    
import numpy as np
from typing import List, Dict, Any

class GraderModule:
    """Deterministic grader. Same episode_log → same score, always."""

    BASELINE_TPS = {1: 2800.0, 2: 2100.0, 3: 1600.0}
    OPTIMAL_TPS  = {1: 8200.0, 2: 5800.0, 3: 4200.0}

    def score(self, task_id: int, episode_log: List[Dict[str, Any]]) -> Dict:
        if not episode_log:
            return {"score": 0.0, "breakdown": {}, "feedback": "Empty episode log."}
        
        graders = {1: self._task1, 2: self._task2, 3: self._task3}
        if task_id not in graders:
            raise ValueError(f"Unknown task_id: {task_id}")
        return graders[task_id](episode_log)

    def _task1(self, log) -> Dict:
        mean_tps = np.mean([s['metrics']['tokens_per_sec'] for s in log])
        score = (mean_tps - self.BASELINE_TPS[1]) / (self.OPTIMAL_TPS[1] - self.BASELINE_TPS[1])
        score = float(np.clip(score, 0.0, 1.0))
        feedback = self._throughput_feedback(mean_tps, 1)
        return {"score": score, "breakdown": {"throughput": score}, "feedback": feedback}

    def _task2(self, log) -> Dict:
        mean_ttft  = np.mean([s['metrics']['ttft_p50_ms'] for s in log])
        peak_mem   = max(s['metrics']['gpu_memory_gb'] for s in log)
        ttft_score = float(np.clip(1.0 - mean_ttft / 300.0, 0.0, 1.0))
        mem_score  = 1.0 if peak_mem < 36.0 else float(np.clip(1.0 - (peak_mem-36)/10, 0.0, 1.0))
        score = 0.5 * ttft_score + 0.5 * mem_score
        feedback = f"TTFT score: {ttft_score:.2f} (mean TTFT {mean_ttft:.0f}ms vs 300ms SLO). Memory score: {mem_score:.2f} (peak {peak_mem:.1f}GB vs 36GB limit)."
        return {"score": score, "breakdown": {"ttft": ttft_score, "memory": mem_score}, "feedback": feedback}

    def _task3(self, log) -> Dict:
        mean_tps     = np.mean([s['metrics']['tokens_per_sec'] for s in log])
        mean_slo     = np.mean([s['metrics']['slo_violations'] for s in log])
        total_cost   = sum(s['metrics']['cost_per_1k'] for s in log)
        actions      = [s['action'] for s in log]
        
        T = float(np.clip(mean_tps / self.OPTIMAL_TPS[3], 0.0, 1.0))
        S = float(np.clip(1.0 - mean_slo / 100.0, 0.0, 1.0))
        C = float(np.clip(1.0 - total_cost / 5.0, 0.0, 1.0))
        A = 1.0 - self._action_variance(actions)
        
        score = 0.40*T + 0.30*S + 0.20*C + 0.10*A
        feedback = self._task3_feedback(T, S, C, A, log)
        return {"score": score, "breakdown": {"throughput":T,"slo":S,"cost":C,"stability":A}, "feedback": feedback}

    def _action_variance(self, actions) -> float:
        batch_vals = [a.get('batch_size', 32) for a in actions]
        kv_vals    = [a.get('kv_budget', 1.0)   for a in actions]
        variance   = np.std(np.diff(batch_vals))/512 + np.std(np.diff(kv_vals))/1.0
        return float(np.clip(variance / 0.5, 0.0, 1.0))
    
    def _throughput_feedback(self, mean_tps, task_id) -> str:
        pct = (mean_tps - self.BASELINE_TPS[task_id]) / (self.OPTIMAL_TPS[task_id] - self.BASELINE_TPS[task_id]) * 100
        return ff"Agent achieved {mean_tps:.0f} TPS ({pct:.0f}% of way from baseline to optimal)."

BASELINE AGENT — Fixed-config, deterministic

      python
      agents/baseline.py — Naïve vLLM defaults (Person C, Day 6)
    
from inferencegym.models import ServeAction, QuantTier
from env.inference_env import InferenceEnv
from simulator.trace_sim import TraceSimulator
from grader.grader import GraderModule

# The fixed action that the baseline ALWAYS takes, regardless of observation
BASELINE_ACTION = ServeAction(
    kv_budget      = 1.0,         # no eviction
    spec_length    = 0,           # speculative decoding off
    batch_size     = 32,          # vLLM default
    prefill_disagg = False,       # colocated
    quant_tier     = QuantTier.FP16, # full precision
)

def run_baseline(task_id: int, seed: int = 0) -> dict:
    """Runs fixed baseline agent on one task, returns grader score."""
    sim     = TraceSimulator("simulator/data/traces_llama3_8b.parquet", seed=seed)
    env     = InferenceEnv(sim, task_id=task_id, seed=seed)
    grader  = GraderModule()
    
    env.reset()
    done = False
    while not done:
        _, _, done, _ = env.step(BASELINE_ACTION)
    
    result = grader.score(task_id, env._episode_log)
    return {"task_id": task_id, "score": result["score"],
            "breakdown": result["breakdown"], "action_config": BASELINE_ACTION.__dict__}

def run_all_baselines() -> dict:
    # Seed=0 guarantees identical results every run
    return {"scores": {f"task{i}": run_baseline(i, seed=0)["score"] for i in [1,2,3]},
            "expected_range": {"task1":[0.30,0.40], "task2":[0.22,0.32], "task3":[0.18,0.28]}}

Phase 5 — Deployment & Demo Agent

Days 8–9 (Apr 3–4). Goal: the environment is live on HuggingFace Spaces at a public URL, a PPO agent shows a rising reward curve, and the Colab demo notebook runs end-to-end.

Days 8–9

🚀

Phase Gate — End of Day 9

From a fresh machine with no local setup, running the Colab notebook completes all cells without error. The HuggingFace Spaces URL is public and all endpoints respond. The PPO reward curve plot shows a statistically increasing trend from first 5k steps to last 5k steps of training.

HUGGINGFACE SPACES DEPLOYMENT

Person B — Days 8-9

B
Create HF Space with Docker SDK Go to huggingface.co/new-space. Select SDK: Docker. This will create a Dockerfile-based deployment where port 7860 is auto-exposed. Push your repo code.
B
README.md HF frontmatter Add the required YAML block at the top of README.md: title: InferenceGym, emoji: 🏋️, colorFrom: green, colorTo: blue, sdk: docker, pinned: false. This controls the HF Space landing page.
B
Health check verification After push, HF Spaces shows a build log. Wait for "Running" status. Hit the public URL's /health endpoint. If it doesn't respond in 2 minutes, check build logs for import errors — most commonly a missing package in requirements.txt.
B
Stress test from live URL Run 10 concurrent reset+step×5 loops against the live URL. Check /health shows active_sessions > 0 during the test. Confirm no 500 errors appear in HF Space logs.

PPO DEMO AGENT — Person C, Day 8

Gym wrapper + stable-baselines3 PPO

C
Write HTTPGymEnv wrapper Subclass gymnasium.Env. reset() calls POST /reset. step(action) calls POST /step. observation_space is Box(low=-inf, high=inf, shape=(12,)). action_space is Box for continuous knobs.
C
Run PPO for 50k steps on Task 1 Use stable_baselines3.PPO("MlpPolicy", env, verbose=1). Train 50k steps. Plot ep_rew_mean over time using matplotlib. It should go from ~0.1 at start to ~0.35+ by 50k steps.
C
If PPO doesn't converge Check: (1) normalise observations with VecNormalize, (2) reduce learning rate to 1e-4, (3) increase n_steps to 2048, (4) check reward range is [-1,1] (it should be from InferenceEnv). The environment is designed to be learnable — reward engineering is correct.

COLAB DEMO NOTEBOOK STRUCTURE — Person C, Day 9

      python
      notebooks/InferenceGym_Demo.ipynb — Cell-by-cell structure
    
# Cell 1: Title markdown
# "# InferenceGym Demo — Meta PyTorch × Scaler Hackathon 2026"

# Cell 2: Install (runs in 90 seconds on Colab)
!pip install stable-baselines3 gymnasium httpx pandas matplotlib -q

# Cell 3: Connect to live environment
HF_URL = "https://YOUR_ORG-inferencegym.hf.space"
import httpx
response = httpx.get(f"{HF_URL}/health")
print("Environment status:", response.json())

# Cell 4: Show available tasks
tasks = httpx.get(f"{HF_URL}/tasks").json()
for t in tasks['tasks']: print(f"{t['id']}: {t['name']} ({t['difficulty']})")

# Cell 5: Run baseline agent, show scores
baseline = httpx.get(f"{HF_URL}/baseline").json()
print("Baseline scores (naïve vLLM defaults):", baseline['scores'])

# Cell 6: Manual episode — human in the loop
res = httpx.post(f"{HF_URL}/reset", json={"task_id": 1, "seed": 42}).json()
session_id = res['session_id']; obs = res['observation']
print("Initial observation:", obs)

# Cell 7: Run 10 manual steps with a smart action
episode_log = []
for _ in range(10):
    result = httpx.post(f"{HF_URL}/step", json={"session_id": session_id,
        "action": {"kv_budget":0.6, "batch_size":128, "spec_length":0, "prefill_disagg":False, "quant_tier":0}}).json()
    episode_log.append(result)

# Cell 8: Gym wrapper
import gymnasium as gym; import numpy as np; import httpx

class InferenceGymEnv(gym.Env):
    def __init__(self, base_url, task_id=1):
        self.url = base_url; self.task_id = task_id; self.session_id = None
        self.observation_space = gym.spaces.Box(-np.inf, np.inf, shape=(12,), dtype=np.float32)
        self.action_space = gym.spaces.Box(
            low=np.array([0.1, 0.0, 1.0], dtype=np.float32),
            high=np.array([1.0, 1.0, 512.0], dtype=np.float32))
    def obs_to_array(self, obs): return np.array(list(obs.values())[:12], dtype=np.float32)
    def reset(self, **kwargs):
        r = httpx.post(f"{self.url}/reset", json={"task_id":self.task_id}).json()
        self.session_id = r['session_id']; return self.obs_to_array(r['observation']), {}
    def step(self, action):
        act = {"kv_budget":float(action[0]), "spec_length":0, "batch_size":int(action[2]),
               "prefill_disagg":False, "quant_tier":0}
        r = httpx.post(f"{self.url}/step", json={"session_id":self.session_id,"action":act}).json()
        return self.obs_to_array(r['observation']), r['reward'], r['done'], False, {}

# Cell 9: Train PPO (takes ~10 minutes on Colab T4)
from stable_baselines3 import PPO
env = InferenceGymEnv(HF_URL, task_id=1)
model = PPO("MlpPolicy", env, verbose=1, learning_rate=3e-4, n_steps=512)
model.learn(total_timesteps=50_000)

# Cell 10: Plot reward curve (the money shot)
import matplotlib.pyplot as plt
rewards = [ep['r'] for ep in model.ep_info_buffer]
plt.figure(figsize=(12,4)); plt.plot(rewards, alpha=0.3, label='Episode reward')
plt.axhline(y=0.35, color='r', linestyle='--', label='Baseline score')
plt.title('PPO Agent Learning on InferenceGym Task 1'); plt.legend(); plt.show()
print(f"Final agent score: {np.mean(rewards[-20:]):.3f} vs baseline: 0.35")

Phase 6 — Polish, Writeup & Submission

Days 10–11 (Apr 5–7). Goal: every submission checklist item is ticked. The repo is clean. The writeup is compelling. The video is recorded. The form is submitted.

Days 10–11

🏆

Final Gate — Submit by Apr 7 11:59 PM

The submission form is filled with HF Space URL + GitHub repo URL. No code changes after submission. The repo is public, has a clean README, and contains no API keys or large binary files committed to git.

ENVIRONMENT.md — Technical spec for judges

Person A writes this on Day 10

A
Observation space table Full table with field name, type, range, and description for all 12 observation fields. Copy from models.py and expand.
A
Action space table Full table with field name, type, valid values, default, and effect when changed for all 5 action dimensions.
A
Reward function derivation Show the R = αT - βL - γV - δC formula with all constants, normalization choices, and why each weight was set the way it was.
A
Trace data methodology Document exactly what source data you used, how it was preprocessed, and why it's realistic. If using published benchmarks, cite them.

README.md — The first thing judges see

Person C writes this on Day 10

C
One-paragraph pitch first Before any technical content. Why does this environment matter? What problem does it solve? This should be the same words you'd use to pitch to a judge in 30 seconds.
C
Quick start in 5 lines Show the curl commands to hit /health, /reset, /step, /grader. A judge who never reads further should still understand the API from these 5 lines.
C
Baseline vs agent scores table Show a simple table: Task 1/2/3 × Baseline/PPO Agent. The numbers do the talking.
C
Link to Colab notebook prominently "Open in Colab" badge. Judges who click this and see the reward curve rising will be convinced.

2-MINUTE DEMO VIDEO SCRIPT — Person C, Day 10

Time	Screen	What You Say / Show
0:00–0:20	Slide: problem statement	"LLM inference is where 80% of AI budget is spent. There's no RL environment for optimising it. We built one."
0:20–0:40	HF Space — /health → /tasks	"This is InferenceGym on HuggingFace Spaces, live right now. 3 tasks, 5 action knobs, fully CPU-only." Hit the endpoints live.
0:40–1:00	Colab — run baseline	"Naïve vLLM defaults score 0.35 on Task 1. That's your baseline — static config, no optimisation."
1:00–1:30	Colab — PPO reward curve	"A simple PPO agent trained for 50k steps hits 0.65 — almost double. No GPU, no model, just our trace-driven simulator." Show the plot.
1:30–2:00	Architecture diagram	"Any company can drop in their own trace data and train an agent for their specific workload. That's the value proposition. Thank you."

Complete 11-Day Timeline

Every person, every day. The critical path runs through Person A's simulator — protect it above all else.

Mar 27
Day 1
TODAY

PHASE 0 — SETUP & ARCHITECTURE LOCK

A →Design data schemas in models.py. Write skeleton TraceSimulator with hardcoded stub output. Design lookup table format.
B →Create FastAPI app with all 8 endpoint stubs returning valid-shaped hardcoded JSON. Dockerfile builds. /health returns 200.
C →Write grader rubric on paper for all 3 tasks. Download trace data. Write workload_configs.json. Agree on HF Space naming.
ALL →Agree and commit models.py to main. This file cannot change after today without unanimous consent.

Mar 28
Day 2

PHASE 1 — SIMULATOR CORE (Day 1 of 2)

A →Implement TraceSimulator — load parquet, bilinear interpolation, Gaussian noise, OOM detection. Write WorkloadGenerator (Poisson arrivals, prompt sampling).
B →Wire /reset and /step endpoints to the InferenceEnv stubs (not real yet — use A's skeleton). Test with curl that responses are correctly shaped.
C →Process trace data — reshape into lookup table Parquet format with correct columns. Validate at least 50 data points across the batch×prompt grid. Start grader skeleton.

Mar 29
Day 3

PHASE 1 — SIMULATOR CORE (Day 2 of 2) 🔑 CRITICAL GATE

A →Complete WorkloadGenerator — queue depth, burst injection, spec acceptance model. Complete InferenceEnv.reset() and step(). All simulator unit tests pass.
B →Wire all endpoints to real InferenceEnv (replacing stubs). Implement SessionManager. Test full reset→step×10 cycle via HTTP.
C →Implement GraderModule skeleton with correct formula shape (even if constants need tuning). Run smoke test: score a 10-step episode log. Get any finite number.

Mar 30
Day 4

PHASE 2 — ENVIRONMENT LOGIC COMPLETE

A →Implement all 3 task configs (action masking for T1/T2, burst injection for T3). Full reward function with α β γ δ weights. Write full unit test suite (20+ tests).
B →Build Dockerfile — multi-stage, confirm image <2GB. Run full Docker cycle locally. Implement /state, /info, /health endpoints. Add Pydantic request validation.
C →Complete GraderModule — calibrate baseline TPS constants, write unit tests for all 3 task graders with known expected outputs. Score computation verified by hand.

Mar 31
Day 5

PHASE 3 — API LAYER COMPLETE & OPENENV VALIDATED

A →Full integration test — run 200-step episode for all 3 tasks programmatically. Confirm rewards are in [-1,1] range. Fix any edge cases (divide by zero, negative queue).
B →Run openenv validate — fix any compliance issues. Implement /grader and /baseline endpoints (wiring C's modules). Add rate limiting and CORS middleware.
C →Write BaselineAgent and run against all 3 tasks. Record expected scores (should be ~0.30-0.35 for T1, ~0.22-0.28 for T2, ~0.18-0.24 for T3). Adjust grader constants if needed.

Apr 1
Day 6

PHASE 4 — GRADER & BASELINE COMPLETE

A →Adversarial task stress test — run 1000-step Task 3 episodes, check burst injection fires at correct intervals, priority routing triggers, no state corruption.
B →Concurrent session test — run 10 simultaneous reset→step×5 cycles, confirm no session leakage. Profile memory usage under load — must stay under 512MB.
C →Write PPO gym wrapper (HTTPGymEnv). Start PPO training on Task 1. Set it running overnight — 50k steps should complete in ~4-6 hours on a modern CPU.

Apr 2
Day 7

BUFFER DAY + INTERNAL DEMO

ALL →Internal demo meeting — each person walks through the Colab notebook end to end. Find anything broken. Fix it today.
A →Fix any bugs found in internal demo. Add /info endpoint with full JSON schema. Docstrings on all public methods.
C →Review PPO training results — plot reward curve, verify it's increasing. If not, debug (check normalization, learning rate, reward scale). Start writing Colab notebook.

Apr 3
Day 8

PHASE 5 — DEPLOYMENT

B →Deploy to HuggingFace Spaces — push, watch build logs, verify all endpoints respond from live public URL. Document the URL in README.
C →Complete Colab notebook — all 10 cells work end-to-end against the live HF Space URL. The notebook should run cold in under 15 minutes.
A →Test from fresh machine — clone the repo, build Docker, run all tests. Confirm there are no hidden local dependencies. Fix whatever breaks.

Apr 4
Day 9

PHASE 5 — DEMO COMPLETE

C →Record 2-minute demo video using OBS or Loom. Follow the script. Upload to YouTube (unlisted) and link in README. Do not make it public until submission.
B →Stress test live deployment — 50 concurrent requests, verify no 500 errors. Check HF Space memory and CPU usage stays stable.
ALL →Write submission description draft (~500 words covering: problem, design, grader design, baseline vs agent results). Will refine on Day 10.

Apr 5–6
Days 10-11

PHASE 6 — WRITEUP, POLISH & SUBMISSION PREP

A →Write ENVIRONMENT.md — full technical spec for judges (observation space, action space, reward formula, task descriptions, simulator methodology).
C →Write final README — pitch paragraph, quick start, baseline vs agent table, Colab link, video link. Run through the submission checklist line by line.
ALL →Final end-to-end verification — test from a fresh browser with no cookies or local setup. Every endpoint must work. Grader must score any completed episode.

Apr 7
DEADLINE

SUBMIT BY 11:59 PM — NO CODE CHANGES AFTER

ALL →Submit HF Space URL + GitHub repo URL on hackathon portal. Fill in: env name, description, team members. Double check the HF Space is public.

§A

Appendix A — Full Module Specifications

Every file in the repository, what it owns, and the exact interface it must expose.

COMPLETE FILE TREE WITH OWNERSHIP

textRepository structure
inferencegym/
├── models.py               [ALL] — Locked Day 1. ServeAction, ServeObservation, MetricsSnapshot, WorkloadState
│
├── env/
│   ├── inference_env.py    [A] — Core InferenceEnv class. reset(), step(), _compute_reward(), _enforce_action_mask()
│   ├── observation.py      [A] — _build_obs() helper, normalise values to [0,1] for RL agents
│   ├── action.py           [A] — ActionValidator, clamp continuous actions to valid ranges
│   └── reward.py           [A] — RewardComputer, configurable α β γ δ, TASK_CONFIGS dict
│
├── simulator/
│   ├── trace_sim.py        [A] — TraceSimulator: load parquet, interpolate, noise, OOM detection
│   ├── workload.py         [A] — WorkloadGenerator: Poisson, LogNormal, burst injection, queue
│   ├── session_manager.py  [B] — SessionManager: thread-safe LRU cache of InferenceEnv instances
│   └── data/
│       ├── traces_llama3_8b.parquet    [C] — lookup table: (batch,kv,spec,plen) → metrics
│       ├── sharegpt_dist.json          [C] — LogNormal params for Task 2 prompt distribution
│       └── workload_configs.json       [C] — Task 1/2/3 workload configuration parameters
│
├── grader/
│   ├── grader.py           [C] — GraderModule: dispatches to per-task graders, returns score+breakdown
│   ├── task1_grader.py     [C] — Throughput normalisation formula
│   ├── task2_grader.py     [C] — TTFT + memory compliance formula
│   └── task3_grader.py     [C] — 4-objective formula including action stability
│
├── agents/
│   ├── baseline.py         [C] — BaselineAgent: fixed BASELINE_ACTION, run_all_baselines()
│   └── ppo_demo.py         [C] — HTTPGymEnv wrapper + PPO training script
│
├── server/
│   ├── app.py              [B] — FastAPI application, all 8 endpoints, startup event
│   ├── schemas.py          [B] — Pydantic request/response models (ResetRequest, StepRequest, etc.)
│   └── middleware.py       [B] — CORS, rate limiting (max 100 req/min per IP), request logging
│
├── tests/
│   ├── test_simulator.py   [A] — 20+ unit tests for TraceSimulator and WorkloadGenerator
│   ├── test_env.py         [A] — Contract tests for step/reset/state, edge cases
│   ├── test_grader.py      [C] — Unit tests for all 3 grader formulas with known expected outputs
│   └── test_api.py         [B] — Integration tests: httpx client hitting full FastAPI stack
│
├── notebooks/
│   └── InferenceGym_Demo.ipynb   [C] — 10-cell Colab demo notebook
│
├── Dockerfile              [B] — Multi-stage, CPU-only, port 7860, <2GB image
├── docker-compose.yml      [B] — Local dev: volume mount source, hot reload
├── requirements.txt        [B] — Pinned CPU-only deps. No torch. No CUDA.
├── README.md               [C] — HF Spaces frontmatter + pitch + quickstart + links
└── ENVIRONMENT.md          [A] — Full technical spec for judges

MODULE INTERFACE CONTRACTS — What each module must expose

TraceSimulator simulator/trace_sim.py

__init__(trace_path: str, seed: int = 42) — loads parquet, builds interpolators, sets rng
simulate(action: ServeAction, workload: WorkloadState) → MetricsSnapshot — the core method
reset_seed(seed: int) — resets the rng for episode reproducibility
Must not raise exceptions on valid input. OOM conditions are returned as data, not exceptions.

WorkloadGenerator simulator/workload.py

__init__(task_id: int, seed: int = 42) — loads workload config for this task
reset() → WorkloadState — returns initial state, resets internal step counter
step(action: ServeAction) → WorkloadState — advances one step, updates queue
is_burst_active() → bool — True during burst windows for Task 3

InferenceEnv env/inference_env.py

reset() → ServeObservation — starts new episode, returns initial observation
step(action) → (obs, reward, done, info) — Gym-compatible signature
state() → dict — returns episode metadata for /state endpoint
_episode_log: list — accumulates step dicts for grader consumption
session_id: str — unique UUID per episode, set on reset()

GraderModule grader/grader.py

score(task_id: int, episode_log: list) → dict — returns {score, breakdown, feedback}
Must be stateless — no internal mutable state. Same input → same output always.
score must be a float in [0.0, 1.0]
breakdown must contain one float per scoring component
feedback must be a human-readable string explaining the score

§B

Appendix B — Data Schemas & Complete API Reference

LOOKUP TABLE PARQUET SCHEMA — traces_llama3_8b.parquet

Column	Type	Values	Description
`batch_size`	int	1,4,8,16,32,64,128,256,512	Max concurrent requests served
`kv_budget`	float	0.1, 0.25, 0.5, 0.75, 1.0	KV cache allocation fraction
`spec_length`	int	0, 1, 2, 4, 8	Speculative draft tokens (0 = disabled)
`quant_tier`	int	0, 1, 2	0=FP16, 1=INT8, 2=INT4
`prompt_len_bucket`	int	0–7	Bucket index: [64,128,256,512,1024,2048,4096,8192]
`ttft_p50_ms`	float	>0	Median time to first token (milliseconds)
`ttft_p99_ms`	float	>0	99th percentile TTFT
`tpot_ms`	float	>0	Time per output token
`tps`	float	>0	Output tokens per second
`gpu_mem_gb`	float	0–80	GPU memory footprint in GB
`cost_per_1k`	float	>0	Relative cost per 1000 tokens (normalised)

WORKLOAD CONFIGS — workload_configs.json structure

jsonsimulator/data/workload_configs.json
{
  "tasks": {
    "1": {
      "name": "Static Uniform",
      "arrival_rate_rps": 10.0,
      "arrival_dist": "poisson",
      "prompt_len_dist": "uniform",
      "prompt_len_min": 64,
      "prompt_len_max": 128,
      "slo_target_ms": 500.0,
      "burst_enabled": false,
      "priority_routing": false,
      "active_actions": ["kv_budget", "batch_size"]
    },
    "2": {
      "name": "Bursty ShareGPT",
      "arrival_rate_rps": 25.0,
      "arrival_rate_burst": 80.0,
      "burst_period_steps": 30,
      "arrival_dist": "poisson_bursty",
      "prompt_len_dist": "lognormal",
      "prompt_len_mu": 5.2,
      "prompt_len_sigma": 1.3,
      "prompt_len_clamp_min": 32,
      "prompt_len_clamp_max": 8192,
      "memory_hard_limit_gb": 36.0,
      "slo_target_ms": 300.0,
      "burst_enabled": true,
      "active_actions": ["kv_budget", "batch_size", "spec_length"]
    },
    "3": {
      "name": "Adversarial Multi-Tenant",
      "arrival_rate_rps": 30.0,
      "burst_multiplier": 10.0,
      "burst_interval_steps": 120,
      "burst_duration_steps": 15,
      "prompt_len_dist": "bimodal",
      "short_request_frac": 0.7,
      "short_prompt_max": 128,
      "long_prompt_min": 4096,
      "long_prompt_max": 8192,
      "priority_mix": [0.2, 0.5, 0.3],
      "slo_interactive_ms": 200.0,
      "slo_batch_ms": 2000.0,
      "cost_budget_episode": 5.0,
      "memory_hard_limit_gb": 38.0,
      "active_actions": ["kv_budget", "batch_size", "spec_length", "prefill_disagg", "quant_tier"]
    }
  }
}

COMPLETE OBSERVATION & ACTION SPACE REFERENCE

Field	Type	Range	Normalised?	Description
`queue_depth`	float	[0, 512]	No	Pending requests in serving queue
`mean_prompt_len`	float	[32, 8192]	No	Mean token count of current window
`arrival_rate`	float	[0, 200]	No	10-step EMA requests/second
`kv_cache_occupancy`	float	[0.0, 1.0]	Yes	Fraction of KV cache in use
`ttft_p50`	float	[0, 5000] ms	No	Median TTFT last 20 requests
`tpot_p50`	float	[0, 500] ms	No	Median time-per-output-token
`slo_violation_rate`	float	[0.0, 1.0]	Yes	Fraction of requests missing SLO
`gpu_memory_used_gb`	float	[0, 80]	No	Simulated GPU memory pressure
`spec_accept_rate`	float	[0.0, 1.0]	Yes	Speculative token acceptance rate
`priority_distribution`	float[3]	[0,1] each	Yes	[interactive, batch, best_effort] fractions
`timestep`	int	[0, 200]	No	Current episode step
`cost_so_far`	float	[0, ∞]	No	Cumulative cost this episode

§C

Appendix C — Risk Register

Every known failure mode, its probability, and exact mitigation steps.

Risk	Prob	Mitigation	Owner
Trace data is wrong shape Published benchmarks don't have the exact columns needed	Medium	Implement Option C (synthetic data) on Day 1 before even trying Option A. This takes 30 minutes and gives you a valid fallback. Option A then becomes an enhancement, not a dependency.	C
PPO doesn't converge Reward curve is flat or decreasing	Low	Task 1 is designed for easy learning. If PPO fails: (1) add VecNormalize wrapper, (2) lower learning rate to 1e-4, (3) check reward is truly in [-1,1]. If still failing, use a simple hill-climbing agent — just show any rising curve.	C
HuggingFace Spaces OOM Free tier has 16GB RAM — simulator might use too much	Low	Load trace data as a numpy array, not a pandas DataFrame, at startup. Target <200MB for the lookup table. Use `parquet` with snappy compression. Test memory usage locally with `psutil` before deploying.	B
Race condition in session cache Concurrent requests corrupt session state	Medium	All reads and writes to `self._sessions` dict are wrapped in `threading.Lock()`. Individual `InferenceEnv` instances are not thread-safe but each session is owned by one caller at a time — this is fine because the /step endpoint is synchronous and FastAPI serialises calls per session_id.	B
Grader gives score > 1.0 or < 0.0 Formula constants are miscalibrated	Medium	All grader component scores are individually `np.clip(x, 0.0, 1.0)` before the weighted sum. The final score is also clipped. Calibrate BASELINE_TPS and OPTIMAL_TPS constants on Day 5 by running the actual baseline agent and verifying scores fall in [0.20, 0.40].	C
Person A is blocked on Day 3 Simulator not done, Person B and C can't proceed	Medium	Person A prioritises the interface (`simulate() returns a valid MetricsSnapshot`) over the implementation quality. A synthetic linear model with hardcoded constants is enough for Day 3. Person B and C only need the method signature to work. Real trace data can be plugged in on Day 4.	A
Docker image >2GB stable-baselines3 pulls large PyTorch dependency	Medium	Install `stable-baselines3[extra]` only in a separate `requirements-demo.txt` that is NOT in the Dockerfile. The server only needs the environment. The PPO demo runs from outside the container (in Colab). This keeps the image under 500MB.	B
OpenEnv spec compliance fails openenv validate finds schema mismatches	Low	Run `openenv validate` at the end of every day starting Day 3. Validation issues are always about JSON schema — field names, types, missing fields. Fix immediately, never defer. Keep a local copy of the openenv spec open while writing endpoint response schemas.	B

§D

Appendix D — Final Submission Checklist

Every item must be checked before submitting. Do not submit until all boxes are ticked.

OPENENV COMPLIANCE

POST /reset returns session_id + initial observation dict
POST /step returns observation + reward (float) + done (bool) + info
GET /state returns current episode metadata
GET /tasks returns 3 tasks with id, name, difficulty labels
POST /grader returns score 0.0–1.0 + breakdown dict + feedback string
GET /baseline returns reproducible baseline scores for all 3 tasks
GET /health returns {"status": "ok"}
openenv validate --url https://YOUR_SPACE.hf.space passes with no errors
3 tasks with easy/medium/hard difficulty labels present
Reward function documented with partial credit design

QUALITY CRITERIA

Baseline agent runs reproducibly (fixed seed=0, same score every run)
PPO reward curve plot shows statistically increasing trend
Colab notebook runs end-to-end in <15 minutes on free T4
README has: pitch paragraph, quickstart, scores table, Colab link, video link
ENVIRONMENT.md has full technical spec
No API keys, no secrets in repository
No large binary files committed to git (use .gitignore for *.parquet — serve from HF repo)
Grader is deterministic (run same episode log twice, get same score)
2-minute demo video recorded and linked in README
HF Space is public (not private or gated)

DEPLOYMENT CHECKS

Docker image builds locally with docker build -t test .
Image is under 2GB (docker image ls)
Container starts and /health responds within 30s
HF Spaces URL is live and all endpoints respond
Tested from a fresh browser/machine with no local setup
50 concurrent requests don't produce 500 errors
HF Spaces shows "Running" not "Building" or "Error"

SUBMISSION FORM

Environment name: InferenceGym (or your chosen name)
Description: 500-word submission text
All team member names listed
HuggingFace Spaces URL submitted
GitHub repository URL submitted (public)
Submitted BEFORE 11:59 PM April 7
No code changes pushed after submission time

🎯 The One-Line Summary for Judges

InferenceGym is the first RL environment for LLM inference control. A naïve vLLM config scores 0.22 on the hardest task. A simple PPO agent trained for 50k steps reaches 0.65 — a 3× improvement in serving efficiency, no GPU, no model required. That's the pitch. Everything else in this document is how you build the thing that delivers that demo.

INFERENCEGYM · MASTER BUILD DOCUMENT · META PYTORCH × SCALER HACKATHON 2026 · DEADLINE APRIL 7

InferenceGymComplete Engineering Plan

InferenceGym
Complete Engineering Plan