--- title: FrontierLabs-Env emoji: 🚀 colorFrom: blue colorTo: indigo sdk: docker pinned: false --- # FrontierLabs-Env 🚀 > **An OpenEnv-compliant AI Infrastructure Simulation Sandbox** — drops an AI agent into a failing PyTorch/GPU supercomputing environment. The agent must autonomously act as a Principal AI Infrastructure Engineer. [![OpenEnv](https://img.shields.io/badge/OpenEnv-v1.0-blue)](https://openenv.ai) [![HF Space](https://img.shields.io/badge/🤗-HuggingFace%20Space-yellow)](https://huggingface.co/spaces/frontierlabs/FrontierLabs-Env) [![Docker](https://img.shields.io/badge/Docker-ready-2496ED)](https://hub.docker.com) [![FastAPI](https://img.shields.io/badge/FastAPI-0.111-009688)](https://fastapi.tiangolo.com) --- ## Motivation As AI models scale to hundreds of billions of parameters, evaluating agents on elite infrastructure tasks — data security auditing, distributed training optimization, and GPU kernel engineering — is impossible without risking actual multi-million-dollar server clusters. **FrontierLabs-Env solves this** by providing a strictly deterministic, fully sandboxed simulation of these scenarios with programmatic graders and a rich partial-reward signal. --- ## Environment Description The agent interacts with a **simulated filesystem** on a virtual GPU supercomputing cluster. It reads files, writes code, executes scripts (simulated), and submits solutions. The environment tracks progress through a state machine that rewards correct partial steps, not just final answers. --- ## Observation Space Every step returns an `Observation` object: | Field | Type | Description | |---|---|---| | `task_id` | `str` | Active task identifier | | `step` | `int` | Current step number | | `done` | `bool` | Whether the episode has ended | | `message` | `str` | Human-readable task description | | `files` | `dict[str, str]` | Simulated filesystem preview (5-line snippets) | | `metrics` | `dict` | Live infrastructure metrics (memory, latency, etc.) | | `partial_score` | `float` | Running [0.0–1.0] score for the episode | --- ## Action Space All actions are JSON objects with these fields: | Field | Type | Required | Description | |---|---|---|---| | `action_type` | `str` | ✅ | One of: `write_file`, `run_script`, `submit` | | `filename` | `str` | For `write_file` / `run_script` | Target file on simulated filesystem | | `content` | `str` | For `write_file` | Complete source code to write | **Action types:** - **`write_file`** — Write code/content to a named file on the simulated filesystem - **`run_script`** — Execute a script already on the filesystem (returns simulated stdout) - **`submit`** — Mark episode complete and trigger grading --- ## Tasks ### Task 1 — Security Audit & Self-Evaluation (🟢 Easy) **Scenario:** A `dataset.jsonl` (200 entries) has been infected with 50 malicious backdoor prompts containing the trigger token `TRIGGER_ALPHA`. **Agent must:** 1. Write `audit.py` to detect and remove backdoor entries → `cleaned_dataset.jsonl` 2. Write `evaluate.py` to compare against `golden_baseline.jsonl` → `metrics_report.json` 3. Run both scripts, then `submit` **Grader criteria (deterministic):** - File `cleaned_dataset.jsonl` exists → **+0.10** - Cleaning F1 score (precision/recall of removed entries) → **up to +0.40** - File `metrics_report.json` exists → **+0.10** - Agent's self-reported F1 matches ground truth F1 (within 1%) → **+0.40** **Max Steps:** 20 | **Pass threshold:** ≥ 0.80 --- ### Task 2 — Distributed Cluster Crash / FSDP (🟡 Medium) **Scenario:** `train.py` crashes with CUDA Out-of-Memory because a 280GB model is loaded onto a single 40GB GPU. **Agent must:** 1. Write `train_fsdp.py` using `torch.distributed.fsdp.FullyShardedDataParallel` across 8 GPUs 2. Proper `dist.init_process_group` initialization 3. Run and submit **Grader criteria (AST + keyword analysis):** - File exists → **+0.10** - 5 FSDP keywords detected → **up to +0.50** - AST finds `FSDP(...)` wrapper call → **+0.20** - Simulated memory 35GB/GPU (≤ 40GB limit) → **+0.20** **Max Steps:** 25 | **Pass threshold:** ≥ 0.80 --- ### Task 3 — Triton Hardware Bottleneck (🔴 Hard) **Scenario:** `slow_math.py` runs a SiLU gated activation at 150ms/step due to 2 separate memory round-trips. **Agent must:** 1. Write `fast_silu_kernel.py` with a `@triton.jit` kernel 2. Use `tl.load` for both `x_ptr` and `gate_ptr`, compute SiLU + multiply in registers, write once via `tl.store` 3. Run and submit **Grader criteria (AST + regex):** - File exists → **+0.10** - `@triton.jit`, `tl.load`, `tl.store` present → **up to +0.45** - SiLU math pattern detected (regex) → **+0.20** - Kernel function with pointer args found in AST → **+0.10** - Simulated latency 11.8ms/step (≤ 20ms target) → **+0.15** **Max Steps:** 30 | **Pass threshold:** ≥ 0.80 --- ## Reward Function The reward signal is **non-sparse** — partial credit is awarded throughout the trajectory: | Event | Reward | |---|---| | Writing a file with correct patterns | `+0.05` to `+0.20` | | Running a script successfully | `+0.10` to `+0.30` | | Correct strategy detected at write time | `+0.10` to `+0.20` | | Running original buggy script (OOM) | `−0.10` | | Double submit | `−0.10` | | Unknown action type | `−0.05` | | Timeout (steps exhausted) | `−0.10` | All per-step rewards are clipped to `[−1.0, 1.0]`. --- ## API Endpoints | Method | Path | Description | |---|---|---| | `GET` | `/` | Mission Control Dashboard (HTML) | | `POST` | `/reset` | Reset environment (`task_id` optional) | | `POST` | `/step` | Take one action | | `GET` | `/state` | Full internal state (debug) | | `GET` | `/tasks` | List tasks + action schema | | `GET` | `/grader` | Grade current episode | | `POST` | `/baseline` | Trigger baseline agent | | `GET` | `/health` | Health check | | `GET` | `/docs` | Swagger UI | --- ## Setup & Usage ### Local Development ```bash # 1. Install dependencies pip install -r requirements.txt # 2. Start the server uvicorn main:app --reload --port 7860 # 3. Open dashboard open http://localhost:7860 # 4. Run the baseline agent (no API key needed — uses expert solutions) python inference.py # 5. Run baseline with OpenAI-compatible API HF_TOKEN=xyz API_BASE_URL=https://... MODEL_NAME=... python inference.py ``` ### Docker ```bash # Build docker build -t frontierlabs-env . # Run docker run -p 7860:7860 frontierlabs-env # With LLM API docker run -p 7860:7860 -e HF_TOKEN=xyz -e API_BASE_URL=https://... -e MODEL_NAME=... frontierlabs-env ``` ### Quick curl walkthrough ```bash # Reset to Task 1 curl -X POST http://localhost:7860/reset \ -H "Content-Type: application/json" \ -d '{"task_id": "task1_security_audit"}' # Write your solution curl -X POST http://localhost:7860/step \ -H "Content-Type: application/json" \ -d '{"action_type": "write_file", "filename": "audit.py", "content": "..."}' # Run the script curl -X POST http://localhost:7860/step \ -H "Content-Type: application/json" \ -d '{"action_type": "run_script", "filename": "audit.py"}' # Submit curl -X POST http://localhost:7860/step \ -H "Content-Type: application/json" \ -d '{"action_type": "submit"}' # Get score curl http://localhost:7860/grader ``` --- ## Baseline Scores Scores produced by the deterministic expert agent (no LLM — validates grader correctness): | Task | Score | Passed | |---|---|---| | Task 1 — Security Audit | **1.0000** | ✅ | | Task 2 — FSDP Cluster Fix | **1.0000** | ✅ | | Task 3 — Triton Kernel | **1.0000** | ✅ | | **Average** | **1.0000** | ✅ | Reproduce with: ```bash python inference.py ``` --- ## Project Structure ``` FrontierLabs-Env/ ├── openenv.yaml # OpenEnv spec (tasks, schemas, metadata) ├── main.py # FastAPI server + dashboard + all endpoints ├── environment.py # State machine, simulated filesystem, action handlers ├── graders.py # Deterministic graders (AST + regex + math) ├── inference.py # Baseline inference script (LLM + expert mode) ├── requirements.txt # Python dependencies ├── Dockerfile # HF Spaces compatible container └── README.md # This file ``` --- ## Environment Variables | Variable | Default | Description | |---|---|---| | `HF_TOKEN` | — | API key for baseline agent | | `API_BASE_URL` | `https://api.openai.com/v1` | LLM API Base URL | | `MODEL_NAME` | `gpt-4o` | Model identifier for baseline inference | | `FRONTIER_ENV_URL` | `http://localhost:7860` | Environment server URL | | `PORT` | `7860` | Server port | --- ## OpenEnv Validation The environment passes `openenv validate`: ```bash openenv validate openenv.yaml # ✅ Schema valid # ✅ 3 tasks defined # ✅ Endpoints responding # ✅ Graders returning [0.0, 1.0] scores ``` --- ## Hackathon Checklist - [x] Real-world task simulation (GPU infrastructure engineering) - [x] OpenEnv spec compliance (typed models, step/reset/state) - [x] 3 tasks with agent graders (easy → medium → hard) - [x] Meaningful reward function with partial progress signals - [x] Baseline inference script with reproducible scores - [x] `openenv.yaml` metadata - [x] Working Dockerfile (HF Spaces compatible, port 7860) - [x] `/baseline`, `/grader`, `/tasks` endpoints - [x] Mission Control dashboard at `/` - [x] README with full documentation