Spaces:
Sleeping
Sleeping
| title: FrontierLabs-Env | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| pinned: false | |
| # FrontierLabs-Env π | |
| > **An OpenEnv-compliant AI Infrastructure Simulation Sandbox** β drops an AI agent into a failing PyTorch/GPU supercomputing environment. The agent must autonomously act as a Principal AI Infrastructure Engineer. | |
| [](https://openenv.ai) | |
| [](https://huggingface.co/spaces/frontierlabs/FrontierLabs-Env) | |
| [](https://hub.docker.com) | |
| [](https://fastapi.tiangolo.com) | |
| --- | |
| ## Motivation | |
| As AI models scale to hundreds of billions of parameters, evaluating agents on elite infrastructure tasks β data security auditing, distributed training optimization, and GPU kernel engineering β is impossible without risking actual multi-million-dollar server clusters. | |
| **FrontierLabs-Env solves this** by providing a strictly deterministic, fully sandboxed simulation of these scenarios with programmatic graders and a rich partial-reward signal. | |
| --- | |
| ## Environment Description | |
| The agent interacts with a **simulated filesystem** on a virtual GPU supercomputing cluster. It reads files, writes code, executes scripts (simulated), and submits solutions. The environment tracks progress through a state machine that rewards correct partial steps, not just final answers. | |
| --- | |
| ## Observation Space | |
| Every step returns an `Observation` object: | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `task_id` | `str` | Active task identifier | | |
| | `step` | `int` | Current step number | | |
| | `done` | `bool` | Whether the episode has ended | | |
| | `message` | `str` | Human-readable task description | | |
| | `files` | `dict[str, str]` | Simulated filesystem preview (5-line snippets) | | |
| | `metrics` | `dict` | Live infrastructure metrics (memory, latency, etc.) | | |
| | `partial_score` | `float` | Running [0.0β1.0] score for the episode | | |
| --- | |
| ## Action Space | |
| All actions are JSON objects with these fields: | |
| | Field | Type | Required | Description | | |
| |---|---|---|---| | |
| | `action_type` | `str` | β | One of: `write_file`, `run_script`, `submit` | | |
| | `filename` | `str` | For `write_file` / `run_script` | Target file on simulated filesystem | | |
| | `content` | `str` | For `write_file` | Complete source code to write | | |
| **Action types:** | |
| - **`write_file`** β Write code/content to a named file on the simulated filesystem | |
| - **`run_script`** β Execute a script already on the filesystem (returns simulated stdout) | |
| - **`submit`** β Mark episode complete and trigger grading | |
| --- | |
| ## Tasks | |
| ### Task 1 β Security Audit & Self-Evaluation (π’ Easy) | |
| **Scenario:** A `dataset.jsonl` (200 entries) has been infected with 50 malicious backdoor prompts containing the trigger token `TRIGGER_ALPHA`. | |
| **Agent must:** | |
| 1. Write `audit.py` to detect and remove backdoor entries β `cleaned_dataset.jsonl` | |
| 2. Write `evaluate.py` to compare against `golden_baseline.jsonl` β `metrics_report.json` | |
| 3. Run both scripts, then `submit` | |
| **Grader criteria (deterministic):** | |
| - File `cleaned_dataset.jsonl` exists β **+0.10** | |
| - Cleaning F1 score (precision/recall of removed entries) β **up to +0.40** | |
| - File `metrics_report.json` exists β **+0.10** | |
| - Agent's self-reported F1 matches ground truth F1 (within 1%) β **+0.40** | |
| **Max Steps:** 20 | **Pass threshold:** β₯ 0.80 | |
| --- | |
| ### Task 2 β Distributed Cluster Crash / FSDP (π‘ Medium) | |
| **Scenario:** `train.py` crashes with CUDA Out-of-Memory because a 280GB model is loaded onto a single 40GB GPU. | |
| **Agent must:** | |
| 1. Write `train_fsdp.py` using `torch.distributed.fsdp.FullyShardedDataParallel` across 8 GPUs | |
| 2. Proper `dist.init_process_group` initialization | |
| 3. Run and submit | |
| **Grader criteria (AST + keyword analysis):** | |
| - File exists β **+0.10** | |
| - 5 FSDP keywords detected β **up to +0.50** | |
| - AST finds `FSDP(...)` wrapper call β **+0.20** | |
| - Simulated memory 35GB/GPU (β€ 40GB limit) β **+0.20** | |
| **Max Steps:** 25 | **Pass threshold:** β₯ 0.80 | |
| --- | |
| ### Task 3 β Triton Hardware Bottleneck (π΄ Hard) | |
| **Scenario:** `slow_math.py` runs a SiLU gated activation at 150ms/step due to 2 separate memory round-trips. | |
| **Agent must:** | |
| 1. Write `fast_silu_kernel.py` with a `@triton.jit` kernel | |
| 2. Use `tl.load` for both `x_ptr` and `gate_ptr`, compute SiLU + multiply in registers, write once via `tl.store` | |
| 3. Run and submit | |
| **Grader criteria (AST + regex):** | |
| - File exists β **+0.10** | |
| - `@triton.jit`, `tl.load`, `tl.store` present β **up to +0.45** | |
| - SiLU math pattern detected (regex) β **+0.20** | |
| - Kernel function with pointer args found in AST β **+0.10** | |
| - Simulated latency 11.8ms/step (β€ 20ms target) β **+0.15** | |
| **Max Steps:** 30 | **Pass threshold:** β₯ 0.80 | |
| --- | |
| ## Reward Function | |
| The reward signal is **non-sparse** β partial credit is awarded throughout the trajectory: | |
| | Event | Reward | | |
| |---|---| | |
| | Writing a file with correct patterns | `+0.05` to `+0.20` | | |
| | Running a script successfully | `+0.10` to `+0.30` | | |
| | Correct strategy detected at write time | `+0.10` to `+0.20` | | |
| | Running original buggy script (OOM) | `β0.10` | | |
| | Double submit | `β0.10` | | |
| | Unknown action type | `β0.05` | | |
| | Timeout (steps exhausted) | `β0.10` | | |
| All per-step rewards are clipped to `[β1.0, 1.0]`. | |
| --- | |
| ## API Endpoints | |
| | Method | Path | Description | | |
| |---|---|---| | |
| | `GET` | `/` | Mission Control Dashboard (HTML) | | |
| | `POST` | `/reset` | Reset environment (`task_id` optional) | | |
| | `POST` | `/step` | Take one action | | |
| | `GET` | `/state` | Full internal state (debug) | | |
| | `GET` | `/tasks` | List tasks + action schema | | |
| | `GET` | `/grader` | Grade current episode | | |
| | `POST` | `/baseline` | Trigger baseline agent | | |
| | `GET` | `/health` | Health check | | |
| | `GET` | `/docs` | Swagger UI | | |
| --- | |
| ## Setup & Usage | |
| ### Local Development | |
| ```bash | |
| # 1. Install dependencies | |
| pip install -r requirements.txt | |
| # 2. Start the server | |
| uvicorn main:app --reload --port 7860 | |
| # 3. Open dashboard | |
| open http://localhost:7860 | |
| # 4. Run the baseline agent (no API key needed β uses expert solutions) | |
| python inference.py | |
| # 5. Run baseline with OpenAI-compatible API | |
| HF_TOKEN=xyz API_BASE_URL=https://... MODEL_NAME=... python inference.py | |
| ``` | |
| ### Docker | |
| ```bash | |
| # Build | |
| docker build -t frontierlabs-env . | |
| # Run | |
| docker run -p 7860:7860 frontierlabs-env | |
| # With LLM API | |
| docker run -p 7860:7860 -e HF_TOKEN=xyz -e API_BASE_URL=https://... -e MODEL_NAME=... frontierlabs-env | |
| ``` | |
| ### Quick curl walkthrough | |
| ```bash | |
| # Reset to Task 1 | |
| curl -X POST http://localhost:7860/reset \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"task_id": "task1_security_audit"}' | |
| # Write your solution | |
| curl -X POST http://localhost:7860/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action_type": "write_file", "filename": "audit.py", "content": "..."}' | |
| # Run the script | |
| curl -X POST http://localhost:7860/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action_type": "run_script", "filename": "audit.py"}' | |
| # Submit | |
| curl -X POST http://localhost:7860/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action_type": "submit"}' | |
| # Get score | |
| curl http://localhost:7860/grader | |
| ``` | |
| --- | |
| ## Baseline Scores | |
| Scores produced by the deterministic expert agent (no LLM β validates grader correctness): | |
| | Task | Score | Passed | | |
| |---|---|---| | |
| | Task 1 β Security Audit | **1.0000** | β | | |
| | Task 2 β FSDP Cluster Fix | **1.0000** | β | | |
| | Task 3 β Triton Kernel | **1.0000** | β | | |
| | **Average** | **1.0000** | β | | |
| Reproduce with: | |
| ```bash | |
| python inference.py | |
| ``` | |
| --- | |
| ## Project Structure | |
| ``` | |
| FrontierLabs-Env/ | |
| βββ openenv.yaml # OpenEnv spec (tasks, schemas, metadata) | |
| βββ main.py # FastAPI server + dashboard + all endpoints | |
| βββ environment.py # State machine, simulated filesystem, action handlers | |
| βββ graders.py # Deterministic graders (AST + regex + math) | |
| βββ inference.py # Baseline inference script (LLM + expert mode) | |
| βββ requirements.txt # Python dependencies | |
| βββ Dockerfile # HF Spaces compatible container | |
| βββ README.md # This file | |
| ``` | |
| --- | |
| ## Environment Variables | |
| | Variable | Default | Description | | |
| |---|---|---| | |
| | `HF_TOKEN` | β | API key for baseline agent | | |
| | `API_BASE_URL` | `https://api.openai.com/v1` | LLM API Base URL | | |
| | `MODEL_NAME` | `gpt-4o` | Model identifier for baseline inference | | |
| | `FRONTIER_ENV_URL` | `http://localhost:7860` | Environment server URL | | |
| | `PORT` | `7860` | Server port | | |
| --- | |
| ## OpenEnv Validation | |
| The environment passes `openenv validate`: | |
| ```bash | |
| openenv validate openenv.yaml | |
| # β Schema valid | |
| # β 3 tasks defined | |
| # β Endpoints responding | |
| # β Graders returning [0.0, 1.0] scores | |
| ``` | |
| --- | |
| ## Hackathon Checklist | |
| - [x] Real-world task simulation (GPU infrastructure engineering) | |
| - [x] OpenEnv spec compliance (typed models, step/reset/state) | |
| - [x] 3 tasks with agent graders (easy β medium β hard) | |
| - [x] Meaningful reward function with partial progress signals | |
| - [x] Baseline inference script with reproducible scores | |
| - [x] `openenv.yaml` metadata | |
| - [x] Working Dockerfile (HF Spaces compatible, port 7860) | |
| - [x] `/baseline`, `/grader`, `/tasks` endpoints | |
| - [x] Mission Control dashboard at `/` | |
| - [x] README with full documentation | |