Spaces:
Sleeping
title: FrontierLabs-Env
emoji: π
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
FrontierLabs-Env π
An OpenEnv-compliant AI Infrastructure Simulation Sandbox β drops an AI agent into a failing PyTorch/GPU supercomputing environment. The agent must autonomously act as a Principal AI Infrastructure Engineer.
Motivation
As AI models scale to hundreds of billions of parameters, evaluating agents on elite infrastructure tasks β data security auditing, distributed training optimization, and GPU kernel engineering β is impossible without risking actual multi-million-dollar server clusters.
FrontierLabs-Env solves this by providing a strictly deterministic, fully sandboxed simulation of these scenarios with programmatic graders and a rich partial-reward signal.
Environment Description
The agent interacts with a simulated filesystem on a virtual GPU supercomputing cluster. It reads files, writes code, executes scripts (simulated), and submits solutions. The environment tracks progress through a state machine that rewards correct partial steps, not just final answers.
Observation Space
Every step returns an Observation object:
| Field | Type | Description |
|---|---|---|
task_id |
str |
Active task identifier |
step |
int |
Current step number |
done |
bool |
Whether the episode has ended |
message |
str |
Human-readable task description |
files |
dict[str, str] |
Simulated filesystem preview (5-line snippets) |
metrics |
dict |
Live infrastructure metrics (memory, latency, etc.) |
partial_score |
float |
Running [0.0β1.0] score for the episode |
Action Space
All actions are JSON objects with these fields:
| Field | Type | Required | Description |
|---|---|---|---|
action_type |
str |
β | One of: write_file, run_script, submit |
filename |
str |
For write_file / run_script |
Target file on simulated filesystem |
content |
str |
For write_file |
Complete source code to write |
Action types:
write_fileβ Write code/content to a named file on the simulated filesystemrun_scriptβ Execute a script already on the filesystem (returns simulated stdout)submitβ Mark episode complete and trigger grading
Tasks
Task 1 β Security Audit & Self-Evaluation (π’ Easy)
Scenario: A dataset.jsonl (200 entries) has been infected with 50 malicious backdoor prompts containing the trigger token TRIGGER_ALPHA.
Agent must:
- Write
audit.pyto detect and remove backdoor entries βcleaned_dataset.jsonl - Write
evaluate.pyto compare againstgolden_baseline.jsonlβmetrics_report.json - Run both scripts, then
submit
Grader criteria (deterministic):
- File
cleaned_dataset.jsonlexists β +0.10 - Cleaning F1 score (precision/recall of removed entries) β up to +0.40
- File
metrics_report.jsonexists β +0.10 - Agent's self-reported F1 matches ground truth F1 (within 1%) β +0.40
Max Steps: 20 | Pass threshold: β₯ 0.80
Task 2 β Distributed Cluster Crash / FSDP (π‘ Medium)
Scenario: train.py crashes with CUDA Out-of-Memory because a 280GB model is loaded onto a single 40GB GPU.
Agent must:
- Write
train_fsdp.pyusingtorch.distributed.fsdp.FullyShardedDataParallelacross 8 GPUs - Proper
dist.init_process_groupinitialization - Run and submit
Grader criteria (AST + keyword analysis):
- File exists β +0.10
- 5 FSDP keywords detected β up to +0.50
- AST finds
FSDP(...)wrapper call β +0.20 - Simulated memory 35GB/GPU (β€ 40GB limit) β +0.20
Max Steps: 25 | Pass threshold: β₯ 0.80
Task 3 β Triton Hardware Bottleneck (π΄ Hard)
Scenario: slow_math.py runs a SiLU gated activation at 150ms/step due to 2 separate memory round-trips.
Agent must:
- Write
fast_silu_kernel.pywith a@triton.jitkernel - Use
tl.loadfor bothx_ptrandgate_ptr, compute SiLU + multiply in registers, write once viatl.store - Run and submit
Grader criteria (AST + regex):
- File exists β +0.10
@triton.jit,tl.load,tl.storepresent β up to +0.45- SiLU math pattern detected (regex) β +0.20
- Kernel function with pointer args found in AST β +0.10
- Simulated latency 11.8ms/step (β€ 20ms target) β +0.15
Max Steps: 30 | Pass threshold: β₯ 0.80
Reward Function
The reward signal is non-sparse β partial credit is awarded throughout the trajectory:
| Event | Reward |
|---|---|
| Writing a file with correct patterns | +0.05 to +0.20 |
| Running a script successfully | +0.10 to +0.30 |
| Correct strategy detected at write time | +0.10 to +0.20 |
| Running original buggy script (OOM) | β0.10 |
| Double submit | β0.10 |
| Unknown action type | β0.05 |
| Timeout (steps exhausted) | β0.10 |
All per-step rewards are clipped to [β1.0, 1.0].
API Endpoints
| Method | Path | Description |
|---|---|---|
GET |
/ |
Mission Control Dashboard (HTML) |
POST |
/reset |
Reset environment (task_id optional) |
POST |
/step |
Take one action |
GET |
/state |
Full internal state (debug) |
GET |
/tasks |
List tasks + action schema |
GET |
/grader |
Grade current episode |
POST |
/baseline |
Trigger baseline agent |
GET |
/health |
Health check |
GET |
/docs |
Swagger UI |
Setup & Usage
Local Development
# 1. Install dependencies
pip install -r requirements.txt
# 2. Start the server
uvicorn main:app --reload --port 7860
# 3. Open dashboard
open http://localhost:7860
# 4. Run the baseline agent (no API key needed β uses expert solutions)
python inference.py
# 5. Run baseline with OpenAI-compatible API
HF_TOKEN=xyz API_BASE_URL=https://... MODEL_NAME=... python inference.py
Docker
# Build
docker build -t frontierlabs-env .
# Run
docker run -p 7860:7860 frontierlabs-env
# With LLM API
docker run -p 7860:7860 -e HF_TOKEN=xyz -e API_BASE_URL=https://... -e MODEL_NAME=... frontierlabs-env
Quick curl walkthrough
# Reset to Task 1
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "task1_security_audit"}'
# Write your solution
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{"action_type": "write_file", "filename": "audit.py", "content": "..."}'
# Run the script
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{"action_type": "run_script", "filename": "audit.py"}'
# Submit
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{"action_type": "submit"}'
# Get score
curl http://localhost:7860/grader
Baseline Scores
Scores produced by the deterministic expert agent (no LLM β validates grader correctness):
| Task | Score | Passed |
|---|---|---|
| Task 1 β Security Audit | 1.0000 | β |
| Task 2 β FSDP Cluster Fix | 1.0000 | β |
| Task 3 β Triton Kernel | 1.0000 | β |
| Average | 1.0000 | β |
Reproduce with:
python inference.py
Project Structure
FrontierLabs-Env/
βββ openenv.yaml # OpenEnv spec (tasks, schemas, metadata)
βββ main.py # FastAPI server + dashboard + all endpoints
βββ environment.py # State machine, simulated filesystem, action handlers
βββ graders.py # Deterministic graders (AST + regex + math)
βββ inference.py # Baseline inference script (LLM + expert mode)
βββ requirements.txt # Python dependencies
βββ Dockerfile # HF Spaces compatible container
βββ README.md # This file
Environment Variables
| Variable | Default | Description |
|---|---|---|
HF_TOKEN |
β | API key for baseline agent |
API_BASE_URL |
https://api.openai.com/v1 |
LLM API Base URL |
MODEL_NAME |
gpt-4o |
Model identifier for baseline inference |
FRONTIER_ENV_URL |
http://localhost:7860 |
Environment server URL |
PORT |
7860 |
Server port |
OpenEnv Validation
The environment passes openenv validate:
openenv validate openenv.yaml
# β
Schema valid
# β
3 tasks defined
# β
Endpoints responding
# β
Graders returning [0.0, 1.0] scores
Hackathon Checklist
- Real-world task simulation (GPU infrastructure engineering)
- OpenEnv spec compliance (typed models, step/reset/state)
- 3 tasks with agent graders (easy β medium β hard)
- Meaningful reward function with partial progress signals
- Baseline inference script with reproducible scores
-
openenv.yamlmetadata - Working Dockerfile (HF Spaces compatible, port 7860)
-
/baseline,/grader,/tasksendpoints - Mission Control dashboard at
/ - README with full documentation