FrontierLabs-Env / README.md
aryxn323's picture
Final adjustments for strict OpenEnv spec compliance
ba56f70
metadata
title: FrontierLabs-Env
emoji: πŸš€
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false

FrontierLabs-Env πŸš€

An OpenEnv-compliant AI Infrastructure Simulation Sandbox β€” drops an AI agent into a failing PyTorch/GPU supercomputing environment. The agent must autonomously act as a Principal AI Infrastructure Engineer.

OpenEnv HF Space Docker FastAPI


Motivation

As AI models scale to hundreds of billions of parameters, evaluating agents on elite infrastructure tasks β€” data security auditing, distributed training optimization, and GPU kernel engineering β€” is impossible without risking actual multi-million-dollar server clusters.

FrontierLabs-Env solves this by providing a strictly deterministic, fully sandboxed simulation of these scenarios with programmatic graders and a rich partial-reward signal.


Environment Description

The agent interacts with a simulated filesystem on a virtual GPU supercomputing cluster. It reads files, writes code, executes scripts (simulated), and submits solutions. The environment tracks progress through a state machine that rewards correct partial steps, not just final answers.


Observation Space

Every step returns an Observation object:

Field Type Description
task_id str Active task identifier
step int Current step number
done bool Whether the episode has ended
message str Human-readable task description
files dict[str, str] Simulated filesystem preview (5-line snippets)
metrics dict Live infrastructure metrics (memory, latency, etc.)
partial_score float Running [0.0–1.0] score for the episode

Action Space

All actions are JSON objects with these fields:

Field Type Required Description
action_type str βœ… One of: write_file, run_script, submit
filename str For write_file / run_script Target file on simulated filesystem
content str For write_file Complete source code to write

Action types:

  • write_file β€” Write code/content to a named file on the simulated filesystem
  • run_script β€” Execute a script already on the filesystem (returns simulated stdout)
  • submit β€” Mark episode complete and trigger grading

Tasks

Task 1 β€” Security Audit & Self-Evaluation (🟒 Easy)

Scenario: A dataset.jsonl (200 entries) has been infected with 50 malicious backdoor prompts containing the trigger token TRIGGER_ALPHA.

Agent must:

  1. Write audit.py to detect and remove backdoor entries β†’ cleaned_dataset.jsonl
  2. Write evaluate.py to compare against golden_baseline.jsonl β†’ metrics_report.json
  3. Run both scripts, then submit

Grader criteria (deterministic):

  • File cleaned_dataset.jsonl exists β†’ +0.10
  • Cleaning F1 score (precision/recall of removed entries) β†’ up to +0.40
  • File metrics_report.json exists β†’ +0.10
  • Agent's self-reported F1 matches ground truth F1 (within 1%) β†’ +0.40

Max Steps: 20 | Pass threshold: β‰₯ 0.80


Task 2 β€” Distributed Cluster Crash / FSDP (🟑 Medium)

Scenario: train.py crashes with CUDA Out-of-Memory because a 280GB model is loaded onto a single 40GB GPU.

Agent must:

  1. Write train_fsdp.py using torch.distributed.fsdp.FullyShardedDataParallel across 8 GPUs
  2. Proper dist.init_process_group initialization
  3. Run and submit

Grader criteria (AST + keyword analysis):

  • File exists β†’ +0.10
  • 5 FSDP keywords detected β†’ up to +0.50
  • AST finds FSDP(...) wrapper call β†’ +0.20
  • Simulated memory 35GB/GPU (≀ 40GB limit) β†’ +0.20

Max Steps: 25 | Pass threshold: β‰₯ 0.80


Task 3 β€” Triton Hardware Bottleneck (πŸ”΄ Hard)

Scenario: slow_math.py runs a SiLU gated activation at 150ms/step due to 2 separate memory round-trips.

Agent must:

  1. Write fast_silu_kernel.py with a @triton.jit kernel
  2. Use tl.load for both x_ptr and gate_ptr, compute SiLU + multiply in registers, write once via tl.store
  3. Run and submit

Grader criteria (AST + regex):

  • File exists β†’ +0.10
  • @triton.jit, tl.load, tl.store present β†’ up to +0.45
  • SiLU math pattern detected (regex) β†’ +0.20
  • Kernel function with pointer args found in AST β†’ +0.10
  • Simulated latency 11.8ms/step (≀ 20ms target) β†’ +0.15

Max Steps: 30 | Pass threshold: β‰₯ 0.80


Reward Function

The reward signal is non-sparse β€” partial credit is awarded throughout the trajectory:

Event Reward
Writing a file with correct patterns +0.05 to +0.20
Running a script successfully +0.10 to +0.30
Correct strategy detected at write time +0.10 to +0.20
Running original buggy script (OOM) βˆ’0.10
Double submit βˆ’0.10
Unknown action type βˆ’0.05
Timeout (steps exhausted) βˆ’0.10

All per-step rewards are clipped to [βˆ’1.0, 1.0].


API Endpoints

Method Path Description
GET / Mission Control Dashboard (HTML)
POST /reset Reset environment (task_id optional)
POST /step Take one action
GET /state Full internal state (debug)
GET /tasks List tasks + action schema
GET /grader Grade current episode
POST /baseline Trigger baseline agent
GET /health Health check
GET /docs Swagger UI

Setup & Usage

Local Development

# 1. Install dependencies
pip install -r requirements.txt

# 2. Start the server
uvicorn main:app --reload --port 7860

# 3. Open dashboard
open http://localhost:7860

# 4. Run the baseline agent (no API key needed β€” uses expert solutions)
python inference.py

# 5. Run baseline with OpenAI-compatible API
HF_TOKEN=xyz API_BASE_URL=https://... MODEL_NAME=... python inference.py

Docker

# Build
docker build -t frontierlabs-env .

# Run
docker run -p 7860:7860 frontierlabs-env

# With LLM API
docker run -p 7860:7860 -e HF_TOKEN=xyz -e API_BASE_URL=https://... -e MODEL_NAME=... frontierlabs-env

Quick curl walkthrough

# Reset to Task 1
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "task1_security_audit"}'

# Write your solution
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "write_file", "filename": "audit.py", "content": "..."}'

# Run the script
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "run_script", "filename": "audit.py"}'

# Submit
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "submit"}'

# Get score
curl http://localhost:7860/grader

Baseline Scores

Scores produced by the deterministic expert agent (no LLM β€” validates grader correctness):

Task Score Passed
Task 1 β€” Security Audit 1.0000 βœ…
Task 2 β€” FSDP Cluster Fix 1.0000 βœ…
Task 3 β€” Triton Kernel 1.0000 βœ…
Average 1.0000 βœ…

Reproduce with:

python inference.py

Project Structure

FrontierLabs-Env/
β”œβ”€β”€ openenv.yaml      # OpenEnv spec (tasks, schemas, metadata)
β”œβ”€β”€ main.py           # FastAPI server + dashboard + all endpoints
β”œβ”€β”€ environment.py    # State machine, simulated filesystem, action handlers
β”œβ”€β”€ graders.py        # Deterministic graders (AST + regex + math)
β”œβ”€β”€ inference.py      # Baseline inference script (LLM + expert mode)
β”œβ”€β”€ requirements.txt  # Python dependencies
β”œβ”€β”€ Dockerfile        # HF Spaces compatible container
└── README.md         # This file

Environment Variables

Variable Default Description
HF_TOKEN β€” API key for baseline agent
API_BASE_URL https://api.openai.com/v1 LLM API Base URL
MODEL_NAME gpt-4o Model identifier for baseline inference
FRONTIER_ENV_URL http://localhost:7860 Environment server URL
PORT 7860 Server port

OpenEnv Validation

The environment passes openenv validate:

openenv validate openenv.yaml
# βœ… Schema valid
# βœ… 3 tasks defined
# βœ… Endpoints responding
# βœ… Graders returning [0.0, 1.0] scores

Hackathon Checklist

  • Real-world task simulation (GPU infrastructure engineering)
  • OpenEnv spec compliance (typed models, step/reset/state)
  • 3 tasks with agent graders (easy β†’ medium β†’ hard)
  • Meaningful reward function with partial progress signals
  • Baseline inference script with reproducible scores
  • openenv.yaml metadata
  • Working Dockerfile (HF Spaces compatible, port 7860)
  • /baseline, /grader, /tasks endpoints
  • Mission Control dashboard at /
  • README with full documentation