Spaces:
Running
title: EntropyEnv
emoji: π
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
π EntropyEnv β Multi-Agent Dev Tools Environment.
A multi-domain RL environment for training and evaluating AI agents on real-world developer and clinical tasks. Built for the Scaler Γ Meta Γ PyTorch Γ Hugging Face OpenEnv Hackathon 2026.
π‘ Why This Environment?
Most RL benchmarks test agents on static, single-turn tasks β classify this image, answer this question. But real developer workflows are multi-turn, iterative, and require revision:
- A security reviewer doesn't just find a bug β they identify β propose a fix β revise after feedback
- A DevOps engineer doesn't just flag outdated packages β they resolve version conflicts across an entire dependency graph
- A clinical coordinator doesn't just spot missing steps β they prioritize by urgency and plan a dependency-safe recovery
No existing RL environment tests agents on this full identify β act β revise cycle. EntropyEnv fills that gap with 9 tasks across 3 real-world domains, progressive difficulty, rich partial-credit scoring, and iterative multi-turn episodes.
π― What Is This?
EntropyEnv is a training gym for AI agents β not the agent itself. Think of it like a driving test course: we build the course, and different AI "drivers" take the test.
An AI agent connects via API, receives a task (e.g., "find the vulnerability in this code"), sends back an action (its answer), and gets a reward score based on how good the answer is.
POST /reset
AI Agent βββββββββββββββββββββββββΊ EntropyEnv
β
βββ Picks a task case from the dataset
βββ Returns: observation (the problem)
βββββββββββββββββββββββββ β
β
POST /step β
βββββββββββββββββββββββββΊ β
βββ Validates the action (3 stages)
βββ Grades it (domain-specific grader)
βββββββββββββββββββββββββ βββ Returns: reward + done + next observation
β
(repeat until done) β
ποΈ Three Domains, Nine Tasks
π Domain 1: MCP Security Auditing
Agents identify vulnerabilities in code snippets, propose secure fixes, and iteratively revise based on adversarial reviewer feedback.
| Task | Difficulty | What the Agent Does |
|---|---|---|
sec_easy |
π’ Easy | Classify a single vulnerability (type, CVSS, severity) |
sec_medium |
π‘ Medium | Identify β propose a code fix |
sec_hard |
π΄ Hard | Identify β fix β revise with adversarial reviewer feedback |
Coverage: SQL injection, XSS, IDOR, hardcoded secrets, missing auth, JWT misuse, path traversal, SSRF, XXE
π¦ Domain 2: PyTorch Migration Time-Machine
Agents detect deprecated APIs, resolve version conflicts using compatibility matrices, and fix torch.compile graph-break patterns in dependency order.
| Task | Difficulty | What the Agent Does |
|---|---|---|
dep_easy |
π’ Easy | Flag outdated packages and deprecated API usage |
dep_medium |
π‘ Medium | Resolve version conflicts across package constraints |
dep_hard |
π΄ Hard | Fix torch.compile graph-breaks in correct dependency order |
Coverage: Variable, cuda(), DataParallel, ONNX export, torch.compile, vmap, torch.export
π₯ Domain 3: Clinical Workflow Chaos Simulator
Agents detect missing steps in hospital workflows, rank them by clinical priority, and plan dependency-ordered recovery sequences.
| Task | Difficulty | What the Agent Does |
|---|---|---|
cli_easy |
π’ Easy | Detect missing workflow steps and assess risk |
cli_medium |
π‘ Medium | Detect gaps β rank by clinical priority |
cli_hard |
π΄ Hard | Detect β rank β plan dependency-safe recovery |
Coverage: Surgery prep, ER triage, chemotherapy, cardiac emergency, blood transfusion, organ transplant, stroke code
β‘ Key Features
| Feature | Description |
|---|---|
| π― Partial-Credit Scoring | F1, NDCG, weighted multi-component grading β not binary pass/fail |
| π Multi-Turn Episodes | Agents iterate through identify β act β revise workflows |
| π‘οΈ 3-Stage Validation | Schema β Domain β Consistency checks with helpful error hints |
| π Score Breakdown | Per-component feedback in every step so agents learn what to improve |
| ποΈ Fatal Error Handling | Automatic 402/401/403 detection stops wasted API calls immediately |
| π Universal LLM Support | Works with any OpenAI-compatible model (Qwen, Llama, DeepSeek, Gemini, etc.) |
| π³ Docker-Ready | One-command deploy to Hugging Face Spaces |
| π GRPO-Compatible | Smooth reward gradients designed for policy optimization training |
π‘ API Reference
| Method | Path | Description |
|---|---|---|
GET / |
Health check | Returns status and available tasks |
POST /reset |
Start episode | {"task_id": "sec_easy"} β {episode_id, observation} |
POST /step |
Submit action | {episode_id, action_type, ...} β {reward, done, observation} |
GET /state |
Query state | ?episode_id=xxx β current episode info |
GET /debug |
Debug panel | Interactive HTML benchmark runner |
GET /web |
Gradio UI | Full task browser with run history |
Quick Example
import requests
# 1. Start an episode
resp = requests.post("http://localhost:7860/reset", json={"task_id": "sec_easy"})
data = resp.json()
episode_id = data["episode_id"]
observation = data["observation"]
print(observation["task_description"])
# β "Identify the SQL injection vulnerability in this code snippet."
# 2. Send an action
action = {
"episode_id": episode_id,
"action_type": "identify_vulnerability",
"vuln_type": "sql_injection",
"cvss_score": 9.1,
"severity": "critical",
"affected_line": 3
}
result = requests.post("http://localhost:7860/step", json=action).json()
print(f"Reward: {result['reward']}, Done: {result['done']}")
# β Reward: 0.85, Done: true
π Getting Started
Run Locally
# Install dependencies
pip install fastapi uvicorn openai requests packaging gradio python-dotenv
# Start the environment
uvicorn server.app:app --host 0.0.0.0 --port 7860
Run with Docker
docker build -t entropyenv .
docker run -p 7860:7860 entropyenv
Run the Baseline Agent
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="your_token_here"
export ENV_URL="http://localhost:7860"
python inference.py
Deploy to Hugging Face Spaces
huggingface-cli login
openenv push --repo-id <username>/EntropyEnv
ποΈ Project Structure
entropyenv/
βββ inference.py # Baseline agent with smart prompt engineering
βββ openenv.yaml # OpenEnv manifest (9 tasks)
βββ pyproject.toml # Package configuration
βββ Dockerfile # Multi-stage Docker build
βββ server/
β βββ app.py # FastAPI server with session management
β βββ router.py # Task dispatcher with Counter-based sequence checking
β βββ session.py # Episode state management
β βββ web_ui.py # Gradio UI with performance dashboard
β βββ demo_agent.py # Rule-based demo agent
β βββ benchmark_store.py # Persistent results storage
β βββ debug_panel.html # Interactive debug interface
β βββ validation/
β β βββ validator.py # 3-stage validation with type-casting
β βββ graders/
β β βββ base_grader.py # Universal reward pipeline
β β βββ security_grader.py # Security domain grader
β β βββ dependency_grader.py # Dependency domain grader
β β βββ clinical_grader.py # Clinical domain grader
β βββ datasets/
β βββ security_cases.py # 13 ground-truth security cases
β βββ dependency_cases.py # 13 ground-truth dependency cases
β βββ clinical_cases.py # 13 ground-truth clinical cases
βββ results/
βββ run_history.json # Benchmark history (auto-created)
π Baseline Performance
Note: Scores below are from the latest grading revision (v3: weighted 0.60Γmax + 0.40Γmean scoring, difficulty_multiplier removed, dep_hard done-condition fixed). Re-benchmarking across 14+ models in progress.
| Model | Provider | sec_easy | sec_med | sec_hard | dep_easy | dep_med | dep_hard | cli_easy | cli_med | cli_hard | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|
(Run python unnecessary/run_14_models.py to auto-populate this table) |
Scoring formula: score = 0.60 Γ max(step_rewards) + 0.40 Γ mean(step_rewards), clamped to [0.01, 0.99]
Design principles:
- π― No artificial difficulty caps β scores reflect actual grader correctness
- π Weighted blend β rewards consistently good episodes over single-lucky-step flukes
- π¬ Spec-compliant β
[END]lines perfectly match the 3-line format mandatory rules - π§ 14+ model families tested for universal compatibility
π Inference Log Format
The baseline inference.py emits structured logs matching the OpenEnv spec:
[START] task=sec_easy env=EntropyEnv model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=identify_vulnerability reward=0.85 done=false error=null
[STEP] step=2 action=propose_fix reward=0.92 done=true error=null
[END] success=true steps=2 score=0.89 rewards=0.85,0.92
π€ Built With
- FastAPI β High-performance async API framework
- Gradio β Interactive web UI for testing and visualization
- PyTorch β Domain expertise for migration tasks
- OpenEnv β Standardized RL environment specification
Built with β€οΈ for the Scaler Γ Meta Γ PyTorch Γ Hugging Face OpenEnv Hackathon 2026