File size: 4,271 Bytes
f42fa9e 22cb7e7 f42fa9e 22cb7e7 f42fa9e e2cf8f8 865526d e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 865526d e2cf8f8 22cb7e7 865526d 22cb7e7 865526d 22cb7e7 e2cf8f8 22cb7e7 a2ff803 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | ---
title: AgentDebugger-Env π
emoji: π
colorFrom: yellow
colorTo: green
sdk: docker
app_port: 8000
pinned: true
license: mit
---
# AgentDebuggerEnv π
> **Benchmarking Agentic Reasoning through the Iterative Hypothesis-Test-Fix Loop.**
**AgentDebuggerEnv** is an OpenEnv-compliant benchmarking environment designed for the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**. Unlike static code-repair benchmarks that only measure the final output, AgentDebuggerEnv evaluates the *cognitive trajectory* of an agent: how it forms hypotheses, interprets execution failures, and iterates toward a solution in a secure, live sandbox.
---
## π The Core Philosophy
Traditional benchmarks (like HumanEval or MBPP) are "one-shot": the model sees a prompt and writes code. Real-world engineering is **iterative**.
AgentDebuggerEnv forces agents to operate in a **live feedback loop**:
1. **Observe**: Analyze existing buggy code and initial test failures.
2. **Hypothesize**: Explicitly state a theory about the root cause (scored for accuracy).
3. **Act**: Submit a surgical fix or query the environment for more context.
4. **Verify**: Observe real-time `stdout/stderr` from a sandboxed test suite execution.
---
## π οΈ Technical Architecture
### 1. Robust Security Sandbox
Every submission is executed in a multi-layered isolated environment:
* **AST Filtering**: An Abstract Syntax Tree (AST) pass blocks dangerous imports (`os`, `sys`, `subprocess`, etc.) and builtins before execution.
* **Process Isolation**: Executes in a separate subprocess with hard memory (256MB) and time (10s) limits.
* **Thread Safety**: A specialized "Concurrency Sandbox" allows multi-threaded tests for identifying race conditions while maintaining host security.
### 2. High-Fidelity Feedback
Instead of binary `Pass/Fail` bits, the environment returns the **raw execution stream**. This allows agents to:
* Read stack traces.
* See partial progress (e.g., "6 passed, 2 failed").
* Detect timeouts and resource exhaustion.
---
## π Task Suite & Reasoning Challenges
| Task | Difficulty | Reasoning Challenge | Why it's hard |
| :--- | :--- | :--- | :--- |
| **Easy** | π’ Easy | **Off-by-One** | Requires basic logic verification. The error message is high-signal. |
| **Medium** | π‘ Medium | **Red Herring** | The symptom (MD5 hashing error) manifests far from the root cause. Agent must trace data flow backward. |
| **Hard** | π΄ Hard | **Race Condition** | **Invisible to sequential tests.** The agent must reason that passing tests do *not* mean the code is correct, and design a concurrent stress test. |
---
## π Professional Grading Methodology
Our graders don't just check if the code works at the end. They score the **process**:
* **Sequential Correctness (40%)**: Does the fix pass the original unit tests?
* **Hidden Strength (30%)**: Does the fix survive a high-concurrency (1000-thread) stress test? (Hard task only).
* **Hypothesis Accuracy (20%)**: Did the agent correctly identify the bug? (NLP-based keyword matching against ground truth).
* **Efficiency Bonus (10%)**: Did the agent solve it within 5 attempts?
---
## βοΈ Installation & Usage
### π¦ Local Setup
```bash
git clone https://huggingface.co/spaces/shashaank0707/AgentDebugger-env
cd AgentDebugger-env
pip install -e .
```
### π’ Running the Environment
```bash
# Start the FastAPI server
uvicorn env.server:app --host 0.0.0.0 --port 8000
```
### π€ Running an Agent (OpenEnv Baseline)
```bash
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your_openai_key"
export ENV_BASE_URL="http://localhost:8000"
python inference.py
```
---
## π OpenEnv API Compliance
AgentDebuggerEnv implements the full OpenEnv specification:
* `POST /reset`: Initialize a task (`{"task_id": "medium"}`).
* `POST /step`: Submit an `Action` (supports `submit_fix`, `query_context`, `give_up`).
* `GET /state`: Retrieve full episode history and current environment state.
* `GET /health`: Standard health check for automated uptime monitoring.
---
## π Metadata & License
* **License**: [MIT](LICENSE)
* **Author**: shashaank
* **Hackathon**: Meta + PyTorch + HuggingFace OpenEnv 2024
|