File size: 5,678 Bytes
f42fa9e 22cb7e7 f42fa9e 22cb7e7 f42fa9e e2cf8f8 865526d e2cf8f8 22cb7e7 e2cf8f8 159a5fa 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 159a5fa 8807d25 159a5fa e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 865526d e2cf8f8 22cb7e7 865526d 22cb7e7 865526d 22cb7e7 e2cf8f8 22cb7e7 a2ff803 e2cf8f8 159a5fa 22cb7e7 e2cf8f8 22cb7e7 e2cf8f8 22cb7e7 8807d25 22cb7e7 8807d25 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | ---
title: AgentDebugger-Env π
emoji: π
colorFrom: yellow
colorTo: green
sdk: docker
app_port: 8000
pinned: true
license: mit
---
# AgentDebuggerEnv π
> **Benchmarking Agentic Reasoning through the Iterative Hypothesis-Test-Fix Loop.**
**AgentDebuggerEnv** is an OpenEnv-compliant benchmarking environment designed for the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**. Unlike static code-repair benchmarks that only measure the final output, AgentDebuggerEnv evaluates the *cognitive trajectory* of an agent: how it forms hypotheses, interprets execution failures, and iterates toward a solution in a secure, live sandbox.
---
## π Baseline Performance
Tested with **GPT-4o** using the standard `inference.py` script:
- **Easy (0.85)**: Solved in 1-2 attempts; clear signal from error output.
- **Medium (0.50)**: Solved in ~4 attempts; agents must resist a red-herring authentication error.
- **Hard (0.18)**: Rarely solved; agents must proactively design concurrent tests to surface the hidden race condition.
- **Mean Score: 0.51**
*Measurements taken over multiple runs to account for LLM variance. See `openenv.yaml` for full metadata.*
---
## π The Core Philosophy
Traditional benchmarks (like HumanEval or MBPP) are "one-shot": the model sees a prompt and writes code. Real-world engineering is **iterative**.
AgentDebuggerEnv forces agents to operate in a **live feedback loop**:
1. **Observe**: Analyze existing buggy code and initial test failures.
2. **Hypothesize**: Explicitly state a theory about the root cause (scored for accuracy).
3. **Act**: Submit a surgical fix or query the environment for more context.
4. **Verify**: Observe real-time `stdout/stderr` from a sandboxed test suite execution.
---
## π οΈ Technical Architecture
### 1. Robust Security Sandbox
Every submission is executed in a multi-layered isolated environment:
* **AST Filtering**: A deep Abstract Syntax Tree (AST) pass analyzes submitted code before execution, blocking dangerous imports (`os`, `sys`, `subprocess`, `socket`, etc.) and preventing the override of security-critical builtins.
* **Process Isolation**: Executes in a separate subprocess with strict resource limits (CPU/Memory) enforced via container runtime and execution timeouts (15s). Any attempt to hang the environment results in immediate termination.
* **Thread Safety**: A specialized "Concurrency Sandbox" allows multi-threaded tests (essential for the Hard Task) while maintaining strict host-level security boundaries.
### 2. High-Fidelity Feedback
Instead of binary `Pass/Fail` bits, the environment returns the **raw execution stream**. This allows agents to:
* Read stack traces.
* See partial progress (e.g., "6 passed, 2 failed").
* Detect timeouts and resource exhaustion.
---
## π Task Suite & Reasoning Challenges
| Task | Difficulty | Reasoning Challenge | Why it's hard |
| :--- | :--- | :--- | :--- |
| **Easy** | π’ Easy | **Off-by-One** | Requires basic logic verification. The error message is high-signal. |
| **Medium** | π‘ Medium | **Red Herring** | The symptom (MD5 hashing error) manifests far from the root cause. Agent must trace data flow backward. |
| **Hard** | π΄ Hard | **Race Condition** | **Invisible to sequential tests.** The agent must reason that passing tests do *not* mean the code is correct, and design a concurrent stress test. |
---
## π Professional Grading Methodology
Our graders don't just check if the code works at the end. They score the **process**:
* **Sequential Correctness (40%)**: Does the fix pass the original unit tests?
* **Hidden Strength (30%)**: Does the fix survive a high-concurrency (1000-thread) stress test? (Hard task only).
* **Hypothesis Accuracy (20%)**: Did the agent correctly identify the bug? (NLP-based keyword matching against ground truth).
* **Efficiency Bonus (10%)**: Did the agent solve it within 5 attempts?
---
## βοΈ Installation & Usage
### π¦ Local Setup
```bash
git clone https://huggingface.co/spaces/shashaank0707/AgentDebugger-env
cd AgentDebugger-env
pip install -e .
```
### π’ Running the Environment
```bash
# Start the FastAPI server
uvicorn env.server:app --host 0.0.0.0 --port 8000
```
### π€ Running an Agent (OpenEnv Baseline)
```bash
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your_openai_key"
export ENV_BASE_URL="http://localhost:8000"
python inference.py
```
---
### π Environment Variables
| Variable | Description | Standard Fallback |
| :--- | :--- | :--- |
| `API_BASE_URL` | LLM API endpoint | `https://api.openai.com/v1` |
| `MODEL_NAME` | Model to evaluate | `gpt-4o` |
| `HF_TOKEN` | Auth token (or OpenAI key) | β |
| `OPENAI_API_KEY` | Alternative auth token | β |
| `ENV_BASE_URL` | Address of the FastAPI server | `http://localhost:8000` |
---
## π OpenEnv API Compliance
AgentDebuggerEnv implements the full OpenEnv specification:
* `POST /reset`: Initialize a task (`{"task_id": "medium"}`).
* `POST /step`: Submit an `Action` (supports `submit_fix`, `query_context`, `give_up`).
* `GET /state`: Retrieve full episode history and current environment state.
* `GET /health`: Standard health check for automated uptime monitoring.
---
## π Metadata & License
* **License**: [MIT](LICENSE)
* **Author**: Shashaank (GitHub: @shasshaank, HF: @shashaank0707)
* **Hackathon**: Meta + PyTorch + HuggingFace OpenEnv 2024
---
### β
Submission Integrity
- **Commit SHA**: `159a5faf82fc1ab3709f9674becf9a3ec55cf562`
- **Last Verified Sync**: 2026-04-08
- **Platform Match**: GitHub and HF Space are identical at this HEAD.
|