File size: 3,940 Bytes
f42fa9e 865526d f42fa9e e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d e2cf8f8 865526d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | ---
title: AgentDebugger Env ๐
emoji: ๐
colorFrom: yellow
colorTo: green
sdk: docker
app_port: 8000
pinned: false
---
# AgentDebuggerEnv ๐
> **Benchmarking Agentic Reasoning through the Iterative Hypothesis-Test-Fix Loop.**
An OpenEnv-compliant environment designed for the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**. Unlike static code-repair benchmarks, **AgentDebuggerEnv** focuses on the *trajectory* of an agent's reasoningโmeasuring how effectively an agent forms hypotheses, observes failures, and iterates toward a solution in a live execution sandbox.
---
## ๐ Overview
Debugging is one of the highest-leverage cognitive tasks in software engineering. Modern LLM agents often struggle with:
- **Red Herrings**: Following misleading error messages to the wrong function.
- **Stagnant Iteration**: Repeating the same failed fix attempt instead of updating their hypothesis based on new output.
- **Concurrency Failures**: Failing to detect or fix non-deterministic bugs (race conditions).
**AgentDebuggerEnv** makes these failures measurable and scorable. The environment provides a live, sandboxed feedback loop where agents submit complete code fixes and receive real-time execution results.
---
## ๐ ๏ธ Core Mechanics: The Feedback Loop
The environment follows the standard OpenEnv interface (`reset`, `step`, `state`) but enforces a strict **Hypothesis-Test-Fix** cycle:
1. **Hypothesis**: The agent must state its theory about the bug before every fix attempt.
2. **Execution**: The submitted code is executed in a secure sandbox with hard timeouts.
3. **Observation**: The agent receives the actual `stdout` + `stderr` from the test suite, not just a binary pass/fail.
4. **Reward**: Dense reward signal is provided at every step, scaling with test progress and hypothesis accuracy.
---
## ๐ Tasks & Difficulty
The environment includes three standardized tasks designed to test different facets of agentic reasoning:
| Task | Difficulty | Core Challenge |
| :--- | :--- | :--- |
| **Easy** | Easy | **Off-by-One**: Simple logic bug with an explicit, high-signal error message. |
| **Medium** | Medium | **Red Herring**: Interdependent functions where the error manifests far from the root cause. |
| **Hard** | Hard | **Race Condition**: A concurrency bug that is invisible to sequential tests. Agent must design a concurrent test to surface it. |
---
## โ๏ธ How It Works (Spec Compliance)
### Data Models
- **Observation**: Includes `buggy_code`, `test_suite`, `previous_attempts` (full history), and `current_error_output`.
- **Action**: Supports `submit_fix` (requires `hypothesis`), `query_context` (for deeper code analysis), and `give_up`.
- **Reward**: A multi-component reward including `test_progress`, `hypothesis_match`, and `efficiency_bonus`.
### Infrastructure
- **FastAPI**: Exposes standard endpoints on port 8000.
- **Docker**: Fully containerized and ready for HuggingFace Spaces.
- **Security**: Robust AST-based filtering to prevent malicious code escape.
- **Baseline Script**: Includes a reference `inference.py` script that uses the OpenAI client for benchmark evaluation.
---
## ๐ฆ Quick Start
### Installation
```bash
git clone https://huggingface.co/spaces/shashaank/agentdebugger-env
cd agentdebugger-env
pip install -r requirements.txt
```
### Running Locally
```bash
# Start the environment server
uvicorn env.server:app --host 0.0.0.0 --port 8000
# Run the baseline inference (requires API key)
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your_key_here"
python inference.py
```
---
## ๐ Benchmarking Results (GPT-4o Baseline)
| Task | Grader Score | Solved |
| :--- | :--- | :--- |
| Easy | 0.85 | Yes |
| Medium | 0.50 | Mixed |
| Hard | 0.18 | No |
---
## ๐ License
MIT License. Created by **shashaank** for the Meta / PyTorch / HuggingFace OpenEnv Hackathon.
|