shank
Updated readme
865526d
|
raw
history blame
3.94 kB
---
title: AgentDebugger Env ๐Ÿ›
emoji: ๐Ÿ“ˆ
colorFrom: yellow
colorTo: green
sdk: docker
app_port: 8000
pinned: false
---
# AgentDebuggerEnv ๐Ÿ›
> **Benchmarking Agentic Reasoning through the Iterative Hypothesis-Test-Fix Loop.**
An OpenEnv-compliant environment designed for the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**. Unlike static code-repair benchmarks, **AgentDebuggerEnv** focuses on the *trajectory* of an agent's reasoningโ€”measuring how effectively an agent forms hypotheses, observes failures, and iterates toward a solution in a live execution sandbox.
---
## ๐Ÿš€ Overview
Debugging is one of the highest-leverage cognitive tasks in software engineering. Modern LLM agents often struggle with:
- **Red Herrings**: Following misleading error messages to the wrong function.
- **Stagnant Iteration**: Repeating the same failed fix attempt instead of updating their hypothesis based on new output.
- **Concurrency Failures**: Failing to detect or fix non-deterministic bugs (race conditions).
**AgentDebuggerEnv** makes these failures measurable and scorable. The environment provides a live, sandboxed feedback loop where agents submit complete code fixes and receive real-time execution results.
---
## ๐Ÿ› ๏ธ Core Mechanics: The Feedback Loop
The environment follows the standard OpenEnv interface (`reset`, `step`, `state`) but enforces a strict **Hypothesis-Test-Fix** cycle:
1. **Hypothesis**: The agent must state its theory about the bug before every fix attempt.
2. **Execution**: The submitted code is executed in a secure sandbox with hard timeouts.
3. **Observation**: The agent receives the actual `stdout` + `stderr` from the test suite, not just a binary pass/fail.
4. **Reward**: Dense reward signal is provided at every step, scaling with test progress and hypothesis accuracy.
---
## ๐Ÿ“ Tasks & Difficulty
The environment includes three standardized tasks designed to test different facets of agentic reasoning:
| Task | Difficulty | Core Challenge |
| :--- | :--- | :--- |
| **Easy** | Easy | **Off-by-One**: Simple logic bug with an explicit, high-signal error message. |
| **Medium** | Medium | **Red Herring**: Interdependent functions where the error manifests far from the root cause. |
| **Hard** | Hard | **Race Condition**: A concurrency bug that is invisible to sequential tests. Agent must design a concurrent test to surface it. |
---
## โš™๏ธ How It Works (Spec Compliance)
### Data Models
- **Observation**: Includes `buggy_code`, `test_suite`, `previous_attempts` (full history), and `current_error_output`.
- **Action**: Supports `submit_fix` (requires `hypothesis`), `query_context` (for deeper code analysis), and `give_up`.
- **Reward**: A multi-component reward including `test_progress`, `hypothesis_match`, and `efficiency_bonus`.
### Infrastructure
- **FastAPI**: Exposes standard endpoints on port 8000.
- **Docker**: Fully containerized and ready for HuggingFace Spaces.
- **Security**: Robust AST-based filtering to prevent malicious code escape.
- **Baseline Script**: Includes a reference `inference.py` script that uses the OpenAI client for benchmark evaluation.
---
## ๐Ÿ“ฆ Quick Start
### Installation
```bash
git clone https://huggingface.co/spaces/shashaank/agentdebugger-env
cd agentdebugger-env
pip install -r requirements.txt
```
### Running Locally
```bash
# Start the environment server
uvicorn env.server:app --host 0.0.0.0 --port 8000
# Run the baseline inference (requires API key)
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your_key_here"
python inference.py
```
---
## ๐Ÿ“Š Benchmarking Results (GPT-4o Baseline)
| Task | Grader Score | Solved |
| :--- | :--- | :--- |
| Easy | 0.85 | Yes |
| Medium | 0.50 | Mixed |
| Hard | 0.18 | No |
---
## ๐Ÿ“œ License
MIT License. Created by **shashaank** for the Meta / PyTorch / HuggingFace OpenEnv Hackathon.