title: AgentDebugger Env π
emoji: π
colorFrom: yellow
colorTo: green
sdk: docker
app_port: 8000
pinned: false
AgentDebuggerEnv π
Benchmarking Agentic Reasoning through the Iterative Hypothesis-Test-Fix Loop.
An OpenEnv-compliant environment designed for the Meta + PyTorch + HuggingFace OpenEnv Hackathon. Unlike static code-repair benchmarks, AgentDebuggerEnv focuses on the trajectory of an agent's reasoningβmeasuring how effectively an agent forms hypotheses, observes failures, and iterates toward a solution in a live execution sandbox.
π Overview
Debugging is one of the highest-leverage cognitive tasks in software engineering. Modern LLM agents often struggle with:
- Red Herrings: Following misleading error messages to the wrong function.
- Stagnant Iteration: Repeating the same failed fix attempt instead of updating their hypothesis based on new output.
- Concurrency Failures: Failing to detect or fix non-deterministic bugs (race conditions).
AgentDebuggerEnv makes these failures measurable and scorable. The environment provides a live, sandboxed feedback loop where agents submit complete code fixes and receive real-time execution results.
π οΈ Core Mechanics: The Feedback Loop
The environment follows the standard OpenEnv interface (reset, step, state) but enforces a strict Hypothesis-Test-Fix cycle:
- Hypothesis: The agent must state its theory about the bug before every fix attempt.
- Execution: The submitted code is executed in a secure sandbox with hard timeouts.
- Observation: The agent receives the actual
stdout+stderrfrom the test suite, not just a binary pass/fail. - Reward: Dense reward signal is provided at every step, scaling with test progress and hypothesis accuracy.
π Tasks & Difficulty
The environment includes three standardized tasks designed to test different facets of agentic reasoning:
| Task | Difficulty | Core Challenge |
|---|---|---|
| Easy | Easy | Off-by-One: Simple logic bug with an explicit, high-signal error message. |
| Medium | Medium | Red Herring: Interdependent functions where the error manifests far from the root cause. |
| Hard | Hard | Race Condition: A concurrency bug that is invisible to sequential tests. Agent must design a concurrent test to surface it. |
βοΈ How It Works (Spec Compliance)
Data Models
- Observation: Includes
buggy_code,test_suite,previous_attempts(full history), andcurrent_error_output. - Action: Supports
submit_fix(requireshypothesis),query_context(for deeper code analysis), andgive_up. - Reward: A multi-component reward including
test_progress,hypothesis_match, andefficiency_bonus.
Infrastructure
- FastAPI: Exposes standard endpoints on port 8000.
- Docker: Fully containerized and ready for HuggingFace Spaces.
- Security: Robust AST-based filtering to prevent malicious code escape.
- Baseline Script: Includes a reference
inference.pyscript that uses the OpenAI client for benchmark evaluation.
π¦ Quick Start
Installation
git clone https://huggingface.co/spaces/shashaank/agentdebugger-env
cd agentdebugger-env
pip install -r requirements.txt
Running Locally
# Start the environment server
uvicorn env.server:app --host 0.0.0.0 --port 8000
# Run the baseline inference (requires API key)
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your_key_here"
python inference.py
π Benchmarking Results (GPT-4o Baseline)
| Task | Grader Score | Solved |
|---|---|---|
| Easy | 0.85 | Yes |
| Medium | 0.50 | Mixed |
| Hard | 0.18 | No |
π License
MIT License. Created by shashaank for the Meta / PyTorch / HuggingFace OpenEnv Hackathon.