Spaces:

agentDebugger
/

AgentDebugger-training-v3

Running

App Files Files Community

AgentDebugger-training-v3 / README.md

shank

Updated readme

865526d about 1 month ago

preview code

raw

history blame

3.94 kB

	---
	title: AgentDebugger Env 🐛
	emoji: 📈
	colorFrom: yellow
	colorTo: green
	sdk: docker
	app_port: 8000
	pinned: false
	---

	# AgentDebuggerEnv 🐛

	> Benchmarking Agentic Reasoning through the Iterative Hypothesis-Test-Fix Loop.

	An OpenEnv-compliant environment designed for the Meta + PyTorch + HuggingFace OpenEnv Hackathon. Unlike static code-repair benchmarks, AgentDebuggerEnv focuses on the trajectory of an agent's reasoning—measuring how effectively an agent forms hypotheses, observes failures, and iterates toward a solution in a live execution sandbox.

	---

	## 🚀 Overview

	Debugging is one of the highest-leverage cognitive tasks in software engineering. Modern LLM agents often struggle with:
	- Red Herrings: Following misleading error messages to the wrong function.
	- Stagnant Iteration: Repeating the same failed fix attempt instead of updating their hypothesis based on new output.
	- Concurrency Failures: Failing to detect or fix non-deterministic bugs (race conditions).

	AgentDebuggerEnv makes these failures measurable and scorable. The environment provides a live, sandboxed feedback loop where agents submit complete code fixes and receive real-time execution results.

	---

	## 🛠️ Core Mechanics: The Feedback Loop

	The environment follows the standard OpenEnv interface (`reset`, `step`, `state`) but enforces a strict Hypothesis-Test-Fix cycle:

	1. Hypothesis: The agent must state its theory about the bug before every fix attempt.
	2. Execution: The submitted code is executed in a secure sandbox with hard timeouts.
	3. Observation: The agent receives the actual `stdout` + `stderr` from the test suite, not just a binary pass/fail.
	4. Reward: Dense reward signal is provided at every step, scaling with test progress and hypothesis accuracy.

	---

	## 📁 Tasks & Difficulty

	The environment includes three standardized tasks designed to test different facets of agentic reasoning:

	\| Task \| Difficulty \| Core Challenge \|
	\| :--- \| :--- \| :--- \|
	\| Easy \| Easy \| Off-by-One: Simple logic bug with an explicit, high-signal error message. \|
	\| Medium \| Medium \| Red Herring: Interdependent functions where the error manifests far from the root cause. \|
	\| Hard \| Hard \| Race Condition: A concurrency bug that is invisible to sequential tests. Agent must design a concurrent test to surface it. \|

	---

	## ⚙️ How It Works (Spec Compliance)

	### Data Models
	- Observation: Includes `buggy_code`, `test_suite`, `previous_attempts` (full history), and `current_error_output`.
	- Action: Supports `submit_fix` (requires `hypothesis`), `query_context` (for deeper code analysis), and `give_up`.
	- Reward: A multi-component reward including `test_progress`, `hypothesis_match`, and `efficiency_bonus`.

	### Infrastructure
	- FastAPI: Exposes standard endpoints on port 8000.
	- Docker: Fully containerized and ready for HuggingFace Spaces.
	- Security: Robust AST-based filtering to prevent malicious code escape.
	- Baseline Script: Includes a reference `inference.py` script that uses the OpenAI client for benchmark evaluation.

	---

	## 📦 Quick Start

	### Installation
	```bash
	git clone https://huggingface.co/spaces/shashaank/agentdebugger-env
	cd agentdebugger-env
	pip install -r requirements.txt
	```

	### Running Locally
	```bash
	# Start the environment server
	uvicorn env.server:app --host 0.0.0.0 --port 8000

	# Run the baseline inference (requires API key)
	export API_BASE_URL="https://api.openai.com/v1"
	export MODEL_NAME="gpt-4o"
	export HF_TOKEN="your_key_here"
	python inference.py
	```

	---

	## 📊 Benchmarking Results (GPT-4o Baseline)

	\| Task \| Grader Score \| Solved \|
	\| :--- \| :--- \| :--- \|
	\| Easy \| 0.85 \| Yes \|
	\| Medium \| 0.50 \| Mixed \|
	\| Hard \| 0.18 \| No \|

	---

	## 📜 License
	MIT License. Created by shashaank for the Meta / PyTorch / HuggingFace OpenEnv Hackathon.