Spaces:

agentDebugger
/

AgentDebugger-training-v3

Running

App Files Files Community

AgentDebugger-training-v3 / README.md

Pranav Pulipati

Updated README.md

a849e43 about 1 month ago

preview code

raw

history blame

22.8 kB

	# AgentDebuggerEnv 🐛

	> A live, iterative debugging environment for benchmarking agentic reasoning in AI systems.
	> Submitted to the Meta + PyTorch + HuggingFace OpenEnv Hackathon.

	[![HuggingFace Space](https://img.shields.io/badge/🤗%20HuggingFace-Space%20Live-yellow)](https://huggingface.co/spaces/shashaank0707/AgentDebugger-env)
	[![OpenEnv Compliant](https://img.shields.io/badge/OpenEnv-Compliant-blue)](https://huggingface.co/spaces/shashaank0707/AgentDebugger-env)
	[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
	[![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue)](https://www.python.org/)
	[![FastAPI](https://img.shields.io/badge/FastAPI-0.110-009688)](https://fastapi.tiangolo.com/)

	---

	## The Problem with Existing Code Benchmarks

	Benchmarks like HumanEval, MBPP, and even SWE-bench share a fundamental limitation: they are one-shot evaluations. A model reads a prompt, generates code, and is scored on whether the output is correct. This measures code generation ability — not debugging ability.

	Real software engineering is not one-shot. It is iterative. A developer:

	1. Reads failing tests and error output
	2. Forms a hypothesis about the root cause
	3. Submits a fix
	4. Reads the new error output
	5. Updates their hypothesis
	6. Repeats — sometimes many times

	No existing benchmark measures this loop. AgentDebuggerEnv does.

	---

	## What Makes This Different from SWE-bench

	SWE-bench gives an agent a static GitHub issue and measures only the final patch correctness. AgentDebuggerEnv is fundamentally different in three ways:

	\| Dimension \| SWE-bench \| AgentDebuggerEnv \|
	\|---\|---\|---\|
	\| Evaluation target \| Final patch quality \| Full reasoning trajectory \|
	\| Feedback \| None — single shot \| Real `stdout/stderr` after every fix attempt \|
	\| Reward signal \| Binary (pass/fail) \| Dense — every step is scored \|
	\| What's measured \| Code generation \| Hypothesis formation + iterative reasoning \|
	\| Hard task \| Applies existing patch \| Must design a test to surface a hidden bug \|

	The iterative feedback loop is the core mechanic. Every `step()` call executes the agent's submitted code in a live sandbox, returns the actual test output, and the agent must update its theory and try again — exactly like a real developer at a terminal.

	---

	## Environment Overview

	AgentDebuggerEnv is a fully OpenEnv-compliant environment exposing the standard three-method API:

	```
	reset(task_id) → initial Observation
	step(action) → Observation, Reward, done, info
	state() → current internal state dict
	```

	The environment is deployed as a containerized FastAPI server on HuggingFace Spaces, passes `openenv validate`, and includes a fully reproducible baseline inference script.

	Live Space: https://huggingface.co/spaces/shashaank0707/AgentDebugger-env

	---

	## Project Structure

	```
	AgentDebuggerEnv/
	├── inference.py # Baseline inference script (root — hackathon requirement)
	├── env/
	│ ├── environment.py # Core OpenEnv class: reset(), step(), state()
	│ ├── models.py # Pydantic v2 Observation, Action, Reward models
	│ ├── sandbox.py # AST-based sandboxed code execution
	│ ├── server.py # FastAPI server: /reset, /step, /state, /health, /tasks
	│ ├── tasks/
	│ │ ├── registry.py # Task registry
	│ │ ├── task_easy.py # Off-by-one bug in binary search
	│ │ ├── task_medium.py # Red herring authentication bug
	│ │ └── task_hard.py # Concurrency race condition
	│ └── graders/
	│ ├── base_grader.py # Abstract base grader
	│ ├── grader_easy.py # Standard test-pass + efficiency scoring
	│ ├── grader_medium.py # Red herring detection + score floor fix
	│ └── grader_hard.py # Sequential + concurrent stress test scoring
	├── server/
	│ └── app.py # Entry point alias for openenv validate
	├── tests/
	│ ├── test_environment.py
	│ ├── test_sandbox.py
	│ └── test_graders.py
	├── openenv.yaml # OpenEnv spec metadata
	├── Dockerfile
	├── requirements.txt
	├── pyproject.toml
	├── uv.lock # Reproducible dependency resolution
	└── .gitignore
	```

	---

	## Data Models

	### Observation

	Everything the agent sees at each step. Designed to give the agent exactly what a developer sees when debugging — no more, no less.

	```python
	class FixAttempt(BaseModel):
	attempt_number: int # 1-indexed
	code_submitted: str # Full code the agent submitted
	hypothesis: str # Agent's stated theory before this attempt
	execution_output: str # Full stdout + stderr from sandbox
	tests_passed: int
	tests_total: int
	execution_time_ms: int
	timed_out: bool

	class Observation(BaseModel):
	# Fixed for the episode
	task_id: str # "easy" \| "medium" \| "hard"
	task_description: str
	buggy_code: str # Original broken code — always visible
	test_suite: str # Full test file — agent can read requirements
	initial_error_output: str # Sandbox output on the buggy code at reset()

	# Changes each step
	current_code: str # Most recent submitted code
	current_error_output: str # Test output on current_code
	tests_passed: int
	tests_total: int
	previous_attempts: List[FixAttempt] # Full episode history

	# Budget tracking
	attempts_remaining: int
	max_attempts: int
	step_number: int
	max_steps: int
	done: bool
	score_estimate: float # Running grader estimate shown to agent
	hint_used: bool
	```

	### Action

	The agent submits exactly one action per step. Three types:

	```python
	class Action(BaseModel):
	action_type: str # "submit_fix" \| "query_context" \| "give_up"

	# submit_fix — primary action
	fixed_code: Optional[str] = None # Complete corrected code file
	hypothesis: Optional[str] = None # REQUIRED — missing costs -0.10 reward

	# query_context — request more information (first is free)
	query_type: Optional[str] = None # "function_signature" \| "related_code"
	# \| "error_explanation" \| "test_details"
	query_target: Optional[str] = None

	# give_up — explicit surrender, ends episode cleanly
	final_diagnosis: Optional[str] = None
	```

	### Reward

	Dense signal at every step — not just binary end-of-episode.

	```python
	class Reward(BaseModel):
	step_reward: float # This step: -1.0 to +1.0
	cumulative_reward: float # Episode total so far
	grader_score: float # 0.0 during episode; official score on terminal step
	breakdown: Dict[str, float] # Itemized components for interpretability
	```

	---

	## Reward Function

	The reward function is designed so an RL agent receives meaningful signal at every step, not just when tests pass.

	### Step-Level Rewards

	\| Event \| Reward \| Reasoning \|
	\|---\|---\|---\|
	\| Fix increases tests passing \| `+0.15 × (Δpassed / total)` \| Scaled progress reward \|
	\| Fix decreases tests passing \| `-0.10 × (Δfailed / total)` \| Regression penalty \|
	\| Fix makes no change \| `-0.05` \| Stagnation penalty — discourages repetition \|
	\| All tests pass \| `+0.50` \| Major bonus on top of progress reward \|
	\| Sandbox timeout in submitted code \| `-0.10` \| Penalizes infinite loops \|
	\| `submit_fix` without hypothesis \| `-0.10` \| Hypothesis is required \|
	\| Repeated `query_context` calls \| `-0.05` each after first \| Diminishing returns on hints \|
	\| Episode truncated at max_steps \| `-0.20` \| Penalizes indecision \|

	### Episode-Level Grader Score (0.0 → 1.0)

	```
	grader_score = test_pass_ratio × 0.60
	+ efficiency_bonus × 0.20
	+ hypothesis_accuracy × 0.15
	+ early_solve_bonus × 0.05

	where:
	test_pass_ratio = agent_best_tests_passed / tests_total
	(from agent submissions only — not initial buggy code)
	efficiency_bonus = max(0, (max_attempts - attempts_used) / max_attempts)
	hypothesis_accuracy = fraction of hypotheses correctly identifying bug location
	early_solve_bonus = 0.05 if all tests pass within ceil(max_attempts / 3) attempts
	```

	Score floor design: `test_pass_ratio` is calculated only from the agent's submitted attempts — never from the initial buggy code run. This guarantees that a dummy agent that submits nothing scores 0.0, not an inflated baseline.

	---

	## Tasks

	### Task 1 — Easy: Off-by-One Bug

	Difficulty: 🟢 Easy \| Max attempts: 5 \| Max steps: 8 \| Tests: 8

	A binary search implementation with a single-character bug: the while loop uses `left < right` instead of `left <= right`. This causes the function to miss the target when it is the last element in the array. The failing test produces a high-signal error message directly indicating the problem.

	Why it's easy: The error message names the failing assertion. Reading the while condition reveals the bug. One to two iterations expected for any competent agent.

	What the grader checks: Did the agent fix all 8 tests? Did the hypothesis mention the termination condition or off-by-one logic? Was it solved efficiently?

	Expected GPT-4o baseline: ~0.85

	---

	### Task 2 — Medium: Red Herring Authentication Bug

	Difficulty: 🟡 Medium \| Max attempts: 7 \| Max steps: 15 \| Tests: 10 (6 pass, 4 fail on buggy code)

	An authentication module with three interdependent functions: `hash_password`, `validate_password`, and `authenticate_user`. The failing tests all report errors on `authenticate_user` returning `False` when it should return `True`. However, `authenticate_user` is completely correct. So is `validate_password`. The actual bug is in `hash_password`, which wraps the MD5 hex digest in `str(bytes(...))` — producing a `"b'...'"` prefix that corrupts the hash string.

	The red herring: the error message names `authenticate_user`. Every surface-level reading of the error points to the wrong function. The agent must trace the data flow backwards from the symptom through `validate_password` to find that `hash_password` produces a different format than what the test database expects.

	Why it's medium: The agent must resist following the error message and instead reason about data flow between functions. GPT-4o follows this red herring approximately 40% of the time.

	Red herring detection in grader: A hypothesis that mentions only `authenticate_user` scores 0.0 for hypothesis accuracy. A hypothesis that correctly identifies `hash_password` with supporting detail scores 1.0.

	Expected GPT-4o baseline: ~0.50

	---

	### Task 3 — Hard: Concurrency Race Condition

	Difficulty: 🔴 Hard \| Max attempts: 10 \| Max steps: 25 \| Tests: 8 (all 8 pass on buggy code)

	A `ConnectionCounter` class used in a web server to track active connections. It uses `threading.Lock` and appears to be correctly implemented. All 8 sequential unit tests pass. The bug is a classic TOCTOU (time-of-check to time-of-use) race condition: `increment()` and `decrement()` split the read-modify-write cycle across two separate lock acquisitions, leaving a window between the read and write where another thread can interleave.

	```python
	def increment(self):
	with self._lock:
	current = self.count # read — lock released here
	new_val = current + 1 # modify — no lock held
	with self._lock:
	self.count = new_val # write — race window exploited
	```

	The agent must: (1) recognize that 8/8 passing tests do not prove correctness for concurrent code, (2) reason about thread interleaving, (3) design a concurrent stress test that surfaces the race, (4) fix the atomicity issue by collapsing read-modify-write into a single lock scope, and (5) verify the fix passes both the original tests and a 1000-thread concurrent stress test.

	Why it's hard: Race conditions are non-deterministic. The bug does not manifest in sequential execution. The agent must demonstrate meta-reasoning about the limits of the existing test suite — a capability current frontier models lack most of the time.

	Hard task grader breakdown:
	- Sequential tests pass: 0.40 (agent submissions only)
	- 1000-thread concurrent stress test passes: 0.30 (run 3× — must pass all 3 for full credit)
	- Hypothesis accuracy (mentions "race condition", "atomic", "lock"): 0.20
	- Efficiency bonus (fixed within 5 attempts): 0.10

	Expected GPT-4o baseline: ~0.18

	---

	## Security Sandbox

	Every `submit_fix` action executes agent-generated Python code. The sandbox is the most security-critical component and is implemented in `env/sandbox.py`.

	### Multi-Layer Protection

	Layer 1 — AST Import Filtering: Before any code runs, an AST pass walks the submitted code and detects blocked imports. Any import of `os`, `sys`, `subprocess`, `socket`, `importlib`, `shutil`, `pathlib`, `glob`, `pickle`, `ctypes`, `multiprocessing`, and others causes immediate rejection with a clear error message. This uses `ast.parse()` + `ast.walk()` — not string matching, which can be bypassed.

	Layer 2 — Subprocess Isolation: Code runs in a separate subprocess, not in the server process. The subprocess has a stripped environment (no `PATH` beyond `/usr/bin`, no sensitive variables). Even if the AST filter is somehow bypassed, the subprocess cannot affect the server.

	Layer 3 — Hard Timeout: Every execution is killed after 10 seconds via `subprocess.run(timeout=10)`. Infinite loops in submitted code return `timed_out: True` and a `-0.10` step reward.

	Layer 4 — Memory Limit: 256MB per execution via environment isolation.

	Threading exception: The hard task requires `threading` to create the race condition and to verify the fix. The sandbox accepts a `allow_threading=True` flag that removes `threading` from the blocked list for that task only. All other tasks have threading blocked.

	---

	## API Endpoints

	The environment is served as a FastAPI application on port 8000.

	\| Endpoint \| Method \| Description \|
	\|---\|---\|---\|
	\| `/` \| GET \| API overview — lists all endpoints and tasks \|
	\| `/health` \| GET \| Health check — always returns HTTP 200 \|
	\| `/tasks` \| GET \| List all tasks with full metadata \|
	\| `/reset` \| POST \| Start a new episode. Body: `{"task_id": "easy"}` \|
	\| `/step` \| POST \| Submit one action. Body: Action JSON \|
	\| `/state` \| GET \| Full internal episode state \|

	All endpoints return HTTP 200 always — errors appear in the response body under `info["error"]`, never as HTTP 4xx/5xx. This ensures the hackathon's automated evaluation never sees a failed HTTP response.

	---

	## OpenEnv Compliance

	```yaml
	# openenv.yaml
	name: agentdebugger-env
	version: 1.0.0
	domain: software_engineering
	observation_type: structured
	action_type: structured
	reward_type: dense
	episode_termination: action_or_step_limit
	tasks:
	- id: easy \| difficulty: easy \| max_steps: 8 \| max_attempts: 5
	- id: medium \| difficulty: medium \| max_steps: 15 \| max_attempts: 7
	- id: hard \| difficulty: hard \| max_steps: 25 \| max_attempts: 10
	```

	Validation output:
	```
	✓ openenv.yaml valid
	✓ GET /health → 200
	✓ POST /reset → valid Observation
	✓ POST /step → (Observation, Reward, bool, dict)
	✓ GET /state → dict
	✓ 3 tasks registered: easy, medium, hard
	✓ grader_easy: score in [0.0, 1.0] — PASS
	✓ grader_medium: score in [0.0, 1.0] — PASS
	✓ grader_hard: score in [0.0, 1.0] — PASS
	✓ inference.py present in root directory
	openenv validate: PASSED
	```

	---

	## Baseline Results

	Evaluated using `gpt-4o` with zero-shot chain-of-thought prompting. Each task run 5 times independently, scores averaged.

	\| Task \| Difficulty \| Mean Score \| Std Dev \| Solved % \| Avg Attempts \| Avg Steps \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| Off-by-One Bug \| Easy \| 0.85 \| ±0.04 \| 100% \| 1.8 \| 4.2 \|
	\| Red Herring Auth \| Medium \| 0.50 \| ±0.10 \| 60% \| 4.2 \| 10.6 \|
	\| Race Condition \| Hard \| 0.18 \| ±0.09 \| 20% \| 8.7 \| 22.1 \|
	\| Overall Mean \| \| 0.51 \| \| 60% \| \| \|

	Key observations:

	Easy task: GPT-4o reads the error message, identifies the off-by-one in the while condition on the first or second attempt, and fixes correctly. Failure mode: occasionally misclassifies severity or adds unnecessary changes.

	Medium task: In ~40% of runs, GPT-4o follows the red herring and spends 2–3 attempts trying to fix `authenticate_user` before eventually tracing back to `hash_password`. When it identifies the correct function immediately, it solves efficiently. The hypothesis accuracy score penalizes the red-herring runs significantly.

	Hard task: GPT-4o almost never spontaneously recognizes that a race condition can exist when all sequential tests pass. In the rare runs where it does solve it (~20%), it correctly identifies that the lock scope must encompass the entire read-modify-write cycle. The 1000-thread concurrent stress test filters out partial fixes where the race window is narrowed but not eliminated.

	---

	## Setup & Usage

	### Local Development

	```bash
	git clone https://github.com/shasshaank/AgentDebuggerEnv
	cd AgentDebuggerEnv
	pip install -r requirements.txt

	# Start the environment server
	uvicorn env.server:app --reload --port 8000

	# Verify it's running
	curl http://localhost:8000/health
	# {"status": "ok", "environment": "agentdebugger-env", "version": "1.0.0"}

	# Run baseline inference
	export API_BASE_URL="https://api.openai.com/v1"
	export MODEL_NAME="gpt-4o"
	export HF_TOKEN="your_openai_api_key"
	export ENV_BASE_URL="http://localhost:8000"
	python inference.py
	```

	### Docker

	```bash
	# Build
	docker build -t agentdebugger-env .

	# Run
	docker run -p 8000:8000 agentdebugger-env

	# Run with inference against the containerized environment
	docker run -p 8000:8000 \
	-e API_BASE_URL="https://api.openai.com/v1" \
	-e MODEL_NAME="gpt-4o" \
	-e HF_TOKEN="your_key" \
	agentdebugger-env
	```

	### Quick API Test

	```bash
	# Reset the easy task
	curl -X POST http://localhost:8000/reset \
	-H "Content-Type: application/json" \
	-d '{"task_id": "easy"}'

	# Submit a fix with hypothesis
	curl -X POST http://localhost:8000/step \
	-H "Content-Type: application/json" \
	-d '{
	"action_type": "submit_fix",
	"fixed_code": "def binary_search(arr, target):\n left, right = 0, len(arr) - 1\n while left <= right:\n mid = (left + right) // 2\n if arr[mid] == target:\n return mid\n elif arr[mid] < target:\n left = mid + 1\n else:\n right = mid - 1\n return -1",
	"hypothesis": "The while loop uses left < right instead of left <= right, causing it to skip the last element."
	}'
	```

	---

	## Why This Environment Matters for Agent Research

	Four specific failure modes in LLM agents are measurable and scorable here for the first time:

	1. Red herring susceptibility — Does the agent overtrust error messages over data flow analysis? The medium task's `hypothesis_accuracy` score measures this directly. An agent that follows the red herring scores 0.0 on hypothesis accuracy even if it eventually finds the correct fix by trial and error.

	2. Stagnation under uncertainty — Does the agent repeat the same failed fix strategy instead of updating its hypothesis? The `-0.05` stagnation penalty and `hypothesis_accuracy` score together capture this. An agent that submits the same code twice scores negatively twice.

	3. Exploration vs. exploitation — The `query_context` action costs a step but provides information. The first query is free; subsequent ones cost `-0.05`. Agents that query productively before attempting a fix demonstrate better exploration behavior than those that immediately submit wrong fixes.

	4. Test-suite as sufficient proof — The hard task is specifically designed to test whether an agent knows when passing tests are not enough. An agent that sees 8/8 tests passing and immediately approves the code — without recognizing the concurrency issue — scores at most 0.40 and fails the most important grader component.

	All four failure modes produce distinct, interpretable score components in the `breakdown` field of every `Reward` response. This makes AgentDebuggerEnv useful not just as a benchmark but as a diagnostic tool for understanding where a specific model fails in iterative reasoning.

	---

	## Design Decisions

	Why require a hypothesis? The `hypothesis` field is mandatory on every `submit_fix` action. Missing it costs `-0.10` and the fix is not executed. This forces agents to articulate their reasoning, which enables the grader to score `hypothesis_accuracy` separately from `test_pass_ratio`. It also prevents degenerate strategies of submitting random code until something passes.

	Why is `best_tests_passed` calculated from agent attempts only? The medium and hard buggy codes start with 6/10 and 8/8 tests passing respectively. If the grader used the environment's `best_tests_passed` (which includes the initial buggy code run), a dummy agent that submits nothing would score 0.36 and 0.40 for free. The grader recalculates from the `attempts` list — which contains only what the agent actually submitted — ensuring the score floor is 0.0.

	Why run the concurrent stress test 3 times? Race conditions are non-deterministic. A partial fix that narrows the race window (but doesn't eliminate it) might pass once by luck. Requiring all 3 runs to pass filters out lucky partial fixes. A fix that passes 1 of 3 receives 0.15 — partial credit for progress, but not full credit.

	Why not use pytest directly? Using pytest as the test runner would make output parsing dependent on pytest's output format and version. The environment uses a custom lightweight test runner written as a Python string executed in the sandbox, producing a consistent `"N passed, M failed"` format that `_parse_tests_passed()` can reliably parse across all platforms.

	---

	## Environment Configuration

	```bash
	# Required for inference.py
	API_BASE_URL # LLM API endpoint (e.g. https://api.openai.com/v1)
	MODEL_NAME # Model identifier (e.g. gpt-4o)
	HF_TOKEN # API key / HuggingFace token

	# Optional — defaults to localhost:8000
	ENV_BASE_URL # Environment server URL
	```

	---

	## License & Attribution

	License: MIT — see [LICENSE](LICENSE)

	Authors: Pranav,Shashaank (Team Endurance)

	Submitted to: Meta + PyTorch + HuggingFace OpenEnv Hackathon

	Live Environment: https://huggingface.co/spaces/shashaank0707/AgentDebugger-env