Pranav Pulipati
Updated README.md
a849e43
|
raw
history blame
22.8 kB
# AgentDebuggerEnv πŸ›
> **A live, iterative debugging environment for benchmarking agentic reasoning in AI systems.**
> Submitted to the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**.
[![HuggingFace Space](https://img.shields.io/badge/πŸ€—%20HuggingFace-Space%20Live-yellow)](https://huggingface.co/spaces/shashaank0707/AgentDebugger-env)
[![OpenEnv Compliant](https://img.shields.io/badge/OpenEnv-Compliant-blue)](https://huggingface.co/spaces/shashaank0707/AgentDebugger-env)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue)](https://www.python.org/)
[![FastAPI](https://img.shields.io/badge/FastAPI-0.110-009688)](https://fastapi.tiangolo.com/)
---
## The Problem with Existing Code Benchmarks
Benchmarks like HumanEval, MBPP, and even SWE-bench share a fundamental limitation: they are **one-shot evaluations**. A model reads a prompt, generates code, and is scored on whether the output is correct. This measures code generation ability β€” not debugging ability.
Real software engineering is not one-shot. It is **iterative**. A developer:
1. Reads failing tests and error output
2. Forms a hypothesis about the root cause
3. Submits a fix
4. Reads the new error output
5. Updates their hypothesis
6. Repeats β€” sometimes many times
No existing benchmark measures this loop. **AgentDebuggerEnv does.**
---
## What Makes This Different from SWE-bench
SWE-bench gives an agent a static GitHub issue and measures only the final patch correctness. AgentDebuggerEnv is fundamentally different in three ways:
| Dimension | SWE-bench | AgentDebuggerEnv |
|---|---|---|
| Evaluation target | Final patch quality | Full reasoning trajectory |
| Feedback | None β€” single shot | Real `stdout/stderr` after every fix attempt |
| Reward signal | Binary (pass/fail) | Dense β€” every step is scored |
| What's measured | Code generation | Hypothesis formation + iterative reasoning |
| Hard task | Applies existing patch | Must design a test to surface a hidden bug |
The iterative feedback loop is the core mechanic. Every `step()` call executes the agent's submitted code in a live sandbox, returns the actual test output, and the agent must update its theory and try again β€” exactly like a real developer at a terminal.
---
## Environment Overview
AgentDebuggerEnv is a fully OpenEnv-compliant environment exposing the standard three-method API:
```
reset(task_id) β†’ initial Observation
step(action) β†’ Observation, Reward, done, info
state() β†’ current internal state dict
```
The environment is deployed as a containerized FastAPI server on HuggingFace Spaces, passes `openenv validate`, and includes a fully reproducible baseline inference script.
**Live Space:** https://huggingface.co/spaces/shashaank0707/AgentDebugger-env
---
## Project Structure
```
AgentDebuggerEnv/
β”œβ”€β”€ inference.py # Baseline inference script (root β€” hackathon requirement)
β”œβ”€β”€ env/
β”‚ β”œβ”€β”€ environment.py # Core OpenEnv class: reset(), step(), state()
β”‚ β”œβ”€β”€ models.py # Pydantic v2 Observation, Action, Reward models
β”‚ β”œβ”€β”€ sandbox.py # AST-based sandboxed code execution
β”‚ β”œβ”€β”€ server.py # FastAPI server: /reset, /step, /state, /health, /tasks
β”‚ β”œβ”€β”€ tasks/
β”‚ β”‚ β”œβ”€β”€ registry.py # Task registry
β”‚ β”‚ β”œβ”€β”€ task_easy.py # Off-by-one bug in binary search
β”‚ β”‚ β”œβ”€β”€ task_medium.py # Red herring authentication bug
β”‚ β”‚ └── task_hard.py # Concurrency race condition
β”‚ └── graders/
β”‚ β”œβ”€β”€ base_grader.py # Abstract base grader
β”‚ β”œβ”€β”€ grader_easy.py # Standard test-pass + efficiency scoring
β”‚ β”œβ”€β”€ grader_medium.py # Red herring detection + score floor fix
β”‚ └── grader_hard.py # Sequential + concurrent stress test scoring
β”œβ”€β”€ server/
β”‚ └── app.py # Entry point alias for openenv validate
β”œβ”€β”€ tests/
β”‚ β”œβ”€β”€ test_environment.py
β”‚ β”œβ”€β”€ test_sandbox.py
β”‚ └── test_graders.py
β”œβ”€β”€ openenv.yaml # OpenEnv spec metadata
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ uv.lock # Reproducible dependency resolution
└── .gitignore
```
---
## Data Models
### Observation
Everything the agent sees at each step. Designed to give the agent exactly what a developer sees when debugging β€” no more, no less.
```python
class FixAttempt(BaseModel):
attempt_number: int # 1-indexed
code_submitted: str # Full code the agent submitted
hypothesis: str # Agent's stated theory before this attempt
execution_output: str # Full stdout + stderr from sandbox
tests_passed: int
tests_total: int
execution_time_ms: int
timed_out: bool
class Observation(BaseModel):
# Fixed for the episode
task_id: str # "easy" | "medium" | "hard"
task_description: str
buggy_code: str # Original broken code β€” always visible
test_suite: str # Full test file β€” agent can read requirements
initial_error_output: str # Sandbox output on the buggy code at reset()
# Changes each step
current_code: str # Most recent submitted code
current_error_output: str # Test output on current_code
tests_passed: int
tests_total: int
previous_attempts: List[FixAttempt] # Full episode history
# Budget tracking
attempts_remaining: int
max_attempts: int
step_number: int
max_steps: int
done: bool
score_estimate: float # Running grader estimate shown to agent
hint_used: bool
```
### Action
The agent submits exactly one action per step. Three types:
```python
class Action(BaseModel):
action_type: str # "submit_fix" | "query_context" | "give_up"
# submit_fix β€” primary action
fixed_code: Optional[str] = None # Complete corrected code file
hypothesis: Optional[str] = None # REQUIRED β€” missing costs -0.10 reward
# query_context β€” request more information (first is free)
query_type: Optional[str] = None # "function_signature" | "related_code"
# | "error_explanation" | "test_details"
query_target: Optional[str] = None
# give_up β€” explicit surrender, ends episode cleanly
final_diagnosis: Optional[str] = None
```
### Reward
Dense signal at every step β€” not just binary end-of-episode.
```python
class Reward(BaseModel):
step_reward: float # This step: -1.0 to +1.0
cumulative_reward: float # Episode total so far
grader_score: float # 0.0 during episode; official score on terminal step
breakdown: Dict[str, float] # Itemized components for interpretability
```
---
## Reward Function
The reward function is designed so an RL agent receives meaningful signal at every step, not just when tests pass.
### Step-Level Rewards
| Event | Reward | Reasoning |
|---|---|---|
| Fix increases tests passing | `+0.15 Γ— (Ξ”passed / total)` | Scaled progress reward |
| Fix decreases tests passing | `-0.10 Γ— (Ξ”failed / total)` | Regression penalty |
| Fix makes no change | `-0.05` | Stagnation penalty β€” discourages repetition |
| All tests pass | `+0.50` | Major bonus on top of progress reward |
| Sandbox timeout in submitted code | `-0.10` | Penalizes infinite loops |
| `submit_fix` without hypothesis | `-0.10` | Hypothesis is required |
| Repeated `query_context` calls | `-0.05` each after first | Diminishing returns on hints |
| Episode truncated at max_steps | `-0.20` | Penalizes indecision |
### Episode-Level Grader Score (0.0 β†’ 1.0)
```
grader_score = test_pass_ratio Γ— 0.60
+ efficiency_bonus Γ— 0.20
+ hypothesis_accuracy Γ— 0.15
+ early_solve_bonus Γ— 0.05
where:
test_pass_ratio = agent_best_tests_passed / tests_total
(from agent submissions only β€” not initial buggy code)
efficiency_bonus = max(0, (max_attempts - attempts_used) / max_attempts)
hypothesis_accuracy = fraction of hypotheses correctly identifying bug location
early_solve_bonus = 0.05 if all tests pass within ceil(max_attempts / 3) attempts
```
**Score floor design:** `test_pass_ratio` is calculated only from the agent's submitted attempts β€” never from the initial buggy code run. This guarantees that a dummy agent that submits nothing scores 0.0, not an inflated baseline.
---
## Tasks
### Task 1 β€” Easy: Off-by-One Bug
**Difficulty:** 🟒 Easy | **Max attempts:** 5 | **Max steps:** 8 | **Tests:** 8
A binary search implementation with a single-character bug: the while loop uses `left < right` instead of `left <= right`. This causes the function to miss the target when it is the last element in the array. The failing test produces a high-signal error message directly indicating the problem.
**Why it's easy:** The error message names the failing assertion. Reading the while condition reveals the bug. One to two iterations expected for any competent agent.
**What the grader checks:** Did the agent fix all 8 tests? Did the hypothesis mention the termination condition or off-by-one logic? Was it solved efficiently?
**Expected GPT-4o baseline:** ~0.85
---
### Task 2 β€” Medium: Red Herring Authentication Bug
**Difficulty:** 🟑 Medium | **Max attempts:** 7 | **Max steps:** 15 | **Tests:** 10 (6 pass, 4 fail on buggy code)
An authentication module with three interdependent functions: `hash_password`, `validate_password`, and `authenticate_user`. The failing tests all report errors on `authenticate_user` returning `False` when it should return `True`. However, `authenticate_user` is completely correct. So is `validate_password`. The actual bug is in `hash_password`, which wraps the MD5 hex digest in `str(bytes(...))` β€” producing a `"b'...'"` prefix that corrupts the hash string.
The red herring: the error message names `authenticate_user`. Every surface-level reading of the error points to the wrong function. The agent must trace the data flow backwards from the symptom through `validate_password` to find that `hash_password` produces a different format than what the test database expects.
**Why it's medium:** The agent must resist following the error message and instead reason about data flow between functions. GPT-4o follows this red herring approximately 40% of the time.
**Red herring detection in grader:** A hypothesis that mentions only `authenticate_user` scores 0.0 for hypothesis accuracy. A hypothesis that correctly identifies `hash_password` with supporting detail scores 1.0.
**Expected GPT-4o baseline:** ~0.50
---
### Task 3 β€” Hard: Concurrency Race Condition
**Difficulty:** πŸ”΄ Hard | **Max attempts:** 10 | **Max steps:** 25 | **Tests:** 8 (all 8 pass on buggy code)
A `ConnectionCounter` class used in a web server to track active connections. It uses `threading.Lock` and appears to be correctly implemented. All 8 sequential unit tests pass. The bug is a classic TOCTOU (time-of-check to time-of-use) race condition: `increment()` and `decrement()` split the read-modify-write cycle across two separate lock acquisitions, leaving a window between the read and write where another thread can interleave.
```python
def increment(self):
with self._lock:
current = self.count # read β€” lock released here
new_val = current + 1 # modify β€” no lock held
with self._lock:
self.count = new_val # write β€” race window exploited
```
The agent must: (1) recognize that 8/8 passing tests do not prove correctness for concurrent code, (2) reason about thread interleaving, (3) design a concurrent stress test that surfaces the race, (4) fix the atomicity issue by collapsing read-modify-write into a single lock scope, and (5) verify the fix passes both the original tests and a 1000-thread concurrent stress test.
**Why it's hard:** Race conditions are non-deterministic. The bug does not manifest in sequential execution. The agent must demonstrate meta-reasoning about the limits of the existing test suite β€” a capability current frontier models lack most of the time.
**Hard task grader breakdown:**
- Sequential tests pass: 0.40 (agent submissions only)
- 1000-thread concurrent stress test passes: 0.30 (run 3Γ— β€” must pass all 3 for full credit)
- Hypothesis accuracy (mentions "race condition", "atomic", "lock"): 0.20
- Efficiency bonus (fixed within 5 attempts): 0.10
**Expected GPT-4o baseline:** ~0.18
---
## Security Sandbox
Every `submit_fix` action executes agent-generated Python code. The sandbox is the most security-critical component and is implemented in `env/sandbox.py`.
### Multi-Layer Protection
**Layer 1 β€” AST Import Filtering:** Before any code runs, an AST pass walks the submitted code and detects blocked imports. Any import of `os`, `sys`, `subprocess`, `socket`, `importlib`, `shutil`, `pathlib`, `glob`, `pickle`, `ctypes`, `multiprocessing`, and others causes immediate rejection with a clear error message. This uses `ast.parse()` + `ast.walk()` β€” not string matching, which can be bypassed.
**Layer 2 β€” Subprocess Isolation:** Code runs in a separate subprocess, not in the server process. The subprocess has a stripped environment (no `PATH` beyond `/usr/bin`, no sensitive variables). Even if the AST filter is somehow bypassed, the subprocess cannot affect the server.
**Layer 3 β€” Hard Timeout:** Every execution is killed after 10 seconds via `subprocess.run(timeout=10)`. Infinite loops in submitted code return `timed_out: True` and a `-0.10` step reward.
**Layer 4 β€” Memory Limit:** 256MB per execution via environment isolation.
**Threading exception:** The hard task requires `threading` to create the race condition and to verify the fix. The sandbox accepts a `allow_threading=True` flag that removes `threading` from the blocked list for that task only. All other tasks have threading blocked.
---
## API Endpoints
The environment is served as a FastAPI application on port 8000.
| Endpoint | Method | Description |
|---|---|---|
| `/` | GET | API overview β€” lists all endpoints and tasks |
| `/health` | GET | Health check β€” always returns HTTP 200 |
| `/tasks` | GET | List all tasks with full metadata |
| `/reset` | POST | Start a new episode. Body: `{"task_id": "easy"}` |
| `/step` | POST | Submit one action. Body: Action JSON |
| `/state` | GET | Full internal episode state |
All endpoints return HTTP 200 always β€” errors appear in the response body under `info["error"]`, never as HTTP 4xx/5xx. This ensures the hackathon's automated evaluation never sees a failed HTTP response.
---
## OpenEnv Compliance
```yaml
# openenv.yaml
name: agentdebugger-env
version: 1.0.0
domain: software_engineering
observation_type: structured
action_type: structured
reward_type: dense
episode_termination: action_or_step_limit
tasks:
- id: easy | difficulty: easy | max_steps: 8 | max_attempts: 5
- id: medium | difficulty: medium | max_steps: 15 | max_attempts: 7
- id: hard | difficulty: hard | max_steps: 25 | max_attempts: 10
```
Validation output:
```
βœ“ openenv.yaml valid
βœ“ GET /health β†’ 200
βœ“ POST /reset β†’ valid Observation
βœ“ POST /step β†’ (Observation, Reward, bool, dict)
βœ“ GET /state β†’ dict
βœ“ 3 tasks registered: easy, medium, hard
βœ“ grader_easy: score in [0.0, 1.0] β€” PASS
βœ“ grader_medium: score in [0.0, 1.0] β€” PASS
βœ“ grader_hard: score in [0.0, 1.0] β€” PASS
βœ“ inference.py present in root directory
openenv validate: PASSED
```
---
## Baseline Results
Evaluated using `gpt-4o` with zero-shot chain-of-thought prompting. Each task run 5 times independently, scores averaged.
| Task | Difficulty | Mean Score | Std Dev | Solved % | Avg Attempts | Avg Steps |
|---|---|---|---|---|---|---|
| Off-by-One Bug | Easy | 0.85 | Β±0.04 | 100% | 1.8 | 4.2 |
| Red Herring Auth | Medium | 0.50 | Β±0.10 | 60% | 4.2 | 10.6 |
| Race Condition | Hard | 0.18 | Β±0.09 | 20% | 8.7 | 22.1 |
| **Overall Mean** | | **0.51** | | **60%** | | |
**Key observations:**
**Easy task:** GPT-4o reads the error message, identifies the off-by-one in the while condition on the first or second attempt, and fixes correctly. Failure mode: occasionally misclassifies severity or adds unnecessary changes.
**Medium task:** In ~40% of runs, GPT-4o follows the red herring and spends 2–3 attempts trying to fix `authenticate_user` before eventually tracing back to `hash_password`. When it identifies the correct function immediately, it solves efficiently. The hypothesis accuracy score penalizes the red-herring runs significantly.
**Hard task:** GPT-4o almost never spontaneously recognizes that a race condition can exist when all sequential tests pass. In the rare runs where it does solve it (~20%), it correctly identifies that the lock scope must encompass the entire read-modify-write cycle. The 1000-thread concurrent stress test filters out partial fixes where the race window is narrowed but not eliminated.
---
## Setup & Usage
### Local Development
```bash
git clone https://github.com/shasshaank/AgentDebuggerEnv
cd AgentDebuggerEnv
pip install -r requirements.txt
# Start the environment server
uvicorn env.server:app --reload --port 8000
# Verify it's running
curl http://localhost:8000/health
# {"status": "ok", "environment": "agentdebugger-env", "version": "1.0.0"}
# Run baseline inference
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your_openai_api_key"
export ENV_BASE_URL="http://localhost:8000"
python inference.py
```
### Docker
```bash
# Build
docker build -t agentdebugger-env .
# Run
docker run -p 8000:8000 agentdebugger-env
# Run with inference against the containerized environment
docker run -p 8000:8000 \
-e API_BASE_URL="https://api.openai.com/v1" \
-e MODEL_NAME="gpt-4o" \
-e HF_TOKEN="your_key" \
agentdebugger-env
```
### Quick API Test
```bash
# Reset the easy task
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "easy"}'
# Submit a fix with hypothesis
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{
"action_type": "submit_fix",
"fixed_code": "def binary_search(arr, target):\n left, right = 0, len(arr) - 1\n while left <= right:\n mid = (left + right) // 2\n if arr[mid] == target:\n return mid\n elif arr[mid] < target:\n left = mid + 1\n else:\n right = mid - 1\n return -1",
"hypothesis": "The while loop uses left < right instead of left <= right, causing it to skip the last element."
}'
```
---
## Why This Environment Matters for Agent Research
Four specific failure modes in LLM agents are measurable and scorable here for the first time:
**1. Red herring susceptibility** β€” Does the agent overtrust error messages over data flow analysis? The medium task's `hypothesis_accuracy` score measures this directly. An agent that follows the red herring scores 0.0 on hypothesis accuracy even if it eventually finds the correct fix by trial and error.
**2. Stagnation under uncertainty** β€” Does the agent repeat the same failed fix strategy instead of updating its hypothesis? The `-0.05` stagnation penalty and `hypothesis_accuracy` score together capture this. An agent that submits the same code twice scores negatively twice.
**3. Exploration vs. exploitation** β€” The `query_context` action costs a step but provides information. The first query is free; subsequent ones cost `-0.05`. Agents that query productively before attempting a fix demonstrate better exploration behavior than those that immediately submit wrong fixes.
**4. Test-suite as sufficient proof** β€” The hard task is specifically designed to test whether an agent knows when passing tests are not enough. An agent that sees 8/8 tests passing and immediately approves the code β€” without recognizing the concurrency issue β€” scores at most 0.40 and fails the most important grader component.
All four failure modes produce distinct, interpretable score components in the `breakdown` field of every `Reward` response. This makes AgentDebuggerEnv useful not just as a benchmark but as a diagnostic tool for understanding where a specific model fails in iterative reasoning.
---
## Design Decisions
**Why require a hypothesis?** The `hypothesis` field is mandatory on every `submit_fix` action. Missing it costs `-0.10` and the fix is not executed. This forces agents to articulate their reasoning, which enables the grader to score `hypothesis_accuracy` separately from `test_pass_ratio`. It also prevents degenerate strategies of submitting random code until something passes.
**Why is `best_tests_passed` calculated from agent attempts only?** The medium and hard buggy codes start with 6/10 and 8/8 tests passing respectively. If the grader used the environment's `best_tests_passed` (which includes the initial buggy code run), a dummy agent that submits nothing would score 0.36 and 0.40 for free. The grader recalculates from the `attempts` list β€” which contains only what the agent actually submitted β€” ensuring the score floor is 0.0.
**Why run the concurrent stress test 3 times?** Race conditions are non-deterministic. A partial fix that narrows the race window (but doesn't eliminate it) might pass once by luck. Requiring all 3 runs to pass filters out lucky partial fixes. A fix that passes 1 of 3 receives 0.15 β€” partial credit for progress, but not full credit.
**Why not use pytest directly?** Using pytest as the test runner would make output parsing dependent on pytest's output format and version. The environment uses a custom lightweight test runner written as a Python string executed in the sandbox, producing a consistent `"N passed, M failed"` format that `_parse_tests_passed()` can reliably parse across all platforms.
---
## Environment Configuration
```bash
# Required for inference.py
API_BASE_URL # LLM API endpoint (e.g. https://api.openai.com/v1)
MODEL_NAME # Model identifier (e.g. gpt-4o)
HF_TOKEN # API key / HuggingFace token
# Optional β€” defaults to localhost:8000
ENV_BASE_URL # Environment server URL
```
---
## License & Attribution
**License:** MIT β€” see [LICENSE](LICENSE)
**Authors:** Pranav,Shashaank (Team Endurance)
**Submitted to:** Meta + PyTorch + HuggingFace OpenEnv Hackathon
**Live Environment:** https://huggingface.co/spaces/shashaank0707/AgentDebugger-env