AgentDebuggerEnv π
A live, iterative debugging environment for benchmarking agentic reasoning in AI systems. Submitted to the Meta + PyTorch + HuggingFace OpenEnv Hackathon.
The Problem with Existing Code Benchmarks
Benchmarks like HumanEval, MBPP, and even SWE-bench share a fundamental limitation: they are one-shot evaluations. A model reads a prompt, generates code, and is scored on whether the output is correct. This measures code generation ability β not debugging ability.
Real software engineering is not one-shot. It is iterative. A developer:
- Reads failing tests and error output
- Forms a hypothesis about the root cause
- Submits a fix
- Reads the new error output
- Updates their hypothesis
- Repeats β sometimes many times
No existing benchmark measures this loop. AgentDebuggerEnv does.
What Makes This Different from SWE-bench
SWE-bench gives an agent a static GitHub issue and measures only the final patch correctness. AgentDebuggerEnv is fundamentally different in three ways:
| Dimension | SWE-bench | AgentDebuggerEnv |
|---|---|---|
| Evaluation target | Final patch quality | Full reasoning trajectory |
| Feedback | None β single shot | Real stdout/stderr after every fix attempt |
| Reward signal | Binary (pass/fail) | Dense β every step is scored |
| What's measured | Code generation | Hypothesis formation + iterative reasoning |
| Hard task | Applies existing patch | Must design a test to surface a hidden bug |
The iterative feedback loop is the core mechanic. Every step() call executes the agent's submitted code in a live sandbox, returns the actual test output, and the agent must update its theory and try again β exactly like a real developer at a terminal.
Environment Overview
AgentDebuggerEnv is a fully OpenEnv-compliant environment exposing the standard three-method API:
reset(task_id) β initial Observation
step(action) β Observation, Reward, done, info
state() β current internal state dict
The environment is deployed as a containerized FastAPI server on HuggingFace Spaces, passes openenv validate, and includes a fully reproducible baseline inference script.
Live Space: https://huggingface.co/spaces/shashaank0707/AgentDebugger-env
Project Structure
AgentDebuggerEnv/
βββ inference.py # Baseline inference script (root β hackathon requirement)
βββ env/
β βββ environment.py # Core OpenEnv class: reset(), step(), state()
β βββ models.py # Pydantic v2 Observation, Action, Reward models
β βββ sandbox.py # AST-based sandboxed code execution
β βββ server.py # FastAPI server: /reset, /step, /state, /health, /tasks
β βββ tasks/
β β βββ registry.py # Task registry
β β βββ task_easy.py # Off-by-one bug in binary search
β β βββ task_medium.py # Red herring authentication bug
β β βββ task_hard.py # Concurrency race condition
β βββ graders/
β βββ base_grader.py # Abstract base grader
β βββ grader_easy.py # Standard test-pass + efficiency scoring
β βββ grader_medium.py # Red herring detection + score floor fix
β βββ grader_hard.py # Sequential + concurrent stress test scoring
βββ server/
β βββ app.py # Entry point alias for openenv validate
βββ tests/
β βββ test_environment.py
β βββ test_sandbox.py
β βββ test_graders.py
βββ openenv.yaml # OpenEnv spec metadata
βββ Dockerfile
βββ requirements.txt
βββ pyproject.toml
βββ uv.lock # Reproducible dependency resolution
βββ .gitignore
Data Models
Observation
Everything the agent sees at each step. Designed to give the agent exactly what a developer sees when debugging β no more, no less.
class FixAttempt(BaseModel):
attempt_number: int # 1-indexed
code_submitted: str # Full code the agent submitted
hypothesis: str # Agent's stated theory before this attempt
execution_output: str # Full stdout + stderr from sandbox
tests_passed: int
tests_total: int
execution_time_ms: int
timed_out: bool
class Observation(BaseModel):
# Fixed for the episode
task_id: str # "easy" | "medium" | "hard"
task_description: str
buggy_code: str # Original broken code β always visible
test_suite: str # Full test file β agent can read requirements
initial_error_output: str # Sandbox output on the buggy code at reset()
# Changes each step
current_code: str # Most recent submitted code
current_error_output: str # Test output on current_code
tests_passed: int
tests_total: int
previous_attempts: List[FixAttempt] # Full episode history
# Budget tracking
attempts_remaining: int
max_attempts: int
step_number: int
max_steps: int
done: bool
score_estimate: float # Running grader estimate shown to agent
hint_used: bool
Action
The agent submits exactly one action per step. Three types:
class Action(BaseModel):
action_type: str # "submit_fix" | "query_context" | "give_up"
# submit_fix β primary action
fixed_code: Optional[str] = None # Complete corrected code file
hypothesis: Optional[str] = None # REQUIRED β missing costs -0.10 reward
# query_context β request more information (first is free)
query_type: Optional[str] = None # "function_signature" | "related_code"
# | "error_explanation" | "test_details"
query_target: Optional[str] = None
# give_up β explicit surrender, ends episode cleanly
final_diagnosis: Optional[str] = None
Reward
Dense signal at every step β not just binary end-of-episode.
class Reward(BaseModel):
step_reward: float # This step: -1.0 to +1.0
cumulative_reward: float # Episode total so far
grader_score: float # 0.0 during episode; official score on terminal step
breakdown: Dict[str, float] # Itemized components for interpretability
Reward Function
The reward function is designed so an RL agent receives meaningful signal at every step, not just when tests pass.
Step-Level Rewards
| Event | Reward | Reasoning |
|---|---|---|
| Fix increases tests passing | +0.15 Γ (Ξpassed / total) |
Scaled progress reward |
| Fix decreases tests passing | -0.10 Γ (Ξfailed / total) |
Regression penalty |
| Fix makes no change | -0.05 |
Stagnation penalty β discourages repetition |
| All tests pass | +0.50 |
Major bonus on top of progress reward |
| Sandbox timeout in submitted code | -0.10 |
Penalizes infinite loops |
submit_fix without hypothesis |
-0.10 |
Hypothesis is required |
Repeated query_context calls |
-0.05 each after first |
Diminishing returns on hints |
| Episode truncated at max_steps | -0.20 |
Penalizes indecision |
Episode-Level Grader Score (0.0 β 1.0)
grader_score = test_pass_ratio Γ 0.60
+ efficiency_bonus Γ 0.20
+ hypothesis_accuracy Γ 0.15
+ early_solve_bonus Γ 0.05
where:
test_pass_ratio = agent_best_tests_passed / tests_total
(from agent submissions only β not initial buggy code)
efficiency_bonus = max(0, (max_attempts - attempts_used) / max_attempts)
hypothesis_accuracy = fraction of hypotheses correctly identifying bug location
early_solve_bonus = 0.05 if all tests pass within ceil(max_attempts / 3) attempts
Score floor design: test_pass_ratio is calculated only from the agent's submitted attempts β never from the initial buggy code run. This guarantees that a dummy agent that submits nothing scores 0.0, not an inflated baseline.
Tasks
Task 1 β Easy: Off-by-One Bug
Difficulty: π’ Easy | Max attempts: 5 | Max steps: 8 | Tests: 8
A binary search implementation with a single-character bug: the while loop uses left < right instead of left <= right. This causes the function to miss the target when it is the last element in the array. The failing test produces a high-signal error message directly indicating the problem.
Why it's easy: The error message names the failing assertion. Reading the while condition reveals the bug. One to two iterations expected for any competent agent.
What the grader checks: Did the agent fix all 8 tests? Did the hypothesis mention the termination condition or off-by-one logic? Was it solved efficiently?
Expected GPT-4o baseline: ~0.85
Task 2 β Medium: Red Herring Authentication Bug
Difficulty: π‘ Medium | Max attempts: 7 | Max steps: 15 | Tests: 10 (6 pass, 4 fail on buggy code)
An authentication module with three interdependent functions: hash_password, validate_password, and authenticate_user. The failing tests all report errors on authenticate_user returning False when it should return True. However, authenticate_user is completely correct. So is validate_password. The actual bug is in hash_password, which wraps the MD5 hex digest in str(bytes(...)) β producing a "b'...'" prefix that corrupts the hash string.
The red herring: the error message names authenticate_user. Every surface-level reading of the error points to the wrong function. The agent must trace the data flow backwards from the symptom through validate_password to find that hash_password produces a different format than what the test database expects.
Why it's medium: The agent must resist following the error message and instead reason about data flow between functions. GPT-4o follows this red herring approximately 40% of the time.
Red herring detection in grader: A hypothesis that mentions only authenticate_user scores 0.0 for hypothesis accuracy. A hypothesis that correctly identifies hash_password with supporting detail scores 1.0.
Expected GPT-4o baseline: ~0.50
Task 3 β Hard: Concurrency Race Condition
Difficulty: π΄ Hard | Max attempts: 10 | Max steps: 25 | Tests: 8 (all 8 pass on buggy code)
A ConnectionCounter class used in a web server to track active connections. It uses threading.Lock and appears to be correctly implemented. All 8 sequential unit tests pass. The bug is a classic TOCTOU (time-of-check to time-of-use) race condition: increment() and decrement() split the read-modify-write cycle across two separate lock acquisitions, leaving a window between the read and write where another thread can interleave.
def increment(self):
with self._lock:
current = self.count # read β lock released here
new_val = current + 1 # modify β no lock held
with self._lock:
self.count = new_val # write β race window exploited
The agent must: (1) recognize that 8/8 passing tests do not prove correctness for concurrent code, (2) reason about thread interleaving, (3) design a concurrent stress test that surfaces the race, (4) fix the atomicity issue by collapsing read-modify-write into a single lock scope, and (5) verify the fix passes both the original tests and a 1000-thread concurrent stress test.
Why it's hard: Race conditions are non-deterministic. The bug does not manifest in sequential execution. The agent must demonstrate meta-reasoning about the limits of the existing test suite β a capability current frontier models lack most of the time.
Hard task grader breakdown:
- Sequential tests pass: 0.40 (agent submissions only)
- 1000-thread concurrent stress test passes: 0.30 (run 3Γ β must pass all 3 for full credit)
- Hypothesis accuracy (mentions "race condition", "atomic", "lock"): 0.20
- Efficiency bonus (fixed within 5 attempts): 0.10
Expected GPT-4o baseline: ~0.18
Security Sandbox
Every submit_fix action executes agent-generated Python code. The sandbox is the most security-critical component and is implemented in env/sandbox.py.
Multi-Layer Protection
Layer 1 β AST Import Filtering: Before any code runs, an AST pass walks the submitted code and detects blocked imports. Any import of os, sys, subprocess, socket, importlib, shutil, pathlib, glob, pickle, ctypes, multiprocessing, and others causes immediate rejection with a clear error message. This uses ast.parse() + ast.walk() β not string matching, which can be bypassed.
Layer 2 β Subprocess Isolation: Code runs in a separate subprocess, not in the server process. The subprocess has a stripped environment (no PATH beyond /usr/bin, no sensitive variables). Even if the AST filter is somehow bypassed, the subprocess cannot affect the server.
Layer 3 β Hard Timeout: Every execution is killed after 10 seconds via subprocess.run(timeout=10). Infinite loops in submitted code return timed_out: True and a -0.10 step reward.
Layer 4 β Memory Limit: 256MB per execution via environment isolation.
Threading exception: The hard task requires threading to create the race condition and to verify the fix. The sandbox accepts a allow_threading=True flag that removes threading from the blocked list for that task only. All other tasks have threading blocked.
API Endpoints
The environment is served as a FastAPI application on port 8000.
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | API overview β lists all endpoints and tasks |
/health |
GET | Health check β always returns HTTP 200 |
/tasks |
GET | List all tasks with full metadata |
/reset |
POST | Start a new episode. Body: {"task_id": "easy"} |
/step |
POST | Submit one action. Body: Action JSON |
/state |
GET | Full internal episode state |
All endpoints return HTTP 200 always β errors appear in the response body under info["error"], never as HTTP 4xx/5xx. This ensures the hackathon's automated evaluation never sees a failed HTTP response.
OpenEnv Compliance
# openenv.yaml
name: agentdebugger-env
version: 1.0.0
domain: software_engineering
observation_type: structured
action_type: structured
reward_type: dense
episode_termination: action_or_step_limit
tasks:
- id: easy | difficulty: easy | max_steps: 8 | max_attempts: 5
- id: medium | difficulty: medium | max_steps: 15 | max_attempts: 7
- id: hard | difficulty: hard | max_steps: 25 | max_attempts: 10
Validation output:
β openenv.yaml valid
β GET /health β 200
β POST /reset β valid Observation
β POST /step β (Observation, Reward, bool, dict)
β GET /state β dict
β 3 tasks registered: easy, medium, hard
β grader_easy: score in [0.0, 1.0] β PASS
β grader_medium: score in [0.0, 1.0] β PASS
β grader_hard: score in [0.0, 1.0] β PASS
β inference.py present in root directory
openenv validate: PASSED
Baseline Results
Evaluated using gpt-4o with zero-shot chain-of-thought prompting. Each task run 5 times independently, scores averaged.
| Task | Difficulty | Mean Score | Std Dev | Solved % | Avg Attempts | Avg Steps |
|---|---|---|---|---|---|---|
| Off-by-One Bug | Easy | 0.85 | Β±0.04 | 100% | 1.8 | 4.2 |
| Red Herring Auth | Medium | 0.50 | Β±0.10 | 60% | 4.2 | 10.6 |
| Race Condition | Hard | 0.18 | Β±0.09 | 20% | 8.7 | 22.1 |
| Overall Mean | 0.51 | 60% |
Key observations:
Easy task: GPT-4o reads the error message, identifies the off-by-one in the while condition on the first or second attempt, and fixes correctly. Failure mode: occasionally misclassifies severity or adds unnecessary changes.
Medium task: In ~40% of runs, GPT-4o follows the red herring and spends 2β3 attempts trying to fix authenticate_user before eventually tracing back to hash_password. When it identifies the correct function immediately, it solves efficiently. The hypothesis accuracy score penalizes the red-herring runs significantly.
Hard task: GPT-4o almost never spontaneously recognizes that a race condition can exist when all sequential tests pass. In the rare runs where it does solve it (~20%), it correctly identifies that the lock scope must encompass the entire read-modify-write cycle. The 1000-thread concurrent stress test filters out partial fixes where the race window is narrowed but not eliminated.
Setup & Usage
Local Development
git clone https://github.com/shasshaank/AgentDebuggerEnv
cd AgentDebuggerEnv
pip install -r requirements.txt
# Start the environment server
uvicorn env.server:app --reload --port 8000
# Verify it's running
curl http://localhost:8000/health
# {"status": "ok", "environment": "agentdebugger-env", "version": "1.0.0"}
# Run baseline inference
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your_openai_api_key"
export ENV_BASE_URL="http://localhost:8000"
python inference.py
Docker
# Build
docker build -t agentdebugger-env .
# Run
docker run -p 8000:8000 agentdebugger-env
# Run with inference against the containerized environment
docker run -p 8000:8000 \
-e API_BASE_URL="https://api.openai.com/v1" \
-e MODEL_NAME="gpt-4o" \
-e HF_TOKEN="your_key" \
agentdebugger-env
Quick API Test
# Reset the easy task
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "easy"}'
# Submit a fix with hypothesis
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{
"action_type": "submit_fix",
"fixed_code": "def binary_search(arr, target):\n left, right = 0, len(arr) - 1\n while left <= right:\n mid = (left + right) // 2\n if arr[mid] == target:\n return mid\n elif arr[mid] < target:\n left = mid + 1\n else:\n right = mid - 1\n return -1",
"hypothesis": "The while loop uses left < right instead of left <= right, causing it to skip the last element."
}'
Why This Environment Matters for Agent Research
Four specific failure modes in LLM agents are measurable and scorable here for the first time:
1. Red herring susceptibility β Does the agent overtrust error messages over data flow analysis? The medium task's hypothesis_accuracy score measures this directly. An agent that follows the red herring scores 0.0 on hypothesis accuracy even if it eventually finds the correct fix by trial and error.
2. Stagnation under uncertainty β Does the agent repeat the same failed fix strategy instead of updating its hypothesis? The -0.05 stagnation penalty and hypothesis_accuracy score together capture this. An agent that submits the same code twice scores negatively twice.
3. Exploration vs. exploitation β The query_context action costs a step but provides information. The first query is free; subsequent ones cost -0.05. Agents that query productively before attempting a fix demonstrate better exploration behavior than those that immediately submit wrong fixes.
4. Test-suite as sufficient proof β The hard task is specifically designed to test whether an agent knows when passing tests are not enough. An agent that sees 8/8 tests passing and immediately approves the code β without recognizing the concurrency issue β scores at most 0.40 and fails the most important grader component.
All four failure modes produce distinct, interpretable score components in the breakdown field of every Reward response. This makes AgentDebuggerEnv useful not just as a benchmark but as a diagnostic tool for understanding where a specific model fails in iterative reasoning.
Design Decisions
Why require a hypothesis? The hypothesis field is mandatory on every submit_fix action. Missing it costs -0.10 and the fix is not executed. This forces agents to articulate their reasoning, which enables the grader to score hypothesis_accuracy separately from test_pass_ratio. It also prevents degenerate strategies of submitting random code until something passes.
Why is best_tests_passed calculated from agent attempts only? The medium and hard buggy codes start with 6/10 and 8/8 tests passing respectively. If the grader used the environment's best_tests_passed (which includes the initial buggy code run), a dummy agent that submits nothing would score 0.36 and 0.40 for free. The grader recalculates from the attempts list β which contains only what the agent actually submitted β ensuring the score floor is 0.0.
Why run the concurrent stress test 3 times? Race conditions are non-deterministic. A partial fix that narrows the race window (but doesn't eliminate it) might pass once by luck. Requiring all 3 runs to pass filters out lucky partial fixes. A fix that passes 1 of 3 receives 0.15 β partial credit for progress, but not full credit.
Why not use pytest directly? Using pytest as the test runner would make output parsing dependent on pytest's output format and version. The environment uses a custom lightweight test runner written as a Python string executed in the sandbox, producing a consistent "N passed, M failed" format that _parse_tests_passed() can reliably parse across all platforms.
Environment Configuration
# Required for inference.py
API_BASE_URL # LLM API endpoint (e.g. https://api.openai.com/v1)
MODEL_NAME # Model identifier (e.g. gpt-4o)
HF_TOKEN # API key / HuggingFace token
# Optional β defaults to localhost:8000
ENV_BASE_URL # Environment server URL
License & Attribution
License: MIT β see LICENSE
Authors: Pranav,Shashaank (Team Endurance)
Submitted to: Meta + PyTorch + HuggingFace OpenEnv Hackathon
Live Environment: https://huggingface.co/spaces/shashaank0707/AgentDebugger-env