| # AgentDebuggerEnv π |
|
|
| > **A live, iterative debugging environment for benchmarking agentic reasoning in AI systems.** |
| > Submitted to the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**. |
|
|
| [](https://huggingface.co/spaces/shashaank0707/AgentDebugger-env) |
| [](https://huggingface.co/spaces/shashaank0707/AgentDebugger-env) |
| [](LICENSE) |
| [](https://www.python.org/) |
| [](https://fastapi.tiangolo.com/) |
|
|
| --- |
|
|
| ## The Problem with Existing Code Benchmarks |
|
|
| Benchmarks like HumanEval, MBPP, and even SWE-bench share a fundamental limitation: they are **one-shot evaluations**. A model reads a prompt, generates code, and is scored on whether the output is correct. This measures code generation ability β not debugging ability. |
|
|
| Real software engineering is not one-shot. It is **iterative**. A developer: |
|
|
| 1. Reads failing tests and error output |
| 2. Forms a hypothesis about the root cause |
| 3. Submits a fix |
| 4. Reads the new error output |
| 5. Updates their hypothesis |
| 6. Repeats β sometimes many times |
|
|
| No existing benchmark measures this loop. **AgentDebuggerEnv does.** |
|
|
| --- |
|
|
| ## What Makes This Different from SWE-bench |
|
|
| SWE-bench gives an agent a static GitHub issue and measures only the final patch correctness. AgentDebuggerEnv is fundamentally different in three ways: |
|
|
| | Dimension | SWE-bench | AgentDebuggerEnv | |
| |---|---|---| |
| | Evaluation target | Final patch quality | Full reasoning trajectory | |
| | Feedback | None β single shot | Real `stdout/stderr` after every fix attempt | |
| | Reward signal | Binary (pass/fail) | Dense β every step is scored | |
| | What's measured | Code generation | Hypothesis formation + iterative reasoning | |
| | Hard task | Applies existing patch | Must design a test to surface a hidden bug | |
|
|
| The iterative feedback loop is the core mechanic. Every `step()` call executes the agent's submitted code in a live sandbox, returns the actual test output, and the agent must update its theory and try again β exactly like a real developer at a terminal. |
|
|
| --- |
|
|
| ## Environment Overview |
|
|
| AgentDebuggerEnv is a fully OpenEnv-compliant environment exposing the standard three-method API: |
|
|
| ``` |
| reset(task_id) β initial Observation |
| step(action) β Observation, Reward, done, info |
| state() β current internal state dict |
| ``` |
|
|
| The environment is deployed as a containerized FastAPI server on HuggingFace Spaces, passes `openenv validate`, and includes a fully reproducible baseline inference script. |
|
|
| **Live Space:** https://huggingface.co/spaces/shashaank0707/AgentDebugger-env |
|
|
| --- |
|
|
| ## Project Structure |
|
|
| ``` |
| AgentDebuggerEnv/ |
| βββ inference.py # Baseline inference script (root β hackathon requirement) |
| βββ env/ |
| β βββ environment.py # Core OpenEnv class: reset(), step(), state() |
| β βββ models.py # Pydantic v2 Observation, Action, Reward models |
| β βββ sandbox.py # AST-based sandboxed code execution |
| β βββ server.py # FastAPI server: /reset, /step, /state, /health, /tasks |
| β βββ tasks/ |
| β β βββ registry.py # Task registry |
| β β βββ task_easy.py # Off-by-one bug in binary search |
| β β βββ task_medium.py # Red herring authentication bug |
| β β βββ task_hard.py # Concurrency race condition |
| β βββ graders/ |
| β βββ base_grader.py # Abstract base grader |
| β βββ grader_easy.py # Standard test-pass + efficiency scoring |
| β βββ grader_medium.py # Red herring detection + score floor fix |
| β βββ grader_hard.py # Sequential + concurrent stress test scoring |
| βββ server/ |
| β βββ app.py # Entry point alias for openenv validate |
| βββ tests/ |
| β βββ test_environment.py |
| β βββ test_sandbox.py |
| β βββ test_graders.py |
| βββ openenv.yaml # OpenEnv spec metadata |
| βββ Dockerfile |
| βββ requirements.txt |
| βββ pyproject.toml |
| βββ uv.lock # Reproducible dependency resolution |
| βββ .gitignore |
| ``` |
|
|
| --- |
|
|
| ## Data Models |
|
|
| ### Observation |
|
|
| Everything the agent sees at each step. Designed to give the agent exactly what a developer sees when debugging β no more, no less. |
|
|
| ```python |
| class FixAttempt(BaseModel): |
| attempt_number: int # 1-indexed |
| code_submitted: str # Full code the agent submitted |
| hypothesis: str # Agent's stated theory before this attempt |
| execution_output: str # Full stdout + stderr from sandbox |
| tests_passed: int |
| tests_total: int |
| execution_time_ms: int |
| timed_out: bool |
| |
| class Observation(BaseModel): |
| # Fixed for the episode |
| task_id: str # "easy" | "medium" | "hard" |
| task_description: str |
| buggy_code: str # Original broken code β always visible |
| test_suite: str # Full test file β agent can read requirements |
| initial_error_output: str # Sandbox output on the buggy code at reset() |
| |
| # Changes each step |
| current_code: str # Most recent submitted code |
| current_error_output: str # Test output on current_code |
| tests_passed: int |
| tests_total: int |
| previous_attempts: List[FixAttempt] # Full episode history |
| |
| # Budget tracking |
| attempts_remaining: int |
| max_attempts: int |
| step_number: int |
| max_steps: int |
| done: bool |
| score_estimate: float # Running grader estimate shown to agent |
| hint_used: bool |
| ``` |
|
|
| ### Action |
|
|
| The agent submits exactly one action per step. Three types: |
|
|
| ```python |
| class Action(BaseModel): |
| action_type: str # "submit_fix" | "query_context" | "give_up" |
| |
| # submit_fix β primary action |
| fixed_code: Optional[str] = None # Complete corrected code file |
| hypothesis: Optional[str] = None # REQUIRED β missing costs -0.10 reward |
| |
| # query_context β request more information (first is free) |
| query_type: Optional[str] = None # "function_signature" | "related_code" |
| # | "error_explanation" | "test_details" |
| query_target: Optional[str] = None |
| |
| # give_up β explicit surrender, ends episode cleanly |
| final_diagnosis: Optional[str] = None |
| ``` |
|
|
| ### Reward |
|
|
| Dense signal at every step β not just binary end-of-episode. |
|
|
| ```python |
| class Reward(BaseModel): |
| step_reward: float # This step: -1.0 to +1.0 |
| cumulative_reward: float # Episode total so far |
| grader_score: float # 0.0 during episode; official score on terminal step |
| breakdown: Dict[str, float] # Itemized components for interpretability |
| ``` |
|
|
| --- |
|
|
| ## Reward Function |
|
|
| The reward function is designed so an RL agent receives meaningful signal at every step, not just when tests pass. |
|
|
| ### Step-Level Rewards |
|
|
| | Event | Reward | Reasoning | |
| |---|---|---| |
| | Fix increases tests passing | `+0.15 Γ (Ξpassed / total)` | Scaled progress reward | |
| | Fix decreases tests passing | `-0.10 Γ (Ξfailed / total)` | Regression penalty | |
| | Fix makes no change | `-0.05` | Stagnation penalty β discourages repetition | |
| | All tests pass | `+0.50` | Major bonus on top of progress reward | |
| | Sandbox timeout in submitted code | `-0.10` | Penalizes infinite loops | |
| | `submit_fix` without hypothesis | `-0.10` | Hypothesis is required | |
| | Repeated `query_context` calls | `-0.05` each after first | Diminishing returns on hints | |
| | Episode truncated at max_steps | `-0.20` | Penalizes indecision | |
| |
| ### Episode-Level Grader Score (0.0 β 1.0) |
| |
| ``` |
| grader_score = test_pass_ratio Γ 0.60 |
| + efficiency_bonus Γ 0.20 |
| + hypothesis_accuracy Γ 0.15 |
| + early_solve_bonus Γ 0.05 |
|
|
| where: |
| test_pass_ratio = agent_best_tests_passed / tests_total |
| (from agent submissions only β not initial buggy code) |
| efficiency_bonus = max(0, (max_attempts - attempts_used) / max_attempts) |
| hypothesis_accuracy = fraction of hypotheses correctly identifying bug location |
| early_solve_bonus = 0.05 if all tests pass within ceil(max_attempts / 3) attempts |
| ``` |
| |
| **Score floor design:** `test_pass_ratio` is calculated only from the agent's submitted attempts β never from the initial buggy code run. This guarantees that a dummy agent that submits nothing scores 0.0, not an inflated baseline. |
|
|
| --- |
|
|
| ## Tasks |
|
|
| ### Task 1 β Easy: Off-by-One Bug |
|
|
| **Difficulty:** π’ Easy | **Max attempts:** 5 | **Max steps:** 8 | **Tests:** 8 |
|
|
| A binary search implementation with a single-character bug: the while loop uses `left < right` instead of `left <= right`. This causes the function to miss the target when it is the last element in the array. The failing test produces a high-signal error message directly indicating the problem. |
|
|
| **Why it's easy:** The error message names the failing assertion. Reading the while condition reveals the bug. One to two iterations expected for any competent agent. |
|
|
| **What the grader checks:** Did the agent fix all 8 tests? Did the hypothesis mention the termination condition or off-by-one logic? Was it solved efficiently? |
|
|
| **Expected GPT-4o baseline:** ~0.85 |
|
|
| --- |
|
|
| ### Task 2 β Medium: Red Herring Authentication Bug |
|
|
| **Difficulty:** π‘ Medium | **Max attempts:** 7 | **Max steps:** 15 | **Tests:** 10 (6 pass, 4 fail on buggy code) |
|
|
| An authentication module with three interdependent functions: `hash_password`, `validate_password`, and `authenticate_user`. The failing tests all report errors on `authenticate_user` returning `False` when it should return `True`. However, `authenticate_user` is completely correct. So is `validate_password`. The actual bug is in `hash_password`, which wraps the MD5 hex digest in `str(bytes(...))` β producing a `"b'...'"` prefix that corrupts the hash string. |
|
|
| The red herring: the error message names `authenticate_user`. Every surface-level reading of the error points to the wrong function. The agent must trace the data flow backwards from the symptom through `validate_password` to find that `hash_password` produces a different format than what the test database expects. |
|
|
| **Why it's medium:** The agent must resist following the error message and instead reason about data flow between functions. GPT-4o follows this red herring approximately 40% of the time. |
|
|
| **Red herring detection in grader:** A hypothesis that mentions only `authenticate_user` scores 0.0 for hypothesis accuracy. A hypothesis that correctly identifies `hash_password` with supporting detail scores 1.0. |
|
|
| **Expected GPT-4o baseline:** ~0.50 |
|
|
| --- |
|
|
| ### Task 3 β Hard: Concurrency Race Condition |
|
|
| **Difficulty:** π΄ Hard | **Max attempts:** 10 | **Max steps:** 25 | **Tests:** 8 (all 8 pass on buggy code) |
|
|
| A `ConnectionCounter` class used in a web server to track active connections. It uses `threading.Lock` and appears to be correctly implemented. All 8 sequential unit tests pass. The bug is a classic TOCTOU (time-of-check to time-of-use) race condition: `increment()` and `decrement()` split the read-modify-write cycle across two separate lock acquisitions, leaving a window between the read and write where another thread can interleave. |
|
|
| ```python |
| def increment(self): |
| with self._lock: |
| current = self.count # read β lock released here |
| new_val = current + 1 # modify β no lock held |
| with self._lock: |
| self.count = new_val # write β race window exploited |
| ``` |
|
|
| The agent must: (1) recognize that 8/8 passing tests do not prove correctness for concurrent code, (2) reason about thread interleaving, (3) design a concurrent stress test that surfaces the race, (4) fix the atomicity issue by collapsing read-modify-write into a single lock scope, and (5) verify the fix passes both the original tests and a 1000-thread concurrent stress test. |
|
|
| **Why it's hard:** Race conditions are non-deterministic. The bug does not manifest in sequential execution. The agent must demonstrate meta-reasoning about the limits of the existing test suite β a capability current frontier models lack most of the time. |
|
|
| **Hard task grader breakdown:** |
| - Sequential tests pass: 0.40 (agent submissions only) |
| - 1000-thread concurrent stress test passes: 0.30 (run 3Γ β must pass all 3 for full credit) |
| - Hypothesis accuracy (mentions "race condition", "atomic", "lock"): 0.20 |
| - Efficiency bonus (fixed within 5 attempts): 0.10 |
|
|
| **Expected GPT-4o baseline:** ~0.18 |
|
|
| --- |
|
|
| ## Security Sandbox |
|
|
| Every `submit_fix` action executes agent-generated Python code. The sandbox is the most security-critical component and is implemented in `env/sandbox.py`. |
|
|
| ### Multi-Layer Protection |
|
|
| **Layer 1 β AST Import Filtering:** Before any code runs, an AST pass walks the submitted code and detects blocked imports. Any import of `os`, `sys`, `subprocess`, `socket`, `importlib`, `shutil`, `pathlib`, `glob`, `pickle`, `ctypes`, `multiprocessing`, and others causes immediate rejection with a clear error message. This uses `ast.parse()` + `ast.walk()` β not string matching, which can be bypassed. |
|
|
| **Layer 2 β Subprocess Isolation:** Code runs in a separate subprocess, not in the server process. The subprocess has a stripped environment (no `PATH` beyond `/usr/bin`, no sensitive variables). Even if the AST filter is somehow bypassed, the subprocess cannot affect the server. |
|
|
| **Layer 3 β Hard Timeout:** Every execution is killed after 10 seconds via `subprocess.run(timeout=10)`. Infinite loops in submitted code return `timed_out: True` and a `-0.10` step reward. |
|
|
| **Layer 4 β Memory Limit:** 256MB per execution via environment isolation. |
|
|
| **Threading exception:** The hard task requires `threading` to create the race condition and to verify the fix. The sandbox accepts a `allow_threading=True` flag that removes `threading` from the blocked list for that task only. All other tasks have threading blocked. |
|
|
| --- |
|
|
| ## API Endpoints |
|
|
| The environment is served as a FastAPI application on port 8000. |
|
|
| | Endpoint | Method | Description | |
| |---|---|---| |
| | `/` | GET | API overview β lists all endpoints and tasks | |
| | `/health` | GET | Health check β always returns HTTP 200 | |
| | `/tasks` | GET | List all tasks with full metadata | |
| | `/reset` | POST | Start a new episode. Body: `{"task_id": "easy"}` | |
| | `/step` | POST | Submit one action. Body: Action JSON | |
| | `/state` | GET | Full internal episode state | |
|
|
| All endpoints return HTTP 200 always β errors appear in the response body under `info["error"]`, never as HTTP 4xx/5xx. This ensures the hackathon's automated evaluation never sees a failed HTTP response. |
|
|
| --- |
|
|
| ## OpenEnv Compliance |
|
|
| ```yaml |
| # openenv.yaml |
| name: agentdebugger-env |
| version: 1.0.0 |
| domain: software_engineering |
| observation_type: structured |
| action_type: structured |
| reward_type: dense |
| episode_termination: action_or_step_limit |
| tasks: |
| - id: easy | difficulty: easy | max_steps: 8 | max_attempts: 5 |
| - id: medium | difficulty: medium | max_steps: 15 | max_attempts: 7 |
| - id: hard | difficulty: hard | max_steps: 25 | max_attempts: 10 |
| ``` |
|
|
| Validation output: |
| ``` |
| β openenv.yaml valid |
| β GET /health β 200 |
| β POST /reset β valid Observation |
| β POST /step β (Observation, Reward, bool, dict) |
| β GET /state β dict |
| β 3 tasks registered: easy, medium, hard |
| β grader_easy: score in [0.0, 1.0] β PASS |
| β grader_medium: score in [0.0, 1.0] β PASS |
| β grader_hard: score in [0.0, 1.0] β PASS |
| β inference.py present in root directory |
| openenv validate: PASSED |
| ``` |
|
|
| --- |
|
|
| ## Baseline Results |
|
|
| Evaluated using `gpt-4o` with zero-shot chain-of-thought prompting. Each task run 5 times independently, scores averaged. |
|
|
| | Task | Difficulty | Mean Score | Std Dev | Solved % | Avg Attempts | Avg Steps | |
| |---|---|---|---|---|---|---| |
| | Off-by-One Bug | Easy | 0.85 | Β±0.04 | 100% | 1.8 | 4.2 | |
| | Red Herring Auth | Medium | 0.50 | Β±0.10 | 60% | 4.2 | 10.6 | |
| | Race Condition | Hard | 0.18 | Β±0.09 | 20% | 8.7 | 22.1 | |
| | **Overall Mean** | | **0.51** | | **60%** | | | |
|
|
| **Key observations:** |
|
|
| **Easy task:** GPT-4o reads the error message, identifies the off-by-one in the while condition on the first or second attempt, and fixes correctly. Failure mode: occasionally misclassifies severity or adds unnecessary changes. |
|
|
| **Medium task:** In ~40% of runs, GPT-4o follows the red herring and spends 2β3 attempts trying to fix `authenticate_user` before eventually tracing back to `hash_password`. When it identifies the correct function immediately, it solves efficiently. The hypothesis accuracy score penalizes the red-herring runs significantly. |
|
|
| **Hard task:** GPT-4o almost never spontaneously recognizes that a race condition can exist when all sequential tests pass. In the rare runs where it does solve it (~20%), it correctly identifies that the lock scope must encompass the entire read-modify-write cycle. The 1000-thread concurrent stress test filters out partial fixes where the race window is narrowed but not eliminated. |
|
|
| --- |
|
|
| ## Setup & Usage |
|
|
| ### Local Development |
|
|
| ```bash |
| git clone https://github.com/shasshaank/AgentDebuggerEnv |
| cd AgentDebuggerEnv |
| pip install -r requirements.txt |
| |
| # Start the environment server |
| uvicorn env.server:app --reload --port 8000 |
| |
| # Verify it's running |
| curl http://localhost:8000/health |
| # {"status": "ok", "environment": "agentdebugger-env", "version": "1.0.0"} |
| |
| # Run baseline inference |
| export API_BASE_URL="https://api.openai.com/v1" |
| export MODEL_NAME="gpt-4o" |
| export HF_TOKEN="your_openai_api_key" |
| export ENV_BASE_URL="http://localhost:8000" |
| python inference.py |
| ``` |
|
|
| ### Docker |
|
|
| ```bash |
| # Build |
| docker build -t agentdebugger-env . |
| |
| # Run |
| docker run -p 8000:8000 agentdebugger-env |
| |
| # Run with inference against the containerized environment |
| docker run -p 8000:8000 \ |
| -e API_BASE_URL="https://api.openai.com/v1" \ |
| -e MODEL_NAME="gpt-4o" \ |
| -e HF_TOKEN="your_key" \ |
| agentdebugger-env |
| ``` |
|
|
| ### Quick API Test |
|
|
| ```bash |
| # Reset the easy task |
| curl -X POST http://localhost:8000/reset \ |
| -H "Content-Type: application/json" \ |
| -d '{"task_id": "easy"}' |
| |
| # Submit a fix with hypothesis |
| curl -X POST http://localhost:8000/step \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "action_type": "submit_fix", |
| "fixed_code": "def binary_search(arr, target):\n left, right = 0, len(arr) - 1\n while left <= right:\n mid = (left + right) // 2\n if arr[mid] == target:\n return mid\n elif arr[mid] < target:\n left = mid + 1\n else:\n right = mid - 1\n return -1", |
| "hypothesis": "The while loop uses left < right instead of left <= right, causing it to skip the last element." |
| }' |
| ``` |
|
|
| --- |
|
|
| ## Why This Environment Matters for Agent Research |
|
|
| Four specific failure modes in LLM agents are measurable and scorable here for the first time: |
|
|
| **1. Red herring susceptibility** β Does the agent overtrust error messages over data flow analysis? The medium task's `hypothesis_accuracy` score measures this directly. An agent that follows the red herring scores 0.0 on hypothesis accuracy even if it eventually finds the correct fix by trial and error. |
|
|
| **2. Stagnation under uncertainty** β Does the agent repeat the same failed fix strategy instead of updating its hypothesis? The `-0.05` stagnation penalty and `hypothesis_accuracy` score together capture this. An agent that submits the same code twice scores negatively twice. |
|
|
| **3. Exploration vs. exploitation** β The `query_context` action costs a step but provides information. The first query is free; subsequent ones cost `-0.05`. Agents that query productively before attempting a fix demonstrate better exploration behavior than those that immediately submit wrong fixes. |
|
|
| **4. Test-suite as sufficient proof** β The hard task is specifically designed to test whether an agent knows when passing tests are not enough. An agent that sees 8/8 tests passing and immediately approves the code β without recognizing the concurrency issue β scores at most 0.40 and fails the most important grader component. |
|
|
| All four failure modes produce distinct, interpretable score components in the `breakdown` field of every `Reward` response. This makes AgentDebuggerEnv useful not just as a benchmark but as a diagnostic tool for understanding where a specific model fails in iterative reasoning. |
|
|
| --- |
|
|
| ## Design Decisions |
|
|
| **Why require a hypothesis?** The `hypothesis` field is mandatory on every `submit_fix` action. Missing it costs `-0.10` and the fix is not executed. This forces agents to articulate their reasoning, which enables the grader to score `hypothesis_accuracy` separately from `test_pass_ratio`. It also prevents degenerate strategies of submitting random code until something passes. |
|
|
| **Why is `best_tests_passed` calculated from agent attempts only?** The medium and hard buggy codes start with 6/10 and 8/8 tests passing respectively. If the grader used the environment's `best_tests_passed` (which includes the initial buggy code run), a dummy agent that submits nothing would score 0.36 and 0.40 for free. The grader recalculates from the `attempts` list β which contains only what the agent actually submitted β ensuring the score floor is 0.0. |
|
|
| **Why run the concurrent stress test 3 times?** Race conditions are non-deterministic. A partial fix that narrows the race window (but doesn't eliminate it) might pass once by luck. Requiring all 3 runs to pass filters out lucky partial fixes. A fix that passes 1 of 3 receives 0.15 β partial credit for progress, but not full credit. |
|
|
| **Why not use pytest directly?** Using pytest as the test runner would make output parsing dependent on pytest's output format and version. The environment uses a custom lightweight test runner written as a Python string executed in the sandbox, producing a consistent `"N passed, M failed"` format that `_parse_tests_passed()` can reliably parse across all platforms. |
|
|
| --- |
|
|
| ## Environment Configuration |
|
|
| ```bash |
| # Required for inference.py |
| API_BASE_URL # LLM API endpoint (e.g. https://api.openai.com/v1) |
| MODEL_NAME # Model identifier (e.g. gpt-4o) |
| HF_TOKEN # API key / HuggingFace token |
| |
| # Optional β defaults to localhost:8000 |
| ENV_BASE_URL # Environment server URL |
| ``` |
|
|
| --- |
|
|
| ## License & Attribution |
|
|
| **License:** MIT β see [LICENSE](LICENSE) |
|
|
| **Authors:** Pranav,Shashaank (Team Endurance) |
|
|
| **Submitted to:** Meta + PyTorch + HuggingFace OpenEnv Hackathon |
|
|
| **Live Environment:** https://huggingface.co/spaces/shashaank0707/AgentDebugger-env |
|
|