Shashaank
Fix: Revise README for improved clarity and detail
1c8aca2
|
raw
history blame
22.8 kB

AgentDebuggerEnv πŸ›

A live, iterative debugging environment for benchmarking agentic reasoning in AI systems. Submitted to the Meta + PyTorch + HuggingFace OpenEnv Hackathon.

HuggingFace Space OpenEnv Compliant License: MIT Python 3.10+ FastAPI


The Problem with Existing Code Benchmarks

Benchmarks like HumanEval, MBPP, and even SWE-bench share a fundamental limitation: they are one-shot evaluations. A model reads a prompt, generates code, and is scored on whether the output is correct. This measures code generation ability β€” not debugging ability.

Real software engineering is not one-shot. It is iterative. A developer:

  1. Reads failing tests and error output
  2. Forms a hypothesis about the root cause
  3. Submits a fix
  4. Reads the new error output
  5. Updates their hypothesis
  6. Repeats β€” sometimes many times

No existing benchmark measures this loop. AgentDebuggerEnv does.


What Makes This Different from SWE-bench

SWE-bench gives an agent a static GitHub issue and measures only the final patch correctness. AgentDebuggerEnv is fundamentally different in three ways:

Dimension SWE-bench AgentDebuggerEnv
Evaluation target Final patch quality Full reasoning trajectory
Feedback None β€” single shot Real stdout/stderr after every fix attempt
Reward signal Binary (pass/fail) Dense β€” every step is scored
What's measured Code generation Hypothesis formation + iterative reasoning
Hard task Applies existing patch Must design a test to surface a hidden bug

The iterative feedback loop is the core mechanic. Every step() call executes the agent's submitted code in a live sandbox, returns the actual test output, and the agent must update its theory and try again β€” exactly like a real developer at a terminal.


Environment Overview

AgentDebuggerEnv is a fully OpenEnv-compliant environment exposing the standard three-method API:

reset(task_id)  β†’  initial Observation
step(action)    β†’  Observation, Reward, done, info
state()         β†’  current internal state dict

The environment is deployed as a containerized FastAPI server on HuggingFace Spaces, passes openenv validate, and includes a fully reproducible baseline inference script.

Live Space: https://huggingface.co/spaces/shashaank0707/AgentDebugger-env


Project Structure

AgentDebuggerEnv/
β”œβ”€β”€ inference.py                  # Baseline inference script (root β€” hackathon requirement)
β”œβ”€β”€ env/
β”‚   β”œβ”€β”€ environment.py            # Core OpenEnv class: reset(), step(), state()
β”‚   β”œβ”€β”€ models.py                 # Pydantic v2 Observation, Action, Reward models
β”‚   β”œβ”€β”€ sandbox.py                # AST-based sandboxed code execution
β”‚   β”œβ”€β”€ server.py                 # FastAPI server: /reset, /step, /state, /health, /tasks
β”‚   β”œβ”€β”€ tasks/
β”‚   β”‚   β”œβ”€β”€ registry.py           # Task registry
β”‚   β”‚   β”œβ”€β”€ task_easy.py          # Off-by-one bug in binary search
β”‚   β”‚   β”œβ”€β”€ task_medium.py        # Red herring authentication bug
β”‚   β”‚   └── task_hard.py          # Concurrency race condition
β”‚   └── graders/
β”‚       β”œβ”€β”€ base_grader.py        # Abstract base grader
β”‚       β”œβ”€β”€ grader_easy.py        # Standard test-pass + efficiency scoring
β”‚       β”œβ”€β”€ grader_medium.py      # Red herring detection + score floor fix
β”‚       └── grader_hard.py        # Sequential + concurrent stress test scoring
β”œβ”€β”€ server/
β”‚   └── app.py                    # Entry point alias for openenv validate
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_environment.py
β”‚   β”œβ”€β”€ test_sandbox.py
β”‚   └── test_graders.py
β”œβ”€β”€ openenv.yaml                  # OpenEnv spec metadata
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ uv.lock                       # Reproducible dependency resolution
└── .gitignore

Data Models

Observation

Everything the agent sees at each step. Designed to give the agent exactly what a developer sees when debugging β€” no more, no less.

class FixAttempt(BaseModel):
    attempt_number: int       # 1-indexed
    code_submitted: str       # Full code the agent submitted
    hypothesis: str           # Agent's stated theory before this attempt
    execution_output: str     # Full stdout + stderr from sandbox
    tests_passed: int
    tests_total: int
    execution_time_ms: int
    timed_out: bool

class Observation(BaseModel):
    # Fixed for the episode
    task_id: str              # "easy" | "medium" | "hard"
    task_description: str
    buggy_code: str           # Original broken code β€” always visible
    test_suite: str           # Full test file β€” agent can read requirements
    initial_error_output: str # Sandbox output on the buggy code at reset()

    # Changes each step
    current_code: str         # Most recent submitted code
    current_error_output: str # Test output on current_code
    tests_passed: int
    tests_total: int
    previous_attempts: List[FixAttempt]  # Full episode history

    # Budget tracking
    attempts_remaining: int
    max_attempts: int
    step_number: int
    max_steps: int
    done: bool
    score_estimate: float     # Running grader estimate shown to agent
    hint_used: bool

Action

The agent submits exactly one action per step. Three types:

class Action(BaseModel):
    action_type: str          # "submit_fix" | "query_context" | "give_up"

    # submit_fix β€” primary action
    fixed_code: Optional[str] = None      # Complete corrected code file
    hypothesis: Optional[str] = None      # REQUIRED β€” missing costs -0.10 reward

    # query_context β€” request more information (first is free)
    query_type: Optional[str] = None      # "function_signature" | "related_code"
                                          # | "error_explanation" | "test_details"
    query_target: Optional[str] = None

    # give_up β€” explicit surrender, ends episode cleanly
    final_diagnosis: Optional[str] = None

Reward

Dense signal at every step β€” not just binary end-of-episode.

class Reward(BaseModel):
    step_reward: float        # This step: -1.0 to +1.0
    cumulative_reward: float  # Episode total so far
    grader_score: float       # 0.0 during episode; official score on terminal step
    breakdown: Dict[str, float]  # Itemized components for interpretability

Reward Function

The reward function is designed so an RL agent receives meaningful signal at every step, not just when tests pass.

Step-Level Rewards

Event Reward Reasoning
Fix increases tests passing +0.15 Γ— (Ξ”passed / total) Scaled progress reward
Fix decreases tests passing -0.10 Γ— (Ξ”failed / total) Regression penalty
Fix makes no change -0.05 Stagnation penalty β€” discourages repetition
All tests pass +0.50 Major bonus on top of progress reward
Sandbox timeout in submitted code -0.10 Penalizes infinite loops
submit_fix without hypothesis -0.10 Hypothesis is required
Repeated query_context calls -0.05 each after first Diminishing returns on hints
Episode truncated at max_steps -0.20 Penalizes indecision

Episode-Level Grader Score (0.0 β†’ 1.0)

grader_score = test_pass_ratio Γ— 0.60
             + efficiency_bonus Γ— 0.20
             + hypothesis_accuracy Γ— 0.15
             + early_solve_bonus Γ— 0.05

where:
  test_pass_ratio    = agent_best_tests_passed / tests_total
                       (from agent submissions only β€” not initial buggy code)
  efficiency_bonus   = max(0, (max_attempts - attempts_used) / max_attempts)
  hypothesis_accuracy = fraction of hypotheses correctly identifying bug location
  early_solve_bonus  = 0.05 if all tests pass within ceil(max_attempts / 3) attempts

Score floor design: test_pass_ratio is calculated only from the agent's submitted attempts β€” never from the initial buggy code run. This guarantees that a dummy agent that submits nothing scores 0.0, not an inflated baseline.


Tasks

Task 1 β€” Easy: Off-by-One Bug

Difficulty: 🟒 Easy | Max attempts: 5 | Max steps: 8 | Tests: 8

A binary search implementation with a single-character bug: the while loop uses left < right instead of left <= right. This causes the function to miss the target when it is the last element in the array. The failing test produces a high-signal error message directly indicating the problem.

Why it's easy: The error message names the failing assertion. Reading the while condition reveals the bug. One to two iterations expected for any competent agent.

What the grader checks: Did the agent fix all 8 tests? Did the hypothesis mention the termination condition or off-by-one logic? Was it solved efficiently?

Expected GPT-4o baseline: ~0.85


Task 2 β€” Medium: Red Herring Authentication Bug

Difficulty: 🟑 Medium | Max attempts: 7 | Max steps: 15 | Tests: 10 (6 pass, 4 fail on buggy code)

An authentication module with three interdependent functions: hash_password, validate_password, and authenticate_user. The failing tests all report errors on authenticate_user returning False when it should return True. However, authenticate_user is completely correct. So is validate_password. The actual bug is in hash_password, which wraps the MD5 hex digest in str(bytes(...)) β€” producing a "b'...'" prefix that corrupts the hash string.

The red herring: the error message names authenticate_user. Every surface-level reading of the error points to the wrong function. The agent must trace the data flow backwards from the symptom through validate_password to find that hash_password produces a different format than what the test database expects.

Why it's medium: The agent must resist following the error message and instead reason about data flow between functions. GPT-4o follows this red herring approximately 40% of the time.

Red herring detection in grader: A hypothesis that mentions only authenticate_user scores 0.0 for hypothesis accuracy. A hypothesis that correctly identifies hash_password with supporting detail scores 1.0.

Expected GPT-4o baseline: ~0.50


Task 3 β€” Hard: Concurrency Race Condition

Difficulty: πŸ”΄ Hard | Max attempts: 10 | Max steps: 25 | Tests: 8 (all 8 pass on buggy code)

A ConnectionCounter class used in a web server to track active connections. It uses threading.Lock and appears to be correctly implemented. All 8 sequential unit tests pass. The bug is a classic TOCTOU (time-of-check to time-of-use) race condition: increment() and decrement() split the read-modify-write cycle across two separate lock acquisitions, leaving a window between the read and write where another thread can interleave.

def increment(self):
    with self._lock:
        current = self.count     # read  β€” lock released here
    new_val = current + 1        # modify β€” no lock held
    with self._lock:
        self.count = new_val     # write β€” race window exploited

The agent must: (1) recognize that 8/8 passing tests do not prove correctness for concurrent code, (2) reason about thread interleaving, (3) design a concurrent stress test that surfaces the race, (4) fix the atomicity issue by collapsing read-modify-write into a single lock scope, and (5) verify the fix passes both the original tests and a 1000-thread concurrent stress test.

Why it's hard: Race conditions are non-deterministic. The bug does not manifest in sequential execution. The agent must demonstrate meta-reasoning about the limits of the existing test suite β€” a capability current frontier models lack most of the time.

Hard task grader breakdown:

  • Sequential tests pass: 0.40 (agent submissions only)
  • 1000-thread concurrent stress test passes: 0.30 (run 3Γ— β€” must pass all 3 for full credit)
  • Hypothesis accuracy (mentions "race condition", "atomic", "lock"): 0.20
  • Efficiency bonus (fixed within 5 attempts): 0.10

Expected GPT-4o baseline: ~0.18


Security Sandbox

Every submit_fix action executes agent-generated Python code. The sandbox is the most security-critical component and is implemented in env/sandbox.py.

Multi-Layer Protection

Layer 1 β€” AST Import Filtering: Before any code runs, an AST pass walks the submitted code and detects blocked imports. Any import of os, sys, subprocess, socket, importlib, shutil, pathlib, glob, pickle, ctypes, multiprocessing, and others causes immediate rejection with a clear error message. This uses ast.parse() + ast.walk() β€” not string matching, which can be bypassed.

Layer 2 β€” Subprocess Isolation: Code runs in a separate subprocess, not in the server process. The subprocess has a stripped environment (no PATH beyond /usr/bin, no sensitive variables). Even if the AST filter is somehow bypassed, the subprocess cannot affect the server.

Layer 3 β€” Hard Timeout: Every execution is killed after 10 seconds via subprocess.run(timeout=10). Infinite loops in submitted code return timed_out: True and a -0.10 step reward.

Layer 4 β€” Memory Limit: 256MB per execution via environment isolation.

Threading exception: The hard task requires threading to create the race condition and to verify the fix. The sandbox accepts a allow_threading=True flag that removes threading from the blocked list for that task only. All other tasks have threading blocked.


API Endpoints

The environment is served as a FastAPI application on port 8000.

Endpoint Method Description
/ GET API overview β€” lists all endpoints and tasks
/health GET Health check β€” always returns HTTP 200
/tasks GET List all tasks with full metadata
/reset POST Start a new episode. Body: {"task_id": "easy"}
/step POST Submit one action. Body: Action JSON
/state GET Full internal episode state

All endpoints return HTTP 200 always β€” errors appear in the response body under info["error"], never as HTTP 4xx/5xx. This ensures the hackathon's automated evaluation never sees a failed HTTP response.


OpenEnv Compliance

# openenv.yaml
name: agentdebugger-env
version: 1.0.0
domain: software_engineering
observation_type: structured
action_type: structured
reward_type: dense
episode_termination: action_or_step_limit
tasks:
  - id: easy   | difficulty: easy   | max_steps: 8  | max_attempts: 5
  - id: medium | difficulty: medium | max_steps: 15 | max_attempts: 7
  - id: hard   | difficulty: hard   | max_steps: 25 | max_attempts: 10

Validation output:

βœ“ openenv.yaml valid
βœ“ GET /health β†’ 200
βœ“ POST /reset β†’ valid Observation
βœ“ POST /step  β†’ (Observation, Reward, bool, dict)
βœ“ GET /state  β†’ dict
βœ“ 3 tasks registered: easy, medium, hard
βœ“ grader_easy:   score in [0.0, 1.0] β€” PASS
βœ“ grader_medium: score in [0.0, 1.0] β€” PASS
βœ“ grader_hard:   score in [0.0, 1.0] β€” PASS
βœ“ inference.py present in root directory
openenv validate: PASSED

Baseline Results

Evaluated using gpt-4o with zero-shot chain-of-thought prompting. Each task run 5 times independently, scores averaged.

Task Difficulty Mean Score Std Dev Solved % Avg Attempts Avg Steps
Off-by-One Bug Easy 0.85 Β±0.04 100% 1.8 4.2
Red Herring Auth Medium 0.50 Β±0.10 60% 4.2 10.6
Race Condition Hard 0.18 Β±0.09 20% 8.7 22.1
Overall Mean 0.51 60%

Key observations:

Easy task: GPT-4o reads the error message, identifies the off-by-one in the while condition on the first or second attempt, and fixes correctly. Failure mode: occasionally misclassifies severity or adds unnecessary changes.

Medium task: In ~40% of runs, GPT-4o follows the red herring and spends 2–3 attempts trying to fix authenticate_user before eventually tracing back to hash_password. When it identifies the correct function immediately, it solves efficiently. The hypothesis accuracy score penalizes the red-herring runs significantly.

Hard task: GPT-4o almost never spontaneously recognizes that a race condition can exist when all sequential tests pass. In the rare runs where it does solve it (~20%), it correctly identifies that the lock scope must encompass the entire read-modify-write cycle. The 1000-thread concurrent stress test filters out partial fixes where the race window is narrowed but not eliminated.


Setup & Usage

Local Development

git clone https://github.com/shasshaank/AgentDebuggerEnv
cd AgentDebuggerEnv
pip install -r requirements.txt

# Start the environment server
uvicorn env.server:app --reload --port 8000

# Verify it's running
curl http://localhost:8000/health
# {"status": "ok", "environment": "agentdebugger-env", "version": "1.0.0"}

# Run baseline inference
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your_openai_api_key"
export ENV_BASE_URL="http://localhost:8000"
python inference.py

Docker

# Build
docker build -t agentdebugger-env .

# Run
docker run -p 8000:8000 agentdebugger-env

# Run with inference against the containerized environment
docker run -p 8000:8000 \
  -e API_BASE_URL="https://api.openai.com/v1" \
  -e MODEL_NAME="gpt-4o" \
  -e HF_TOKEN="your_key" \
  agentdebugger-env

Quick API Test

# Reset the easy task
curl -X POST http://localhost:8000/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "easy"}'

# Submit a fix with hypothesis
curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{
    "action_type": "submit_fix",
    "fixed_code": "def binary_search(arr, target):\n    left, right = 0, len(arr) - 1\n    while left <= right:\n        mid = (left + right) // 2\n        if arr[mid] == target:\n            return mid\n        elif arr[mid] < target:\n            left = mid + 1\n        else:\n            right = mid - 1\n    return -1",
    "hypothesis": "The while loop uses left < right instead of left <= right, causing it to skip the last element."
  }'

Why This Environment Matters for Agent Research

Four specific failure modes in LLM agents are measurable and scorable here for the first time:

1. Red herring susceptibility β€” Does the agent overtrust error messages over data flow analysis? The medium task's hypothesis_accuracy score measures this directly. An agent that follows the red herring scores 0.0 on hypothesis accuracy even if it eventually finds the correct fix by trial and error.

2. Stagnation under uncertainty β€” Does the agent repeat the same failed fix strategy instead of updating its hypothesis? The -0.05 stagnation penalty and hypothesis_accuracy score together capture this. An agent that submits the same code twice scores negatively twice.

3. Exploration vs. exploitation β€” The query_context action costs a step but provides information. The first query is free; subsequent ones cost -0.05. Agents that query productively before attempting a fix demonstrate better exploration behavior than those that immediately submit wrong fixes.

4. Test-suite as sufficient proof β€” The hard task is specifically designed to test whether an agent knows when passing tests are not enough. An agent that sees 8/8 tests passing and immediately approves the code β€” without recognizing the concurrency issue β€” scores at most 0.40 and fails the most important grader component.

All four failure modes produce distinct, interpretable score components in the breakdown field of every Reward response. This makes AgentDebuggerEnv useful not just as a benchmark but as a diagnostic tool for understanding where a specific model fails in iterative reasoning.


Design Decisions

Why require a hypothesis? The hypothesis field is mandatory on every submit_fix action. Missing it costs -0.10 and the fix is not executed. This forces agents to articulate their reasoning, which enables the grader to score hypothesis_accuracy separately from test_pass_ratio. It also prevents degenerate strategies of submitting random code until something passes.

Why is best_tests_passed calculated from agent attempts only? The medium and hard buggy codes start with 6/10 and 8/8 tests passing respectively. If the grader used the environment's best_tests_passed (which includes the initial buggy code run), a dummy agent that submits nothing would score 0.36 and 0.40 for free. The grader recalculates from the attempts list β€” which contains only what the agent actually submitted β€” ensuring the score floor is 0.0.

Why run the concurrent stress test 3 times? Race conditions are non-deterministic. A partial fix that narrows the race window (but doesn't eliminate it) might pass once by luck. Requiring all 3 runs to pass filters out lucky partial fixes. A fix that passes 1 of 3 receives 0.15 β€” partial credit for progress, but not full credit.

Why not use pytest directly? Using pytest as the test runner would make output parsing dependent on pytest's output format and version. The environment uses a custom lightweight test runner written as a Python string executed in the sandbox, producing a consistent "N passed, M failed" format that _parse_tests_passed() can reliably parse across all platforms.


Environment Configuration

# Required for inference.py
API_BASE_URL   # LLM API endpoint (e.g. https://api.openai.com/v1)
MODEL_NAME     # Model identifier (e.g. gpt-4o)
HF_TOKEN       # API key / HuggingFace token

# Optional β€” defaults to localhost:8000
ENV_BASE_URL   # Environment server URL

License & Attribution

License: MIT β€” see LICENSE

Author: Shashaank

Submitted to: Meta + PyTorch + HuggingFace OpenEnv Hackathon

Live Environment: https://huggingface.co/spaces/shashaank0707/AgentDebugger-env