name: agentdebugger-env version: 1.0.0 description: > A live, iterative debugging environment where AI agents fix broken code by forming hypotheses, submitting fixes, observing test output, and iterating — benchmarking genuine agentic reasoning through a hypothesis-test-fix feedback loop. domain: software_engineering tags: - debugging - agentic-reasoning - code-repair - openenv - software-engineering observation_type: structured action_type: structured reward_type: dense episode_termination: action_or_step_limit inference_script: inference.py tasks: - id: easy name: Single Function Off-By-One Bug difficulty: easy max_attempts: 5 max_steps: 8 tests_total: 8 description: > Binary search with an off-by-one termination condition. Clear error message, 1-2 iterations expected. - id: medium name: Red Herring — Interdependent Function Bug difficulty: medium max_attempts: 7 max_steps: 15 tests_total: 10 description: > Authentication module where error points to the wrong function. Agent must trace data flow backwards from symptom to root cause. - id: hard name: Concurrency Race Condition difficulty: hard max_attempts: 10 max_steps: 25 tests_total: 8 description: > Thread-safe counter with a race condition invisible to sequential tests. Agent must design a concurrent test to surface the bug, then fix it. baseline: model: gpt-4o script: inference.py mean_score: 0.51 scores: easy: 0.85 medium: 0.50 hard: 0.18 author: shashaank license: MIT huggingface_space: shashaank0707/AgentDebugger-env api_base_url_env_var: API_BASE_URL model_name_env_var: MODEL_NAME hf_token_env_var: HF_TOKEN