File size: 1,914 Bytes
0ee66d2 e93446d 0ee66d2 e93446d 8807d25 0ee66d2 ee08016 0ee66d2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | name: agentdebugger-env
version: 1.0.0
description: >
A live, iterative debugging environment where AI agents fix broken code
by forming hypotheses, submitting fixes, observing test output, and
iterating — benchmarking genuine agentic reasoning through a
hypothesis-test-fix feedback loop.
domain: software_engineering
tags:
- debugging
- agentic-reasoning
- code-repair
- openenv
- software-engineering
observation_type: structured
action_type: structured
reward_type: dense
episode_termination: action_or_step_limit
inference_script: inference.py
tasks:
- id: easy
name: Single Function Off-By-One Bug
difficulty: easy
max_attempts: 5
max_steps: 8
tests_total: 8
description: >
Binary search with an off-by-one termination condition.
Clear error message, 1-2 iterations expected.
- id: medium
name: Red Herring — Interdependent Function Bug
difficulty: medium
max_attempts: 7
max_steps: 15
tests_total: 10
description: >
Authentication module where error points to the wrong function.
Agent must trace data flow backwards from symptom to root cause.
- id: hard
name: Concurrency Race Condition
difficulty: hard
max_attempts: 10
max_steps: 25
tests_total: 8
description: >
Thread-safe counter with a race condition invisible to sequential tests.
Agent must design a concurrent test to surface the bug, then fix it.
baseline:
model: meta-llama/Llama-3.1-70B-Instruct
script: inference.py
mean_score: 0.51
scores:
easy: 0.85
medium: 0.50
hard: 0.18
author: "Shashaank (GitHub: @shasshaank, HF: @shashaank0707)"
# Submission Integrity: SHA 159a5faf82fc1ab3709f9674becf9a3ec55cf562 | Verified 2026-04-08
license: MIT
huggingface_space: shashaank0707/AgentDebugger-env
api_base_url_env_var: API_BASE_URL
model_name_env_var: MODEL_NAME
hf_token_env_var: HF_TOKEN
|