File size: 17,103 Bytes
f42fa9e
b92ad01
 
 
 
 
 
663b8db
22cb7e7
 
f42fa9e
 
e2cf8f8
 
4057375
e2cf8f8
4057375
 
1c8aca2
 
4057375
 
e2cf8f8
 
 
1c8aca2
e2cf8f8
4057375
1c8aca2
4057375
1c8aca2
4057375
e2cf8f8
 
 
4057375
1c8aca2
 
 
4057375
 
 
1c8aca2
4057375
 
1c8aca2
4057375
1c8aca2
 
 
4057375
1c8aca2
4057375
e2cf8f8
4057375
 
 
 
 
 
1c8aca2
4057375
e2cf8f8
 
 
6cca39d
e2cf8f8
6cca39d
 
 
 
e2cf8f8
 
 
4057375
e2cf8f8
4057375
e2cf8f8
4057375
1c8aca2
4057375
1c8aca2
4057375
1c8aca2
4057375
1c8aca2
4057375
1c8aca2
4057375
1c8aca2
4057375
1c8aca2
4057375
1c8aca2
4057375
1c8aca2
4057375
 
 
1c8aca2
4057375
 
 
 
 
1c8aca2
 
4057375
 
 
 
 
 
1c8aca2
e2cf8f8
4057375
 
 
 
5c507c3
4057375
 
 
e2cf8f8
 
4057375
1c8aca2
4057375
1c8aca2
 
 
 
 
4057375
1c8aca2
4057375
 
 
 
 
 
1c8aca2
 
4057375
e2cf8f8
865526d
4057375
 
1c8aca2
4057375
1c8aca2
4057375
 
 
 
 
1c8aca2
 
4057375
1c8aca2
 
 
 
 
4057375
1c8aca2
5c507c3
1c8aca2
5c507c3
1c8aca2
4057375
1c8aca2
4057375
1c8aca2
4057375
1c8aca2
 
 
4057375
1c8aca2
4057375
 
 
6cca39d
 
 
 
 
4057375
 
 
6cca39d
1c8aca2
4057375
6cca39d
 
 
 
1c8aca2
4057375
6cca39d
 
 
 
4057375
1c8aca2
 
 
4057375
1c8aca2
 
 
 
 
 
 
 
 
 
4057375
 
 
1c8aca2
 
6cca39d
1c8aca2
4057375
 
 
 
 
 
 
 
1c8aca2
e2cf8f8
1c8aca2
4057375
1c8aca2
4057375
1c8aca2
e2cf8f8
4057375
 
 
1c8aca2
4057375
 
1c8aca2
4057375
 
1c8aca2
4057375
 
865526d
1c8aca2
4057375
1c8aca2
865526d
4057375
 
22cb7e7
1c8aca2
4057375
1c8aca2
22cb7e7
1c8aca2
 
 
 
 
 
 
 
 
 
 
 
a2ff803
 
4057375
a2ff803
 
e2cf8f8
 
4057375
1c8aca2
 
4057375
 
 
a2ff803
 
1c8aca2
 
 
 
4057375
1c8aca2
4057375
 
6cca39d
 
 
4057375
1c8aca2
4057375
1c8aca2
4057375
1c8aca2
4057375
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1c8aca2
e2cf8f8
 
1c8aca2
 
4057375
22cb7e7
4057375
e2cf8f8
5c507c3
1c8aca2
4057375
e2cf8f8
4057375
1c8aca2
 
 
 
 
 
 
4057375
 
 
1c8aca2
 
 
8807d25
 
4057375
 
6cca39d
4057375
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
---
title: AgentDebugger-Training 🧠
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
app_file: app.py
python_version: 3.10.13
pinned: true
license: mit
---

# AgentDebuggerEnv πŸ›

> **A live, iterative debugging environment for benchmarking genuine agentic reasoning in AI systems.**

[![HuggingFace Space](https://img.shields.io/badge/πŸ€—%20Space-Live-yellow)](https://huggingface.co/spaces/shashaank0707/AgentDebugger-env)
[![OpenEnv](https://img.shields.io/badge/OpenEnv-Compliant-blue)](#openenv-api-compliance)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue)](https://www.python.org/)

*Submitted to the **Meta + PyTorch + HuggingFace OpenEnv Hackathon.***

---

## The Problem with Existing Code Benchmarks

Benchmarks like HumanEval, MBPP, and SWE-bench share a fundamental limitation: they are **one-shot**. A model reads a problem, generates code, and is scored on the final output. This measures code generation β€” not debugging ability.

Real software engineering is not one-shot. It is **iterative**. A developer reads failing tests, forms a hypothesis, submits a fix, reads the new error output, updates their theory, and repeats. No existing OpenEnv environment benchmarks this loop.

**AgentDebuggerEnv does.**

---

## How It's Different from SWE-bench

| Dimension | SWE-bench | AgentDebuggerEnv |
|---|---|---|
| Evaluation target | Final patch correctness | Full reasoning trajectory |
| Feedback to agent | None β€” single shot | Real `stdout/stderr` after every attempt |
| Reward signal | Binary end-of-episode | Dense β€” every step scored |
| What's measured | Code generation | Hypothesis formation + iterative reasoning |
| Hard task | Apply patch to existing issue | Must design a test to surface a hidden bug |
| Agent failure modes | Not tracked | 4 distinct measurable failure modes |

The iterative feedback loop is the core mechanic. Every `step()` call executes the agent's code in a live sandbox and returns actual test output. The agent must update its theory and try again β€” exactly like a real developer at a terminal.

---

## Baseline Performance

Evaluated using `gpt-4o` with zero-shot prompting. Each task run 5 times independently, scores averaged.

| Task | Difficulty | Mean Score | Std Dev | Solved % | Avg Attempts |
|---|---|---|---|---|---|
| Off-by-One Bug | 🟒 Easy | 0.85 | ±0.04 | 100% | 1.8 |
| Red Herring Auth Bug | 🟑 Medium | 0.50 | ±0.10 | 60% | 4.2 |
| Race Condition | πŸ”΄ Hard | 0.18 | Β±0.09 | 20% | 8.7 |
| **Overall Mean** | | **0.51** | | **60%** | |

The hard task is specifically designed so that frontier models fail most of the time. GPT-4o almost never spontaneously recognizes that a race condition can exist when all sequential tests pass β€” which is exactly the reasoning gap this environment is built to measure.

---

All four failure modes produce distinct, interpretable score components in the `breakdown` field of every `Reward` response:

*   **Red Herring Susceptibility**: Does the agent overtrust error messages (Medium Task symptom) or trace data flow to the root?
*   **Stagnation**: Does the agent repeat failed fixes? Prohibited by the `-0.05` stagnation penalty.
*   **Exploration/Exploitation**: Measures if agents query for context productively before attempting fixes.
*   **Test-Suite Overconfidence**: Detects if an agent fails to reason about concurrency when sequential tests pass (Hard Task).

---

## Task Suite

### 🟒 Task 1 β€” Easy: Off-by-One Bug

**Max attempts:** 5 | **Max steps:** 8 | **Tests:** 8

A binary search implementation with a single-character bug: the while loop uses `left < right` instead of `left <= right`. This causes the function to miss the target when it is the last element. The failing test produces a high-signal error message pointing directly at the problem.

**Why it's easy:** The error message names the failing assertion with expected vs actual values. Reading the while condition reveals the bug. 1–2 iterations expected.

**What the grader checks:** Did all 8 tests pass? Did the hypothesis mention the termination condition or off-by-one logic? Was it efficient?

---

### 🟑 Task 2 β€” Medium: Red Herring Authentication Bug

**Max attempts:** 7 | **Max steps:** 15 | **Tests:** 10 (6 pass, 4 fail on buggy code)

An authentication module with three interdependent functions: `hash_password`, `validate_password`, and `authenticate_user`. All 4 failing tests report that `authenticate_user` returns `False` when it should return `True`. But `authenticate_user` is completely correct. So is `validate_password`. The bug is in `hash_password`, which wraps the MD5 hex digest in `str(bytes(...))` β€” producing a `"b'...'"` prefix that makes the computed hash never match the stored hash.

**The red herring:** Every surface reading of the error points to `authenticate_user`. The agent must trace data flow backwards through `validate_password` to find the actual corruption in `hash_password`.

**Red herring detection in grader:** A hypothesis mentioning only `authenticate_user` scores 0.0 for hypothesis accuracy. Correctly identifying `hash_password` with supporting detail scores 1.0. GPT-4o follows the red herring ~40% of the time.

---

### πŸ”΄ Task 3 β€” Hard: Concurrency Race Condition

**Max attempts:** 10 | **Max steps:** 25 | **Tests:** 8 (ALL 8 pass on the buggy code)

A `ConnectionCounter` class used in a web server to track active connections. It uses `threading.Lock` and appears correctly implemented. All 8 sequential unit tests pass. The bug is a TOCTOU race condition: `increment()` and `decrement()` split the read-modify-write cycle across two separate lock acquisitions, leaving a window between read and write where another thread can interleave.

```python
def increment(self):
    with self._lock:
        current = self.count   # read  β€” lock released here
    new_val = current + 1      # modify β€” NO lock held
    with self._lock:
        self.count = new_val   # write β€” race window
```

The agent must: recognize that 8/8 passing tests do not prove correctness for concurrent code, reason about thread interleaving, design a concurrent stress test that surfaces the race, fix the atomicity issue by collapsing read-modify-write into a single lock scope, and verify the fix survives a 1000-thread stress test.

**Hard task grader breakdown:**
- Sequential tests pass (agent submissions only): **0.40**
- 1000-thread concurrent stress test passes (run 5Γ—, must pass >=4 for full credit): **0.30**
- Hypothesis accuracy (mentions "race condition", "atomic", "lock"): **0.20**
- Efficiency bonus (fixed within 5 attempts): **0.10**

---

## Reward Function Design

The reward function provides dense signal at every step so an RL agent can learn from every action β€” not just the final outcome.

### Step-Level Rewards

| Event | Reward | Reasoning |
|---|---|---|
| Fix increases tests passing | `+0.15 Γ— (Ξ”passed / total)` | Scaled progress |
| Fix decreases tests passing | `-0.10 Γ— (Ξ”failed / total)` | Regression penalty |
| Fix makes no change to passing count | `-0.05` | Stagnation penalty |
| All tests pass | `+0.50` | Major bonus on top of progress |
| Submitted code times out in sandbox | `-0.10` | Penalizes infinite loops |
| `submit_fix` without hypothesis field | `-0.10` | Hypothesis is required |
| First `query_context` use | `0.00` | Free |
| Subsequent `query_context` uses | `-0.05` each | Diminishing returns |
| Episode truncated at max_steps | `-0.20` | Penalizes indecision |

### Episode-Level Grader Score

```
grader_score = test_pass_ratio    Γ— 0.60
             + efficiency_bonus   Γ— 0.20
             + hypothesis_accuracy Γ— 0.15
             + early_solve_bonus  Γ— 0.05

test_pass_ratio    = agent_best_tests_passed / tests_total
                     (from agent submissions only β€” never the initial buggy code run)
efficiency_bonus   = max(0, (max_attempts - attempts_used) / max_attempts)
hypothesis_accuracy = fraction of hypotheses correctly identifying the bug
early_solve_bonus  = 0.05 if solved within ceil(max_attempts / 3) attempts
```

**Score floor design:** `test_pass_ratio` uses only the agent's submitted attempts β€” never the initial buggy code run. The medium buggy code passes 6/10 tests and the hard buggy code passes 8/8 tests sequentially. Without this design, a dummy agent that submits nothing would score 0.36 and 0.40 for free respectively. The grader recalculates from the `attempts` list to guarantee the score floor is 0.0.

---

## Security Sandbox

Every `submit_fix` action executes agent-generated Python code. All execution routes through `env/sandbox.py` β€” never via raw `exec()` anywhere in the codebase.

**Layer 1 β€” AST Import & Attribute Filtering:** Before execution, an AST walk detects blocked imports and prevents access to any attribute starting with an underscore (`_`). This blocks private member access and dunder escapes (like `__class__`).

**Layer 2 β€” Subprocess Isolation:** Code runs in a child subprocess with a stripped environment and no network access.

**Layer 3 β€” Hard Timeout:** Every execution killed after 10 seconds. Infinite loops in submitted code return `timed_out: True` and a `-0.10` step reward.

**Layer 4 β€” Memory Limit:** 256MB per execution.

**Threading exception:** The hard task requires `threading` to create and verify the race condition. The sandbox accepts `allow_threading=True` for that task only. All other tasks block threading entirely.

---

## Data Models

```python
class Observation(BaseModel):
    task_id: str                          # "easy" | "medium" | "hard"
    buggy_code: str                       # Original broken code
    test_suite: str                       # Full test file content
    current_code: str                     # Most recent submitted code
    current_error_output: str             # Sandbox stdout/stderr output
    tests_passed: int                
    attempts_remaining: int
    max_attempts: int
    done: bool
    score_estimate: float                 # Running grader estimate

class Action(BaseModel):
    action_type: str                      # "submit_fix" | "query_context" | "give_up"
    fixed_code: Optional[str]             # Complete corrected code
    hypothesis: Optional[str]             # Theory about the bug (required for submit)
    query_type: Optional[str]             # "function_signature" | "error_explanation" etc.

class Reward(BaseModel):
    step_reward: float                    # Dense signal: range -1.0 to +1.0
    cumulative_reward: float 
    grader_score: float                   # Official score (terminal step only)
    breakdown: Dict[str, float]           # Itemized components
```

---

## OpenEnv API Compliance

```yaml
name: agentdebugger-env
version: 1.0.0
domain: software_engineering
observation_type: structured
action_type: structured
reward_type: dense
episode_termination: action_or_step_limit
tasks:
  - {id: easy,   difficulty: easy,   max_steps: 8,  max_attempts: 5}
  - {id: medium, difficulty: medium, max_steps: 15, max_attempts: 7}
  - {id: hard,   difficulty: hard,   max_steps: 25, max_attempts: 10}
```

Application-level errors are returned in `info.error` inside the response body. Core evaluation endpoints are designed to avoid 4xx/5xx status codes for agent-level mistakes, ensuring the evaluation flow is never interrupted by network-level exceptions.

| Endpoint | Method | Description |
|---|---|---|
| `/` | GET | API overview β€” lists all endpoints and tasks |
| `/health` | GET | Health check β€” always HTTP 200 |
| `/tasks` | GET | All tasks with metadata |
| `/reset` | POST | Start episode. Body: `{"task_id": "easy"}` |
| `/step` | POST | Submit one action |
| `/state` | GET | Full internal episode state |

---

## Installation & Usage

### Local Setup

```bash
git clone https://github.com/shasshaank/AgentDebuggerEnv
cd AgentDebuggerEnv
pip install -r requirements.txt

# Start the environment server
uvicorn env.server:app --reload --port 8000

# Verification: Run the pre-submission validator
python validator.py

# Verify it's running
curl http://localhost:8000/health
```

### Docker

```bash
docker build -t agentdebugger-env .
docker run -p 8000:8000 agentdebugger-env
```

### Running the Baseline Inference Script

```bash
git clone https://github.com/shasshaank/AgentDebuggerEnv
cd AgentDebuggerEnv
pip install -r requirements.txt

# Start the environment server
uvicorn env.server:app --reload --port 8000

# Verify it's running
curl http://localhost:8000/health
# {"status": "ok", "environment": "agentdebugger-env", "version": "1.0.0"}

# Run baseline inference
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your_api_key"
export ENV_BASE_URL="http://localhost:8000"
python inference.py
```

Using Meta-Llama via HuggingFace (Recommended):

```bash
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-70B-Instruct"
export HF_TOKEN="your_huggingface_token"
export ENV_BASE_URL="http://localhost:8000"
python inference.py
```

---

## Environment Variables

| Variable | Description | Default |
|---|---|---|
| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
| `MODEL_NAME` | Model identifier | `meta-llama/Llama-3.1-70B-Instruct` |
| `HF_TOKEN` | Hugging Face Token (Read) | β€” |
| `ENV_BASE_URL` | Environment server address | `http://localhost:8000` |

---

## Project Structure

```
AgentDebuggerEnv/
β”œβ”€β”€ inference.py                  # Baseline script (root β€” hackathon requirement)
β”œβ”€β”€ env/
β”‚   β”œβ”€β”€ environment.py            # Core OpenEnv: reset(), step(), state()
β”‚   β”œβ”€β”€ models.py                 # Pydantic v2 Observation, Action, Reward
β”‚   β”œβ”€β”€ sandbox.py                # AST-based sandboxed code execution
β”‚   β”œβ”€β”€ server.py                 # FastAPI: /reset /step /state /health /tasks
β”‚   β”œβ”€β”€ tasks/
β”‚   β”‚   β”œβ”€β”€ task_easy.py          # Off-by-one in binary search
β”‚   β”‚   β”œβ”€β”€ task_medium.py        # Red herring authentication bug
β”‚   β”‚   └── task_hard.py          # Concurrency race condition
β”‚   └── graders/
β”‚       β”œβ”€β”€ grader_easy.py        # Test pass + efficiency scoring
β”‚       β”œβ”€β”€ grader_medium.py      # Red herring detection + score floor fix
β”‚       └── grader_hard.py        # Sequential + concurrent stress test
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
└── uv.lock                       # Reproducible dependency resolution
```

---

## Design Decisions

**Why is hypothesis mandatory?** Requiring a hypothesis on every `submit_fix` prevents degenerate strategies of submitting random code until something passes. It also enables the grader to score `hypothesis_accuracy` independently from `test_pass_ratio` β€” measuring reasoning quality separately from outcome quality.

**Why recalculate `test_pass_ratio` from the attempts list?** The medium buggy code passes 6/10 tests and the hard buggy code passes 8/8 tests sequentially. If the grader used the environment's `best_tests_passed` (which includes the initial buggy code run at reset), a dummy agent that submits nothing would score 0.36 and 0.40 for free. Recalculating from the `attempts` list guarantees the score floor is 0.0.

**Why run the concurrent stress test 5 times?** Race conditions are non-deterministic. A partial fix that narrows the race window may pass once by luck. Requiring 4 of 5 runs to pass provides a robust statistical threshold that filters out lucky partial fixes while allowing for minor runner jitter. Passing 2 of 5 gives 0.15 β€” partial credit for progress.

**Why not use pytest directly?** Using pytest as the test runner makes output parsing dependent on pytest's version and output format. The environment uses a lightweight custom test runner embedded as a Python string, producing a consistent `"N passed, M failed"` format that `_parse_tests_passed()` can reliably parse across all platforms and environments.

**Why `query_context` costs reward after the first use?** Free unlimited context queries would allow agents to trivially read all available information before attempting any fix. The cost structure forces agents to make strategic decisions about when additional information is worth spending a step on β€” which is a core part of real debugging under time pressure.

---

## License & Attribution

**License:** MIT β€” see [LICENSE](LICENSE)

**Author:** Shashaank | GitHub: [@shasshaank](https://github.com/shasshaank) | HF: [@shashaank0707](https://huggingface.co/shashaank0707)

**Live Environment:** https://huggingface.co/spaces/shashaank0707/AgentDebugger-env

**Submitted to:** Meta + PyTorch + HuggingFace OpenEnv Hackathon

---

## Submission Integrity

- **Commit SHA:** `5c507c313ff2c209d7b770af6f08cf6ed6ab1568`
- **Last Verified Sync:** 2026-04-09
- **Platform Match:** GitHub and HF Space are in sync at this HEAD