File size: 5,678 Bytes
f42fa9e
22cb7e7
f42fa9e
 
 
 
 
22cb7e7
 
f42fa9e
 
e2cf8f8
 
865526d
e2cf8f8
22cb7e7
e2cf8f8
 
 
159a5fa
 
 
 
 
 
 
 
 
 
 
 
 
22cb7e7
e2cf8f8
22cb7e7
e2cf8f8
22cb7e7
 
 
 
 
e2cf8f8
 
 
22cb7e7
e2cf8f8
22cb7e7
 
159a5fa
8807d25
159a5fa
e2cf8f8
22cb7e7
 
 
 
 
e2cf8f8
 
 
22cb7e7
e2cf8f8
22cb7e7
 
 
 
 
e2cf8f8
 
 
22cb7e7
e2cf8f8
22cb7e7
e2cf8f8
22cb7e7
 
 
 
e2cf8f8
 
 
22cb7e7
e2cf8f8
22cb7e7
e2cf8f8
22cb7e7
 
 
865526d
e2cf8f8
22cb7e7
865526d
22cb7e7
865526d
22cb7e7
e2cf8f8
22cb7e7
 
a2ff803
 
 
 
 
e2cf8f8
 
 
 
159a5fa
 
 
 
 
 
 
 
 
 
 
 
22cb7e7
 
 
e2cf8f8
22cb7e7
 
 
 
e2cf8f8
 
 
22cb7e7
 
8807d25
22cb7e7
8807d25
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
title: AgentDebugger-Env πŸ›
emoji: πŸ“ˆ
colorFrom: yellow
colorTo: green
sdk: docker
app_port: 8000
pinned: true
license: mit
---

# AgentDebuggerEnv πŸ›

> **Benchmarking Agentic Reasoning through the Iterative Hypothesis-Test-Fix Loop.**

**AgentDebuggerEnv** is an OpenEnv-compliant benchmarking environment designed for the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**. Unlike static code-repair benchmarks that only measure the final output, AgentDebuggerEnv evaluates the *cognitive trajectory* of an agent: how it forms hypotheses, interprets execution failures, and iterates toward a solution in a secure, live sandbox.

---

## πŸ“Š Baseline Performance

Tested with **GPT-4o** using the standard `inference.py` script:

- **Easy (0.85)**: Solved in 1-2 attempts; clear signal from error output.
- **Medium (0.50)**: Solved in ~4 attempts; agents must resist a red-herring authentication error.
- **Hard (0.18)**: Rarely solved; agents must proactively design concurrent tests to surface the hidden race condition.
- **Mean Score: 0.51**

*Measurements taken over multiple runs to account for LLM variance. See `openenv.yaml` for full metadata.*

---

## πŸš€ The Core Philosophy

Traditional benchmarks (like HumanEval or MBPP) are "one-shot": the model sees a prompt and writes code. Real-world engineering is **iterative**.

AgentDebuggerEnv forces agents to operate in a **live feedback loop**:
1.  **Observe**: Analyze existing buggy code and initial test failures.
2.  **Hypothesize**: Explicitly state a theory about the root cause (scored for accuracy).
3.  **Act**: Submit a surgical fix or query the environment for more context.
4.  **Verify**: Observe real-time `stdout/stderr` from a sandboxed test suite execution.

---

## πŸ› οΈ Technical Architecture

### 1. Robust Security Sandbox
Every submission is executed in a multi-layered isolated environment:
*   **AST Filtering**: A deep Abstract Syntax Tree (AST) pass analyzes submitted code before execution, blocking dangerous imports (`os`, `sys`, `subprocess`, `socket`, etc.) and preventing the override of security-critical builtins.
*   **Process Isolation**: Executes in a separate subprocess with strict resource limits (CPU/Memory) enforced via container runtime and execution timeouts (15s). Any attempt to hang the environment results in immediate termination.
*   **Thread Safety**: A specialized "Concurrency Sandbox" allows multi-threaded tests (essential for the Hard Task) while maintaining strict host-level security boundaries.

### 2. High-Fidelity Feedback
Instead of binary `Pass/Fail` bits, the environment returns the **raw execution stream**. This allows agents to:
*   Read stack traces.
*   See partial progress (e.g., "6 passed, 2 failed").
*   Detect timeouts and resource exhaustion.

---

## πŸ“ Task Suite & Reasoning Challenges

| Task | Difficulty | Reasoning Challenge | Why it's hard |
| :--- | :--- | :--- | :--- |
| **Easy** | 🟒 Easy | **Off-by-One** | Requires basic logic verification. The error message is high-signal. |
| **Medium** | 🟑 Medium | **Red Herring** | The symptom (MD5 hashing error) manifests far from the root cause. Agent must trace data flow backward. |
| **Hard** | πŸ”΄ Hard | **Race Condition** | **Invisible to sequential tests.** The agent must reason that passing tests do *not* mean the code is correct, and design a concurrent stress test. |

---

## πŸ“Š Professional Grading Methodology

Our graders don't just check if the code works at the end. They score the **process**:

*   **Sequential Correctness (40%)**: Does the fix pass the original unit tests?
*   **Hidden Strength (30%)**: Does the fix survive a high-concurrency (1000-thread) stress test? (Hard task only).
*   **Hypothesis Accuracy (20%)**: Did the agent correctly identify the bug? (NLP-based keyword matching against ground truth).
*   **Efficiency Bonus (10%)**: Did the agent solve it within 5 attempts?

---

## βš™οΈ Installation & Usage

### πŸ“¦ Local Setup
```bash
git clone https://huggingface.co/spaces/shashaank0707/AgentDebugger-env
cd AgentDebugger-env
pip install -e .
```

### 🚒 Running the Environment
```bash
# Start the FastAPI server
uvicorn env.server:app --host 0.0.0.0 --port 8000
```

### πŸ€– Running an Agent (OpenEnv Baseline)
```bash
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your_openai_key"
export ENV_BASE_URL="http://localhost:8000"
python inference.py
```

---

### πŸ” Environment Variables

| Variable | Description | Standard Fallback |
| :--- | :--- | :--- |
| `API_BASE_URL` | LLM API endpoint | `https://api.openai.com/v1` |
| `MODEL_NAME` | Model to evaluate | `gpt-4o` |
| `HF_TOKEN` | Auth token (or OpenAI key) | β€” |
| `OPENAI_API_KEY` | Alternative auth token | β€” |
| `ENV_BASE_URL` | Address of the FastAPI server | `http://localhost:8000` |

---

## πŸ”— OpenEnv API Compliance

AgentDebuggerEnv implements the full OpenEnv specification:

*   `POST /reset`: Initialize a task (`{"task_id": "medium"}`).
*   `POST /step`: Submit an `Action` (supports `submit_fix`, `query_context`, `give_up`).
*   `GET /state`: Retrieve full episode history and current environment state.
*   `GET /health`: Standard health check for automated uptime monitoring.

---

## πŸ“œ Metadata & License
*   **License**: [MIT](LICENSE)
*   **Author**: Shashaank (GitHub: @shasshaank, HF: @shashaank0707)
*   **Hackathon**: Meta + PyTorch + HuggingFace OpenEnv 2024

---

### βœ… Submission Integrity
- **Commit SHA**: `159a5faf82fc1ab3709f9674becf9a3ec55cf562`
- **Last Verified Sync**: 2026-04-08
- **Platform Match**: GitHub and HF Space are identical at this HEAD.