File size: 8,416 Bytes
e2cf8f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
# AgentDebuggerEnv β€” Implementation Plan

An OpenEnv-compliant debugging environment where AI agents fix broken code through iterative hypothesis-test-fix cycles. Submission for the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**.

## User Review Required

> [!IMPORTANT]
> This is a large project with **15+ files** to create. The entire codebase needs to be built from scratch (only the README exists currently). Please confirm you'd like me to proceed with the full implementation.

> [!WARNING]
> The README specifies `huggingface_space: shashaank/agentdebugger-env`. You'll need to create this HuggingFace Space and deploy the Docker container there for the hackathon submission. I'll build everything locally; deployment is a manual step.

## Proposed Changes

The implementation follows the exact order from the README's Section 14 checklist. Each step depends on the previous.

---

### Step 1: Sandbox (`env/sandbox.py`) β€” Build & Test First

This is the most security-critical component. Every code execution goes through here.

#### [NEW] [sandbox.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/sandbox.py)

- `execute_code(code, test_code, allow_threading=False) β†’ (str, bool, int)`
- AST-based import detection (not string matching) to block dangerous imports
- `BLOCKED_IMPORTS` list: os, sys, subprocess, socket, importlib, shutil, pathlib, glob, pickle, shelve, dbm, sqlite3, ftplib, http, urllib, requests, httpx, asyncio, multiprocessing, threading (unless `allow_threading=True`), ctypes, cffi, resource, signal, mmap, gc
- Write code + test_code to a temp file, run in subprocess with `timeout=10`
- Capture merged stdout+stderr
- Clean up temp files in `finally` block

#### [NEW] [test_sandbox.py](file:///Users/shashaankjain/Desktop/meta_hackathon/tests/test_sandbox.py)

- 5 required tests: timeout, os blocked, sys blocked, clean code runs, syntax error returns output

---

### Step 2: Data Models

#### [NEW] [models.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/models.py)

- `FixAttempt`, `Observation`, `Action`, `Reward` β€” all Pydantic v2 BaseModel subclasses
- Exact field names and types from README Section 3

---

### Step 3: Task Definitions

#### [NEW] [task_easy.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/tasks/task_easy.py)

- Binary search with `<` instead of `<=` bug
- 8-test suite, 7 pass initially, 1 fails (last element)
- Ground truth: `hypothesis_keywords`: ["left <= right", "termination", "last element", "off by one", "<="]

#### [NEW] [task_medium.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/tasks/task_medium.py)

- `hash_password`, `validate_password`, `authenticate_user` β€” bug is in `hash_password`
- 10-test suite, 6 pass, 4 fail (edge cases with hash mismatch)
- Red herring: error points to `authenticate_user` but bug is in `hash_password`
- Hypothesis must mention "hash_password" AND at least 1 other keyword

#### [NEW] [task_hard.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/tasks/task_hard.py)

- `ConnectionCounter` with race condition in `increment()`/`decrement()`
- 8 sequential tests all pass on buggy code
- Bug only surfaces under concurrent access
- `allow_threading=True` for this task

#### [NEW] [registry.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/tasks/registry.py)

- Maps `"easy"` / `"medium"` / `"hard"` β†’ task config dict (buggy_code, test_suite, description, ground_truth, max_attempts, max_steps)

#### [NEW] [`__init__.py` files](file:///Users/shashaankjain/Desktop/meta_hackathon/env/__init__.py)

- `env/__init__.py` and `env/tasks/__init__.py`

---

### Step 4: Graders

#### [NEW] [base_grader.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/graders/base_grader.py)

- Abstract base class with `score()` method

#### [NEW] [grader_easy.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/graders/grader_easy.py)

- Standard formula: 0.60 test_pass_ratio + 0.20 efficiency + 0.15 hypothesis + 0.05 early_solve

#### [NEW] [grader_medium.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/graders/grader_medium.py)

- Same formula but with red herring detection: hypothesis mentioning only "authenticate_user" scores 0.0

#### [NEW] [grader_hard.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/graders/grader_hard.py)

- Custom weights: 0.40 original tests + 0.30 concurrent stress test + 0.20 hypothesis + 0.10 efficiency
- Runs a 1000-thread concurrent stress test against submitted code

#### [NEW] [test_graders.py](file:///Users/shashaankjain/Desktop/meta_hackathon/tests/test_graders.py)

- Determinism tests (same input β†’ same output)
- Range tests (output always in [0.0, 1.0])

---

### Step 5: Environment Core

#### [NEW] [environment.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/environment.py)

- `DebuggerEnvironment` class with `reset(task_id)`, `step(action)`, `state()` methods
- `reset()`: loads task, runs buggy code through sandbox to get initial error output
- `step()`: routes by `action_type` β€” submit_fix β†’ sandbox, query_context β†’ return info, give_up β†’ run grader
- All action rules from Section 3.2 implemented exactly
- Step-level reward calculation per Section 6.1
- Episode-level grader invocation on `done=True`
- Never crashes β€” all errors returned in `info["error"]`

#### [NEW] [test_environment.py](file:///Users/shashaankjain/Desktop/meta_hackathon/tests/test_environment.py)

- Unit tests for reset/step/state

---

### Step 6: FastAPI Server

#### [NEW] [server.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/server.py)

- `POST /reset` β€” body: `{"task_id": "easy"}`, returns Observation JSON
- `POST /step` β€” body: Action JSON, returns `{"observation", "reward", "done", "info"}`
- `GET /state` β€” returns full state dict
- `GET /health` β€” returns `{"status": "ok", "environment": "agentdebugger-env", "version": "1.0.0"}` with HTTP 200

---

### Step 7: Inference Script

#### [NEW] [inference.py](file:///Users/shashaankjain/Desktop/meta_hackathon/inference.py)

- Exact code from README Section 8 β€” already fully specified
- Root directory placement (not in `env/`)
- Reads env vars: `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`, `ENV_BASE_URL`
- Uses `openai` Python client
- Saves `baseline_results.json`

---

### Step 8: Configuration & Deployment

#### [NEW] [openenv.yaml](file:///Users/shashaankjain/Desktop/meta_hackathon/openenv.yaml)

- Exact content from README Section 9

#### [NEW] [Dockerfile](file:///Users/shashaankjain/Desktop/meta_hackathon/Dockerfile)

- Exact content from README Section 10

#### [NEW] [requirements.txt](file:///Users/shashaankjain/Desktop/meta_hackathon/requirements.txt)

- Exact content from README Section 11

---

## Open Questions

> [!IMPORTANT]
> **Task Medium β€” The Hash Bug:** The README describes a bytes/str conversion bug in `hash_password` where `str()` wrapping adds `"b'"` prefix. I need to carefully design the `user_db` and test setup so that 6 tests pass and exactly 4 fail. The README leaves the exact test suite design for medium to the implementer. I'll design it to match the described behavior. Any preferences?

> [!IMPORTANT]
> **Hard Task Test Count:** The README says `tests_total: 8` for hard in `openenv.yaml`, but the hard task has 8 sequential tests (all pass) and the agent needs to design a concurrent test. The grader independently runs its own 1000-thread stress test. I'll keep `tests_total: 8` as the initial suite and the grader adds its own concurrent verification separately. Correct?

## Verification Plan

### Automated Tests
1. `pytest tests/test_sandbox.py -v` β€” All 5 sandbox tests pass
2. `pytest tests/test_graders.py -v` β€” Determinism and range tests pass
3. `pytest tests/test_environment.py -v` β€” Reset/step/state tests pass
4. Start server with `uvicorn env.server:app --port 8000`, then:
   - `curl http://localhost:8000/health` β†’ 200 with correct JSON
   - POST `/reset` for each task β†’ valid Observation
   - POST `/step` with various actions β†’ correct responses
5. Variance self-check:
   - Dummy agent (submits `pass`) β†’ scores < 0.15
   - Perfect agent (ground truth fix + correct hypothesis) β†’ scores > 0.85 on easy

### Manual Verification
- Docker build: `docker build -t agentdebugger-env .`
- Docker run and health check
- User deploys to HuggingFace Space and runs `openenv validate .`