File size: 8,416 Bytes
e2cf8f8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | # AgentDebuggerEnv β Implementation Plan
An OpenEnv-compliant debugging environment where AI agents fix broken code through iterative hypothesis-test-fix cycles. Submission for the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**.
## User Review Required
> [!IMPORTANT]
> This is a large project with **15+ files** to create. The entire codebase needs to be built from scratch (only the README exists currently). Please confirm you'd like me to proceed with the full implementation.
> [!WARNING]
> The README specifies `huggingface_space: shashaank/agentdebugger-env`. You'll need to create this HuggingFace Space and deploy the Docker container there for the hackathon submission. I'll build everything locally; deployment is a manual step.
## Proposed Changes
The implementation follows the exact order from the README's Section 14 checklist. Each step depends on the previous.
---
### Step 1: Sandbox (`env/sandbox.py`) β Build & Test First
This is the most security-critical component. Every code execution goes through here.
#### [NEW] [sandbox.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/sandbox.py)
- `execute_code(code, test_code, allow_threading=False) β (str, bool, int)`
- AST-based import detection (not string matching) to block dangerous imports
- `BLOCKED_IMPORTS` list: os, sys, subprocess, socket, importlib, shutil, pathlib, glob, pickle, shelve, dbm, sqlite3, ftplib, http, urllib, requests, httpx, asyncio, multiprocessing, threading (unless `allow_threading=True`), ctypes, cffi, resource, signal, mmap, gc
- Write code + test_code to a temp file, run in subprocess with `timeout=10`
- Capture merged stdout+stderr
- Clean up temp files in `finally` block
#### [NEW] [test_sandbox.py](file:///Users/shashaankjain/Desktop/meta_hackathon/tests/test_sandbox.py)
- 5 required tests: timeout, os blocked, sys blocked, clean code runs, syntax error returns output
---
### Step 2: Data Models
#### [NEW] [models.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/models.py)
- `FixAttempt`, `Observation`, `Action`, `Reward` β all Pydantic v2 BaseModel subclasses
- Exact field names and types from README Section 3
---
### Step 3: Task Definitions
#### [NEW] [task_easy.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/tasks/task_easy.py)
- Binary search with `<` instead of `<=` bug
- 8-test suite, 7 pass initially, 1 fails (last element)
- Ground truth: `hypothesis_keywords`: ["left <= right", "termination", "last element", "off by one", "<="]
#### [NEW] [task_medium.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/tasks/task_medium.py)
- `hash_password`, `validate_password`, `authenticate_user` β bug is in `hash_password`
- 10-test suite, 6 pass, 4 fail (edge cases with hash mismatch)
- Red herring: error points to `authenticate_user` but bug is in `hash_password`
- Hypothesis must mention "hash_password" AND at least 1 other keyword
#### [NEW] [task_hard.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/tasks/task_hard.py)
- `ConnectionCounter` with race condition in `increment()`/`decrement()`
- 8 sequential tests all pass on buggy code
- Bug only surfaces under concurrent access
- `allow_threading=True` for this task
#### [NEW] [registry.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/tasks/registry.py)
- Maps `"easy"` / `"medium"` / `"hard"` β task config dict (buggy_code, test_suite, description, ground_truth, max_attempts, max_steps)
#### [NEW] [`__init__.py` files](file:///Users/shashaankjain/Desktop/meta_hackathon/env/__init__.py)
- `env/__init__.py` and `env/tasks/__init__.py`
---
### Step 4: Graders
#### [NEW] [base_grader.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/graders/base_grader.py)
- Abstract base class with `score()` method
#### [NEW] [grader_easy.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/graders/grader_easy.py)
- Standard formula: 0.60 test_pass_ratio + 0.20 efficiency + 0.15 hypothesis + 0.05 early_solve
#### [NEW] [grader_medium.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/graders/grader_medium.py)
- Same formula but with red herring detection: hypothesis mentioning only "authenticate_user" scores 0.0
#### [NEW] [grader_hard.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/graders/grader_hard.py)
- Custom weights: 0.40 original tests + 0.30 concurrent stress test + 0.20 hypothesis + 0.10 efficiency
- Runs a 1000-thread concurrent stress test against submitted code
#### [NEW] [test_graders.py](file:///Users/shashaankjain/Desktop/meta_hackathon/tests/test_graders.py)
- Determinism tests (same input β same output)
- Range tests (output always in [0.0, 1.0])
---
### Step 5: Environment Core
#### [NEW] [environment.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/environment.py)
- `DebuggerEnvironment` class with `reset(task_id)`, `step(action)`, `state()` methods
- `reset()`: loads task, runs buggy code through sandbox to get initial error output
- `step()`: routes by `action_type` β submit_fix β sandbox, query_context β return info, give_up β run grader
- All action rules from Section 3.2 implemented exactly
- Step-level reward calculation per Section 6.1
- Episode-level grader invocation on `done=True`
- Never crashes β all errors returned in `info["error"]`
#### [NEW] [test_environment.py](file:///Users/shashaankjain/Desktop/meta_hackathon/tests/test_environment.py)
- Unit tests for reset/step/state
---
### Step 6: FastAPI Server
#### [NEW] [server.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/server.py)
- `POST /reset` β body: `{"task_id": "easy"}`, returns Observation JSON
- `POST /step` β body: Action JSON, returns `{"observation", "reward", "done", "info"}`
- `GET /state` β returns full state dict
- `GET /health` β returns `{"status": "ok", "environment": "agentdebugger-env", "version": "1.0.0"}` with HTTP 200
---
### Step 7: Inference Script
#### [NEW] [inference.py](file:///Users/shashaankjain/Desktop/meta_hackathon/inference.py)
- Exact code from README Section 8 β already fully specified
- Root directory placement (not in `env/`)
- Reads env vars: `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`, `ENV_BASE_URL`
- Uses `openai` Python client
- Saves `baseline_results.json`
---
### Step 8: Configuration & Deployment
#### [NEW] [openenv.yaml](file:///Users/shashaankjain/Desktop/meta_hackathon/openenv.yaml)
- Exact content from README Section 9
#### [NEW] [Dockerfile](file:///Users/shashaankjain/Desktop/meta_hackathon/Dockerfile)
- Exact content from README Section 10
#### [NEW] [requirements.txt](file:///Users/shashaankjain/Desktop/meta_hackathon/requirements.txt)
- Exact content from README Section 11
---
## Open Questions
> [!IMPORTANT]
> **Task Medium β The Hash Bug:** The README describes a bytes/str conversion bug in `hash_password` where `str()` wrapping adds `"b'"` prefix. I need to carefully design the `user_db` and test setup so that 6 tests pass and exactly 4 fail. The README leaves the exact test suite design for medium to the implementer. I'll design it to match the described behavior. Any preferences?
> [!IMPORTANT]
> **Hard Task Test Count:** The README says `tests_total: 8` for hard in `openenv.yaml`, but the hard task has 8 sequential tests (all pass) and the agent needs to design a concurrent test. The grader independently runs its own 1000-thread stress test. I'll keep `tests_total: 8` as the initial suite and the grader adds its own concurrent verification separately. Correct?
## Verification Plan
### Automated Tests
1. `pytest tests/test_sandbox.py -v` β All 5 sandbox tests pass
2. `pytest tests/test_graders.py -v` β Determinism and range tests pass
3. `pytest tests/test_environment.py -v` β Reset/step/state tests pass
4. Start server with `uvicorn env.server:app --port 8000`, then:
- `curl http://localhost:8000/health` β 200 with correct JSON
- POST `/reset` for each task β valid Observation
- POST `/step` with various actions β correct responses
5. Variance self-check:
- Dummy agent (submits `pass`) β scores < 0.15
- Perfect agent (ground truth fix + correct hypothesis) β scores > 0.85 on easy
### Manual Verification
- Docker build: `docker build -t agentdebugger-env .`
- Docker run and health check
- User deploys to HuggingFace Space and runs `openenv validate .`
|