AgentDebuggerEnv β Implementation Plan
An OpenEnv-compliant debugging environment where AI agents fix broken code through iterative hypothesis-test-fix cycles. Submission for the Meta + PyTorch + HuggingFace OpenEnv Hackathon.
User Review Required
This is a large project with 15+ files to create. The entire codebase needs to be built from scratch (only the README exists currently). Please confirm you'd like me to proceed with the full implementation.
The README specifies
huggingface_space: shashaank/agentdebugger-env. You'll need to create this HuggingFace Space and deploy the Docker container there for the hackathon submission. I'll build everything locally; deployment is a manual step.
Proposed Changes
The implementation follows the exact order from the README's Section 14 checklist. Each step depends on the previous.
Step 1: Sandbox (env/sandbox.py) β Build & Test First
This is the most security-critical component. Every code execution goes through here.
[NEW] sandbox.py
execute_code(code, test_code, allow_threading=False) β (str, bool, int)- AST-based import detection (not string matching) to block dangerous imports
BLOCKED_IMPORTSlist: os, sys, subprocess, socket, importlib, shutil, pathlib, glob, pickle, shelve, dbm, sqlite3, ftplib, http, urllib, requests, httpx, asyncio, multiprocessing, threading (unlessallow_threading=True), ctypes, cffi, resource, signal, mmap, gc- Write code + test_code to a temp file, run in subprocess with
timeout=10 - Capture merged stdout+stderr
- Clean up temp files in
finallyblock
[NEW] test_sandbox.py
- 5 required tests: timeout, os blocked, sys blocked, clean code runs, syntax error returns output
Step 2: Data Models
[NEW] models.py
FixAttempt,Observation,Action,Rewardβ all Pydantic v2 BaseModel subclasses- Exact field names and types from README Section 3
Step 3: Task Definitions
[NEW] task_easy.py
- Binary search with
<instead of<=bug - 8-test suite, 7 pass initially, 1 fails (last element)
- Ground truth:
hypothesis_keywords: ["left <= right", "termination", "last element", "off by one", "<="]
[NEW] task_medium.py
hash_password,validate_password,authenticate_userβ bug is inhash_password- 10-test suite, 6 pass, 4 fail (edge cases with hash mismatch)
- Red herring: error points to
authenticate_userbut bug is inhash_password - Hypothesis must mention "hash_password" AND at least 1 other keyword
[NEW] task_hard.py
ConnectionCounterwith race condition inincrement()/decrement()- 8 sequential tests all pass on buggy code
- Bug only surfaces under concurrent access
allow_threading=Truefor this task
[NEW] registry.py
- Maps
"easy"/"medium"/"hard"β task config dict (buggy_code, test_suite, description, ground_truth, max_attempts, max_steps)
[NEW] __init__.py files
env/__init__.pyandenv/tasks/__init__.py
Step 4: Graders
[NEW] base_grader.py
- Abstract base class with
score()method
[NEW] grader_easy.py
- Standard formula: 0.60 test_pass_ratio + 0.20 efficiency + 0.15 hypothesis + 0.05 early_solve
[NEW] grader_medium.py
- Same formula but with red herring detection: hypothesis mentioning only "authenticate_user" scores 0.0
[NEW] grader_hard.py
- Custom weights: 0.40 original tests + 0.30 concurrent stress test + 0.20 hypothesis + 0.10 efficiency
- Runs a 1000-thread concurrent stress test against submitted code
[NEW] test_graders.py
- Determinism tests (same input β same output)
- Range tests (output always in [0.0, 1.0])
Step 5: Environment Core
[NEW] environment.py
DebuggerEnvironmentclass withreset(task_id),step(action),state()methodsreset(): loads task, runs buggy code through sandbox to get initial error outputstep(): routes byaction_typeβ submit_fix β sandbox, query_context β return info, give_up β run grader- All action rules from Section 3.2 implemented exactly
- Step-level reward calculation per Section 6.1
- Episode-level grader invocation on
done=True - Never crashes β all errors returned in
info["error"]
[NEW] test_environment.py
- Unit tests for reset/step/state
Step 6: FastAPI Server
[NEW] server.py
POST /resetβ body:{"task_id": "easy"}, returns Observation JSONPOST /stepβ body: Action JSON, returns{"observation", "reward", "done", "info"}GET /stateβ returns full state dictGET /healthβ returns{"status": "ok", "environment": "agentdebugger-env", "version": "1.0.0"}with HTTP 200
Step 7: Inference Script
[NEW] inference.py
- Exact code from README Section 8 β already fully specified
- Root directory placement (not in
env/) - Reads env vars:
API_BASE_URL,MODEL_NAME,HF_TOKEN,ENV_BASE_URL - Uses
openaiPython client - Saves
baseline_results.json
Step 8: Configuration & Deployment
[NEW] openenv.yaml
- Exact content from README Section 9
[NEW] Dockerfile
- Exact content from README Section 10
[NEW] requirements.txt
- Exact content from README Section 11
Open Questions
Task Medium β The Hash Bug: The README describes a bytes/str conversion bug in
hash_passwordwherestr()wrapping adds"b'"prefix. I need to carefully design theuser_dband test setup so that 6 tests pass and exactly 4 fail. The README leaves the exact test suite design for medium to the implementer. I'll design it to match the described behavior. Any preferences?
Hard Task Test Count: The README says
tests_total: 8for hard inopenenv.yaml, but the hard task has 8 sequential tests (all pass) and the agent needs to design a concurrent test. The grader independently runs its own 1000-thread stress test. I'll keeptests_total: 8as the initial suite and the grader adds its own concurrent verification separately. Correct?
Verification Plan
Automated Tests
pytest tests/test_sandbox.py -vβ All 5 sandbox tests passpytest tests/test_graders.py -vβ Determinism and range tests passpytest tests/test_environment.py -vβ Reset/step/state tests pass- Start server with
uvicorn env.server:app --port 8000, then:curl http://localhost:8000/healthβ 200 with correct JSON- POST
/resetfor each task β valid Observation - POST
/stepwith various actions β correct responses
- Variance self-check:
- Dummy agent (submits
pass) β scores < 0.15 - Perfect agent (ground truth fix + correct hypothesis) β scores > 0.85 on easy
- Dummy agent (submits
Manual Verification
- Docker build:
docker build -t agentdebugger-env . - Docker run and health check
- User deploys to HuggingFace Space and runs
openenv validate .