AgentDebugger-training-v3 / implementation_plan.md
shank
first commit
e2cf8f8
|
raw
history blame
8.42 kB

AgentDebuggerEnv β€” Implementation Plan

An OpenEnv-compliant debugging environment where AI agents fix broken code through iterative hypothesis-test-fix cycles. Submission for the Meta + PyTorch + HuggingFace OpenEnv Hackathon.

User Review Required

This is a large project with 15+ files to create. The entire codebase needs to be built from scratch (only the README exists currently). Please confirm you'd like me to proceed with the full implementation.

The README specifies huggingface_space: shashaank/agentdebugger-env. You'll need to create this HuggingFace Space and deploy the Docker container there for the hackathon submission. I'll build everything locally; deployment is a manual step.

Proposed Changes

The implementation follows the exact order from the README's Section 14 checklist. Each step depends on the previous.


Step 1: Sandbox (env/sandbox.py) β€” Build & Test First

This is the most security-critical component. Every code execution goes through here.

[NEW] sandbox.py

  • execute_code(code, test_code, allow_threading=False) β†’ (str, bool, int)
  • AST-based import detection (not string matching) to block dangerous imports
  • BLOCKED_IMPORTS list: os, sys, subprocess, socket, importlib, shutil, pathlib, glob, pickle, shelve, dbm, sqlite3, ftplib, http, urllib, requests, httpx, asyncio, multiprocessing, threading (unless allow_threading=True), ctypes, cffi, resource, signal, mmap, gc
  • Write code + test_code to a temp file, run in subprocess with timeout=10
  • Capture merged stdout+stderr
  • Clean up temp files in finally block

[NEW] test_sandbox.py

  • 5 required tests: timeout, os blocked, sys blocked, clean code runs, syntax error returns output

Step 2: Data Models

[NEW] models.py

  • FixAttempt, Observation, Action, Reward β€” all Pydantic v2 BaseModel subclasses
  • Exact field names and types from README Section 3

Step 3: Task Definitions

[NEW] task_easy.py

  • Binary search with < instead of <= bug
  • 8-test suite, 7 pass initially, 1 fails (last element)
  • Ground truth: hypothesis_keywords: ["left <= right", "termination", "last element", "off by one", "<="]

[NEW] task_medium.py

  • hash_password, validate_password, authenticate_user β€” bug is in hash_password
  • 10-test suite, 6 pass, 4 fail (edge cases with hash mismatch)
  • Red herring: error points to authenticate_user but bug is in hash_password
  • Hypothesis must mention "hash_password" AND at least 1 other keyword

[NEW] task_hard.py

  • ConnectionCounter with race condition in increment()/decrement()
  • 8 sequential tests all pass on buggy code
  • Bug only surfaces under concurrent access
  • allow_threading=True for this task

[NEW] registry.py

  • Maps "easy" / "medium" / "hard" β†’ task config dict (buggy_code, test_suite, description, ground_truth, max_attempts, max_steps)

[NEW] __init__.py files

  • env/__init__.py and env/tasks/__init__.py

Step 4: Graders

[NEW] base_grader.py

  • Abstract base class with score() method

[NEW] grader_easy.py

  • Standard formula: 0.60 test_pass_ratio + 0.20 efficiency + 0.15 hypothesis + 0.05 early_solve

[NEW] grader_medium.py

  • Same formula but with red herring detection: hypothesis mentioning only "authenticate_user" scores 0.0

[NEW] grader_hard.py

  • Custom weights: 0.40 original tests + 0.30 concurrent stress test + 0.20 hypothesis + 0.10 efficiency
  • Runs a 1000-thread concurrent stress test against submitted code

[NEW] test_graders.py

  • Determinism tests (same input β†’ same output)
  • Range tests (output always in [0.0, 1.0])

Step 5: Environment Core

[NEW] environment.py

  • DebuggerEnvironment class with reset(task_id), step(action), state() methods
  • reset(): loads task, runs buggy code through sandbox to get initial error output
  • step(): routes by action_type β€” submit_fix β†’ sandbox, query_context β†’ return info, give_up β†’ run grader
  • All action rules from Section 3.2 implemented exactly
  • Step-level reward calculation per Section 6.1
  • Episode-level grader invocation on done=True
  • Never crashes β€” all errors returned in info["error"]

[NEW] test_environment.py

  • Unit tests for reset/step/state

Step 6: FastAPI Server

[NEW] server.py

  • POST /reset β€” body: {"task_id": "easy"}, returns Observation JSON
  • POST /step β€” body: Action JSON, returns {"observation", "reward", "done", "info"}
  • GET /state β€” returns full state dict
  • GET /health β€” returns {"status": "ok", "environment": "agentdebugger-env", "version": "1.0.0"} with HTTP 200

Step 7: Inference Script

[NEW] inference.py

  • Exact code from README Section 8 β€” already fully specified
  • Root directory placement (not in env/)
  • Reads env vars: API_BASE_URL, MODEL_NAME, HF_TOKEN, ENV_BASE_URL
  • Uses openai Python client
  • Saves baseline_results.json

Step 8: Configuration & Deployment

[NEW] openenv.yaml

  • Exact content from README Section 9

[NEW] Dockerfile

  • Exact content from README Section 10

[NEW] requirements.txt

  • Exact content from README Section 11

Open Questions

Task Medium β€” The Hash Bug: The README describes a bytes/str conversion bug in hash_password where str() wrapping adds "b'" prefix. I need to carefully design the user_db and test setup so that 6 tests pass and exactly 4 fail. The README leaves the exact test suite design for medium to the implementer. I'll design it to match the described behavior. Any preferences?

Hard Task Test Count: The README says tests_total: 8 for hard in openenv.yaml, but the hard task has 8 sequential tests (all pass) and the agent needs to design a concurrent test. The grader independently runs its own 1000-thread stress test. I'll keep tests_total: 8 as the initial suite and the grader adds its own concurrent verification separately. Correct?

Verification Plan

Automated Tests

  1. pytest tests/test_sandbox.py -v β€” All 5 sandbox tests pass
  2. pytest tests/test_graders.py -v β€” Determinism and range tests pass
  3. pytest tests/test_environment.py -v β€” Reset/step/state tests pass
  4. Start server with uvicorn env.server:app --port 8000, then:
    • curl http://localhost:8000/health β†’ 200 with correct JSON
    • POST /reset for each task β†’ valid Observation
    • POST /step with various actions β†’ correct responses
  5. Variance self-check:
    • Dummy agent (submits pass) β†’ scores < 0.15
    • Perfect agent (ground truth fix + correct hypothesis) β†’ scores > 0.85 on easy

Manual Verification

  • Docker build: docker build -t agentdebugger-env .
  • Docker run and health check
  • User deploys to HuggingFace Space and runs openenv validate .