shank
Fix: Final submission cleanup, unified identity and integrity markers
8807d25
|
raw
history blame
5.68 kB
metadata
title: AgentDebugger-Env πŸ›
emoji: πŸ“ˆ
colorFrom: yellow
colorTo: green
sdk: docker
app_port: 8000
pinned: true
license: mit

AgentDebuggerEnv πŸ›

Benchmarking Agentic Reasoning through the Iterative Hypothesis-Test-Fix Loop.

AgentDebuggerEnv is an OpenEnv-compliant benchmarking environment designed for the Meta + PyTorch + HuggingFace OpenEnv Hackathon. Unlike static code-repair benchmarks that only measure the final output, AgentDebuggerEnv evaluates the cognitive trajectory of an agent: how it forms hypotheses, interprets execution failures, and iterates toward a solution in a secure, live sandbox.


πŸ“Š Baseline Performance

Tested with GPT-4o using the standard inference.py script:

  • Easy (0.85): Solved in 1-2 attempts; clear signal from error output.
  • Medium (0.50): Solved in ~4 attempts; agents must resist a red-herring authentication error.
  • Hard (0.18): Rarely solved; agents must proactively design concurrent tests to surface the hidden race condition.
  • Mean Score: 0.51

Measurements taken over multiple runs to account for LLM variance. See openenv.yaml for full metadata.


πŸš€ The Core Philosophy

Traditional benchmarks (like HumanEval or MBPP) are "one-shot": the model sees a prompt and writes code. Real-world engineering is iterative.

AgentDebuggerEnv forces agents to operate in a live feedback loop:

  1. Observe: Analyze existing buggy code and initial test failures.
  2. Hypothesize: Explicitly state a theory about the root cause (scored for accuracy).
  3. Act: Submit a surgical fix or query the environment for more context.
  4. Verify: Observe real-time stdout/stderr from a sandboxed test suite execution.

πŸ› οΈ Technical Architecture

1. Robust Security Sandbox

Every submission is executed in a multi-layered isolated environment:

  • AST Filtering: A deep Abstract Syntax Tree (AST) pass analyzes submitted code before execution, blocking dangerous imports (os, sys, subprocess, socket, etc.) and preventing the override of security-critical builtins.
  • Process Isolation: Executes in a separate subprocess with strict resource limits (CPU/Memory) enforced via container runtime and execution timeouts (15s). Any attempt to hang the environment results in immediate termination.
  • Thread Safety: A specialized "Concurrency Sandbox" allows multi-threaded tests (essential for the Hard Task) while maintaining strict host-level security boundaries.

2. High-Fidelity Feedback

Instead of binary Pass/Fail bits, the environment returns the raw execution stream. This allows agents to:

  • Read stack traces.
  • See partial progress (e.g., "6 passed, 2 failed").
  • Detect timeouts and resource exhaustion.

πŸ“ Task Suite & Reasoning Challenges

Task Difficulty Reasoning Challenge Why it's hard
Easy 🟒 Easy Off-by-One Requires basic logic verification. The error message is high-signal.
Medium 🟑 Medium Red Herring The symptom (MD5 hashing error) manifests far from the root cause. Agent must trace data flow backward.
Hard πŸ”΄ Hard Race Condition Invisible to sequential tests. The agent must reason that passing tests do not mean the code is correct, and design a concurrent stress test.

πŸ“Š Professional Grading Methodology

Our graders don't just check if the code works at the end. They score the process:

  • Sequential Correctness (40%): Does the fix pass the original unit tests?
  • Hidden Strength (30%): Does the fix survive a high-concurrency (1000-thread) stress test? (Hard task only).
  • Hypothesis Accuracy (20%): Did the agent correctly identify the bug? (NLP-based keyword matching against ground truth).
  • Efficiency Bonus (10%): Did the agent solve it within 5 attempts?

βš™οΈ Installation & Usage

πŸ“¦ Local Setup

git clone https://huggingface.co/spaces/shashaank0707/AgentDebugger-env
cd AgentDebugger-env
pip install -e .

🚒 Running the Environment

# Start the FastAPI server
uvicorn env.server:app --host 0.0.0.0 --port 8000

πŸ€– Running an Agent (OpenEnv Baseline)

export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your_openai_key"
export ENV_BASE_URL="http://localhost:8000"
python inference.py

πŸ” Environment Variables

Variable Description Standard Fallback
API_BASE_URL LLM API endpoint https://api.openai.com/v1
MODEL_NAME Model to evaluate gpt-4o
HF_TOKEN Auth token (or OpenAI key) β€”
OPENAI_API_KEY Alternative auth token β€”
ENV_BASE_URL Address of the FastAPI server http://localhost:8000

πŸ”— OpenEnv API Compliance

AgentDebuggerEnv implements the full OpenEnv specification:

  • POST /reset: Initialize a task ({"task_id": "medium"}).
  • POST /step: Submit an Action (supports submit_fix, query_context, give_up).
  • GET /state: Retrieve full episode history and current environment state.
  • GET /health: Standard health check for automated uptime monitoring.

πŸ“œ Metadata & License

  • License: MIT
  • Author: Shashaank (GitHub: @shasshaank, HF: @shashaank0707)
  • Hackathon: Meta + PyTorch + HuggingFace OpenEnv 2024

βœ… Submission Integrity

  • Commit SHA: 159a5faf82fc1ab3709f9674becf9a3ec55cf562
  • Last Verified Sync: 2026-04-08
  • Platform Match: GitHub and HF Space are identical at this HEAD.