Spaces:

Chirag0123
/

codebase-nav-env

Sleeping

App Files Files Community

Chirag0123 commited on 13 days ago

Commit

635be3f

1 Parent(s): a5c1fa0

v3.0 — Gradio UI + run_agent.py + full OpenEnv compliance

Browse files

Files changed (6) hide show

Dockerfile +3 -12
README.md +120 -91
app.py +344 -0
requirements.txt +8 -6
run_agent.py +290 -0
test_e2e.py +56 -0

Dockerfile CHANGED Viewed

@@ -1,31 +1,22 @@
 FROM python:3.11-slim
-# Create non-root user for security — MANDATORY for running agent code safely
 RUN useradd -m -u 1000 envuser
 WORKDIR /app
-# Install system dependencies
-RUN apt-get update && apt-get install -y \
-    git \
-    && rm -rf /var/lib/apt/lists/*
-# Copy and install Python dependencies first (layer caching)
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
-# Copy project
 COPY . .
-# Make repo_templates readable
 RUN chmod -R 755 repo_templates/
-# Create temp directory for working copies
 RUN mkdir -p /tmp/openenv_work && chmod 777 /tmp/openenv_work
-# Switch to non-root for security
 USER envuser
 EXPOSE 7860
-CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1"]

 FROM python:3.11-slim
 RUN useradd -m -u 1000 envuser
 WORKDIR /app
+RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 COPY . .
 RUN chmod -R 755 repo_templates/
 RUN mkdir -p /tmp/openenv_work && chmod 777 /tmp/openenv_work
 USER envuser
 EXPOSE 7860
+# Entry point: Gradio app that also mounts FastAPI endpoints
+CMD ["python", "app.py"]

README.md CHANGED Viewed

@@ -13,50 +13,92 @@ tags:
   - coding-agent
 ---
-# Codebase Navigation & Repair — OpenEnv Environment v2.0
-**An RL environment + evaluation layer that makes AI coding agents reliable, testable, and debuggable.**
-AI agents navigate unfamiliar Python codebases, identify bugs, and implement features — graded by running actual tests. Unlike existing benchmarks, this system provides **process-level evaluation**, not just final output scoring.
-## Why This Exists
-Every coding agent (Devin, Cursor, Copilot, Codex) fails ~25%+ on complex tasks. Current benchmarks tell you the agent scored 0.4 but not **why** it failed. This environment answers:
-- Did the agent explore strategically or waste steps?
-- Did it verify its fixes before submitting?
-- Can it resist misleading comments and prompt injection?
-- How efficiently does it use its context window?
-## Architecture
 ```
-┌──────────────────────────────────────────────────────────┐
-│                    FastAPI Server                         │
-│  /reset  /step  /state  /trajectory  /evaluate  /metrics │
-└──────────┬───────────────────────────────────────────────┘
-           │
-┌──────────▼───────────────────────────────────────────────┐
-│              CodebaseNavEnvironment (extended)             │
-│                                                           │
-│  ┌─────────────┐  ┌──────────────┐  ┌─────────────────┐  │
-│  │ Trajectory   │  │  Evaluator   │  │  Security       │  │
-│  │ Logger       │  │  (process)   │  │  Scanner        │  │
-│  └─────────────┘  └──────────────┘  └─────────────────┘  │
-│  ┌─────────────┐  ┌──────────────┐  ┌─────────────────┐  │
-│  │ Fault       │  │  Memory      │  │  Grader         │  │
-│  │ Injector    │  │  Tracker     │  │  (pytest)       │  │
-│  └─────────────┘  └──────────────┘  └─────────────────┘  │
-└──────────────────────────────────────────────────────────┘
 ```
-## Tasks
-| Task | Difficulty | Description |
-|------|-----------|-------------|
-| task1 | Easy | Single-file bug repair (5 variants) |
-| task2 | Medium | Cross-module interface bug + regression test (5 variants) |
-| task3 | Hard | Feature implementation from spec (5 variants) |
 ## API Endpoints
@@ -68,76 +110,63 @@ Every coding agent (Devin, Cursor, Copilot, Codex) fails ~25%+ on complex tasks.
 | `/state` | GET | Get current state |
 | `/health` | GET | Health check |
-### Evaluation Layer (v2.0)
 | Endpoint | Method | Description |
 |----------|--------|-------------|
-| `/trajectory` | GET | Full action log with timing, diffs, security flags |
 | `/evaluate` | GET | Multi-dimensional scores (6 axes) |
-| `/metrics` | GET | Comprehensive stats: memory, security, timeline |
-| `/fault-config` | POST | Enable fault injection: "none", "light", "heavy" |
-## Multi-Dimensional Evaluation
-The `/evaluate` endpoint scores agents across **6 quality dimensions**:
-| Dimension | Weight | What It Measures |
-|-----------|--------|-----------------|
-| Efficiency | 20% | Steps used vs optimal path |
-| Navigation | 15% | Read relevant files first? Explored strategically? |
-| Correctness | 30% | Final test pass rate + regression detection |
-| Reasoning | 15% | read→write→test pattern adherence |
-| Robustness | 10% | Error recovery + fault injection handling |
-| Security | 10% | Unsafe code detection + prompt injection resistance |
-## Fault Injection
-Test agent robustness by injecting controlled faults:
-```bash
-# Enable heavy fault injection
-curl -X POST http://localhost:7860/fault-config -d '{"level":"heavy"}'
-# Next reset will inject:
-# - Misleading "BUG:" comments on correct lines
-# - Red herring files that look buggy but aren't
-# - Noisy docstrings claiming code is correct
 ```
-## Quick Start
-### Local
-```bash
-pip install -r requirements.txt
-uvicorn server.app:app --host 0.0.0.0 --port 7860
 ```
-### Docker
-```bash
-docker build -t codebase-nav-env .
-docker run -p 7860:7860 codebase-nav-env
 ```
-### Run Inference
-```bash
-export HF_TOKEN=your_token
-export ENV_BASE_URL=http://localhost:7860
-python inference.py
-```
-## Example Output: `/evaluate`
-```json
-{
-  "composite_score": 0.874,
-  "dimensions": {
-    "efficiency": {"score": 0.8, "evidence": ["Used 5 steps vs 4 optimal"]},
-    "navigation": {"score": 1.0, "evidence": ["Good: first read was relevant file"]},
-    "correctness": {"score": 0.714, "evidence": ["No test regressions"]},
-    "reasoning": {"score": 1.0, "evidence": ["Agent tested after writing"]},
-    "robustness": {"score": 1.0, "evidence": ["Clean execution"]},
-    "security": {"score": 1.0, "evidence": ["No security violations"]}
-  }
-}
-```
 ## License

   - coding-agent
 ---
+# 🔍 Codebase Navigation & Repair — OpenEnv
+> **The system that makes AI coding agents reliable, testable, and debuggable.**
+## The Problem
+AI coding agents (Copilot, Devin, Cursor) fail ~25%+ on complex tasks. Current benchmarks tell you the score but not **why** the agent failed. Was it poor navigation? Wasted steps? Hallucinated code? There is no way to know.
+## Our Solution
+An RL environment where agents navigate unfamiliar Python repos, find bugs, and fix them — graded by **actual pytest execution** with **process-level evaluation**.
+Unlike existing benchmarks, we evaluate **how** the agent works, not just the final output:
+| What We Measure | Why It Matters |
+|----------------|---------------|
+| Navigation efficiency | Did it read relevant files first? |
+| Reasoning patterns | Did it follow read→write→test? |
+| Context usage | How much of what it read was useful? |
+| Security | Did it write safe code? |
+| Robustness | Can it handle misleading comments? |
+## How It Works
+```
+Agent resets environment → sees repo file tree (NOT contents)
+  → reads files one at a time (costs steps)
+  → identifies bugs in source code
+  → writes fixed code
+  → runs tests to verify
+  → submits for final grade
+```
+### Tasks
+| Task | Difficulty | Description | Variants |
+|------|-----------|-------------|----------|
+| task1 | Easy | Single-file bug repair | 5 |
+| task2 | Medium | Cross-module interface bug + regression test | 5 |
+| task3 | Hard | Feature implementation from spec | 5 |
+Each variant has structurally different code, so the agent can't memorize solutions.
+## Quick Start
+### 1. Run Locally (No Docker)
+```bash
+pip install -r requirements.txt
+python app.py                    # Gradio UI at http://localhost:7860
+```
+### 2. Run Agent (No LLM needed)
+```bash
+python run_agent.py              # deterministic agent demo
+python run_agent.py --all-tasks  # run all 3 tasks
 ```
+### 3. Run Agent with LLM
+```bash
+export HF_TOKEN=hf_xxxxx
+python run_agent.py --llm --task task1
+```
+### 4. Docker
+```bash
+docker build -t codebase-nav-env .
+docker run -p 7860:7860 codebase-nav-env
 ```
+### 5. API Usage
+```bash
+# Reset
+curl -X POST "http://localhost:7860/reset?task=task1"
+# Take action
+curl -X POST http://localhost:7860/step \
+  -H "Content-Type: application/json" \
+  -d '{"action_type":"read_file","path":"src/auth.py"}'
+# Submit
+curl -X POST http://localhost:7860/step \
+  -d '{"action_type":"submit"}'
+# Get evaluation
+curl http://localhost:7860/evaluate
+```
 ## API Endpoints
 | `/state` | GET | Get current state |
 | `/health` | GET | Health check |
+### Evaluation Layer
 | Endpoint | Method | Description |
 |----------|--------|-------------|
+| `/trajectory` | GET | Full action log with timing and diffs |
 | `/evaluate` | GET | Multi-dimensional scores (6 axes) |
+| `/metrics` | GET | Memory, security, timeline stats |
+| `/fault-config` | POST | Enable fault injection |
+## Evaluation Dimensions
+```
+efficiency   [████████████████░░░░] 0.800  — 5 steps vs 4 optimal
+navigation   [████████████████████] 1.000  — read relevant files first
+correctness  [██████████████░░░░░░] 0.714  — 71.4% tests passing
+reasoning    [████████████████████] 1.000  — correct read→write→test pattern
+robustness   [████████████████████] 1.000  — no errors encountered
+security     [████████████████████] 1.000  — no unsafe code detected
 ```
+## Project Structure
 ```
+codebase-nav-env/
+├── app.py                  # Gradio UI + FastAPI (HF Space entry point)
+├── run_agent.py            # Standalone HF agent (deterministic + LLM)
+├── inference.py            # OpenEnv inference script ([START]/[STEP]/[END])
+├── server/
+│   ├── app.py              # FastAPI endpoints
+│   ├── environment.py      # Core RL environment
+│   ├── models.py           # Pydantic models
+│   ├── grader.py           # pytest runner
+│   ├── repo_loader.py      # Template loader
+│   ├── sandbox.py          # Secure subprocess
+│   ├── trajectory.py       # Full trajectory recording
+│   ├── evaluator.py        # 6-dimension scoring engine
+│   ├── fault_injection.py  # Robustness testing
+│   ├── security.py         # Unsafe code detection
+│   └── memory.py           # Context efficiency tracking
+├── repo_templates/          # 15 task variants
+│   ├── task1/               # 5 single-file bug variants
+│   ├── task2/               # 5 cross-module bug variants
+│   └── task3/               # 5 feature implementation variants
+├── openenv.yaml            # Environment metadata
+├── Dockerfile              # Docker build
+├── requirements.txt        # Dependencies
+└── README.md               # This file
 ```
+## Why This Is Real-World
+This isn't a toy benchmark. It tests the **exact capabilities** production coding agents need:
+- **Navigate unfamiliar code** — agent sees only file names, not contents
+- **Budget exploration** — finite steps mean strategic reading matters
+- **Verify fixes** — must run tests, not just hope the fix works
+- **Handle noise** — real repos have misleading comments and dead code
+- **Write safe code** — production agents can't `eval()` or `os.system()`
 ## License

app.py ADDED Viewed

	@@ -0,0 +1,344 @@

+#!/usr/bin/env python3
+"""
+app.py — Gradio UI + FastAPI endpoints for the OpenEnv environment.
+This is the HF Space entry point.
+"""
+import os
+import json
+import gradio as gr
+from server.environment import CodebaseNavEnvironment
+from server.models import RepoAction
+# ── Global environment instance ──────────────────────────────────────────────
+env = CodebaseNavEnvironment()
+# ── Gradio callback functions ────────────────────────────────────────────────
+def reset_environment(task: str):
+    """Reset environment and return initial state."""
+    try:
+        result = env.reset(task=task)
+        obs = result.observation
+        tree = "\n".join(f"  📄 {f}" for f in obs.repo_tree)
+        failing = ", ".join(obs.failing_tests) if obs.failing_tests else "None listed"
+        info_data = result.info
+        status_text = (
+            f"✅ Episode started\n"
+            f"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n"
+            f"Task: {task}\n"
+            f"Variant: {info_data.get('variant_id', 'unknown')}\n"
+            f"Steps remaining: {obs.steps_remaining}\n"
+            f"━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n\n"
+            f"📁 Repository Files:\n{tree}\n\n"
+            f"🔴 Failing Tests: {failing}\n\n"
+            f"📋 Task: {obs.task_description}"
+        )
+        return status_text, "", "0", "0.000"
+    except Exception as e:
+        return f"❌ Error: {e}", "", "0", "0.000"
+def take_step(action_type: str, path: str, query: str, content: str):
+    """Execute one agent step."""
+    if env.done:
+        return "❌ Episode is done. Reset first.", "", "", ""
+    try:
+        action = RepoAction(
+            action_type=action_type,
+            path=path if path.strip() else None,
+            query=query if query.strip() else None,
+            content=content if content.strip() else None,
+        )
+        result = env.step(action)
+        obs = result.observation
+        action_result = obs.last_action_result or "No output"
+        error = obs.last_action_error or ""
+        if error:
+            error = f"⚠️ {error}"
+        status = (
+            f"Step {result.info['steps_taken']} | "
+            f"Reward: {result.reward:+.3f} | "
+            f"Steps left: {obs.steps_remaining}"
+        )
+        if result.done:
+            status += f"\n\n🏁 EPISODE DONE — Final Score: {result.info['final_score']:.3f}"
+        flags = result.info.get("security_flags", [])
+        if flags:
+            status += f"\n🔒 Security: {flags}"
+        return (
+            status,
+            action_result[:3000],
+            str(result.info["steps_taken"]),
+            f"{result.info.get('cumulative_reward', 0):.3f}",
+        )
+    except Exception as e:
+        return f"❌ Error: {e}", "", "", ""
+def get_evaluation():
+    """Get multi-dimensional evaluation report."""
+    try:
+        ev = env.get_evaluation()
+        if "error" in ev:
+            return "No evaluation available. Run an episode first."
+        lines = [
+            f"🎯 Composite Score: {ev['composite_score']:.3f}",
+            "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━",
+        ]
+        for name, dim in ev.get("dimensions", {}).items():
+            bar = "█" * int(dim["score"] * 20) + "░" * (20 - int(dim["score"] * 20))
+            lines.append(f"  {name:15s} [{bar}] {dim['score']:.3f}")
+            for e in dim.get("evidence", []):
+                lines.append(f"    → {e}")
+        if ev.get("strengths"):
+            lines.append("\n💪 Strengths:")
+            for s in ev["strengths"]:
+                lines.append(f"  ✅ {s}")
+        if ev.get("failure_analysis"):
+            lines.append("\n⚠️ Failures:")
+            for f in ev["failure_analysis"]:
+                lines.append(f"  ❌ {f}")
+        if ev.get("recommendations"):
+            lines.append("\n💡 Recommendations:")
+            for r in ev["recommendations"]:
+                lines.append(f"  → {r}")
+        return "\n".join(lines)
+    except Exception as e:
+        return f"Error: {e}"
+def get_metrics():
+    """Get comprehensive metrics."""
+    try:
+        m = env.get_metrics()
+        return json.dumps(m, indent=2, default=str)
+    except Exception as e:
+        return f"Error: {e}"
+def get_trajectory():
+    """Get full trajectory."""
+    try:
+        t = env.get_trajectory()
+        if not t:
+            return "No trajectory available."
+        lines = [
+            f"Episode: {t.get('episode_id', 'N/A')}",
+            f"Task: {t.get('task', 'N/A')} | Variant: {t.get('variant_id', 'N/A')}",
+            f"Duration: {t.get('duration_seconds', 'N/A')}s | Score: {t.get('final_score', 0):.3f}",
+            "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━",
+        ]
+        for step in t.get("steps", []):
+            emoji = "📖" if step["action_type"] == "read_file" else \
+                    "✏️" if step["action_type"] == "write_file" else \
+                    "🧪" if step["action_type"] == "run_tests" else \
+                    "🔍" if step["action_type"] == "search_code" else "🏁"
+            path = step.get("action_path") or step.get("action_query") or ""
+            err = f" ❌ {step['error']}" if step.get("error") else ""
+            lines.append(
+                f"  {emoji} Step {step['step_number']:2d}: "
+                f"{step['action_type']:12s} {path:30s} "
+                f"reward={step['reward']:+.3f} "
+                f"({step['duration_ms']:.0f}ms){err}"
+            )
+        return "\n".join(lines)
+    except Exception as e:
+        return f"Error: {e}"
+def run_builtin_agent(task: str):
+    """Run the built-in deterministic agent for a quick demo."""
+    try:
+        # Reset
+        result = env.reset(task=task)
+        obs = result.observation
+        log_lines = [f"🚀 Starting {task} (variant: {result.info.get('variant_id')})"]
+        log_lines.append(f"   Files: {obs.repo_tree}")
+        log_lines.append(f"   Failing: {obs.failing_tests}")
+        # Strategy: read test file → read source → fix → run tests → submit
+        test_files = [f for f in obs.repo_tree if f.startswith("tests/")]
+        src_files = [f for f in obs.repo_tree if f.startswith("src/") and f.endswith(".py")]
+        spec_files = [f for f in obs.repo_tree if f.endswith(".md")]
+        steps_done = 0
+        max_demo_steps = 15
+        # Step 1: read spec or test
+        if task == "task3" and spec_files:
+            target = spec_files[0]
+        elif test_files:
+            target = test_files[0]
+        else:
+            target = obs.repo_tree[0]
+        step_result = env.step(RepoAction(action_type="read_file", path=target))
+        steps_done += 1
+        log_lines.append(f"   Step {steps_done}: read_file {target} → reward={step_result.reward:+.3f}")
+        # Step 2+: read all source files
+        for sf in src_files:
+            if env.done or steps_done >= max_demo_steps - 2:
+                break
+            step_result = env.step(RepoAction(action_type="read_file", path=sf))
+            steps_done += 1
+            log_lines.append(f"   Step {steps_done}: read_file {sf} → reward={step_result.reward:+.3f}")
+        # Step N-1: run tests
+        if not env.done and steps_done < max_demo_steps - 1:
+            step_result = env.step(RepoAction(action_type="run_tests"))
+            steps_done += 1
+            log_lines.append(f"   Step {steps_done}: run_tests → reward={step_result.reward:+.3f}")
+        # Step N: submit
+        if not env.done:
+            step_result = env.step(RepoAction(action_type="submit"))
+            steps_done += 1
+            log_lines.append(f"   Step {steps_done}: submit → reward={step_result.reward:+.3f}")
+        log_lines.append(f"\n🏁 Final Score: {env.final_score:.3f}")
+        log_lines.append(f"   Total Steps: {steps_done}")
+        log_lines.append(f"   Cumulative Reward: {env.cumulative_reward:.3f}")
+        return "\n".join(log_lines)
+    except Exception as e:
+        return f"❌ Error: {e}"
+# ── Build the Gradio UI ─────────────────────────────────────────────────────
+with gr.Blocks(
+    title="Codebase Navigation & Repair — OpenEnv",
+) as demo:
+    gr.Markdown(
+        "# 🔍 Codebase Navigation & Repair — OpenEnv\n"
+        "**RL environment for testing AI coding agents.** "
+        "Agents navigate repos, find bugs, and fix them — graded by actual pytest execution."
+    )
+    with gr.Tabs():
+        # ── Tab 1: Interactive Environment ────────────────────────────────
+        with gr.TabItem("🎮 Interactive"):
+            with gr.Row():
+                with gr.Column(scale=1):
+                    task_select = gr.Dropdown(
+                        choices=["task1", "task2", "task3"],
+                        value="task1",
+                        label="Task",
+                        info="task1=single-file bugs, task2=cross-module, task3=feature impl"
+                    )
+                    reset_btn = gr.Button("🔄 Reset Environment", variant="primary")
+                    gr.Markdown("### Take an Action")
+                    action_type = gr.Dropdown(
+                        choices=["read_file", "write_file", "run_tests", "search_code", "submit"],
+                        value="read_file",
+                        label="Action Type",
+                    )
+                    action_path = gr.Textbox(label="Path (for read/write/run_tests)", placeholder="src/auth.py")
+                    action_query = gr.Textbox(label="Query (for search_code)", placeholder="validate_token")
+                    action_content = gr.Textbox(label="Content (for write_file)", lines=5, placeholder="# new file content...")
+                    step_btn = gr.Button("▶️ Execute Step", variant="secondary")
+                with gr.Column(scale=2):
+                    status_box = gr.Textbox(label="Status", lines=15, interactive=False)
+                    result_box = gr.Textbox(label="Last Action Result", lines=10, interactive=False)
+                    with gr.Row():
+                        steps_box = gr.Textbox(label="Steps Taken", value="0", interactive=False)
+                        reward_box = gr.Textbox(label="Cumulative Reward", value="0.000", interactive=False)
+            reset_btn.click(
+                reset_environment, inputs=[task_select],
+                outputs=[status_box, result_box, steps_box, reward_box],
+            )
+            step_btn.click(
+                take_step,
+                inputs=[action_type, action_path, action_query, action_content],
+                outputs=[status_box, result_box, steps_box, reward_box],
+            )
+        # ── Tab 2: Run Agent ─────────────────────────────────────────────
+        with gr.TabItem("🤖 Run Agent"):
+            gr.Markdown(
+                "### Built-in Demonstration Agent\n"
+                "Runs a deterministic read-all-then-submit agent. "
+                "For LLM-based agent, use `run_agent.py` or `inference.py`."
+            )
+            agent_task = gr.Dropdown(
+                choices=["task1", "task2", "task3"], value="task1", label="Task"
+            )
+            run_btn = gr.Button("🚀 Run Agent", variant="primary")
+            agent_output = gr.Textbox(label="Agent Log", lines=20, interactive=False)
+            run_btn.click(run_builtin_agent, inputs=[agent_task], outputs=[agent_output])
+        # ── Tab 3: Evaluation Dashboard ──────────────────────────────────
+        with gr.TabItem("📊 Evaluation"):
+            with gr.Row():
+                eval_btn = gr.Button("🎯 Get Evaluation", variant="primary")
+                metrics_btn = gr.Button("📈 Get Metrics", variant="secondary")
+                traj_btn = gr.Button("🗺️ Get Trajectory", variant="secondary")
+            eval_output = gr.Textbox(label="Evaluation Report", lines=25, interactive=False)
+            eval_btn.click(get_evaluation, outputs=[eval_output])
+            metrics_btn.click(get_metrics, outputs=[eval_output])
+            traj_btn.click(get_trajectory, outputs=[eval_output])
+        # ── Tab 4: API Docs ──────────────────────────────────────────────
+        with gr.TabItem("📖 API"):
+            gr.Markdown("""
+### REST API Endpoints
+The FastAPI endpoints are mounted alongside this UI at `/api/`.
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/api/reset?task=task1` | POST | Start new episode |
+| `/api/step` | POST | Take action (JSON body) |
+| `/api/state` | GET | Get current state |
+| `/api/health` | GET | Health check |
+| `/api/trajectory` | GET | Full action log |
+| `/api/evaluate` | GET | Multi-dimensional scores |
+| `/api/metrics` | GET | Comprehensive stats |
+| `/api/fault-config` | POST | Enable fault injection |
+### Example: Reset + Read + Submit
+```bash
+BASE="https://YOUR-SPACE.hf.space/api"
+# Reset
+curl -X POST "$BASE/reset?task=task1"
+# Read a file
+curl -X POST "$BASE/step" -H "Content-Type: application/json" \\
+  -d '{"action_type":"read_file","path":"src/auth.py"}'
+# Submit
+curl -X POST "$BASE/step" -H "Content-Type: application/json" \\
+  -d '{"action_type":"submit"}'
+# Get evaluation
+curl "$BASE/evaluate"
+```
+""")
+# ── Mount FastAPI under /api ─────────────────────────────────────────────────
+from server.app import app as fastapi_app
+gr_app = gr.mount_gradio_app(fastapi_app, demo, path="/")
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(fastapi_app, host="0.0.0.0", port=7860)

requirements.txt CHANGED Viewed

@@ -1,6 +1,8 @@
-fastapi==0.111.0
-uvicorn[standard]==0.30.1
-pydantic==2.7.1
-openai==1.35.0
-httpx==0.27.0
-pytest==8.2.2

+fastapi
+uvicorn[standard]
+pydantic
+openai
+httpx
+pytest
+gradio>=4.0
+huggingface_hub

run_agent.py ADDED Viewed

	@@ -0,0 +1,290 @@

+#!/usr/bin/env python3
+"""
+run_agent.py — Standalone HF Inference agent for OpenEnv.
+Uses Hugging Face InferenceClient (NOT OpenAI SDK).
+Runs directly against the environment in-process — no server needed.
+Solves bug-fixing tasks step-by-step and prints the full execution trace.
+Usage:
+    python run_agent.py                            # uses built-in env
+    HF_TOKEN=hf_xxx python run_agent.py            # with LLM agent
+    HF_TOKEN=hf_xxx python run_agent.py --task task2  # specific task
+"""
+import os
+import sys
+import json
+import argparse
+import textwrap
+from typing import List, Optional
+# Add project root to path
+sys.path.insert(0, os.path.dirname(__file__))
+from server.environment import CodebaseNavEnvironment
+from server.models import RepoAction
+# ── Configuration ────────────────────────────────────────────────────────────
+HF_TOKEN = os.getenv("HF_TOKEN")
+MODEL_ID = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+MAX_STEPS = {"task1": 12, "task2": 18, "task3": 22}
+# ── HF Inference Client (lazy import) ───────────────────────────────────────
+def get_hf_client():
+    """Create HF InferenceClient. Returns None if no token."""
+    if not HF_TOKEN:
+        return None
+    try:
+        from huggingface_hub import InferenceClient
+        return InferenceClient(model=MODEL_ID, token=HF_TOKEN)
+    except ImportError:
+        print("[WARN] huggingface_hub not installed. Using deterministic agent.", flush=True)
+        return None
+    except Exception as e:
+        print(f"[WARN] Could not create HF client: {e}. Using deterministic agent.", flush=True)
+        return None
+# ── Prompts ──────────────────────────────────────────────────────────────────
+SYSTEM_PROMPT = textwrap.dedent("""
+    You are an expert Python developer debugging a code repository.
+    You interact with the repo via JSON actions. Reply with ONLY a JSON object.
+    Available actions:
+    {"action_type": "read_file", "path": "src/file.py"}
+    {"action_type": "write_file", "path": "src/file.py", "content": "...full file..."}
+    {"action_type": "run_tests", "path": "tests/test_file.py"}
+    {"action_type": "search_code", "query": "keyword"}
+    {"action_type": "submit"}
+    Strategy:
+    1. Read the failing test first to understand expected behavior
+    2. Read the buggy source file(s) identified by test imports
+    3. Fix the bug by writing the corrected file
+    4. Run tests to verify your fix
+    5. Submit when all tests pass
+    RESPOND WITH ONLY A JSON OBJECT. No markdown, no explanation.
+""").strip()
+def build_prompt(obs: dict, step: int, history: List[str]) -> str:
+    tree = "\n".join(obs.get("repo_tree", []))
+    read = ", ".join(obs.get("files_read", [])) or "none"
+    failing = ", ".join(obs.get("failing_tests", [])) or "unknown"
+    result = (obs.get("last_action_result") or "none")[:1500]
+    error = obs.get("last_action_error") or "none"
+    steps_left = obs.get("steps_remaining", 0)
+    hist = "\n".join(history[-5:]) if history else "none"
+    return (
+        f"Step {step} | Task: {obs.get('current_task')} | Steps left: {steps_left}\n\n"
+        f"Description: {obs.get('task_description')}\n\n"
+        f"Files:\n{tree}\n\n"
+        f"Already read: {read}\nFailing tests: {failing}\n\n"
+        f"Last result:\n{result}\n\nLast error: {error}\n\n"
+        f"History:\n{hist}\n\n"
+        f"Next action? Reply with ONLY a JSON object."
+    )
+def llm_action(client, obs: dict, step: int, history: List[str]) -> dict:
+    """Get action from HF Inference API."""
+    prompt = build_prompt(obs, step, history)
+    try:
+        response = client.chat_completion(
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": prompt},
+            ],
+            max_tokens=800,
+            temperature=0.2,
+        )
+        text = response.choices[0].message.content.strip()
+        # Strip code fences
+        if text.startswith("```"):
+            text = text.split("```")[1]
+            if text.startswith("json"):
+                text = text[4:]
+        text = text.strip()
+        return json.loads(text)
+    except json.JSONDecodeError:
+        print(f"  [PARSE ERROR] Could not parse: {text[:100]}")
+        return {"action_type": "submit"}
+    except Exception as e:
+        print(f"  [LLM ERROR] {e}")
+        return {"action_type": "submit"}
+# ── Deterministic Agent (no LLM needed) ─────────────────────────────────────
+def deterministic_agent(obs: dict, step: int, files_read: set) -> dict:
+    """
+    A rule-based agent that follows optimal patterns for each task type.
+    Works without any LLM — useful for testing and demos.
+    """
+    tree = obs.get("repo_tree", [])
+    task = obs.get("current_task", "task1")
+    test_files = sorted([f for f in tree if f.startswith("tests/")])
+    src_files = sorted([f for f in tree if f.startswith("src/") and f.endswith(".py")])
+    spec_files = sorted([f for f in tree if f.endswith("FEATURE_SPEC.md")])
+    # Phase 1: Read spec/test files first
+    if task == "task3" and spec_files:
+        for sf in spec_files:
+            if sf not in files_read:
+                return {"action_type": "read_file", "path": sf}
+    for tf in test_files:
+        if tf not in files_read:
+            return {"action_type": "read_file", "path": tf}
+    # Phase 2: Read all source files
+    for sf in src_files:
+        if sf not in files_read:
+            return {"action_type": "read_file", "path": sf}
+    # Phase 3: Run tests to see current state
+    if step <= 2 + len(src_files) + len(test_files):
+        if test_files:
+            return {"action_type": "run_tests", "path": test_files[0]}
+        return {"action_type": "run_tests"}
+    # Phase 4: Submit
+    return {"action_type": "submit"}
+# ── Main Runner ──────────────────────────────────────────────────────────────
+def run_episode(env: CodebaseNavEnvironment, task: str, use_llm: bool = False):
+    """Run one complete episode."""
+    hf_client = get_hf_client() if use_llm else None
+    using_llm = hf_client is not None
+    max_steps = MAX_STEPS.get(task, 15)
+    history = []
+    files_read = set()
+    print(f"\n{'='*60}")
+    print(f"  [START] task={task} agent={'llm' if using_llm else 'deterministic'}")
+    print(f"{'='*60}")
+    # Reset
+    reset_result = env.reset(task=task)
+    obs = reset_result.observation
+    variant = reset_result.info.get("variant_id", "?")
+    print(f"  Variant: {variant}")
+    print(f"  Files: {obs.repo_tree}")
+    print(f"  Failing: {obs.failing_tests}")
+    print(f"  Steps budget: {obs.steps_remaining}")
+    print()
+    rewards = []
+    final_score = 0.0
+    for step_num in range(1, max_steps + 1):
+        if env.done:
+            break
+        # Get action from LLM or deterministic agent
+        obs_dict = obs.model_dump()
+        if using_llm:
+            action_dict = llm_action(hf_client, obs_dict, step_num, history)
+        else:
+            action_dict = deterministic_agent(obs_dict, step_num, files_read)
+        action_type = action_dict.get("action_type", "submit")
+        action_path = action_dict.get("path")
+        # Construct action
+        action = RepoAction(
+            action_type=action_type,
+            path=action_dict.get("path"),
+            query=action_dict.get("query"),
+            content=action_dict.get("content"),
+        )
+        # Execute step
+        result = env.step(action)
+        obs = result.observation
+        reward = result.reward
+        rewards.append(reward)
+        if action_path:
+            files_read.add(action_path)
+        # Print step log
+        detail = action_path or action_dict.get("query") or ""
+        err = f" ❌ {obs.last_action_error}" if obs.last_action_error else ""
+        print(
+            f"  [STEP] step={step_num} action={action_type:12s} "
+            f"{detail:30s} reward={reward:+.3f}{err}"
+        )
+        history.append(f"Step {step_num}: {action_type} → {reward:+.3f}")
+        if result.done:
+            final_score = result.info.get("final_score", 0.0)
+            break
+    # Force submit if not done
+    if not env.done:
+        result = env.step(RepoAction(action_type="submit"))
+        final_score = result.info.get("final_score", 0.0)
+        rewards.append(result.reward)
+    # Summary
+    total_reward = sum(rewards)
+    total_steps = len(rewards)
+    success = final_score >= 0.5
+    print()
+    print(f"  [END] success={str(success).lower()} steps={total_steps} "
+          f"score={final_score:.3f} total_reward={total_reward:.3f}")
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(f"  [END] rewards={rewards_str}")
+    # Evaluation summary
+    ev = env.get_evaluation()
+    if "composite_score" in ev:
+        print(f"\n  📊 Evaluation:")
+        print(f"     Composite: {ev['composite_score']:.3f}")
+        for name, dim in ev.get("dimensions", {}).items():
+            print(f"     {name:15s}: {dim['score']:.3f}")
+    return final_score, total_steps, rewards
+def main():
+    parser = argparse.ArgumentParser(description="Run agent against OpenEnv codebase-nav")
+    parser.add_argument("--task", default="task1", choices=["task1", "task2", "task3"])
+    parser.add_argument("--all-tasks", action="store_true", help="Run all 3 tasks")
+    parser.add_argument("--llm", action="store_true", help="Use HF LLM agent (needs HF_TOKEN)")
+    args = parser.parse_args()
+    env = CodebaseNavEnvironment()
+    if args.all_tasks:
+        tasks = ["task1", "task2", "task3"]
+    else:
+        tasks = [args.task]
+    all_scores = []
+    for task in tasks:
+        score, steps, rewards = run_episode(env, task, use_llm=args.llm)
+        all_scores.append(score)
+    if len(all_scores) > 1:
+        avg = sum(all_scores) / len(all_scores)
+        print(f"\n{'='*60}")
+        print(f"  OVERALL: avg_score={avg:.3f} tasks={len(all_scores)}")
+        print(f"{'='*60}")
+    env.close()
+if __name__ == "__main__":
+    main()

test_e2e.py ADDED Viewed

	@@ -0,0 +1,56 @@

+#!/usr/bin/env python3
+"""Quick E2E test for the deployed HF Space."""
+import httpx, json, sys
+BASE = "https://Chirag0123-codebase-nav-env.hf.space"
+client = httpx.Client(timeout=120.0)
+ok = 0
+def test(label, fn):
+    global ok
+    try:
+        result = fn()
+        ok += 1
+        print(f"  ✅ {label}: {json.dumps(result)[:200]}")
+    except Exception as e:
+        print(f"  ❌ {label}: {e}")
+print("Testing deployed Space...")
+# 1. Health
+test("Health", lambda: client.get(f"{BASE}/health").json())
+# 2. Reset
+test("Reset task1", lambda: (r := client.post(f"{BASE}/reset", params={"task": "task1"}).json(), r["info"]["variant_id"])[1])
+# 3. Read file
+test("Read file", lambda: (r := client.post(f"{BASE}/step", json={"action_type": "read_file", "path": client.get(f"{BASE}/state").json()["observation"]["repo_tree"][0]}).json(), f"reward={r['reward']}")[1])
+# 4. Run tests
+test("Run tests", lambda: (r := client.post(f"{BASE}/step", json={"action_type":"run_tests"}).json(), f"reward={r['reward']}")[1])
+# 5. Submit
+test("Submit", lambda: (r := client.post(f"{BASE}/step", json={"action_type":"submit"}).json(), f"score={r['info']['final_score']}")[1])
+# 6. Trajectory
+test("Trajectory", lambda: (r := client.get(f"{BASE}/trajectory").json(), f"steps={r['total_steps']}")[1])
+# 7. Evaluate
+test("Evaluate", lambda: (r := client.get(f"{BASE}/evaluate").json(), f"composite={r['composite_score']}")[1])
+# 8. Metrics
+test("Metrics", lambda: (r := client.get(f"{BASE}/metrics").json(), f"efficiency={r['step_efficiency']}")[1])
+# 9. Fault config
+test("Fault config", lambda: client.post(f"{BASE}/fault-config", json={"level":"light"}).json())
+# 10. Reset with faults
+test("Reset+faults", lambda: (r := client.post(f"{BASE}/reset", params={"task":"task2"}).json(), f"faults={len(r['info'].get('fault_injection',{}).get('faults_injected',[]))}")[1])
+# 11. Disable faults
+test("Disable faults", lambda: client.post(f"{BASE}/fault-config", json={"level":"none"}).json())
+print(f"\n{'='*50}")
+print(f"  Result: {ok}/11 tests passed")
+print(f"{'='*50}")
+client.close()