Spaces:
Sleeping
π Codebase Navigation & Repair β OpenEnv
𧨠The Problem
AI coding agents fail silently and unpredictably. Worse, no one knows WHY they fail. Current benchmarks just give a final Pass/Fail grade. Did the agent read the wrong files? Hallucinate a fix? Ignore the tests entirely? There is no way to know.
π‘ The Solution
We track and evaluate every step of the agentβs reasoning and actions. Codebase Navigation & Repair is a system that makes AI coding agents reliable in real-world scenarios. We don't just grade the final output; we grade the entire journey.
π οΈ What It Is
It is a fully OpenEnv-compliant, production-ready testing environment for AI software engineers.
You drop an AI agent into an unfamiliar Python repository with a hidden bug. The agent cannot cheat by seeing all files at once. It must explore the codebase step-by-step, find the issue, write the fix, and run actual tests to prove it worksβexactly like a human engineer.
π Why It Matters
For developers building autonomous agents (like Devin, Copilot, or Cursor), reliability is the biggest unsolved problem. Our system provides a high-fidelity diagnostic layer so researchers can find the exact weak spots in their models and fix them.
π¬ Demo Walkthrough: A Realistic Bug Scenario
Imagine an e-commerce agent tasked with fixing an order processing failure.
BEFORE: β test_process_valid_order is failing.
- Step 1: Agent reads
tests/test_orders.pyto understand the expected behavior. - Step 2: Agent reads
src/order_processor.pyand spots the bug: a missingdatetimeimport causing the script to crash. - Step 3: Agent writes the fix to
src/order_processor.py. - Step 4: Agent runs
pytest. - Step 5: Agent submits the fixed codebase.
AFTER: β All tests pass.
Our system records this perfect execution. But if an agent fails, our Process-Based Evaluation engine flags exactly what went wrong: e.g., "Agent wasted 14 steps reading irrelevant files and submitted without testing."
ποΈ How It Works (Simplified)
- The Server: A FastAPI engine loads a Python repository with a verifiable bug.
- The Agent: An AI model (we provide a Hugging Face Inference agent) requests the current state and explores the repo tree.
- The Loop: The agent interacts via structured actions (
read_file,write_file,run_tests). - The Evaluation: Every action is logged, timed, and scored against our 6-axis Reliability Grader.
- The UI: A beautiful Gradio interface lets you watch the AI operate in real-time or explore its trajectory post-flight.
π₯ Why Itβs Better (Our USP)
We test Process and Reliability, not just Correctness.
- Flight Data Recorder: Full trajectory replay. Debug the AI's thought process step-by-step.
- Dynamic Fault Injection: Real code is messy. We inject misleading comments and red herring files to see if the AI gets distracted.
- Proactive Security: We scan the AI's output for dangerous patterns (like
os.system) to prevent destructive actions. - Context Efficiency: We penalize agents that waste API tokens by reading identical files over and over.
β±οΈ Why Now?
The rise of autonomous agents is here. But enterprise adoption is stalled because these agents are unpredictable. Moving from "cool toy" to "reliable teammate" requires rigorous, process-level evaluation. Our system directly solves the reliability bottleneck.
π€ Hackathon Alignment
We built this explicitly for the Meta OpenEnv hackathon:
- 100% OpenEnv Compliant: Implements standard
/reset,/step, and/stateAPIs. - Live & Deployed: Running live on Hugging Face Spaces with a Gradio frontend.
- Inference Ready: Built-in agent using Hugging Face inference (
run_agent.py). - Sandboxed: Secure, dockerized test execution.
π Why This Wins
This is not infrastructure; it is a diagnostic product for the AI era. It features immense technical depth (sandboxed execution, multi-dimensional scoring, fault injection), massive real-world relevance, and a polished user experience. It doesn't just test AI agentsβit shows us how to make them better.