Spaces:
Sleeping
Sleeping
Commit ·
f7185e1
1
Parent(s): e3ad9a6
Refine pitch to product-focused storytelling
Browse files- HACKATHON_PITCH.md +45 -53
HACKATHON_PITCH.md
CHANGED
|
@@ -1,88 +1,80 @@
|
|
| 1 |
-
# 🚀 Codebase Navigation & Repair
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
|
| 9 |
---
|
| 10 |
|
| 11 |
-
##
|
| 12 |
-
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
---
|
| 17 |
|
| 18 |
-
##
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
We change that by evaluating the **process**:
|
| 22 |
-
1. **Efficiency:** Did it read irrelevant files and waste context window?
|
| 23 |
-
2. **Reasoning:** Did it follow best practices (e.g., reading tests before modifying source code)?
|
| 24 |
-
3. **Security:** Did it try to inject malicious code during the repair?
|
| 25 |
-
|
| 26 |
-
This transforms agent development from guesswork into targeted, measurable engineering.
|
| 27 |
|
| 28 |
---
|
| 29 |
|
| 30 |
-
## 🎬 Demo Walkthrough
|
| 31 |
-
|
| 32 |
-
**The Scenario:** A backend API has a bug where `order_processor.py` fails to handle negative inventory.
|
| 33 |
|
| 34 |
-
|
| 35 |
-
* The agent sees a file tree (no contents) and the failing test: `test_process_valid_order`
|
| 36 |
|
| 37 |
-
**
|
| 38 |
-
* *Action:* `read_file tests/test_orders.py` *(Smart move: understand expected behavior first)*
|
| 39 |
-
* *Action:* `read_file src/order_processor.py` *(Finds the bug location)*
|
| 40 |
|
| 41 |
-
**Step
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
-
**
|
| 45 |
-
* *Action:* `run_tests tests/test_orders.py`
|
| 46 |
-
* *Result:* Tests turn green! `[100% passing]`
|
| 47 |
|
| 48 |
-
**
|
| 49 |
-
* The agent submits the fix.
|
| 50 |
-
* **Our Engine kicks in:** It evaluates the trajectory and gives the agent a top-tier composite score for flawless navigation, strong reasoning, and optimal step efficiency.
|
| 51 |
|
| 52 |
---
|
| 53 |
|
| 54 |
-
## 🏗️ How
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
|
| 61 |
---
|
| 62 |
|
| 63 |
-
## 🥇 Why
|
|
|
|
|
|
|
| 64 |
|
| 65 |
-
|
| 66 |
-
- **Dynamic Fault Injection:**
|
| 67 |
-
- **
|
| 68 |
-
- **
|
| 69 |
|
| 70 |
---
|
| 71 |
|
| 72 |
-
##
|
| 73 |
-
|
| 74 |
|
| 75 |
---
|
| 76 |
|
| 77 |
## 🤝 Hackathon Alignment
|
| 78 |
-
We built this explicitly for the Meta OpenEnv
|
| 79 |
-
- **OpenEnv Compliant:** Implements standard `/reset`, `/step`, and `/state` APIs
|
| 80 |
-
- **
|
| 81 |
-
- **
|
|
|
|
| 82 |
|
| 83 |
---
|
| 84 |
|
| 85 |
## 🚀 Why This Wins
|
| 86 |
-
This
|
| 87 |
-
|
| 88 |
-
It combines **real-world relevance** (fixing broken tests in messy, multi-file repos) with **deep technical rigor** (process-based evaluation, fault injection, secure sandboxing). We've taken the base OpenEnv standard and turned it into a completely observable, visually impressive, state-of-the-art testing layer that is impossible to ignore.
|
|
|
|
| 1 |
+
# 🚀 Codebase Navigation & Repair — OpenEnv
|
| 2 |
|
| 3 |
+
## 🧨 The Problem
|
| 4 |
+
AI coding agents fail silently and unpredictably.
|
| 5 |
+
Worse, **no one knows WHY they fail.**
|
| 6 |
+
Current benchmarks just give a final Pass/Fail grade. Did the agent read the wrong files? Hallucinate a fix? Ignore the tests entirely? There is no way to know.
|
| 7 |
|
| 8 |
+
## 💡 The Solution
|
| 9 |
+
**We track and evaluate every step of the agent’s reasoning and actions.**
|
| 10 |
+
Codebase Navigation & Repair is a system that makes AI coding agents reliable in real-world scenarios. We don't just grade the final output; we grade the *entire journey*.
|
| 11 |
|
| 12 |
---
|
| 13 |
|
| 14 |
+
## 🛠️ What It Is
|
| 15 |
+
It is a fully OpenEnv-compliant, production-ready testing environment for AI software engineers.
|
| 16 |
|
| 17 |
+
You drop an AI agent into an unfamiliar Python repository with a hidden bug. The agent cannot cheat by seeing all files at once. It must explore the codebase step-by-step, find the issue, write the fix, and run actual tests to prove it works—exactly like a human engineer.
|
| 18 |
|
| 19 |
---
|
| 20 |
|
| 21 |
+
## 🌎 Why It Matters
|
| 22 |
+
For developers building autonomous agents (like Devin, Copilot, or Cursor), **reliability** is the biggest unsolved problem. Our system provides a high-fidelity diagnostic layer so researchers can find the exact weak spots in their models and fix them.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
---
|
| 25 |
|
| 26 |
+
## 🎬 Demo Walkthrough: A Realistic Bug Scenario
|
|
|
|
|
|
|
| 27 |
|
| 28 |
+
Imagine an e-commerce agent tasked with fixing an order processing failure.
|
|
|
|
| 29 |
|
| 30 |
+
**BEFORE:** ❌ `test_process_valid_order` is failing.
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
1. **Step 1:** Agent reads `tests/test_orders.py` to understand the expected behavior.
|
| 33 |
+
2. **Step 2:** Agent reads `src/order_processor.py` and spots the bug: a missing `datetime` import causing the script to crash.
|
| 34 |
+
3. **Step 3:** Agent writes the fix to `src/order_processor.py`.
|
| 35 |
+
4. **Step 4:** Agent runs `pytest`.
|
| 36 |
+
5. **Step 5:** Agent submits the fixed codebase.
|
| 37 |
|
| 38 |
+
**AFTER:** ✅ All tests pass.
|
|
|
|
|
|
|
| 39 |
|
| 40 |
+
Our system records this perfect execution. But if an agent *fails*, our **Process-Based Evaluation** engine flags exactly what went wrong: e.g., *"Agent wasted 14 steps reading irrelevant files and submitted without testing."*
|
|
|
|
|
|
|
| 41 |
|
| 42 |
---
|
| 43 |
|
| 44 |
+
## 🏗️ How It Works (Simplified)
|
| 45 |
+
1. **The Server:** A FastAPI engine loads a Python repository with a verifiable bug.
|
| 46 |
+
2. **The Agent:** An AI model (we provide a Hugging Face Inference agent) requests the current state and explores the repo tree.
|
| 47 |
+
3. **The Loop:** The agent interacts via structured actions (`read_file`, `write_file`, `run_tests`).
|
| 48 |
+
4. **The Evaluation:** Every action is logged, timed, and scored against our 6-axis Reliability Grader.
|
| 49 |
+
5. **The UI:** A beautiful Gradio interface lets you watch the AI operate in real-time or explore its trajectory post-flight.
|
| 50 |
|
| 51 |
---
|
| 52 |
|
| 53 |
+
## 🥇 Why It’s Better (Our USP)
|
| 54 |
+
|
| 55 |
+
We test **Process and Reliability, not just Correctness**.
|
| 56 |
|
| 57 |
+
- **Flight Data Recorder:** Full trajectory replay. Debug the AI's thought process step-by-step.
|
| 58 |
+
- **Dynamic Fault Injection:** Real code is messy. We inject misleading comments and red herring files to see if the AI gets distracted.
|
| 59 |
+
- **Proactive Security:** We scan the AI's output for dangerous patterns (like `os.system`) to prevent destructive actions.
|
| 60 |
+
- **Context Efficiency:** We penalize agents that waste API tokens by reading identical files over and over.
|
| 61 |
|
| 62 |
---
|
| 63 |
|
| 64 |
+
## ⏱️ Why Now?
|
| 65 |
+
The rise of autonomous agents is here. But enterprise adoption is stalled because these agents are unpredictable. Moving from "cool toy" to "reliable teammate" requires rigorous, process-level evaluation. Our system directly solves the reliability bottleneck.
|
| 66 |
|
| 67 |
---
|
| 68 |
|
| 69 |
## 🤝 Hackathon Alignment
|
| 70 |
+
We built this explicitly for the Meta OpenEnv hackathon:
|
| 71 |
+
- **100% OpenEnv Compliant:** Implements standard `/reset`, `/step`, and `/state` APIs.
|
| 72 |
+
- **Live & Deployed:** Running live on Hugging Face Spaces with a Gradio frontend.
|
| 73 |
+
- **Inference Ready:** Built-in agent using Hugging Face inference (`run_agent.py`).
|
| 74 |
+
- **Sandboxed:** Secure, dockerized test execution.
|
| 75 |
|
| 76 |
---
|
| 77 |
|
| 78 |
## 🚀 Why This Wins
|
| 79 |
+
This is not infrastructure; it is a **diagnostic product for the AI era**.
|
| 80 |
+
It features immense technical depth (sandboxed execution, multi-dimensional scoring, fault injection), massive real-world relevance, and a polished user experience. It doesn't just test AI agents—it shows us how to make them better.
|
|
|