Chirag0123 commited on
Commit
f7185e1
·
1 Parent(s): e3ad9a6

Refine pitch to product-focused storytelling

Browse files
Files changed (1) hide show
  1. HACKATHON_PITCH.md +45 -53
HACKATHON_PITCH.md CHANGED
@@ -1,88 +1,80 @@
1
- # 🚀 Codebase Navigation & Repair
2
 
3
- **AI coding agents fail silently and unpredictably. And worse—no one knows *why* they fail.**
 
 
 
4
 
5
- They get lost in large codebases, hallucinate fixes, and deploy broken code. Existing benchmarks only tell you if an agent failed, not *where* or *why* it went wrong.
6
-
7
- Our solution: **The system that makes AI coding agents reliable in real-world scenarios.** We track, evaluate, and score every single step of the agent’s reasoning, navigation, and execution.
8
 
9
  ---
10
 
11
- ## 🌟 What is it?
12
- Codebase Navigation & Repair is a specialized process-evaluation engine for AI coding agents (like Devin, Copilot, or Cursor).
13
 
14
- Instead of spoon-feeding the AI the exact files it needs, we drop the agent into an unfamiliar, multi-file Python repository. The agent must independently navigate the codebase, understand the bug, write a fix, and run the test suite to verify its workjust like a human engineer.
15
 
16
  ---
17
 
18
- ## 🛠️ Why it matters
19
- Right now, evaluating AI agents is binary: Pass or Fail.
20
-
21
- We change that by evaluating the **process**:
22
- 1. **Efficiency:** Did it read irrelevant files and waste context window?
23
- 2. **Reasoning:** Did it follow best practices (e.g., reading tests before modifying source code)?
24
- 3. **Security:** Did it try to inject malicious code during the repair?
25
-
26
- This transforms agent development from guesswork into targeted, measurable engineering.
27
 
28
  ---
29
 
30
- ## 🎬 Demo Walkthrough
31
-
32
- **The Scenario:** A backend API has a bug where `order_processor.py` fails to handle negative inventory.
33
 
34
- **Step 1: The Reset (Agent enters the workspace)**
35
- * The agent sees a file tree (no contents) and the failing test: `test_process_valid_order`
36
 
37
- **Step 2: Investigation (Agent reads files)**
38
- * *Action:* `read_file tests/test_orders.py` *(Smart move: understand expected behavior first)*
39
- * *Action:* `read_file src/order_processor.py` *(Finds the bug location)*
40
 
41
- **Step 3: The Repair (Agent writes code)**
42
- * *Action:* `write_file src/order_processor.py` *(Modifies logic to add `if item.qty < 0: raise ValueError`)*
 
 
 
43
 
44
- **Step 4: Verification (Agent runs tests)**
45
- * *Action:* `run_tests tests/test_orders.py`
46
- * *Result:* Tests turn green! `[100% passing]`
47
 
48
- **Step 5: Submission & Evaluation**
49
- * The agent submits the fix.
50
- * **Our Engine kicks in:** It evaluates the trajectory and gives the agent a top-tier composite score for flawless navigation, strong reasoning, and optimal step efficiency.
51
 
52
  ---
53
 
54
- ## 🏗️ How it works (Simplified)
55
-
56
- 1. **The Server:** A FastAPI engine loads a sandboxed, hidden-bug repository.
57
- 2. **The Agent:** Interacts via strict API calls (`read_file`, `write_file`, `run_tests`), simulating real console usage.
58
- 3. **The Grader:** A sandboxed Pytest runner securely executes the agent's code.
59
- 4. **The UI:** A live Gradio dashboard lets you watch agents work in real-time or explore dynamic evaluation metrics.
60
 
61
  ---
62
 
63
- ## 🥇 Why it’s better
 
 
64
 
65
- We don't just grade the outcome; we stress-test the AI:
66
- - **Dynamic Fault Injection:** We actively inject misleading code comments and red herring files into the codebase to see if the AI gets tricked.
67
- - **Trajectory Replay:** We record every API call, diff, and timestamp so you can "play back" an agent's failure.
68
- - **Proactive Security:** We monitor the agent's output for dangerous patterns (like `os.system("rm -rf /")`) to ensure production safety.
69
 
70
  ---
71
 
72
- ## Why Now
73
- Autonomous coding agents are the fastest-growing sector in AI. But **reliability is the biggest unsolved problem holding them back from enterprise adoption.** A system that can definitively evaluate *how* an agent reasons and *why* it fails is the missing infrastructure for the next generation of AI product development.
74
 
75
  ---
76
 
77
  ## 🤝 Hackathon Alignment
78
- We built this explicitly for the Meta OpenEnv standard:
79
- - **OpenEnv Compliant:** Implements standard `/reset`, `/step`, and `/state` APIs out-of-the-box.
80
- - **Hugging Face Ready:** Fully dockerized, sandboxed, and deployed via Gradio to HF Spaces.
81
- - **HF Inference Agent:** Includes a standalone Python script (`run_agent.py`) using Hugging Face inference endpoints—no OpenAI lock-in required.
 
82
 
83
  ---
84
 
85
  ## 🚀 Why This Wins
86
- This project isn't just a hackathon toy—it is a piece of **core infrastructure** the AI industry actually needs right now.
87
-
88
- It combines **real-world relevance** (fixing broken tests in messy, multi-file repos) with **deep technical rigor** (process-based evaluation, fault injection, secure sandboxing). We've taken the base OpenEnv standard and turned it into a completely observable, visually impressive, state-of-the-art testing layer that is impossible to ignore.
 
1
+ # 🚀 Codebase Navigation & Repair — OpenEnv
2
 
3
+ ## 🧨 The Problem
4
+ AI coding agents fail silently and unpredictably.
5
+ Worse, **no one knows WHY they fail.**
6
+ Current benchmarks just give a final Pass/Fail grade. Did the agent read the wrong files? Hallucinate a fix? Ignore the tests entirely? There is no way to know.
7
 
8
+ ## 💡 The Solution
9
+ **We track and evaluate every step of the agent’s reasoning and actions.**
10
+ Codebase Navigation & Repair is a system that makes AI coding agents reliable in real-world scenarios. We don't just grade the final output; we grade the *entire journey*.
11
 
12
  ---
13
 
14
+ ## 🛠️ What It Is
15
+ It is a fully OpenEnv-compliant, production-ready testing environment for AI software engineers.
16
 
17
+ You drop an AI agent into an unfamiliar Python repository with a hidden bug. The agent cannot cheat by seeing all files at once. It must explore the codebase step-by-step, find the issue, write the fix, and run actual tests to prove it worksexactly like a human engineer.
18
 
19
  ---
20
 
21
+ ## 🌎 Why It Matters
22
+ For developers building autonomous agents (like Devin, Copilot, or Cursor), **reliability** is the biggest unsolved problem. Our system provides a high-fidelity diagnostic layer so researchers can find the exact weak spots in their models and fix them.
 
 
 
 
 
 
 
23
 
24
  ---
25
 
26
+ ## 🎬 Demo Walkthrough: A Realistic Bug Scenario
 
 
27
 
28
+ Imagine an e-commerce agent tasked with fixing an order processing failure.
 
29
 
30
+ **BEFORE:** `test_process_valid_order` is failing.
 
 
31
 
32
+ 1. **Step 1:** Agent reads `tests/test_orders.py` to understand the expected behavior.
33
+ 2. **Step 2:** Agent reads `src/order_processor.py` and spots the bug: a missing `datetime` import causing the script to crash.
34
+ 3. **Step 3:** Agent writes the fix to `src/order_processor.py`.
35
+ 4. **Step 4:** Agent runs `pytest`.
36
+ 5. **Step 5:** Agent submits the fixed codebase.
37
 
38
+ **AFTER:** All tests pass.
 
 
39
 
40
+ Our system records this perfect execution. But if an agent *fails*, our **Process-Based Evaluation** engine flags exactly what went wrong: e.g., *"Agent wasted 14 steps reading irrelevant files and submitted without testing."*
 
 
41
 
42
  ---
43
 
44
+ ## 🏗️ How It Works (Simplified)
45
+ 1. **The Server:** A FastAPI engine loads a Python repository with a verifiable bug.
46
+ 2. **The Agent:** An AI model (we provide a Hugging Face Inference agent) requests the current state and explores the repo tree.
47
+ 3. **The Loop:** The agent interacts via structured actions (`read_file`, `write_file`, `run_tests`).
48
+ 4. **The Evaluation:** Every action is logged, timed, and scored against our 6-axis Reliability Grader.
49
+ 5. **The UI:** A beautiful Gradio interface lets you watch the AI operate in real-time or explore its trajectory post-flight.
50
 
51
  ---
52
 
53
+ ## 🥇 Why It’s Better (Our USP)
54
+
55
+ We test **Process and Reliability, not just Correctness**.
56
 
57
+ - **Flight Data Recorder:** Full trajectory replay. Debug the AI's thought process step-by-step.
58
+ - **Dynamic Fault Injection:** Real code is messy. We inject misleading comments and red herring files to see if the AI gets distracted.
59
+ - **Proactive Security:** We scan the AI's output for dangerous patterns (like `os.system`) to prevent destructive actions.
60
+ - **Context Efficiency:** We penalize agents that waste API tokens by reading identical files over and over.
61
 
62
  ---
63
 
64
+ ## ⏱️ Why Now?
65
+ The rise of autonomous agents is here. But enterprise adoption is stalled because these agents are unpredictable. Moving from "cool toy" to "reliable teammate" requires rigorous, process-level evaluation. Our system directly solves the reliability bottleneck.
66
 
67
  ---
68
 
69
  ## 🤝 Hackathon Alignment
70
+ We built this explicitly for the Meta OpenEnv hackathon:
71
+ - **100% OpenEnv Compliant:** Implements standard `/reset`, `/step`, and `/state` APIs.
72
+ - **Live & Deployed:** Running live on Hugging Face Spaces with a Gradio frontend.
73
+ - **Inference Ready:** Built-in agent using Hugging Face inference (`run_agent.py`).
74
+ - **Sandboxed:** Secure, dockerized test execution.
75
 
76
  ---
77
 
78
  ## 🚀 Why This Wins
79
+ This is not infrastructure; it is a **diagnostic product for the AI era**.
80
+ It features immense technical depth (sandboxed execution, multi-dimensional scoring, fault injection), massive real-world relevance, and a polished user experience. It doesn't just test AI agents—it shows us how to make them better.