Spaces:
Sleeping
Sleeping
Commit ·
d505d26
1
Parent(s): d6551d3
docs: highly refine README formatting with technical badges and polished narrative structure
Browse files
README.md
CHANGED
|
@@ -13,65 +13,83 @@ tags:
|
|
| 13 |
- coding-agent
|
| 14 |
---
|
| 15 |
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
**Play with the live environment:** [Interactive Hugging Face Space](https://huggingface.co/spaces/Chirag0123/codebase-nav-env)
|
| 21 |
|
| 22 |
## 🚨 The End of "Vibe Coding"
|
| 23 |
|
| 24 |
-
We are officially in the era of **Vibe Coding**. The
|
| 25 |
-
|
| 26 |
-
Most agents don't actually know **where the issue exists**, what the **code flow** looks like, or how the **function dependencies** cascade. They simply guess edits based on the prompt until a test arbitrarily passes. When an AI agent claims "I fixed the bug," how do you verify *how* it did it? Did it actually navigate to the source of the crash, or did it randomly change syntax until the test turned green?
|
| 27 |
|
| 28 |
-
Current benchmarks only evaluate the final outcome. **They
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
| 41 |
|
| 42 |
-
|
| 43 |
|
| 44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|--------|--------------------------------------------|
|
| 48 |
-
| **Causal Graph Probe** | Detects "Shortcut Learning". Did the agent actually read the test file, trace its imported module, and fix the root cause, or did it guess blindly? |
|
| 49 |
-
| **Confidence Calibrator** | Infers agent behavioral confidence based on commit speed, rewrite hesitation, and test verification ratios. |
|
| 50 |
-
| **Counterfactual Engine** | Analyzes the precise trace line to determine if the agent's strategy is brittle and heavily reliant on memorization of specific repository layouts. |
|
| 51 |
-
| **Episodic Memory Bank** | A cross-episode RAG store that captures mistakes (like failing to run tests before commiting) and injects hard lessons into future iteration system prompts. |
|
| 52 |
-
| **3D Trace Visualizer** | A seamless, fully-interpolated 3D environment engine that renders repos as geometric maps (Cubes for Source, Prisms for Tests) and visualizes the exact agent navigation traces with glowing Catmull-Rom tube curves. |
|
| 53 |
|
| 54 |
## ⚙️ How It Works (The OpenEnv Standard)
|
| 55 |
|
| 56 |
-
1. **Agent loads unfamiliar environment
|
| 57 |
-
2. Agent reads files one at a time (
|
| 58 |
-
3. Agent
|
| 59 |
-
4. Agent writes
|
| 60 |
-
5. Agent verifies functionality through containerized `pytest` execution
|
| 61 |
-
6. Environment scores agent
|
|
|
|
|
|
|
| 62 |
|
| 63 |
## 🚀 Quick Start
|
| 64 |
|
| 65 |
### 1. Run Locally (No Docker)
|
|
|
|
| 66 |
```bash
|
| 67 |
pip install -r requirements.txt
|
| 68 |
-
python app.py # Gradio UI + FastAPI at http://localhost:7860
|
| 69 |
```
|
| 70 |
|
| 71 |
### 2. Connect Your Custom LLM Agent
|
|
|
|
| 72 |
```bash
|
| 73 |
export HF_TOKEN=hf_xxxxx
|
| 74 |
-
#
|
| 75 |
python inference.py
|
| 76 |
```
|
| 77 |
|
|
@@ -81,14 +99,20 @@ docker build -t codebase-nav-env .
|
|
| 81 |
docker run -p 7860:7860 codebase-nav-env
|
| 82 |
```
|
| 83 |
|
|
|
|
|
|
|
| 84 |
## 📊 Evaluation API Layers
|
| 85 |
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
|
| 89 |
-
|
|
| 90 |
-
| `/
|
| 91 |
-
| `/
|
| 92 |
-
| `/
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
-
*Stop trusting the vibe. Force the cognition.*
|
|
|
|
| 13 |
- coding-agent
|
| 14 |
---
|
| 15 |
|
| 16 |
+
<div align="center">
|
| 17 |
+
<a href="https://huggingface.co/spaces/Chirag0123/codebase-nav-env">
|
| 18 |
+
<img src="https://raw.githubusercontent.com/Chirag0096/Codebase-Navigation-Repair-OpenEnv/assets/assets/demo.webp" width="100%" alt="3D Visualizer Architecture Trace">
|
| 19 |
+
</a>
|
| 20 |
+
|
| 21 |
+
<br/>
|
| 22 |
+
|
| 23 |
+
<h1>🔍 Codebase Navigation Repair OpenEnv</h1>
|
| 24 |
+
|
| 25 |
+
<p><strong>The ultimate diagnostic environment to end "Vibe Coding." Making AI coding agents structural, testable, and deeply debuggable.</strong></p>
|
| 26 |
+
|
| 27 |
+
<p>
|
| 28 |
+
<a href="https://huggingface.co/spaces/Chirag0123/codebase-nav-env"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Live%20Demo-blue" alt="Hugging Face Space"></a>
|
| 29 |
+
<img src="https://img.shields.io/badge/Python-3.10+-blue.svg" alt="Python Version">
|
| 30 |
+
<img src="https://img.shields.io/badge/FastAPI-REST_API-009688.svg" alt="FastAPI">
|
| 31 |
+
<img src="https://img.shields.io/badge/Three.js-3D_Visualizer-black.svg" alt="ThreeJs">
|
| 32 |
+
<img src="https://img.shields.io/badge/Docker-Containerized_Scoring-2496ED.svg" alt="Docker">
|
| 33 |
+
</p>
|
| 34 |
+
</div>
|
| 35 |
|
| 36 |
+
---
|
|
|
|
|
|
|
| 37 |
|
| 38 |
## 🚨 The End of "Vibe Coding"
|
| 39 |
|
| 40 |
+
We are officially in the era of **Vibe Coding**. The volume of AI-generated code is exploding, yet developers and top-tier AI Agents (Copilot, Devin, Claude Code) are increasingly writing and submitting code *blindly*.
|
|
|
|
|
|
|
| 41 |
|
| 42 |
+
Most agents don't actually know **where the issue exists**, what the **code flow** looks like, or how the **function dependencies** cascade. Current developer benchmarks only evaluate the final outcome. **They do not evaluate cognition.**
|
| 43 |
|
| 44 |
+
When an AI agent claims "I fixed the bug," how do you verify *how* it did it? Did it actually navigate to the source of the crash, trace the logical data flow, or did it just randomly change syntax until a test arbitrarily turned green?
|
| 45 |
|
| 46 |
+
## 💡 Our Solution: 3D Visualization & Deep Analytic Execution
|
| 47 |
|
| 48 |
+
This project is not just another benchmark—it is a **Full-Stack Diagnostic Platform**. It actively forces autonomous AI agents to explore an unknown Python repository file-by-file through a strictly monitored API, and then exposes their **exact cognitive layout**.
|
| 49 |
|
| 50 |
+
By tracking structural behavior instead of just binary pass/fail outcomes, our platform gives researchers, engineers, and Hackathon judges unprecedented visibility into an AI's actual thought process and navigation footprint.
|
| 51 |
|
| 52 |
+
---
|
| 53 |
|
| 54 |
+
## 🧠 Core Intelligence Modules (v4.0)
|
| 55 |
|
| 56 |
+
Unlike standard environments, we evaluate **how** the agent works using proprietary, research-grade engines built specifically for this platform:
|
| 57 |
|
| 58 |
+
| 🧩 Module | 🎯 What It Does (The Cure to Vibe Coding) |
|
| 59 |
+
|-----------|--------------------------------------------|
|
| 60 |
+
| **`3D Trace Visualizer`** | A seamless, fully-interpolated 3D engine that renders repos as geometric maps (Cubes for Source, Prisms for Tests). Visualizes agent navigation traces via glowing Catmull-Rom tube paths. |
|
| 61 |
+
| **`Causal Graph Probe`** | Detects "Shortcut Learning". Maps a Directed Acyclic Graph to verify if the agent actually read the test file, traced its imported module, and structurally fixed the root cause—or if it guessed blindly. |
|
| 62 |
+
| **`Confidence Calibrator`** | Infers the agent's behavioral confidence entirely based on real-time execution speeds, rewrite hesitation frequencies, and test verification ratios. |
|
| 63 |
+
| **`Counterfactual Engine`** | Subjects the agent to 6 robustness ablation tests (mutating the environment behind the scenes) to determine if its strategy relies on brittle memorization. |
|
| 64 |
+
| **`Episodic Memory Bank`** | A cross-episode Retrieval-Augmented Generation (RAG) store capturing procedural mistakes (e.g., failing to run tests before committing) to dynamically auto-inject hard lessons into future iteration system prompts. |
|
| 65 |
|
| 66 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
## ⚙️ How It Works (The OpenEnv Standard)
|
| 69 |
|
| 70 |
+
1. **Blind Start:** Agent loads an unfamiliar environment variant -> sees the repository file tree (NOT contents).
|
| 71 |
+
2. **Step Budgeting:** Agent explores variables and reads files one at a time (costing strictly penalized exploration steps).
|
| 72 |
+
3. **Flow Navigation:** Agent navigates architecture dependencies and identifies structural vulnerabilities.
|
| 73 |
+
4. **Execution:** Agent acts and writes the updated architectural fix.
|
| 74 |
+
5. **Verification:** Agent verifies functionality through containerized `pytest` execution loops safely within the RL boundary.
|
| 75 |
+
6. **Dynamic Scoring:** Environment scores the agent's complete step trajectory across 6 independent research axes.
|
| 76 |
+
|
| 77 |
+
---
|
| 78 |
|
| 79 |
## 🚀 Quick Start
|
| 80 |
|
| 81 |
### 1. Run Locally (No Docker)
|
| 82 |
+
Spin up the backend and the 3D analytical dashboard.
|
| 83 |
```bash
|
| 84 |
pip install -r requirements.txt
|
| 85 |
+
python app.py # Gradio UI + FastAPI starts at http://localhost:7860
|
| 86 |
```
|
| 87 |
|
| 88 |
### 2. Connect Your Custom LLM Agent
|
| 89 |
+
Wire up your own agent configuration.
|
| 90 |
```bash
|
| 91 |
export HF_TOKEN=hf_xxxxx
|
| 92 |
+
# Execute your script pointing to the local /step FASTApi environment
|
| 93 |
python inference.py
|
| 94 |
```
|
| 95 |
|
|
|
|
| 99 |
docker run -p 7860:7860 codebase-nav-env
|
| 100 |
```
|
| 101 |
|
| 102 |
+
---
|
| 103 |
+
|
| 104 |
## 📊 Evaluation API Layers
|
| 105 |
|
| 106 |
+
The environment strictly communicates via a standard RESTful architecture.
|
| 107 |
+
|
| 108 |
+
| Endpoint | Method | Operational Description |
|
| 109 |
+
|----------|--------|-------------------------|
|
| 110 |
+
| `/step` | `POST` | Takes singular OpenEnv navigation action (`read_file`, `write_file`) |
|
| 111 |
+
| `/evaluate` | `GET` | Fetches deterministic baseline evaluation metrics |
|
| 112 |
+
| `/causal-probe` | `GET` | Generates directed acyclic graphs resolving true root-cause logic mapping |
|
| 113 |
+
| `/confidence` | `GET` | Emits behavioral-time confidence calibration algorithms |
|
| 114 |
+
| `/counterfactual` | `POST` | Triggers the 6 robustness ablation hallucination detection engine |
|
| 115 |
+
|
| 116 |
+
<br/>
|
| 117 |
|
| 118 |
+
> *Stop trusting the vibe. Force the cognition.*
|