repomind

Running

ZeroR3 commited on 1 day ago

Commit

fee907b

1 Parent(s): 53f2b06

feat: HF Space updated for final submission

- README.md: full stress test results (62 verified data points across 124 min)
- README.md: 6.49x answer to Hakob, AITER regression honesty, 9/9 repo Q&A
- app.py: env-var backend toggle (VLLM_BASE_URL + MODEL_NAME), Steve Kimoi tutorial pattern
- app.py: live MI300X mode when VLLM_BASE_URL is set, mock fallback otherwise

Files changed (2) hide show

README.md +36 -8
app.py +127 -62

README.md CHANGED Viewed

@@ -44,19 +44,47 @@ This is a memory-architecture story, not a CUDA-vs-ROCm one.
 - **Agent loop**: SC-TIR style (PLAN → CALL TOOL → OBSERVE → THINK → ANSWER)
 - **Tools**: `read_file` · `grep_codebase` · `execute_code` (sandboxed) · `run_tests` · `git_log`
-## Status — verified on real MI300X (2026-05-05)
-Smoke test on a single AMD MI300X x1 (AMD Developer Cloud, $1.99/hr, vLLM 0.17.1 + ROCm 7.2 Quick Start image):
 - ✅ Model weights in VRAM: **77.29 GiB**
-- ✅ Available KV cache: **95.26 GiB**
-- ✅ `--max-model-len 262144` (256K) — `Application startup complete`
 - ✅ `/v1/models` returns `max_model_len: 262144`
-- ✅ **31.31× max concurrency at 256K context** — single MI300X serves ~31 simultaneous users at full 256K context
-- ✅ Real Python code generation through `/v1/chat/completions` (merge sort / LCS / hello world)
-- ✅ Cost of smoke test: ~$1.00 of $100 credits
-This Space currently still runs on CPU-basic with the **mock LLM backend** because exposing a public API requires keeping a paid MI300X droplet up — final demo will be wired to a live MI300X endpoint during submission window.
 If the MI300X memory-architecture pitch resonates, **a like on this Space helps us with the Hugging Face Special Prize judging** 🤗

 - **Agent loop**: SC-TIR style (PLAN → CALL TOOL → OBSERVE → THINK → ANSWER)
 - **Tools**: `read_file` · `grep_codebase` · `execute_code` (sandboxed) · `run_tests` · `git_log`
+## Status — verified on real MI300X (2026-05-05 / 2026-05-06)
+Full stress test on a single AMD MI300X x1 (AMD Developer Cloud, $1.99/hr, vLLM 0.17.1 + ROCm 7.2 Quick Start image). **2 sessions, 124 min total, ~$4.12.**
+**Memory budget — Qwen3-Coder-Next-FP8 + 256K context, FP8 KV cache:**
 - ✅ Model weights in VRAM: **77.29 GiB**
+- ✅ Available KV cache: **94.58 GiB** (2,065,744 tokens)
+- ✅ VRAM peak: **176 GiB / 191.7 GiB** (92% utilization)
+- ✅ `--max-model-len 262144` started, `Application startup complete`
 - ✅ `/v1/models` returns `max_model_len: 262144`
+**Concurrency stress (24 cells, default Triton attention, all 144 outputs clean):**
+- ✅ **31/31 success at 8K, 16K, 32K, AND 64K** — every realistic-developer context
+- ✅ **25/31 at 128K**, **6-8 at 256K** within a 15-minute window (compute-bound, honest ceiling)
+- ✅ Aggregate throughput at N=31: 78.5 tok/s @ 8K · 31.4 @ 16K · 12.1 @ 32K · 3.6 @ 64K
+**Long-context coherence — needle-in-haystack at 200K:**
+- ✅ **3/3 positions passed** (early, middle, late) — model recovers embedded sentinel function and constant
+- ✅ This proves 256K window is *usable*, not just *allocated*
+**End-to-end repo ingestion — 9/9 questions answered correctly:**
+- ✅ REPOMIND self (68K tokens, 68 files) — 3/3
+- ✅ pallets/flask (408K total → fitted 180K) — 3/3
+- ✅ **pytorch/vision (1.3M tokens, 581 files, 6,799 chunks → fitted 180K) — 3/3** with correct file path citations
+**Tuning attempt — measured regression worth reporting:**
+- ⚠️ Tried `--attention-backend ROCM_AITER_FA` (AMD's hand-tuned MI300X kernels)
+- Throughput **2-4× higher** under AITER, TTFT 2.8× faster at 64K
+- BUT output **degenerates to repeating-punctuation gibberish** in 137/144 cells under FP8 KV cache
+- Default Triton stays the production-safe choice; filed for AMD upstream investigation
+**Cost — at AMD Cloud $1.99/hr:**
+- ✅ ~$45.75 / 1M completion tokens (aggregate at 32K, N=31)
+- ✅ 14.5 active continuous queriers per MI300X, or 70–140 dev seats for typical bursty engineering teams
+- ✅ Owned MI300X ($18K) breaks even vs Cursor in 3–6 months at team-of-100 usage
+This Space currently runs CPU-basic with the **mock LLM backend** because keeping a paid MI300X droplet up 24/7 for sporadic visitors is uneconomical. **Final demo wires to a live MI300X endpoint** during the judging window.
+Full evidence pack (7 JSON results + 5 PNG plots + e2e prompts/answers + 2× rocm-smi snapshots + run logs) is in the repo:
+[github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test](https://github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test)
+Extended PHASE 1+2 narrative (24-cell matrix + AITER A/B): [extended/SUMMARY.md](https://github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test/extended).
 If the MI300X memory-architecture pitch resonates, **a like on this Space helps us with the Hugging Face Special Prize judging** 🤗

app.py CHANGED Viewed

@@ -1,8 +1,17 @@
 """REPOMIND — HuggingFace Space entry point.
-Public demo. Backend defaults to the offline mock LLM so the Space runs
-without GPU credits. Once the AMD MI300X vLLM endpoint is live, switch
-the backend toggle to "vllm" and point at the live URL.
 Local repo: https://github.com/SRKRZ23/repomind
 Hackathon:  https://lablab.ai/ai-hackathons/amd-developer
@@ -23,25 +32,42 @@ from ingestion.chunker import ingest_to_json
 from ingestion.cloner import clone
-HEADER_MD = """
 # REPOMIND
 **Open-source repo-scale coding agent on AMD MI300X.**
-Ingest a git repository (up to 256K tokens, FP8) on a single GPU and reason across the whole codebase with multi-step tool use.
-> 📦 GitHub: [SRKRZ23/repomind](https://github.com/SRKRZ23/repomind)
 > 🏆 Built for the [AMD Developer Hackathon 2026](https://lablab.ai/ai-hackathons/amd-developer)
-### Why MI300X?
-- Qwen3-Coder-Next-FP8 weights ≈ 80 GB
-- 256K KV cache @ FP8 ≈ 38 GB
-- + activations ≈ 25 GB → **~143 GB total on a single GPU**
-- NVIDIA H100 80GB physically OOMs. AMD MI300X 192GB just runs it.
-### About this Space
-This is the **frontend** demo. Backend defaults to the **mock LLM** so the Space
-runs on CPU-basic without burning GPU credits. Switch to `vllm` and provide a
-base URL once the MI300X endpoint is live.
 """
@@ -75,7 +101,16 @@ def ingest(url_or_path: str, chunk_tokens: int) -> str:
         return f"❌ {type(e).__name__}: {e}"
-def ask(question: str, backend: str, base_url: str, model: str):
     summary_path = SCRATCH_DIR / "active.json"
     if not summary_path.exists():
         return "Ingest a repo first.", ""
@@ -85,91 +120,121 @@ def ask(question: str, backend: str, base_url: str, model: str):
     summary = json.loads(summary_path.read_text())
     repo_root = Path(summary.get("root", "."))
-    # Backend wiring — vLLM only when user explicitly chose it AND a URL is given
-    if backend == "vllm":
-        if not base_url or not base_url.strip():
-            return "vLLM backend selected but no base URL provided.", ""
-        try:
-            from serving.vllm_client import VLLMClient
-            llm = VLLMClient(base_url=base_url.strip(), model=model.strip() or "Qwen/Qwen3-Coder-Next-FP8")
-        except Exception as e:
-            return f"❌ failed to init vLLM client: {e}", ""
-    else:
-        from serving.mock_client import MockClient
-        llm = MockClient(max_tool_turns=2)
     from agent.loop import Agent
     from tools.registry import default_registry
     try:
-        agent = Agent(llm=llm, tools=default_registry(repo_root, scratch_dir=SCRATCH_DIR / "scratch"), max_steps=4)
         result = agent.run(question, summary)
     except Exception as e:
         return f"❌ agent failed: {type(e).__name__}: {e}", ""
-    trace_lines = [f"- {tc['name']} {json.dumps(tc['arguments'], ensure_ascii=False)}" for tc in result.tool_calls]
     trace = "\n".join(trace_lines) or "(no tool calls)"
     return result.answer, trace
-with gr.Blocks(title="REPOMIND — repo-scale coding agent on AMD MI300X", theme=gr.themes.Soft()) as demo:
     gr.Markdown(HEADER_MD)
     with gr.Tab("1. Ingest"):
         with gr.Row():
             url = gr.Textbox(
                 label="GitHub URL or owner/repo",
-                placeholder="https://github.com/torvalds/linux  OR  pallets/flask",
                 scale=4,
             )
-            chunk_tokens = gr.Slider(256, 4096, value=1024, step=128, label="Tokens / chunk", scale=1)
         ingest_btn = gr.Button("Ingest", variant="primary")
         ingest_out = gr.Code(label="Ingestion summary", language="json")
         ingest_btn.click(ingest, [url, chunk_tokens], ingest_out)
     with gr.Tab("2. Ask"):
-        with gr.Row():
-            backend = gr.Radio(
-                choices=["mock (offline demo)", "vllm (live MI300X)"],
-                value="mock (offline demo)",
-                label="Backend",
-                scale=1,
-            )
-            base_url = gr.Textbox(
-                label="vLLM base URL (only used in `vllm` mode)",
-                value="",
-                placeholder="http://your-mi300x-host:8000/v1",
-                scale=2,
-            )
-            model = gr.Textbox(
-                label="Model id",
-                value="Qwen/Qwen3-Coder-Next-FP8",
-                scale=2,
-            )
         question = gr.Textbox(
             label="Question",
             lines=3,
-            placeholder="What does the chunker prioritize? Where is authentication handled?",
         )
         ask_btn = gr.Button("Ask", variant="primary")
         answer = gr.Markdown(label="Answer")
-        tool_trace = gr.Code(label="Tool trace", language="markdown")
-        # normalize backend selector to internal value
-        def _ask(q, b, u, m):
-            internal = "vllm" if b.startswith("vllm") else "mock"
-            return ask(q, internal, u, m)
-        ask_btn.click(_ask, [question, backend, base_url, model], [answer, tool_trace])
     gr.Markdown(
         "---\n"
         "**Author:** [Sardor Razikov](https://huggingface.co/ZeroR3) · "
         "[GitHub](https://github.com/SRKRZ23) · "
         "[lablab.ai](https://lablab.ai/u/@Sardor_R) · "
-        "[Zenodo (ECB)](https://doi.org/10.5281/zenodo.19791329)"
     )
 if __name__ == "__main__":
-    demo.launch()

 """REPOMIND — HuggingFace Space entry point.
+Public demo. Auto-detects backend from environment variables (Steve Kimoi's
+canonical lablab/AMD tutorial pattern):
+    VLLM_BASE_URL  — set in Space → Settings → Variables and secrets
+                     to point at a live MI300X vLLM endpoint, e.g.
+                     http://<your-droplet-ip>:8000/v1
+    MODEL_NAME     — model id served by vLLM, defaults to
+                     Qwen/Qwen3-Coder-Next-FP8
+When VLLM_BASE_URL is unset (default), the Space runs the offline mock
+backend on CPU-basic so it stays free 24/7. When set, the Space wires
+through to the live AMD MI300X for real inference.
 Local repo: https://github.com/SRKRZ23/repomind
 Hackathon:  https://lablab.ai/ai-hackathons/amd-developer
 from ingestion.cloner import clone
+# ─── Configuration via env vars (Steve Kimoi tutorial pattern) ────────────
+VLLM_BASE_URL = os.environ.get("VLLM_BASE_URL", "").strip()
+MODEL_NAME = os.environ.get("MODEL_NAME", "Qwen/Qwen3-Coder-Next-FP8").strip()
+LIVE_BACKEND = bool(VLLM_BASE_URL)
+BACKEND_LABEL = "🟢 Live AMD MI300X" if LIVE_BACKEND else "🟡 Mock backend (CPU-basic, demo mode)"
+BACKEND_HINT = (
+    f"Connected to vLLM endpoint: `{VLLM_BASE_URL}` · model `{MODEL_NAME}`"
+    if LIVE_BACKEND else
+    "Set the Space secrets `VLLM_BASE_URL` + `MODEL_NAME` to wire a real MI300X backend."
+)
+HEADER_MD = f"""
 # REPOMIND
 **Open-source repo-scale coding agent on AMD MI300X.**
+Ingest a git repository (up to 256K tokens, FP8) on a single GPU and
+reason across the whole codebase with multi-step tool use.
+> 📦 GitHub: [SRKRZ23/repomind](https://github.com/SRKRZ23/repomind) · MIT
 > 🏆 Built for the [AMD Developer Hackathon 2026](https://lablab.ai/ai-hackathons/amd-developer)
+> 🤗 HF Special Prize candidate · 🛡 Conservative claim discipline applied
+### Why AMD MI300X (verified 2026-05-05 on real hardware)
+- Qwen3-Coder-Next-FP8 weights = **77.29 GiB** in VRAM (verified)
+- 256K KV cache @ FP8 = **94.58 GiB** available (2,065,744 tokens, verified)
+- Activations + framework overhead → peak 176/191.7 GiB ≈ **92% utilization**
+- NVIDIA H100 80 GB cannot accommodate this on a single card by VRAM
+  accounting (~143 GB > 80 GB); MI300X 192 GB has the headroom
+### Status
+**Backend right now**: {BACKEND_LABEL}
+{BACKEND_HINT}
 """
         return f"❌ {type(e).__name__}: {e}"
+def _build_llm():
+    """Return an LLM client based on env-var configuration."""
+    if LIVE_BACKEND:
+        from serving.vllm_client import VLLMClient
+        return VLLMClient(base_url=VLLM_BASE_URL, model=MODEL_NAME)
+    from serving.mock_client import MockClient
+    return MockClient(max_tool_turns=2)
+def ask(question: str):
     summary_path = SCRATCH_DIR / "active.json"
     if not summary_path.exists():
         return "Ingest a repo first.", ""
     summary = json.loads(summary_path.read_text())
     repo_root = Path(summary.get("root", "."))
+    try:
+        llm = _build_llm()
+    except Exception as e:
+        return f"❌ failed to init LLM client: {type(e).__name__}: {e}", ""
     from agent.loop import Agent
     from tools.registry import default_registry
     try:
+        agent = Agent(
+            llm=llm,
+            tools=default_registry(repo_root, scratch_dir=SCRATCH_DIR / "scratch"),
+            max_steps=4,
+        )
         result = agent.run(question, summary)
     except Exception as e:
         return f"❌ agent failed: {type(e).__name__}: {e}", ""
+    trace_lines = [
+        f"- {tc['name']} {json.dumps(tc['arguments'], ensure_ascii=False)}"
+        for tc in result.tool_calls
+    ]
     trace = "\n".join(trace_lines) or "(no tool calls)"
     return result.answer, trace
+with gr.Blocks(
+    title="REPOMIND — repo-scale coding agent on AMD MI300X",
+) as demo:
     gr.Markdown(HEADER_MD)
     with gr.Tab("1. Ingest"):
+        gr.Markdown(
+            "Paste any **GitHub URL** or `owner/repo` shorthand. "
+            "REPOMIND clones it, parses the source files, and chunks them "
+            "into priority-ranked sections (README first, then top-level "
+            "symbols, then nested code, then tests)."
+        )
         with gr.Row():
             url = gr.Textbox(
                 label="GitHub URL or owner/repo",
+                placeholder="https://github.com/pallets/flask  OR  pallets/flask",
                 scale=4,
             )
+            chunk_tokens = gr.Slider(
+                256, 4096, value=1024, step=128, label="Tokens / chunk", scale=1
+            )
         ingest_btn = gr.Button("Ingest", variant="primary")
         ingest_out = gr.Code(label="Ingestion summary", language="json")
         ingest_btn.click(ingest, [url, chunk_tokens], ingest_out)
+        gr.Markdown(
+            "**Examples that work on a single MI300X**: "
+            "`pallets/flask` (~408K tokens, fits in 256K window with priority chunking) · "
+            "`pytorch/vision` (~1.3M tokens, trimmed to 180K of highest-priority "
+            "content via the chunker) · this repo `SRKRZ23/repomind` (~68K tokens, fits whole)."
+        )
     with gr.Tab("2. Ask"):
+        gr.Markdown(
+            f"Ask any question about the ingested repo. The agent runs an "
+            f"SC-TIR loop (PLAN → CALL TOOL → OBSERVE → THINK → ANSWER) with "
+            f"five tools: `read_file`, `grep_codebase`, `execute_code` "
+            f"(sandboxed), `run_tests`, `git_log`.\n\n"
+            f"**Backend**: {BACKEND_LABEL}"
+        )
         question = gr.Textbox(
             label="Question",
             lines=3,
+            placeholder=(
+                "Where is the WSGI entry point? · "
+                "What does the chunker prioritize? · "
+                "Trace one slab allocation through the call graph."
+            ),
         )
         ask_btn = gr.Button("Ask", variant="primary")
         answer = gr.Markdown(label="Answer")
+        tool_trace = gr.Code(label="Tool trace (agent steps)", language="markdown")
+        ask_btn.click(ask, [question], [answer, tool_trace])
+    with gr.Tab("3. Verified evidence"):
+        gr.Markdown(
+            "REPOMIND was stress-tested on a real AMD MI300X x1 droplet across "
+            "two sessions (**2026-05-05 / 2026-05-06**, 124 min total, $4.12). "
+            "Highlights:\n\n"
+            "| Test | Result |\n"
+            "|---|---|\n"
+            "| Memory peak | 176/191.7 GiB (92%) |\n"
+            "| `--max-model-len 262144` | started clean |\n"
+            "| Concurrency 8K / 16K / 32K / 64K @ N=31 | **31/31 success at every context** ✅ |\n"
+            "| Concurrency 128K @ N=31 | 25/31 (6 timeouts past 15 min) |\n"
+            "| Long-context needle at 200K | **3/3** pass (early/middle/late) |\n"
+            "| End-to-end repo Q&A | **9/9** correct across 3 repos |\n"
+            "| Largest repo tested | **pytorch/vision (1.3M tokens)** |\n"
+            "| Tuning attempt: AITER backend | regression — 137/144 cells broken under FP8 KV cache; default Triton stays production-safe |\n"
+            "| Cost | $1.99/hr cloud, $45.75/1M completion tokens |\n\n"
+            "Full evidence pack — JSON results, plots, raw model outputs — "
+            "is at [github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test]"
+            "(https://github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test). "
+            "Extended PHASE 1+2 narrative + AITER A/B in the [extended/SUMMARY.md]"
+            "(https://github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test/extended)."
+        )
     gr.Markdown(
         "---\n"
         "**Author:** [Sardor Razikov](https://huggingface.co/ZeroR3) · "
         "[GitHub](https://github.com/SRKRZ23) · "
         "[lablab.ai](https://lablab.ai/u/@Sardor_R) · "
+        "[Zenodo (ECB)](https://doi.org/10.5281/zenodo.19791329) · "
+        "Tashkent 🇺🇿\n\n"
+        "*If the MI300X memory-architecture story resonates, "
+        "**a like on this Space helps with the Hugging Face Special Prize judging.** 🤗*"
     )
 if __name__ == "__main__":
+    demo.launch(theme=gr.themes.Soft(primary_hue="red", secondary_hue="gray"))