ZeroR3 commited on
Commit
fee907b
Β·
1 Parent(s): 53f2b06

feat: HF Space updated for final submission

Browse files

- README.md: full stress test results (62 verified data points across 124 min)
- README.md: 6.49x answer to Hakob, AITER regression honesty, 9/9 repo Q&A
- app.py: env-var backend toggle (VLLM_BASE_URL + MODEL_NAME), Steve Kimoi tutorial pattern
- app.py: live MI300X mode when VLLM_BASE_URL is set, mock fallback otherwise

Files changed (2) hide show
  1. README.md +36 -8
  2. app.py +127 -62
README.md CHANGED
@@ -44,19 +44,47 @@ This is a memory-architecture story, not a CUDA-vs-ROCm one.
44
  - **Agent loop**: SC-TIR style (PLAN β†’ CALL TOOL β†’ OBSERVE β†’ THINK β†’ ANSWER)
45
  - **Tools**: `read_file` Β· `grep_codebase` Β· `execute_code` (sandboxed) Β· `run_tests` Β· `git_log`
46
 
47
- ## Status β€” verified on real MI300X (2026-05-05)
48
 
49
- Smoke test on a single AMD MI300X x1 (AMD Developer Cloud, $1.99/hr, vLLM 0.17.1 + ROCm 7.2 Quick Start image):
50
 
 
51
  - βœ… Model weights in VRAM: **77.29 GiB**
52
- - βœ… Available KV cache: **95.26 GiB**
53
- - βœ… `--max-model-len 262144` (256K) β€” `Application startup complete`
 
54
  - βœ… `/v1/models` returns `max_model_len: 262144`
55
- - βœ… **31.31Γ— max concurrency at 256K context** β€” single MI300X serves ~31 simultaneous users at full 256K context
56
- - βœ… Real Python code generation through `/v1/chat/completions` (merge sort / LCS / hello world)
57
- - βœ… Cost of smoke test: ~$1.00 of $100 credits
58
 
59
- This Space currently still runs on CPU-basic with the **mock LLM backend** because exposing a public API requires keeping a paid MI300X droplet up β€” final demo will be wired to a live MI300X endpoint during submission window.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
  If the MI300X memory-architecture pitch resonates, **a like on this Space helps us with the Hugging Face Special Prize judging** πŸ€—
62
 
 
44
  - **Agent loop**: SC-TIR style (PLAN β†’ CALL TOOL β†’ OBSERVE β†’ THINK β†’ ANSWER)
45
  - **Tools**: `read_file` Β· `grep_codebase` Β· `execute_code` (sandboxed) Β· `run_tests` Β· `git_log`
46
 
47
+ ## Status β€” verified on real MI300X (2026-05-05 / 2026-05-06)
48
 
49
+ Full stress test on a single AMD MI300X x1 (AMD Developer Cloud, $1.99/hr, vLLM 0.17.1 + ROCm 7.2 Quick Start image). **2 sessions, 124 min total, ~$4.12.**
50
 
51
+ **Memory budget β€” Qwen3-Coder-Next-FP8 + 256K context, FP8 KV cache:**
52
  - βœ… Model weights in VRAM: **77.29 GiB**
53
+ - βœ… Available KV cache: **94.58 GiB** (2,065,744 tokens)
54
+ - βœ… VRAM peak: **176 GiB / 191.7 GiB** (92% utilization)
55
+ - βœ… `--max-model-len 262144` started, `Application startup complete`
56
  - βœ… `/v1/models` returns `max_model_len: 262144`
 
 
 
57
 
58
+ **Concurrency stress (24 cells, default Triton attention, all 144 outputs clean):**
59
+ - βœ… **31/31 success at 8K, 16K, 32K, AND 64K** β€” every realistic-developer context
60
+ - βœ… **25/31 at 128K**, **6-8 at 256K** within a 15-minute window (compute-bound, honest ceiling)
61
+ - βœ… Aggregate throughput at N=31: 78.5 tok/s @ 8K Β· 31.4 @ 16K Β· 12.1 @ 32K Β· 3.6 @ 64K
62
+
63
+ **Long-context coherence β€” needle-in-haystack at 200K:**
64
+ - βœ… **3/3 positions passed** (early, middle, late) β€” model recovers embedded sentinel function and constant
65
+ - βœ… This proves 256K window is *usable*, not just *allocated*
66
+
67
+ **End-to-end repo ingestion β€” 9/9 questions answered correctly:**
68
+ - βœ… REPOMIND self (68K tokens, 68 files) β€” 3/3
69
+ - βœ… pallets/flask (408K total β†’ fitted 180K) β€” 3/3
70
+ - βœ… **pytorch/vision (1.3M tokens, 581 files, 6,799 chunks β†’ fitted 180K) β€” 3/3** with correct file path citations
71
+
72
+ **Tuning attempt β€” measured regression worth reporting:**
73
+ - ⚠️ Tried `--attention-backend ROCM_AITER_FA` (AMD's hand-tuned MI300X kernels)
74
+ - Throughput **2-4Γ— higher** under AITER, TTFT 2.8Γ— faster at 64K
75
+ - BUT output **degenerates to repeating-punctuation gibberish** in 137/144 cells under FP8 KV cache
76
+ - Default Triton stays the production-safe choice; filed for AMD upstream investigation
77
+
78
+ **Cost β€” at AMD Cloud $1.99/hr:**
79
+ - βœ… ~$45.75 / 1M completion tokens (aggregate at 32K, N=31)
80
+ - βœ… 14.5 active continuous queriers per MI300X, or 70–140 dev seats for typical bursty engineering teams
81
+ - βœ… Owned MI300X ($18K) breaks even vs Cursor in 3–6 months at team-of-100 usage
82
+
83
+ This Space currently runs CPU-basic with the **mock LLM backend** because keeping a paid MI300X droplet up 24/7 for sporadic visitors is uneconomical. **Final demo wires to a live MI300X endpoint** during the judging window.
84
+
85
+ Full evidence pack (7 JSON results + 5 PNG plots + e2e prompts/answers + 2Γ— rocm-smi snapshots + run logs) is in the repo:
86
+ [github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test](https://github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test)
87
+ Extended PHASE 1+2 narrative (24-cell matrix + AITER A/B): [extended/SUMMARY.md](https://github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test/extended).
88
 
89
  If the MI300X memory-architecture pitch resonates, **a like on this Space helps us with the Hugging Face Special Prize judging** πŸ€—
90
 
app.py CHANGED
@@ -1,8 +1,17 @@
1
  """REPOMIND β€” HuggingFace Space entry point.
2
 
3
- Public demo. Backend defaults to the offline mock LLM so the Space runs
4
- without GPU credits. Once the AMD MI300X vLLM endpoint is live, switch
5
- the backend toggle to "vllm" and point at the live URL.
 
 
 
 
 
 
 
 
 
6
 
7
  Local repo: https://github.com/SRKRZ23/repomind
8
  Hackathon: https://lablab.ai/ai-hackathons/amd-developer
@@ -23,25 +32,42 @@ from ingestion.chunker import ingest_to_json
23
  from ingestion.cloner import clone
24
 
25
 
26
- HEADER_MD = """
 
 
 
 
 
 
 
 
 
 
 
 
27
  # REPOMIND
28
  **Open-source repo-scale coding agent on AMD MI300X.**
29
 
30
- Ingest a git repository (up to 256K tokens, FP8) on a single GPU and reason across the whole codebase with multi-step tool use.
 
31
 
32
- > πŸ“¦ GitHub: [SRKRZ23/repomind](https://github.com/SRKRZ23/repomind)
33
  > πŸ† Built for the [AMD Developer Hackathon 2026](https://lablab.ai/ai-hackathons/amd-developer)
 
 
 
 
 
 
 
 
 
34
 
35
- ### Why MI300X?
36
- - Qwen3-Coder-Next-FP8 weights β‰ˆ 80 GB
37
- - 256K KV cache @ FP8 β‰ˆ 38 GB
38
- - + activations β‰ˆ 25 GB β†’ **~143 GB total on a single GPU**
39
- - NVIDIA H100 80GB physically OOMs. AMD MI300X 192GB just runs it.
40
 
41
- ### About this Space
42
- This is the **frontend** demo. Backend defaults to the **mock LLM** so the Space
43
- runs on CPU-basic without burning GPU credits. Switch to `vllm` and provide a
44
- base URL once the MI300X endpoint is live.
45
  """
46
 
47
 
@@ -75,7 +101,16 @@ def ingest(url_or_path: str, chunk_tokens: int) -> str:
75
  return f"❌ {type(e).__name__}: {e}"
76
 
77
 
78
- def ask(question: str, backend: str, base_url: str, model: str):
 
 
 
 
 
 
 
 
 
79
  summary_path = SCRATCH_DIR / "active.json"
80
  if not summary_path.exists():
81
  return "Ingest a repo first.", ""
@@ -85,91 +120,121 @@ def ask(question: str, backend: str, base_url: str, model: str):
85
  summary = json.loads(summary_path.read_text())
86
  repo_root = Path(summary.get("root", "."))
87
 
88
- # Backend wiring β€” vLLM only when user explicitly chose it AND a URL is given
89
- if backend == "vllm":
90
- if not base_url or not base_url.strip():
91
- return "vLLM backend selected but no base URL provided.", ""
92
- try:
93
- from serving.vllm_client import VLLMClient
94
- llm = VLLMClient(base_url=base_url.strip(), model=model.strip() or "Qwen/Qwen3-Coder-Next-FP8")
95
- except Exception as e:
96
- return f"❌ failed to init vLLM client: {e}", ""
97
- else:
98
- from serving.mock_client import MockClient
99
- llm = MockClient(max_tool_turns=2)
100
 
101
  from agent.loop import Agent
102
  from tools.registry import default_registry
103
 
104
  try:
105
- agent = Agent(llm=llm, tools=default_registry(repo_root, scratch_dir=SCRATCH_DIR / "scratch"), max_steps=4)
 
 
 
 
106
  result = agent.run(question, summary)
107
  except Exception as e:
108
  return f"❌ agent failed: {type(e).__name__}: {e}", ""
109
 
110
- trace_lines = [f"- {tc['name']} {json.dumps(tc['arguments'], ensure_ascii=False)}" for tc in result.tool_calls]
 
 
 
111
  trace = "\n".join(trace_lines) or "(no tool calls)"
112
  return result.answer, trace
113
 
114
 
115
- with gr.Blocks(title="REPOMIND β€” repo-scale coding agent on AMD MI300X", theme=gr.themes.Soft()) as demo:
 
 
116
  gr.Markdown(HEADER_MD)
117
 
118
  with gr.Tab("1. Ingest"):
 
 
 
 
 
 
119
  with gr.Row():
120
  url = gr.Textbox(
121
  label="GitHub URL or owner/repo",
122
- placeholder="https://github.com/torvalds/linux OR pallets/flask",
123
  scale=4,
124
  )
125
- chunk_tokens = gr.Slider(256, 4096, value=1024, step=128, label="Tokens / chunk", scale=1)
 
 
126
  ingest_btn = gr.Button("Ingest", variant="primary")
127
  ingest_out = gr.Code(label="Ingestion summary", language="json")
128
  ingest_btn.click(ingest, [url, chunk_tokens], ingest_out)
129
 
 
 
 
 
 
 
 
130
  with gr.Tab("2. Ask"):
131
- with gr.Row():
132
- backend = gr.Radio(
133
- choices=["mock (offline demo)", "vllm (live MI300X)"],
134
- value="mock (offline demo)",
135
- label="Backend",
136
- scale=1,
137
- )
138
- base_url = gr.Textbox(
139
- label="vLLM base URL (only used in `vllm` mode)",
140
- value="",
141
- placeholder="http://your-mi300x-host:8000/v1",
142
- scale=2,
143
- )
144
- model = gr.Textbox(
145
- label="Model id",
146
- value="Qwen/Qwen3-Coder-Next-FP8",
147
- scale=2,
148
- )
149
  question = gr.Textbox(
150
  label="Question",
151
  lines=3,
152
- placeholder="What does the chunker prioritize? Where is authentication handled?",
 
 
 
 
153
  )
154
  ask_btn = gr.Button("Ask", variant="primary")
155
  answer = gr.Markdown(label="Answer")
156
- tool_trace = gr.Code(label="Tool trace", language="markdown")
157
-
158
- # normalize backend selector to internal value
159
- def _ask(q, b, u, m):
160
- internal = "vllm" if b.startswith("vllm") else "mock"
161
- return ask(q, internal, u, m)
162
-
163
- ask_btn.click(_ask, [question, backend, base_url, model], [answer, tool_trace])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164
 
165
  gr.Markdown(
166
  "---\n"
167
  "**Author:** [Sardor Razikov](https://huggingface.co/ZeroR3) Β· "
168
  "[GitHub](https://github.com/SRKRZ23) Β· "
169
  "[lablab.ai](https://lablab.ai/u/@Sardor_R) Β· "
170
- "[Zenodo (ECB)](https://doi.org/10.5281/zenodo.19791329)"
 
 
 
171
  )
172
 
173
 
174
  if __name__ == "__main__":
175
- demo.launch()
 
1
  """REPOMIND β€” HuggingFace Space entry point.
2
 
3
+ Public demo. Auto-detects backend from environment variables (Steve Kimoi's
4
+ canonical lablab/AMD tutorial pattern):
5
+
6
+ VLLM_BASE_URL β€” set in Space β†’ Settings β†’ Variables and secrets
7
+ to point at a live MI300X vLLM endpoint, e.g.
8
+ http://<your-droplet-ip>:8000/v1
9
+ MODEL_NAME β€” model id served by vLLM, defaults to
10
+ Qwen/Qwen3-Coder-Next-FP8
11
+
12
+ When VLLM_BASE_URL is unset (default), the Space runs the offline mock
13
+ backend on CPU-basic so it stays free 24/7. When set, the Space wires
14
+ through to the live AMD MI300X for real inference.
15
 
16
  Local repo: https://github.com/SRKRZ23/repomind
17
  Hackathon: https://lablab.ai/ai-hackathons/amd-developer
 
32
  from ingestion.cloner import clone
33
 
34
 
35
+ # ─── Configuration via env vars (Steve Kimoi tutorial pattern) ────────────
36
+ VLLM_BASE_URL = os.environ.get("VLLM_BASE_URL", "").strip()
37
+ MODEL_NAME = os.environ.get("MODEL_NAME", "Qwen/Qwen3-Coder-Next-FP8").strip()
38
+ LIVE_BACKEND = bool(VLLM_BASE_URL)
39
+ BACKEND_LABEL = "🟒 Live AMD MI300X" if LIVE_BACKEND else "🟑 Mock backend (CPU-basic, demo mode)"
40
+ BACKEND_HINT = (
41
+ f"Connected to vLLM endpoint: `{VLLM_BASE_URL}` Β· model `{MODEL_NAME}`"
42
+ if LIVE_BACKEND else
43
+ "Set the Space secrets `VLLM_BASE_URL` + `MODEL_NAME` to wire a real MI300X backend."
44
+ )
45
+
46
+
47
+ HEADER_MD = f"""
48
  # REPOMIND
49
  **Open-source repo-scale coding agent on AMD MI300X.**
50
 
51
+ Ingest a git repository (up to 256K tokens, FP8) on a single GPU and
52
+ reason across the whole codebase with multi-step tool use.
53
 
54
+ > πŸ“¦ GitHub: [SRKRZ23/repomind](https://github.com/SRKRZ23/repomind) Β· MIT
55
  > πŸ† Built for the [AMD Developer Hackathon 2026](https://lablab.ai/ai-hackathons/amd-developer)
56
+ > πŸ€— HF Special Prize candidate Β· πŸ›‘ Conservative claim discipline applied
57
+
58
+ ### Why AMD MI300X (verified 2026-05-05 on real hardware)
59
+
60
+ - Qwen3-Coder-Next-FP8 weights = **77.29 GiB** in VRAM (verified)
61
+ - 256K KV cache @ FP8 = **94.58 GiB** available (2,065,744 tokens, verified)
62
+ - Activations + framework overhead β†’ peak 176/191.7 GiB β‰ˆ **92% utilization**
63
+ - NVIDIA H100 80 GB cannot accommodate this on a single card by VRAM
64
+ accounting (~143 GB > 80 GB); MI300X 192 GB has the headroom
65
 
66
+ ### Status
 
 
 
 
67
 
68
+ **Backend right now**: {BACKEND_LABEL}
69
+
70
+ {BACKEND_HINT}
 
71
  """
72
 
73
 
 
101
  return f"❌ {type(e).__name__}: {e}"
102
 
103
 
104
+ def _build_llm():
105
+ """Return an LLM client based on env-var configuration."""
106
+ if LIVE_BACKEND:
107
+ from serving.vllm_client import VLLMClient
108
+ return VLLMClient(base_url=VLLM_BASE_URL, model=MODEL_NAME)
109
+ from serving.mock_client import MockClient
110
+ return MockClient(max_tool_turns=2)
111
+
112
+
113
+ def ask(question: str):
114
  summary_path = SCRATCH_DIR / "active.json"
115
  if not summary_path.exists():
116
  return "Ingest a repo first.", ""
 
120
  summary = json.loads(summary_path.read_text())
121
  repo_root = Path(summary.get("root", "."))
122
 
123
+ try:
124
+ llm = _build_llm()
125
+ except Exception as e:
126
+ return f"❌ failed to init LLM client: {type(e).__name__}: {e}", ""
 
 
 
 
 
 
 
 
127
 
128
  from agent.loop import Agent
129
  from tools.registry import default_registry
130
 
131
  try:
132
+ agent = Agent(
133
+ llm=llm,
134
+ tools=default_registry(repo_root, scratch_dir=SCRATCH_DIR / "scratch"),
135
+ max_steps=4,
136
+ )
137
  result = agent.run(question, summary)
138
  except Exception as e:
139
  return f"❌ agent failed: {type(e).__name__}: {e}", ""
140
 
141
+ trace_lines = [
142
+ f"- {tc['name']} {json.dumps(tc['arguments'], ensure_ascii=False)}"
143
+ for tc in result.tool_calls
144
+ ]
145
  trace = "\n".join(trace_lines) or "(no tool calls)"
146
  return result.answer, trace
147
 
148
 
149
+ with gr.Blocks(
150
+ title="REPOMIND β€” repo-scale coding agent on AMD MI300X",
151
+ ) as demo:
152
  gr.Markdown(HEADER_MD)
153
 
154
  with gr.Tab("1. Ingest"):
155
+ gr.Markdown(
156
+ "Paste any **GitHub URL** or `owner/repo` shorthand. "
157
+ "REPOMIND clones it, parses the source files, and chunks them "
158
+ "into priority-ranked sections (README first, then top-level "
159
+ "symbols, then nested code, then tests)."
160
+ )
161
  with gr.Row():
162
  url = gr.Textbox(
163
  label="GitHub URL or owner/repo",
164
+ placeholder="https://github.com/pallets/flask OR pallets/flask",
165
  scale=4,
166
  )
167
+ chunk_tokens = gr.Slider(
168
+ 256, 4096, value=1024, step=128, label="Tokens / chunk", scale=1
169
+ )
170
  ingest_btn = gr.Button("Ingest", variant="primary")
171
  ingest_out = gr.Code(label="Ingestion summary", language="json")
172
  ingest_btn.click(ingest, [url, chunk_tokens], ingest_out)
173
 
174
+ gr.Markdown(
175
+ "**Examples that work on a single MI300X**: "
176
+ "`pallets/flask` (~408K tokens, fits in 256K window with priority chunking) Β· "
177
+ "`pytorch/vision` (~1.3M tokens, trimmed to 180K of highest-priority "
178
+ "content via the chunker) Β· this repo `SRKRZ23/repomind` (~68K tokens, fits whole)."
179
+ )
180
+
181
  with gr.Tab("2. Ask"):
182
+ gr.Markdown(
183
+ f"Ask any question about the ingested repo. The agent runs an "
184
+ f"SC-TIR loop (PLAN β†’ CALL TOOL β†’ OBSERVE β†’ THINK β†’ ANSWER) with "
185
+ f"five tools: `read_file`, `grep_codebase`, `execute_code` "
186
+ f"(sandboxed), `run_tests`, `git_log`.\n\n"
187
+ f"**Backend**: {BACKEND_LABEL}"
188
+ )
 
 
 
 
 
 
 
 
 
 
 
189
  question = gr.Textbox(
190
  label="Question",
191
  lines=3,
192
+ placeholder=(
193
+ "Where is the WSGI entry point? Β· "
194
+ "What does the chunker prioritize? Β· "
195
+ "Trace one slab allocation through the call graph."
196
+ ),
197
  )
198
  ask_btn = gr.Button("Ask", variant="primary")
199
  answer = gr.Markdown(label="Answer")
200
+ tool_trace = gr.Code(label="Tool trace (agent steps)", language="markdown")
201
+
202
+ ask_btn.click(ask, [question], [answer, tool_trace])
203
+
204
+ with gr.Tab("3. Verified evidence"):
205
+ gr.Markdown(
206
+ "REPOMIND was stress-tested on a real AMD MI300X x1 droplet across "
207
+ "two sessions (**2026-05-05 / 2026-05-06**, 124 min total, $4.12). "
208
+ "Highlights:\n\n"
209
+ "| Test | Result |\n"
210
+ "|---|---|\n"
211
+ "| Memory peak | 176/191.7 GiB (92%) |\n"
212
+ "| `--max-model-len 262144` | started clean |\n"
213
+ "| Concurrency 8K / 16K / 32K / 64K @ N=31 | **31/31 success at every context** βœ… |\n"
214
+ "| Concurrency 128K @ N=31 | 25/31 (6 timeouts past 15 min) |\n"
215
+ "| Long-context needle at 200K | **3/3** pass (early/middle/late) |\n"
216
+ "| End-to-end repo Q&A | **9/9** correct across 3 repos |\n"
217
+ "| Largest repo tested | **pytorch/vision (1.3M tokens)** |\n"
218
+ "| Tuning attempt: AITER backend | regression β€” 137/144 cells broken under FP8 KV cache; default Triton stays production-safe |\n"
219
+ "| Cost | $1.99/hr cloud, $45.75/1M completion tokens |\n\n"
220
+ "Full evidence pack β€” JSON results, plots, raw model outputs β€” "
221
+ "is at [github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test]"
222
+ "(https://github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test). "
223
+ "Extended PHASE 1+2 narrative + AITER A/B in the [extended/SUMMARY.md]"
224
+ "(https://github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test/extended)."
225
+ )
226
 
227
  gr.Markdown(
228
  "---\n"
229
  "**Author:** [Sardor Razikov](https://huggingface.co/ZeroR3) Β· "
230
  "[GitHub](https://github.com/SRKRZ23) Β· "
231
  "[lablab.ai](https://lablab.ai/u/@Sardor_R) Β· "
232
+ "[Zenodo (ECB)](https://doi.org/10.5281/zenodo.19791329) Β· "
233
+ "Tashkent πŸ‡ΊπŸ‡Ώ\n\n"
234
+ "*If the MI300X memory-architecture story resonates, "
235
+ "**a like on this Space helps with the Hugging Face Special Prize judging.** πŸ€—*"
236
  )
237
 
238
 
239
  if __name__ == "__main__":
240
+ demo.launch(theme=gr.themes.Soft(primary_hue="red", secondary_hue="gray"))