repomind

Running

App Files Files Community

repomind / README.md

ZeroR3

fix: footer links — drop broken lablab/@-encoded URL, add LinkedIn + X + emails, plain-text name

6f8abcc 1 day ago

preview code

raw

history blame contribute delete

4.99 kB

	---
	title: REPOMIND
	emoji: 🧠
	colorFrom: indigo
	colorTo: red
	sdk: gradio
	sdk_version: 6.14.0
	python_version: '3.13'
	app_file: app.py
	pinned: false
	license: mit
	short_description: Repo-scale coding agent — 256K context on a single MI300X
	tags:
	- amd-hackathon-2026
	- amd-developer-hackathon
	- agents
	- coding-agent
	- long-context
	- rocm
	- mi300x
	- qwen3-coder
	- vllm
	---

	# REPOMIND

	> Open-source repo-scale coding agent for self-hosted use. Designed to ingest an entire git repo (256K tokens, FP8) and reason across it on a single AMD MI300X — what NVIDIA H100 80GB cannot accommodate by VRAM accounting (~143GB total > 80GB).

	Built for the [AMD Developer Hackathon 2026](https://lablab.ai/ai-hackathons/amd-developer) · MIT License · [GitHub source](https://github.com/SRKRZ23/repomind)

	## Why MI300X?

	- Qwen3-Coder-Next-FP8 weights ≈ 80 GB
	- 256K KV cache @ FP8 ≈ 38 GB
	- activations ≈ 25 GB → ~143 GB total on a single GPU
	- NVIDIA H100 80GB cannot accommodate this configuration on a single card by VRAM accounting (~143 GB > 80 GB). AMD MI300X 192 GB has the headroom.

	This is a memory-architecture story, not a CUDA-vs-ROCm one.

	## Stack

	- Model: `Qwen/Qwen3-Coder-Next-FP8` — 80B params, 3B active (MoE)
	- Inference: vLLM ROCm 7 with `qwen3_coder` tool-call parser
	- Agent loop: SC-TIR style (PLAN → CALL TOOL → OBSERVE → THINK → ANSWER)
	- Tools: `read_file` · `grep_codebase` · `execute_code` (sandboxed) · `run_tests` · `git_log`

	## Status — verified on real MI300X (2026-05-05 / 2026-05-06)

	Full stress test on a single AMD MI300X x1 (AMD Developer Cloud, $1.99/hr, vLLM 0.17.1 + ROCm 7.2 Quick Start image). 2 sessions, 124 min total, ~$4.12.

	Memory budget — Qwen3-Coder-Next-FP8 + 256K context, FP8 KV cache:
	- ✅ Model weights in VRAM: 77.29 GiB
	- ✅ Available KV cache: 94.58 GiB (2,065,744 tokens)
	- ✅ VRAM peak: 176 GiB / 191.7 GiB (92% utilization)
	- ✅ `--max-model-len 262144` started, `Application startup complete`
	- ✅ `/v1/models` returns `max_model_len: 262144`

	Concurrency stress (24 cells, default Triton attention, all 144 outputs clean):
	- ✅ 31/31 success at 8K, 16K, 32K, AND 64K — every realistic-developer context
	- ✅ 25/31 at 128K, 6-8 at 256K within a 15-minute window (compute-bound, honest ceiling)
	- ✅ Aggregate throughput at N=31: 78.5 tok/s @ 8K · 31.4 @ 16K · 12.1 @ 32K · 3.6 @ 64K

	Long-context coherence — needle-in-haystack at 200K:
	- ✅ 3/3 positions passed (early, middle, late) — model recovers embedded sentinel function and constant
	- ✅ This proves 256K window is usable, not just allocated

	End-to-end repo ingestion — 9/9 questions answered correctly:
	- ✅ REPOMIND self (68K tokens, 68 files) — 3/3
	- ✅ pallets/flask (408K total → fitted 180K) — 3/3
	- ✅ pytorch/vision (1.3M tokens, 581 files, 6,799 chunks → fitted 180K) — 3/3 with correct file path citations

	Tuning attempt — measured regression worth reporting:
	- ⚠️ Tried `--attention-backend ROCM_AITER_FA` (AMD's hand-tuned MI300X kernels)
	- Throughput 2-4× higher under AITER, TTFT 2.8× faster at 64K
	- BUT output degenerates to repeating-punctuation gibberish in 137/144 cells under FP8 KV cache
	- Default Triton stays the production-safe choice; filed for AMD upstream investigation

	Cost — at AMD Cloud $1.99/hr:
	- ✅ ~$45.75 / 1M completion tokens (aggregate at 32K, N=31)
	- ✅ 14.5 active continuous queriers per MI300X, or 70–140 dev seats for typical bursty engineering teams
	- ✅ Owned MI300X ($18K) breaks even vs Cursor in 3–6 months at team-of-100 usage

	This Space currently runs CPU-basic with the mock LLM backend because keeping a paid MI300X droplet up 24/7 for sporadic visitors is uneconomical. Final demo wires to a live MI300X endpoint during the judging window.

	Full evidence pack (7 JSON results + 5 PNG plots + e2e prompts/answers + 2× rocm-smi snapshots + run logs) is in the repo:
	[github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test](https://github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test)
	Extended PHASE 1+2 narrative (24-cell matrix + AITER A/B): [extended/SUMMARY.md](https://github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test/extended).

	If the MI300X memory-architecture pitch resonates, a like on this Space helps us with the Hugging Face Special Prize judging 🤗

	## Author

	Sardor Razikov — Independent ML Engineer · Tashkent 🇺🇿
	- Kaggle SPR 2026 #7/371 (Top 1.9%) · S6E3 #23/4,142 · AIMO3 39/50 (XTX $2.2M)
	- [Epistemic Curie Benchmark on Zenodo](https://doi.org/10.5281/zenodo.19791329)
	- [GitHub](https://github.com/SRKRZ23/repomind) · [LinkedIn](https://www.linkedin.com/in/sardor-razikov-569a5327b) · [X / Twitter](https://x.com/SardorRazi99093)
	- Email: razikovsardor1@gmail.com · razikovs777@gmail.com