Spaces:

ZeroR3
/

repomind

Running

File size: 4,990 Bytes

c02083b
2ac3c5b
 
 
 
c02083b
 
 
 
 
 
65f9403
2ac3c5b
53f2b06
2ac3c5b
 
 
 
 
 
 
 
c02083b
 
2ac3c5b
 
337508d
2ac3c5b
 
 
 
 
 
 
 
53f2b06
2ac3c5b
 
 
 
 
 
 
 
 
 
fee907b
2ac3c5b
fee907b
5e74789
fee907b
5e74789
fee907b
 
 
5e74789
 
fee907b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2ac3c5b
 
 
 
 
6f8abcc
2ac3c5b
6f8abcc

---
title: REPOMIND
emoji: 🧠
colorFrom: indigo
colorTo: red
sdk: gradio
sdk_version: 6.14.0
python_version: '3.13'
app_file: app.py
pinned: false
license: mit
short_description: Repo-scale coding agent — 256K context on a single MI300X
tags:
  - amd-hackathon-2026
  - amd-developer-hackathon
  - agents
  - coding-agent
  - long-context
  - rocm
  - mi300x
  - qwen3-coder
  - vllm
---

# REPOMIND

> Open-source repo-scale coding agent for self-hosted use. Designed to ingest an entire git repo (256K tokens, FP8) and reason across it on a single AMD MI300X — what NVIDIA H100 80GB cannot accommodate by VRAM accounting (~143GB total > 80GB).

**Built for the [AMD Developer Hackathon 2026](https://lablab.ai/ai-hackathons/amd-developer)** · MIT License · [GitHub source](https://github.com/SRKRZ23/repomind)

## Why MI300X?

- Qwen3-Coder-Next-FP8 weights ≈ 80 GB
- 256K KV cache @ FP8 ≈ 38 GB
- activations ≈ 25 GB → **~143 GB total on a single GPU**
- NVIDIA H100 80GB cannot accommodate this configuration on a single card by VRAM accounting (~143 GB > 80 GB). AMD MI300X 192 GB has the headroom.

This is a memory-architecture story, not a CUDA-vs-ROCm one.

## Stack

- **Model**: `Qwen/Qwen3-Coder-Next-FP8` — 80B params, 3B active (MoE)
- **Inference**: vLLM ROCm 7 with `qwen3_coder` tool-call parser
- **Agent loop**: SC-TIR style (PLAN → CALL TOOL → OBSERVE → THINK → ANSWER)
- **Tools**: `read_file` · `grep_codebase` · `execute_code` (sandboxed) · `run_tests` · `git_log`

## Status — verified on real MI300X (2026-05-05 / 2026-05-06)

Full stress test on a single AMD MI300X x1 (AMD Developer Cloud, $1.99/hr, vLLM 0.17.1 + ROCm 7.2 Quick Start image). **2 sessions, 124 min total, ~$4.12.**

**Memory budget — Qwen3-Coder-Next-FP8 + 256K context, FP8 KV cache:**
- ✅ Model weights in VRAM: **77.29 GiB**
- ✅ Available KV cache: **94.58 GiB** (2,065,744 tokens)
- ✅ VRAM peak: **176 GiB / 191.7 GiB** (92% utilization)
- ✅ `--max-model-len 262144` started, `Application startup complete`
- ✅ `/v1/models` returns `max_model_len: 262144`

**Concurrency stress (24 cells, default Triton attention, all 144 outputs clean):**
- ✅ **31/31 success at 8K, 16K, 32K, AND 64K** — every realistic-developer context
- ✅ **25/31 at 128K**, **6-8 at 256K** within a 15-minute window (compute-bound, honest ceiling)
- ✅ Aggregate throughput at N=31: 78.5 tok/s @ 8K · 31.4 @ 16K · 12.1 @ 32K · 3.6 @ 64K

**Long-context coherence — needle-in-haystack at 200K:**
- ✅ **3/3 positions passed** (early, middle, late) — model recovers embedded sentinel function and constant
- ✅ This proves 256K window is *usable*, not just *allocated*

**End-to-end repo ingestion — 9/9 questions answered correctly:**
- ✅ REPOMIND self (68K tokens, 68 files) — 3/3
- ✅ pallets/flask (408K total → fitted 180K) — 3/3
- ✅ **pytorch/vision (1.3M tokens, 581 files, 6,799 chunks → fitted 180K) — 3/3** with correct file path citations

**Tuning attempt — measured regression worth reporting:**
- ⚠️ Tried `--attention-backend ROCM_AITER_FA` (AMD's hand-tuned MI300X kernels)
- Throughput **2-4× higher** under AITER, TTFT 2.8× faster at 64K
- BUT output **degenerates to repeating-punctuation gibberish** in 137/144 cells under FP8 KV cache
- Default Triton stays the production-safe choice; filed for AMD upstream investigation

**Cost — at AMD Cloud $1.99/hr:**
- ✅ ~$45.75 / 1M completion tokens (aggregate at 32K, N=31)
- ✅ 14.5 active continuous queriers per MI300X, or 70–140 dev seats for typical bursty engineering teams
- ✅ Owned MI300X ($18K) breaks even vs Cursor in 3–6 months at team-of-100 usage

This Space currently runs CPU-basic with the **mock LLM backend** because keeping a paid MI300X droplet up 24/7 for sporadic visitors is uneconomical. **Final demo wires to a live MI300X endpoint** during the judging window.

Full evidence pack (7 JSON results + 5 PNG plots + e2e prompts/answers + 2× rocm-smi snapshots + run logs) is in the repo:
[github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test](https://github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test)
Extended PHASE 1+2 narrative (24-cell matrix + AITER A/B): [extended/SUMMARY.md](https://github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test/extended).

If the MI300X memory-architecture pitch resonates, **a like on this Space helps us with the Hugging Face Special Prize judging** 🤗

## Author

**Sardor Razikov** — Independent ML Engineer · Tashkent 🇺🇿
- Kaggle SPR 2026 #7/371 (Top 1.9%) · S6E3 #23/4,142 · AIMO3 39/50 (XTX $2.2M)
- [Epistemic Curie Benchmark on Zenodo](https://doi.org/10.5281/zenodo.19791329)
- [GitHub](https://github.com/SRKRZ23/repomind) · [LinkedIn](https://www.linkedin.com/in/sardor-razikov-569a5327b) · [X / Twitter](https://x.com/SardorRazi99093)
- Email: razikovsardor1@gmail.com · razikovs777@gmail.com