Anyone running this on AMD MI300X / vLLM ROCm 7 at 256K context?
Hi everyone - building an open-source repo-scale coding agent (REPOMIND) on top of Qwen3-Coder-Next-FP8 for the AMD Developer Hackathon. Submission May 11, MIT licensed.
The whole architecture relies on the MI300X 192GB single-GPU memory advantage β load 256K tokens of code + KV cache on one card, which physically can't fit on H100 80GB at FP8.
Two questions for the community:
Has anyone here actually run vLLM ROCm 7 with
--tool-call-parser qwen3_coderat >128K context length? Any pitfalls before I burn AMD Cloud credits?For long-context tool-calling, what's the recommended
--max-model-len/--kv-cache-dtypecombination on MI300X? I see Day-0 ROCm support announced but no community reports yet at 256K specifically.
The agent uses an SC-TIR loop (PLAN β CALL β OBSERVE β THINK β ANSWER) with 5 tools (read_file, grep, sandboxed exec, run_tests, git_log). Will publish benchmarks (H100 OOM vs MI300X works) once credits land.
Repo: https://github.com/SRKRZ23/repomind
HF Space: https://huggingface.co/spaces/ZeroR3/repomind
Thanks - and huge respect to the Qwen team for FP8 release + Day-0 ROCm support.
Quick update β the HF Space has been moved to the official AMD Developer Hackathon org:
https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/repomind
Likes there contribute to the HF Special Prize judging π€
Quick update β smoke-tested vLLM 0.17.1 + ROCm 7.2 Quick Start image with Qwen3-Coder-Next-FP8 on a single AMD MI300X (192 GB) yesterday.
Verified:
- max_model_len 262144 (256K) starts cleanly, Application startup complete
- 77.29 GiB weights + 95.26 GiB KV cache available at 256K config
- 31.31Γ max concurrency at 256K context per request
- Cold start ~3.5 min (with model download), warm restart ~1.5 min
- Generation throughput: 30 tok/s at 8K config (warm)
- Real Python code generation through /v1/chat/completions verified
Full evidence (rocm-smi, vLLM logs, JSON responses):
github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-smoke-test
Huge thanks to the Qwen team β Day-0 ROCm support + FP8 release made this possible without manual quantization. The qwen3_coder tool-call parser will be wired in next for the agentic loop (SC-TIR-style adapted from AIMO3 math).
Final update β REPOMIND submission for the AMD Developer Hackathon 2026 just landed: lablab.ai/ai-hackathons/amd-developer/repomind/repomind
Full verified results on Qwen3-Coder-Next-FP8 + single MI300X + vLLM 0.17.1 + ROCm 7.2 (124 min, $4.12 total):
Memory: 77.29 GiB weights + 94.58 GiB KV cache available + 92% VRAM peak.
Concurrency (24-cell matrix, default Triton): 31/31 success at 8K, 16K, 32K, AND 64K. 6.49Γ faster aggregate throughput on 8K vs 32K at N=31.
Long-context: 3/3 needle pass at 200K tokens (usable, not just allocated).
Repo Q&A: 9/9 correct including pytorch/vision (1.3M tokens β 5Γ larger than the 256K context window).
Tuning A/B: tried --attention-backend ROCM_AITER_FA. Got 2-4Γ throughput BUT output degenerated to repeating punctuation on 137/144 cells under FP8 KV cache. Default Triton stays production-safe (0/144 broken). Filing for AMD upstream β vLLM startup logs flag q_scale and prob_scale as uncalibrated for the FP8 attention path.
The qwen3_coder tool-call parser parsed our 5-tool agent registry (read_file, grep_codebase, execute_code, run_tests, git_log) without modification. Day-0 unlock from the Qwen team β huge thanks.
Full evidence pack: github.com/SRKRZ23/repomind/tree/main/benchmarks
HF Space (judged for HF Special Prize): huggingface.co/spaces/lablab-ai-amd-developer-hackathon/repomind
Demo video (1:38): youtu.be/BvSBR1QazLU
If anyone from the Qwen team wants raw vLLM logs / repro for the AITER FP8 regression β happy to share.
β Sardor / ZeroR3