Anyone running this on AMD MI300X / vLLM ROCm 7 at 256K context?

#5
by ZeroR3 - opened

Hi everyone - building an open-source repo-scale coding agent (REPOMIND) on top of Qwen3-Coder-Next-FP8 for the AMD Developer Hackathon. Submission May 11, MIT licensed.

The whole architecture relies on the MI300X 192GB single-GPU memory advantage β€” load 256K tokens of code + KV cache on one card, which physically can't fit on H100 80GB at FP8.

Two questions for the community:

  1. Has anyone here actually run vLLM ROCm 7 with --tool-call-parser qwen3_coder at >128K context length? Any pitfalls before I burn AMD Cloud credits?

  2. For long-context tool-calling, what's the recommended --max-model-len / --kv-cache-dtype combination on MI300X? I see Day-0 ROCm support announced but no community reports yet at 256K specifically.

The agent uses an SC-TIR loop (PLAN β†’ CALL β†’ OBSERVE β†’ THINK β†’ ANSWER) with 5 tools (read_file, grep, sandboxed exec, run_tests, git_log). Will publish benchmarks (H100 OOM vs MI300X works) once credits land.

Repo: https://github.com/SRKRZ23/repomind
HF Space: https://huggingface.co/spaces/ZeroR3/repomind

Thanks - and huge respect to the Qwen team for FP8 release + Day-0 ROCm support.

Quick update β€” the HF Space has been moved to the official AMD Developer Hackathon org:
https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/repomind

Likes there contribute to the HF Special Prize judging πŸ€—

Quick update β€” smoke-tested vLLM 0.17.1 + ROCm 7.2 Quick Start image with Qwen3-Coder-Next-FP8 on a single AMD MI300X (192 GB) yesterday.

Verified:

  • max_model_len 262144 (256K) starts cleanly, Application startup complete
  • 77.29 GiB weights + 95.26 GiB KV cache available at 256K config
  • 31.31Γ— max concurrency at 256K context per request
  • Cold start ~3.5 min (with model download), warm restart ~1.5 min
  • Generation throughput: 30 tok/s at 8K config (warm)
  • Real Python code generation through /v1/chat/completions verified

Full evidence (rocm-smi, vLLM logs, JSON responses):
github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-smoke-test

Huge thanks to the Qwen team β€” Day-0 ROCm support + FP8 release made this possible without manual quantization. The qwen3_coder tool-call parser will be wired in next for the agentic loop (SC-TIR-style adapted from AIMO3 math).

Final update β€” REPOMIND submission for the AMD Developer Hackathon 2026 just landed: lablab.ai/ai-hackathons/amd-developer/repomind/repomind

Full verified results on Qwen3-Coder-Next-FP8 + single MI300X + vLLM 0.17.1 + ROCm 7.2 (124 min, $4.12 total):

Memory: 77.29 GiB weights + 94.58 GiB KV cache available + 92% VRAM peak.

Concurrency (24-cell matrix, default Triton): 31/31 success at 8K, 16K, 32K, AND 64K. 6.49Γ— faster aggregate throughput on 8K vs 32K at N=31.

Long-context: 3/3 needle pass at 200K tokens (usable, not just allocated).

Repo Q&A: 9/9 correct including pytorch/vision (1.3M tokens β€” 5Γ— larger than the 256K context window).

Tuning A/B: tried --attention-backend ROCM_AITER_FA. Got 2-4Γ— throughput BUT output degenerated to repeating punctuation on 137/144 cells under FP8 KV cache. Default Triton stays production-safe (0/144 broken). Filing for AMD upstream β€” vLLM startup logs flag q_scale and prob_scale as uncalibrated for the FP8 attention path.

The qwen3_coder tool-call parser parsed our 5-tool agent registry (read_file, grep_codebase, execute_code, run_tests, git_log) without modification. Day-0 unlock from the Qwen team β€” huge thanks.

Full evidence pack: github.com/SRKRZ23/repomind/tree/main/benchmarks
HF Space (judged for HF Special Prize): huggingface.co/spaces/lablab-ai-amd-developer-hackathon/repomind
Demo video (1:38): youtu.be/BvSBR1QazLU

If anyone from the Qwen team wants raw vLLM logs / repro for the AITER FP8 regression β€” happy to share.

β€” Sardor / ZeroR3

Sign up or log in to comment