Instructions to use Qwen/Qwen3-Coder-Next-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen3-Coder-Next-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Qwen/Qwen3-Coder-Next-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Coder-Next-FP8")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Coder-Next-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Qwen/Qwen3-Coder-Next-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen3-Coder-Next-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3-Coder-Next-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen3-Coder-Next-FP8

SGLang

How to use Qwen/Qwen3-Coder-Next-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen3-Coder-Next-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3-Coder-Next-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen3-Coder-Next-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3-Coder-Next-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen3-Coder-Next-FP8 with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen3-Coder-Next-FP8
```

Anyone running this on AMD MI300X / vLLM ROCm 7 at 256K context?

by ZeroR3 - opened 4 days ago

Discussion

ZeroR3

4 days ago

Hi everyone - building an open-source repo-scale coding agent (REPOMIND) on top of Qwen3-Coder-Next-FP8 for the AMD Developer Hackathon. Submission May 11, MIT licensed.

The whole architecture relies on the MI300X 192GB single-GPU memory advantage — load 256K tokens of code + KV cache on one card, which physically can't fit on H100 80GB at FP8.

Two questions for the community:

Has anyone here actually run vLLM ROCm 7 with --tool-call-parser qwen3_coder at >128K context length? Any pitfalls before I burn AMD Cloud credits?
For long-context tool-calling, what's the recommended --max-model-len / --kv-cache-dtype combination on MI300X? I see Day-0 ROCm support announced but no community reports yet at 256K specifically.

The agent uses an SC-TIR loop (PLAN → CALL → OBSERVE → THINK → ANSWER) with 5 tools (read_file, grep, sandboxed exec, run_tests, git_log). Will publish benchmarks (H100 OOM vs MI300X works) once credits land.

Repo: https://github.com/SRKRZ23/repomind
HF Space: https://huggingface.co/spaces/ZeroR3/repomind

Thanks - and huge respect to the Qwen team for FP8 release + Day-0 ROCm support.

ZeroR3

4 days ago

Quick update — the HF Space has been moved to the official AMD Developer Hackathon org:
https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/repomind

Likes there contribute to the HF Special Prize judging 🤗

ZeroR3

3 days ago

Quick update — smoke-tested vLLM 0.17.1 + ROCm 7.2 Quick Start image with Qwen3-Coder-Next-FP8 on a single AMD MI300X (192 GB) yesterday.

Verified:

max_model_len 262144 (256K) starts cleanly, Application startup complete
77.29 GiB weights + 95.26 GiB KV cache available at 256K config
31.31× max concurrency at 256K context per request
Cold start ~3.5 min (with model download), warm restart ~1.5 min
Generation throughput: 30 tok/s at 8K config (warm)
Real Python code generation through /v1/chat/completions verified

Full evidence (rocm-smi, vLLM logs, JSON responses):
github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-smoke-test

Huge thanks to the Qwen team — Day-0 ROCm support + FP8 release made this possible without manual quantization. The qwen3_coder tool-call parser will be wired in next for the agentic loop (SC-TIR-style adapted from AIMO3 math).

ZeroR3

1 day ago

Final update — REPOMIND submission for the AMD Developer Hackathon 2026 just landed: lablab.ai/ai-hackathons/amd-developer/repomind/repomind

Full verified results on Qwen3-Coder-Next-FP8 + single MI300X + vLLM 0.17.1 + ROCm 7.2 (124 min, $4.12 total):

Memory: 77.29 GiB weights + 94.58 GiB KV cache available + 92% VRAM peak.

Concurrency (24-cell matrix, default Triton): 31/31 success at 8K, 16K, 32K, AND 64K. 6.49× faster aggregate throughput on 8K vs 32K at N=31.

Long-context: 3/3 needle pass at 200K tokens (usable, not just allocated).

Repo Q&A: 9/9 correct including pytorch/vision (1.3M tokens — 5× larger than the 256K context window).

Tuning A/B: tried --attention-backend ROCM_AITER_FA. Got 2-4× throughput BUT output degenerated to repeating punctuation on 137/144 cells under FP8 KV cache. Default Triton stays production-safe (0/144 broken). Filing for AMD upstream — vLLM startup logs flag q_scale and prob_scale as uncalibrated for the FP8 attention path.

The qwen3_coder tool-call parser parsed our 5-tool agent registry (read_file, grep_codebase, execute_code, run_tests, git_log) without modification. Day-0 unlock from the Qwen team — huge thanks.

Full evidence pack: github.com/SRKRZ23/repomind/tree/main/benchmarks
HF Space (judged for HF Special Prize): huggingface.co/spaces/lablab-ai-amd-developer-hackathon/repomind
Demo video (1:38): youtu.be/BvSBR1QazLU

If anyone from the Qwen team wants raw vLLM logs / repro for the AITER FP8 regression — happy to share.

— Sardor / ZeroR3

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment