ollama:rocm error 500

#6
by flaviocb - opened

Hello,

I am trying to run the Qwen3-VL-30B-A3B-Instruct-GGUF (Q4_K_M) specifically to avoid the forced Chain of Thought (CoT) "thinking" blocks present in the base model.

While the standard base model (from ollama.com) works fine on my setup, this Unsloth Instruct quantization causes the Ollama runner to crash immediately upon loading (500 Internal Server Error / exit status 2). I have also tried donwloading from the bartowski and qwen page, same problem. Have also tried downloading the Q4_0 quantization, same problem.

My Hardware:

GPU: AMD Radeon RX 7900 XTX (24GB VRAM)
CPU: AMD Ryzen 7 9700X
RAM: 96GB DDR5
OS: Ubuntu Noble (Host) running Ollama via Docker (Official Image)

The Issue:

Base Model (qwen3-vl:30b): Loads and runs perfectly, fitting into VRAM. However, it forces "Thinking..." output which cannot be suppressed via System Prompts.

Unsloth Instruct Model (Q4_K_M): When I attempt to load this model, the runner crashes instantly.

Steps Taken:

Verified Integrity: Deleted and re-pulled the model.

Context Limits: Created a custom Modelfile forcing PARAMETER num_ctx 4096 to rule out OOM from default context windows.

Layer Offloading: Tried both forcing GPU (num_gpu 99) and removing the parameter to allow CPU offload. Both result in the same hard crash.

Logs: Running docker logs shows a hard termination of the runner:

Plaintext

time=2025-12-29T18:45:53.099Z level=ERROR source=server.go:302 msg="llama runner terminated" error="exit status 2"
[GIN] 500 | 88.391699ms | POST "/api/chat"

Any guidance on a compatible quantization or a workaround to strip the "Thinking" tokens from the base model would be appreciated.

Thanks

Sign up or log in to comment