GGUF
English
qwen3
reasoning
distillation
claude-opus
full-finetune
conversational

Request: MXFP4_MOE Quant

#2
by codyknowscode - opened

I've been running this model for a day now at Q4_K_M and I see compared to unsloth/Qwen3-Coder-Next-GGUF:MXFP4_MOE it tends to do more syntax errors and failures in tool calling. FIM works the same way in both (tested out with llama.vim in neovim/LazyVim).

What I think is happening is the extra precision from MXFP4_MOE helps out. Since I'm running this on a DGX Spark, the difference in performance is small enough (at the same model size) that I'd rather take the extra precision over the error rate.

Would it be possible to get a MXFP4_MOE version of it?

Benchmark:

# unsloth/Qwen3-Coder-Next-GGUF:MXFP4_MOE
gml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):                                                                                                                                                                        
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124610 MiB                                                                                                                                                          
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |                                                                                                  
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |           pp512 |       1378.16 ± 7.49 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |           tg128 |         50.34 ± 0.16 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |           pp512 |       1390.25 ± 8.43 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |           tg128 |         50.65 ± 0.06 |                                                                                                  
                                                                                                                                                                                                                                      
build: 25eec6f32 (8672) 

# samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-GGUF:Q4_K_M                                                                                                                                                                                                              
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):                                                                                                                                                                        
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124610 MiB                                                                                                                                                          
| model                           |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |                                                                                                  
| ------------------------------- | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |                                                                                                  
| qwen3next 80B.A3B Q4_K - Medium |  45.15 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |           pp512 |       1254.23 ± 9.04 |                                                                                                 
| qwen3next 80B.A3B Q4_K - Medium |  45.15 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |           tg128 |         58.42 ± 0.15 |                                                                                                 
| qwen3next 80B.A3B Q4_K - Medium |  45.15 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |           pp512 |      1264.55 ± 12.38 |                                                                                                 
| qwen3next 80B.A3B Q4_K - Medium |  45.15 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |           tg128 |         58.75 ± 0.11 |                                                                                                 
                                                                                                                                                                                                                                      
build: 25eec6f32 (8672) 
This comment has been hidden

MXFP4_MOE quantization is now available: Qwen3-Coder-Next-Opus-Distilled-MXFP4_MOE.gguf (43.7 GB, 4.39 BPW). Expert weights in MXFP4, routing and attention in Q8_0. Let me know how it performs on your DGX Spark!

Thanks, will do a full workday run of it tomorrow.

Initial results look very good:

unsloth/Qwen3-Coder-Next-GGUF:MXFP4_MOE

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124610 MiB
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |           pp512 |      1349.83 ± 28.80 |
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |           tg128 |         50.51 ± 0.17 |
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |           pp512 |      1376.36 ± 18.73 |
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |           tg128 |         50.89 ± 0.10 |

samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-GGUF:MXFP4_MOE

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124610 MiB
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |           pp512 |      1450.04 ± 32.45 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |           tg128 |         52.34 ± 0.07 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |           pp512 |      1465.28 ± 36.82 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |           tg128 |         52.48 ± 0.15 |

build: 5c4aae66e (8703)

Next I'll benchmark at different context lengths...

I'm still seeing the same issues:

  • Still getting occasional tool calling issues where it outputs the tool call xml instead of actually calling the tool (in OpenCode), which stops the execution sequence (not via an error, it's assuming it made the tool call and awaits it's return, which never happens).
  • Syntax errors: quite frequently tends to add extra end lines (ruby code) then spends a considerable amount of time debugging to fix it.
  • Indentation mismatches: quite frequently is off by 1 space, occasionally puts whole code blocks at completely off levels.

I'm guessing the Distillation process messes things slightly, just enough to cause these annoyances. Ultimately it fixes them but because it's trying to reason about them it's taking significantly more time than without reasoning.

Note: these issues are not unique to your model, they happen in unsloth/Qwen3-Coder-Next-GGUF:MXFP4_MOE as well, but with far less frequency (I'd say in the order of 100x more frequently based on vibes).

I'm seeing a lot less syntax errors and indentation mismatches with the following parameters:

  --temperature 0.9     \
  --top-p 0.90          \
  --min-p 0.02          \
  --top-k 30            \
  --repeat-penalty 1.05 \

I'm guessing the distillation made it more sensitive to chaos 🤷‍♂️ .

I also have some template fixes that make tool calling XML no longer bleed into thinking blocks.
PR here: https://huggingface.co/samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-GGUF/discussions/5

Speed benchmark at different context lengths, DGX Spark:

unsloth/Qwen3-Coder-Next-GGUF:MXFP4_MOE

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):                                                                                                                                                                        
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124610 MiB                                                                                                                                                          
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |                                                                                                  
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |    pp512 @ d512 |       1373.49 ± 8.04 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |    tg128 @ d512 |         50.80 ± 0.21 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |   pp512 @ d1024 |       1333.32 ± 8.06 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |   tg128 @ d1024 |         50.63 ± 0.13 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |   pp512 @ d2048 |       1301.38 ± 9.11 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |   tg128 @ d2048 |         50.10 ± 0.13 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |   pp512 @ d4096 |       1220.53 ± 7.21 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |   tg128 @ d4096 |         49.13 ± 0.14 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |   pp512 @ d8192 |       1074.86 ± 2.09 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |   tg128 @ d8192 |         46.85 ± 0.34 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |  pp512 @ d16384 |        884.87 ± 5.44 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |  tg128 @ d16384 |         43.13 ± 0.08 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |  pp512 @ d32768 |        642.72 ± 1.74 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |  tg128 @ d32768 |         36.45 ± 0.07 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |  pp512 @ d65536 |        369.06 ± 0.62 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |  tg128 @ d65536 |         13.22 ± 0.02 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 | pp512 @ d131072 |        175.41 ± 0.32 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 | tg128 @ d131072 |          7.22 ± 0.01 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |    pp512 @ d512 |       1383.85 ± 9.96 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |    tg128 @ d512 |         51.00 ± 0.16 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |   pp512 @ d1024 |       1378.92 ± 6.56 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |   tg128 @ d1024 |         50.91 ± 0.12 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |   pp512 @ d2048 |       1380.03 ± 6.53 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |   tg128 @ d2048 |         50.51 ± 0.23 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |   pp512 @ d4096 |      1362.56 ± 12.29 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |   tg128 @ d4096 |         49.85 ± 0.19 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |   pp512 @ d8192 |       1343.10 ± 3.04 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |   tg128 @ d8192 |         48.54 ± 0.22 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |  pp512 @ d16384 |       1292.46 ± 4.70 |                                                                                                  
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |  tg128 @ d16384 |         46.04 ± 0.25 |
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |  pp512 @ d32768 |       1213.80 ± 7.31 |
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |  tg128 @ d32768 |         42.13 ± 0.11 |
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |  pp512 @ d65536 |       1090.70 ± 2.46 |
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |  tg128 @ d65536 |         35.71 ± 0.08 |
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 | pp512 @ d131072 |        896.75 ± 5.50 |
| qwen3next 80B.A3B MXFP4 MoE    |  44.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 | tg128 @ d131072 |         27.28 ± 0.06 |

samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled-GGUF:MXFP4_MOE

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124610 MiB
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |    pp512 @ d512 |       1438.45 ± 6.97 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |    tg128 @ d512 |         52.01 ± 0.22 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |   pp512 @ d1024 |       1408.13 ± 5.77 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |   tg128 @ d1024 |         51.54 ± 0.35 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |   pp512 @ d2048 |      1373.98 ± 10.56 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |   tg128 @ d2048 |         51.20 ± 0.15 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |   pp512 @ d4096 |       1275.13 ± 4.46 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |   tg128 @ d4096 |         50.10 ± 0.15 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |   pp512 @ d8192 |       1126.88 ± 3.82 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |   tg128 @ d8192 |         48.05 ± 0.12 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |  pp512 @ d16384 |        915.48 ± 2.85 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |  tg128 @ d16384 |         44.11 ± 0.12 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |  pp512 @ d32768 |        661.39 ± 2.76 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |  tg128 @ d32768 |         37.05 ± 0.09 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |  pp512 @ d65536 |        369.01 ± 0.41 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 |  tg128 @ d65536 |         14.17 ± 0.03 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 | pp512 @ d131072 |        176.03 ± 0.75 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  0 |    0 | tg128 @ d131072 |          7.40 ± 0.01 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |    pp512 @ d512 |      1472.71 ± 10.73 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |    tg128 @ d512 |         52.27 ± 0.20 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |   pp512 @ d1024 |       1470.50 ± 9.31 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |   tg128 @ d1024 |         52.22 ± 0.15 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |   pp512 @ d2048 |       1464.86 ± 8.63 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |   tg128 @ d2048 |         51.79 ± 0.16 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |   pp512 @ d4096 |       1443.87 ± 7.19 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |   tg128 @ d4096 |         51.27 ± 0.21 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |   pp512 @ d8192 |       1424.75 ± 6.47 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |   tg128 @ d8192 |         49.70 ± 0.61 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |  pp512 @ d16384 |       1374.76 ± 8.43 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |  tg128 @ d16384 |         47.48 ± 0.10 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |  pp512 @ d32768 |       1284.97 ± 6.75 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |  tg128 @ d32768 |         43.21 ± 0.13 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |  pp512 @ d65536 |       1141.02 ± 4.34 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 |  tg128 @ d65536 |         36.64 ± 0.10 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 | pp512 @ d131072 |        931.34 ± 1.43 |
| qwen3next 80B.A3B MXFP4 MoE    |  40.73 GiB |    79.67 B | CUDA       | 100 |  1 |    0 | tg128 @ d131072 |         27.77 ± 0.06 |

build: 3ee9da0e4 (8726)

Conclusion: use Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled with Flash Attention!

codyknowscode changed discussion status to closed

i'll add all of this in the readme asap! thanks a lot man!

Sign up or log in to comment