Increasing the precision of some of the weights when quantizing

by Shuasimodo - opened Feb 16

•

I am curious about if the difference in how everyone is quantizing their mxfp4 files might be impacting the quality. Basically all llama.cpp quantizations into mxfp4 are using q8_0 for some weights rather than f16. The official releases of the gpt oss models use f16 weights in the same spots that the community is using q8_0. I wonder if quantizing them into f16 rather than q8_0 might make a bit of an improvement. I get there's post quantization training in those models, but aside from that the curiosity remains if they were upped a little if that would make a difference in the quality of the output (especially for coding models) while maintaining the same size relatively.

If you put ollama + gpt oss (from ollama library) against llama.cpp + ggml/gpt oss (from hf), there is a notable accuracy difference between the models (speed and size too, but accuracy is the focus here). ollama uses the f32/f16/mxfp4 structure, where llama.cpp uses f32/q8_0/mxfp4 structure. There are a couple other examples of this but I'm trying to keep it simple.

Size wise isn't too much of a difference, ~3gb difference on the gpt-oss:120b model between the two methods. The larger is a little slower especially on prompt reading - but is also notably more coherent and accurate.

A good example is gpt oss 20b, the ollama or the unsloth (~~f16~~ bf16, seems identical to ollama's model) do really good with tricky riddles meant to trip them up. Where the ggml/gpt-oss:20b gets tripped easily up and will make more mistakes.

I might be using the wrong terminology, but I mean these guys, here:

Is that something you could do, or be interested in exploring? My whole thought is around whether or not this would have a positive impact for coding models in terms of code generation and accuracy in writing code.

I don't have a spare ~300gb of disk space or I'd try it.

I rant on a little more about it here.
https://huggingface.co/unsloth/MiniMax-M2.5-GGUF/discussions/3#69934b1e30cad63e1a14b5ed

noctrex

Owner Feb 16

Hmm.. well yeah I could make a quant like that. Let me try it out a little and I'll upload it

noctrex

Owner Feb 17

Just keep in mind that the gpt-oss models where trained from the start to use FP4, so they will always perform the best on this quant.
I created a quant with bf16/fp4 instead of the mainline q8/fp4 over here, seems good so far, and the REAM method has very little quality loss:
https://huggingface.co/noctrex/Qwen3-Coder-Next-REAM-MXFP4_MOE-GGUF

Shuasimodo

Feb 17

•

edited Feb 17

Yeah, I don't expect it to be gpt oss quality for those reasons - but I do think that it might help improve the small areas where the q8_0 weights fall short.
It's things like variables, or specific code formatting, coherency, etc... the little things that make you have to work a little harder.

For example take "create pacman in javascript". The model will do a nearly perfect job most of the time, except often there will be a small issue which causes it to either not play, or makes pacman go flying off the map.
or the "what is the nineteenth letter in this sentence" question. I find most mxfp4, and q4 quants will spell 'nineteenth' as 'nineth', 'nintheen' during the counting stage. Where a higher precision of the same model will count and more accurately. Then I try to assume how based on the results of experiments like that, how it impacts larger projects or more complex tasks.

I guess to most it's subtle and goes unnoticed. I'm just trying to theorize the best way to squeeze the most juice out of the quantization type. And if the method works, it could be a significant improvement for the full model quantized into mxfp4 versus other quants of the same model.

Downloading the REAM model now and will give it a spin.

Thanks!

EDIT: For clarity, that's why I referenced (somewhere) the difference of quality in the output between the ggml vs unsloth gpt-oss:120b models where the only difference is those q8_0 vs f16 weights. Noticed they make a difference there which makes me wonder if it can work the other way around in a positive way by upping the q8 to f16 in general.

Shuasimodo

Feb 17

•

edited Feb 22

This was a tricky little experiment. I'm going to try to make sense of my findings.

The new REAM mxfp4 quant has a higher quality output in terms of accuracy of what it can do, but the REAMing has degraded the model.

For example:
counting:
When counting in spelling riddles, it counts and spells accurately.

coding:
When coding, it produces higher quality code for what it can do, but the creativeness and fullness of the code which the full model has is gone.

So it seems like the f32/bf16/mxfp4 structure is indeed creating an improvement over the quality of the output.

Also I put the code and outputs between the models (the original with f32/q8_0/mxfp4 versus the REAM with f32/bf16/mxfp4) into grok and had it write a summary of the findings.

Based on the code outputs we reviewed—the REAM model's Pac-Man implementation shows signs of higher precision in its execution (smoother basic mechanics, fewer visual quirks in the simplified elements), but it's clearly constrained by whatever REAM optimizations were applied, resulting in a less capable overall model (e.g., omitted power pellet functionality, no lives system, rudimentary AI). The original MXFP4, with its f32/q8_0/mxfp4 structure, delivers a more comprehensive and feature-rich game, aligning with the base model's strengths, but yes, it exhibits subtle artifacts that could stem from the q8_0 weights (like the malformed ghost feet paths, inconsistent scared-mode rendering, and minor logic gaps in direction handling that lead to occasional ghost stalling). Quantizing the original to MXFP4 via f32/bf16/mxfp4 should indeed mitigate those precision losses, potentially yielding output quality on par with (or better than) the REAM variant while preserving the full model's depth and intelligence at reduced size.

noctrex

Owner Feb 17

OK I updated this quant also, to use bf16/fp4. The gguf got a little bit larger using this method, from 40.74GB up to 42.76GB, which seems acceptable to me.
Please try this also.

Shuasimodo

Feb 17

lol. for the sake of science I have to finish the REAM test. Almost done downloading your original mxfp4 REAM model and then going to run the same prompts to compare an accurate before and after.

Then will download the full model you just did. And will report back with my findings.

Thanks for working with me on this, I really appreciate it. Hope the final results are good.

The bigger file size aligns perfectly with the difference between gpt oss and i agree is reasonable.

ok brb

noctrex

Owner Feb 17

Thank you for taking time from your precious schedule in order to look at these quantizations.
I'm testing it also right now to see if there's any difference.
At last we have a model that is of such quality that can be actually used for coding locally.

58 hidden messages

Expand all

mancub

Feb 24

•

edited Feb 24

Thanks for sharing your command line! If you don't mind me asking more questions...n00b here when it comes to tuning the server arguments. 😄

How come you have speculative decoding arguments without a draft model though, do they apply in this case?
I see your user batch size is pretty low...have you tried a more balanced 8192/8192 for example (here that gives me PP and TG of 800+ and ~21 respectively?
If the model uses BF16 KVs, how come you are quantizing down to Q8 - i thought the whole point was to keep the BF16 precision (or your 7900XTX doesn't support BF16)?
Couldn't you get more layers in VRAM (40 is almost everything in CPU) to get more T/s ?
What is the benefit of repeat-last, slot-prompt-similarity, and context-shift in your use case?
You are leaving your min-p to llama's default 0.1 which many people think is too high vs the suggested 0.01 (or 0.05 according to some paper). I've seen some actually turn it off with 0.0?

Below is my manual layman tests of a variety of b/ub combinations on a generic prompt ("only audit the opened file blah.hxx for errors") using parameters I posted above 😀

2048/2048
---------
prompt eval time =   38180.52 ms / 19777 tokens (    1.93 ms per token,   517.99 tokens per second)
       eval time =    2496.84 ms /    50 tokens (   49.94 ms per token,    20.03 tokens per second)

prompt eval time =   13669.37 ms /  5448 tokens (    2.51 ms per token,   398.56 tokens per second)
       eval time =   48466.74 ms /   850 tokens (   57.02 ms per token,    17.54 tokens per second)


4096/1024
---------
prompt eval time =   60083.56 ms / 19775 tokens (    3.04 ms per token,   329.12 tokens per second)
       eval time =    2862.28 ms /    50 tokens (   57.25 ms per token,    17.47 tokens per second)

prompt eval time =   18942.54 ms /  5448 tokens (    3.48 ms per token,   287.61 tokens per second)
       eval time =   67139.57 ms /  1005 tokens (   66.81 ms per token,    14.97 tokens per second)

4096/2048
---------
prompt eval time =   38205.18 ms / 19778 tokens (    1.93 ms per token,   517.68 tokens per second)
       eval time =    2174.93 ms /    50 tokens (   43.50 ms per token,    22.99 tokens per second)

prompt eval time =   13614.43 ms /  5448 tokens (    2.50 ms per token,   400.16 tokens per second)
       eval time =   43261.12 ms /   758 tokens (   57.07 ms per token,    17.52 tokens per second)


4096/4096
---------
prompt eval time =   27038.89 ms / 19777 tokens (    1.37 ms per token,   731.43 tokens per second)
       eval time =    2829.79 ms /    50 tokens (   56.60 ms per token,    17.67 tokens per second)

prompt eval time =   11439.16 ms /  5448 tokens (    2.10 ms per token,   476.26 tokens per second)
       eval time =   47553.41 ms /   721 tokens (   65.95 ms per token,    15.16 tokens per second)


8192/2048
---------
prompt eval time =   38968.52 ms / 19777 tokens (    1.97 ms per token,   507.51 tokens per second)
       eval time =    2740.70 ms /    50 tokens (   54.81 ms per token,    18.24 tokens per second)

prompt eval time =   13936.96 ms /  5448 tokens (    2.56 ms per token,   390.90 tokens per second)
       eval time =   43170.25 ms /   667 tokens (   64.72 ms per token,    15.45 tokens per second)


8192/4096
---------
prompt eval time =   26605.48 ms / 19778 tokens (    1.35 ms per token,   743.38 tokens per second)
       eval time =    2452.14 ms /    50 tokens (   49.04 ms per token,    20.39 tokens per second)

prompt eval time =   11254.27 ms /  5448 tokens (    2.07 ms per token,   484.08 tokens per second)
       eval time =   53571.39 ms /   851 tokens (   62.95 ms per token,    15.89 tokens per second)


8192/8192
---------
prompt eval time =   22755.85 ms / 19777 tokens (    1.15 ms per token,   869.10 tokens per second)
       eval time =    2282.77 ms /    50 tokens (   45.66 ms per token,    21.90 tokens per second)

prompt eval time =    8729.87 ms /  5448 tokens (    1.60 ms per token,   624.06 tokens per second)
       eval time =   44432.20 ms /   755 tokens (   58.85 ms per token,    16.99 tokens per second)


16384/8192
----------
prompt eval time =   22945.90 ms / 19776 tokens (    1.16 ms per token,   861.85 tokens per second)
       eval time =    2562.97 ms /    50 tokens (   51.26 ms per token,    19.51 tokens per second)

prompt eval time =    8803.18 ms /  5448 tokens (    1.62 ms per token,   618.87 tokens per second)
       eval time =   50102.45 ms /   771 tokens (   64.98 ms per token,    15.39 tokens per second)

Shuasimodo

Feb 24

•

edited Feb 24

Sorry to jump in the middle here.

I want to share my F16 findings, as well as wrap this whole thing up from my end.
I'll start with the F16.

F16 vs BF16 3 days of daily use

The F16 seems to run a little faster over extremely long context (eg 262144)
The F16 runs like a dream on my rtx 2060 6gb (which is the smallest card I have access to) - with context limitations, it's pretty amazing. I can run it with 32k context, f16 k and v cache, and get 16 tok/sec on that little gpu. So essentially, this whole experiment unlocks higher quality models for everybody. I know some people don't care about that, but I do.

For curiosity sake - I can run this version on the 2060 with 102400 context, and q4_0 for both k and v cache. It's not as high quality... but it is basically just as fast, fully functional and not delusional. (arguably better than running a q4_0 k quant)

The F16 version (and f16 in general) is more precise and less dynamic than the BF16 precision level. Incredibly precise, actually - and I think that it has a significant use case in terms of instruction following and workloads that require high precision. The less dynamic range of the f16 variant seems to be more of a 1:1 type deal where what you tell it to do, it will do. Where as I've explained earlier, the bf16 will do some thinking for you - making assumptions, and fill in the blanks. So for coding, you might need a few turns, but it will work through everything at a higher quality than using a lower quality quant.

It's very good. Arguably better than q8_0 and lower, and in terms of precision between the bf16, it's negligible.
Dynamic range and creativity between the two is non-negotiable. BF16 is superior.

F16 also runs on everybody's hardware and requires a little less computing power in general. Which makes it a community win.

At first I found that when running the f16 version that it was much more strict, and less dynamic. But then after using it exclusively for the past few days I found that shifting the parameters of specifically top-k and top-p a little bit increased the dynamic quality of the model and opened it up a little more without the model going off the rails. Also prompt matters a lot.

Click to view model params

[qwen3-coder-next]
model = /gguf_files/qwen3-coder-next-mxfp4.gguf
alias = qwen3-coder-next
n-gpu-layers = 48
n-cpu-moe = 512
ctx-size = 32768
cache-type-k = f16
cache-type-v = f16
batch-size  = 256
ubatch-size = 128
jinja = true
temp = 0.1
top-p = 0.96
top-k = 40
min-p = 0.05
cache-prompt = true
cache-reuse = 256 
flash-attn = true
sleep-idle-seconds = 21600

It's something that needs to be explored a little, but it does change the models behaviour a quite significantly.

I don't think the F16 versions should be particularly neglected, both for the universal compatibility, and the precision of the output.
The F16 in my opinion is universally usable. And is the highest quality that someone with the smallest of consumer hardware can run.

Conclusion

There are still a bunch of different quant types out there, and there's no one-size-fits-all for everybody's use case. But this method has shown that it's entirely possible, and usable, to up the quality of quantized models significantly and in my opinion that raises the standard of quality for people running quantized models to a whole new level.

And in terms of question about:

Whether speed is more important than smarts at a community level.

And my argument is no. Speed is not "more important". Look around, look at all the quantized models and benchmarks. They'll all using a significantly lesser precision and this is a community standard.

8/10 people have no idea what they're working with, or what they can have...

Many people just want a chatbot for roleplay and stuff, and they don't care how many errors it makes. Which is fine, and people who are using q4, q3 or lower to just get a model to run is fine. But what if they were shown, by anyone or all quantizers that there could be a higher quality format that they could run. Instead of getting some dumb little chatbot, they get a more robust, higher quality model they can run locally.

as mancub said, "I see the whole experience being taken to the next level."
and as the other guy popped in and said: "a noticeable improvement in quality for < 3 GB of increased file size. I see this for logic and coherency, especially when coding."

I think this is what matters. Out of all the gguf quants there are no other mxfp4 quants or even k quants that even gave this structure a shot. Everybody was happy with their conclusion of "it's faster, but it's dumber". But now we're able to run models that are almost the same size as q4_k_m, but with a massive increase in quality and precision. Arguably as close to the original models quality that a quantized model can achieve.

This whole experiment has shown that there is a way to increase the quality of the output without ballooning the model size and each variant of this method (f16 or bf16 depending on hardware or use case) is universally usable. Whether you're running an rtx 2060, or an h100.

This should matter because whether it's f16, or bf16, both are arguably better than q8_0 and lower. 16 bits is more than 8, and more than 4.

The community should not need to depend on chatgpt, or grok, or claude for everything. Most of the work can be handled by the latest smaller models but that's not going to happen if we're having discussions like "is it really worth it to have a smarter, more precise model at the expense of token generation speed". Because as of today we have local 400b, 120b, 80b, and the new 35b, which are all necking on SOTA qualities. People should be able to run these models with high quality, high precision, and their hardware limitations shouldn't be held against them. Increasing the quality of a model in a smaller format opens the door to innovation, growth, and development.

We can reduce the kv cache to get a faster generation speed. Or we can opt in to using f32/q8_0/mxfp4 or lower k quants.

If people want speed, reduce your k&v cache type to q4_0, run the f16, set your context to max, and let it rip.
But smarts, B16/F16 quality locally should not be kept from people with consumer level hardware.

So now we know, that even for 4b models, it doesn't matter the size, there is a way to run it quantized, at a much higher quality. And as I said earlier, that's what matters from all of this.

There is still room to tinker and explore. People can explore with different layers, they can get creative. But the fact remains that increasing the precision of the weights to higher values does significantly impact the quality of the output. So everybody gets smarter local models.

noctrex

Owner Feb 24

@mancub

This is a new type of speculative that is processed for the KV cache. So if you have long coding sessions for example, it gets processed faster.
Yeah, those batch sizes are a little bit from older tests I did. I must retest again with the new models.
I'm quantitizing the KV cache sa that it does not OOM on my 24GB VRAM with that large of a size. If I go down to 128k I can have it unquantized.
If I put more layers in it does OOM when the cache starts to fill up. With this, the model takes up 14GB VRAM, and the cache takes the other half.
The context-shift is so that it can slide over when it fills up, it forgets the oldest.
I upped a little bit the prompt-similarity from default 0.1 to 0.2 so that it can maybe catch better similar tokens, but haven't really tried it out yet.
The repeat-last-n is not used on this model, but it have it as a general option in my llama-swap config, for models that need a little repeat-penalty, I upped it from default 64 to 256, so that it can better catch longer repeats.
Hey, forgot to add the min-p! Good catch.. I only have one set of eyes...tired eyes 😄

noctrex

Owner Feb 24

@Shuasimodo Impressive write up. Thank you very much for your testing, and your dedication to this new method!
Maybe you should also post it to r/LocalLLaMA, so that more people can test it out.
If this is viable maybe from now on I will only quantize models in this particular format.
I already did the new models I posted in this FP4_BF16 variant, and cooking up the Qwen3.5 MoE's in FP4_BF16 also.

mancub

Feb 24

Thank you @noctrex and @Shuasimodo for your work on this!

I would say that this could be one of those monumental findings that needs to be broadcast far and wide. It definitely brings much needed quality to the consumer grade systems which was previously limited to the cloud (and paying whatever fees they wanted for accessing it).

But, I also want to be a devil's advocate and look at the other side of the coin. I think most people want an easy recipe to get going with this, as they aren't mavericks on the forefront of innovation (or discovery).

The F16 vs BF16 is a good example: "which one do I use?"

Telling most people "it depends" will only push them back to other, possibly benchmaxed models, and water down the accomplishment we have here. Especially in today's environment of scarcity of local resources (RAM, GPU, storage...) being able to leverage what we have at home while gaining access to higher quality results is huge. Everyone needs to be educated on the importance of "faster is not better, if it's dumber". I'm the first one guilty of it, until today at least. 😁

So how does one decide which variant to select?! If I understood everything written above correctly, F16 will work on most hardware, yet it's not as diverse as BF16. Could we say then to all of those people looking for a quick start: for coding/agentic tasks use F16 variant with low Temps, and if you want to chit-chat use BF16 variant at the model defaults?

The other issue is the mere sizes of these models. Downloading 40GB+ each and testing dozens of models might not be feasible for many (most?) people. The rule before was: if the model is smaller than your VRAM you are kinda guaranteed it would fit on your GPU. But now with the recent developments all bets are off. One has to consider available VRAM, RAM and number of CPU threads in the system, only complicating things for an average user.

I guess I'm stating the obvious things above, but I'd hate to see this pass and not gain traction it deserves, IMHO.

One group I see that needs to consider this are the people/teams making all the community quants right now. If they would focus on delivering quality tuned F16/BF16 variants, now that we know is possible, we could have a single gold standard quant that has the best quality without the need for so many other quants that are sub-par. This could free everyone's time to persue other optimizations and improvements that we are yet to discover, similarly to what we've seen here, rather than constantly wondering whether to run Q1, Q3, IQ3, Q4 (and who's model version at that) each time a new model is released.

On the other hand, it would make it easier to define a minimum standard PC configuration (be it Win/macos/Linux) that just works with that "gold standard" quant at respectible performance levels. It would save hundreds of questions being posted all the time about "What do I get if I want to run..."

The other group are the LLM apps purveyors: Llama.cpp, LM Studio, Ollama, Koboldcpp, vLLM etc. They need to educate their respective user bases about quality over quantity, and that we should not settle for less now that it's possible.

Shuasimodo

Feb 24

•

edited Feb 25

All of that is why it's all done out in the open in a public discussion. Up until a week ago, this was theoretically impossible by community standards and laughable to even consider running a model of this size on low-end consumer hardware with any expectation of a high quality output. So it's definitely something that's going to have to brew and seep its way into the community. People will figure out how to educate, and sort it. Maybe some will release 3 versions (q8, f16, bf16), maybe others will just explain it, maybe others will release 10 variations of the same model - who knows at the moment. Overall I think it's more for ... not so much a power user, but someone who's intellectually curious on the subject of quality in quantized models.

All gpu's support f16. Which was why I pushed for it and gave it such an extensive test - even though bf16 is better.

So I would say that f16 is better than q8 k quants and lower.
F16 is probably where to start (unless you know you have higher end hardware)
BF16 is all around better. Better for coding, better for creative/dynamic responses, etc...
F16 is still very good, but if we get into the mathematical difference between the two, BF16 has a wider dynamic range than F16, F16 is more precise than Bf16, and that is where the difference lies.
BF16 is also how most of these models are trained, so that helps in terms of there not really being quantization or modification of those weights

As for the size, this is an 80b model. So it's kind of a whole new realm for running locally.

will be a different story for the new 35b qwen model which I'm excited to try.
will be an even different story for the new 122b-a10b (might be fine with f16, as the gpt oss 120b was)
but we're talking less than 1/3 of the original models size here with an exceptionally high quality.

As for which to select, and why - Honestly I think noctrex nails it in the model description. The bf16 is essentially using unquantized weights, and is closer to the original model where the f16 is not.

overall, the f16 in my opinion and from my testing so far is better than k quants with lower bits.
bartowski model readme's are a good way. It's a thorough explaination of what does what. https://huggingface.co/bartowski/openai_gpt-oss-120b-GGUF
again, who knows. it's out of my hands, my work here is done (lol sorry)

I think it's important to let this work it's way into the community, and the community will figure it out.

Shuasimodo

Feb 27

This comment has been hidden

noctrex

Owner Feb 27

•

edited Feb 27

@Shuasimodo Thank you for your valuable feedback. Very interesting information.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment