noctrex/Huihui-Qwen3-VL-4B-Instruct-abliterated-GGUF · Would it make sense to get Qwen3-VL MXFP4 quants?

Jan 20

I like to keep Qwen3-VL-4B in my VRAM for quick analysis of Frigate images. Im currently using the Q4_K_M quant of it in llama.cpp. Would a MXFP4 version of it be faster in any way on a 5060ti?

noctrex

Owner Jan 21

Well it's a dense model and not a MoE one, but lemme check if I can cook up something...

noctrex

Owner Jan 21

@ampersandru for what size in GB are you looking for?

ampersandru

Jan 21

•

edited Jan 21

@ampersandru for what size in GB are you looking for?

Im currently using your abiliterated Q4_K_M version which comes to 2.43gb with the mmproj-F16 at 816mb, which has been great for simple frigate image analysis. MXFP4 is new to me - would it be around the same size but perform better/faster with higher quality? Gemini says "Expect a noticeable jump in prefill speed when using MXFP4—which is critical for the Vision (VL) part of Qwen3 when it's "looking" at images."

Image analysis from Frigate to llama.cpp with Huihui-Qwen3-VL-4B-Instruct-abliterated-Q4_K_M.gguf back to Frigate with an analysis is already sub 1 second, so Im extremely curious if Gemini is right and it could get even faster? If we could even go smaller but retain similar quality, then Im all for it since I just keep this model in VRAM for Frigate (and home assistant for weather radar analysis) and use other models for other things.

Thanks!

noctrex

Owner Jan 21

•

edited Jan 22

So I did an experimental quant, try to see if it works: Huihui-Qwen3-VL-4B-Instruct-abliterated-MXFP4.gguf
If not, try to use IQ4_NL and then IQ4_XS

ampersandru

Jan 21

amazing! Seems to all be working, and seems a little bit faster, utilizing very similar VRAM as Q4_K_M.

logs confirm: llama_model_loader: - type mxfp4: 131 tensors

I only tested a few images in openwebui to the model, will continue to test with Frigate

Thank you so much!

ampersandru

Jan 21

•

edited Jan 21

actually, logs arent printing out "BLACKWELL_NATIVE_FP4", "1" - I will look more into this

https://github.com/srogmann/llama.cpp/commit/11d546691e152df250bb2b6305e6a9a16a70fce5

edit: sorry, this was a different branch.

ranova

Jan 21

It seems I might need Nvidia driver 590 to get cuda 13.1 features? And MXFP4 requires newer CUDA? Unfortunately, I have to use driver 580 open source drivers in unraid right now, the 590 drivers don't work

ampersandru

Jan 22

Yep, after digging around, I will need to get on driver 590 once its released for blackwell cards for full MXFP4 support. Hopefully, someone else can provide some testing for you in the meantime - otherwise I will be back when its updated. Thanks again!

noctrex

Owner Jan 22

Try to use one of the I-quants, IQ4_NL or IQ4_XS, they should be better than Q4

ampersandru

Jan 22

•

edited Jan 22

Try to use one of the I-quants, IQ4_NL or IQ4_XS, they should be better than Q4

thanks! Tried IQ4_XS and it seems to run just as quickly but utilizes less memory than Q4 and I read it can be more accurate/provide better output than Q4? Good enough for me until I can install 590 drivers!

Benchmarks on driver 580 (no full MXFP4 support)

noctrex

Owner Jan 22

Thanks. Yes the I quants should be used, they actually use the imatrix for important tensors, so they should also perform better

noctrex changed discussion status to closed Jan 22

ampersandru

Mar 5

•

edited Mar 5

Thanks. Yes the I quants should be used, they actually use the imatrix for important tensors, so they should also perform better

I was using your IQ4 version up until Qwen3.5-9b release and it ran fast and really well, thanks again!

Sorry to bother you in this old thread, not sure how else to reach you! Since huihui is having trouble abliterating Qwen3.5-4b and 9b (https://x.com/support_huihui/status/2029057869598081152), while we wait, can you create IQ4_XS and MXFP4 from HauhauCS's Qwen3.5-9B-Uncensored-HauhauCS-Aggressive ? https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive

Thanks!

noctrex

Owner Mar 5

@ampersandru let me see what I can do

noctrex

Owner Mar 6

They seem to claim that it has zero refusals and is lossless, this is something I have never witnessed on abliterated models.
There is ALWAYS some loss.
Unless they use some novel way to doing that, I find that hard to believe. I will wait for some benchmarks

ampersandru

Mar 6

They seem to claim that it has zero refusals and is lossless, this is something I have never witnessed on abliterated models.
There is ALWAYS some loss.
Unless they use some novel way to doing that, I find that hard to believe. I will wait for some benchmarks

Yep I was suspect of that when I saw it. Ive been using it any doesnt seem to have loss based on my daily usage, which isnt indicative of others though. Hopefully theres more benchmarks or huihui can figure out 9b and 4b!

ampersandru

Mar 9

They seem to claim that it has zero refusals and is lossless, this is something I have never witnessed on abliterated models.
There is ALWAYS some loss.
Unless they use some novel way to doing that, I find that hard to believe. I will wait for some benchmarks

huihui 4b and 9b's are out!
https://huggingface.co/huihui-ai/Huihui-Qwen3.5-9B-abliterated

Thanks in advance!

noctrex

Owner Mar 9

I'm sure that the quantizer mradermacher will get to them, like he did with the other ones.
I don't think there's a need to have the same multiple quantizations of the same model.

ampersandru

Mar 9

I'm sure that the quantizer mradermacher will get to them, like he did with the other ones.
I don't think there's a need to have the same multiple quantizations of the same model.

100% makes sense - seems that he doesnt do MXFP4 though, would you still do that for 9b and 4b?

noctrex

Owner Mar 9

Those small models are dense model, and MXFP4 is only for MoE models, like the Huihui-Qwen3.5-35B-A3B-abliterated I already did

ampersandru

Mar 9

ahh gotcha, did not know that. Thanks!