Would it make sense to get Qwen3-VL MXFP4 quants?
I like to keep Qwen3-VL-4B in my VRAM for quick analysis of Frigate images. Im currently using the Q4_K_M quant of it in llama.cpp. Would a MXFP4 version of it be faster in any way on a 5060ti?
Well it's a dense model and not a MoE one, but lemme check if I can cook up something...
@ampersandru for what size in GB are you looking for?
Im currently using your abiliterated Q4_K_M version which comes to 2.43gb with the mmproj-F16 at 816mb, which has been great for simple frigate image analysis. MXFP4 is new to me - would it be around the same size but perform better/faster with higher quality? Gemini says "Expect a noticeable jump in prefill speed when using MXFP4—which is critical for the Vision (VL) part of Qwen3 when it's "looking" at images."
Image analysis from Frigate to llama.cpp with Huihui-Qwen3-VL-4B-Instruct-abliterated-Q4_K_M.gguf back to Frigate with an analysis is already sub 1 second, so Im extremely curious if Gemini is right and it could get even faster? If we could even go smaller but retain similar quality, then Im all for it since I just keep this model in VRAM for Frigate (and home assistant for weather radar analysis) and use other models for other things.
Thanks!
So I did an experimental quant, try to see if it works: Huihui-Qwen3-VL-4B-Instruct-abliterated-MXFP4.gguf
If not, try to use IQ4_NL and then IQ4_XS
amazing! Seems to all be working, and seems a little bit faster, utilizing very similar VRAM as Q4_K_M.
logs confirm: llama_model_loader: - type mxfp4: 131 tensors
I only tested a few images in openwebui to the model, will continue to test with Frigate
Thank you so much!
actually, logs arent printing out "BLACKWELL_NATIVE_FP4", "1" - I will look more into this
https://github.com/srogmann/llama.cpp/commit/11d546691e152df250bb2b6305e6a9a16a70fce5
edit: sorry, this was a different branch.
It seems I might need Nvidia driver 590 to get cuda 13.1 features? And MXFP4 requires newer CUDA? Unfortunately, I have to use driver 580 open source drivers in unraid right now, the 590 drivers don't work
Yep, after digging around, I will need to get on driver 590 once its released for blackwell cards for full MXFP4 support. Hopefully, someone else can provide some testing for you in the meantime - otherwise I will be back when its updated. Thanks again!
Try to use one of the I-quants, IQ4_NL or IQ4_XS, they should be better than Q4
Try to use one of the I-quants, IQ4_NL or IQ4_XS, they should be better than Q4
thanks! Tried IQ4_XS and it seems to run just as quickly but utilizes less memory than Q4 and I read it can be more accurate/provide better output than Q4? Good enough for me until I can install 590 drivers!
Thanks. Yes the I quants should be used, they actually use the imatrix for important tensors, so they should also perform better
Thanks. Yes the I quants should be used, they actually use the imatrix for important tensors, so they should also perform better
I was using your IQ4 version up until Qwen3.5-9b release and it ran fast and really well, thanks again!
Sorry to bother you in this old thread, not sure how else to reach you! Since huihui is having trouble abliterating Qwen3.5-4b and 9b (https://x.com/support_huihui/status/2029057869598081152), while we wait, can you create IQ4_XS and MXFP4 from HauhauCS's Qwen3.5-9B-Uncensored-HauhauCS-Aggressive ? https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive
Thanks!
They seem to claim that it has zero refusals and is lossless, this is something I have never witnessed on abliterated models.
There is ALWAYS some loss.
Unless they use some novel way to doing that, I find that hard to believe. I will wait for some benchmarks
They seem to claim that it has zero refusals and is lossless, this is something I have never witnessed on abliterated models.
There is ALWAYS some loss.
Unless they use some novel way to doing that, I find that hard to believe. I will wait for some benchmarks
Yep I was suspect of that when I saw it. Ive been using it any doesnt seem to have loss based on my daily usage, which isnt indicative of others though. Hopefully theres more benchmarks or huihui can figure out 9b and 4b!
They seem to claim that it has zero refusals and is lossless, this is something I have never witnessed on abliterated models.
There is ALWAYS some loss.
Unless they use some novel way to doing that, I find that hard to believe. I will wait for some benchmarks
huihui 4b and 9b's are out!
https://huggingface.co/huihui-ai/Huihui-Qwen3.5-9B-abliterated
Thanks in advance!
I'm sure that the quantizer mradermacher will get to them, like he did with the other ones.
I don't think there's a need to have the same multiple quantizations of the same model.
I'm sure that the quantizer mradermacher will get to them, like he did with the other ones.
I don't think there's a need to have the same multiple quantizations of the same model.
100% makes sense - seems that he doesnt do MXFP4 though, would you still do that for 9b and 4b?
Those small models are dense model, and MXFP4 is only for MoE models, like the Huihui-Qwen3.5-35B-A3B-abliterated I already did
ahh gotcha, did not know that. Thanks!
