Quick question: I noticed your MLX 3-bit variant sits at ~24GB, while GGUF’s Q3_K_S is only ~12.4GB?

#1
by realperson1234 - opened

Hi Daniel and the Unsloth team,

Thank you for your amazing efforts and consistent work on Unsloth. The recent 2-bit Qwen3.6-27B showcase (the one that made 26 tool calls, Reddit post) really highlights the potential of everything you're building. Kudos to the entire team! 👏

Quick question: I noticed your MLX 3-bit variant sits at ~24GB, while GGUF’s Q3_K_S is only ~12.4GB (UD-Q2_K_XL is ~11.8GB). I also see that some GGUF/MLX variants already match that footprint (e.g., Qwen3.6-27B-oQ2 mlx at 11.4GB, Qwen3.6-27B-oQ4 mlx at 16.7GB). I’m curious:

  1. What’s driving the ~2x size difference in Unsloth’s current MLX quantization pipeline compared to GGUF’s compression strategy?
  2. With native MLX still pending in Studio, what’s the expected timeline for tighter size/accuracy parity, and will Unsloth align its MLX quantization naming/strategy with GGUF’s more compressed families?

Looking forward to your insights. Thanks again for the incredible work pushing local AI forward!

This probably is an effective bit ~4 or such. Don't think there any way I can run this on my Mac mini M4 Pro with just 24GB. ☹️

Hi Daniel and the Unsloth team,

Thank you for your amazing efforts and consistent work on Unsloth. The recent 2-bit Qwen3.6-27B showcase (the one that made 26 tool calls, Reddit post) really highlights the potential of everything you're building. Kudos to the entire team! 👏

Quick question: I noticed your MLX 3-bit variant sits at ~24GB, while GGUF’s Q3_K_S is only ~12.4GB (UD-Q2_K_XL is ~11.8GB). I also see that some GGUF/MLX variants already match that footprint (e.g., Qwen3.6-27B-oQ2 mlx at 11.4GB, Qwen3.6-27B-oQ4 mlx at 16.7GB). I’m curious:

  1. What’s driving the ~2x size difference in Unsloth’s current MLX quantization pipeline compared to GGUF’s compression strategy?
  2. With native MLX still pending in Studio, what’s the expected timeline for tighter size/accuracy parity, and will Unsloth align its MLX quantization naming/strategy with GGUF’s more compressed families?

Looking forward to your insights. Thanks again for the incredible work pushing local AI forward!

So for the MLX's quant we don't have each layer in 3-bit. There are some layers which are sensitive to quantization so we keep them in 8bit or 16bit. That is why the size of the 3bit model is ~24GB. To maintain the accuracy.

This probably is an effective bit ~4 or such. Don't think there any way I can run this on my Mac mini M4 Pro with just 24GB. ☹️

You can still try out GGUF's using unsloth studio or llama.cpp

So for the MLX's quant we don't have each layer in 3-bit. There are some layers which are sensitive to quantization so we keep them in 8bit or 16bit

Yep i understand that, but here is the thing, your 35B model UD 3 bit mlx here https://huggingface.co/unsloth/Qwen3.6-35B-A3B-UD-MLX-3bit is only 17.4Gb then why does a 27B UD mlx 3 bit bigger then this, so either it is not 3bit more like 6 bit or something seems off, is just what i am calling out.

Sign up or log in to comment