Quantization Method

by x-polyglot-x - opened 17 days ago

Hey there,

Can you talk about your quantization method? I am curious to learn more about the mixed precision setup. Can you provide any code examples?

Thanks!

spicyneuron

Owner 16 days ago

Of course! My earliest attempts "cloned" exact settings from Unsloth GGUFs (https://github.com/spicyneuron/gguf-clone), and I've also tried empirical analysis to identify sensitive weights (similar to https://github.com/baa-ai/MINT).

But I eventually found that uploading config.json and safetensors.index.json to Claude / Codex, and then discussing model architecture, arrives at 90%+ of the same recommendations.

The typical process is:

Create a "maximized" 4-bit version where all potentially sensitive layers are BF16
Incrementally drop precision (BF16 → 8 → 6) for each trial
Look for a clear best size / speed / quality tradeoff

For evaluation, I run 500 tasks each from hellaswag, piqa, and winogrande, as well as perplexity and throughput tests:

mlx_lm.benchmark --model "$model" --prompt-tokens 1024 --generation-tokens 512 --num-trials 5
mlx_lm.perplexity --model "$model" --sequence-length 1024 --seed 123
mlx_lm.evaluate --model "$model" --task hellaswag --seed 123 --limit 500
mlx_lm.evaluate --model "$model" --task piqa --seed 123 --limit 500
mlx_lm.evaluate --model "$model" --task winogrande --seed 123 --limit 500

And depending on model size, I'll sometimes pick both a "best tradeoff" and a "smallest acceptable" version.

Still awaiting maintainer review, but here's the mlx-lm PR for granular quantization overrides: https://github.com/ml-explore/mlx-lm/pull/922

Let me know if you end up trying this yourself!

x-polyglot-x

8 days ago

Thanks for such a descriptive reply! A very clear style of communication and the iterative process to find the best model? So scientific! I love it. I won’t pretend to understand it all, but I appreciate your thoroughness and willingness to reply.

The concept makes sense: Preserve the most important layers at full precision, and then use mixed precision for the others to find the best balance of size and quality.

I’d definitely love to give it a try sometime. I’m very new to quantization so this helps and is encouraging. I’m hoping to gain more mlx memory / power when the new mac studios are released so that I can run these huge models at decent speeds.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment