Qwen3.5-35B-A3B-MLX-IQ4_M-NVFP4
With the great work on high quality model quantization from the likes of Unsloth, Bartowski, etc, GGUF models are noticeably more accurate per GB than MLX.
The default MLX quantization strategy is too general, naively quantizing every module in the LLM, reducing quality and long context fidelity unnecessarily.
Inspired by Qwen, GG-MLX IQ quantizes select LLM modules while leaving more sensitive parts untouched resulting in remarkably high accuracy and coherence while dramatically reducing memory footprint.
allenai/tulu-3-sft-mixture (instruct/chat data)
Perplexity: 4.056 卤 0.026 (BF16 ~4.000)
Evaluation time: 144.19 seconds
Peak memory: 29.52 GB
Tokens per second: 907
Dataset statistics:
Total samples: 256 Total tokens: 131072
Usage with MLX
# Install MLX and dependencies
pip install mlx-lm
# Run chat interface
python -m mlx_lm.generate --model GG-MLX/Qwen3.5-35B-A3B-MLX-IQ4_M-NVFP4 --prompt "Hello, how are you?" --temp 0.7
# Or use the Python API
from mlx_lm import load, generate
model, tokenizer = load("GG-MLX/Qwen3.5-35B-A3B-MLX-IQ4_M-NVFP4")
response = generate(model, tokenizer, prompt="Explain quantum computing simply.", max_tokens=512)
print(response)
- Downloads last month
- 552
Model size
35B params
Tensor type
BF16
路
U8 路
U32 路
F32 路
Hardware compatibility
Log In to add your hardware
4-bit