Qwen3.5-35B-A3B-MLX-IQ4_M-NVFP4

With the great work on high quality model quantization from the likes of Unsloth, Bartowski, etc, GGUF models are noticeably more accurate per GB than MLX.
The default MLX quantization strategy is too general, naively quantizing every module in the LLM, reducing quality and long context fidelity unnecessarily. Inspired by Qwen, GG-MLX IQ quantizes select LLM modules while leaving more sensitive parts untouched resulting in remarkably high accuracy and coherence while dramatically reducing memory footprint.

allenai/tulu-3-sft-mixture (instruct/chat data)

Perplexity: 4.056 卤 0.026 (BF16 ~4.000)

Evaluation time: 144.19 seconds

Peak memory: 29.52 GB

Tokens per second: 907

Dataset statistics:

Total samples: 256 Total tokens: 131072

Usage with MLX

# Install MLX and dependencies
pip install mlx-lm

# Run chat interface
python -m mlx_lm.generate --model GG-MLX/Qwen3.5-35B-A3B-MLX-IQ4_M-NVFP4 --prompt "Hello, how are you?" --temp 0.7

# Or use the Python API
from mlx_lm import load, generate

model, tokenizer = load("GG-MLX/Qwen3.5-35B-A3B-MLX-IQ4_M-NVFP4")
response = generate(model, tokenizer, prompt="Explain quantum computing simply.", max_tokens=512)
print(response)
Downloads last month
552
Safetensors
Model size
35B params
Tensor type
BF16
U8
U32
F32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for GG-MLX/Qwen3.5-35B-A3B-MLX-IQ4_M-NVFP4

Quantized
(243)
this model