Qwen3.5-35B-A3B-MLX-IQ4_M-NVFP4

With the great work on high quality model quantization from the likes of Unsloth, Bartowski, etc, GGUF models are noticeably more accurate per GB than MLX.
The default MLX quantization strategy is too general, naively quantizing every module in the LLM, reducing quality and long context fidelity unnecessarily. Inspired by Qwen, GG-MLX IQ quantizes select LLM modules while leaving more sensitive parts untouched resulting in remarkably high accuracy and coherence while dramatically reducing memory footprint.

allenai/tulu-3-sft-mixture (instruct/chat data)

Perplexity: 4.056 ± 0.026 (BF16 ~4.000)

Evaluation time: 144.19 seconds

Peak memory: 29.52 GB

Tokens per second: 907

Dataset statistics:

Total samples: 256 Total tokens: 131072

Usage with MLX

# Install MLX and dependencies
pip install mlx-lm

# Run chat interface
python -m mlx_lm.generate --model GG-MLX/Qwen3.5-35B-A3B-MLX-IQ4_M-NVFP4 --prompt "Hello, how are you?" --temp 0.7

# Or use the Python API
from mlx_lm import load, generate

model, tokenizer = load("GG-MLX/Qwen3.5-35B-A3B-MLX-IQ4_M-NVFP4")
response = generate(model, tokenizer, prompt="Explain quantum computing simply.", max_tokens=512)
print(response)

Downloads last month: 552

Safetensors

Model size

35B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Model tree for GG-MLX/Qwen3.5-35B-A3B-MLX-IQ4_M-NVFP4

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(243)

this model