Qwen3.5-35B-A3B optimized for MLX. This quant supports image input and requires a vision-enabled MLX server.

For the non-vision model: https://huggingface.co/spicyneuron/Qwen3.5-35B-A3B-MLX-4.8bit

EDIT: Updated chat template to enable better prompt caching.

Usage

# Start server at http://localhost:8080/chat/completions
uvx --from mlx-vlm --with torchvision \
  mlx_vlm.server \
  --host 127.0.0.1 \
  --port 8080 \
  --model spicyneuron/Qwen3.5-35B-A3B-MLX-4.9bit-vision

Methodology

Quantized using a custom script inspired by Unsloth/AesSedai/ubergarm style mixed-precision GGUFs. MLX quantization options differ than llama.cpp, but the principles are the same:

  • Sensitive layers like MoE routing, attention, and output embeddings get higher precision
  • More tolerant layers like MoE experts get lower precision
Downloads last month
477
Safetensors
Model size
6B params
Tensor type
BF16
U32
F32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for spicyneuron/Qwen3.5-35B-A3B-MLX-4.9bit-vision

Quantized
(241)
this model