Qwen3.5-9B Distilled OPUS Heretic - MLX-VLM 8bit

8-bit quantized MLX-VLM conversion of an abliterated Qwen3.5-9B model distilled from Claude Opus 4.6 reasoning, optimized for Apple Silicon.

Size: ~9.8 GB | Bits/weight: 8.864 | Quality: Good balance of quality and size

Background

This model starts from Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled, a Qwen3.5-9B base fine-tuned via knowledge distillation from Claude Opus 4.6 to replicate its chain-of-thought reasoning style.

Abliteration was applied using the technique from Arditi et al. (2024), adapted with a custom script to handle the hybrid DeltaNet/full-attention architecture. The result is a model that retains strong reasoning and vision capabilities while removing refusal behavior.

The model was then converted to MLX-VLM format and quantized to 8-bit for Apple Silicon inference.

Architecture

  • Type: Qwen3_5ForConditionalGeneration (multimodal)
  • Layers: 32 total — 24 linear attention (DeltaNet) + 8 full attention
  • Hidden size: 4096 | Intermediate size: 12288
  • Vision encoder: 27-layer ViT
  • Inputs: Text, images, video

Confirmed Capabilities

  • Vision: Correctly describes image content
  • Reasoning: Step-by-step mathematical problem solving (e.g., integration by parts)
  • Uncensored: Responds to sensitive prompts without refusal

Usage

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-8bit"
model, processor = load(model_path)
config = load_config(model_path)

# Text-only
prompt = apply_chat_template(processor, config, "Your question here", num_images=0)
result = generate(model, processor, prompt, max_tokens=500)
print(result.text)

# Vision
prompt = apply_chat_template(processor, config, "Describe this image", num_images=1)
result = generate(model, processor, prompt, max_tokens=500, image=["image.jpg"])
print(result.text)

Model Family

Model Size Bits/Weight Notes
andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-fp16 ~4 GB 16 2B, best quality
andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-8bit ~2.1 GB 8 2B, balanced
andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-4bit ~1.2 GB 4 2B, smallest
andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-fp16 ~18 GB 16 9B, best quality
andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-8bit ~9.8 GB 8.864 This model
andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit ~5.6 GB 5.059 9B, smallest

Credits

Downloads last month
279
Safetensors
Model size
3B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-8bit