Qwen3.5-9B Distilled OPUS Heretic - MLX-VLM 8bit
8-bit quantized MLX-VLM conversion of an abliterated Qwen3.5-9B model distilled from Claude Opus 4.6 reasoning, optimized for Apple Silicon.
Size: ~9.8 GB | Bits/weight: 8.864 | Quality: Good balance of quality and size
Background
This model starts from Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled, a Qwen3.5-9B base fine-tuned via knowledge distillation from Claude Opus 4.6 to replicate its chain-of-thought reasoning style.
Abliteration was applied using the technique from Arditi et al. (2024), adapted with a custom script to handle the hybrid DeltaNet/full-attention architecture. The result is a model that retains strong reasoning and vision capabilities while removing refusal behavior.
The model was then converted to MLX-VLM format and quantized to 8-bit for Apple Silicon inference.
Architecture
- Type: Qwen3_5ForConditionalGeneration (multimodal)
- Layers: 32 total — 24 linear attention (DeltaNet) + 8 full attention
- Hidden size: 4096 | Intermediate size: 12288
- Vision encoder: 27-layer ViT
- Inputs: Text, images, video
Confirmed Capabilities
- Vision: Correctly describes image content
- Reasoning: Step-by-step mathematical problem solving (e.g., integration by parts)
- Uncensored: Responds to sensitive prompts without refusal
Usage
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model_path = "andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-8bit"
model, processor = load(model_path)
config = load_config(model_path)
# Text-only
prompt = apply_chat_template(processor, config, "Your question here", num_images=0)
result = generate(model, processor, prompt, max_tokens=500)
print(result.text)
# Vision
prompt = apply_chat_template(processor, config, "Describe this image", num_images=1)
result = generate(model, processor, prompt, max_tokens=500, image=["image.jpg"])
print(result.text)
Model Family
| Model | Size | Bits/Weight | Notes |
|---|---|---|---|
| andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-fp16 | ~4 GB | 16 | 2B, best quality |
| andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-8bit | ~2.1 GB | 8 | 2B, balanced |
| andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-4bit | ~1.2 GB | 4 | 2B, smallest |
| andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-fp16 | ~18 GB | 16 | 9B, best quality |
| andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-8bit | ~9.8 GB | 8.864 | This model |
| andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit | ~5.6 GB | 5.059 | 9B, smallest |
Credits
- Base distillation: Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled
- Abliteration technique: Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction" (2024)
- MLX-VLM framework: Apple MLX-VLM
- Downloads last month
- 279
8-bit
Model tree for andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-8bit
Base model
Qwen/Qwen3.5-9B-Base