Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled — GGUF
GGUF quantizations of Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled, a reasoning-enhanced abliterated Qwen3.5-35B-A3B (35B total / 3B active MoE).
All quantizations were performed using importance matrix (imatrix) computed from wiki.train.raw (200 chunks) for optimal quality preservation.
Available Quantizations
| Quantization | File | Size | BPW | Description |
|---|---|---|---|---|
| Q8_0 | Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-Q8_0.gguf |
36.9 GB | 8.52 | Highest quality, near-lossless |
| Q6_K | Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-Q6_K.gguf |
28.5 GB | 6.57 | Excellent quality, good balance |
| Q5_K_M | Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-Q5_K_M.gguf |
23.2 GB | 5.33 | Good quality, recommended for most users |
Choosing a Quantization
- Q8_0: Use when you have ample memory and want minimal quality loss. Best for evaluation and comparison.
- Q6_K: Recommended balance of quality and size. Suitable for 64GB+ systems.
- Q5_K_M: Best for memory-constrained setups. The MoE architecture (only 3B active params per token) makes this model particularly efficient at lower quantizations.
Note: Because this is a Mixture-of-Experts model with only 3B active parameters per token out of 35B total, the full model weight file must be loaded into memory, but inference speed is determined by the active parameter count.
Quantization Details
| Parameter | Value |
|---|---|
| Source | BF16 GGUF (65 GB) |
| Quantizer | llama-quantize (llama.cpp) |
| Importance Matrix | imatrix.dat computed from wiki.train.raw |
| imatrix Chunks | 200 |
| imatrix Source Quant | Q8_0 (intermediate) |
Model Overview
This model is built in two stages:
Qwen/Qwen3.5-35B-A3B (original)
│
│ Heretic v1.2.0 (SOMA + MPOA abliteration)
â–¼
Jongsim/Qwen3.5-35B-A3B-heretic (abliterated base)
│
│ Supervised Fine-Tuning (LoRA + Unsloth)
â–¼
Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled (this model)
Key Features
- Abliterated (Uncensored): Censorship removed via directional ablation (93.4% refusal reduction, KL divergence = 0.064)
- Reasoning-Enhanced: SFT on ~3,191 Chain-of-Thought samples distilled from Claude 4.6 Opus
- Structured Output: Generates reasoning within
<think>...</think>tags before final responses - Efficient MoE: 35B total params, only 3B active per token (256 experts, 8 routed + 1 shared)
Architecture
| Attribute | Value |
|---|---|
| Architecture | Qwen3.5 MoE (Gated DeltaNet + Gated Attention + MoE) |
| Total Parameters | 35B |
| Active Parameters | 3B per token |
| Hidden Dimension | 2,048 |
| Layers | 40 |
| Experts | 256 total, 8 routed + 1 shared active |
| Context Length | 262,144 tokens (native) |
| Vocabulary | 248,320 tokens |
Usage with llama.cpp
# Interactive chat
./llama-cli -m Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-Q5_K_M.gguf \
-ngl 99 -c 8192 --chat-template chatml -cnv
# Server mode
./llama-server -m Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-Q5_K_M.gguf \
-ngl 99 -c 8192 --chat-template chatml --port 8080
Recommended Sampling Parameters
| Use Case | Temperature | Top-P | Top-K | Min-P |
|---|---|---|---|---|
| Thinking (general) | 1.0 | 0.95 | 20 | 0.0 |
| Thinking (coding) | 0.6 | 0.95 | 20 | 0.0 |
| Non-thinking (general) | 0.7 | 0.8 | 20 | 0.0 |
Example Output Format
The model produces structured reasoning:
<think>
Let me analyze this problem step by step.
1. First, I need to identify the core question being asked.
2. Then, I'll consider the relevant constraints and conditions.
3. Next, I'll work through the logic systematically.
4. Finally, I'll verify my reasoning for consistency.
[detailed reasoning follows...]
</think>
[final answer here]
Training Details
Stage 1: Abliteration
| Parameter | Value |
|---|---|
| Method | SOMA + MPOA (Heretic v1.2.0) |
| Refusal Reduction | 93.4% (91 → 6 refusals out of 100) |
| KL Divergence | 0.0638 |
| Selected Trial | Trial 84 / 200 (Bayesian search) |
Stage 2: Reasoning SFT
| Parameter | Value |
|---|---|
| Framework | Unsloth 2026.3.3 + TRL SFTTrainer |
| Method | LoRA (r=16, α=32) |
| Datasets | 3,191 samples from 3 reasoning distillation datasets |
| Epochs | 5 |
| Best Checkpoint | checkpoint-1995 (12% GSM8K, 50 samples) |
| Learning Rate | 2e-4 (linear decay) |
| Max Seq Length | 2,048 |
Related Models
| Model | Description |
|---|---|
| Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled | Full bf16 safetensors (source model) |
| Jongsim/Qwen3.5-35B-A3B-heretic | Abliterated base (no reasoning SFT) |
| Jongsim/Qwen3.5-35B-A3B-heretic-GGUF | GGUF quants of the abliterated base |
Acknowledgements
- Downloads last month
- 140
Hardware compatibility
Log In to add your hardware
5-bit
6-bit
8-bit
Model tree for Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-GGUF
Base model
Qwen/Qwen3.5-35B-A3B-Base Finetuned
Qwen/Qwen3.5-35B-A3B Finetuned
Jongsim/Qwen3.5-35B-A3B-heretic