Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled — GGUF

GGUF quantizations of Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled, a reasoning-enhanced abliterated Qwen3.5-35B-A3B (35B total / 3B active MoE).

All quantizations were performed using importance matrix (imatrix) computed from wiki.train.raw (200 chunks) for optimal quality preservation.

Available Quantizations

Quantization	File	Size	BPW	Description
Q8_0	`Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-Q8_0.gguf`	36.9 GB	8.52	Highest quality, near-lossless
Q6_K	`Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-Q6_K.gguf`	28.5 GB	6.57	Excellent quality, good balance
Q5_K_M	`Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-Q5_K_M.gguf`	23.2 GB	5.33	Good quality, recommended for most users

Choosing a Quantization

Q8_0: Use when you have ample memory and want minimal quality loss. Best for evaluation and comparison.
Q6_K: Recommended balance of quality and size. Suitable for 64GB+ systems.
Q5_K_M: Best for memory-constrained setups. The MoE architecture (only 3B active params per token) makes this model particularly efficient at lower quantizations.

Note: Because this is a Mixture-of-Experts model with only 3B active parameters per token out of 35B total, the full model weight file must be loaded into memory, but inference speed is determined by the active parameter count.

Quantization Details

Parameter	Value
Source	BF16 GGUF (65 GB)
Quantizer	`llama-quantize` (llama.cpp)
Importance Matrix	`imatrix.dat` computed from `wiki.train.raw`
imatrix Chunks	200
imatrix Source Quant	Q8_0 (intermediate)

Model Overview

This model is built in two stages:

Qwen/Qwen3.5-35B-A3B (original)
 │
 │ Heretic v1.2.0 (SOMA + MPOA abliteration)
 ▼
Jongsim/Qwen3.5-35B-A3B-heretic (abliterated base)
 │
 │ Supervised Fine-Tuning (LoRA + Unsloth)
 ▼
Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled (this model)

Key Features

Abliterated (Uncensored): Censorship removed via directional ablation (93.4% refusal reduction, KL divergence = 0.064)
Reasoning-Enhanced: SFT on ~3,191 Chain-of-Thought samples distilled from Claude 4.6 Opus
Structured Output: Generates reasoning within <think>...</think> tags before final responses
Efficient MoE: 35B total params, only 3B active per token (256 experts, 8 routed + 1 shared)

Architecture

Attribute	Value
Architecture	Qwen3.5 MoE (Gated DeltaNet + Gated Attention + MoE)
Total Parameters	35B
Active Parameters	3B per token
Hidden Dimension	2,048
Layers	40
Experts	256 total, 8 routed + 1 shared active
Context Length	262,144 tokens (native)
Vocabulary	248,320 tokens

Usage with llama.cpp

# Interactive chat
./llama-cli -m Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-Q5_K_M.gguf \
    -ngl 99 -c 8192 --chat-template chatml -cnv

# Server mode
./llama-server -m Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-Q5_K_M.gguf \
    -ngl 99 -c 8192 --chat-template chatml --port 8080

Recommended Sampling Parameters

Use Case	Temperature	Top-P	Top-K
Thinking (general)	1.0	0.95	20
Thinking (coding)	0.6	0.95	20
Non-thinking (general)	0.7	0.8	20

Example Output Format

The model produces structured reasoning:

<think>
Let me analyze this problem step by step.

1. First, I need to identify the core question being asked.
2. Then, I'll consider the relevant constraints and conditions.
3. Next, I'll work through the logic systematically.
4. Finally, I'll verify my reasoning for consistency.

[detailed reasoning follows...]
</think>

[final answer here]

Training Details

Stage 1: Abliteration

Parameter	Value
Method	SOMA + MPOA (Heretic v1.2.0)
Refusal Reduction	93.4% (91 → 6 refusals out of 100)
KL Divergence	0.0638
Selected Trial	Trial 84 / 200 (Bayesian search)

Stage 2: Reasoning SFT

Parameter	Value
Framework	Unsloth 2026.3.3 + TRL SFTTrainer
Method	LoRA (r=16, α=32)
Datasets	3,191 samples from 3 reasoning distillation datasets
Epochs	5
Best Checkpoint	checkpoint-1995 (12% GSM8K, 50 samples)
Learning Rate	2e-4 (linear decay)
Max Seq Length	2,048

Related Models

Model	Description
Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled	Full bf16 safetensors (source model)
Jongsim/Qwen3.5-35B-A3B-heretic	Abliterated base (no reasoning SFT)
Jongsim/Qwen3.5-35B-A3B-heretic-GGUF	GGUF quants of the abliterated base

Acknowledgements

Qwen Team for Qwen3.5-35B-A3B
Heretic for automated abliteration
Unsloth AI for efficient LoRA fine-tuning
nohurry, TeichAI, Jackrong for reasoning datasets
llama.cpp for the GGUF format and quantization tools

Downloads last month: 140

GGUF

Hardware compatibility

5-bit

6-bit

8-bit

Model tree for Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-GGUF

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Finetuned

Jongsim/Qwen3.5-35B-A3B-heretic

Finetuned

Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled

Quantized

(3)

this model

Jongsim
/

Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-GGUF

Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled — GGUF

Available Quantizations

Choosing a Quantization

Quantization Details

Model Overview

Key Features

Architecture

Usage with llama.cpp

Recommended Sampling Parameters

Example Output Format

Training Details

Stage 1: Abliteration

Stage 2: Reasoning SFT

Related Models

Acknowledgements

Model tree for Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-GGUF

Datasets used to train Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-GGUF