Qwen3.5-35B-A3B-heretic-Reasoning
A reasoning-enhanced, abliterated version of Qwen3.5-35B-A3B (35B total / 3B active parameters, Mixture of Experts). This model was built in two stages: first, censorship removal via directional ablation using Heretic, then supervised fine-tuning on high-quality Chain-of-Thought reasoning traces distilled from Claude 4.6 Opus.
The model produces structured reasoning within <think>...</think> tags before delivering final responses. All weights are in bf16 precision.
Model Introduction
This model is a fine-tuned derivative of Jongsim/Qwen3.5-35B-A3B-heretic, which itself is an abliterated (decensored) version of Qwen/Qwen3.5-35B-A3B.
The primary objective is to inject high-density structured reasoning capability from Claude 4.6 Opus while preserving the uncensored nature of the abliterated base model. Through SFT on curated reasoning distillation data, the model learns to decompose complex problems into sequential steps within a dedicated thinking block before generating the final answer.
Architecture Overview
| Property | Value |
|---|---|
| Architecture | Qwen3.5 MoE (Gated DeltaNet + Gated Attention + MoE) |
| Total Parameters | 35B |
| Active Parameters | 3B per token |
| Hidden Dimension | 2048 |
| Layers | 40 (10 repeating blocks of 3x DeltaNet-MoE + 1x Attention-MoE) |
| Experts | 256 total, 8 routed + 1 shared active |
| Expert Intermediate Dim | 512 |
| Context Length | 262,144 tokens (native) |
| Precision | bf16 |
| Vocabulary | 248,320 tokens |
Training Pipeline
Qwen/Qwen3.5-35B-A3B (original)
|
| Heretic v1.2.0 (SOMA + MPOA abliteration)
v
Jongsim/Qwen3.5-35B-A3B-heretic (abliterated base)
|
| Supervised Fine-Tuning (LoRA + Unsloth)
v
Jongsim/Qwen3.5-35B-A3B-heretic-Reasoning (this model)
Stage 1: Abliteration (Censorship Removal)
The base model was processed with Heretic v1.2.0, an automated censorship removal tool that applies directional ablation optimized via Bayesian hyperparameter search (Optuna TPE).
Two techniques were combined:
- SOMA (Self-Organizing Map Abliteration): Uses a 4x4 SOM to discover multiple refusal directions in activation space, then ablates the top-k directions simultaneously.
- MPOA (Magnitude-Preserving Orthogonal Ablation): Projects out the refusal direction while preserving the original weight magnitude via row normalization with low-rank correction (rank 4).
Abliteration Configuration
| Parameter | Value |
|---|---|
| Method | SOMA + MPOA |
| Orthogonalize Direction | true |
| Row Normalization | full |
| Full Normalization LoRA Rank | 4 |
| Winsorization Quantile | 0.95 |
| SOM Grid | 4 x 4 (16 neurons) |
| SOM Iterations | 10,000 |
| SOM Learning Rate | 0.01 |
| SOM Sigma | 0.5 |
| SOM k (directions) | 4 |
| Optimization Trials | 200 (60 startup) |
| Selected Trial | Trial 84 / 200 |
| Good Prompts | mlabonne/harmless_alpaca (train[:400]) |
| Bad Prompts | mlabonne/harmful_behaviors (train[:400]) |
| Quantization | none (bf16) |
Abliteration Results
| Metric | Original | Abliterated |
|---|---|---|
| KL Divergence | 0 (reference) | 0.0638 |
| Refusals (out of 100) | 91 | 6 |
93.4% refusal reduction with minimal distribution shift (KL = 0.0638).
Stage 2: Supervised Fine-Tuning (Reasoning Distillation)
Objective
Inject structured Chain-of-Thought reasoning patterns from Claude 4.6 Opus into the abliterated model. The training enforces a strict output format where the model generates internal reasoning within <think> blocks before producing the final response.
Training Strategy
- Framework: Unsloth 2026.3.3 + TRL SFTTrainer
- Method: LoRA (Low-Rank Adaptation) applied to both attention and MoE expert layers
- Loss Computation:
train_on_responses_only— loss is calculated exclusively on assistant responses (both thinking trace and final answer), not on user prompts- Instruction boundary:
<|im_start|>user\n - Response boundary:
<|im_start|>assistant\n<think>
- Instruction boundary:
- Chat Template: Qwen ChatML format (
<|im_start|>/<|im_end|>)
LoRA Configuration
| Parameter | Value |
|---|---|
| PEFT Method | LoRA |
| Rank (r) | 16 |
| Alpha | 32 (= 2 x rank) |
| Dropout | 0.0 |
| Bias | none |
| Target Modules (Attention) | q_proj, k_proj, v_proj, o_proj |
| Target Modules (FFN) | gate_proj, up_proj, down_proj |
| Target Modules (MoE) | gate_up_proj |
| Gradient Checkpointing | unsloth mode |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Max Sequence Length | 2,048 |
| Per-Device Batch Size | 1 |
| Gradient Accumulation Steps | 8 |
| Effective Batch Size | 8 |
| Number of Epochs | 5 |
| Total Training Steps | 1,995 |
| Learning Rate | 2e-4 |
| LR Scheduler | Linear decay |
| Warmup Steps | 5 |
| Optimizer | AdamW 8-bit |
| Weight Decay | 0.001 |
| Precision | bf16 |
| Seed | 3407 |
| Total FLOPs | 3.56 x 10^18 |
Datasets
Three publicly available reasoning distillation datasets were combined, shuffled (seed=42), and used for training:
| Dataset | Samples | Description |
|---|---|---|
| nohurry/Opus-4.6-Reasoning-3000x-filtered | ~2,308 | Filtered reasoning trajectories from Claude 4.6 Opus. Each sample contains a problem, a detailed thinking trace, and a final solution. |
| TeichAI/claude-4.5-opus-high-reasoning-250x | ~250 | High-intensity structured reasoning instances from Claude 4.5 Opus multi-turn conversations. |
| Jackrong/Qwen3.5-reasoning-700x | ~633 | Curated reasoning samples in both conversation and instruction format, designed for step-by-step problem solving diversity. |
| Total | ~3,191 | Combined after filtering empty/invalid rows. |
Training Loss
| Epoch | Avg Loss | Steps |
|---|---|---|
| 1 | 0.4299 | 79 - 399 |
| 2 | 0.3729 | 400 - 798 |
| 3 | 0.3359 | 799 - 1197 |
| 4 | 0.3059 | 1198 - 1596 |
| 5 | 0.2958 | 1597 - 1995 |
Training loss decreased monotonically from 0.4299 to 0.2958 across 5 epochs, indicating stable convergence without overfitting signs at the loss level.
Checkpoint Selection
The best checkpoint was selected based on GSM8K accuracy (50 samples). All checkpoints were evaluated in isolated subprocesses to prevent GPU memory leaks from Unsloth's model patching.
| Checkpoint | Epoch | GSM8K Accuracy |
|---|---|---|
| checkpoint-1200 | 3.0 | 8.0% (4/50) |
| checkpoint-1400 | 3.5 | 10.0% (5/50) |
| checkpoint-1596 | 4.0 | 10.0% (5/50) |
| checkpoint-1995 | 5.0 | 12.0% (6/50) |
checkpoint-1995 (epoch 5) was selected and merged into bf16 for the final release.
Note: GSM8K measures narrow arithmetic reasoning and does not fully reflect the model's broader reasoning capabilities (code generation, logical analysis, multi-step planning) which are the primary targets of the distillation training.
Hardware and Environment
| Component | Value |
|---|---|
| Hardware | NVIDIA DGX Spark |
| GPU | NVIDIA GB10 (128GB unified memory) |
| Compute Capability | sm121 |
| Architecture | aarch64 |
| CUDA | 13.0 |
| PyTorch | 2.9.1a0 |
| Transformers | 5.2.0 |
| Unsloth | 2026.3.3 |
| TRL | 0.24.0 |
| PEFT | 0.18.1 |
| Datasets | 4.3.0 |
| Tokenizers | 0.22.2 |
DGX Spark-Specific Notes
- Flash Attention and Memory-Efficient Attention (cutlass) are disabled due to sm121 incompatibility (supported: sm80-sm100). Only Math SDP is used.
flash_attnpackage is fully removed to prevent FATAL errors on sm121.torch.compile/ TorchInductor is disabled due to Triton ptxas compatibility issues.- The entire model (35B parameters) fits in a single GPU's 128GB unified memory without quantization.
Usage
This model uses the standard Qwen3.5 chat template. It operates in thinking mode by default.
Inference Example
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Jongsim/Qwen3.5-35B-A3B-heretic-Reasoning"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [
{"role": "user", "content": "Explain the difference between TCP and UDP, and when you would choose one over the other."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=8192, temperature=0.7, top_p=0.8, top_k=20)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)
Recommended Sampling Parameters
| Mode | temperature | top_p | top_k | presence_penalty |
|---|---|---|---|---|
| Thinking (general) | 1.0 | 0.95 | 20 | 1.5 |
| Thinking (coding) | 0.6 | 0.95 | 20 | 0.0 |
| Non-thinking (general) | 0.7 | 0.8 | 20 | 1.5 |
Example of Learned Reasoning Format
The model produces output in the following structure:
<think>
Let me analyze this problem step by step.
1. First, I need to identify the core question being asked.
2. Then, I'll consider the relevant constraints and conditions.
3. Next, I'll work through the logic systematically.
4. Finally, I'll verify my reasoning for consistency.
[detailed reasoning follows...]
</think>
[final answer here]
This structured thinking pattern, distilled from Claude 4.6 Opus interactions, reduces redundant cognitive loops while preserving deep analytical capacity.
Limitations
- Hallucination Risk: As an autoregressive language model, the model may generate plausible-sounding but factually incorrect statements, particularly regarding real-world events or obscure technical details.
- GSM8K Performance: The model scores 12% on GSM8K (50 samples). This is expected because the training data emphasizes broad reasoning patterns (code, logic, planning) rather than arithmetic drill. For pure math benchmarks, consider models specifically trained on mathematical datasets.
- Abliteration Residual: 6 out of 100 harmful prompts still trigger refusal. The abliteration is not exhaustive.
- Context Length Trade-off: While the architecture supports 262K tokens natively, the SFT was performed with max_seq_length=2048. Very long reasoning chains beyond the training distribution may degrade in quality.
- MoE Inference Overhead: Despite having only 3B active parameters per token, the full 35B model must be loaded into memory. Minimum ~65GB VRAM/RAM required for bf16.
Acknowledgements
- Qwen Team for the Qwen3.5-35B-A3B architecture and pretrained weights
- Heretic (p-e-w) for the automated directional ablation framework
- Unsloth AI for efficient LoRA fine-tuning of large MoE models
- nohurry, TeichAI, and Jackrong for the reasoning distillation datasets
Citation
@misc{jongsim_qwen35_heretic_reasoning,
title = {Qwen3.5-35B-A3B-heretic-Reasoning},
author = {Jongsim},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Jongsim/Qwen3.5-35B-A3B-heretic-Reasoning}}
}
- Downloads last month
- 117
Model tree for Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled
Base model
Qwen/Qwen3.5-35B-A3B-Base