Qwen3.5-35B-A3B-heretic-Reasoning

A reasoning-enhanced, abliterated version of Qwen3.5-35B-A3B (35B total / 3B active parameters, Mixture of Experts). This model was built in two stages: first, censorship removal via directional ablation using Heretic, then supervised fine-tuning on high-quality Chain-of-Thought reasoning traces distilled from Claude 4.6 Opus.

The model produces structured reasoning within <think>...</think> tags before delivering final responses. All weights are in bf16 precision.

Model Introduction

This model is a fine-tuned derivative of Jongsim/Qwen3.5-35B-A3B-heretic, which itself is an abliterated (decensored) version of Qwen/Qwen3.5-35B-A3B.

The primary objective is to inject high-density structured reasoning capability from Claude 4.6 Opus while preserving the uncensored nature of the abliterated base model. Through SFT on curated reasoning distillation data, the model learns to decompose complex problems into sequential steps within a dedicated thinking block before generating the final answer.

Architecture Overview

Property Value
Architecture Qwen3.5 MoE (Gated DeltaNet + Gated Attention + MoE)
Total Parameters 35B
Active Parameters 3B per token
Hidden Dimension 2048
Layers 40 (10 repeating blocks of 3x DeltaNet-MoE + 1x Attention-MoE)
Experts 256 total, 8 routed + 1 shared active
Expert Intermediate Dim 512
Context Length 262,144 tokens (native)
Precision bf16
Vocabulary 248,320 tokens

Training Pipeline

Qwen/Qwen3.5-35B-A3B (original)
 |
 | Heretic v1.2.0 (SOMA + MPOA abliteration)
 v
Jongsim/Qwen3.5-35B-A3B-heretic (abliterated base)
 |
 | Supervised Fine-Tuning (LoRA + Unsloth)
 v
Jongsim/Qwen3.5-35B-A3B-heretic-Reasoning (this model)

Stage 1: Abliteration (Censorship Removal)

The base model was processed with Heretic v1.2.0, an automated censorship removal tool that applies directional ablation optimized via Bayesian hyperparameter search (Optuna TPE).

Two techniques were combined:

  • SOMA (Self-Organizing Map Abliteration): Uses a 4x4 SOM to discover multiple refusal directions in activation space, then ablates the top-k directions simultaneously.
  • MPOA (Magnitude-Preserving Orthogonal Ablation): Projects out the refusal direction while preserving the original weight magnitude via row normalization with low-rank correction (rank 4).

Abliteration Configuration

Parameter Value
Method SOMA + MPOA
Orthogonalize Direction true
Row Normalization full
Full Normalization LoRA Rank 4
Winsorization Quantile 0.95
SOM Grid 4 x 4 (16 neurons)
SOM Iterations 10,000
SOM Learning Rate 0.01
SOM Sigma 0.5
SOM k (directions) 4
Optimization Trials 200 (60 startup)
Selected Trial Trial 84 / 200
Good Prompts mlabonne/harmless_alpaca (train[:400])
Bad Prompts mlabonne/harmful_behaviors (train[:400])
Quantization none (bf16)

Abliteration Results

Metric Original Abliterated
KL Divergence 0 (reference) 0.0638
Refusals (out of 100) 91 6

93.4% refusal reduction with minimal distribution shift (KL = 0.0638).

Stage 2: Supervised Fine-Tuning (Reasoning Distillation)

Objective

Inject structured Chain-of-Thought reasoning patterns from Claude 4.6 Opus into the abliterated model. The training enforces a strict output format where the model generates internal reasoning within <think> blocks before producing the final response.

Training Strategy

  • Framework: Unsloth 2026.3.3 + TRL SFTTrainer
  • Method: LoRA (Low-Rank Adaptation) applied to both attention and MoE expert layers
  • Loss Computation: train_on_responses_only — loss is calculated exclusively on assistant responses (both thinking trace and final answer), not on user prompts
    • Instruction boundary: <|im_start|>user\n
    • Response boundary: <|im_start|>assistant\n<think>
  • Chat Template: Qwen ChatML format (<|im_start|> / <|im_end|>)

LoRA Configuration

Parameter Value
PEFT Method LoRA
Rank (r) 16
Alpha 32 (= 2 x rank)
Dropout 0.0
Bias none
Target Modules (Attention) q_proj, k_proj, v_proj, o_proj
Target Modules (FFN) gate_proj, up_proj, down_proj
Target Modules (MoE) gate_up_proj
Gradient Checkpointing unsloth mode

Training Hyperparameters

Parameter Value
Max Sequence Length 2,048
Per-Device Batch Size 1
Gradient Accumulation Steps 8
Effective Batch Size 8
Number of Epochs 5
Total Training Steps 1,995
Learning Rate 2e-4
LR Scheduler Linear decay
Warmup Steps 5
Optimizer AdamW 8-bit
Weight Decay 0.001
Precision bf16
Seed 3407
Total FLOPs 3.56 x 10^18

Datasets

Three publicly available reasoning distillation datasets were combined, shuffled (seed=42), and used for training:

Dataset Samples Description
nohurry/Opus-4.6-Reasoning-3000x-filtered ~2,308 Filtered reasoning trajectories from Claude 4.6 Opus. Each sample contains a problem, a detailed thinking trace, and a final solution.
TeichAI/claude-4.5-opus-high-reasoning-250x ~250 High-intensity structured reasoning instances from Claude 4.5 Opus multi-turn conversations.
Jackrong/Qwen3.5-reasoning-700x ~633 Curated reasoning samples in both conversation and instruction format, designed for step-by-step problem solving diversity.
Total ~3,191 Combined after filtering empty/invalid rows.

Training Loss

Epoch Avg Loss Steps
1 0.4299 79 - 399
2 0.3729 400 - 798
3 0.3359 799 - 1197
4 0.3059 1198 - 1596
5 0.2958 1597 - 1995

Training loss decreased monotonically from 0.4299 to 0.2958 across 5 epochs, indicating stable convergence without overfitting signs at the loss level.

Checkpoint Selection

The best checkpoint was selected based on GSM8K accuracy (50 samples). All checkpoints were evaluated in isolated subprocesses to prevent GPU memory leaks from Unsloth's model patching.

Checkpoint Epoch GSM8K Accuracy
checkpoint-1200 3.0 8.0% (4/50)
checkpoint-1400 3.5 10.0% (5/50)
checkpoint-1596 4.0 10.0% (5/50)
checkpoint-1995 5.0 12.0% (6/50)

checkpoint-1995 (epoch 5) was selected and merged into bf16 for the final release.

Note: GSM8K measures narrow arithmetic reasoning and does not fully reflect the model's broader reasoning capabilities (code generation, logical analysis, multi-step planning) which are the primary targets of the distillation training.

Hardware and Environment

Component Value
Hardware NVIDIA DGX Spark
GPU NVIDIA GB10 (128GB unified memory)
Compute Capability sm121
Architecture aarch64
CUDA 13.0
PyTorch 2.9.1a0
Transformers 5.2.0
Unsloth 2026.3.3
TRL 0.24.0
PEFT 0.18.1
Datasets 4.3.0
Tokenizers 0.22.2

DGX Spark-Specific Notes

  • Flash Attention and Memory-Efficient Attention (cutlass) are disabled due to sm121 incompatibility (supported: sm80-sm100). Only Math SDP is used.
  • flash_attn package is fully removed to prevent FATAL errors on sm121.
  • torch.compile / TorchInductor is disabled due to Triton ptxas compatibility issues.
  • The entire model (35B parameters) fits in a single GPU's 128GB unified memory without quantization.

Usage

This model uses the standard Qwen3.5 chat template. It operates in thinking mode by default.

Inference Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Jongsim/Qwen3.5-35B-A3B-heretic-Reasoning"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [
    {"role": "user", "content": "Explain the difference between TCP and UDP, and when you would choose one over the other."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=8192, temperature=0.7, top_p=0.8, top_k=20)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Recommended Sampling Parameters

Mode temperature top_p top_k presence_penalty
Thinking (general) 1.0 0.95 20 1.5
Thinking (coding) 0.6 0.95 20 0.0
Non-thinking (general) 0.7 0.8 20 1.5

Example of Learned Reasoning Format

The model produces output in the following structure:

<think>
Let me analyze this problem step by step.

1. First, I need to identify the core question being asked.
2. Then, I'll consider the relevant constraints and conditions.
3. Next, I'll work through the logic systematically.
4. Finally, I'll verify my reasoning for consistency.

[detailed reasoning follows...]
</think>

[final answer here]

This structured thinking pattern, distilled from Claude 4.6 Opus interactions, reduces redundant cognitive loops while preserving deep analytical capacity.

Limitations

  • Hallucination Risk: As an autoregressive language model, the model may generate plausible-sounding but factually incorrect statements, particularly regarding real-world events or obscure technical details.
  • GSM8K Performance: The model scores 12% on GSM8K (50 samples). This is expected because the training data emphasizes broad reasoning patterns (code, logic, planning) rather than arithmetic drill. For pure math benchmarks, consider models specifically trained on mathematical datasets.
  • Abliteration Residual: 6 out of 100 harmful prompts still trigger refusal. The abliteration is not exhaustive.
  • Context Length Trade-off: While the architecture supports 262K tokens natively, the SFT was performed with max_seq_length=2048. Very long reasoning chains beyond the training distribution may degrade in quality.
  • MoE Inference Overhead: Despite having only 3B active parameters per token, the full 35B model must be loaded into memory. Minimum ~65GB VRAM/RAM required for bf16.

Acknowledgements

  • Qwen Team for the Qwen3.5-35B-A3B architecture and pretrained weights
  • Heretic (p-e-w) for the automated directional ablation framework
  • Unsloth AI for efficient LoRA fine-tuning of large MoE models
  • nohurry, TeichAI, and Jackrong for the reasoning distillation datasets

Citation

@misc{jongsim_qwen35_heretic_reasoning,
  title        = {Qwen3.5-35B-A3B-heretic-Reasoning},
  author       = {Jongsim},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Jongsim/Qwen3.5-35B-A3B-heretic-Reasoning}}
}
Downloads last month
117
Safetensors
Model size
35B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled

Finetuned
(1)
this model
Quantizations
3 models

Datasets used to train Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled