Qwen3.5-35B-A3B-heretic-v2-Opus-4.6-Distilled

A reasoning-enhanced, abliterated Qwen3.5-35B-A3B MoE model (35B total / 3B active parameters). Built on top of llmfan46/Qwen3.5-35B-A3B-heretic-v2, fine-tuned on high-quality Chain-of-Thought reasoning traces distilled from Claude Opus 4.6 and Claude Opus 4.5, with LoRA merged at epoch 3 in bf16 precision.

The model produces structured reasoning within <think>...</think> tags before delivering final responses.

Training Pipeline

Qwen/Qwen3.5-35B-A3B (original)
 │
 │  Heretic v1.2.0 (SOMA + MPOA abliteration, v2 config)
 ▼
llmfan46/Qwen3.5-35B-A3B-heretic-v2 (abliterated base, by llmfan46)
 │
 │  LoRA SFT with Unsloth (epoch 3 / 5 merged)
 ▼
Jongsim/Qwen3.5-35B-A3B-heretic-v2-Opus-4.6-Distilled (this model)

Architecture

Property Value
Architecture Qwen3.5 MoE (Gated DeltaNet + Gated Attention + MoE)
Total Parameters 35B
Active Parameters ~3B per token
Hidden Dimension 2048
Layers 40 (10 repeating blocks: 3× DeltaNet-MoE + 1× Attention-MoE)
Experts 256 total, 8 routed + 1 shared active
Expert Intermediate Dim 512
Context Length 262,144 tokens (native)
Precision bf16
Vocabulary 248,320 tokens

Fine-Tuning Details

LoRA Configuration

Parameter Value
Framework Unsloth + PEFT
Method LoRA (Low-Rank Adaptation)
Rank (r) 16
Alpha 32
Dropout 0.0
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, gate_up_proj

Training Configuration

Parameter Value
Trainer SFTTrainer (train_on_responses_only)
Optimizer AdamW 8-bit
Learning Rate 2e-5
LR Scheduler Cosine
Batch Size 1 (per device)
Gradient Accumulation 8
Effective Batch Size 8
Max Sequence Length 2,048 tokens
Warmup Ratio 0.03
Total Epochs 5 (merged at epoch 3)
Steps per Epoch 1,603
Merged Checkpoint Step 4,809 (epoch 3)
Precision bf16
Hardware NVIDIA DGX Spark (GB10 Blackwell GPU, 128GB unified memory)

Training Loss

Epoch Step Train Loss
1 1,603 0.3792
2 3,206 0.3602
3 (merged) 4,809 0.1715
4 6,412 0.1530
5 8,015 0.1490

Epoch 3 was selected for merging as it shows significant convergence (loss dropped from 0.36 → 0.17) while avoiding potential overfitting from later epochs (diminishing returns: epoch 4→5 only 0.004 improvement).

Training Datasets

Dataset Rows Description
nohurry/Opus-4.6-Reasoning-3000x-filtered 2,308 Claude Opus 4.6 reasoning traces (filtered)
TeichAI/claude-4.5-opus-high-reasoning-250x 250 Claude Opus 4.5 high-quality reasoning
Jackrong/Qwen3.5-reasoning-700x 633 Qwen3.5 reasoning examples
Roman1111111/claude-opus-4.6-10000x 9,631 Claude Opus 4.6 large-scale reasoning
Total 12,822

All datasets use the ChatML conversation format with <think>...</think> reasoning blocks in assistant responses.

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Jongsim/Qwen3.5-35B-A3B-heretic-v2-Opus-4.6-Distilled"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="bfloat16", device_map="auto")

messages = [
    {"role": "system", "content": "You are a helpful assistant. Think step by step."},
    {"role": "user", "content": "Explain the proof that there are infinitely many prime numbers."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))

vLLM

from vllm import LLM, SamplingParams

llm = LLM(model="Jongsim/Qwen3.5-35B-A3B-heretic-v2-Opus-4.6-Distilled", dtype="bfloat16")
params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=2048)

messages = [{"role": "user", "content": "Solve this step by step: What is 23 * 47?"}]
output = llm.chat(messages, sampling_params=params)
print(output[0].outputs[0].text)

Abliteration (Stage 0)

The base model (llmfan46/Qwen3.5-35B-A3B-heretic-v2) was created by llmfan46 using Heretic v1.2.0:

  • SOMA (Self-Organizing Map Abliteration): 4×4 SOM discovering multiple refusal directions, top-4 ablated
  • MPOA (Magnitude-Preserving Orthogonal Ablation): Projected ablation with row normalization (rank 3)
  • Bayesian optimization: 200 Optuna trials for optimal hyperparameters

License

This model inherits the Apache 2.0 license from the base Qwen3.5-35B-A3B model.

Acknowledgments

  • Qwen Team for Qwen3.5-35B-A3B
  • llmfan46 for the abliterated heretic-v2 base model
  • Heretic by Philipp Emanuel Weidmann for abliteration
  • Unsloth for efficient LoRA fine-tuning
  • Dataset creators: nohurry, TeichAI, Jackrong, Roman1111111
Downloads last month
574
Safetensors
Model size
35B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jongsim/Qwen3.5-35B-A3B-heretic-v2-Opus-4.6-Distilled

Datasets used to train Jongsim/Qwen3.5-35B-A3B-heretic-v2-Opus-4.6-Distilled