Qwen3.5-35B-A3B-heretic-v2-eq-v1

A DPO fine-tune of llmfan46/Qwen3.5-35B-A3B-heretic-v2 optimized for emotional intelligence and empathetic conversation quality, while preserving the uncensored nature of the heretic abliteration.

35B total parameters, ~3B active per token (256 experts, 8 routed + 1 shared). Hybrid GatedDeltaNet + full attention architecture with 262K native context.

Honest note: This is a v1 — the goal was getting this new MoE architecture working end-to-end and seeing where EQ lands, not exhaustive hyperparameter search. The results are encouraging enough to share. Expect v2 to be more deliberately optimized.

EQ-Bench 3 Results

Rubric Score: 83.85 (judge: claude-3.7-sonnet)

+6 points over the vanilla Qwen3.5-35B-A3B baseline (77.85), placing it above every Qwen model on the EQ-Bench 3 leaderboard — including models 7x its size:

Model	Params (active)	Rubric Score	Elo
claude-sonnet-4-6	—	—	1865.0
claude-opus-4-6	—	—	1846.6
gpt-5.4	—	—	1675.8
Qwen3.5-35B-A3B-heretic-v2-eq-v1 (ours)	3B	83.85	—
Qwen3.5-27B dense	27B	83.05	—
Qwen3-235B-A22B	22B	80.90	1271.6
QwQ-32B	32B	79.90	1209.9
Qwen3.5-35B-A3B (baseline)	3B	77.85	—
Qwen3-32B	32B	74.30	1006.6
Qwen3-30B-A3B	3B	66.00	733.5

Note on judge model: We use claude-3.7-sonnet as the judge because that is what the public EQ-Bench 3 leaderboard uses for this family of models, making our scores directly comparable to published results. We plan to publish updated benchmarks with newer judge models (including Opus) in the future.

Per-Dimension Scores (raw, 0-20 scale)

Strengths

Dimension	Score
Analytical	18.3
Demonstrated Empathy	17.8
Depth of Insight	17.4
Emotional Reasoning	17.3
Humanlike	17.2
Subtext Identification	17.1
Pragmatic EI	17.0
Validating	16.7
Social Dexterity	16.4
Message Tailoring	16.3
Theory of Mind	16.2

Moderate

Dimension	Score
Correctness	15.8
Conversational	15.7
Warmth	15.5
Safety Conscious	15.0
Boundary Setting	14.2
Compliant	13.9

Future Work

Dimension	Score	Notes
Challenging	13.1	Could push back more when appropriate
Intellectual Grounding	11.3	Room to improve factual anchoring in emotional contexts
Reactive	10.1	Under-reacts to provocative/extreme inputs
Moralising	7.9	Low is generally good — avoids lecturing
Sycophantic	6.9	Low is good, but could still improve

Sycophancy and reactivity are the primary targets for v2.

Training Data

Trained on nivvis/eq-dpo — turn-level DPO pairs from synthetic emotional support conversations:

Chosen: Gold supporter responses from top-Elo conversations (generated by Claude Sonnet/Opus, GPT-5.4)
Rejected: Qwen's own responses given the same conversation prefix (in-distribution)
Source conversations: Ranked via Elo tournament with logit-probe A/B judging
Margin filtering: Pairs where the judge couldn't distinguish (|P(gold) - P(alt)| < 0.15) were dropped

What the DPO fixes

The base Qwen model's main failure modes in empathetic conversation:

Repetitive "It sounds like..." openers
Verbose, over-structured responses (reflect → validate → explore → support every turn)
Same emotional register regardless of context
Therapist-parody voice instead of authentic warmth

The DPO-tuned model produces more natural, varied, and authentic responses.

Training Details

Hardware: NVIDIA RTX PRO 6000 Blackwell 96GB
Framework: Unsloth + TRL DPOTrainer
Precision: bf16 LoRA (Unsloth ignored FP8 for this architecture)
Hyperparameters:
- Learning rate: 5e-6
- Beta (DPO temperature): 0.1
- Batch size: 1, gradient accumulation: 8
- Optimizer: AdamW 8-bit
- Max sequence length: 4096
- LoRA rank: 32, alpha: 64
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, gate_up_proj
Training time: ~2.2 hours (1 epoch, 2,592 train pairs)
Final loss: 0.233 (mean), rewards accuracy: 100%

Architecture

Base: Qwen3.5-35B-A3B MoE (256 experts, 8 active per token, ~3B active parameters)
Fine-tune: heretic-v2 by llmfan46 (abliterated)
DPO: 1 epoch, 2,880 turn-level preference pairs, bf16 LoRA (r=32, alpha=64)
Layers: 40 transformer blocks
Attention: Hybrid — 3:1 GatedDeltaNet (linear) to full attention
MoE: 256 experts, top-8 routing + 1 shared expert per token
Context: 262K native (1M+ with YaRN)
Precision: bfloat16
Format: bf16 safetensors, merged (no adapter needed)

Usage

vLLM

VLLM_USE_DEEP_GEMM=0 \
VLLM_USE_FLASHINFER_SAMPLER=0 \
OMP_NUM_THREADS=4 \
vllm serve nivvis/Qwen3.5-35B-A3B-heretic-v2-eq-v1 \
  --served-model-name qwen35-35b-eq-v1 \
  --max-num-seqs 32 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --trust-remote-code

Note: VLLM_USE_DEEP_GEMM=0 and VLLM_USE_FLASHINFER_SAMPLER=0 are recommended for stability with the hybrid GatedDeltaNet + MoE architecture. Adjust --max-model-len and --gpu-memory-utilization to fit your VRAM.

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "nivvis/Qwen3.5-35B-A3B-heretic-v2-eq-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="bfloat16",
    device_map="auto",
    trust_remote_code=True,
)

Tips

Supports tool calling via the Qwen3 coder format
Thinking/reasoning via <think> tags (enable_thinking can be toggled in the chat template)
The tokenizer_class has been patched from TokenizersBackend to Qwen2Tokenizer to fix compatibility with SGLang and older inference frameworks

GGUF

GGUF quantizations (F16 and Q4_K_M) are available at nivvis/Qwen3.5-35B-A3B-heretic-v2-eq-v1-GGUF.

Limitations

Trained specifically on empathetic support conversation style — may not generalize to all EQ tasks
1 epoch only — further training (more epochs, margin-DPO, think-token masking) likely to improve
EQ-Bench 3 score measured without thinking enabled — thinking may improve or change scores
The base heretic-v2 model has not been benchmarked separately on EQ-Bench 3, so the exact improvement from DPO vs the base fine-tune is not isolated

Credits

Fine-tune base: llmfan46/Qwen3.5-35B-A3B-heretic-v2 (abliterated by llmfan46)
Original model: Qwen/Qwen3.5-35B-A3B by Alibaba Qwen team
EQBench: eqbench.com

Tested With

Package	Version
vLLM	`0.17.0rc1.dev151+g4497431df`
PyTorch	`2.10.0+cu128`
Transformers	`5.3.0.dev0`

Qwen3.5 support is bleeding-edge — all three packages required dev/RC builds at time of release. These are the versions we verified with.

Citation

If you use this model, please cite EQ-Bench:

@article{paech2024eqbench,
  title={EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models},
  author={Paech, Samuel J},
  journal={arXiv preprint arXiv:2312.06281},
  year={2024}
}