Qwen3.5-35B-A3B-heretic-v2-eq-v1
A DPO fine-tune of llmfan46/Qwen3.5-35B-A3B-heretic-v2 optimized for emotional intelligence and empathetic conversation quality, while preserving the uncensored nature of the heretic abliteration.
35B total parameters, ~3B active per token (256 experts, 8 routed + 1 shared). Hybrid GatedDeltaNet + full attention architecture with 262K native context.
Honest note: This is a v1 — the goal was getting this new MoE architecture working end-to-end and seeing where EQ lands, not exhaustive hyperparameter search. The results are encouraging enough to share. Expect v2 to be more deliberately optimized.
EQ-Bench 3 Results
Rubric Score: 83.85 (judge: claude-3.7-sonnet)
+6 points over the vanilla Qwen3.5-35B-A3B baseline (77.85), placing it above every Qwen model on the EQ-Bench 3 leaderboard — including models 7x its size:
| Model | Params (active) | Rubric Score | Elo |
|---|---|---|---|
| claude-sonnet-4-6 | — | — | 1865.0 |
| claude-opus-4-6 | — | — | 1846.6 |
| gpt-5.4 | — | — | 1675.8 |
| Qwen3.5-35B-A3B-heretic-v2-eq-v1 (ours) | 3B | 83.85 | — |
| Qwen3.5-27B dense | 27B | 83.05 | — |
| Qwen3-235B-A22B | 22B | 80.90 | 1271.6 |
| QwQ-32B | 32B | 79.90 | 1209.9 |
| Qwen3.5-35B-A3B (baseline) | 3B | 77.85 | — |
| Qwen3-32B | 32B | 74.30 | 1006.6 |
| Qwen3-30B-A3B | 3B | 66.00 | 733.5 |
Note on judge model: We use
claude-3.7-sonnetas the judge because that is what the public EQ-Bench 3 leaderboard uses for this family of models, making our scores directly comparable to published results. We plan to publish updated benchmarks with newer judge models (including Opus) in the future.
Per-Dimension Scores (raw, 0-20 scale)
Strengths
| Dimension | Score |
|---|---|
| Analytical | 18.3 |
| Demonstrated Empathy | 17.8 |
| Depth of Insight | 17.4 |
| Emotional Reasoning | 17.3 |
| Humanlike | 17.2 |
| Subtext Identification | 17.1 |
| Pragmatic EI | 17.0 |
| Validating | 16.7 |
| Social Dexterity | 16.4 |
| Message Tailoring | 16.3 |
| Theory of Mind | 16.2 |
Moderate
| Dimension | Score |
|---|---|
| Correctness | 15.8 |
| Conversational | 15.7 |
| Warmth | 15.5 |
| Safety Conscious | 15.0 |
| Boundary Setting | 14.2 |
| Compliant | 13.9 |
Future Work
| Dimension | Score | Notes |
|---|---|---|
| Challenging | 13.1 | Could push back more when appropriate |
| Intellectual Grounding | 11.3 | Room to improve factual anchoring in emotional contexts |
| Reactive | 10.1 | Under-reacts to provocative/extreme inputs |
| Moralising | 7.9 | Low is generally good — avoids lecturing |
| Sycophantic | 6.9 | Low is good, but could still improve |
Sycophancy and reactivity are the primary targets for v2.
Training Data
Trained on nivvis/eq-dpo — turn-level DPO pairs from synthetic emotional support conversations:
- Chosen: Gold supporter responses from top-Elo conversations (generated by Claude Sonnet/Opus, GPT-5.4)
- Rejected: Qwen's own responses given the same conversation prefix (in-distribution)
- Source conversations: Ranked via Elo tournament with logit-probe A/B judging
- Margin filtering: Pairs where the judge couldn't distinguish (|P(gold) - P(alt)| < 0.15) were dropped
What the DPO fixes
The base Qwen model's main failure modes in empathetic conversation:
- Repetitive "It sounds like..." openers
- Verbose, over-structured responses (reflect → validate → explore → support every turn)
- Same emotional register regardless of context
- Therapist-parody voice instead of authentic warmth
The DPO-tuned model produces more natural, varied, and authentic responses.
Training Details
- Hardware: NVIDIA RTX PRO 6000 Blackwell 96GB
- Framework: Unsloth + TRL DPOTrainer
- Precision: bf16 LoRA (Unsloth ignored FP8 for this architecture)
- Hyperparameters:
- Learning rate: 5e-6
- Beta (DPO temperature): 0.1
- Batch size: 1, gradient accumulation: 8
- Optimizer: AdamW 8-bit
- Max sequence length: 4096
- LoRA rank: 32, alpha: 64
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, gate_up_proj
- Training time: ~2.2 hours (1 epoch, 2,592 train pairs)
- Final loss: 0.233 (mean), rewards accuracy: 100%
Architecture
- Base: Qwen3.5-35B-A3B MoE (256 experts, 8 active per token, ~3B active parameters)
- Fine-tune: heretic-v2 by llmfan46 (abliterated)
- DPO: 1 epoch, 2,880 turn-level preference pairs, bf16 LoRA (r=32, alpha=64)
- Layers: 40 transformer blocks
- Attention: Hybrid — 3:1 GatedDeltaNet (linear) to full attention
- MoE: 256 experts, top-8 routing + 1 shared expert per token
- Context: 262K native (1M+ with YaRN)
- Precision: bfloat16
- Format: bf16 safetensors, merged (no adapter needed)
Usage
vLLM
VLLM_USE_DEEP_GEMM=0 \
VLLM_USE_FLASHINFER_SAMPLER=0 \
OMP_NUM_THREADS=4 \
vllm serve nivvis/Qwen3.5-35B-A3B-heretic-v2-eq-v1 \
--served-model-name qwen35-35b-eq-v1 \
--max-num-seqs 32 \
--max-model-len 262144 \
--gpu-memory-utilization 0.85 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--trust-remote-code
Note:
VLLM_USE_DEEP_GEMM=0andVLLM_USE_FLASHINFER_SAMPLER=0are recommended for stability with the hybrid GatedDeltaNet + MoE architecture. Adjust--max-model-lenand--gpu-memory-utilizationto fit your VRAM.
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "nivvis/Qwen3.5-35B-A3B-heretic-v2-eq-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="bfloat16",
device_map="auto",
trust_remote_code=True,
)
Tips
- Supports tool calling via the Qwen3 coder format
- Thinking/reasoning via
<think>tags (enable_thinking can be toggled in the chat template) - The tokenizer_class has been patched from
TokenizersBackendtoQwen2Tokenizerto fix compatibility with SGLang and older inference frameworks
GGUF
GGUF quantizations (F16 and Q4_K_M) are available at nivvis/Qwen3.5-35B-A3B-heretic-v2-eq-v1-GGUF.
Limitations
- Trained specifically on empathetic support conversation style — may not generalize to all EQ tasks
- 1 epoch only — further training (more epochs, margin-DPO, think-token masking) likely to improve
- EQ-Bench 3 score measured without thinking enabled — thinking may improve or change scores
- The base heretic-v2 model has not been benchmarked separately on EQ-Bench 3, so the exact improvement from DPO vs the base fine-tune is not isolated
Credits
- Fine-tune base: llmfan46/Qwen3.5-35B-A3B-heretic-v2 (abliterated by llmfan46)
- Original model: Qwen/Qwen3.5-35B-A3B by Alibaba Qwen team
- EQBench: eqbench.com
Tested With
| Package | Version |
|---|---|
| vLLM | 0.17.0rc1.dev151+g4497431df |
| PyTorch | 2.10.0+cu128 |
| Transformers | 5.3.0.dev0 |
Qwen3.5 support is bleeding-edge — all three packages required dev/RC builds at time of release. These are the versions we verified with.
Citation
If you use this model, please cite EQ-Bench:
@article{paech2024eqbench,
title={EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models},
author={Paech, Samuel J},
journal={arXiv preprint arXiv:2312.06281},
year={2024}
}
License
Apache 2.0 (following the base Qwen3.5 license)
- Downloads last month
- 109
Model tree for nivvis/Qwen3.5-35B-A3B-EQ-v1
Dataset used to train nivvis/Qwen3.5-35B-A3B-EQ-v1
Collection including nivvis/Qwen3.5-35B-A3B-EQ-v1
Paper for nivvis/Qwen3.5-35B-A3B-EQ-v1
Evaluation results
- Rubric Score (0-100) on EQ-Bench 3self-reported83.850