Qwen3.5-35B-A3B-heretic-v2-eq-v1

A DPO fine-tune of llmfan46/Qwen3.5-35B-A3B-heretic-v2 optimized for emotional intelligence and empathetic conversation quality, while preserving the uncensored nature of the heretic abliteration.

35B total parameters, ~3B active per token (256 experts, 8 routed + 1 shared). Hybrid GatedDeltaNet + full attention architecture with 262K native context.

Honest note: This is a v1 — the goal was getting this new MoE architecture working end-to-end and seeing where EQ lands, not exhaustive hyperparameter search. The results are encouraging enough to share. Expect v2 to be more deliberately optimized.

EQ-Bench 3 Results

Rubric Score: 83.85 (judge: claude-3.7-sonnet)

+6 points over the vanilla Qwen3.5-35B-A3B baseline (77.85), placing it above every Qwen model on the EQ-Bench 3 leaderboard — including models 7x its size:

Model Params (active) Rubric Score Elo
claude-sonnet-4-6 1865.0
claude-opus-4-6 1846.6
gpt-5.4 1675.8
Qwen3.5-35B-A3B-heretic-v2-eq-v1 (ours) 3B 83.85
Qwen3.5-27B dense 27B 83.05
Qwen3-235B-A22B 22B 80.90 1271.6
QwQ-32B 32B 79.90 1209.9
Qwen3.5-35B-A3B (baseline) 3B 77.85
Qwen3-32B 32B 74.30 1006.6
Qwen3-30B-A3B 3B 66.00 733.5

Note on judge model: We use claude-3.7-sonnet as the judge because that is what the public EQ-Bench 3 leaderboard uses for this family of models, making our scores directly comparable to published results. We plan to publish updated benchmarks with newer judge models (including Opus) in the future.

Per-Dimension Scores (raw, 0-20 scale)

Strengths

Dimension Score
Analytical 18.3
Demonstrated Empathy 17.8
Depth of Insight 17.4
Emotional Reasoning 17.3
Humanlike 17.2
Subtext Identification 17.1
Pragmatic EI 17.0
Validating 16.7
Social Dexterity 16.4
Message Tailoring 16.3
Theory of Mind 16.2

Moderate

Dimension Score
Correctness 15.8
Conversational 15.7
Warmth 15.5
Safety Conscious 15.0
Boundary Setting 14.2
Compliant 13.9

Future Work

Dimension Score Notes
Challenging 13.1 Could push back more when appropriate
Intellectual Grounding 11.3 Room to improve factual anchoring in emotional contexts
Reactive 10.1 Under-reacts to provocative/extreme inputs
Moralising 7.9 Low is generally good — avoids lecturing
Sycophantic 6.9 Low is good, but could still improve

Sycophancy and reactivity are the primary targets for v2.

Training Data

Trained on nivvis/eq-dpo — turn-level DPO pairs from synthetic emotional support conversations:

  • Chosen: Gold supporter responses from top-Elo conversations (generated by Claude Sonnet/Opus, GPT-5.4)
  • Rejected: Qwen's own responses given the same conversation prefix (in-distribution)
  • Source conversations: Ranked via Elo tournament with logit-probe A/B judging
  • Margin filtering: Pairs where the judge couldn't distinguish (|P(gold) - P(alt)| < 0.15) were dropped

What the DPO fixes

The base Qwen model's main failure modes in empathetic conversation:

  • Repetitive "It sounds like..." openers
  • Verbose, over-structured responses (reflect → validate → explore → support every turn)
  • Same emotional register regardless of context
  • Therapist-parody voice instead of authentic warmth

The DPO-tuned model produces more natural, varied, and authentic responses.

Training Details

  • Hardware: NVIDIA RTX PRO 6000 Blackwell 96GB
  • Framework: Unsloth + TRL DPOTrainer
  • Precision: bf16 LoRA (Unsloth ignored FP8 for this architecture)
  • Hyperparameters:
    • Learning rate: 5e-6
    • Beta (DPO temperature): 0.1
    • Batch size: 1, gradient accumulation: 8
    • Optimizer: AdamW 8-bit
    • Max sequence length: 4096
    • LoRA rank: 32, alpha: 64
    • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, gate_up_proj
  • Training time: ~2.2 hours (1 epoch, 2,592 train pairs)
  • Final loss: 0.233 (mean), rewards accuracy: 100%

Architecture

  • Base: Qwen3.5-35B-A3B MoE (256 experts, 8 active per token, ~3B active parameters)
  • Fine-tune: heretic-v2 by llmfan46 (abliterated)
  • DPO: 1 epoch, 2,880 turn-level preference pairs, bf16 LoRA (r=32, alpha=64)
  • Layers: 40 transformer blocks
  • Attention: Hybrid — 3:1 GatedDeltaNet (linear) to full attention
  • MoE: 256 experts, top-8 routing + 1 shared expert per token
  • Context: 262K native (1M+ with YaRN)
  • Precision: bfloat16
  • Format: bf16 safetensors, merged (no adapter needed)

Usage

vLLM

VLLM_USE_DEEP_GEMM=0 \
VLLM_USE_FLASHINFER_SAMPLER=0 \
OMP_NUM_THREADS=4 \
vllm serve nivvis/Qwen3.5-35B-A3B-heretic-v2-eq-v1 \
  --served-model-name qwen35-35b-eq-v1 \
  --max-num-seqs 32 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --trust-remote-code

Note: VLLM_USE_DEEP_GEMM=0 and VLLM_USE_FLASHINFER_SAMPLER=0 are recommended for stability with the hybrid GatedDeltaNet + MoE architecture. Adjust --max-model-len and --gpu-memory-utilization to fit your VRAM.

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "nivvis/Qwen3.5-35B-A3B-heretic-v2-eq-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="bfloat16",
    device_map="auto",
    trust_remote_code=True,
)

Tips

  • Supports tool calling via the Qwen3 coder format
  • Thinking/reasoning via <think> tags (enable_thinking can be toggled in the chat template)
  • The tokenizer_class has been patched from TokenizersBackend to Qwen2Tokenizer to fix compatibility with SGLang and older inference frameworks

GGUF

GGUF quantizations (F16 and Q4_K_M) are available at nivvis/Qwen3.5-35B-A3B-heretic-v2-eq-v1-GGUF.

Limitations

  • Trained specifically on empathetic support conversation style — may not generalize to all EQ tasks
  • 1 epoch only — further training (more epochs, margin-DPO, think-token masking) likely to improve
  • EQ-Bench 3 score measured without thinking enabled — thinking may improve or change scores
  • The base heretic-v2 model has not been benchmarked separately on EQ-Bench 3, so the exact improvement from DPO vs the base fine-tune is not isolated

Credits

Tested With

Package Version
vLLM 0.17.0rc1.dev151+g4497431df
PyTorch 2.10.0+cu128
Transformers 5.3.0.dev0

Qwen3.5 support is bleeding-edge — all three packages required dev/RC builds at time of release. These are the versions we verified with.

Citation

If you use this model, please cite EQ-Bench:

@article{paech2024eqbench,
  title={EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models},
  author={Paech, Samuel J},
  journal={arXiv preprint arXiv:2312.06281},
  year={2024}
}

License

Apache 2.0 (following the base Qwen3.5 license)

Downloads last month
109
Safetensors
Model size
35B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nivvis/Qwen3.5-35B-A3B-EQ-v1

Finetuned
(2)
this model
Quantizations
3 models

Dataset used to train nivvis/Qwen3.5-35B-A3B-EQ-v1

Collection including nivvis/Qwen3.5-35B-A3B-EQ-v1

Paper for nivvis/Qwen3.5-35B-A3B-EQ-v1

Evaluation results