Qwen3.5-27B Marvin V2 Stage 4 β€” Antirep DPO

Fine-tuned Qwen3.5-VL-27B-A3B for creative writing and roleplay with targeted repetition suppression via DPO.

Training Pipeline

This model is the result of a 4-stage training pipeline:

  1. Stage 1 β€” CPT: Continual pre-training on 87M tokens of info-dense creative writing data
  2. Stage 2 β€” CPT: Second CPT pass on 38M tokens of high-quality Marvin-style creative prose
  3. Stage 2.5 β€” Thinking SFT (EP2): Instruction tuning with thinking traces (7,522 samples, 2 epochs)
  4. Stage 4 β€” Antirep DPO (this model): Pure anti-repetition DPO targeting visible text repetition

Antirep DPO Details

  • 268 DPO pairs with 53x average repetition contrast ratio
  • Chosen: Generated from Stage 4 Masked DPO model (median rep3g=51)
  • Rejected: EP2-induced repetition + Stage 3 V2 naturally repetitive outputs (median rep3g=920)
  • Filter: Only pairs where chosen_rep < rejected_rep * 0.5
  • mask_thinking: true β€” DPO loss applied only to visible text, not <think> blocks
  • 96.5% repetition reduction vs EP2 baseline (rep3g: 839 β†’ 29 on 5-prompt test)

Training Configuration

  • Base: Stage 2 Thinking SFT EP2 (NOT the masked DPO)
  • QLoRA: r=32, alpha=32, rsLoRA, nf4 quantization
  • DPO: beta=0.1, sigmoid loss
  • LR: 5e-6 constant with warmup
  • Optimizer: paged AdamW 8-bit
  • 1 epoch, 67 steps total
  • Flash Attention 2, gradient checkpointing

Usage

This is a Qwen3.5-VL-27B model. Use with any Qwen3.5-compatible inference engine.

The model uses ChatML format with <think> blocks for reasoning:

<|im_start|>system
You are a creative writing assistant.<|im_end|>
<|im_start|>user
Write a scene about...<|im_end|>
<|im_start|>assistant
<think>
[reasoning here]
</think>

[visible response here]<|im_end|>

Quantized Versions

Q4_K_M GGUF available at ToastyPigeon/Qwen3.5-Test-GGUFs

Downloads last month
16
Safetensors
Model size
28B params
Tensor type
BF16
Β·
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ToastyPigeon/Qwen3.5-27B-Marvin-V2-Stage4-Antirep-DPO

Quantizations
2 models