Qwen3.5-27B Marvin V2 Stage 4 β Antirep DPO
Fine-tuned Qwen3.5-VL-27B-A3B for creative writing and roleplay with targeted repetition suppression via DPO.
Training Pipeline
This model is the result of a 4-stage training pipeline:
- Stage 1 β CPT: Continual pre-training on 87M tokens of info-dense creative writing data
- Stage 2 β CPT: Second CPT pass on 38M tokens of high-quality Marvin-style creative prose
- Stage 2.5 β Thinking SFT (EP2): Instruction tuning with thinking traces (7,522 samples, 2 epochs)
- Stage 4 β Antirep DPO (this model): Pure anti-repetition DPO targeting visible text repetition
Antirep DPO Details
- 268 DPO pairs with 53x average repetition contrast ratio
- Chosen: Generated from Stage 4 Masked DPO model (median rep3g=51)
- Rejected: EP2-induced repetition + Stage 3 V2 naturally repetitive outputs (median rep3g=920)
- Filter: Only pairs where chosen_rep < rejected_rep * 0.5
- mask_thinking: true β DPO loss applied only to visible text, not
<think>blocks - 96.5% repetition reduction vs EP2 baseline (rep3g: 839 β 29 on 5-prompt test)
Training Configuration
- Base: Stage 2 Thinking SFT EP2 (NOT the masked DPO)
- QLoRA: r=32, alpha=32, rsLoRA, nf4 quantization
- DPO: beta=0.1, sigmoid loss
- LR: 5e-6 constant with warmup
- Optimizer: paged AdamW 8-bit
- 1 epoch, 67 steps total
- Flash Attention 2, gradient checkpointing
Usage
This is a Qwen3.5-VL-27B model. Use with any Qwen3.5-compatible inference engine.
The model uses ChatML format with <think> blocks for reasoning:
<|im_start|>system
You are a creative writing assistant.<|im_end|>
<|im_start|>user
Write a scene about...<|im_end|>
<|im_start|>assistant
<think>
[reasoning here]
</think>
[visible response here]<|im_end|>
Quantized Versions
Q4_K_M GGUF available at ToastyPigeon/Qwen3.5-Test-GGUFs
- Downloads last month
- 16
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support