exp028b-downproj-dpo-lr1e6-ep2-merged

SFT + DPO merged model. Full 16-bit weights, no adapter loading required.

Training Pipeline

  1. SFT: tomofusa/exp021b-blend-h-lora
  2. DPO: u-10bei/dpo-dataset-qwen-cot (2 epoch, lr=1e-06, beta=0.1)

DPO Configuration

  • Learning rate: 1e-06
  • Beta: 0.1
  • Loss type: ipo
  • LoRA: r=64, alpha=128
  • Max length: 1024
Downloads last month
14
Safetensors
Model size
4B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tomofusa/exp028b-downproj-dpo-lr1e6-ep2-merged

Finetuned
(1541)
this model

Dataset used to train tomofusa/exp028b-downproj-dpo-lr1e6-ep2-merged