qwen3-4b-advanced-dpo-v27-merged
This model is a DPO fine-tuned version of deepkick/qwen3-4b-advanced-sft-v13-merged.
v27: rejectedをv21失敗タスクタイプ(pick_two/examine/clean/cool/heat)に偏重サンプリング
Method
- Base: Qwen/Qwen3-4B-Instruct-2507
- SFT Base: deepkick/qwen3-4b-advanced-sft-v13-merged (ALF 27/50, score 4.0543)
- DPO: v5 dataset success/failure pairs (311 samples)
- Beta: 0.1
- LR: 5e-07
- Epochs: 1
- LoRA: r=32, alpha=128
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"deepkick/qwen3-4b-advanced-dpo-v27-merged",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("deepkick/qwen3-4b-advanced-dpo-v27-merged")
- Downloads last month
- 4
Model tree for deepkick/qwen3-4b-advanced-dpo-v27-merged
Base model
Qwen/Qwen3-4B-Instruct-2507