qwen3-4b-advanced-dpo-v23-merged
DPO fine-tuned version of deepkick/qwen3-4b-advanced-sft-v13-merged.
Method
- Base: Qwen/Qwen3-4B-Instruct-2507
- SFT Base: deepkick/qwen3-4b-advanced-sft-v13-merged (ALF 27/50, score 4.0543)
- DPO Dataset: deepkick/sft_alfworld_v5_action_format (THOUGHT/ACTION format)
- DPO Pairs: 311 samples (全失敗パターン網羅)
- Beta: 0.1
- LR: 5e-07
- Epochs: 1
- LoRA: r=32, alpha=128
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"deepkick/qwen3-4b-advanced-dpo-v23-merged",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("deepkick/qwen3-4b-advanced-dpo-v23-merged")
- Downloads last month
- 3
Model tree for deepkick/qwen3-4b-advanced-dpo-v23-merged
Base model
Qwen/Qwen3-4B-Instruct-2507