qwen3-4b-advanced-dpo-v27-merged

This model is a DPO fine-tuned version of deepkick/qwen3-4b-advanced-sft-v13-merged. v27: rejectedをv21失敗タスクタイプ(pick_two/examine/clean/cool/heat)に偏重サンプリング

Method

  • Base: Qwen/Qwen3-4B-Instruct-2507
  • SFT Base: deepkick/qwen3-4b-advanced-sft-v13-merged (ALF 27/50, score 4.0543)
  • DPO: v5 dataset success/failure pairs (311 samples)
  • Beta: 0.1
  • LR: 5e-07
  • Epochs: 1
  • LoRA: r=32, alpha=128

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "deepkick/qwen3-4b-advanced-dpo-v27-merged",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("deepkick/qwen3-4b-advanced-dpo-v27-merged")
Downloads last month
4
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for deepkick/qwen3-4b-advanced-dpo-v27-merged

Finetuned
(1541)
this model

Dataset used to train deepkick/qwen3-4b-advanced-dpo-v27-merged