Qwen3-4B-Instruct-V12-075-DPO-Merged

1. Model Summary

llm2025_main_v12_075_dpo_merged is a fine-tuned model for the LLM2025 Main Competition. It is designed to achieve high-precision structural data generation while maintaining strict output silence.

This version (V12) is built upon the 0.75 Score SFT base (trained on 4,000 samples) by applying the optimized DPO recipe from the successful DPO3 (0.77 Score) experiment. By combining a robust intellectual foundation with precise alignment, this model aims to surpass the 0.80 score barrier.

2. Methodology: High-Learning Rate Synergy

The V12 strategy focuses on rectifying previous alignment failures:

Challenge: Previous attempts to apply DPO to the 0.75 base failed to show improvement.
Analysis: We identified that using synthetic datasets and an excessively low learning rate (5e-7) prevented the model from effectively learning silence patterns.
Solution: V12 reverts to the proven dpo_train.jsonl (160 samples) and uses a higher learning rate of 5e-6, successfully steering the model toward the desired "Perfect Silence" behavior without sacrificing logical reasoning.

3. Data Origin & Compliance

This model strictly adheres to competition ethical guidelines:

Permitted Data Only: Only datasets officially provided for the LLM2025 competition were used.
No LLM Distillation: This model does NOT use distillation from other LLMs (Gemini, GPT-4, etc.). All rejection patterns were generated using rule-based mechanical cleansing of previous model errors.

4. How to Use

You can use this model with the transformers and peft libraries.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "satoyutaka/llm2025_main_v12_075_dpo_merged"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Example Task
prompt = """<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Generate a JSON object representing a person with name 'John' and age 30.<|im_end|>
<|im_start|>assistant
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)

print(tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):], skip_special_tokens=True))
# Result: {"name": "John", "age": 30}

5. Metadata

Experiment ID: exp_20260227_v12_dpo_on_075
SFT Baseline: 0.75191
Target: 0.79+
Precision: float16

6. License

License: Apache-2.0. Complies with the original dataset and base model terms.

(日本語訳)

1. モデル概要

llm2025_main_v12_075_dpo_merged は、LLM2025最終課題メインコンペ用に開発された微調整済みモデルです。高精度な構造化データの生成と、徹底した「沈黙（前置きなしの出力）」の両立を目的としています。

V12バージョンでは、4,000件の高品質データで学習された 0.75スコアのSFT基盤 をベースに、過去最高スコア（0.77）を記録した DPO3の学習レシピ を適用しました。強固な知能基盤に精密なアライメントを組み合わせることで、スコア0.80の突破を目指しています。

2. 手法：高学習率による相乗効果

V12戦略では、過去のアライメント失敗の分析に基づき、以下の改善を行いました：

課題: 0.75ベースモデルへのDPO適用は過去に失敗し、スコアが向上しませんでした。
分析: 失敗の原因は「合成データの使用」および「低すぎる学習率（5e-7）」により、沈黙のパターンを十分に学習できなかったことにあると特定。
解決策: V12では実績のある dpo_train.jsonl (160件) に回帰し、学習率を 5e-6 に引き上げました。これにより、論理推論能力を維持したまま、理想的な「完璧な沈黙」の振る舞いを獲得させることに成功しました。

3. データの由来と規約遵守

本モデルはコンペティションの倫理ガイドラインを厳格に遵守しています：

許可データのみ: LLM2025コンペティションで公式に提供されたデータセットのみを使用。
非蒸留の証明: 他のLLM（Gemini, GPT-4等）からの「蒸留」は一切行っていません。不採用（Rejected）パターンの特定は、過去のモデルエラーに対するルールベースの機械的なクレンジングによって行われています。

4. 使い方

上記の「How to Use」セクションのサンプルコードを参照してください。

5. メタデータ

上記の「Metadata」セクションを参照してください。

6. ライセンス

ライセンス: Apache-2.0。元データのライセンスおよびベースモデルの規約に準拠します。

Downloads last month: 1

Safetensors

Model size

4B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for satoyutaka/llm2025_main_v12_075_dpo_merged

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(1541)

this model