qwen3-4b-sft-v5g-hybrid-merged

Matsuo Lab LLM Competition 2025 (StructEval)
SFT only · No DPO · 12,686 training samples

What is this model?

Qwen3-4B-Instruct-2507 をベースに、構造化データ出力（JSON / YAML / TOML / XML / CSV）の品質を最大化するための SFT を施したモデルです。

Core Idea: Empty Think Injection

本モデルの最も重要な技術的貢献は Empty Think Injection です。

Qwen3 の thinking 機能（<think>...</think>）はデフォルトで有効であり、推論時にモデルは思考プロセスを出力してから回答を生成します。構造化データ出力タスクでは、この thinking 出力がコードフェンス（```json）や前置き文（"Here is the JSON output:"）の混入を引き起こし、パースエラーの原因となっていました。

Empty Think Injection は、SFT データの全 assistant 出力を以下の形式に統一する手法です：

<think>
</think>

{raw structured data}

元データに含まれる CoT（Chain-of-Thought: Approach: ... Output: ...）を物理削除
空の <think> ブロックを先頭に付与し、モデルが thinking フェーズを即座に終了してデータ出力に移行するよう訓練
コードフェンス、前置き文、後書きを一切含まないクリーンな出力を実現

この手法により、推論時にモデルは <think>\n</think> の後に即座に構造化データを出力します。

Dataset Construction (12,686 samples)

コンペの運営提供データセット（9種）から、ルールベースの前処理のみで構築しています。LLM によるデータ合成・改変は一切使用していません。

Source	Samples	Description
u-10bei/structured_data_with_cot_dataset_512_v4	4,608	5形式の基礎データ。CoT 削除 + Empty Think Injection を適用
daichira/structured-hard-sft-4k	4,000	深い構造・大キー数の高難度データ。Empty Think Injection を適用
Deep Structure Upsampling	3,578	daichira hard-4k から深さ4+のXML・深さ5+のYAMLを選択的に3倍増幅
CSV Dot-Notation Expansion	500	daichira hard-4k の深い YAML をルールベースで平坦化し、CSV→JSON/YAML の展開ペアを合成

Deep Structure Upsampling

daichira/structured-hard-sft-4k の中から深い構造（XML depth≥4, YAML depth≥5）を持つサンプルを選択的に3倍に増幅しています。浅いデータが支配的になることを防ぎ、深い階層の末端キーを欠損なく出力する能力を強化します。

CSV Dot-Notation Expansion

daichira/structured-hard-sft-4k の深い YAML データを Python スクリプトで逆変換し、ドット記法の平坦な CSV を入力、深い JSON/YAML を出力とする合成データ 500 件を生成しています。

例: metrics.accuracy,metrics.loss → metrics:\n accuracy: 0.95\n loss: 0.02

「ドット記法 = 深いネストの平坦化表現」という変換ルールをモデルに教え込むことで、CSV 以外のタスクでも深い構造の展開能力が汎化しました。

Training Configuration

Parameter	Value
Method	SFT via Unsloth
Epochs	2
Batch size	16 (per_device=2 × grad_accum=8)
Learning rate	5e-5
LoRA	r=64, alpha=64
Max seq length	2048
Quantization	4-bit (load_in_4bit=True)
Final eval_loss	0.1552

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "beachcities/qwen3-4b-sft-v5g-hybrid-merged"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

License

Apache-2.0 (following base model terms)

Downloads last month: 1

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for beachcities/qwen3-4b-sft-v5g-hybrid-merged

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(1543)

this model

Finetunes

1 model

beachcities
/

qwen3-4b-sft-v5g-hybrid-merged