Qwen3-4B-AgentBench-llm2025_advance_v5_ironguard

This is the V5 "Iron Guard" version of the specialized agent model for AgentBench-comp, based on the Qwen3-4B architecture.
This version utilizes DPO (Direct Preference Optimization) to enforce extreme discipline, eliminate unnecessary verbosity, and ensure 100% adherence to task-specific action formats.

🛡️ "Iron Guard" Strategy (V5)

After analyzing V4's failure (where the 7B model suffered from "over-thinking" and timeouts), V5 returns to the agile 4B architecture with a primary focus on Discipline.

Zero Filler (Talk-less): DPO-trained to remove all conversational filler (e.g., "Certainly!", "I can help with that," "Have a nice day").
Strict Format Enforcement: Optimized for the Thought -> Action -> Observation loop. It ensures that every response ends with a valid Action: or Final Answer: as required by the AgentBench evaluator.
Latency Optimized: By producing concise "Thoughts" and immediately jumping to "Actions," the model minimizes token usage and drastically reduces the risk of timeouts.
Improved Context Persistence: Trained with max_length=2048 and full Attention gate updates (q, k, v, o_proj) to maintain complex environment states over long trajectories.

🧠 Training Methodology: DPO (Direct Preference Optimization)

Unlike previous SFT-only versions, V5 uses DPO to directly penalize "bad behaviors":

Chosen (Desired): Concise reasoning directly followed by a valid action.
Rejected (Undesired): Verbose, chatty, or recursive reasoning that delays action execution.
Hyperparameters:
- beta=0.1 for subtle but firm style alignment.
- max_length=2048 to capture long trajectory context.
- target_modules: Full Attention layers (q_proj, k_proj, v_proj, o_proj).

Usage (Standard Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "satoyutaka/Qwen3-4B-AgentBench-llm2025_advance_1st"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Optimized for ReAct format: Thought -> Action -> Observation

📊 Dataset Policy

100% Synthetic & Compliant: 157 manually curated DPO pairs generated using a Qwen3-30B teacher model. No original AgentBench data was used.
Domain Focus: Balanced between DBBench (SQL generation) and ALFWorld (Action planning in simulated environments).

[日本語訳] Qwen3-4B-AgentBench-llm2025_advance_v5_ironguard

本モデルは、AgentBench-compにおける最高スコア（4.5〜5.0）獲得を目指して開発された、Qwen3-4Bベースの第5世代（V5）エージェントモデルです。
DPO（Direct Preference Optimization）を適用することで、「鉄の規律（Iron Guard）」を備え、無駄な発話（お喋り）を完全に排除し、タスク遂行に特化した出力スタイルを確立しました。

🛡️ "Iron Guard"（鉄の規律）戦略

高知能ながら「考えすぎ」によるタイムアウトやフォーマット違反が多発したV4の反省を活かし、V5では軽量な4Bモデルへの回帰と、徹底した規律の定着を最優先しました。

徹底した「お喋り」の排除: 「承知いたしました」「こんにちは」といった、エージェントにとって不要な挨拶やメタ発言をDPOによって抑制。
厳格なフォーマット遵守: Thought -> Action -> Observation のループを完璧に守ります。AgentBenchの評価システムが要求する Action: や Final Answer: を、迷いなく正確な書式で出力します。
低遅延・高スループット: 思考（Thought）を簡潔にまとめ、即座にアクションに移行することで、トークン消費量を抑え、運営サーバーでのタイムアウトリスクを最小限に抑えています。
コンテキスト維持能力の向上: max_length=2048 および Attention全層（q, k, v, o_proj）の更新により、多段階の探索が必要なタスクでも環境情報を正確に保持します。

🧠 学習プロセス: DPO (Direct Preference Optimization)

従来のSFT（教師あり学習）に加え、DPOを用いることで「何が悪い行動か」を直接モデルに学習させました。

Chosen（採用された例）: 簡潔な推論の後、即座に正しいアクションが続く軌跡。
Rejected（棄却された例）: 冗長で喋りすぎ、あるいはアクションの出力が遅れる、またはループしてしまう軌跡。
ハイパーパラメータ:
- beta=0.1: 推論能力を損なわずにスタイルを鮮鋭化。
- max_length=2048: 複雑な推論過程を切り捨てずに学習。
- target_modules: Attention層すべてを対象とし、文脈保持とフォーマット形成の両面を強化。

使い方 (Standard Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "satoyutaka/Qwen3-4B-AgentBench-llm2025_advance_1st"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# ReAct形式（Thought -> Action -> Observation）に最適化されています

📊 データ生成ポリシー

100%合成データ: 大会規約を完全に遵守し、Qwen3-30B教師モデルを用いて生成した157組のDPO用ペアデータを使用。
ドメイン: DBBench（SQL生成能力）およびALFWorld（シミュレーション環境での行動計画）の双方に最適化。

Downloads last month: 2

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support