Qwen2.5-7B-AgentBench-V4-BF16

This is the V4 Advanced model, a high-precision variant of the agent model based on the Qwen2.5-7B-Instruct architecture, specifically designed to achieve extreme accuracy and long-context understanding in the AgentBench-comp evaluation environment.

The reasoning capability has been maximize through strict data curation and extended context length, aiming to solve complex multi-step tasks without generation errors.

⚡ V4: What Changed from V3

	V3-BF16	V4-BF16
Context Length	2048	4096 (Handles longer ALFWorld trajectories)
Dataset	Standard SFT	Iron Guard (171 strictly curated high-quality trajectories)
Training Method	Validation enabled	Optimized SFT (Batch 1, GradAcc 4, No validation overhead)
Target Focus	General Agent	Aggregations & Complex Planning (SQL SUM/COUNT, long navigation)

🚀 Key Features

Extended Context (4096 tokens): Ensures the model flawlessly captures long trial-and-error processes up to the final answer.
"Iron Guard" Protocol: Trained exclusively on a meticulously filtered dataset to eliminate hallucinations and formatting errors.
Targeted Logic: Heavily focuses on SQL aggregation commands and complex ALFWorld navigation patterns, using strategically added Japanese logic (JP Spice).
High Parameter Efficiency: Loss refined down to 0.192 through 1500 intensive iterations on Mac M4.

🛠 The "Iron Guard" Journey

V4 was born from the need to eliminate formatting errors and improve strict adherence in long tasks:

Dataset Review: Processed raw trajectories through the "Iron Guard" protocol, retaining only 171 flawless examples.
Context Extension: Discovered ALFWorld tasks were being truncated. Doubled max_seq_length to 4096.
Memory Optimization: Overcame Mac OOM errors by using batch_size=1, grad_accumulation=4, and grad_checkpointing.
Validation Removal: Disabled redundant validation steps to maximize training efficiency and focus purely on learning the curated data.

Usage

# Evaluate directly in BF16 format (if within timeout limits)
cd ~/AgentBench
./update_model.sh satoyutaka/Qwen2.5-7B-AgentBench-V4-BF16
python3 -m src.assigner -c configs/assignments/dbbench-vllm.yaml
python3 -m src.assigner -c configs/assignments/alfworld-vllm.yaml

[日本語訳] Qwen2.5-7B-AgentBench-V4-BF16

本モデルは、AgentBench-comp コンペティションの評価環境において「極限の精度」と「長文脈の理解力」を達成するために設計された、V4 アドバンス版エージェントモデルです（Qwen2.5-7B-Instructベース）。

厳格なデータ選別とコンテキスト長の拡張により推論能力を最大化し、複雑な多段階タスクを生成エラーなしで解決することを目指しています。

⚡ V3 からの変更点

コンテキスト長: 2048 → 4096（ALFWorldの長大な軌跡にも対応）
データセット: 標準SFT → Iron Guardプロトコル（厳密に選別された171件の高品質データのみ使用）
学習手法: バリデーションあり → 最適化SFT（バッチ1、累積4、バリデーション省略により学習リソースを集中）
ターゲット: 汎用エージェント → 集計と複雑な計画（SQLのSUM/COUNT、ALFWorldの長期探索に特化）

🛠 「Iron Guard」の開発経緯

V4は、フォーマットエラーを完全に排除し、長期タスクでの厳密な指示順守を向上させる必要性から生まれました：

データセットの再審査: 「Iron Guard」プロトコルで生の軌跡を精査し、完璧な171件のみを残しました。
コンテキスト拡張: ALFWorldタスクが切り捨てられている問題を発見し、max_seq_lengthを4096に倍増しました。
メモリ最適化: batch_size=1, grad_accumulation=4, grad_checkpointing を駆使し、MacのOOMエラーを克服しました。
バリデーション廃止: 重複する検証ステップを無効化し、厳選データの学習のみに純粋に集中（Loss 0.192）。

Author: satoyutaka Competition: AgentBench-comp (V4-BF16) Base Model: Qwen2.5-7B-Instruct (LoRA fine-tuned, Iter 1500) Precision: BFloat16

Downloads last month: 1

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support