shilinyan commited on
Commit
5275f2b
·
verified ·
1 Parent(s): db903f1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -0
README.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ base_model:
5
+ - Qwen/Qwen3-VL-8B-Instruct
6
+ - Accio-Lab/Metis-8B-ColdStart
7
+ tags:
8
+ - multimodal
9
+ - vision-language
10
+ - reinforcement-learning
11
+ - tool-use
12
+ - agentic
13
+ - qwen3_vl
14
+ - HDPO
15
+ datasets:
16
+ - Accio-Lab/Metis-RL
17
+ language:
18
+ - en
19
+ pipeline_tag: image-text-to-text
20
+ ---
21
+
22
+ # Metis-8B-RL
23
+
24
+ **Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models**
25
+
26
+ Metis-8B-RL is the final RL-trained checkpoint of the **Metis** framework, trained with **Hierarchical Decoupled Policy Optimization (HDPO)** on top of [Metis-8B-ColdStart](https://huggingface.co/Accio-Lab/Metis-8B-ColdStart). It is a strategic multimodal reasoning agent that selectively invokes code execution, text search, and image search tools during multi-turn reasoning.
27
+
28
+ [[Paper (arXiv)]](https://arxiv.org/abs/2604.08545) | [[GitHub]](https://github.com/Accio-Lab/Metis) | [[ColdStart Model]](https://huggingface.co/Accio-Lab/Metis-8B-ColdStart) | [[RL Data]](https://huggingface.co/datasets/Accio-Lab/Metis-RL) | [[ColdStart Data]](https://huggingface.co/datasets/Accio-Lab/Metis-ColdStart)
29
+
30
+ ## Highlights
31
+
32
+ - **98% → 2% Tool Calls** — Reduces blind tool invocation by orders of magnitude.
33
+ - **SOTA Performance** — Best accuracy across 13 benchmarks among open-source 8B agentic models.
34
+ - **Meta-Cognitive Wisdom** — Learns *when* to use tools, not just *how*.
35
+
36
+ ## Model Details
37
+
38
+ | Attribute | Value |
39
+ |---|---|
40
+ | Base model | [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) |
41
+ | SFT checkpoint | [Metis-8B-ColdStart](https://huggingface.co/Accio-Lab/Metis-8B-ColdStart) |
42
+ | RL algorithm | HDPO (Hierarchical Decoupled Policy Optimization) |
43
+ | Training data | [Metis-RL](https://huggingface.co/datasets/Accio-Lab/Metis-RL) (~5K prompts) |
44
+ | License | Apache-2.0 |
45
+
46
+ ### HDPO Training Hyperparameters
47
+
48
+ | Hyperparameter | Value |
49
+ |---|---|
50
+ | Batch size | 128 |
51
+ | Rollouts per prompt (*G*) | 16 |
52
+ | Learning rate | 1e-6 |
53
+ | KL coefficient | 0 |
54
+ | Loss weights | w_acc = 1.0, w_tool = 0.15 |
55
+ | Max response length | 16,384 tokens |
56
+
57
+ ## Method: Hierarchical Decoupled Policy Optimization (HDPO)
58
+
59
+ Current agentic multimodal models suffer from **blind tool invocation** — they reflexively call external tools even when queries are directly resolvable from the visual context. Existing RL methods attempt to fix this by coupling accuracy and tool-efficiency into a single scalar reward, but this creates an irreconcilable optimization dilemma.
60
+
61
+ HDPO resolves this through three key components:
62
+
63
+ 1. **Dual Reward Design** — An accuracy reward (r_acc) and a tool-efficiency reward (r_tool) that is conditioned on correctness.
64
+ 2. **Decoupled Advantage Estimation** — Accuracy advantages are computed over all rollouts; tool efficiency advantages are computed *exclusively* over correct rollouts (conditional GRPO).
65
+ 3. **Hierarchical Policy Update** — Two independent clipped surrogate losses combined as `L_HDPO = w_acc · L_GRPO(A_acc) + w_tool · L_GRPO(A_tool)`.
66
+
67
+ This naturally induces an implicit curriculum: *first learn to be correct, then learn to be efficient*.
68
+
69
+ ## Evaluation Results
70
+
71
+ ### Perception and Document Understanding
72
+
73
+ | Model | V\*Bench | HR4K | HR8K | TreeBench | MME-RW | SEED2+ | CharXiv(DQ) | CharXiv(RQ) |
74
+ |---|---|---|---|---|---|---|---|---|
75
+ | Qwen3-VL-8B-Instruct | 86.4 | 78.9 | 74.6 | 40.7 | 61.9 | 71.0 | 83.0 | 46.3 |
76
+ | DeepEyesV2 | 81.8 | 77.9 | 73.8 | 42.5 | 64.9 | 70.5 | 78.6 | 48.9 |
77
+ | SenseNova-MARS-8B | **92.2** | 83.1 | 78.4 | - | 67.9 | - | - | - |
78
+ | Skywork-R1V4-30B-A3B | 88.0 | 82.8 | 79.8 | - | **71.4** | - | - | - |
79
+ | **Metis (Ours)** | 91.1 | **83.5** | **82.0** | **45.2** | 70.3 | **72.5** | **83.4** | **54.1** |
80
+
81
+ ### Mathematical and Logical Reasoning
82
+
83
+ | Model | MathVista | MathVerse | WeMath | DynaMath | LogicVista | Avg. |
84
+ |---|---|---|---|---|---|---|
85
+ | Qwen3-VL-8B-Instruct | 76.3 | 61.3 | 38.8 | 65.5 | 54.9 | 59.4 |
86
+ | DeepEyesV2 | 71.9 | 52.7 | 38.1 | 57.2 | 48.7 | 53.7 |
87
+ | **Metis (Ours)** | **78.0** | **65.9** | **65.2** | **69.2** | **56.2** | **66.9** |
88
+
89
+ ## Usage
90
+
91
+ Please refer to the [GitHub repository](https://github.com/Accio-Lab/Metis) for full installation and inference instructions.
92
+
93
+ ### Installation
94
+
95
+ ```bash
96
+ git clone https://github.com/Accio-Lab/Metis.git
97
+ cd Metis
98
+ pip install -e verl
99
+ pip install -e ".[vllm,search_tool,python_code_dep]"
100
+ ```
101
+
102
+ ## Citation
103
+
104
+ ```bibtex
105
+ @article{yan2026metis,
106
+ title={Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models},
107
+ author={Yan, Shilin and Tong, Jintao and Xue, Hongwei and Tang, Xiaojun and Wang, Yangyang and Shi, Kunyu and Zhang, Guannan and Li, Ruixuan and Zou, Yixiong},
108
+ journal={arXiv preprint arXiv:2604.08545},
109
+ year={2026}
110
+ }
111
+ ```
112
+
113
+ ## Acknowledgments
114
+
115
+ Metis is built upon [verl](https://github.com/volcengine/verl), [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool), and [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL).