nielsr HF Staff commited on
Commit
0ab6a00
·
verified ·
1 Parent(s): 7d2715f

Add metadata and improve model card

Browse files

This PR adds relevant YAML metadata (pipeline tag, library name, and license) to the model card. It also includes a reference to the base model (`Qwen3-VL-4B-Thinking`) and ensures the model is correctly linked to the research paper and GitHub repository. The sample code for loading the model via the `transformers` library has been preserved.

Files changed (1) hide show
  1. README.md +25 -96
README.md CHANGED
@@ -1,10 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
  <h1 align="center">
2
  MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
3
  </h1>
4
 
5
  <div align="center">
6
  <p>
7
- <a href="https://arxiv.org/pdf/2605.22177">
8
  <img src="https://img.shields.io/badge/Paper-arxiv%3A2605.22177-blue" alt="Paper"/>
9
  </a>
10
  <a href="https://huggingface.co/papers/2605.22177">
@@ -18,7 +30,7 @@ MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensem
18
 
19
  ## Overview
20
 
21
- **MAESTRO-4B** is the lightweight multimodal orchestrator used in **MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles**.
22
 
23
  Rather than solving every task with a single monolithic model, MAESTRO frames multimodal agent execution as a sequential decision-making problem over a hierarchical model-skill registry. At each reasoning step, the 4B orchestrator decides:
24
 
@@ -27,37 +39,23 @@ Rather than solving every task with a single monolithic model, MAESTRO frames mu
27
  - which task-specific skill to use,
28
  - and when to terminate with a final answer.
29
 
30
- The full MAESTRO system is available at [jinyangwu/Maestro](https://github.com/jinyangwu/Maestro). The repository includes example train/validation data under `data/` and skill implementations under `skills/`.
31
 
32
  > **Important**
33
  > This checkpoint is an **orchestrator policy**, not a standalone all-purpose VLM. To reproduce MAESTRO-style rollout, use this model together with the skill registry and auxiliary model services provided in the GitHub repository.
34
 
35
  ## Key Features
36
 
37
- - **RL-trained orchestration policy**: Learns model-skill routing through outcome-based reinforcement learning.
38
  - **Hierarchical skill registry**: Selects coarse Level-1 skills and dispatches to fine-grained Level-2 solvers.
39
  - **Model-skill composition**: Treats expert model selection and skill invocation as a unified action.
40
- - **Plug-and-play extensibility**: Can exploit newly added experts and skills without retraining in the reported setup.
41
- - **Efficient 4B controller**: Uses a compact orchestrator to coordinate larger or specialized frozen expert models.
42
-
43
- ## Performance Highlights
44
-
45
- The MAESTRO paper evaluates the full orchestration system across representative multimodal benchmarks covering mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis.
46
-
47
- | Setting | Result |
48
- | --- | --- |
49
- | In-domain multimodal benchmarks | 70.1% average accuracy |
50
- | Closed-source reference baselines | GPT-5: 69.3%, Gemini-2.5-Pro: 68.7% |
51
- | Augmented out-of-domain registry without retraining | 59.5% average accuracy |
52
- | Average latency in the reported setup | 2.88s |
53
-
54
- These numbers describe the **full MAESTRO system** with its model-skill registry and external services, not isolated single-model inference from this checkpoint alone.
55
 
56
  ## Quickstart
57
 
58
  ### Load the orchestrator checkpoint
59
 
60
- Below is a minimal Transformers-style loading example. Full model-skill orchestration requires the MAESTRO repository and the auxiliary services described below.
61
 
62
  ```python
63
  import torch
@@ -77,61 +75,15 @@ processor = AutoProcessor.from_pretrained(
77
  )
78
  ```
79
 
80
- ### Run the full MAESTRO framework
81
-
82
- Clone the project repository:
83
-
84
- ```bash
85
- git clone https://github.com/jinyangwu/Maestro
86
- cd Maestro
87
- ```
88
-
89
- Create the Python environment and install dependencies:
90
-
91
- ```bash
92
- conda create -n maestro python=3.10 -y
93
- conda activate maestro
94
- pip install -r requirements.txt
95
- ```
96
-
97
- Set an OpenAI API key before training or rollout:
98
-
99
- ```bash
100
- export OPENAI_API_KEY=<your_api_key>
101
- ```
102
-
103
- Before training, deploy the auxiliary model services. Replace each `/path/to/<model>` placeholder with a local model directory or Hugging Face model id.
104
-
105
- Example:
106
-
107
- ```bash
108
- vllm serve /path/to/Intern-S1-mini --served-model-name Intern-S1-mini --tensor_parallel_size 1 --max-num-seqs 512 --trust-remote-code --port 2368 --gpu_memory_utilization 0.9
109
- ```
110
-
111
- Default service ports used by the skills:
112
 
113
- | Port | Model service |
114
  | --- | --- |
115
- | `2362` | `qwen3-VL-8B-Instruct` |
116
- | `2364` | `Chart-R1` |
117
- | `2368` | `Intern-S1-mini` |
118
- | `2369` | `medgemma-1.5-4b-it` |
119
- | `2370` | `DeepEyes-7B` |
120
- | `2376` | `GLM-4.6V-Flash` |
121
- | `2388` | `GLM-OCR` |
122
- | `2389` | `PR1-Qwen2.5-VL-3B-Detection` |
123
-
124
- Start training with:
125
-
126
- ```bash
127
- bash train.sh
128
- ```
129
 
130
- To train from a local checkpoint or a different model id, override `MODEL_NAME`:
131
-
132
- ```bash
133
- MODEL_NAME=/path/to/Qwen3-VL-4B-Thinking bash train.sh
134
- ```
135
 
136
  ## Model Details
137
 
@@ -140,20 +92,6 @@ MODEL_NAME=/path/to/Qwen3-VL-4B-Thinking bash train.sh
140
  - **Base model**: `Qwen3-VL-4B-Thinking`
141
  - **Training method**: outcome-based reinforcement learning with GRPO-style optimization
142
  - **Action space**: latent reasoning, model-skill search actions, and terminal answers
143
- - **Skill interface**: hierarchical skill registry from the MAESTRO repository
144
- - **Expected usage**: high-level controller for external expert models and modular skills
145
-
146
- ## Intended Use
147
-
148
- This model is intended for research on:
149
-
150
- - multimodal agent orchestration,
151
- - reinforcement learning for tool and skill use,
152
- - model routing and expert selection,
153
- - hierarchical skill libraries,
154
- - agentic evaluation across heterogeneous tasks.
155
-
156
- It is especially useful when integrated with the full MAESTRO framework, where the orchestrator can call external expert services during rollout.
157
 
158
  ## Citation
159
 
@@ -169,13 +107,4 @@ If you use this model or the MAESTRO framework in your research, please cite:
169
  primaryClass={cs.LG},
170
  url={https://arxiv.org/abs/2605.22177},
171
  }
172
- ```
173
-
174
- ## Links
175
-
176
- - Code: [https://github.com/jinyangwu/Maestro](https://github.com/jinyangwu/Maestro)
177
- - Model: [https://huggingface.co/Jinyang23/Maestro-4B](https://huggingface.co/Jinyang23/Maestro-4B)
178
-
179
- ## Acknowledgement
180
-
181
- This project builds on open-source reinforcement learning and model-serving ecosystems, including `verl` and vLLM. We thank the authors and contributors of these projects, as well as the developers of the expert models and skill implementations used by MAESTRO.
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ base_model: Qwen/Qwen3-VL-4B-Thinking
6
+ tags:
7
+ - multimodal
8
+ - reinforcement-learning
9
+ - agent
10
+ - grpo
11
+ ---
12
+
13
  <h1 align="center">
14
  MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
15
  </h1>
16
 
17
  <div align="center">
18
  <p>
19
+ <a href="https://huggingface.co/papers/2605.22177">
20
  <img src="https://img.shields.io/badge/Paper-arxiv%3A2605.22177-blue" alt="Paper"/>
21
  </a>
22
  <a href="https://huggingface.co/papers/2605.22177">
 
30
 
31
  ## Overview
32
 
33
+ **MAESTRO-4B** is a lightweight multimodal orchestrator introduced in the paper [MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles](https://huggingface.co/papers/2605.22177).
34
 
35
  Rather than solving every task with a single monolithic model, MAESTRO frames multimodal agent execution as a sequential decision-making problem over a hierarchical model-skill registry. At each reasoning step, the 4B orchestrator decides:
36
 
 
39
  - which task-specific skill to use,
40
  - and when to terminate with a final answer.
41
 
42
+ The full MAESTRO system is available at [jinyangwu/Maestro](https://github.com/jinyangwu/Maestro).
43
 
44
  > **Important**
45
  > This checkpoint is an **orchestrator policy**, not a standalone all-purpose VLM. To reproduce MAESTRO-style rollout, use this model together with the skill registry and auxiliary model services provided in the GitHub repository.
46
 
47
  ## Key Features
48
 
49
+ - **RL-trained orchestration policy**: Learns model-skill routing through outcome-based reinforcement learning (GRPO).
50
  - **Hierarchical skill registry**: Selects coarse Level-1 skills and dispatches to fine-grained Level-2 solvers.
51
  - **Model-skill composition**: Treats expert model selection and skill invocation as a unified action.
52
+ - **Efficient 4B controller**: Uses a compact orchestrator (finetuned from `Qwen3-VL-4B-Thinking`) to coordinate larger or specialized frozen expert models.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
  ## Quickstart
55
 
56
  ### Load the orchestrator checkpoint
57
 
58
+ Below is a minimal Transformers-style loading example. Full model-skill orchestration requires the MAESTRO repository and the auxiliary services described in the official repository.
59
 
60
  ```python
61
  import torch
 
75
  )
76
  ```
77
 
78
+ ## Performance Highlights
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
+ | Setting | Result |
81
  | --- | --- |
82
+ | In-domain multimodal benchmarks | 70.1% average accuracy |
83
+ | Closed-source reference baselines | GPT-5: 69.3%, Gemini-2.5-Pro: 68.7% |
84
+ | Augmented out-of-domain registry without retraining | 59.5% average accuracy |
 
 
 
 
 
 
 
 
 
 
 
85
 
86
+ *These numbers describe the full MAESTRO system with its model-skill registry and external services.*
 
 
 
 
87
 
88
  ## Model Details
89
 
 
92
  - **Base model**: `Qwen3-VL-4B-Thinking`
93
  - **Training method**: outcome-based reinforcement learning with GRPO-style optimization
94
  - **Action space**: latent reasoning, model-skill search actions, and terminal answers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
 
96
  ## Citation
97
 
 
107
  primaryClass={cs.LG},
108
  url={https://arxiv.org/abs/2605.22177},
109
  }
110
+ ```