File size: 6,721 Bytes
6bc5a60 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | <h1 align="center">
MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
</h1>
<div align="center">
<p>
<a href="https://arxiv.org/pdf/2605.22177">
<img src="https://img.shields.io/badge/Paper-arxiv%3A2605.22177-blue" alt="Paper"/>
</a>
<a href="https://huggingface.co/papers/2605.22177">
<img src="https://img.shields.io/badge/Daily%20Paper-HuggingFace-yellow" alt="HF Daily Paper"/>
</a>
<a href="https://github.com/jinyangwu/Maestro">
<img src="https://img.shields.io/badge/Code-GitHub-black" alt="Code"/>
</a>
</p>
</div>
## Overview
**MAESTRO-4B** is the lightweight multimodal orchestrator used in **MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles**.
Rather than solving every task with a single monolithic model, MAESTRO frames multimodal agent execution as a sequential decision-making problem over a hierarchical model-skill registry. At each reasoning step, the 4B orchestrator decides:
- whether to invoke an external expert,
- which expert model to call,
- which task-specific skill to use,
- and when to terminate with a final answer.
The full MAESTRO system is available at [jinyangwu/Maestro](https://github.com/jinyangwu/Maestro). The repository includes example train/validation data under `data/` and skill implementations under `skills/`.
> **Important**
> This checkpoint is an **orchestrator policy**, not a standalone all-purpose VLM. To reproduce MAESTRO-style rollout, use this model together with the skill registry and auxiliary model services provided in the GitHub repository.
## Key Features
- **RL-trained orchestration policy**: Learns model-skill routing through outcome-based reinforcement learning.
- **Hierarchical skill registry**: Selects coarse Level-1 skills and dispatches to fine-grained Level-2 solvers.
- **Model-skill composition**: Treats expert model selection and skill invocation as a unified action.
- **Plug-and-play extensibility**: Can exploit newly added experts and skills without retraining in the reported setup.
- **Efficient 4B controller**: Uses a compact orchestrator to coordinate larger or specialized frozen expert models.
## Performance Highlights
The MAESTRO paper evaluates the full orchestration system across representative multimodal benchmarks covering mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis.
| Setting | Result |
| --- | --- |
| In-domain multimodal benchmarks | 70.1% average accuracy |
| Closed-source reference baselines | GPT-5: 69.3%, Gemini-2.5-Pro: 68.7% |
| Augmented out-of-domain registry without retraining | 59.5% average accuracy |
| Average latency in the reported setup | 2.88s |
These numbers describe the **full MAESTRO system** with its model-skill registry and external services, not isolated single-model inference from this checkpoint alone.
## Quickstart
### Load the orchestrator checkpoint
Below is a minimal Transformers-style loading example. Full model-skill orchestration requires the MAESTRO repository and the auxiliary services described below.
```python
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
model_id = "Jinyang23/Maestro-4B"
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
model_id,
trust_remote_code=True,
)
```
### Run the full MAESTRO framework
Clone the project repository:
```bash
git clone https://github.com/jinyangwu/Maestro
cd Maestro
```
Create the Python environment and install dependencies:
```bash
conda create -n maestro python=3.10 -y
conda activate maestro
pip install -r requirements.txt
```
Set an OpenAI API key before training or rollout:
```bash
export OPENAI_API_KEY=<your_api_key>
```
Before training, deploy the auxiliary model services. Replace each `/path/to/<model>` placeholder with a local model directory or Hugging Face model id.
Example:
```bash
vllm serve /path/to/Intern-S1-mini --served-model-name Intern-S1-mini --tensor_parallel_size 1 --max-num-seqs 512 --trust-remote-code --port 2368 --gpu_memory_utilization 0.9
```
Default service ports used by the skills:
| Port | Model service |
| --- | --- |
| `2362` | `qwen3-VL-8B-Instruct` |
| `2364` | `Chart-R1` |
| `2368` | `Intern-S1-mini` |
| `2369` | `medgemma-1.5-4b-it` |
| `2370` | `DeepEyes-7B` |
| `2376` | `GLM-4.6V-Flash` |
| `2388` | `GLM-OCR` |
| `2389` | `PR1-Qwen2.5-VL-3B-Detection` |
Start training with:
```bash
bash train.sh
```
To train from a local checkpoint or a different model id, override `MODEL_NAME`:
```bash
MODEL_NAME=/path/to/Qwen3-VL-4B-Thinking bash train.sh
```
## Model Details
- **Model name**: `Jinyang23/Maestro-4B`
- **Role**: MAESTRO multimodal orchestration policy
- **Base model**: `Qwen3-VL-4B-Thinking`
- **Training method**: outcome-based reinforcement learning with GRPO-style optimization
- **Action space**: latent reasoning, model-skill search actions, and terminal answers
- **Skill interface**: hierarchical skill registry from the MAESTRO repository
- **Expected usage**: high-level controller for external expert models and modular skills
## Intended Use
This model is intended for research on:
- multimodal agent orchestration,
- reinforcement learning for tool and skill use,
- model routing and expert selection,
- hierarchical skill libraries,
- agentic evaluation across heterogeneous tasks.
It is especially useful when integrated with the full MAESTRO framework, where the orchestrator can call external expert services during rollout.
## Citation
If you use this model or the MAESTRO framework in your research, please cite:
```bibtex
@misc{wu2026maestro,
title={MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles},
author={Jinyang Wu and Guocheng Zhai and Ruihan Jin and Yuhao Shen and Zhengxi Lu and Fan Zhang and Haoran Luo and Zheng Lian and Zhengqi Wen and Jianhua Tao},
year={2026},
eprint={2605.22177},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.22177},
}
```
## Links
- Code: [https://github.com/jinyangwu/Maestro](https://github.com/jinyangwu/Maestro)
- Model: [https://huggingface.co/Jinyang23/Maestro-4B](https://huggingface.co/Jinyang23/Maestro-4B)
## Acknowledgement
This project builds on open-source reinforcement learning and model-serving ecosystems, including `verl` and vLLM. We thank the authors and contributors of these projects, as well as the developers of the expert models and skill implementations used by MAESTRO.
|