| <h1 align="center"> |
| MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles |
| </h1> |
|
|
| <div align="center"> |
| <p> |
| <a href="https://arxiv.org/pdf/2605.22177"> |
| <img src="https://img.shields.io/badge/Paper-arxiv%3A2605.22177-blue" alt="Paper"/> |
| </a> |
| <a href="https://huggingface.co/papers/2605.22177"> |
| <img src="https://img.shields.io/badge/Daily%20Paper-HuggingFace-yellow" alt="HF Daily Paper"/> |
| </a> |
| <a href="https://github.com/jinyangwu/Maestro"> |
| <img src="https://img.shields.io/badge/Code-GitHub-black" alt="Code"/> |
| </a> |
| </p> |
| </div> |
| |
| ## Overview |
|
|
| **MAESTRO-4B** is the lightweight multimodal orchestrator used in **MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles**. |
|
|
| Rather than solving every task with a single monolithic model, MAESTRO frames multimodal agent execution as a sequential decision-making problem over a hierarchical model-skill registry. At each reasoning step, the 4B orchestrator decides: |
|
|
| - whether to invoke an external expert, |
| - which expert model to call, |
| - which task-specific skill to use, |
| - and when to terminate with a final answer. |
|
|
| The full MAESTRO system is available at [jinyangwu/Maestro](https://github.com/jinyangwu/Maestro). The repository includes example train/validation data under `data/` and skill implementations under `skills/`. |
|
|
| > **Important** |
| > This checkpoint is an **orchestrator policy**, not a standalone all-purpose VLM. To reproduce MAESTRO-style rollout, use this model together with the skill registry and auxiliary model services provided in the GitHub repository. |
|
|
| ## Key Features |
|
|
| - **RL-trained orchestration policy**: Learns model-skill routing through outcome-based reinforcement learning. |
| - **Hierarchical skill registry**: Selects coarse Level-1 skills and dispatches to fine-grained Level-2 solvers. |
| - **Model-skill composition**: Treats expert model selection and skill invocation as a unified action. |
| - **Plug-and-play extensibility**: Can exploit newly added experts and skills without retraining in the reported setup. |
| - **Efficient 4B controller**: Uses a compact orchestrator to coordinate larger or specialized frozen expert models. |
|
|
| ## Performance Highlights |
|
|
| The MAESTRO paper evaluates the full orchestration system across representative multimodal benchmarks covering mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. |
|
|
| | Setting | Result | |
| | --- | --- | |
| | In-domain multimodal benchmarks | 70.1% average accuracy | |
| | Closed-source reference baselines | GPT-5: 69.3%, Gemini-2.5-Pro: 68.7% | |
| | Augmented out-of-domain registry without retraining | 59.5% average accuracy | |
| | Average latency in the reported setup | 2.88s | |
|
|
| These numbers describe the **full MAESTRO system** with its model-skill registry and external services, not isolated single-model inference from this checkpoint alone. |
|
|
| ## Quickstart |
|
|
| ### Load the orchestrator checkpoint |
|
|
| Below is a minimal Transformers-style loading example. Full model-skill orchestration requires the MAESTRO repository and the auxiliary services described below. |
|
|
| ```python |
| import torch |
| from transformers import AutoProcessor, AutoModelForImageTextToText |
| |
| model_id = "Jinyang23/Maestro-4B" |
| |
| model = AutoModelForImageTextToText.from_pretrained( |
| model_id, |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| processor = AutoProcessor.from_pretrained( |
| model_id, |
| trust_remote_code=True, |
| ) |
| ``` |
|
|
| ### Run the full MAESTRO framework |
|
|
| Clone the project repository: |
|
|
| ```bash |
| git clone https://github.com/jinyangwu/Maestro |
| cd Maestro |
| ``` |
|
|
| Create the Python environment and install dependencies: |
|
|
| ```bash |
| conda create -n maestro python=3.10 -y |
| conda activate maestro |
| pip install -r requirements.txt |
| ``` |
|
|
| Set an OpenAI API key before training or rollout: |
|
|
| ```bash |
| export OPENAI_API_KEY=<your_api_key> |
| ``` |
|
|
| Before training, deploy the auxiliary model services. Replace each `/path/to/<model>` placeholder with a local model directory or Hugging Face model id. |
|
|
| Example: |
|
|
| ```bash |
| vllm serve /path/to/Intern-S1-mini --served-model-name Intern-S1-mini --tensor_parallel_size 1 --max-num-seqs 512 --trust-remote-code --port 2368 --gpu_memory_utilization 0.9 |
| ``` |
|
|
| Default service ports used by the skills: |
|
|
| | Port | Model service | |
| | --- | --- | |
| | `2362` | `qwen3-VL-8B-Instruct` | |
| | `2364` | `Chart-R1` | |
| | `2368` | `Intern-S1-mini` | |
| | `2369` | `medgemma-1.5-4b-it` | |
| | `2370` | `DeepEyes-7B` | |
| | `2376` | `GLM-4.6V-Flash` | |
| | `2388` | `GLM-OCR` | |
| | `2389` | `PR1-Qwen2.5-VL-3B-Detection` | |
|
|
| Start training with: |
|
|
| ```bash |
| bash train.sh |
| ``` |
|
|
| To train from a local checkpoint or a different model id, override `MODEL_NAME`: |
|
|
| ```bash |
| MODEL_NAME=/path/to/Qwen3-VL-4B-Thinking bash train.sh |
| ``` |
|
|
| ## Model Details |
|
|
| - **Model name**: `Jinyang23/Maestro-4B` |
| - **Role**: MAESTRO multimodal orchestration policy |
| - **Base model**: `Qwen3-VL-4B-Thinking` |
| - **Training method**: outcome-based reinforcement learning with GRPO-style optimization |
| - **Action space**: latent reasoning, model-skill search actions, and terminal answers |
| - **Skill interface**: hierarchical skill registry from the MAESTRO repository |
| - **Expected usage**: high-level controller for external expert models and modular skills |
|
|
| ## Intended Use |
|
|
| This model is intended for research on: |
|
|
| - multimodal agent orchestration, |
| - reinforcement learning for tool and skill use, |
| - model routing and expert selection, |
| - hierarchical skill libraries, |
| - agentic evaluation across heterogeneous tasks. |
|
|
| It is especially useful when integrated with the full MAESTRO framework, where the orchestrator can call external expert services during rollout. |
|
|
| ## Citation |
|
|
| If you use this model or the MAESTRO framework in your research, please cite: |
|
|
| ```bibtex |
| @misc{wu2026maestro, |
| title={MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles}, |
| author={Jinyang Wu and Guocheng Zhai and Ruihan Jin and Yuhao Shen and Zhengxi Lu and Fan Zhang and Haoran Luo and Zheng Lian and Zhengqi Wen and Jianhua Tao}, |
| year={2026}, |
| eprint={2605.22177}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.LG}, |
| url={https://arxiv.org/abs/2605.22177}, |
| } |
| ``` |
|
|
| ## Links |
|
|
| - Code: [https://github.com/jinyangwu/Maestro](https://github.com/jinyangwu/Maestro) |
| - Model: [https://huggingface.co/Jinyang23/Maestro-4B](https://huggingface.co/Jinyang23/Maestro-4B) |
|
|
| ## Acknowledgement |
|
|
| This project builds on open-source reinforcement learning and model-serving ecosystems, including `verl` and vLLM. We thank the authors and contributors of these projects, as well as the developers of the expert models and skill implementations used by MAESTRO. |
|
|