| --- |
| license: apache-2.0 |
| library_name: transformers |
| pipeline_tag: image-text-to-text |
| base_model: Qwen/Qwen3-VL-4B-Thinking |
| tags: |
| - multimodal |
| - reinforcement-learning |
| - agent |
| - grpo |
| --- |
| |
| <h1 align="center"> |
| MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles |
| </h1> |
|
|
| <div align="center"> |
| <p> |
| <a href="https://huggingface.co/papers/2605.22177"> |
| <img src="https://img.shields.io/badge/Paper-arxiv%3A2605.22177-blue" alt="Paper"/> |
| </a> |
| <a href="https://huggingface.co/papers/2605.22177"> |
| <img src="https://img.shields.io/badge/Daily%20Paper-HuggingFace-yellow" alt="HF Daily Paper"/> |
| </a> |
| <a href="https://github.com/jinyangwu/Maestro"> |
| <img src="https://img.shields.io/badge/Code-GitHub-black" alt="Code"/> |
| </a> |
| </p> |
| </div> |
| |
| ## Overview |
|
|
| **MAESTRO-4B** is a lightweight multimodal orchestrator introduced in the paper [MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles](https://huggingface.co/papers/2605.22177). |
|
|
| Rather than solving every task with a single monolithic model, MAESTRO frames multimodal agent execution as a sequential decision-making problem over a hierarchical model-skill registry. At each reasoning step, the 4B orchestrator decides: |
|
|
| - whether to invoke an external expert, |
| - which expert model to call, |
| - which task-specific skill to use, |
| - and when to terminate with a final answer. |
|
|
| The full MAESTRO system is available at [jinyangwu/Maestro](https://github.com/jinyangwu/Maestro). |
|
|
| > **Important** |
| > This checkpoint is an **orchestrator policy**, not a standalone all-purpose VLM. To reproduce MAESTRO-style rollout, use this model together with the skill registry and auxiliary model services provided in the GitHub repository. |
|
|
| ## Key Features |
|
|
| - **RL-trained orchestration policy**: Learns model-skill routing through outcome-based reinforcement learning (GRPO). |
| - **Hierarchical skill registry**: Selects coarse Level-1 skills and dispatches to fine-grained Level-2 solvers. |
| - **Model-skill composition**: Treats expert model selection and skill invocation as a unified action. |
| - **Efficient 4B controller**: Uses a compact orchestrator (finetuned from `Qwen3-VL-4B-Thinking`) to coordinate larger or specialized frozen expert models. |
|
|
| ## Quickstart |
|
|
| ### Load the orchestrator checkpoint |
|
|
| Below is a minimal Transformers-style loading example. Full model-skill orchestration requires the MAESTRO repository and the auxiliary services described in the official repository. |
|
|
| ```python |
| import torch |
| from transformers import AutoProcessor, AutoModelForImageTextToText |
| |
| model_id = "Jinyang23/Maestro-4B" |
| |
| model = AutoModelForImageTextToText.from_pretrained( |
| model_id, |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| processor = AutoProcessor.from_pretrained( |
| model_id, |
| trust_remote_code=True, |
| ) |
| ``` |
|
|
| ## Performance Highlights |
|
|
| | Setting | Result | |
| | --- | --- | |
| | In-domain multimodal benchmarks | 70.1% average accuracy | |
| | Closed-source reference baselines | GPT-5: 69.3%, Gemini-2.5-Pro: 68.7% | |
| | Augmented out-of-domain registry without retraining | 59.5% average accuracy | |
|
|
| *These numbers describe the full MAESTRO system with its model-skill registry and external services.* |
|
|
| ## Model Details |
|
|
| - **Model name**: `Jinyang23/Maestro-4B` |
| - **Role**: MAESTRO multimodal orchestration policy |
| - **Base model**: `Qwen3-VL-4B-Thinking` |
| - **Training method**: outcome-based reinforcement learning with GRPO-style optimization |
| - **Action space**: latent reasoning, model-skill search actions, and terminal answers |
|
|
| ## Citation |
|
|
| If you use this model or the MAESTRO framework in your research, please cite: |
|
|
| ```bibtex |
| @misc{wu2026maestro, |
| title={MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles}, |
| author={Jinyang Wu and Guocheng Zhai and Ruihan Jin and Yuhao Shen and Zhengxi Lu and Fan Zhang and Haoran Luo and Zheng Lian and Zhengqi Wen and Jianhua Tao}, |
| year={2026}, |
| eprint={2605.22177}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.LG}, |
| url={https://arxiv.org/abs/2605.22177}, |
| } |
| ``` |