--- license: apache-2.0 library_name: transformers pipeline_tag: image-text-to-text base_model: Qwen/Qwen3-VL-4B-Thinking tags: - multimodal - reinforcement-learning - agent - grpo ---

MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Paper HF Daily Paper Code

## Overview **MAESTRO-4B** is a lightweight multimodal orchestrator introduced in the paper [MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles](https://huggingface.co/papers/2605.22177). Rather than solving every task with a single monolithic model, MAESTRO frames multimodal agent execution as a sequential decision-making problem over a hierarchical model-skill registry. At each reasoning step, the 4B orchestrator decides: - whether to invoke an external expert, - which expert model to call, - which task-specific skill to use, - and when to terminate with a final answer. The full MAESTRO system is available at [jinyangwu/Maestro](https://github.com/jinyangwu/Maestro). > **Important** > This checkpoint is an **orchestrator policy**, not a standalone all-purpose VLM. To reproduce MAESTRO-style rollout, use this model together with the skill registry and auxiliary model services provided in the GitHub repository. ## Key Features - **RL-trained orchestration policy**: Learns model-skill routing through outcome-based reinforcement learning (GRPO). - **Hierarchical skill registry**: Selects coarse Level-1 skills and dispatches to fine-grained Level-2 solvers. - **Model-skill composition**: Treats expert model selection and skill invocation as a unified action. - **Efficient 4B controller**: Uses a compact orchestrator (finetuned from `Qwen3-VL-4B-Thinking`) to coordinate larger or specialized frozen expert models. ## Quickstart ### Load the orchestrator checkpoint Below is a minimal Transformers-style loading example. Full model-skill orchestration requires the MAESTRO repository and the auxiliary services described in the official repository. ```python import torch from transformers import AutoProcessor, AutoModelForImageTextToText model_id = "Jinyang23/Maestro-4B" model = AutoModelForImageTextToText.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) processor = AutoProcessor.from_pretrained( model_id, trust_remote_code=True, ) ``` ## Performance Highlights | Setting | Result | | --- | --- | | In-domain multimodal benchmarks | 70.1% average accuracy | | Closed-source reference baselines | GPT-5: 69.3%, Gemini-2.5-Pro: 68.7% | | Augmented out-of-domain registry without retraining | 59.5% average accuracy | *These numbers describe the full MAESTRO system with its model-skill registry and external services.* ## Model Details - **Model name**: `Jinyang23/Maestro-4B` - **Role**: MAESTRO multimodal orchestration policy - **Base model**: `Qwen3-VL-4B-Thinking` - **Training method**: outcome-based reinforcement learning with GRPO-style optimization - **Action space**: latent reasoning, model-skill search actions, and terminal answers ## Citation If you use this model or the MAESTRO framework in your research, please cite: ```bibtex @misc{wu2026maestro, title={MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles}, author={Jinyang Wu and Guocheng Zhai and Ruihan Jin and Yuhao Shen and Zhengxi Lu and Fan Zhang and Haoran Luo and Zheng Lian and Zhengqi Wen and Jianhua Tao}, year={2026}, eprint={2605.22177}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2605.22177}, } ```