Update README.md

7d2715f verified 1 day ago

6.72 kB

	<h1 align="center">
	MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
	</h1>

	<div align="center">
	<p>
	<a href="https://arxiv.org/pdf/2605.22177">
	<img src="https://img.shields.io/badge/Paper-arxiv%3A2605.22177-blue" alt="Paper"/>
	</a>
	<a href="https://huggingface.co/papers/2605.22177">
	<img src="https://img.shields.io/badge/Daily%20Paper-HuggingFace-yellow" alt="HF Daily Paper"/>
	</a>
	<a href="https://github.com/jinyangwu/Maestro">
	<img src="https://img.shields.io/badge/Code-GitHub-black" alt="Code"/>
	</a>
	</p>
	</div>

	## Overview

	MAESTRO-4B is the lightweight multimodal orchestrator used in MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles.

	Rather than solving every task with a single monolithic model, MAESTRO frames multimodal agent execution as a sequential decision-making problem over a hierarchical model-skill registry. At each reasoning step, the 4B orchestrator decides:

	- whether to invoke an external expert,
	- which expert model to call,
	- which task-specific skill to use,
	- and when to terminate with a final answer.

	The full MAESTRO system is available at [jinyangwu/Maestro](https://github.com/jinyangwu/Maestro). The repository includes example train/validation data under `data/` and skill implementations under `skills/`.

	> Important
	> This checkpoint is an orchestrator policy, not a standalone all-purpose VLM. To reproduce MAESTRO-style rollout, use this model together with the skill registry and auxiliary model services provided in the GitHub repository.

	## Key Features

	- RL-trained orchestration policy: Learns model-skill routing through outcome-based reinforcement learning.
	- Hierarchical skill registry: Selects coarse Level-1 skills and dispatches to fine-grained Level-2 solvers.
	- Model-skill composition: Treats expert model selection and skill invocation as a unified action.
	- Plug-and-play extensibility: Can exploit newly added experts and skills without retraining in the reported setup.
	- Efficient 4B controller: Uses a compact orchestrator to coordinate larger or specialized frozen expert models.

	## Performance Highlights

	The MAESTRO paper evaluates the full orchestration system across representative multimodal benchmarks covering mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis.

	\| Setting \| Result \|
	\| --- \| --- \|
	\| In-domain multimodal benchmarks \| 70.1% average accuracy \|
	\| Closed-source reference baselines \| GPT-5: 69.3%, Gemini-2.5-Pro: 68.7% \|
	\| Augmented out-of-domain registry without retraining \| 59.5% average accuracy \|
	\| Average latency in the reported setup \| 2.88s \|

	These numbers describe the full MAESTRO system with its model-skill registry and external services, not isolated single-model inference from this checkpoint alone.

	## Quickstart

	### Load the orchestrator checkpoint

	Below is a minimal Transformers-style loading example. Full model-skill orchestration requires the MAESTRO repository and the auxiliary services described below.

	```python
	import torch
	from transformers import AutoProcessor, AutoModelForImageTextToText

	model_id = "Jinyang23/Maestro-4B"

	model = AutoModelForImageTextToText.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True,
	)
	processor = AutoProcessor.from_pretrained(
	model_id,
	trust_remote_code=True,
	)
	```

	### Run the full MAESTRO framework

	Clone the project repository:

	```bash
	git clone https://github.com/jinyangwu/Maestro
	cd Maestro
	```

	Create the Python environment and install dependencies:

	```bash
	conda create -n maestro python=3.10 -y
	conda activate maestro
	pip install -r requirements.txt
	```

	Set an OpenAI API key before training or rollout:

	```bash
	export OPENAI_API_KEY=<your_api_key>
	```

	Before training, deploy the auxiliary model services. Replace each `/path/to/<model>` placeholder with a local model directory or Hugging Face model id.

	Example:

	```bash
	vllm serve /path/to/Intern-S1-mini --served-model-name Intern-S1-mini --tensor_parallel_size 1 --max-num-seqs 512 --trust-remote-code --port 2368 --gpu_memory_utilization 0.9
	```

	Default service ports used by the skills:

	\| Port \| Model service \|
	\| --- \| --- \|
	\| `2362` \| `qwen3-VL-8B-Instruct` \|
	\| `2364` \| `Chart-R1` \|
	\| `2368` \| `Intern-S1-mini` \|
	\| `2369` \| `medgemma-1.5-4b-it` \|
	\| `2370` \| `DeepEyes-7B` \|
	\| `2376` \| `GLM-4.6V-Flash` \|
	\| `2388` \| `GLM-OCR` \|
	\| `2389` \| `PR1-Qwen2.5-VL-3B-Detection` \|

	Start training with:

	```bash
	bash train.sh
	```

	To train from a local checkpoint or a different model id, override `MODEL_NAME`:

	```bash
	MODEL_NAME=/path/to/Qwen3-VL-4B-Thinking bash train.sh
	```

	## Model Details

	- Model name: `Jinyang23/Maestro-4B`
	- Role: MAESTRO multimodal orchestration policy
	- Base model: `Qwen3-VL-4B-Thinking`
	- Training method: outcome-based reinforcement learning with GRPO-style optimization
	- Action space: latent reasoning, model-skill search actions, and terminal answers
	- Skill interface: hierarchical skill registry from the MAESTRO repository
	- Expected usage: high-level controller for external expert models and modular skills

	## Intended Use

	This model is intended for research on:

	- multimodal agent orchestration,
	- reinforcement learning for tool and skill use,
	- model routing and expert selection,
	- hierarchical skill libraries,
	- agentic evaluation across heterogeneous tasks.

	It is especially useful when integrated with the full MAESTRO framework, where the orchestrator can call external expert services during rollout.

	## Citation

	If you use this model or the MAESTRO framework in your research, please cite:

	```bibtex
	@misc{wu2026maestro,
	title={MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles},
	author={Jinyang Wu and Guocheng Zhai and Ruihan Jin and Yuhao Shen and Zhengxi Lu and Fan Zhang and Haoran Luo and Zheng Lian and Zhengqi Wen and Jianhua Tao},
	year={2026},
	eprint={2605.22177},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2605.22177},
	}
	```

	## Links

	- Code: [https://github.com/jinyangwu/Maestro](https://github.com/jinyangwu/Maestro)
	- Model: [https://huggingface.co/Jinyang23/Maestro-4B](https://huggingface.co/Jinyang23/Maestro-4B)

	## Acknowledgement

	This project builds on open-source reinforcement learning and model-serving ecosystems, including `verl` and vLLM. We thank the authors and contributors of these projects, as well as the developers of the expert models and skill implementations used by MAESTRO.