File size: 4,134 Bytes
0ab6a00
 
 
 
 
 
 
 
 
 
 
 
6bc5a60
 
 
 
 
 
0ab6a00
6bc5a60
 
 
 
 
 
 
 
 
 
 
 
 
0ab6a00
6bc5a60
 
 
 
 
 
 
 
0ab6a00
6bc5a60
 
 
 
 
 
0ab6a00
6bc5a60
 
0ab6a00
6bc5a60
 
 
 
 
0ab6a00
6bc5a60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ab6a00
6bc5a60
0ab6a00
6bc5a60
0ab6a00
 
 
6bc5a60
0ab6a00
6bc5a60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ab6a00
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
base_model: Qwen/Qwen3-VL-4B-Thinking
tags:
- multimodal
- reinforcement-learning
- agent
- grpo
---

<h1 align="center">
MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
</h1>

<div align="center">
  <p>
    <a href="https://huggingface.co/papers/2605.22177">
      <img src="https://img.shields.io/badge/Paper-arxiv%3A2605.22177-blue" alt="Paper"/>
    </a>
    <a href="https://huggingface.co/papers/2605.22177">
      <img src="https://img.shields.io/badge/Daily%20Paper-HuggingFace-yellow" alt="HF Daily Paper"/>
    </a>
    <a href="https://github.com/jinyangwu/Maestro">
      <img src="https://img.shields.io/badge/Code-GitHub-black" alt="Code"/>
    </a>
  </p>
</div>

## Overview

**MAESTRO-4B** is a lightweight multimodal orchestrator introduced in the paper [MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles](https://huggingface.co/papers/2605.22177).

Rather than solving every task with a single monolithic model, MAESTRO frames multimodal agent execution as a sequential decision-making problem over a hierarchical model-skill registry. At each reasoning step, the 4B orchestrator decides:

- whether to invoke an external expert,
- which expert model to call,
- which task-specific skill to use,
- and when to terminate with a final answer.

The full MAESTRO system is available at [jinyangwu/Maestro](https://github.com/jinyangwu/Maestro).

> **Important**
> This checkpoint is an **orchestrator policy**, not a standalone all-purpose VLM. To reproduce MAESTRO-style rollout, use this model together with the skill registry and auxiliary model services provided in the GitHub repository.

## Key Features

- **RL-trained orchestration policy**: Learns model-skill routing through outcome-based reinforcement learning (GRPO).
- **Hierarchical skill registry**: Selects coarse Level-1 skills and dispatches to fine-grained Level-2 solvers.
- **Model-skill composition**: Treats expert model selection and skill invocation as a unified action.
- **Efficient 4B controller**: Uses a compact orchestrator (finetuned from `Qwen3-VL-4B-Thinking`) to coordinate larger or specialized frozen expert models.

## Quickstart

### Load the orchestrator checkpoint

Below is a minimal Transformers-style loading example. Full model-skill orchestration requires the MAESTRO repository and the auxiliary services described in the official repository.

```python
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "Jinyang23/Maestro-4B"

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
    model_id,
    trust_remote_code=True,
)
```

## Performance Highlights

| Setting | Result |
| --- | --- |
| In-domain multimodal benchmarks | 70.1% average accuracy |
| Closed-source reference baselines | GPT-5: 69.3%, Gemini-2.5-Pro: 68.7% |
| Augmented out-of-domain registry without retraining | 59.5% average accuracy |

*These numbers describe the full MAESTRO system with its model-skill registry and external services.*

## Model Details

- **Model name**: `Jinyang23/Maestro-4B`
- **Role**: MAESTRO multimodal orchestration policy
- **Base model**: `Qwen3-VL-4B-Thinking`
- **Training method**: outcome-based reinforcement learning with GRPO-style optimization
- **Action space**: latent reasoning, model-skill search actions, and terminal answers

## Citation

If you use this model or the MAESTRO framework in your research, please cite:

```bibtex
@misc{wu2026maestro,
      title={MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles},
      author={Jinyang Wu and Guocheng Zhai and Ruihan Jin and Yuhao Shen and Zhengxi Lu and Fan Zhang and Haoran Luo and Zheng Lian and Zhengqi Wen and Jianhua Tao},
      year={2026},
      eprint={2605.22177},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.22177}, 
}
```