Jinyang23
/

Maestro-4B

Safetensors

qwen3_vl

Model card Files Files and versions

xet

Community

Jinyang23 commited on 1 day ago

Commit

6bc5a60

verified ·

1 Parent(s): 260e92e

Update README.md

Browse files

Files changed (1) hide show

README.md +199 -0

README.md CHANGED Viewed

@@ -1,3 +1,202 @@
 ---
 license: mit
 ---

 ---
 license: mit
+language:
+  - en
+tags:
+  - reinforcement-learning
+  - multimodal
+  - agent
+  - tool-use
+  - orchestration
+  - model-routing
+  - qwen3-vl
+  - grpo
+library_name: transformers
+pipeline_tag: image-text-to-text
+base_model:
+  - Qwen/Qwen3-VL-4B-Thinking
+metrics:
+  - accuracy
 ---
+<h1 align="center">
+MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
+</h1>
+<div align="center">
+  <p>
+    <a href="https://arxiv.org/pdf/2605.22177">
+      <img src="https://img.shields.io/badge/Paper-arxiv%3A2605.22177-blue" alt="Paper"/>
+    </a>
+    <a href="https://huggingface.co/papers/2605.22177">
+      <img src="https://img.shields.io/badge/Daily%20Paper-HuggingFace-yellow" alt="HF Daily Paper"/>
+    </a>
+    <a href="https://github.com/jinyangwu/Maestro">
+      <img src="https://img.shields.io/badge/Code-GitHub-black" alt="Code"/>
+    </a>
+  </p>
+</div>
+## Overview
+**MAESTRO-4B** is the lightweight multimodal orchestrator used in **MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles**.
+Rather than solving every task with a single monolithic model, MAESTRO frames multimodal agent execution as a sequential decision-making problem over a hierarchical model-skill registry. At each reasoning step, the 4B orchestrator decides:
+- whether to invoke an external expert,
+- which expert model to call,
+- which task-specific skill to use,
+- and when to terminate with a final answer.
+The full MAESTRO system is available at [jinyangwu/Maestro](https://github.com/jinyangwu/Maestro). The repository includes example train/validation data under `data/` and skill implementations under `skills/`.
+> **Important**
+> This checkpoint is an **orchestrator policy**, not a standalone all-purpose VLM. To reproduce MAESTRO-style rollout, use this model together with the skill registry and auxiliary model services provided in the GitHub repository.
+## Key Features
+- **RL-trained orchestration policy**: Learns model-skill routing through outcome-based reinforcement learning.
+- **Hierarchical skill registry**: Selects coarse Level-1 skills and dispatches to fine-grained Level-2 solvers.
+- **Model-skill composition**: Treats expert model selection and skill invocation as a unified action.
+- **Plug-and-play extensibility**: Can exploit newly added experts and skills without retraining in the reported setup.
+- **Efficient 4B controller**: Uses a compact orchestrator to coordinate larger or specialized frozen expert models.
+## Performance Highlights
+The MAESTRO paper evaluates the full orchestration system across representative multimodal benchmarks covering mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis.
+| Setting | Result |
+| --- | --- |
+| In-domain multimodal benchmarks | 70.1% average accuracy |
+| Closed-source reference baselines | GPT-5: 69.3%, Gemini-2.5-Pro: 68.7% |
+| Augmented out-of-domain registry without retraining | 59.5% average accuracy |
+| Average latency in the reported setup | 2.88s |
+These numbers describe the **full MAESTRO system** with its model-skill registry and external services, not isolated single-model inference from this checkpoint alone.
+## Quickstart
+### Load the orchestrator checkpoint
+Below is a minimal Transformers-style loading example. Full model-skill orchestration requires the MAESTRO repository and the auxiliary services described below.
+```python
+import torch
+from transformers import AutoProcessor, AutoModelForImageTextToText
+model_id = "Jinyang23/Maestro-4B"
+model = AutoModelForImageTextToText.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
+)
+processor = AutoProcessor.from_pretrained(
+    model_id,
+    trust_remote_code=True,
+)
+```
+### Run the full MAESTRO framework
+Clone the project repository:
+```bash
+git clone https://github.com/jinyangwu/Maestro
+cd Maestro
+```
+Create the Python environment and install dependencies:
+```bash
+conda create -n maestro python=3.10 -y
+conda activate maestro
+pip install -r requirements.txt
+```
+Set an OpenAI API key before training or rollout:
+```bash
+export OPENAI_API_KEY=<your_api_key>
+```
+Before training, deploy the auxiliary model services. Replace each `/path/to/<model>` placeholder with a local model directory or Hugging Face model id.
+Example:
+```bash
+vllm serve /path/to/Intern-S1-mini --served-model-name Intern-S1-mini --tensor_parallel_size 1 --max-num-seqs 512 --trust-remote-code --port 2368 --gpu_memory_utilization 0.9
+```
+Default service ports used by the skills:
+| Port | Model service |
+| --- | --- |
+| `2362` | `qwen3-VL-8B-Instruct` |
+| `2364` | `Chart-R1` |
+| `2368` | `Intern-S1-mini` |
+| `2369` | `medgemma-1.5-4b-it` |
+| `2370` | `DeepEyes-7B` |
+| `2376` | `GLM-4.6V-Flash` |
+| `2388` | `GLM-OCR` |
+| `2389` | `PR1-Qwen2.5-VL-3B-Detection` |
+Start training with:
+```bash
+bash train.sh
+```
+To train from a local checkpoint or a different model id, override `MODEL_NAME`:
+```bash
+MODEL_NAME=/path/to/Qwen3-VL-4B-Thinking bash train.sh
+```
+## Model Details
+- **Model name**: `Jinyang23/Maestro-4B`
+- **Role**: MAESTRO multimodal orchestration policy
+- **Base model**: `Qwen3-VL-4B-Thinking`
+- **Training method**: outcome-based reinforcement learning with GRPO-style optimization
+- **Action space**: latent reasoning, model-skill search actions, and terminal answers
+- **Skill interface**: hierarchical skill registry from the MAESTRO repository
+- **Expected usage**: high-level controller for external expert models and modular skills
+## Intended Use
+This model is intended for research on:
+- multimodal agent orchestration,
+- reinforcement learning for tool and skill use,
+- model routing and expert selection,
+- hierarchical skill libraries,
+- agentic evaluation across heterogeneous tasks.
+It is especially useful when integrated with the full MAESTRO framework, where the orchestrator can call external expert services during rollout.
+## Citation
+If you use this model or the MAESTRO framework in your research, please cite:
+```bibtex
+@misc{wu2026maestro,
+      title={MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles},
+      author={Jinyang Wu and Guocheng Zhai and Ruihan Jin and Yuhao Shen and Zhengxi Lu and Fan Zhang and Haoran Luo and Zheng Lian and Zhengqi Wen and Jianhua Tao},
+      year={2026},
+      eprint={2605.22177},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2605.22177},
+}
+```
+## Links
+- Code: [https://github.com/jinyangwu/Maestro](https://github.com/jinyangwu/Maestro)
+- Model: [https://huggingface.co/Jinyang23/Maestro-4B](https://huggingface.co/Jinyang23/Maestro-4B)
+## Acknowledgement
+This project builds on open-source reinforcement learning and model-serving ecosystems, including `verl` and vLLM. We thank the authors and contributors of these projects, as well as the developers of the expert models and skill implementations used by MAESTRO.