Jinyang23
/

Maestro-4B

Safetensors

qwen3_vl

Model card Files Files and versions

xet

Community

Add metadata and improve model card

by nielsr HF Staff - opened 1 day ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+25

-96

Files changed (1) hide show

README.md +25 -96

README.md CHANGED Viewed

@@ -1,10 +1,22 @@
 <h1 align="center">
 MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
 </h1>
 <div align="center">
   <p>
-    <a href="https://arxiv.org/pdf/2605.22177">
       <img src="https://img.shields.io/badge/Paper-arxiv%3A2605.22177-blue" alt="Paper"/>
     </a>
     <a href="https://huggingface.co/papers/2605.22177">
@@ -18,7 +30,7 @@ MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensem
 ## Overview
-**MAESTRO-4B** is the lightweight multimodal orchestrator used in **MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles**.
 Rather than solving every task with a single monolithic model, MAESTRO frames multimodal agent execution as a sequential decision-making problem over a hierarchical model-skill registry. At each reasoning step, the 4B orchestrator decides:
@@ -27,37 +39,23 @@ Rather than solving every task with a single monolithic model, MAESTRO frames mu
 - which task-specific skill to use,
 - and when to terminate with a final answer.
-The full MAESTRO system is available at [jinyangwu/Maestro](https://github.com/jinyangwu/Maestro). The repository includes example train/validation data under `data/` and skill implementations under `skills/`.
 > **Important**
 > This checkpoint is an **orchestrator policy**, not a standalone all-purpose VLM. To reproduce MAESTRO-style rollout, use this model together with the skill registry and auxiliary model services provided in the GitHub repository.
 ## Key Features
-- **RL-trained orchestration policy**: Learns model-skill routing through outcome-based reinforcement learning.
 - **Hierarchical skill registry**: Selects coarse Level-1 skills and dispatches to fine-grained Level-2 solvers.
 - **Model-skill composition**: Treats expert model selection and skill invocation as a unified action.
-- **Plug-and-play extensibility**: Can exploit newly added experts and skills without retraining in the reported setup.
-- **Efficient 4B controller**: Uses a compact orchestrator to coordinate larger or specialized frozen expert models.
-## Performance Highlights
-The MAESTRO paper evaluates the full orchestration system across representative multimodal benchmarks covering mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis.
-| Setting | Result |
-| --- | --- |
-| In-domain multimodal benchmarks | 70.1% average accuracy |
-| Closed-source reference baselines | GPT-5: 69.3%, Gemini-2.5-Pro: 68.7% |
-| Augmented out-of-domain registry without retraining | 59.5% average accuracy |
-| Average latency in the reported setup | 2.88s |
-These numbers describe the **full MAESTRO system** with its model-skill registry and external services, not isolated single-model inference from this checkpoint alone.
 ## Quickstart
 ### Load the orchestrator checkpoint
-Below is a minimal Transformers-style loading example. Full model-skill orchestration requires the MAESTRO repository and the auxiliary services described below.
 ```python
 import torch
@@ -77,61 +75,15 @@ processor = AutoProcessor.from_pretrained(
 )
 ```
-### Run the full MAESTRO framework
-Clone the project repository:
-```bash
-git clone https://github.com/jinyangwu/Maestro
-cd Maestro
-```
-Create the Python environment and install dependencies:
-```bash
-conda create -n maestro python=3.10 -y
-conda activate maestro
-pip install -r requirements.txt
-```
-Set an OpenAI API key before training or rollout:
-```bash
-export OPENAI_API_KEY=<your_api_key>
-```
-Before training, deploy the auxiliary model services. Replace each `/path/to/<model>` placeholder with a local model directory or Hugging Face model id.
-Example:
-```bash
-vllm serve /path/to/Intern-S1-mini --served-model-name Intern-S1-mini --tensor_parallel_size 1 --max-num-seqs 512 --trust-remote-code --port 2368 --gpu_memory_utilization 0.9
-```
-Default service ports used by the skills:
-| Port | Model service |
 | --- | --- |
-| `2362` | `qwen3-VL-8B-Instruct` |
-| `2364` | `Chart-R1` |
-| `2368` | `Intern-S1-mini` |
-| `2369` | `medgemma-1.5-4b-it` |
-| `2370` | `DeepEyes-7B` |
-| `2376` | `GLM-4.6V-Flash` |
-| `2388` | `GLM-OCR` |
-| `2389` | `PR1-Qwen2.5-VL-3B-Detection` |
-Start training with:
-```bash
-bash train.sh
-```
-To train from a local checkpoint or a different model id, override `MODEL_NAME`:
-```bash
-MODEL_NAME=/path/to/Qwen3-VL-4B-Thinking bash train.sh
-```
 ## Model Details
@@ -140,20 +92,6 @@ MODEL_NAME=/path/to/Qwen3-VL-4B-Thinking bash train.sh
 - **Base model**: `Qwen3-VL-4B-Thinking`
 - **Training method**: outcome-based reinforcement learning with GRPO-style optimization
 - **Action space**: latent reasoning, model-skill search actions, and terminal answers
-- **Skill interface**: hierarchical skill registry from the MAESTRO repository
-- **Expected usage**: high-level controller for external expert models and modular skills
-## Intended Use
-This model is intended for research on:
-- multimodal agent orchestration,
-- reinforcement learning for tool and skill use,
-- model routing and expert selection,
-- hierarchical skill libraries,
-- agentic evaluation across heterogeneous tasks.
-It is especially useful when integrated with the full MAESTRO framework, where the orchestrator can call external expert services during rollout.
 ## Citation
@@ -169,13 +107,4 @@ If you use this model or the MAESTRO framework in your research, please cite:
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2605.22177},
 }
-```
-## Links
-- Code: [https://github.com/jinyangwu/Maestro](https://github.com/jinyangwu/Maestro)
-- Model: [https://huggingface.co/Jinyang23/Maestro-4B](https://huggingface.co/Jinyang23/Maestro-4B)
-## Acknowledgement
-This project builds on open-source reinforcement learning and model-serving ecosystems, including `verl` and vLLM. We thank the authors and contributors of these projects, as well as the developers of the expert models and skill implementations used by MAESTRO.

+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: image-text-to-text
+base_model: Qwen/Qwen3-VL-4B-Thinking
+tags:
+- multimodal
+- reinforcement-learning
+- agent
+- grpo
+---
 <h1 align="center">
 MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
 </h1>
 <div align="center">
   <p>
+    <a href="https://huggingface.co/papers/2605.22177">
       <img src="https://img.shields.io/badge/Paper-arxiv%3A2605.22177-blue" alt="Paper"/>
     </a>
     <a href="https://huggingface.co/papers/2605.22177">
 ## Overview
+**MAESTRO-4B** is a lightweight multimodal orchestrator introduced in the paper [MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles](https://huggingface.co/papers/2605.22177).
 Rather than solving every task with a single monolithic model, MAESTRO frames multimodal agent execution as a sequential decision-making problem over a hierarchical model-skill registry. At each reasoning step, the 4B orchestrator decides:
 - which task-specific skill to use,
 - and when to terminate with a final answer.
+The full MAESTRO system is available at [jinyangwu/Maestro](https://github.com/jinyangwu/Maestro).
 > **Important**
 > This checkpoint is an **orchestrator policy**, not a standalone all-purpose VLM. To reproduce MAESTRO-style rollout, use this model together with the skill registry and auxiliary model services provided in the GitHub repository.
 ## Key Features
+- **RL-trained orchestration policy**: Learns model-skill routing through outcome-based reinforcement learning (GRPO).
 - **Hierarchical skill registry**: Selects coarse Level-1 skills and dispatches to fine-grained Level-2 solvers.
 - **Model-skill composition**: Treats expert model selection and skill invocation as a unified action.
+- **Efficient 4B controller**: Uses a compact orchestrator (finetuned from `Qwen3-VL-4B-Thinking`) to coordinate larger or specialized frozen expert models.
 ## Quickstart
 ### Load the orchestrator checkpoint
+Below is a minimal Transformers-style loading example. Full model-skill orchestration requires the MAESTRO repository and the auxiliary services described in the official repository.
 ```python
 import torch
 )
 ```
+## Performance Highlights
+| Setting | Result |
 | --- | --- |
+| In-domain multimodal benchmarks | 70.1% average accuracy |
+| Closed-source reference baselines | GPT-5: 69.3%, Gemini-2.5-Pro: 68.7% |
+| Augmented out-of-domain registry without retraining | 59.5% average accuracy |
+*These numbers describe the full MAESTRO system with its model-skill registry and external services.*
 ## Model Details
 - **Base model**: `Qwen3-VL-4B-Thinking`
 - **Training method**: outcome-based reinforcement learning with GRPO-style optimization
 - **Action space**: latent reasoning, model-skill search actions, and terminal answers
 ## Citation
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2605.22177},
 }
+```