Maestro-4B / README.md
nielsr's picture
nielsr HF Staff
Add metadata and improve model card
0ab6a00 verified
|
raw
history blame
4.13 kB
metadata
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
base_model: Qwen/Qwen3-VL-4B-Thinking
tags:
  - multimodal
  - reinforcement-learning
  - agent
  - grpo

MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Overview

MAESTRO-4B is a lightweight multimodal orchestrator introduced in the paper MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles.

Rather than solving every task with a single monolithic model, MAESTRO frames multimodal agent execution as a sequential decision-making problem over a hierarchical model-skill registry. At each reasoning step, the 4B orchestrator decides:

  • whether to invoke an external expert,
  • which expert model to call,
  • which task-specific skill to use,
  • and when to terminate with a final answer.

The full MAESTRO system is available at jinyangwu/Maestro.

Important This checkpoint is an orchestrator policy, not a standalone all-purpose VLM. To reproduce MAESTRO-style rollout, use this model together with the skill registry and auxiliary model services provided in the GitHub repository.

Key Features

  • RL-trained orchestration policy: Learns model-skill routing through outcome-based reinforcement learning (GRPO).
  • Hierarchical skill registry: Selects coarse Level-1 skills and dispatches to fine-grained Level-2 solvers.
  • Model-skill composition: Treats expert model selection and skill invocation as a unified action.
  • Efficient 4B controller: Uses a compact orchestrator (finetuned from Qwen3-VL-4B-Thinking) to coordinate larger or specialized frozen expert models.

Quickstart

Load the orchestrator checkpoint

Below is a minimal Transformers-style loading example. Full model-skill orchestration requires the MAESTRO repository and the auxiliary services described in the official repository.

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "Jinyang23/Maestro-4B"

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
    model_id,
    trust_remote_code=True,
)

Performance Highlights

Setting Result
In-domain multimodal benchmarks 70.1% average accuracy
Closed-source reference baselines GPT-5: 69.3%, Gemini-2.5-Pro: 68.7%
Augmented out-of-domain registry without retraining 59.5% average accuracy

These numbers describe the full MAESTRO system with its model-skill registry and external services.

Model Details

  • Model name: Jinyang23/Maestro-4B
  • Role: MAESTRO multimodal orchestration policy
  • Base model: Qwen3-VL-4B-Thinking
  • Training method: outcome-based reinforcement learning with GRPO-style optimization
  • Action space: latent reasoning, model-skill search actions, and terminal answers

Citation

If you use this model or the MAESTRO framework in your research, please cite:

@misc{wu2026maestro,
      title={MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles},
      author={Jinyang Wu and Guocheng Zhai and Ruihan Jin and Yuhao Shen and Zhengxi Lu and Fan Zhang and Haoran Luo and Zheng Lian and Zhengqi Wen and Jianhua Tao},
      year={2026},
      eprint={2605.22177},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.22177}, 
}