AVSQwen-Omni-7B

GRPO post-trained Qwen2.5-Omni-7B for Referring Audio-Visual Segmentation (main model used for OmniAVS / RefAVS / MeViS / ReVOS / Ref-DAVIS / Ref-YouTube-VOS / ReasonSeg / RefCOCO). Trained on swift_train_bbox_grpo_balanced_full_v4_sam.jsonl with bbox_format + sam_keyframe + minimal_efficiency rewards.

Usage

from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    "Vegetabot/AVSQwen-Omni-7B",
    torch_dtype="auto",
    device_map="auto",
)
processor = Qwen2_5OmniProcessor.from_pretrained("Vegetabot/AVSQwen-Omni-7B")

For the full inference pipeline (frame selector + grounding + SAM2 segmenter), please refer to inference/ and run/*.sh in the release repo.

Training

  • Framework: ms-swift with GRPO
  • Rewards: bbox_format_reward, sam_keyframe_reward, minimal_efficiency_reward
  • Data: swift_train_bbox_grpo_balanced_full_v4_sam.jsonl (OmniAVS + 7 referring datasets, balanced mix)
  • See training/train_qwen25omni_full_grpo_FINAL_V2.sh in the release repo.

Citation

@article{avsqwen2026,
  title   = {AVSQwen: ...},
  author  = {...},
  year    = {2026}
}
Downloads last month
22
Safetensors
Model size
11B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Vegetabot/AVSQwen-Omni-7B

Finetuned
(53)
this model