yogkul2000
/

AVATAR

Safetensors

qwen2_5_omni

Model card Files Files and versions

xet

Community

yogkul2000 commited on Sep 9, 2025

Commit

0953362

verified ·

1 Parent(s): 86f327e

Create README.md

Browse files

Files changed (1) hide show

README.md +143 -0

README.md ADDED Viewed

	@@ -0,0 +1,143 @@

+<h2 align="center"> <a href="https://arxiv.org/abs/2508.03100">AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
+</a></h2>
+<div align="center">
+[Yogesh Kulkarni](https://yogkul2000.github.io/)<sup></sup>, &nbsp;
+[Pooyan Fazli](https://www.pooyanfazli.com/)<sup></sup> &nbsp;
+<br>
+<a href='https://arxiv.org/abs/2508.03100'><img src='https://img.shields.io/badge/arXiv-2508.03100-b31b1b.svg'></a> &nbsp;
+ <a href='https://people-robots.github.io/AVATAR/'><img src='https://img.shields.io/badge/Project-Website-blue'></a>&nbsp;
+ <a href='https://huggingface.co/yogkul2000/AVATAR'><img src='https://img.shields.io/badge/model-checkpoints-yellow'></a>
+ </div>
+## Abstract
+Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps.
+We introduce AVATAR (Audio-Video Agent for Alignment and Reasoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a novel credit assignment strategy that upweights key reasoning phases during learning.
+AVATAR achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by +5.4 on MMVU, +4.9 on OmniBench, and +4.5 on Video-Holmes, while demonstrating over 35% higher sample efficiency. These results demonstrate that targeted RL improvements, rather than massive architectural changes, effectively address core multimodal reasoning challenges.
+## 📦 Install
+### Environment Setup
+```bash
+conda create -n avatar python=3.10
+conda activate avatar
+# Install PyTorch with CUDA 12.6
+pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu126
+# Install flash attention (if facing issues, use the command below)
+pip install flash-attn==2.7.4.post1
+# If flash-attn installation fails, try:
+pip install flash-attn==2.7.4.post1 --no-build-isolation
+pip install transformers==4.54.1
+# Install other dependencies
+pip install decord opencv-python pillow numpy
+pip install qwen-omni-utils[decord] -U
+```
+### Single Video Inference
+```python
+import os
+import torch
+from transformers import (
+    Qwen2_5OmniThinkerForConditionalGeneration,
+    Qwen2_5OmniProcessor,
+)
+from qwen_omni_utils import process_mm_info
+def run_inference():
+    MODEL_PATH = ""
+    VIDEO_PATH = ""
+    QUESTION = "Use available audio and video to answer: Why the person is doing what they are doing? Give reasoning between <think> and </think> tags."
+    device = "cuda:0"
+    use_audio_flag = True
+    print("Loading model...")
+    model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
+        MODEL_PATH,
+        device_map=device,
+        torch_dtype=torch.bfloat16,
+        attn_implementation="flash_attention_2",
+    ).eval()
+    processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_PATH)
+    print("Model loaded.")
+    content_items = [
+        {"type": "video", "video": VIDEO_PATH},
+        {"type": "text", "text": QUESTION},
+    ]
+    conv = [{"role": "user", "content": content_items}]
+    prompt_text = processor.apply_chat_template(
+        conv, add_generation_prompt=True, tokenize=False
+    )
+    try:
+        audios, images, videos = process_mm_info(
+            conv, use_audio_in_video=use_audio_flag
+        )
+    except Exception as e:
+        print(f"Failed to process with audio, retrying without: {e}")
+        use_audio_flag = False
+        audios, images, videos = process_mm_info(conv, use_audio_in_video=False)
+    inputs = processor(
+        text=prompt_text,
+        audio=audios,
+        images=images,
+        videos=videos,
+        return_tensors="pt",
+        padding=True,
+        use_audio_in_video=use_audio_flag,
+    ).to(device)
+    print("Generating response...")
+    with torch.no_grad():
+        out_ids = model.generate(
+            **inputs,
+            use_audio_in_video=use_audio_flag,
+            do_sample=False,
+            max_new_tokens=512,
+        )
+    reply = processor.batch_decode(out_ids, skip_special_tokens=True)[0]
+    print("\n" + "=" * 20 + " MODEL OUTPUT " + "=" * 20)
+    print(reply)
+    print("=" * 54 + "\n")
+if __name__ == "__main__":
+    run_inference()
+```
+## 📝 Citation
+If you find AVATAR useful for your research, please cite our paper:
+```bib
+@article{kulkarni2025avatar,
+  title={AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video},
+  author={Kulkarni, Yogesh and Fazli, Pooyan},
+  journal={arXiv preprint arXiv:2508.03100},
+  year={2025}
+}
+```
+## 📪 Contact
+For questions about the paper, please contact Yogesh Kulkarni at `ykulka10@asu.edu`. You can also open an issue in this GitHub repository for bugs or specific questions related to the code.