yogkul2000 commited on
Commit
0953362
·
verified ·
1 Parent(s): 86f327e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +143 -0
README.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h2 align="center"> <a href="https://arxiv.org/abs/2508.03100">AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
2
+ </a></h2>
3
+ <div align="center">
4
+
5
+ [Yogesh Kulkarni](https://yogkul2000.github.io/)<sup></sup>, &nbsp;
6
+ [Pooyan Fazli](https://www.pooyanfazli.com/)<sup></sup> &nbsp;
7
+
8
+ <br>
9
+
10
+
11
+ <a href='https://arxiv.org/abs/2508.03100'><img src='https://img.shields.io/badge/arXiv-2508.03100-b31b1b.svg'></a> &nbsp;
12
+ <a href='https://people-robots.github.io/AVATAR/'><img src='https://img.shields.io/badge/Project-Website-blue'></a>&nbsp;
13
+ <a href='https://huggingface.co/yogkul2000/AVATAR'><img src='https://img.shields.io/badge/model-checkpoints-yellow'></a>
14
+
15
+
16
+ </div>
17
+
18
+ ## Abstract
19
+ Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps.
20
+
21
+ We introduce AVATAR (Audio-Video Agent for Alignment and Reasoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a novel credit assignment strategy that upweights key reasoning phases during learning.
22
+
23
+ AVATAR achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by +5.4 on MMVU, +4.9 on OmniBench, and +4.5 on Video-Holmes, while demonstrating over 35% higher sample efficiency. These results demonstrate that targeted RL improvements, rather than massive architectural changes, effectively address core multimodal reasoning challenges.
24
+
25
+
26
+ ## 📦 Install
27
+
28
+ ### Environment Setup
29
+
30
+ ```bash
31
+ conda create -n avatar python=3.10
32
+ conda activate avatar
33
+
34
+ # Install PyTorch with CUDA 12.6
35
+ pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu126
36
+
37
+ # Install flash attention (if facing issues, use the command below)
38
+ pip install flash-attn==2.7.4.post1
39
+
40
+ # If flash-attn installation fails, try:
41
+ pip install flash-attn==2.7.4.post1 --no-build-isolation
42
+
43
+ pip install transformers==4.54.1
44
+
45
+ # Install other dependencies
46
+ pip install decord opencv-python pillow numpy
47
+ pip install qwen-omni-utils[decord] -U
48
+ ```
49
+
50
+ ### Single Video Inference
51
+ ```python
52
+ import os
53
+ import torch
54
+ from transformers import (
55
+ Qwen2_5OmniThinkerForConditionalGeneration,
56
+ Qwen2_5OmniProcessor,
57
+ )
58
+ from qwen_omni_utils import process_mm_info
59
+
60
+
61
+ def run_inference():
62
+ MODEL_PATH = ""
63
+ VIDEO_PATH = ""
64
+ QUESTION = "Use available audio and video to answer: Why the person is doing what they are doing? Give reasoning between <think> and </think> tags."
65
+
66
+ device = "cuda:0"
67
+ use_audio_flag = True
68
+
69
+ print("Loading model...")
70
+ model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
71
+ MODEL_PATH,
72
+ device_map=device,
73
+ torch_dtype=torch.bfloat16,
74
+ attn_implementation="flash_attention_2",
75
+ ).eval()
76
+ processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_PATH)
77
+ print("Model loaded.")
78
+
79
+ content_items = [
80
+ {"type": "video", "video": VIDEO_PATH},
81
+ {"type": "text", "text": QUESTION},
82
+ ]
83
+
84
+ conv = [{"role": "user", "content": content_items}]
85
+
86
+ prompt_text = processor.apply_chat_template(
87
+ conv, add_generation_prompt=True, tokenize=False
88
+ )
89
+
90
+ try:
91
+ audios, images, videos = process_mm_info(
92
+ conv, use_audio_in_video=use_audio_flag
93
+ )
94
+ except Exception as e:
95
+ print(f"Failed to process with audio, retrying without: {e}")
96
+ use_audio_flag = False
97
+ audios, images, videos = process_mm_info(conv, use_audio_in_video=False)
98
+
99
+ inputs = processor(
100
+ text=prompt_text,
101
+ audio=audios,
102
+ images=images,
103
+ videos=videos,
104
+ return_tensors="pt",
105
+ padding=True,
106
+ use_audio_in_video=use_audio_flag,
107
+ ).to(device)
108
+
109
+ print("Generating response...")
110
+ with torch.no_grad():
111
+ out_ids = model.generate(
112
+ **inputs,
113
+ use_audio_in_video=use_audio_flag,
114
+ do_sample=False,
115
+ max_new_tokens=512,
116
+ )
117
+
118
+ reply = processor.batch_decode(out_ids, skip_special_tokens=True)[0]
119
+
120
+ print("\n" + "=" * 20 + " MODEL OUTPUT " + "=" * 20)
121
+ print(reply)
122
+ print("=" * 54 + "\n")
123
+
124
+
125
+ if __name__ == "__main__":
126
+ run_inference()
127
+
128
+ ```
129
+
130
+ ## 📝 Citation
131
+ If you find AVATAR useful for your research, please cite our paper:
132
+ ```bib
133
+ @article{kulkarni2025avatar,
134
+ title={AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video},
135
+ author={Kulkarni, Yogesh and Fazli, Pooyan},
136
+ journal={arXiv preprint arXiv:2508.03100},
137
+ year={2025}
138
+ }
139
+ ```
140
+
141
+ ## 📪 Contact
142
+ For questions about the paper, please contact Yogesh Kulkarni at `ykulka10@asu.edu`. You can also open an issue in this GitHub repository for bugs or specific questions related to the code.
143
+