Add library_name, pipeline_tag and improve model card

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +24 -7
README.md CHANGED
@@ -2,20 +2,23 @@
2
  license: other
3
  license_name: nvidia
4
  license_link: LICENSE
 
 
5
  ---
6
 
7
-
8
  # AutoGaze
9
 
10
  [Project Page](https://autogaze.github.io/) | [Paper](https://huggingface.co/papers/2603.12254) | [GitHub](https://github.com/NVlabs/AutoGaze) | [Models & Data & Benchmark](https://huggingface.co/collections/bfshi/autogaze) | [Demo](https://huggingface.co/spaces/bfshi/AutoGaze)
11
 
12
- AutoGaze is a ultra light-weight model that automatically removes redundant patches in a video before passing it to any Vision Transformer (ViT) or Multi-modal Large Language Model (MLLM).
 
 
13
 
14
- Speficially, AutoGaze perceives each frame and autoregressively selects ("gazing") a minimal set of patches that can reconstruct the original video (i.e., non-redundant patches) up to a reconstruction loss threhold provided by the user. AutoGaze can self-decide when to stop gazing for each frame based on user's request on the acceptable maximum reconstruction loss.<br>
15
 
16
- Empircally, AutoGaze can reduce #tokens in ViTs/MLLMs by up to 100x, reducing their latency by up to 19x/10x. This enables efficiently scaling MLLMs to 4K-resolution, 1K-frame videos, improving performance on benchmarks such as VideoMME. Especially, it improves performance by 14% on HLVid, a high-resolution long-form video benchmark proposed in this work as well.
17
 
18
- This model is for research and development only. <br>
19
 
20
  ### Quick Start:
21
 
@@ -61,7 +64,7 @@ GitHub: https://github.com/NVlabs/AutoGaze <br>
61
 
62
  **Output Parameters:** One-Dimensional (1D)
63
 
64
- **Other Properties Related to Outupt:** N/A
65
 
66
  Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
67
 
@@ -127,6 +130,20 @@ The raw videos are collected from public dataset including Ego4D, 100DoH, Intern
127
 
128
 
129
  ## Ethical Considerations:
130
- NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). <br>
131
 
132
  Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included. <br>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: other
3
  license_name: nvidia
4
  license_link: LICENSE
5
+ library_name: transformers
6
+ pipeline_tag: other
7
  ---
8
 
 
9
  # AutoGaze
10
 
11
  [Project Page](https://autogaze.github.io/) | [Paper](https://huggingface.co/papers/2603.12254) | [GitHub](https://github.com/NVlabs/AutoGaze) | [Models & Data & Benchmark](https://huggingface.co/collections/bfshi/autogaze) | [Demo](https://huggingface.co/spaces/bfshi/AutoGaze)
12
 
13
+ **AutoGaze** (Autoregressive Gazing) is an ultra-lightweight model that automatically removes redundant patches in a video before passing it to any Vision Transformer (ViT) or Multi-modal Large Language Model (MLLM).
14
+
15
+ **Authors:** Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M. Chan, Pavlo Molchanov, Trevor Darrell, Hongxu Yin.
16
 
17
+ Specifically, AutoGaze perceives each frame and autoregressively selects ("gazing") a minimal set of patches that can reconstruct the original video (i.e., non-redundant patches) up to a reconstruction loss threshold provided by the user. AutoGaze can self-decide when to stop gazing for each frame based on user's request on the acceptable maximum reconstruction loss.<br>
18
 
19
+ Empirically, AutoGaze can reduce the number of tokens in ViTs/MLLMs by up to 100x, reducing their latency by up to 19x/10x. This enables efficiently scaling MLLMs to 4K-resolution, 1K-frame videos, improving performance on benchmarks such as VideoMME. Especially, it improves performance by 14% on HLVid, a high-resolution long-form video benchmark proposed in this work as well.
20
 
21
+ This model is for research and development only. <br>
22
 
23
  ### Quick Start:
24
 
 
64
 
65
  **Output Parameters:** One-Dimensional (1D)
66
 
67
+ **Other Properties Related to Output:** N/A
68
 
69
  Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
70
 
 
130
 
131
 
132
  ## Ethical Considerations:
133
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). <br>
134
 
135
  Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included. <br>
136
+
137
+ ## Citation
138
+
139
+ ```bibtex
140
+ @misc{shi2026attendattentionefficientscalable,
141
+ title={Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing},
142
+ author={Baifeng Shi and Stephanie Fu and Long Lian and Hanrong Ye and David Eigen and Aaron Reite and Boyi Li and Jan Kautz and Song Han and David M. Chan and Pavlo Molchanov and Trevor Darrell and Hongxu Yin},
143
+ year={2026},
144
+ eprint={2603.12254},
145
+ archivePrefix={arXiv},
146
+ primaryClass={cs.CV},
147
+ url={https://arxiv.org/abs/2603.12254},
148
+ }
149
+ ```