VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models Paper ⢠2603.22003 ⢠Published 23 days ago ⢠11
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence Paper ⢠2602.08683 ⢠Published Feb 9 ⢠52
Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance Paper ⢠2512.08765 ⢠Published Dec 9, 2025 ⢠134
Cambrian-S: Towards Spatial Supersensing in Video Paper ⢠2511.04670 ⢠Published Nov 6, 2025 ⢠39
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations Paper ⢠2510.23607 ⢠Published Oct 27, 2025 ⢠181
DreamOmni2: Multimodal Instruction-based Editing and Generation Paper ⢠2510.06679 ⢠Published Oct 8, 2025 ⢠74
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech Paper ⢠2509.25131 ⢠Published Sep 29, 2025 ⢠16
LongLive: Real-time Interactive Long Video Generation Paper ⢠2509.22622 ⢠Published Sep 26, 2025 ⢠189
MGM-Omni Collection MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech ⢠13 items ⢠Updated Mar 2 ⢠11
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning Paper ⢠2507.13348 ⢠Published Jul 17, 2025 ⢠79
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition Paper ⢠2412.09501 ⢠Published Dec 12, 2024 ⢠48
VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper ⢠2412.04467 ⢠Published Dec 5, 2024 ⢠118
Llama 3.1 Collection This collection hosts the transformers and original repos of the Llama 3.1, Llama Guard 3 and Prompt Guard models ⢠11 items ⢠Updated Dec 6, 2024 ⢠710
MGM-Data Collection Official data collection for the paper "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models" ⢠2 items ⢠Updated Apr 21, 2024 ⢠7