You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass Paper • 2604.10966 • Published 5 days ago • 9
WildDet3D: Scaling Promptable 3D Detection in the Wild Paper • 2604.08626 • Published 9 days ago • 237
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web Paper • 2604.08516 • Published 9 days ago • 41
WildDet3D Collection This is the collection of WildDet3D artifacts, including demos, model checkpoints and data. https://github.com/allenai/WildDet3D • 8 items • Updated 5 days ago • 17
VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models Paper • 2603.24575 • Published 23 days ago • 18
Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos Paper • 2602.23543 • Published Feb 26 • 9
Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration? Paper • 2602.07055 • Published Feb 4 • 23
PixMo Collection A set of vision-language datasets built by Ai2 and used to train the Molmo family of models. Read more at https://molmo.allenai.org/blog • 9 items • Updated Mar 2 • 88
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning Paper • 2510.23473 • Published Oct 27, 2025 • 86
MolmoAct Data Mixture Collection All datasets for the MolmoAct (Multimodal Open Language Model for Action) release. • 4 items • Updated Dec 23, 2025 • 18
MolmoAct Collection All models for the MolmoAct (Multimodal Open Language Model for Action) release. • 10 items • Updated Dec 23, 2025 • 35
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Paper • 2504.10479 • Published Apr 14, 2025 • 308
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective Paper • 2502.14296 • Published Feb 20, 2025 • 45
TACO Models Collection This collection contains the best-performing TACO models based on LLaMA-3/Qwen2 and SigLIP/CLIP. • 3 items • Updated Oct 31, 2025 • 8
CoTA Datasets Collection This collection contains all versions of the CoTA (Chain-of-Thought-and-Action) datasets. • 4 items • Updated Mar 2 • 7
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models Paper • 2412.07012 • Published Dec 9, 2024 • 1
TaskMeAnything Collection A collection of TaskMeAnything resources [https://github.com/JieyuZ2/TaskMeAnything] • 7 items • Updated Mar 2 • 3