Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding Paper • 2604.00528 • Published 15 days ago • 12
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper • 2505.09568 • Published May 14, 2025 • 99
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant Paper • 2505.05467 • Published May 8, 2025 • 13
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models Paper • 2410.03290 • Published Oct 4, 2024 • 7