CutClaw: Agentic Hours-Long Video Editing via Music Synchronization Paper β’ 2603.29664 β’ Published 15 days ago β’ 48
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence Paper β’ 2602.08683 β’ Published Feb 9 β’ 52
Running 221 FineVision: Open Data is All You Need π 221 A new open-source dataset for training VLMs
StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation Paper β’ 2512.09363 β’ Published Dec 10, 2025 β’ 74
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation Paper β’ 2510.08673 β’ Published Oct 9, 2025 β’ 127
Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 Image-Text-to-Text β’ 31B β’ Updated Nov 26, 2025 β’ 3.77k β’ 53
Qwen/Qwen3-VL-235B-A22B-Thinking Image-Text-to-Text β’ 236B β’ Updated Nov 26, 2025 β’ 327k β’ β’ 387
view article Article Introducing Idefics2: A Powerful 8B Vision-Language Model for the community +1 Apr 15, 2024 β’ 191
Video-T1: Test-Time Scaling for Video Generation Paper β’ 2503.18942 β’ Published Mar 24, 2025 β’ 90
Running 3.78k The Ultra-Scale Playbook π 3.78k The ultimate guide to training LLM on large GPU Clusters