-
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Paper • 2510.02283 • Published • 98 -
Paper2Video: Automatic Video Generation from Scientific Papers
Paper • 2510.05096 • Published • 120 -
LongLive: Real-time Interactive Long Video Generation
Paper • 2509.22622 • Published • 189 -
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
Paper • 2509.08519 • Published • 130
Collections
Discover the best community collections!
Collections including paper arxiv:2507.07966
-
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Paper • 2506.23918 • Published • 90 -
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Paper • 2504.16030 • Published • 37 -
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Paper • 2505.24867 • Published • 82 -
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Paper • 2507.01006 • Published • 254
-
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Paper • 2505.15966 • Published • 53 -
GRIT: Teaching MLLMs to Think with Images
Paper • 2505.15879 • Published • 13 -
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Paper • 2505.16854 • Published • 11 -
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
Paper • 2505.16192 • Published • 12
-
Scaling RL to Long Videos
Paper • 2507.07966 • Published • 162 -
Group Sequence Policy Optimization
Paper • 2507.18071 • Published • 320 -
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
Paper • 2507.14111 • Published • 25 -
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
Paper • 2507.21183 • Published • 15
-
vrgamedevgirl84/Wan14BT2VFusioniX
Text-to-Video • Updated • 605 -
TheStageAI/Elastic-mochi-1-preview
Text-to-Video • Updated • 23 • 2 -
nesaorg/animatediff-base
Text-to-Video • Updated • 132 -
4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation
Paper • 2506.18839 • Published • 13
-
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning
Paper • 2506.04207 • Published • 48 -
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Paper • 2504.11468 • Published • 30 -
RLPR: Extrapolating RLVR to General Domains without Verifiers
Paper • 2506.18254 • Published • 33 -
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Paper • 2507.05255 • Published • 75
-
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Paper • 2503.14734 • Published • 7 -
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
Paper • 2401.02117 • Published • 33 -
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper • 2506.01844 • Published • 158 -
Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding
Paper • 2506.16035 • Published • 89
-
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Paper • 2510.02283 • Published • 98 -
Paper2Video: Automatic Video Generation from Scientific Papers
Paper • 2510.05096 • Published • 120 -
LongLive: Real-time Interactive Long Video Generation
Paper • 2509.22622 • Published • 189 -
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
Paper • 2509.08519 • Published • 130
-
Scaling RL to Long Videos
Paper • 2507.07966 • Published • 162 -
Group Sequence Policy Optimization
Paper • 2507.18071 • Published • 320 -
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
Paper • 2507.14111 • Published • 25 -
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
Paper • 2507.21183 • Published • 15
-
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Paper • 2506.23918 • Published • 90 -
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Paper • 2504.16030 • Published • 37 -
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Paper • 2505.24867 • Published • 82 -
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Paper • 2507.01006 • Published • 254
-
vrgamedevgirl84/Wan14BT2VFusioniX
Text-to-Video • Updated • 605 -
TheStageAI/Elastic-mochi-1-preview
Text-to-Video • Updated • 23 • 2 -
nesaorg/animatediff-base
Text-to-Video • Updated • 132 -
4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation
Paper • 2506.18839 • Published • 13
-
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning
Paper • 2506.04207 • Published • 48 -
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Paper • 2504.11468 • Published • 30 -
RLPR: Extrapolating RLVR to General Domains without Verifiers
Paper • 2506.18254 • Published • 33 -
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Paper • 2507.05255 • Published • 75
-
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Paper • 2505.15966 • Published • 53 -
GRIT: Teaching MLLMs to Think with Images
Paper • 2505.15879 • Published • 13 -
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Paper • 2505.16854 • Published • 11 -
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
Paper • 2505.16192 • Published • 12
-
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Paper • 2503.14734 • Published • 7 -
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
Paper • 2401.02117 • Published • 33 -
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper • 2506.01844 • Published • 158 -
Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding
Paper • 2506.16035 • Published • 89