papers
updated
Emu3.5: Native Multimodal Models are World Learners
Paper
• 2510.26583
• Published • 114
RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via
Hierarchical Model Merging
Paper
• 2510.20479
• Published • 12
Paper
• 2510.18212
• Published • 36
Video-As-Prompt: Unified Semantic Control for Video Generation
Paper
• 2510.20888
• Published • 50
Reasoning with Sampling: Your Base Model is Smarter Than You Think
Paper
• 2510.14901
• Published • 48
DeepAgent: A General Reasoning Agent with Scalable Toolsets
Paper
• 2510.21618
• Published • 103
PixelRefer: A Unified Framework for Spatio-Temporal Object Referring
with Arbitrary Granularity
Paper
• 2510.23603
• Published • 26
Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with
Free-Form Preferences
Paper
• 2510.23451
• Published • 28
ACG: Action Coherence Guidance for Flow-based VLA models
Paper
• 2510.22201
• Published • 37
Rethinking Visual Intelligence: Insights from Video Pretraining
Paper
• 2510.24448
• Published • 6
Latent Chain-of-Thought for Visual Reasoning
Paper
• 2510.23925
• Published • 10
From Spatial to Actions: Grounding Vision-Language-Action Model in
Spatial Foundation Priors
Paper
• 2510.17439
• Published • 28
RoboOmni: Proactive Robot Manipulation in Omni-modal Context
Paper
• 2510.23763
• Published • 62
The Principles of Diffusion Models
Paper
• 2510.21890
• Published • 64
Reasoning-Aware GRPO using Process Mining
Paper
• 2510.25065
• Published • 42
Scaling Latent Reasoning via Looped Language Models
Paper
• 2510.25741
• Published • 229
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement
Learning
Paper
• 2510.23473
• Published • 86
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with
the MME-CoF Benchmark
Paper
• 2510.26802
• Published • 34
Exploring Conditions for Diffusion models in Robotic Control
Paper
• 2510.15510
• Published • 40