Collections
Discover the best community collections!
Collections including paper arxiv:2512.21218
-
TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
Paper β’ 2512.16093 β’ Published β’ 97 -
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Paper β’ 2511.22699 β’ Published β’ 245 -
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
Paper β’ 2512.16676 β’ Published β’ 222 -
Sharp Monocular View Synthesis in Less Than a Second
Paper β’ 2512.10685 β’ Published β’ 29
-
Perception-Aware Policy Optimization for Multimodal Reasoning
Paper β’ 2507.06448 β’ Published β’ 48 -
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
Paper β’ 2507.05920 β’ Published β’ 12 -
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Paper β’ 2508.18265 β’ Published β’ 218 -
Latent Chain-of-Thought for Visual Reasoning
Paper β’ 2510.23925 β’ Published β’ 10
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper β’ 2502.11573 β’ Published β’ 9 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper β’ 2502.02339 β’ Published β’ 23 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper β’ 2502.11775 β’ Published β’ 9 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper β’ 2412.18319 β’ Published β’ 39
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper β’ 2402.04252 β’ Published β’ 30 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper β’ 2402.03749 β’ Published β’ 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper β’ 2402.04615 β’ Published β’ 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper β’ 2402.05008 β’ Published β’ 23
-
Latent Implicit Visual Reasoning
Paper β’ 2512.21218 β’ Published β’ 70 -
Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models
Paper β’ 2510.09592 β’ Published β’ 6 -
Residual Context Diffusion Language Models
Paper β’ 2601.22954 β’ Published β’ 35 -
ClawArena: Benchmarking AI Agents in Evolving Information Environments
Paper β’ 2604.04202 β’ Published β’ 36
-
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Paper β’ 2511.16334 β’ Published β’ 96 -
Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
Paper β’ 2509.07980 β’ Published β’ 105 -
ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute
Paper β’ 2509.04475 β’ Published β’ 3 -
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
Paper β’ 2512.01374 β’ Published β’ 106
-
Nuclear Norm Regularization for Deep Learning
Paper β’ 2405.14544 β’ Published β’ 1 -
Token embeddings violate the manifold hypothesis
Paper β’ 2504.01002 β’ Published β’ 1 -
Approximate Nullspace Augmented Finetuning for Robust Vision Transformers
Paper β’ 2403.10476 β’ Published β’ 1 -
ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning
Paper β’ 2504.00254 β’ Published β’ 1
-
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Paper β’ 2312.15715 β’ Published β’ 20 -
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Paper β’ 2505.23747 β’ Published β’ 69 -
VideoPrism: A Foundational Visual Encoder for Video Understanding
Paper β’ 2402.13217 β’ Published β’ 40 -
Scaling RL to Long Videos
Paper β’ 2507.07966 β’ Published β’ 162
-
Latent Implicit Visual Reasoning
Paper β’ 2512.21218 β’ Published β’ 70 -
Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models
Paper β’ 2510.09592 β’ Published β’ 6 -
Residual Context Diffusion Language Models
Paper β’ 2601.22954 β’ Published β’ 35 -
ClawArena: Benchmarking AI Agents in Evolving Information Environments
Paper β’ 2604.04202 β’ Published β’ 36
-
TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
Paper β’ 2512.16093 β’ Published β’ 97 -
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Paper β’ 2511.22699 β’ Published β’ 245 -
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
Paper β’ 2512.16676 β’ Published β’ 222 -
Sharp Monocular View Synthesis in Less Than a Second
Paper β’ 2512.10685 β’ Published β’ 29
-
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Paper β’ 2511.16334 β’ Published β’ 96 -
Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
Paper β’ 2509.07980 β’ Published β’ 105 -
ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute
Paper β’ 2509.04475 β’ Published β’ 3 -
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
Paper β’ 2512.01374 β’ Published β’ 106
-
Perception-Aware Policy Optimization for Multimodal Reasoning
Paper β’ 2507.06448 β’ Published β’ 48 -
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
Paper β’ 2507.05920 β’ Published β’ 12 -
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Paper β’ 2508.18265 β’ Published β’ 218 -
Latent Chain-of-Thought for Visual Reasoning
Paper β’ 2510.23925 β’ Published β’ 10
-
Nuclear Norm Regularization for Deep Learning
Paper β’ 2405.14544 β’ Published β’ 1 -
Token embeddings violate the manifold hypothesis
Paper β’ 2504.01002 β’ Published β’ 1 -
Approximate Nullspace Augmented Finetuning for Robust Vision Transformers
Paper β’ 2403.10476 β’ Published β’ 1 -
ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning
Paper β’ 2504.00254 β’ Published β’ 1
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper β’ 2502.11573 β’ Published β’ 9 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper β’ 2502.02339 β’ Published β’ 23 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper β’ 2502.11775 β’ Published β’ 9 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper β’ 2412.18319 β’ Published β’ 39
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper β’ 2402.04252 β’ Published β’ 30 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper β’ 2402.03749 β’ Published β’ 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper β’ 2402.04615 β’ Published β’ 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper β’ 2402.05008 β’ Published β’ 23
-
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Paper β’ 2312.15715 β’ Published β’ 20 -
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Paper β’ 2505.23747 β’ Published β’ 69 -
VideoPrism: A Foundational Visual Encoder for Video Understanding
Paper β’ 2402.13217 β’ Published β’ 40 -
Scaling RL to Long Videos
Paper β’ 2507.07966 β’ Published β’ 162