Collections
Discover the best community collections!
Collections including paper arxiv:2511.15661
-
ARE: Scaling Up Agent Environments and Evaluations
Paper • 2509.17158 • Published • 36 -
ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation
Paper • 2510.08551 • Published • 34 -
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
Paper • 2510.04212 • Published • 26 -
ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning
Paper • 2510.12693 • Published • 28
-
MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization
Paper • 2503.16874 • Published • 45 -
System Prompt Optimization with Meta-Learning
Paper • 2505.09666 • Published • 71 -
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
Paper • 2505.23380 • Published • 22 -
DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning
Paper • 2505.23754 • Published • 15
-
iFormer: Integrating ConvNet and Transformer for Mobile Application
Paper • 2501.15369 • Published • 13 -
VisPlay: Self-Evolving Vision-Language Models from Images
Paper • 2511.15661 • Published • 44 -
SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices
Paper • 2601.08303 • Published • 19
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 30 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
MemLoRA: Distilling Expert Adapters for On-Device Memory Systems
Paper • 2512.04763 • Published • 5 -
VisPlay: Self-Evolving Vision-Language Models from Images
Paper • 2511.15661 • Published • 44 -
VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse
Paper • 2512.14531 • Published • 15 -
Improving Recursive Transformers with Mixture of LoRAs
Paper • 2512.12880 • Published • 6
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 39 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 35 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 45 -
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Paper • 2508.20072 • Published • 32
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 9 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 23 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 9 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39
-
MemLoRA: Distilling Expert Adapters for On-Device Memory Systems
Paper • 2512.04763 • Published • 5 -
VisPlay: Self-Evolving Vision-Language Models from Images
Paper • 2511.15661 • Published • 44 -
VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse
Paper • 2512.14531 • Published • 15 -
Improving Recursive Transformers with Mixture of LoRAs
Paper • 2512.12880 • Published • 6
-
ARE: Scaling Up Agent Environments and Evaluations
Paper • 2509.17158 • Published • 36 -
ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation
Paper • 2510.08551 • Published • 34 -
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
Paper • 2510.04212 • Published • 26 -
ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning
Paper • 2510.12693 • Published • 28
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 39 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 35 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 45 -
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Paper • 2508.20072 • Published • 32
-
MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization
Paper • 2503.16874 • Published • 45 -
System Prompt Optimization with Meta-Learning
Paper • 2505.09666 • Published • 71 -
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
Paper • 2505.23380 • Published • 22 -
DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning
Paper • 2505.23754 • Published • 15
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 9 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 23 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 9 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39
-
iFormer: Integrating ConvNet and Transformer for Mobile Application
Paper • 2501.15369 • Published • 13 -
VisPlay: Self-Evolving Vision-Language Models from Images
Paper • 2511.15661 • Published • 44 -
SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices
Paper • 2601.08303 • Published • 19
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 30 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23