Collections
Discover the best community collections!
Collections including paper arxiv:2509.15937
-
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Paper • 2509.16197 • Published • 58 -
InternRobotics/VLAC
Robotics • 2B • Updated • 43 • 41 -
LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence
Paper • 2509.12203 • Published • 20 -
A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning
Paper • 2509.15937 • Published • 20
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 39 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 35 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 45 -
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Paper • 2508.20072 • Published • 32
-
microsoft/bitnet-b1.58-2B-4T
Text Generation • 0.8B • Updated • 15.7k • 1.43k -
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
Paper • 2504.10449 • Published • 15 -
nvidia/Llama-3.1-Nemotron-8B-UltraLong-2M-Instruct
Text Generation • 8B • Updated • 99 • 17 -
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Paper • 2504.11536 • Published • 63
-
openai/whisper-large-v3
Automatic Speech Recognition • 2B • Updated • 4.82M • • 5.6k -
openai/whisper-large-v3-turbo
Automatic Speech Recognition • 0.8B • Updated • 6.68M • • 2.96k -
allenai/olmOCR-7B-0225-preview
Image-Text-to-Text • 8B • Updated • 7.17k • 701 -
microsoft/Phi-4-multimodal-instruct
Automatic Speech Recognition • 6B • Updated • 331k • 1.58k
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 39 -
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Paper • 2507.04447 • Published • 45 -
A Survey on Vision-Language-Action Models for Autonomous Driving
Paper • 2506.24044 • Published • 14 -
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
Paper • 2507.10548 • Published • 37
-
lusxvr/nanoVLM-222M
Image-Text-to-Text • 0.2B • Updated • 196 • 99 -
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Paper • 2503.09516 • Published • 39 -
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
Paper • 2505.24863 • Published • 97 -
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
Paper • 2505.17667 • Published • 88
-
Gemini Robotics: Bringing AI into the Physical World
Paper • 2503.20020 • Published • 31 -
Magma: A Foundation Model for Multimodal AI Agents
Paper • 2502.13130 • Published • 58 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 49
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 30 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Paper • 2509.16197 • Published • 58 -
InternRobotics/VLAC
Robotics • 2B • Updated • 43 • 41 -
LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence
Paper • 2509.12203 • Published • 20 -
A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning
Paper • 2509.15937 • Published • 20
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 39 -
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Paper • 2507.04447 • Published • 45 -
A Survey on Vision-Language-Action Models for Autonomous Driving
Paper • 2506.24044 • Published • 14 -
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
Paper • 2507.10548 • Published • 37
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 39 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 35 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 45 -
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Paper • 2508.20072 • Published • 32
-
lusxvr/nanoVLM-222M
Image-Text-to-Text • 0.2B • Updated • 196 • 99 -
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Paper • 2503.09516 • Published • 39 -
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
Paper • 2505.24863 • Published • 97 -
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
Paper • 2505.17667 • Published • 88
-
microsoft/bitnet-b1.58-2B-4T
Text Generation • 0.8B • Updated • 15.7k • 1.43k -
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
Paper • 2504.10449 • Published • 15 -
nvidia/Llama-3.1-Nemotron-8B-UltraLong-2M-Instruct
Text Generation • 8B • Updated • 99 • 17 -
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Paper • 2504.11536 • Published • 63
-
Gemini Robotics: Bringing AI into the Physical World
Paper • 2503.20020 • Published • 31 -
Magma: A Foundation Model for Multimodal AI Agents
Paper • 2502.13130 • Published • 58 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 49
-
openai/whisper-large-v3
Automatic Speech Recognition • 2B • Updated • 4.82M • • 5.6k -
openai/whisper-large-v3-turbo
Automatic Speech Recognition • 0.8B • Updated • 6.68M • • 2.96k -
allenai/olmOCR-7B-0225-preview
Image-Text-to-Text • 8B • Updated • 7.17k • 701 -
microsoft/Phi-4-multimodal-instruct
Automatic Speech Recognition • 6B • Updated • 331k • 1.58k
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 30 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23