-
Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models
Paper • 2601.22060 • Published • 155 -
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
Paper • 2602.02185 • Published • 118 -
SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning
Paper • 2603.23483 • Published • 62 -
WorldAgents: Can Foundation Image Models be Agents for 3D World Models?
Paper • 2603.19708 • Published • 13
Collections
Discover the best community collections!
Collections including paper arxiv:2601.22060
-
Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models
Paper • 2602.12036 • Published • 93 -
Reinforcement Learning for Self-Improving Agent with Skill Library
Paper • 2512.17102 • Published • 42 -
Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
Paper • 2512.23705 • Published • 45 -
Schoenfeld's Anatomy of Mathematical Reasoning by Language Models
Paper • 2512.19995 • Published • 16
-
ComfyUI-R1: Exploring Reasoning Models for Workflow Generation
Paper • 2506.09790 • Published • 53 -
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance
Paper • 2506.06444 • Published • 73 -
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
Paper • 2506.11763 • Published • 74 -
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research
Paper • 2502.04644 • Published • 4
-
Gemini Robotics: Bringing AI into the Physical World
Paper • 2503.20020 • Published • 31 -
Magma: A Foundation Model for Multimodal AI Agents
Paper • 2502.13130 • Published • 58 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 49
-
Qwen3-TTS Technical Report
Paper • 2601.15621 • Published • 74 -
PaperBanana: Automating Academic Illustration for AI Scientists
Paper • 2601.23265 • Published • 224 -
Moonshine: Speech Recognition for Live Transcription and Voice Commands
Paper • 2410.15608 • Published • 12 -
PersonaLive! Expressive Portrait Image Animation for Live Streaming
Paper • 2512.11253 • Published • 40
-
GUI-G^2: Gaussian Reward Modeling for GUI Grounding
Paper • 2507.15846 • Published • 135 -
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Paper • 2508.05748 • Published • 142 -
Mobile-Agent-v3: Foundamental Agents for GUI Automation
Paper • 2508.15144 • Published • 65 -
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
Paper • 2508.16153 • Published • 162
-
Qwen2.5-Omni Technical Report
Paper • 2503.20215 • Published • 172 -
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper • 2505.22453 • Published • 46 -
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
Paper • 2505.23380 • Published • 22 -
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
Paper • 2505.21523 • Published • 13
-
facebook/w2v-bert-2.0
Feature Extraction • 0.6B • Updated • 3.1M • 213 -
facebook/metaclip-h14-fullcc2.5b
Zero-Shot Image Classification • 1.0B • Updated • 10.8k • 49 -
openai/clip-vit-large-patch14
Zero-Shot Image Classification • 0.4B • Updated • 21.1M • 1.99k -
Salesforce/blip-image-captioning-large
Image-to-Text • 0.5B • Updated • 1.4M • 1.47k
-
Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models
Paper • 2601.22060 • Published • 155 -
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
Paper • 2602.02185 • Published • 118 -
SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning
Paper • 2603.23483 • Published • 62 -
WorldAgents: Can Foundation Image Models be Agents for 3D World Models?
Paper • 2603.19708 • Published • 13
-
Qwen3-TTS Technical Report
Paper • 2601.15621 • Published • 74 -
PaperBanana: Automating Academic Illustration for AI Scientists
Paper • 2601.23265 • Published • 224 -
Moonshine: Speech Recognition for Live Transcription and Voice Commands
Paper • 2410.15608 • Published • 12 -
PersonaLive! Expressive Portrait Image Animation for Live Streaming
Paper • 2512.11253 • Published • 40
-
Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models
Paper • 2602.12036 • Published • 93 -
Reinforcement Learning for Self-Improving Agent with Skill Library
Paper • 2512.17102 • Published • 42 -
Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
Paper • 2512.23705 • Published • 45 -
Schoenfeld's Anatomy of Mathematical Reasoning by Language Models
Paper • 2512.19995 • Published • 16
-
GUI-G^2: Gaussian Reward Modeling for GUI Grounding
Paper • 2507.15846 • Published • 135 -
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Paper • 2508.05748 • Published • 142 -
Mobile-Agent-v3: Foundamental Agents for GUI Automation
Paper • 2508.15144 • Published • 65 -
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
Paper • 2508.16153 • Published • 162
-
ComfyUI-R1: Exploring Reasoning Models for Workflow Generation
Paper • 2506.09790 • Published • 53 -
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance
Paper • 2506.06444 • Published • 73 -
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
Paper • 2506.11763 • Published • 74 -
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research
Paper • 2502.04644 • Published • 4
-
Qwen2.5-Omni Technical Report
Paper • 2503.20215 • Published • 172 -
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper • 2505.22453 • Published • 46 -
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
Paper • 2505.23380 • Published • 22 -
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
Paper • 2505.21523 • Published • 13
-
Gemini Robotics: Bringing AI into the Physical World
Paper • 2503.20020 • Published • 31 -
Magma: A Foundation Model for Multimodal AI Agents
Paper • 2502.13130 • Published • 58 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 49
-
facebook/w2v-bert-2.0
Feature Extraction • 0.6B • Updated • 3.1M • 213 -
facebook/metaclip-h14-fullcc2.5b
Zero-Shot Image Classification • 1.0B • Updated • 10.8k • 49 -
openai/clip-vit-large-patch14
Zero-Shot Image Classification • 0.4B • Updated • 21.1M • 1.99k -
Salesforce/blip-image-captioning-large
Image-to-Text • 0.5B • Updated • 1.4M • 1.47k