Models
Datasets
Spaces
Buckets new
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2601.22060

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Paper • 2601.22060 • Published Jan 29 • 155
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Paper • 2602.02185 • Published Feb 2 • 118
SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Paper • 2603.23483 • Published 26 days ago • 62
WorldAgents: Can Foundation Image Models be Agents for 3D World Models?

Paper • 2603.19708 • Published about 1 month ago • 13

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Paper • 2602.12036 • Published Feb 12 • 93
Reinforcement Learning for Self-Improving Agent with Skill Library

Paper • 2512.17102 • Published Dec 18, 2025 • 42
Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Paper • 2512.23705 • Published Dec 29, 2025 • 45
Schoenfeld's Anatomy of Mathematical Reasoning by Language Models

Paper • 2512.19995 • Published Dec 23, 2025 • 16

Vision-DeepResearch

Osilly/Vision-DeepResearch-Toy-SFT-Data

Viewer • Updated Feb 1 • 1k • 146
Osilly/Vision-DeepResearch-Toy-RL-Data

Viewer • Updated Feb 1 • 1k • 79
Osilly/VDR-Bench

Viewer • Updated Feb 1 • 2k • 307
Osilly/VDR-Bench-testmini

Viewer • Updated Feb 1 • 500 • 410

Applications and Uses

about 23 hours ago

ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

Paper • 2506.09790 • Published Jun 11, 2025 • 53
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

Paper • 2506.06444 • Published Jun 6, 2025 • 73
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Paper • 2506.11763 • Published Jun 13, 2025 • 74
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research

Paper • 2502.04644 • Published Feb 7, 2025 • 4

Multimodal Agent

Gemini Robotics: Bringing AI into the Physical World

Paper • 2503.20020 • Published Mar 25, 2025 • 31
Magma: A Foundation Model for Multimodal AI Agents

Paper • 2502.13130 • Published Feb 18, 2025 • 58
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published Oct 30, 2024 • 49

Qwen3-TTS Technical Report

Paper • 2601.15621 • Published Jan 22 • 74
PaperBanana: Automating Academic Illustration for AI Scientists

Paper • 2601.23265 • Published Jan 30 • 224
Moonshine: Speech Recognition for Live Transcription and Voice Commands

Paper • 2410.15608 • Published Oct 21, 2024 • 12
PersonaLive! Expressive Portrait Image Animation for Live Streaming

Paper • 2512.11253 • Published Dec 12, 2025 • 40

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Paper • 2601.22060 • Published Jan 29 • 155

GUI-G^2: Gaussian Reward Modeling for GUI Grounding

Paper • 2507.15846 • Published Jul 21, 2025 • 135
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Paper • 2508.05748 • Published Aug 7, 2025 • 142
Mobile-Agent-v3: Foundamental Agents for GUI Automation

Paper • 2508.15144 • Published Aug 21, 2025 • 65
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs

Paper • 2508.16153 • Published Aug 22, 2025 • 162

Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26, 2025 • 172
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

Paper • 2505.22453 • Published May 28, 2025 • 46
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

Paper • 2505.23380 • Published May 29, 2025 • 22
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Paper • 2505.21523 • Published May 23, 2025 • 13

facebook/w2v-bert-2.0

Feature Extraction • 0.6B • Updated Jan 25, 2024 • 3.1M • 213
facebook/metaclip-h14-fullcc2.5b

Zero-Shot Image Classification • 1.0B • Updated Jan 11, 2024 • 10.8k • 49
openai/clip-vit-large-patch14

Zero-Shot Image Classification • 0.4B • Updated Sep 15, 2023 • 21.1M • 1.99k
Salesforce/blip-image-captioning-large

Image-to-Text • 0.5B • Updated Feb 3, 2025 • 1.4M • 1.47k

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Paper • 2601.22060 • Published Jan 29 • 155
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Paper • 2602.02185 • Published Feb 2 • 118
SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Paper • 2603.23483 • Published 26 days ago • 62
WorldAgents: Can Foundation Image Models be Agents for 3D World Models?

Paper • 2603.19708 • Published about 1 month ago • 13

Qwen3-TTS Technical Report

Paper • 2601.15621 • Published Jan 22 • 74
PaperBanana: Automating Academic Illustration for AI Scientists

Paper • 2601.23265 • Published Jan 30 • 224
Moonshine: Speech Recognition for Live Transcription and Voice Commands

Paper • 2410.15608 • Published Oct 21, 2024 • 12
PersonaLive! Expressive Portrait Image Animation for Live Streaming

Paper • 2512.11253 • Published Dec 12, 2025 • 40

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Paper • 2602.12036 • Published Feb 12 • 93
Reinforcement Learning for Self-Improving Agent with Skill Library

Paper • 2512.17102 • Published Dec 18, 2025 • 42
Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Paper • 2512.23705 • Published Dec 29, 2025 • 45
Schoenfeld's Anatomy of Mathematical Reasoning by Language Models

Paper • 2512.19995 • Published Dec 23, 2025 • 16

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Paper • 2601.22060 • Published Jan 29 • 155

Vision-DeepResearch

Osilly/Vision-DeepResearch-Toy-SFT-Data

Viewer • Updated Feb 1 • 1k • 146
Osilly/Vision-DeepResearch-Toy-RL-Data

Viewer • Updated Feb 1 • 1k • 79
Osilly/VDR-Bench

Viewer • Updated Feb 1 • 2k • 307
Osilly/VDR-Bench-testmini

Viewer • Updated Feb 1 • 500 • 410

GUI-G^2: Gaussian Reward Modeling for GUI Grounding

Paper • 2507.15846 • Published Jul 21, 2025 • 135
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Paper • 2508.05748 • Published Aug 7, 2025 • 142
Mobile-Agent-v3: Foundamental Agents for GUI Automation

Paper • 2508.15144 • Published Aug 21, 2025 • 65
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs

Paper • 2508.16153 • Published Aug 22, 2025 • 162

Applications and Uses

about 23 hours ago

ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

Paper • 2506.09790 • Published Jun 11, 2025 • 53
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

Paper • 2506.06444 • Published Jun 6, 2025 • 73
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Paper • 2506.11763 • Published Jun 13, 2025 • 74
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research

Paper • 2502.04644 • Published Feb 7, 2025 • 4

Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26, 2025 • 172
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

Paper • 2505.22453 • Published May 28, 2025 • 46
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

Paper • 2505.23380 • Published May 29, 2025 • 22
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Paper • 2505.21523 • Published May 23, 2025 • 13

Multimodal Agent

Gemini Robotics: Bringing AI into the Physical World

Paper • 2503.20020 • Published Mar 25, 2025 • 31
Magma: A Foundation Model for Multimodal AI Agents

Paper • 2502.13130 • Published Feb 18, 2025 • 58
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 51
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published Oct 30, 2024 • 49

facebook/w2v-bert-2.0

Feature Extraction • 0.6B • Updated Jan 25, 2024 • 3.1M • 213
facebook/metaclip-h14-fullcc2.5b

Zero-Shot Image Classification • 1.0B • Updated Jan 11, 2024 • 10.8k • 49
openai/clip-vit-large-patch14

Zero-Shot Image Classification • 0.4B • Updated Sep 15, 2023 • 21.1M • 1.99k
Salesforce/blip-image-captioning-large

Image-to-Text • 0.5B • Updated Feb 3, 2025 • 1.4M • 1.47k

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs