-
CUGA Agent
🤖98Configurable Generalist Agent, leader in AppWorld Benchmark
-
MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
Paper • 2603.28407 • Published • 68 -
How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings
Paper • 2604.04323 • Published • 40
Collections
Discover the best community collections!
Collections including paper arxiv:2603.28407
-
ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling
Paper • 2603.25746 • Published • 155 -
TAPS: Task Aware Proposal Distributions for Speculative Sampling
Paper • 2603.27027 • Published • 142 -
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
Paper • 2603.25716 • Published • 154 -
LongCat-Next: Lexicalizing Modalities as Discrete Tokens
Paper • 2603.27538 • Published • 143
-
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
Paper • 2505.13227 • Published • 45 -
facebook/natural_reasoning
Viewer • Updated • 1.15M • 1.46k • 561 -
nvidia/OpenMathReasoning
Viewer • Updated • 5.68M • 17.6k • 453 -
Search Arena: Analyzing Search-Augmented LLMs
Paper • 2506.05334 • Published • 18
-
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper • 2310.11453 • Published • 107 -
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Paper • 2310.11511 • Published • 80 -
In-Context Learning Creates Task Vectors
Paper • 2310.15916 • Published • 43 -
Matryoshka Diffusion Models
Paper • 2310.15111 • Published • 45
-
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Paper • 2511.18538 • Published • 304 -
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models
Paper • 2511.05459 • Published • 4 -
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
Paper • 2512.18470 • Published • 12 -
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
Paper • 2601.09688 • Published • 127
-
Qwen2.5-Omni Technical Report
Paper • 2503.20215 • Published • 172 -
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper • 2505.22453 • Published • 46 -
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
Paper • 2505.23380 • Published • 22 -
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
Paper • 2505.21523 • Published • 13
-
CUGA Agent
🤖98Configurable Generalist Agent, leader in AppWorld Benchmark
-
MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
Paper • 2603.28407 • Published • 68 -
How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings
Paper • 2604.04323 • Published • 40
-
ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling
Paper • 2603.25746 • Published • 155 -
TAPS: Task Aware Proposal Distributions for Speculative Sampling
Paper • 2603.27027 • Published • 142 -
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
Paper • 2603.25716 • Published • 154 -
LongCat-Next: Lexicalizing Modalities as Discrete Tokens
Paper • 2603.27538 • Published • 143
-
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Paper • 2511.18538 • Published • 304 -
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models
Paper • 2511.05459 • Published • 4 -
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
Paper • 2512.18470 • Published • 12 -
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
Paper • 2601.09688 • Published • 127
-
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
Paper • 2505.13227 • Published • 45 -
facebook/natural_reasoning
Viewer • Updated • 1.15M • 1.46k • 561 -
nvidia/OpenMathReasoning
Viewer • Updated • 5.68M • 17.6k • 453 -
Search Arena: Analyzing Search-Augmented LLMs
Paper • 2506.05334 • Published • 18
-
Qwen2.5-Omni Technical Report
Paper • 2503.20215 • Published • 172 -
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper • 2505.22453 • Published • 46 -
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
Paper • 2505.23380 • Published • 22 -
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
Paper • 2505.21523 • Published • 13
-
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper • 2310.11453 • Published • 107 -
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Paper • 2310.11511 • Published • 80 -
In-Context Learning Creates Task Vectors
Paper • 2310.15916 • Published • 43 -
Matryoshka Diffusion Models
Paper • 2310.15111 • Published • 45