Models
Datasets
Spaces
Buckets new
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2503.16416

Enterprise Agents and Benchmarks

Enterprise agent ecosystem featuring AssetOpsBench (industrial) and ITBench (SRE, FinOps, CISO), CUGA to accelerate AI Automation

Running

19

AssetOpsBench

🚀

19

Generate and benchmark machine learning models with ease
Running

Featured

98

CUGA Agent

🤖

98

Configurable Generalist Agent, leader in AppWorld Benchmark
Running

7

ITBench-Lite-Space

🚀

7

Develop and run interactive code notebooks with JupyterLab
Running

18

VAKRA Leaderboard

🏆

18

Evaluate AI agents on multi‑hop, multi‑source enterprise tasks

Agent Evaluation

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

Paper • 2507.12806 • Published Jul 17, 2025 • 21
Survey on Evaluation of LLM-based Agents

Paper • 2503.16416 • Published Mar 20, 2025 • 96

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Paper • 2504.01990 • Published Mar 31, 2025 • 305
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 308
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

Paper • 2503.24235 • Published Mar 31, 2025 • 55
Seedream 3.0 Technical Report

Paper • 2504.11346 • Published Apr 15, 2025 • 70

Survey on Evaluation of LLM-based Agents

Paper • 2503.16416 • Published Mar 20, 2025 • 96

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Paper • 2503.16419 • Published Mar 20, 2025 • 77
Survey on Evaluation of LLM-based Agents

Paper • 2503.16416 • Published Mar 20, 2025 • 96

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Paper • 2508.15760 • Published Aug 21, 2025 • 47
LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

Paper • 2508.01780 • Published Aug 3, 2025 • 21
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Paper • 2304.08244 • Published Apr 14, 2023 • 1
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs

Paper • 2508.16153 • Published Aug 22, 2025 • 162

Survey on Evaluation of LLM-based Agents

Paper • 2503.16416 • Published Mar 20, 2025 • 96
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Paper • 2505.04921 • Published May 8, 2025 • 187
Survey of User Interface Design and Interaction Techniques in Generative AI Applications

Paper • 2410.22370 • Published Oct 28, 2024 • 12
Survey of Hallucination in Natural Language Generation

Paper • 2202.03629 • Published Feb 8, 2022

Fun journal papers Ive read

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

Paper • 2503.03601 • Published Mar 5, 2025 • 233
Transformers without Normalization

Paper • 2503.10622 • Published Mar 13, 2025 • 172
Survey on Evaluation of LLM-based Agents

Paper • 2503.16416 • Published Mar 20, 2025 • 96

Survey on Evaluation of LLM-based Agents

Paper • 2503.16416 • Published Mar 20, 2025 • 96

CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

Paper • 2503.10613 • Published Mar 13, 2025 • 79
BrushEdit: All-In-One Image Inpainting and Editing

Paper • 2412.10316 • Published Dec 13, 2024 • 36
Survey on Evaluation of LLM-based Agents

Paper • 2503.16416 • Published Mar 20, 2025 • 96

Enterprise Agents and Benchmarks

Enterprise agent ecosystem featuring AssetOpsBench (industrial) and ITBench (SRE, FinOps, CISO), CUGA to accelerate AI Automation

Running

19

AssetOpsBench

🚀

19

Generate and benchmark machine learning models with ease
Running

Featured

98

CUGA Agent

🤖

98

Configurable Generalist Agent, leader in AppWorld Benchmark
Running

7

ITBench-Lite-Space

🚀

7

Develop and run interactive code notebooks with JupyterLab
Running

18

VAKRA Leaderboard

🏆

18

Evaluate AI agents on multi‑hop, multi‑source enterprise tasks

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Paper • 2508.15760 • Published Aug 21, 2025 • 47
LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

Paper • 2508.01780 • Published Aug 3, 2025 • 21
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Paper • 2304.08244 • Published Apr 14, 2023 • 1
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs

Paper • 2508.16153 • Published Aug 22, 2025 • 162

Agent Evaluation

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

Paper • 2507.12806 • Published Jul 17, 2025 • 21
Survey on Evaluation of LLM-based Agents

Paper • 2503.16416 • Published Mar 20, 2025 • 96

Survey on Evaluation of LLM-based Agents

Paper • 2503.16416 • Published Mar 20, 2025 • 96
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Paper • 2505.04921 • Published May 8, 2025 • 187
Survey of User Interface Design and Interaction Techniques in Generative AI Applications

Paper • 2410.22370 • Published Oct 28, 2024 • 12
Survey of Hallucination in Natural Language Generation

Paper • 2202.03629 • Published Feb 8, 2022

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Paper • 2504.01990 • Published Mar 31, 2025 • 305
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 308
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

Paper • 2503.24235 • Published Mar 31, 2025 • 55
Seedream 3.0 Technical Report

Paper • 2504.11346 • Published Apr 15, 2025 • 70

Fun journal papers Ive read

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

Paper • 2503.03601 • Published Mar 5, 2025 • 233
Transformers without Normalization

Paper • 2503.10622 • Published Mar 13, 2025 • 172
Survey on Evaluation of LLM-based Agents

Paper • 2503.16416 • Published Mar 20, 2025 • 96

Survey on Evaluation of LLM-based Agents

Paper • 2503.16416 • Published Mar 20, 2025 • 96

Survey on Evaluation of LLM-based Agents

Paper • 2503.16416 • Published Mar 20, 2025 • 96

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Paper • 2503.16419 • Published Mar 20, 2025 • 77
Survey on Evaluation of LLM-based Agents

Paper • 2503.16416 • Published Mar 20, 2025 • 96

CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

Paper • 2503.10613 • Published Mar 13, 2025 • 79
BrushEdit: All-In-One Image Inpainting and Editing

Paper • 2412.10316 • Published Dec 13, 2024 • 36
Survey on Evaluation of LLM-based Agents

Paper • 2503.16416 • Published Mar 20, 2025 • 96

Previous
1
2
3
Next

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs