Spaces:
Sleeping
Sleeping
| title: Agent Eval Lab | |
| emoji: "🧪" | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: gradio | |
| sdk_version: 5.29.0 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| tags: | |
| - agents | |
| - evaluation | |
| - software-engineering | |
| - tool-use | |
| models: [] | |
| datasets: | |
| - mukunda1729/agent-eval-scenarios | |
| - mukunda1729/premium-agent-repo-landscape | |
| # Agent Eval Lab | |
| Agent Eval Lab is a small public demo for turning rough agent workflows into practical evaluation scenarios. | |
| It helps builders generate: | |
| - a scenario title | |
| - task setup | |
| - expected behavior | |
| - likely failure modes | |
| - scoring dimensions | |
| - next-step follow-up tests | |
| The Space is intentionally lightweight and portfolio-friendly: fast to inspect, easy to extend, and aligned with public artifacts on Kaggle, Codeberg, and other AI platforms. | |
| ## Associated Papers | |
| - Primary paper: [Lightweight Evaluation and Operational Scorecards for Tool-Using AI Agents](https://doi.org/10.5281/zenodo.20034550) | |
| - Paper landing page: [lightweight-agent-eval-paper](https://mukundakatta.github.io/lightweight-agent-eval-paper/) | |
| - Artifact repo: [MukundaKatta/lightweight-agent-eval-paper](https://github.com/MukundaKatta/lightweight-agent-eval-paper) | |
| - Companion evaluation harness paper: [AI Eval Forge: Mixed-Check Regression Testing for LLM and Agent Workflows](https://doi.org/10.5281/zenodo.20044318) | |
| ## Related Public Artifacts | |
| - Hugging Face dataset: [mukunda1729/agent-eval-scenarios](https://huggingface.co/datasets/mukunda1729/agent-eval-scenarios) | |
| - Hugging Face dataset: [mukunda1729/premium-agent-repo-landscape](https://huggingface.co/datasets/mukunda1729/premium-agent-repo-landscape) | |
| - Hugging Face collection: [Agent Labs Portfolio](https://huggingface.co/collections/mukunda1729/agent-labs-portfolio) |