Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available: 6.14.0
metadata
title: Agent Eval Lab
emoji: 🧪
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
license: apache-2.0
tags:
- agents
- evaluation
- software-engineering
- tool-use
models: []
datasets:
- mukunda1729/agent-eval-scenarios
- mukunda1729/premium-agent-repo-landscape
Agent Eval Lab
Agent Eval Lab is a small public demo for turning rough agent workflows into practical evaluation scenarios.
It helps builders generate:
- a scenario title
- task setup
- expected behavior
- likely failure modes
- scoring dimensions
- next-step follow-up tests
The Space is intentionally lightweight and portfolio-friendly: fast to inspect, easy to extend, and aligned with public artifacts on Kaggle, Codeberg, and other AI platforms.
Associated Papers
- Primary paper: Lightweight Evaluation and Operational Scorecards for Tool-Using AI Agents
- Paper landing page: lightweight-agent-eval-paper
- Artifact repo: MukundaKatta/lightweight-agent-eval-paper
- Companion evaluation harness paper: AI Eval Forge: Mixed-Check Regression Testing for LLM and Agent Workflows
Related Public Artifacts
- Hugging Face dataset: mukunda1729/agent-eval-scenarios
- Hugging Face dataset: mukunda1729/premium-agent-repo-landscape
- Hugging Face collection: Agent Labs Portfolio