Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
a9d2c33
1
Parent(s): 4c66880
eval readme update
Browse files- eval/README.md +14 -0
eval/README.md
CHANGED
|
@@ -2,6 +2,20 @@
|
|
| 2 |
|
| 3 |
Rubric-based evaluation pipeline implementing [Rubrics as Rewards](https://arxiv.org/abs/2507.17746) paper (RaR-Explicit formula).
|
| 4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
## Pipeline
|
| 6 |
|
| 7 |
```
|
|
|
|
| 2 |
|
| 3 |
Rubric-based evaluation pipeline implementing [Rubrics as Rewards](https://arxiv.org/abs/2507.17746) paper (RaR-Explicit formula).
|
| 4 |
|
| 5 |
+
## Components
|
| 6 |
+
|
| 7 |
+
| Component | Purpose | Long Term Goal |
|
| 8 |
+
|-----------|---------|----------------|
|
| 9 |
+
| **`generate_rubrics.py`** | Generates instance-specific evaluation criteria (7-20 weighted rubrics) from QA pairs using LLM, following the RaR paper methodology | Improve rubric quality with few-shot examples, domain-specific templates, and iterative refinement |
|
| 10 |
+
| **`rubric_eval.py`** | Scores responses using RaR-Explicit formula: checks each criterion independently via LLM judge, computes weighted normalized score | Support batch evaluation, caching, and alternative scoring formulas (RaR-Holistic) |
|
| 11 |
+
| **`task.py`** | Defines Inspect AI task `hf-benchmark-with-rubrics` that wires dataset, solver, and rubric scorer into a single evaluation pipeline | Add more task variants for different benchmarks (code generation, tool use, multi-turn) |
|
| 12 |
+
| **`solvers.py`** | Registry of solver implementations (`hf_agent`, `claude_code`, `claude_code+hf_mcp`) that can be swapped via CLI args | Expand solver library to benchmark more agents (OpenAI Codex, Gemini, open-source agents) |
|
| 13 |
+
| **`hf_agent_connector.py`** | Lightweight bridge that spins up the hf-agent stack (tools, MCP, LiteLLM loop) and returns the final assistant response | Enable streaming, intermediate step logging, and cost tracking per evaluation |
|
| 14 |
+
| **`leaderboard.py`** | Utilities to build records and append scores to a HuggingFace dataset for tracking performance over time | Add score breakdowns, visualizations, and automatic regression detection |
|
| 15 |
+
| **`run_eval_with_leaderboard.py`** | CLI wrapper that runs `inspect eval`, parses scores from logs, and pushes results to the leaderboard dataset | Support scheduled CI runs, PR-gated benchmarks, and multi-dataset aggregation |
|
| 16 |
+
| **`hf_io.py`** | Helper utilities for pushing DataFrames to HuggingFace Hub | Extend with dataset versioning and diff tracking |
|
| 17 |
+
| **`models.py`** | Shared Pydantic models for evaluation data structures | Centralize all eval schemas for consistency across components |
|
| 18 |
+
|
| 19 |
## Pipeline
|
| 20 |
|
| 21 |
```
|