Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
1a8f5b2
1
Parent(s): fe4d128
link fix
Browse files- eval/README.md +3 -5
eval/README.md
CHANGED
|
@@ -1,11 +1,11 @@
|
|
| 1 |
# HF-Agent Eval
|
| 2 |
|
| 3 |
-
Rubric-based evaluation pipeline implementing [Rubrics as Rewards](https://arxiv.org/abs/
|
| 4 |
|
| 5 |
## Pipeline
|
| 6 |
|
| 7 |
```
|
| 8 |
-
QA pairs → generate_rubrics.py → `eval/task.py@hf-benchmark-with-rubrics` → scores
|
| 9 |
```
|
| 10 |
|
| 11 |
### 1. Generate Rubrics (if not already generated)
|
|
@@ -27,9 +27,7 @@ python eval/generate_rubrics.py \
|
|
| 27 |
|
| 28 |
**Output:** 7-20 weighted criteria per question (Essential: +5, Important: +3-4, Optional: +1-2, Pitfall: -1 to -2)
|
| 29 |
|
| 30 |
-
### 2.
|
| 31 |
-
|
| 32 |
-
Load your rubric dataset, run a solver, and score with `rubric_scorer` using `inspect-ai`.
|
| 33 |
|
| 34 |
Files:
|
| 35 |
- `eval/hf_agent_connector.py` contains a lightweight bridge that spins up
|
|
|
|
| 1 |
# HF-Agent Eval
|
| 2 |
|
| 3 |
+
Rubric-based evaluation pipeline implementing [Rubrics as Rewards](https://arxiv.org/abs/2507.17746) paper (RaR-Explicit formula).
|
| 4 |
|
| 5 |
## Pipeline
|
| 6 |
|
| 7 |
```
|
| 8 |
+
QA pairs → generate_rubrics.py → run `inspect-ai eval eval/task.py@hf-benchmark-with-rubrics` → scores
|
| 9 |
```
|
| 10 |
|
| 11 |
### 1. Generate Rubrics (if not already generated)
|
|
|
|
| 27 |
|
| 28 |
**Output:** 7-20 weighted criteria per question (Essential: +5, Important: +3-4, Optional: +1-2, Pitfall: -1 to -2)
|
| 29 |
|
| 30 |
+
### 2. Response evaluation
|
|
|
|
|
|
|
| 31 |
|
| 32 |
Files:
|
| 33 |
- `eval/hf_agent_connector.py` contains a lightweight bridge that spins up
|