akseljoonas HF Staff commited on
Commit
f00b1a6
·
1 Parent(s): bacafa4
Files changed (1) hide show
  1. eval/README.md +3 -5
eval/README.md CHANGED
@@ -1,11 +1,11 @@
1
  # HF-Agent Eval
2
 
3
- Rubric-based evaluation pipeline implementing [Rubrics as Rewards](https://arxiv.org/abs/2410.13254) (RaR-Explicit).
4
 
5
  ## Pipeline
6
 
7
  ```
8
- QA pairs → generate_rubrics.py → `eval/task.py@hf-benchmark-with-rubrics` → scores
9
  ```
10
 
11
  ### 1. Generate Rubrics (if not already generated)
@@ -27,9 +27,7 @@ python eval/generate_rubrics.py \
27
 
28
  **Output:** 7-20 weighted criteria per question (Essential: +5, Important: +3-4, Optional: +1-2, Pitfall: -1 to -2)
29
 
30
- ### 2. Evaluate Responses (Inspect)
31
-
32
- Load your rubric dataset, run a solver, and score with `rubric_scorer` using `inspect-ai`.
33
 
34
  Files:
35
  - `eval/hf_agent_connector.py` contains a lightweight bridge that spins up
 
1
  # HF-Agent Eval
2
 
3
+ Rubric-based evaluation pipeline implementing [Rubrics as Rewards](https://arxiv.org/abs/2507.17746) paper (RaR-Explicit formula).
4
 
5
  ## Pipeline
6
 
7
  ```
8
+ QA pairs → generate_rubrics.py → run `inspect-ai eval eval/task.py@hf-benchmark-with-rubrics` → scores
9
  ```
10
 
11
  ### 1. Generate Rubrics (if not already generated)
 
27
 
28
  **Output:** 7-20 weighted criteria per question (Essential: +5, Important: +3-4, Optional: +1-2, Pitfall: -1 to -2)
29
 
30
+ ### 2. Response evaluation
 
 
31
 
32
  Files:
33
  - `eval/hf_agent_connector.py` contains a lightweight bridge that spins up