Spaces:
Sleeping
Sleeping
Don Rishabh commited on
Commit ·
433bfad
1
Parent(s): 9867aa7
README: add Scorers section (21 scorers grouped by family)
Browse files
README.md
CHANGED
|
@@ -86,6 +86,24 @@ Multi-turn (`turn_limit > 1`) splits the test pool into a 2-example feedback sli
|
|
| 86 |
|
| 87 |
---
|
| 88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
## Variants
|
| 90 |
|
| 91 |
### Same-family control: Qwen→Qwen
|
|
|
|
| 86 |
|
| 87 |
---
|
| 88 |
|
| 89 |
+
## Scorers
|
| 90 |
+
|
| 91 |
+
Each task picks one of 21 scorers from [`server/scorer.py`](./server/scorer.py). The scorer takes the target's output + the task's expected output and returns a value in `[0, 1]`. Per-task score is the mean across held-out test examples.
|
| 92 |
+
|
| 93 |
+
| Family | Scorers | What they check |
|
| 94 |
+
|---|---|---|
|
| 95 |
+
| **Exact / membership** | `exact_label`, `contains_label`, `contains_all_substrings`, `uppercase_match` | Closed-vocabulary classifiers; "must include these substrings"; case-strict rewrites |
|
| 96 |
+
| **Numeric** | `numeric_match`, `word_count_exact` | Last numeric token within tolerance; word count exactly N |
|
| 97 |
+
| **JSON / YAML** | `json_contains_fields`, `valid_json_object`, `json_key_order`, `valid_yaml_depth` | Structural extraction; required keys/values; key ordering; YAML nesting depth |
|
| 98 |
+
| **Format-strict** | `three_bullets`, `acrostic_match`, `avoid_letter`, `ends_question`, `terminal_output_pattern` | Exactly 3 bullets; first letters spell a word; output avoids a letter; ends with `?`; terminal-session shape |
|
| 99 |
+
| **Multi-step / language** | `stepwise_math`, `translation_match`, `selective_translate` | Numbered steps + numeric answer; token-F1 vs reference; partial-translation rules |
|
| 100 |
+
| **Safety** | `refusal_score` | Detects whether the output is a refusal (matches expected refuse / comply label) |
|
| 101 |
+
| **LLM judge** (Qwen3-8B 8-bit) | `judge_criteria`, `judge_vs_expected` | Free-form persona / reasoning / Yoda-syntax tasks where regex can't grade. Judge returns a score on the first line; deterministic decoding. |
|
| 102 |
+
|
| 103 |
+
The scorer is fixed per task and never seen by the agent — the agent has to infer from train examples + task description what gets graded. New scorers add to `SCORERS` registry at the bottom of `scorer.py`.
|
| 104 |
+
|
| 105 |
+
---
|
| 106 |
+
|
| 107 |
## Variants
|
| 108 |
|
| 109 |
### Same-family control: Qwen→Qwen
|