Don Rishabh commited on
Commit
433bfad
·
1 Parent(s): 9867aa7

README: add Scorers section (21 scorers grouped by family)

Browse files
Files changed (1) hide show
  1. README.md +18 -0
README.md CHANGED
@@ -86,6 +86,24 @@ Multi-turn (`turn_limit > 1`) splits the test pool into a 2-example feedback sli
86
 
87
  ---
88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  ## Variants
90
 
91
  ### Same-family control: Qwen→Qwen
 
86
 
87
  ---
88
 
89
+ ## Scorers
90
+
91
+ Each task picks one of 21 scorers from [`server/scorer.py`](./server/scorer.py). The scorer takes the target's output + the task's expected output and returns a value in `[0, 1]`. Per-task score is the mean across held-out test examples.
92
+
93
+ | Family | Scorers | What they check |
94
+ |---|---|---|
95
+ | **Exact / membership** | `exact_label`, `contains_label`, `contains_all_substrings`, `uppercase_match` | Closed-vocabulary classifiers; "must include these substrings"; case-strict rewrites |
96
+ | **Numeric** | `numeric_match`, `word_count_exact` | Last numeric token within tolerance; word count exactly N |
97
+ | **JSON / YAML** | `json_contains_fields`, `valid_json_object`, `json_key_order`, `valid_yaml_depth` | Structural extraction; required keys/values; key ordering; YAML nesting depth |
98
+ | **Format-strict** | `three_bullets`, `acrostic_match`, `avoid_letter`, `ends_question`, `terminal_output_pattern` | Exactly 3 bullets; first letters spell a word; output avoids a letter; ends with `?`; terminal-session shape |
99
+ | **Multi-step / language** | `stepwise_math`, `translation_match`, `selective_translate` | Numbered steps + numeric answer; token-F1 vs reference; partial-translation rules |
100
+ | **Safety** | `refusal_score` | Detects whether the output is a refusal (matches expected refuse / comply label) |
101
+ | **LLM judge** (Qwen3-8B 8-bit) | `judge_criteria`, `judge_vs_expected` | Free-form persona / reasoning / Yoda-syntax tasks where regex can't grade. Judge returns a score on the first line; deterministic decoding. |
102
+
103
+ The scorer is fixed per task and never seen by the agent — the agent has to infer from train examples + task description what gets graded. New scorers add to `SCORERS` registry at the bottom of `scorer.py`.
104
+
105
+ ---
106
+
107
  ## Variants
108
 
109
  ### Same-family control: Qwen→Qwen