Spaces:
Running
Running
deploy: update reports/eval_report.md
Browse files- reports/eval_report.md +16 -0
reports/eval_report.md
CHANGED
|
@@ -36,3 +36,19 @@ Distinct variant_ids: [0, 1, 2, 3, 4]
|
|
| 36 |
|
| 37 |
Average Reward: 0.8092
|
| 38 |
Completion Rate: 100.00%
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
Average Reward: 0.8092
|
| 38 |
Completion Rate: 100.00%
|
| 39 |
+
|
| 40 |
+
> **Note on identical rewards within a task.** Each task above shows the same
|
| 41 |
+
> reward across all 5 seeds because this run is a *scripted baseline* — the
|
| 42 |
+
> evaluation client follows a fixed strategy per `task_id` (e.g. always call
|
| 43 |
+
> `validate_document` then `flag_fraud_signal` then `deny_claim` for
|
| 44 |
+
> `contradictory_claim`). The seeds vary the generated documents (different
|
| 45 |
+
> claimants, amounts, fraud-signal strengths — see the `Variant` column), but
|
| 46 |
+
> the scripted strategy is invariant to that variation, so the env returns the
|
| 47 |
+
> same scalar reward every time. This is intentional: the scripted baseline
|
| 48 |
+
> exists to demonstrate that the environment's reward surface is deterministic
|
| 49 |
+
> and reproducible across seeds, *not* to show learning.
|
| 50 |
+
>
|
| 51 |
+
> The trained GRPO model produces variable rewards across seeds — see the
|
| 52 |
+
> held-out component-shift evaluation in
|
| 53 |
+
> [`reports/component_shift_summary.json`](component_shift_summary.json) and
|
| 54 |
+
> the README "Held-out evaluation" section.
|