AniketAsla commited on
Commit
8a5e9ed
·
verified ·
1 Parent(s): 6b0c0da

deploy: update reports/eval_report.md

Browse files
Files changed (1) hide show
  1. reports/eval_report.md +16 -0
reports/eval_report.md CHANGED
@@ -36,3 +36,19 @@ Distinct variant_ids: [0, 1, 2, 3, 4]
36
 
37
  Average Reward: 0.8092
38
  Completion Rate: 100.00%
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  Average Reward: 0.8092
38
  Completion Rate: 100.00%
39
+
40
+ > **Note on identical rewards within a task.** Each task above shows the same
41
+ > reward across all 5 seeds because this run is a *scripted baseline* — the
42
+ > evaluation client follows a fixed strategy per `task_id` (e.g. always call
43
+ > `validate_document` then `flag_fraud_signal` then `deny_claim` for
44
+ > `contradictory_claim`). The seeds vary the generated documents (different
45
+ > claimants, amounts, fraud-signal strengths — see the `Variant` column), but
46
+ > the scripted strategy is invariant to that variation, so the env returns the
47
+ > same scalar reward every time. This is intentional: the scripted baseline
48
+ > exists to demonstrate that the environment's reward surface is deterministic
49
+ > and reproducible across seeds, *not* to show learning.
50
+ >
51
+ > The trained GRPO model produces variable rewards across seeds — see the
52
+ > held-out component-shift evaluation in
53
+ > [`reports/component_shift_summary.json`](component_shift_summary.json) and
54
+ > the README "Held-out evaluation" section.