AniketAsla commited on
Commit
e47b2da
·
verified ·
1 Parent(s): fc09821

sync: git a42105a (a42105a80032b12e56f1f5d042a5bc97a243067d)

Browse files
.gitattributes ADDED
@@ -0,0 +1 @@
 
 
1
+ docs/reward_curve.png filter=lfs diff=lfs merge=lfs -text
BLOG.md CHANGED
@@ -209,13 +209,13 @@ This produces a stable learning curve. The complex eval reward runs separately f
209
 
210
  ### Training Signals (WandB + held-out eval)
211
 
212
- The GRPO training run tracks both the reward curve and a held-out component-shift summary. Live metrics go to the [WandB project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) (entity `aniketaslaliya-lnmiit` — open the latest run named `grpo-qwen0.5b-env-connected`). **Canonical plots for judges** are always the committed files [docs/reward_curve.svg](docs/reward_curve.svg) and [docs/component_shift.svg](docs/component_shift.svg), backed by [reports/training_summary.json](reports/training_summary.json).
213
 
214
- ![WandB reward curve - training reward rises as calibration improves](docs/reward_curve.svg)
215
 
216
  ### Component score shift
217
 
218
- ![Component score shift before vs after training](docs/component_shift.svg)
219
 
220
  This companion plot shows how the held-out validation sweep changes before and after training across fraud detection, decision accuracy, evidence grounding, and calibration.
221
 
 
209
 
210
  ### Training Signals (WandB + held-out eval)
211
 
212
+ The GRPO training run tracks both the reward curve and a held-out component-shift summary. Live metrics go to the [WandB project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) (entity `aniketaslaliya-lnmiit` — open the latest run named `grpo-qwen0.5b-env-connected`). **Canonical plots for judges** are always the committed files [docs/reward_curve.png](docs/reward_curve.png) and [docs/component_shift.png](docs/component_shift.png), backed by [reports/training_summary.json](reports/training_summary.json).
213
 
214
+ ![WandB reward curve - training reward rises as calibration improves](docs/reward_curve.png)
215
 
216
  ### Component score shift
217
 
218
+ ![Component score shift before vs after training](docs/component_shift.png)
219
 
220
  This companion plot shows how the held-out validation sweep changes before and after training across fraud detection, decision accuracy, evidence grounding, and calibration.
221
 
README.md CHANGED
@@ -55,14 +55,14 @@ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore ev
55
  |---|---|---|
56
  | **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
57
  | **Storytelling & Presentation** | 30% | [`BLOG.md`](BLOG.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
58
- | **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.svg`](docs/reward_curve.svg) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
59
  | **Reward & Training Pipeline** | 10% | [`app/rubrics.py`](app/rubrics.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`server/calibration_grader.py`](server/calibration_grader.py) — 3×2 calibration matrix. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
60
 
61
  ### Minimum-requirement checklist (for judges)
62
 
63
  - [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
64
  - [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
65
- - [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.svg`](docs/reward_curve.svg), [`docs/component_shift.svg`](docs/component_shift.svg)
66
  - [x] **Mini-blog** at [`BLOG.md`](BLOG.md)
67
  - [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
68
  - [x] **README** motivates the problem, explains the env, and shows results (this file)
@@ -87,7 +87,7 @@ hand-edits, no rounded-up estimates. Source:
87
  | GRPO steps | 2,500 |
88
  | Batch / Generations | 8 / 8 |
89
  | Hardware | L4 GPU (HF Jobs), 3 h 3 min |
90
- | WandB | [Project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) — open the **latest** run named `grpo-qwen0.5b-env-connected` (5K HF Job, Apr 2026). **Canonical curves:** [`docs/reward_curve.svg`](docs/reward_curve.svg) + [`reports/training_summary.json`](reports/training_summary.json) (always match the committed training). |
91
 
92
  ### Headline result
93
 
@@ -116,12 +116,12 @@ decision-making.
116
 
117
  ### Training Plots
118
 
119
- ![Reward Curve](docs/reward_curve.svg)
120
  *Mean training reward across 2,500 GRPO steps (5,000 episodes, 1 epoch).
121
  Reward climbs from 0.130 to 0.469 — a 3.6× improvement. Source:
122
  [`reports/training_summary.json`](reports/training_summary.json).*
123
 
124
- ![Component Shift](docs/component_shift.svg)
125
  *Before vs after on held-out eval: Decision accuracy 0 → 1.0,
126
  Calibration 0 → 1.0, Fraud detection 0 → 0.33. Source:
127
  [`reports/component_shift_summary.json`](reports/component_shift_summary.json).*
@@ -335,8 +335,8 @@ PYTHONPATH=. python train/train_minimal.py
335
  Or open the Colab notebook: [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb)
336
 
337
  Artifacts generated after training:
338
- - `docs/reward_curve.svg`
339
- - `docs/component_shift.svg`
340
  - `reports/training_summary.json`
341
 
342
  ---
@@ -368,8 +368,8 @@ debateFloor/
368
  ├── BLOG.md ← writeup (root for visibility)
369
 
370
  ├── docs/
371
- │ ├── reward_curve.svg ← training reward curve (embedded above)
372
- │ └── component_shift.svg ← before/after component scores (embedded above)
373
 
374
  └── reports/
375
  ├── training_summary.json
 
55
  |---|---|---|
56
  | **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
57
  | **Storytelling & Presentation** | 30% | [`BLOG.md`](BLOG.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
58
+ | **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.png`](docs/reward_curve.png) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
59
  | **Reward & Training Pipeline** | 10% | [`app/rubrics.py`](app/rubrics.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`server/calibration_grader.py`](server/calibration_grader.py) — 3×2 calibration matrix. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
60
 
61
  ### Minimum-requirement checklist (for judges)
62
 
63
  - [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
64
  - [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
65
+ - [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.png`](docs/reward_curve.png), [`docs/component_shift.png`](docs/component_shift.png)
66
  - [x] **Mini-blog** at [`BLOG.md`](BLOG.md)
67
  - [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
68
  - [x] **README** motivates the problem, explains the env, and shows results (this file)
 
87
  | GRPO steps | 2,500 |
88
  | Batch / Generations | 8 / 8 |
89
  | Hardware | L4 GPU (HF Jobs), 3 h 3 min |
90
+ | WandB | [Project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) — open the **latest** run named `grpo-qwen0.5b-env-connected` (5K HF Job, Apr 2026). **Canonical curves:** [`docs/reward_curve.png`](docs/reward_curve.png) + [`reports/training_summary.json`](reports/training_summary.json) (always match the committed training). |
91
 
92
  ### Headline result
93
 
 
116
 
117
  ### Training Plots
118
 
119
+ ![Reward Curve](docs/reward_curve.png)
120
  *Mean training reward across 2,500 GRPO steps (5,000 episodes, 1 epoch).
121
  Reward climbs from 0.130 to 0.469 — a 3.6× improvement. Source:
122
  [`reports/training_summary.json`](reports/training_summary.json).*
123
 
124
+ ![Component Shift](docs/component_shift.png)
125
  *Before vs after on held-out eval: Decision accuracy 0 → 1.0,
126
  Calibration 0 → 1.0, Fraud detection 0 → 0.33. Source:
127
  [`reports/component_shift_summary.json`](reports/component_shift_summary.json).*
 
335
  Or open the Colab notebook: [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb)
336
 
337
  Artifacts generated after training:
338
+ - `docs/reward_curve.png`
339
+ - `docs/component_shift.png`
340
  - `reports/training_summary.json`
341
 
342
  ---
 
368
  ├── BLOG.md ← writeup (root for visibility)
369
 
370
  ├── docs/
371
+ │ ├── reward_curve.png ← training reward curve (embedded above)
372
+ │ └── component_shift.png ← before/after component scores (embedded above)
373
 
374
  └── reports/
375
  ├── training_summary.json
docs/VideoScript_ClaimCourt.md CHANGED
@@ -110,7 +110,7 @@ Click **Run Episode** once on **clean claim** so the audience sees the flow star
110
 
111
  ## ACT 3 — “Yes, we actually trained it” (~20 s) — keep light
112
 
113
- **Visual:** Quick montage: **WandB** project page (reward climbing) **or** `docs/reward_curve.svg` in GitHub; optional 1 clip of HF Jobs log line.
114
 
115
  **Say (one breath, then slow on numbers):**
116
 
 
110
 
111
  ## ACT 3 — “Yes, we actually trained it” (~20 s) — keep light
112
 
113
+ **Visual:** Quick montage: **WandB** project page (reward climbing) **or** `docs/reward_curve.png` in GitHub; optional 1 clip of HF Jobs log line.
114
 
115
  **Say (one breath, then slow on numbers):**
116
 
docs/component_shift.png ADDED
docs/component_shift.svg CHANGED
docs/reward_curve.png ADDED

Git LFS Details

  • SHA256: 94f22adbf956ff3f153e83f977808341b9ae8180ef4ad586eb09e78735a55333
  • Pointer size: 131 Bytes
  • Size of remote file: 221 kB
docs/reward_curve.svg CHANGED
tools/regenerate_plots.py ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Regenerate canonical plots as PNG (and SVG) from committed JSON artifacts.
2
+
3
+ Reads: reports/training_summary.json
4
+ reports/component_shift_summary.json
5
+ Writes: docs/reward_curve.png (+ docs/reward_curve.svg)
6
+ docs/component_shift.png (+ docs/component_shift.svg)
7
+
8
+ Run: python tools/regenerate_plots.py
9
+ """
10
+ from __future__ import annotations
11
+
12
+ import json
13
+ from pathlib import Path
14
+
15
+ import matplotlib.pyplot as plt
16
+
17
+ ROOT = Path(__file__).resolve().parent.parent
18
+ TRAIN_JSON = ROOT / "reports" / "training_summary.json"
19
+ COMP_JSON = ROOT / "reports" / "component_shift_summary.json"
20
+ REWARD_PNG = ROOT / "docs" / "reward_curve.png"
21
+ REWARD_SVG = ROOT / "docs" / "reward_curve.svg"
22
+ COMP_PNG = ROOT / "docs" / "component_shift.png"
23
+ COMP_SVG = ROOT / "docs" / "component_shift.svg"
24
+
25
+ _LABEL_ORDER = [
26
+ "Fraud detection",
27
+ "Decision accuracy",
28
+ "Evidence quality",
29
+ "Calibration",
30
+ "Reasoning quality",
31
+ ]
32
+
33
+
34
+ def regenerate_reward_curve() -> None:
35
+ summary = json.loads(TRAIN_JSON.read_text(encoding="utf-8"))
36
+ log_history = summary.get("log_history", []) or []
37
+
38
+ reward_steps, rewards, loss_steps, losses = [], [], [], []
39
+ for row in log_history:
40
+ step = row.get("step")
41
+ if step is None:
42
+ continue
43
+ if "loss" in row:
44
+ loss_steps.append(step)
45
+ losses.append(row["loss"])
46
+ rv = row.get("reward") or row.get("rewards/mean")
47
+ if rv is not None:
48
+ reward_steps.append(step)
49
+ rewards.append(rv)
50
+
51
+ if not (loss_steps or reward_steps):
52
+ print("[WARN] no log_history rows; skipping reward curve")
53
+ return
54
+
55
+ fig, ax1 = plt.subplots(figsize=(10, 5.5))
56
+ if losses:
57
+ ax1.plot(loss_steps, losses, color="#26547c", linewidth=2, label="Training loss")
58
+ ax1.set_ylabel("Loss", color="#26547c")
59
+ ax1.tick_params(axis="y", labelcolor="#26547c")
60
+ ax1.set_xlabel("Training step")
61
+ ax1.grid(True, alpha=0.25)
62
+
63
+ if rewards:
64
+ ax2 = ax1.twinx()
65
+ ax2.plot(
66
+ reward_steps,
67
+ rewards,
68
+ color="#06a77d",
69
+ linewidth=2,
70
+ label="Mean reward (training scalar)",
71
+ )
72
+ ax2.set_ylabel(
73
+ "Mean reward (training scalar — unbounded)", color="#06a77d"
74
+ )
75
+ ax2.tick_params(axis="y", labelcolor="#06a77d")
76
+ ax2.annotate(
77
+ "Note: training scalar is unbounded.\nSee eval table for [0,1] clamped scores.",
78
+ xy=(0.02, 0.05),
79
+ xycoords="axes fraction",
80
+ fontsize=9,
81
+ color="gray",
82
+ )
83
+
84
+ fig.suptitle(
85
+ "ClaimCourt GRPO Training Progress (training scalar — not eval score)"
86
+ )
87
+ fig.tight_layout()
88
+ fig.savefig(REWARD_PNG, dpi=180, format="png")
89
+ fig.savefig(REWARD_SVG, dpi=180, format="svg")
90
+ plt.close(fig)
91
+ print(f"[OK] {REWARD_PNG}")
92
+ print(f"[OK] {REWARD_SVG}")
93
+
94
+
95
+ def regenerate_component_shift() -> None:
96
+ payload = json.loads(COMP_JSON.read_text(encoding="utf-8"))
97
+ before = payload.get("before") or {}
98
+ after = payload.get("after") or {}
99
+ if not (before and after):
100
+ print("[WARN] component_shift_summary.json missing before/after; skipping")
101
+ return
102
+
103
+ labels = [lbl for lbl in _LABEL_ORDER if lbl in before or lbl in after]
104
+ before_values = [before.get(lbl, 0.0) for lbl in labels]
105
+ after_values = [after.get(lbl, 0.0) for lbl in labels]
106
+ x = list(range(len(labels)))
107
+ width = 0.35
108
+
109
+ fig, ax = plt.subplots(figsize=(10, 5.5))
110
+ ax.bar(
111
+ [i - width / 2 for i in x],
112
+ before_values,
113
+ width,
114
+ label="Before training",
115
+ color="#7a869a",
116
+ )
117
+ ax.bar(
118
+ [i + width / 2 for i in x],
119
+ after_values,
120
+ width,
121
+ label="After training",
122
+ color="#06a77d",
123
+ )
124
+ ax.set_xticks(x)
125
+ ax.set_xticklabels(labels)
126
+ ax.set_ylim(-0.1, 1.1)
127
+ ax.set_ylabel("Component score (eval reward — clamped to [0, 1])")
128
+ ax.set_xlabel("Reward component")
129
+ ax.set_title(
130
+ "ClaimCourt: component-score shift before vs after GRPO training (n=6 held-out)"
131
+ )
132
+ ax.grid(True, axis="y", alpha=0.25)
133
+ ax.legend(frameon=False)
134
+
135
+ for i, v in enumerate(before_values):
136
+ ax.text(i - width / 2, v + 0.02, f"{v:.2f}", ha="center", fontsize=9, color="#7a869a")
137
+ for i, v in enumerate(after_values):
138
+ ax.text(i + width / 2, v + 0.02, f"{v:.2f}", ha="center", fontsize=9, color="#06a77d")
139
+
140
+ fig.tight_layout()
141
+ fig.savefig(COMP_PNG, dpi=180, format="png")
142
+ fig.savefig(COMP_SVG, dpi=180, format="svg")
143
+ plt.close(fig)
144
+ print(f"[OK] {COMP_PNG}")
145
+ print(f"[OK] {COMP_SVG}")
146
+
147
+
148
+ def main() -> None:
149
+ regenerate_reward_curve()
150
+ regenerate_component_shift()
151
+
152
+
153
+ if __name__ == "__main__":
154
+ main()