Spaces:
Running
Running
sync: git a42105a (a42105a80032b12e56f1f5d042a5bc97a243067d)
Browse files- .gitattributes +1 -0
- BLOG.md +3 -3
- README.md +9 -9
- docs/VideoScript_ClaimCourt.md +1 -1
- docs/component_shift.png +0 -0
- docs/component_shift.svg +572 -352
- docs/reward_curve.png +3 -0
- docs/reward_curve.svg +130 -142
- tools/regenerate_plots.py +154 -0
.gitattributes
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
docs/reward_curve.png filter=lfs diff=lfs merge=lfs -text
|
BLOG.md
CHANGED
|
@@ -209,13 +209,13 @@ This produces a stable learning curve. The complex eval reward runs separately f
|
|
| 209 |
|
| 210 |
### Training Signals (WandB + held-out eval)
|
| 211 |
|
| 212 |
-
The GRPO training run tracks both the reward curve and a held-out component-shift summary. Live metrics go to the [WandB project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) (entity `aniketaslaliya-lnmiit` — open the latest run named `grpo-qwen0.5b-env-connected`). **Canonical plots for judges** are always the committed files [docs/reward_curve.
|
| 213 |
|
| 214 |
-

|
| 211 |
|
| 212 |
+
The GRPO training run tracks both the reward curve and a held-out component-shift summary. Live metrics go to the [WandB project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) (entity `aniketaslaliya-lnmiit` — open the latest run named `grpo-qwen0.5b-env-connected`). **Canonical plots for judges** are always the committed files [docs/reward_curve.png](docs/reward_curve.png) and [docs/component_shift.png](docs/component_shift.png), backed by [reports/training_summary.json](reports/training_summary.json).
|
| 213 |
|
| 214 |
+

|
| 215 |
|
| 216 |
### Component score shift
|
| 217 |
|
| 218 |
+

|
| 219 |
|
| 220 |
This companion plot shows how the held-out validation sweep changes before and after training across fraud detection, decision accuracy, evidence grounding, and calibration.
|
| 221 |
|
README.md
CHANGED
|
@@ -55,14 +55,14 @@ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore ev
|
|
| 55 |
|---|---|---|
|
| 56 |
| **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
|
| 57 |
| **Storytelling & Presentation** | 30% | [`BLOG.md`](BLOG.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
|
| 58 |
-
| **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.
|
| 59 |
| **Reward & Training Pipeline** | 10% | [`app/rubrics.py`](app/rubrics.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`server/calibration_grader.py`](server/calibration_grader.py) — 3×2 calibration matrix. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
|
| 60 |
|
| 61 |
### Minimum-requirement checklist (for judges)
|
| 62 |
|
| 63 |
- [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
|
| 64 |
- [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
|
| 65 |
-
- [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.
|
| 66 |
- [x] **Mini-blog** at [`BLOG.md`](BLOG.md)
|
| 67 |
- [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
|
| 68 |
- [x] **README** motivates the problem, explains the env, and shows results (this file)
|
|
@@ -87,7 +87,7 @@ hand-edits, no rounded-up estimates. Source:
|
|
| 87 |
| GRPO steps | 2,500 |
|
| 88 |
| Batch / Generations | 8 / 8 |
|
| 89 |
| Hardware | L4 GPU (HF Jobs), 3 h 3 min |
|
| 90 |
-
| WandB | [Project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) — open the **latest** run named `grpo-qwen0.5b-env-connected` (5K HF Job, Apr 2026). **Canonical curves:** [`docs/reward_curve.
|
| 91 |
|
| 92 |
### Headline result
|
| 93 |
|
|
@@ -116,12 +116,12 @@ decision-making.
|
|
| 116 |
|
| 117 |
### Training Plots
|
| 118 |
|
| 119 |
-
.
|
| 121 |
Reward climbs from 0.130 to 0.469 — a 3.6× improvement. Source:
|
| 122 |
[`reports/training_summary.json`](reports/training_summary.json).*
|
| 123 |
|
| 124 |
-
.*
|
|
@@ -335,8 +335,8 @@ PYTHONPATH=. python train/train_minimal.py
|
|
| 335 |
Or open the Colab notebook: [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb)
|
| 336 |
|
| 337 |
Artifacts generated after training:
|
| 338 |
-
- `docs/reward_curve.
|
| 339 |
-
- `docs/component_shift.
|
| 340 |
- `reports/training_summary.json`
|
| 341 |
|
| 342 |
---
|
|
@@ -368,8 +368,8 @@ debateFloor/
|
|
| 368 |
├── BLOG.md ← writeup (root for visibility)
|
| 369 |
│
|
| 370 |
├── docs/
|
| 371 |
-
│ ├── reward_curve.
|
| 372 |
-
│ └── component_shift.
|
| 373 |
│
|
| 374 |
└── reports/
|
| 375 |
├── training_summary.json
|
|
|
|
| 55 |
|---|---|---|
|
| 56 |
| **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
|
| 57 |
| **Storytelling & Presentation** | 30% | [`BLOG.md`](BLOG.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
|
| 58 |
+
| **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.png`](docs/reward_curve.png) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
|
| 59 |
| **Reward & Training Pipeline** | 10% | [`app/rubrics.py`](app/rubrics.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`server/calibration_grader.py`](server/calibration_grader.py) — 3×2 calibration matrix. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
|
| 60 |
|
| 61 |
### Minimum-requirement checklist (for judges)
|
| 62 |
|
| 63 |
- [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
|
| 64 |
- [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
|
| 65 |
+
- [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.png`](docs/reward_curve.png), [`docs/component_shift.png`](docs/component_shift.png)
|
| 66 |
- [x] **Mini-blog** at [`BLOG.md`](BLOG.md)
|
| 67 |
- [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
|
| 68 |
- [x] **README** motivates the problem, explains the env, and shows results (this file)
|
|
|
|
| 87 |
| GRPO steps | 2,500 |
|
| 88 |
| Batch / Generations | 8 / 8 |
|
| 89 |
| Hardware | L4 GPU (HF Jobs), 3 h 3 min |
|
| 90 |
+
| WandB | [Project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) — open the **latest** run named `grpo-qwen0.5b-env-connected` (5K HF Job, Apr 2026). **Canonical curves:** [`docs/reward_curve.png`](docs/reward_curve.png) + [`reports/training_summary.json`](reports/training_summary.json) (always match the committed training). |
|
| 91 |
|
| 92 |
### Headline result
|
| 93 |
|
|
|
|
| 116 |
|
| 117 |
### Training Plots
|
| 118 |
|
| 119 |
+

|
| 120 |
*Mean training reward across 2,500 GRPO steps (5,000 episodes, 1 epoch).
|
| 121 |
Reward climbs from 0.130 to 0.469 — a 3.6× improvement. Source:
|
| 122 |
[`reports/training_summary.json`](reports/training_summary.json).*
|
| 123 |
|
| 124 |
+

|
| 125 |
*Before vs after on held-out eval: Decision accuracy 0 → 1.0,
|
| 126 |
Calibration 0 → 1.0, Fraud detection 0 → 0.33. Source:
|
| 127 |
[`reports/component_shift_summary.json`](reports/component_shift_summary.json).*
|
|
|
|
| 335 |
Or open the Colab notebook: [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb)
|
| 336 |
|
| 337 |
Artifacts generated after training:
|
| 338 |
+
- `docs/reward_curve.png`
|
| 339 |
+
- `docs/component_shift.png`
|
| 340 |
- `reports/training_summary.json`
|
| 341 |
|
| 342 |
---
|
|
|
|
| 368 |
├── BLOG.md ← writeup (root for visibility)
|
| 369 |
│
|
| 370 |
├── docs/
|
| 371 |
+
│ ├── reward_curve.png ← training reward curve (embedded above)
|
| 372 |
+
│ └── component_shift.png ← before/after component scores (embedded above)
|
| 373 |
│
|
| 374 |
└── reports/
|
| 375 |
├── training_summary.json
|
docs/VideoScript_ClaimCourt.md
CHANGED
|
@@ -110,7 +110,7 @@ Click **Run Episode** once on **clean claim** so the audience sees the flow star
|
|
| 110 |
|
| 111 |
## ACT 3 — “Yes, we actually trained it” (~20 s) — keep light
|
| 112 |
|
| 113 |
-
**Visual:** Quick montage: **WandB** project page (reward climbing) **or** `docs/reward_curve.
|
| 114 |
|
| 115 |
**Say (one breath, then slow on numbers):**
|
| 116 |
|
|
|
|
| 110 |
|
| 111 |
## ACT 3 — “Yes, we actually trained it” (~20 s) — keep light
|
| 112 |
|
| 113 |
+
**Visual:** Quick montage: **WandB** project page (reward climbing) **or** `docs/reward_curve.png` in GitHub; optional 1 clip of HF Jobs log line.
|
| 114 |
|
| 115 |
**Say (one breath, then slow on numbers):**
|
| 116 |
|
docs/component_shift.png
ADDED
|
docs/component_shift.svg
CHANGED
|
|
|
|
docs/reward_curve.png
ADDED
|
Git LFS Details
|
docs/reward_curve.svg
CHANGED
|
|
|
|
tools/regenerate_plots.py
ADDED
|
@@ -0,0 +1,154 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Regenerate canonical plots as PNG (and SVG) from committed JSON artifacts.
|
| 2 |
+
|
| 3 |
+
Reads: reports/training_summary.json
|
| 4 |
+
reports/component_shift_summary.json
|
| 5 |
+
Writes: docs/reward_curve.png (+ docs/reward_curve.svg)
|
| 6 |
+
docs/component_shift.png (+ docs/component_shift.svg)
|
| 7 |
+
|
| 8 |
+
Run: python tools/regenerate_plots.py
|
| 9 |
+
"""
|
| 10 |
+
from __future__ import annotations
|
| 11 |
+
|
| 12 |
+
import json
|
| 13 |
+
from pathlib import Path
|
| 14 |
+
|
| 15 |
+
import matplotlib.pyplot as plt
|
| 16 |
+
|
| 17 |
+
ROOT = Path(__file__).resolve().parent.parent
|
| 18 |
+
TRAIN_JSON = ROOT / "reports" / "training_summary.json"
|
| 19 |
+
COMP_JSON = ROOT / "reports" / "component_shift_summary.json"
|
| 20 |
+
REWARD_PNG = ROOT / "docs" / "reward_curve.png"
|
| 21 |
+
REWARD_SVG = ROOT / "docs" / "reward_curve.svg"
|
| 22 |
+
COMP_PNG = ROOT / "docs" / "component_shift.png"
|
| 23 |
+
COMP_SVG = ROOT / "docs" / "component_shift.svg"
|
| 24 |
+
|
| 25 |
+
_LABEL_ORDER = [
|
| 26 |
+
"Fraud detection",
|
| 27 |
+
"Decision accuracy",
|
| 28 |
+
"Evidence quality",
|
| 29 |
+
"Calibration",
|
| 30 |
+
"Reasoning quality",
|
| 31 |
+
]
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def regenerate_reward_curve() -> None:
|
| 35 |
+
summary = json.loads(TRAIN_JSON.read_text(encoding="utf-8"))
|
| 36 |
+
log_history = summary.get("log_history", []) or []
|
| 37 |
+
|
| 38 |
+
reward_steps, rewards, loss_steps, losses = [], [], [], []
|
| 39 |
+
for row in log_history:
|
| 40 |
+
step = row.get("step")
|
| 41 |
+
if step is None:
|
| 42 |
+
continue
|
| 43 |
+
if "loss" in row:
|
| 44 |
+
loss_steps.append(step)
|
| 45 |
+
losses.append(row["loss"])
|
| 46 |
+
rv = row.get("reward") or row.get("rewards/mean")
|
| 47 |
+
if rv is not None:
|
| 48 |
+
reward_steps.append(step)
|
| 49 |
+
rewards.append(rv)
|
| 50 |
+
|
| 51 |
+
if not (loss_steps or reward_steps):
|
| 52 |
+
print("[WARN] no log_history rows; skipping reward curve")
|
| 53 |
+
return
|
| 54 |
+
|
| 55 |
+
fig, ax1 = plt.subplots(figsize=(10, 5.5))
|
| 56 |
+
if losses:
|
| 57 |
+
ax1.plot(loss_steps, losses, color="#26547c", linewidth=2, label="Training loss")
|
| 58 |
+
ax1.set_ylabel("Loss", color="#26547c")
|
| 59 |
+
ax1.tick_params(axis="y", labelcolor="#26547c")
|
| 60 |
+
ax1.set_xlabel("Training step")
|
| 61 |
+
ax1.grid(True, alpha=0.25)
|
| 62 |
+
|
| 63 |
+
if rewards:
|
| 64 |
+
ax2 = ax1.twinx()
|
| 65 |
+
ax2.plot(
|
| 66 |
+
reward_steps,
|
| 67 |
+
rewards,
|
| 68 |
+
color="#06a77d",
|
| 69 |
+
linewidth=2,
|
| 70 |
+
label="Mean reward (training scalar)",
|
| 71 |
+
)
|
| 72 |
+
ax2.set_ylabel(
|
| 73 |
+
"Mean reward (training scalar — unbounded)", color="#06a77d"
|
| 74 |
+
)
|
| 75 |
+
ax2.tick_params(axis="y", labelcolor="#06a77d")
|
| 76 |
+
ax2.annotate(
|
| 77 |
+
"Note: training scalar is unbounded.\nSee eval table for [0,1] clamped scores.",
|
| 78 |
+
xy=(0.02, 0.05),
|
| 79 |
+
xycoords="axes fraction",
|
| 80 |
+
fontsize=9,
|
| 81 |
+
color="gray",
|
| 82 |
+
)
|
| 83 |
+
|
| 84 |
+
fig.suptitle(
|
| 85 |
+
"ClaimCourt GRPO Training Progress (training scalar — not eval score)"
|
| 86 |
+
)
|
| 87 |
+
fig.tight_layout()
|
| 88 |
+
fig.savefig(REWARD_PNG, dpi=180, format="png")
|
| 89 |
+
fig.savefig(REWARD_SVG, dpi=180, format="svg")
|
| 90 |
+
plt.close(fig)
|
| 91 |
+
print(f"[OK] {REWARD_PNG}")
|
| 92 |
+
print(f"[OK] {REWARD_SVG}")
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
def regenerate_component_shift() -> None:
|
| 96 |
+
payload = json.loads(COMP_JSON.read_text(encoding="utf-8"))
|
| 97 |
+
before = payload.get("before") or {}
|
| 98 |
+
after = payload.get("after") or {}
|
| 99 |
+
if not (before and after):
|
| 100 |
+
print("[WARN] component_shift_summary.json missing before/after; skipping")
|
| 101 |
+
return
|
| 102 |
+
|
| 103 |
+
labels = [lbl for lbl in _LABEL_ORDER if lbl in before or lbl in after]
|
| 104 |
+
before_values = [before.get(lbl, 0.0) for lbl in labels]
|
| 105 |
+
after_values = [after.get(lbl, 0.0) for lbl in labels]
|
| 106 |
+
x = list(range(len(labels)))
|
| 107 |
+
width = 0.35
|
| 108 |
+
|
| 109 |
+
fig, ax = plt.subplots(figsize=(10, 5.5))
|
| 110 |
+
ax.bar(
|
| 111 |
+
[i - width / 2 for i in x],
|
| 112 |
+
before_values,
|
| 113 |
+
width,
|
| 114 |
+
label="Before training",
|
| 115 |
+
color="#7a869a",
|
| 116 |
+
)
|
| 117 |
+
ax.bar(
|
| 118 |
+
[i + width / 2 for i in x],
|
| 119 |
+
after_values,
|
| 120 |
+
width,
|
| 121 |
+
label="After training",
|
| 122 |
+
color="#06a77d",
|
| 123 |
+
)
|
| 124 |
+
ax.set_xticks(x)
|
| 125 |
+
ax.set_xticklabels(labels)
|
| 126 |
+
ax.set_ylim(-0.1, 1.1)
|
| 127 |
+
ax.set_ylabel("Component score (eval reward — clamped to [0, 1])")
|
| 128 |
+
ax.set_xlabel("Reward component")
|
| 129 |
+
ax.set_title(
|
| 130 |
+
"ClaimCourt: component-score shift before vs after GRPO training (n=6 held-out)"
|
| 131 |
+
)
|
| 132 |
+
ax.grid(True, axis="y", alpha=0.25)
|
| 133 |
+
ax.legend(frameon=False)
|
| 134 |
+
|
| 135 |
+
for i, v in enumerate(before_values):
|
| 136 |
+
ax.text(i - width / 2, v + 0.02, f"{v:.2f}", ha="center", fontsize=9, color="#7a869a")
|
| 137 |
+
for i, v in enumerate(after_values):
|
| 138 |
+
ax.text(i + width / 2, v + 0.02, f"{v:.2f}", ha="center", fontsize=9, color="#06a77d")
|
| 139 |
+
|
| 140 |
+
fig.tight_layout()
|
| 141 |
+
fig.savefig(COMP_PNG, dpi=180, format="png")
|
| 142 |
+
fig.savefig(COMP_SVG, dpi=180, format="svg")
|
| 143 |
+
plt.close(fig)
|
| 144 |
+
print(f"[OK] {COMP_PNG}")
|
| 145 |
+
print(f"[OK] {COMP_SVG}")
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
def main() -> None:
|
| 149 |
+
regenerate_reward_curve()
|
| 150 |
+
regenerate_component_shift()
|
| 151 |
+
|
| 152 |
+
|
| 153 |
+
if __name__ == "__main__":
|
| 154 |
+
main()
|