Spaces:

AniketAsla
/

debatefloor

Running

App Files Files Community

AniketAsla commited on 12 days ago

Commit

e47b2da

verified ·

1 Parent(s): fc09821

sync: git a42105a (a42105a80032b12e56f1f5d042a5bc97a243067d)

Browse files

Files changed (9) hide show

.gitattributes +1 -0
BLOG.md +3 -3
README.md +9 -9
docs/VideoScript_ClaimCourt.md +1 -1
docs/component_shift.png +0 -0
docs/component_shift.svg +572 -352
docs/reward_curve.png +3 -0
docs/reward_curve.svg +130 -142
tools/regenerate_plots.py +154 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ docs/reward_curve.png filter=lfs diff=lfs merge=lfs -text

BLOG.md CHANGED Viewed

@@ -209,13 +209,13 @@ This produces a stable learning curve. The complex eval reward runs separately f
 ### Training Signals (WandB + held-out eval)
-The GRPO training run tracks both the reward curve and a held-out component-shift summary. Live metrics go to the [WandB project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) (entity `aniketaslaliya-lnmiit` — open the latest run named `grpo-qwen0.5b-env-connected`). **Canonical plots for judges** are always the committed files [docs/reward_curve.svg](docs/reward_curve.svg) and [docs/component_shift.svg](docs/component_shift.svg), backed by [reports/training_summary.json](reports/training_summary.json).
-![WandB reward curve - training reward rises as calibration improves](docs/reward_curve.svg)
 ### Component score shift
-![Component score shift before vs after training](docs/component_shift.svg)
 This companion plot shows how the held-out validation sweep changes before and after training across fraud detection, decision accuracy, evidence grounding, and calibration.

 ### Training Signals (WandB + held-out eval)
+The GRPO training run tracks both the reward curve and a held-out component-shift summary. Live metrics go to the [WandB project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) (entity `aniketaslaliya-lnmiit` — open the latest run named `grpo-qwen0.5b-env-connected`). **Canonical plots for judges** are always the committed files [docs/reward_curve.png](docs/reward_curve.png) and [docs/component_shift.png](docs/component_shift.png), backed by [reports/training_summary.json](reports/training_summary.json).
+![WandB reward curve - training reward rises as calibration improves](docs/reward_curve.png)
 ### Component score shift
+![Component score shift before vs after training](docs/component_shift.png)
 This companion plot shows how the held-out validation sweep changes before and after training across fraud detection, decision accuracy, evidence grounding, and calibration.

README.md CHANGED Viewed

@@ -55,14 +55,14 @@ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore ev
 |---|---|---|
 | **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
 | **Storytelling & Presentation** | 30% | [`BLOG.md`](BLOG.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
-| **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.svg`](docs/reward_curve.svg) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
 | **Reward & Training Pipeline** | 10% | [`app/rubrics.py`](app/rubrics.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`server/calibration_grader.py`](server/calibration_grader.py) — 3×2 calibration matrix. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
 ### Minimum-requirement checklist (for judges)
 - [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
 - [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
-- [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.svg`](docs/reward_curve.svg), [`docs/component_shift.svg`](docs/component_shift.svg)
 - [x] **Mini-blog** at [`BLOG.md`](BLOG.md)
 - [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
 - [x] **README** motivates the problem, explains the env, and shows results (this file)
@@ -87,7 +87,7 @@ hand-edits, no rounded-up estimates. Source:
 | GRPO steps | 2,500 |
 | Batch / Generations | 8 / 8 |
 | Hardware | L4 GPU (HF Jobs), 3 h 3 min |
-| WandB | [Project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) — open the **latest** run named `grpo-qwen0.5b-env-connected` (5K HF Job, Apr 2026). **Canonical curves:** [`docs/reward_curve.svg`](docs/reward_curve.svg) + [`reports/training_summary.json`](reports/training_summary.json) (always match the committed training). |
 ### Headline result
@@ -116,12 +116,12 @@ decision-making.
 ### Training Plots
-![Reward Curve](docs/reward_curve.svg)
 *Mean training reward across 2,500 GRPO steps (5,000 episodes, 1 epoch).
 Reward climbs from 0.130 to 0.469 — a 3.6× improvement. Source:
 [`reports/training_summary.json`](reports/training_summary.json).*
-![Component Shift](docs/component_shift.svg)
 *Before vs after on held-out eval: Decision accuracy 0 → 1.0,
 Calibration 0 → 1.0, Fraud detection 0 → 0.33. Source:
 [`reports/component_shift_summary.json`](reports/component_shift_summary.json).*
@@ -335,8 +335,8 @@ PYTHONPATH=. python train/train_minimal.py
 Or open the Colab notebook: [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb)
 Artifacts generated after training:
-- `docs/reward_curve.svg`
-- `docs/component_shift.svg`
 - `reports/training_summary.json`
 ---
@@ -368,8 +368,8 @@ debateFloor/
 ├── BLOG.md                         ← writeup (root for visibility)
 │
 ├── docs/
-│   ├── reward_curve.svg            ← training reward curve (embedded above)
-│   └── component_shift.svg         ← before/after component scores (embedded above)
 │
 └── reports/
     ├── training_summary.json

 |---|---|---|
 | **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
 | **Storytelling & Presentation** | 30% | [`BLOG.md`](BLOG.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
+| **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.png`](docs/reward_curve.png) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
 | **Reward & Training Pipeline** | 10% | [`app/rubrics.py`](app/rubrics.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`server/calibration_grader.py`](server/calibration_grader.py) — 3×2 calibration matrix. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
 ### Minimum-requirement checklist (for judges)
 - [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
 - [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
+- [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.png`](docs/reward_curve.png), [`docs/component_shift.png`](docs/component_shift.png)
 - [x] **Mini-blog** at [`BLOG.md`](BLOG.md)
 - [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
 - [x] **README** motivates the problem, explains the env, and shows results (this file)
 | GRPO steps | 2,500 |
 | Batch / Generations | 8 / 8 |
 | Hardware | L4 GPU (HF Jobs), 3 h 3 min |
+| WandB | [Project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) — open the **latest** run named `grpo-qwen0.5b-env-connected` (5K HF Job, Apr 2026). **Canonical curves:** [`docs/reward_curve.png`](docs/reward_curve.png) + [`reports/training_summary.json`](reports/training_summary.json) (always match the committed training). |
 ### Headline result
 ### Training Plots
+![Reward Curve](docs/reward_curve.png)
 *Mean training reward across 2,500 GRPO steps (5,000 episodes, 1 epoch).
 Reward climbs from 0.130 to 0.469 — a 3.6× improvement. Source:
 [`reports/training_summary.json`](reports/training_summary.json).*
+![Component Shift](docs/component_shift.png)
 *Before vs after on held-out eval: Decision accuracy 0 → 1.0,
 Calibration 0 → 1.0, Fraud detection 0 → 0.33. Source:
 [`reports/component_shift_summary.json`](reports/component_shift_summary.json).*
 Or open the Colab notebook: [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb)
 Artifacts generated after training:
+- `docs/reward_curve.png`
+- `docs/component_shift.png`
 - `reports/training_summary.json`
 ---
 ├── BLOG.md                         ← writeup (root for visibility)
 │
 ├── docs/
+│   ├── reward_curve.png            ← training reward curve (embedded above)
+│   └── component_shift.png         ← before/after component scores (embedded above)
 │
 └── reports/
     ├── training_summary.json

docs/VideoScript_ClaimCourt.md CHANGED Viewed

@@ -110,7 +110,7 @@ Click **Run Episode** once on **clean claim** so the audience sees the flow star
 ## ACT 3 — “Yes, we actually trained it” (~20 s) — keep light
-**Visual:** Quick montage: **WandB** project page (reward climbing) **or** `docs/reward_curve.svg` in GitHub; optional 1 clip of HF Jobs log line.
 **Say (one breath, then slow on numbers):**

 ## ACT 3 — “Yes, we actually trained it” (~20 s) — keep light
+**Visual:** Quick montage: **WandB** project page (reward climbing) **or** `docs/reward_curve.png` in GitHub; optional 1 clip of HF Jobs log line.
 **Say (one breath, then slow on numbers):**

docs/component_shift.png ADDED Viewed

docs/component_shift.svg CHANGED Viewed

docs/reward_curve.png ADDED Viewed

Git LFS Details

SHA256: 94f22adbf956ff3f153e83f977808341b9ae8180ef4ad586eb09e78735a55333
Pointer size: 131 Bytes
Size of remote file: 221 kB

docs/reward_curve.svg CHANGED Viewed

tools/regenerate_plots.py ADDED Viewed

	@@ -0,0 +1,154 @@

+"""Regenerate canonical plots as PNG (and SVG) from committed JSON artifacts.
+Reads:  reports/training_summary.json
+        reports/component_shift_summary.json
+Writes: docs/reward_curve.png   (+ docs/reward_curve.svg)
+        docs/component_shift.png (+ docs/component_shift.svg)
+Run:    python tools/regenerate_plots.py
+"""
+from __future__ import annotations
+import json
+from pathlib import Path
+import matplotlib.pyplot as plt
+ROOT = Path(__file__).resolve().parent.parent
+TRAIN_JSON = ROOT / "reports" / "training_summary.json"
+COMP_JSON = ROOT / "reports" / "component_shift_summary.json"
+REWARD_PNG = ROOT / "docs" / "reward_curve.png"
+REWARD_SVG = ROOT / "docs" / "reward_curve.svg"
+COMP_PNG = ROOT / "docs" / "component_shift.png"
+COMP_SVG = ROOT / "docs" / "component_shift.svg"
+_LABEL_ORDER = [
+    "Fraud detection",
+    "Decision accuracy",
+    "Evidence quality",
+    "Calibration",
+    "Reasoning quality",
+]
+def regenerate_reward_curve() -> None:
+    summary = json.loads(TRAIN_JSON.read_text(encoding="utf-8"))
+    log_history = summary.get("log_history", []) or []
+    reward_steps, rewards, loss_steps, losses = [], [], [], []
+    for row in log_history:
+        step = row.get("step")
+        if step is None:
+            continue
+        if "loss" in row:
+            loss_steps.append(step)
+            losses.append(row["loss"])
+        rv = row.get("reward") or row.get("rewards/mean")
+        if rv is not None:
+            reward_steps.append(step)
+            rewards.append(rv)
+    if not (loss_steps or reward_steps):
+        print("[WARN] no log_history rows; skipping reward curve")
+        return
+    fig, ax1 = plt.subplots(figsize=(10, 5.5))
+    if losses:
+        ax1.plot(loss_steps, losses, color="#26547c", linewidth=2, label="Training loss")
+        ax1.set_ylabel("Loss", color="#26547c")
+        ax1.tick_params(axis="y", labelcolor="#26547c")
+    ax1.set_xlabel("Training step")
+    ax1.grid(True, alpha=0.25)
+    if rewards:
+        ax2 = ax1.twinx()
+        ax2.plot(
+            reward_steps,
+            rewards,
+            color="#06a77d",
+            linewidth=2,
+            label="Mean reward (training scalar)",
+        )
+        ax2.set_ylabel(
+            "Mean reward (training scalar — unbounded)", color="#06a77d"
+        )
+        ax2.tick_params(axis="y", labelcolor="#06a77d")
+        ax2.annotate(
+            "Note: training scalar is unbounded.\nSee eval table for [0,1] clamped scores.",
+            xy=(0.02, 0.05),
+            xycoords="axes fraction",
+            fontsize=9,
+            color="gray",
+        )
+    fig.suptitle(
+        "ClaimCourt GRPO Training Progress (training scalar — not eval score)"
+    )
+    fig.tight_layout()
+    fig.savefig(REWARD_PNG, dpi=180, format="png")
+    fig.savefig(REWARD_SVG, dpi=180, format="svg")
+    plt.close(fig)
+    print(f"[OK] {REWARD_PNG}")
+    print(f"[OK] {REWARD_SVG}")
+def regenerate_component_shift() -> None:
+    payload = json.loads(COMP_JSON.read_text(encoding="utf-8"))
+    before = payload.get("before") or {}
+    after = payload.get("after") or {}
+    if not (before and after):
+        print("[WARN] component_shift_summary.json missing before/after; skipping")
+        return
+    labels = [lbl for lbl in _LABEL_ORDER if lbl in before or lbl in after]
+    before_values = [before.get(lbl, 0.0) for lbl in labels]
+    after_values = [after.get(lbl, 0.0) for lbl in labels]
+    x = list(range(len(labels)))
+    width = 0.35
+    fig, ax = plt.subplots(figsize=(10, 5.5))
+    ax.bar(
+        [i - width / 2 for i in x],
+        before_values,
+        width,
+        label="Before training",
+        color="#7a869a",
+    )
+    ax.bar(
+        [i + width / 2 for i in x],
+        after_values,
+        width,
+        label="After training",
+        color="#06a77d",
+    )
+    ax.set_xticks(x)
+    ax.set_xticklabels(labels)
+    ax.set_ylim(-0.1, 1.1)
+    ax.set_ylabel("Component score (eval reward — clamped to [0, 1])")
+    ax.set_xlabel("Reward component")
+    ax.set_title(
+        "ClaimCourt: component-score shift before vs after GRPO training (n=6 held-out)"
+    )
+    ax.grid(True, axis="y", alpha=0.25)
+    ax.legend(frameon=False)
+    for i, v in enumerate(before_values):
+        ax.text(i - width / 2, v + 0.02, f"{v:.2f}", ha="center", fontsize=9, color="#7a869a")
+    for i, v in enumerate(after_values):
+        ax.text(i + width / 2, v + 0.02, f"{v:.2f}", ha="center", fontsize=9, color="#06a77d")
+    fig.tight_layout()
+    fig.savefig(COMP_PNG, dpi=180, format="png")
+    fig.savefig(COMP_SVG, dpi=180, format="svg")
+    plt.close(fig)
+    print(f"[OK] {COMP_PNG}")
+    print(f"[OK] {COMP_SVG}")
+def main() -> None:
+    regenerate_reward_curve()
+    regenerate_component_shift()
+if __name__ == "__main__":
+    main()