AniketAsla commited on
Commit
fc09821
·
verified ·
1 Parent(s): 831c432

sync: git 3b14845 (3b14845b0efbdb46f9c41bea72744a197e3d7095)

Browse files
docs/BLOG.md → BLOG.md RENAMED
@@ -209,21 +209,21 @@ This produces a stable learning curve. The complex eval reward runs separately f
209
 
210
  ### Training Signals (WandB + held-out eval)
211
 
212
- The GRPO training run tracks both the reward curve and a held-out component-shift summary. Live metrics go to the [WandB project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) (entity `aniketaslaliya-lnmiit` — open the latest run named `grpo-qwen0.5b-env-connected`). **Canonical plots for judges** are always the committed files [reward_curve.svg](reward_curve.svg) and [component_shift.svg](component_shift.svg), backed by [reports/training_summary.json](../reports/training_summary.json).
213
 
214
- ![WandB reward curve - training reward rises as calibration improves](reward_curve.svg)
215
 
216
  ### Component score shift
217
 
218
- ![Component score shift before vs after training](component_shift.svg)
219
 
220
  This companion plot shows how the held-out validation sweep changes before and after training across fraud detection, decision accuracy, evidence grounding, and calibration.
221
 
222
- The script also writes [reports/component_shift_summary.json](../reports/component_shift_summary.json) so the before/after component means are easy to inspect.
223
 
224
  ### Quantitative results — 5,000-episode GRPO run
225
 
226
- All numbers are from committed JSON artifacts: [`reports/training_summary.json`](../reports/training_summary.json), [`reports/component_shift_summary.json`](../reports/component_shift_summary.json).
227
 
228
  Three headline numbers tell the story:
229
 
 
209
 
210
  ### Training Signals (WandB + held-out eval)
211
 
212
+ The GRPO training run tracks both the reward curve and a held-out component-shift summary. Live metrics go to the [WandB project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) (entity `aniketaslaliya-lnmiit` — open the latest run named `grpo-qwen0.5b-env-connected`). **Canonical plots for judges** are always the committed files [docs/reward_curve.svg](docs/reward_curve.svg) and [docs/component_shift.svg](docs/component_shift.svg), backed by [reports/training_summary.json](reports/training_summary.json).
213
 
214
+ ![WandB reward curve - training reward rises as calibration improves](docs/reward_curve.svg)
215
 
216
  ### Component score shift
217
 
218
+ ![Component score shift before vs after training](docs/component_shift.svg)
219
 
220
  This companion plot shows how the held-out validation sweep changes before and after training across fraud detection, decision accuracy, evidence grounding, and calibration.
221
 
222
+ The script also writes [reports/component_shift_summary.json](reports/component_shift_summary.json) so the before/after component means are easy to inspect.
223
 
224
  ### Quantitative results — 5,000-episode GRPO run
225
 
226
+ All numbers are from committed JSON artifacts: [`reports/training_summary.json`](reports/training_summary.json), [`reports/component_shift_summary.json`](reports/component_shift_summary.json).
227
 
228
  Three headline numbers tell the story:
229
 
README.md CHANGED
@@ -45,7 +45,7 @@ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore ev
45
  | **WandB (all runs)** | https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl |
46
  | **Trained Model** | https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct |
47
  | **Training Notebook (Colab)** | [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb) |
48
- | **Mini-Blog** | [docs/BLOG.md](https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/docs/BLOG.md) |
49
 
50
  ---
51
 
@@ -54,7 +54,7 @@ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore ev
54
  | Criterion | Weight | Where to find the evidence |
55
  |---|---|---|
56
  | **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
57
- | **Storytelling & Presentation** | 30% | [`docs/BLOG.md`](docs/BLOG.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
58
  | **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.svg`](docs/reward_curve.svg) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
59
  | **Reward & Training Pipeline** | 10% | [`app/rubrics.py`](app/rubrics.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`server/calibration_grader.py`](server/calibration_grader.py) — 3×2 calibration matrix. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
60
 
@@ -63,7 +63,7 @@ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore ev
63
  - [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
64
  - [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
65
  - [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.svg`](docs/reward_curve.svg), [`docs/component_shift.svg`](docs/component_shift.svg)
66
- - [x] **Mini-blog** at [`docs/BLOG.md`](docs/BLOG.md)
67
  - [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
68
  - [x] **README** motivates the problem, explains the env, and shows results (this file)
69
  - [x] **`openenv.yaml`** manifest valid — see repo root
@@ -365,10 +365,11 @@ debateFloor/
365
  │ ├── train_minimal.py ← Pure TRL GRPOTrainer, T4 in 15 min
366
  │ └── train_debatefloor.ipynb ← Colab notebook (dynamic wrapper)
367
 
 
 
368
  ├── docs/
369
  │ ├── reward_curve.svg ← training reward curve (embedded above)
370
- ── component_shift.svg ← before/after component scores (embedded above)
371
- │ └── BLOG.md ← writeup
372
 
373
  └── reports/
374
  ├── training_summary.json
 
45
  | **WandB (all runs)** | https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl |
46
  | **Trained Model** | https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct |
47
  | **Training Notebook (Colab)** | [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb) |
48
+ | **Mini-Blog** | [BLOG.md](https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/BLOG.md) |
49
 
50
  ---
51
 
 
54
  | Criterion | Weight | Where to find the evidence |
55
  |---|---|---|
56
  | **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
57
+ | **Storytelling & Presentation** | 30% | [`BLOG.md`](BLOG.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
58
  | **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.svg`](docs/reward_curve.svg) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
59
  | **Reward & Training Pipeline** | 10% | [`app/rubrics.py`](app/rubrics.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`server/calibration_grader.py`](server/calibration_grader.py) — 3×2 calibration matrix. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
60
 
 
63
  - [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
64
  - [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
65
  - [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.svg`](docs/reward_curve.svg), [`docs/component_shift.svg`](docs/component_shift.svg)
66
+ - [x] **Mini-blog** at [`BLOG.md`](BLOG.md)
67
  - [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
68
  - [x] **README** motivates the problem, explains the env, and shows results (this file)
69
  - [x] **`openenv.yaml`** manifest valid — see repo root
 
365
  │ ├── train_minimal.py ← Pure TRL GRPOTrainer, T4 in 15 min
366
  │ └── train_debatefloor.ipynb ← Colab notebook (dynamic wrapper)
367
 
368
+ ├── BLOG.md ← writeup (root for visibility)
369
+
370
  ├── docs/
371
  │ ├── reward_curve.svg ← training reward curve (embedded above)
372
+ ── component_shift.svg ← before/after component scores (embedded above)
 
373
 
374
  └── reports/
375
  ├── training_summary.json
docs/VideoScript_ClaimCourt.md CHANGED
@@ -139,7 +139,7 @@ Click **Run Episode** once on **clean claim** so the audience sees the flow star
139
 
140
  - **Try it:** https://huggingface.co/spaces/AniketAsla/debatefloor
141
  - **Code:** https://github.com/AniketAslaliya/debateFloor
142
- - **Mini-blog (markdown):** https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/docs/HFBlogPost.md
143
  - **Weights & Biases (all training runs):** https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl
144
 
145
  ---
 
139
 
140
  - **Try it:** https://huggingface.co/spaces/AniketAsla/debatefloor
141
  - **Code:** https://github.com/AniketAslaliya/debateFloor
142
+ - **Mini-blog (markdown):** https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/BLOG.md
143
  - **Weights & Biases (all training runs):** https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl
144
 
145
  ---