Spaces:
Running
Running
sync: git 3b14845 (3b14845b0efbdb46f9c41bea72744a197e3d7095)
Browse files- docs/BLOG.md → BLOG.md +5 -5
- README.md +6 -5
- docs/VideoScript_ClaimCourt.md +1 -1
docs/BLOG.md → BLOG.md
RENAMED
|
@@ -209,21 +209,21 @@ This produces a stable learning curve. The complex eval reward runs separately f
|
|
| 209 |
|
| 210 |
### Training Signals (WandB + held-out eval)
|
| 211 |
|
| 212 |
-
The GRPO training run tracks both the reward curve and a held-out component-shift summary. Live metrics go to the [WandB project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) (entity `aniketaslaliya-lnmiit` — open the latest run named `grpo-qwen0.5b-env-connected`). **Canonical plots for judges** are always the committed files [reward_curve.svg](reward_curve.svg) and [component_shift.svg](component_shift.svg), backed by [reports/training_summary.json](
|
| 213 |
|
| 214 |
-

|
| 215 |
|
| 216 |
### Component score shift
|
| 217 |
|
| 218 |
-

|
| 219 |
|
| 220 |
This companion plot shows how the held-out validation sweep changes before and after training across fraud detection, decision accuracy, evidence grounding, and calibration.
|
| 221 |
|
| 222 |
-
The script also writes [reports/component_shift_summary.json](
|
| 223 |
|
| 224 |
### Quantitative results — 5,000-episode GRPO run
|
| 225 |
|
| 226 |
-
All numbers are from committed JSON artifacts: [`reports/training_summary.json`](
|
| 227 |
|
| 228 |
Three headline numbers tell the story:
|
| 229 |
|
|
|
|
| 209 |
|
| 210 |
### Training Signals (WandB + held-out eval)
|
| 211 |
|
| 212 |
+
The GRPO training run tracks both the reward curve and a held-out component-shift summary. Live metrics go to the [WandB project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) (entity `aniketaslaliya-lnmiit` — open the latest run named `grpo-qwen0.5b-env-connected`). **Canonical plots for judges** are always the committed files [docs/reward_curve.svg](docs/reward_curve.svg) and [docs/component_shift.svg](docs/component_shift.svg), backed by [reports/training_summary.json](reports/training_summary.json).
|
| 213 |
|
| 214 |
+

|
| 215 |
|
| 216 |
### Component score shift
|
| 217 |
|
| 218 |
+

|
| 219 |
|
| 220 |
This companion plot shows how the held-out validation sweep changes before and after training across fraud detection, decision accuracy, evidence grounding, and calibration.
|
| 221 |
|
| 222 |
+
The script also writes [reports/component_shift_summary.json](reports/component_shift_summary.json) so the before/after component means are easy to inspect.
|
| 223 |
|
| 224 |
### Quantitative results — 5,000-episode GRPO run
|
| 225 |
|
| 226 |
+
All numbers are from committed JSON artifacts: [`reports/training_summary.json`](reports/training_summary.json), [`reports/component_shift_summary.json`](reports/component_shift_summary.json).
|
| 227 |
|
| 228 |
Three headline numbers tell the story:
|
| 229 |
|
README.md
CHANGED
|
@@ -45,7 +45,7 @@ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore ev
|
|
| 45 |
| **WandB (all runs)** | https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl |
|
| 46 |
| **Trained Model** | https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct |
|
| 47 |
| **Training Notebook (Colab)** | [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb) |
|
| 48 |
-
| **Mini-Blog** | [
|
| 49 |
|
| 50 |
---
|
| 51 |
|
|
@@ -54,7 +54,7 @@ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore ev
|
|
| 54 |
| Criterion | Weight | Where to find the evidence |
|
| 55 |
|---|---|---|
|
| 56 |
| **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
|
| 57 |
-
| **Storytelling & Presentation** | 30% | [`
|
| 58 |
| **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.svg`](docs/reward_curve.svg) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
|
| 59 |
| **Reward & Training Pipeline** | 10% | [`app/rubrics.py`](app/rubrics.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`server/calibration_grader.py`](server/calibration_grader.py) — 3×2 calibration matrix. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
|
| 60 |
|
|
@@ -63,7 +63,7 @@ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore ev
|
|
| 63 |
- [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
|
| 64 |
- [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
|
| 65 |
- [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.svg`](docs/reward_curve.svg), [`docs/component_shift.svg`](docs/component_shift.svg)
|
| 66 |
-
- [x] **Mini-blog** at [`
|
| 67 |
- [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
|
| 68 |
- [x] **README** motivates the problem, explains the env, and shows results (this file)
|
| 69 |
- [x] **`openenv.yaml`** manifest valid — see repo root
|
|
@@ -365,10 +365,11 @@ debateFloor/
|
|
| 365 |
│ ├── train_minimal.py ← Pure TRL GRPOTrainer, T4 in 15 min
|
| 366 |
│ └── train_debatefloor.ipynb ← Colab notebook (dynamic wrapper)
|
| 367 |
│
|
|
|
|
|
|
|
| 368 |
├── docs/
|
| 369 |
│ ├── reward_curve.svg ← training reward curve (embedded above)
|
| 370 |
-
│
|
| 371 |
-
│ └── BLOG.md ← writeup
|
| 372 |
│
|
| 373 |
└── reports/
|
| 374 |
├── training_summary.json
|
|
|
|
| 45 |
| **WandB (all runs)** | https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl |
|
| 46 |
| **Trained Model** | https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct |
|
| 47 |
| **Training Notebook (Colab)** | [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb) |
|
| 48 |
+
| **Mini-Blog** | [BLOG.md](https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/BLOG.md) |
|
| 49 |
|
| 50 |
---
|
| 51 |
|
|
|
|
| 54 |
| Criterion | Weight | Where to find the evidence |
|
| 55 |
|---|---|---|
|
| 56 |
| **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
|
| 57 |
+
| **Storytelling & Presentation** | 30% | [`BLOG.md`](BLOG.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
|
| 58 |
| **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.svg`](docs/reward_curve.svg) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
|
| 59 |
| **Reward & Training Pipeline** | 10% | [`app/rubrics.py`](app/rubrics.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`server/calibration_grader.py`](server/calibration_grader.py) — 3×2 calibration matrix. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
|
| 60 |
|
|
|
|
| 63 |
- [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
|
| 64 |
- [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
|
| 65 |
- [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.svg`](docs/reward_curve.svg), [`docs/component_shift.svg`](docs/component_shift.svg)
|
| 66 |
+
- [x] **Mini-blog** at [`BLOG.md`](BLOG.md)
|
| 67 |
- [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
|
| 68 |
- [x] **README** motivates the problem, explains the env, and shows results (this file)
|
| 69 |
- [x] **`openenv.yaml`** manifest valid — see repo root
|
|
|
|
| 365 |
│ ├── train_minimal.py ← Pure TRL GRPOTrainer, T4 in 15 min
|
| 366 |
│ └── train_debatefloor.ipynb ← Colab notebook (dynamic wrapper)
|
| 367 |
│
|
| 368 |
+
├── BLOG.md ← writeup (root for visibility)
|
| 369 |
+
│
|
| 370 |
├── docs/
|
| 371 |
│ ├── reward_curve.svg ← training reward curve (embedded above)
|
| 372 |
+
│ └── component_shift.svg ← before/after component scores (embedded above)
|
|
|
|
| 373 |
│
|
| 374 |
└── reports/
|
| 375 |
├── training_summary.json
|
docs/VideoScript_ClaimCourt.md
CHANGED
|
@@ -139,7 +139,7 @@ Click **Run Episode** once on **clean claim** so the audience sees the flow star
|
|
| 139 |
|
| 140 |
- **Try it:** https://huggingface.co/spaces/AniketAsla/debatefloor
|
| 141 |
- **Code:** https://github.com/AniketAslaliya/debateFloor
|
| 142 |
-
- **Mini-blog (markdown):** https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/
|
| 143 |
- **Weights & Biases (all training runs):** https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl
|
| 144 |
|
| 145 |
---
|
|
|
|
| 139 |
|
| 140 |
- **Try it:** https://huggingface.co/spaces/AniketAsla/debatefloor
|
| 141 |
- **Code:** https://github.com/AniketAslaliya/debateFloor
|
| 142 |
+
- **Mini-blog (markdown):** https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/BLOG.md
|
| 143 |
- **Weights & Biases (all training runs):** https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl
|
| 144 |
|
| 145 |
---
|