Spaces:

AniketAsla
/

debatefloor

Running

App Files Files Community

AniketAsla commited on 13 days ago

Commit

fc09821

verified ·

1 Parent(s): 831c432

sync: git 3b14845 (3b14845b0efbdb46f9c41bea72744a197e3d7095)

Browse files

Files changed (3) hide show

docs/BLOG.md → BLOG.md +5 -5
README.md +6 -5
docs/VideoScript_ClaimCourt.md +1 -1

docs/BLOG.md → BLOG.md RENAMED Viewed

@@ -209,21 +209,21 @@ This produces a stable learning curve. The complex eval reward runs separately f
 ### Training Signals (WandB + held-out eval)
-The GRPO training run tracks both the reward curve and a held-out component-shift summary. Live metrics go to the [WandB project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) (entity `aniketaslaliya-lnmiit` — open the latest run named `grpo-qwen0.5b-env-connected`). **Canonical plots for judges** are always the committed files [reward_curve.svg](reward_curve.svg) and [component_shift.svg](component_shift.svg), backed by [reports/training_summary.json](../reports/training_summary.json).
-![WandB reward curve - training reward rises as calibration improves](reward_curve.svg)
 ### Component score shift
-![Component score shift before vs after training](component_shift.svg)
 This companion plot shows how the held-out validation sweep changes before and after training across fraud detection, decision accuracy, evidence grounding, and calibration.
-The script also writes [reports/component_shift_summary.json](../reports/component_shift_summary.json) so the before/after component means are easy to inspect.
 ### Quantitative results — 5,000-episode GRPO run
-All numbers are from committed JSON artifacts: [`reports/training_summary.json`](../reports/training_summary.json), [`reports/component_shift_summary.json`](../reports/component_shift_summary.json).
 Three headline numbers tell the story:

 ### Training Signals (WandB + held-out eval)
+The GRPO training run tracks both the reward curve and a held-out component-shift summary. Live metrics go to the [WandB project workspace](https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl) (entity `aniketaslaliya-lnmiit` — open the latest run named `grpo-qwen0.5b-env-connected`). **Canonical plots for judges** are always the committed files [docs/reward_curve.svg](docs/reward_curve.svg) and [docs/component_shift.svg](docs/component_shift.svg), backed by [reports/training_summary.json](reports/training_summary.json).
+![WandB reward curve - training reward rises as calibration improves](docs/reward_curve.svg)
 ### Component score shift
+![Component score shift before vs after training](docs/component_shift.svg)
 This companion plot shows how the held-out validation sweep changes before and after training across fraud detection, decision accuracy, evidence grounding, and calibration.
+The script also writes [reports/component_shift_summary.json](reports/component_shift_summary.json) so the before/after component means are easy to inspect.
 ### Quantitative results — 5,000-episode GRPO run
+All numbers are from committed JSON artifacts: [`reports/training_summary.json`](reports/training_summary.json), [`reports/component_shift_summary.json`](reports/component_shift_summary.json).
 Three headline numbers tell the story:

README.md CHANGED Viewed

@@ -45,7 +45,7 @@ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore ev
 | **WandB (all runs)** | https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl |
 | **Trained Model** | https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct |
 | **Training Notebook (Colab)** | [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb) |
-| **Mini-Blog** | [docs/BLOG.md](https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/docs/BLOG.md) |
 ---
@@ -54,7 +54,7 @@ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore ev
 | Criterion | Weight | Where to find the evidence |
 |---|---|---|
 | **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
-| **Storytelling & Presentation** | 30% | [`docs/BLOG.md`](docs/BLOG.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
 | **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.svg`](docs/reward_curve.svg) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
 | **Reward & Training Pipeline** | 10% | [`app/rubrics.py`](app/rubrics.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`server/calibration_grader.py`](server/calibration_grader.py) — 3×2 calibration matrix. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
@@ -63,7 +63,7 @@ Indian health-insurance fraud, waste & abuse drains **₹8,000–10,000 crore ev
 - [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
 - [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
 - [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.svg`](docs/reward_curve.svg), [`docs/component_shift.svg`](docs/component_shift.svg)
-- [x] **Mini-blog** at [`docs/BLOG.md`](docs/BLOG.md)
 - [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
 - [x] **README** motivates the problem, explains the env, and shows results (this file)
 - [x] **`openenv.yaml`** manifest valid — see repo root
@@ -365,10 +365,11 @@ debateFloor/
 │   ├── train_minimal.py            ← Pure TRL GRPOTrainer, T4 in 15 min
 │   └── train_debatefloor.ipynb     ← Colab notebook (dynamic wrapper)
 │
 ├── docs/
 │   ├── reward_curve.svg            ← training reward curve (embedded above)
-│   ├── component_shift.svg         ← before/after component scores (embedded above)
-│   └── BLOG.md                     ← writeup
 │
 └── reports/
     ├── training_summary.json

 | **WandB (all runs)** | https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl |
 | **Trained Model** | https://huggingface.co/AniketAsla/debatefloor-grpo-qwen2.5-0.5b-instruct |
 | **Training Notebook (Colab)** | [train/train_debatefloor.ipynb](https://github.com/AniketAslaliya/debateFloor/blob/main/train/train_debatefloor.ipynb) |
+| **Mini-Blog** | [BLOG.md](https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/BLOG.md) |
 ---
 | Criterion | Weight | Where to find the evidence |
 |---|---|---|
 | **Environment Innovation** | 40% | The 3×2 calibration matrix (`README` §_The Core Innovation_) is a novel reward shape — it does not exist in any prior insurance-RL work and directly attacks the calibration-degradation problem documented in the CAPO paper (April 2026). The **Court Panel** mechanic forces the agent to expose its reasoning to a programmatic adversary, which is also unexplored territory for RL on LLMs. |
+| **Storytelling & Presentation** | 30% | [`BLOG.md`](BLOG.md) — full mini-blog motivating the problem, walking the reader through one episode end-to-end, and showing the training delta in plain language. README is structured for a 3-minute read with the headline number first. |
 | **Showing Improvement in Rewards** | 20% | [`docs/reward_curve.svg`](docs/reward_curve.svg) — 2,500-step reward curve from a 5,000-episode GRPO run (0.130 → 0.469, 3.6×). [`reports/training_summary.json`](reports/training_summary.json) — raw metrics including full log history. [`reports/component_shift_summary.json`](reports/component_shift_summary.json) — before/after on held-out eval (Decision accuracy 0 → 1.0, Calibration 0 → 1.0). WandB run linked above for reproducibility. |
 | **Reward & Training Pipeline** | 10% | [`app/rubrics.py`](app/rubrics.py) — composable rubric (decision × confidence × evidence × format), not monolithic. [`server/calibration_grader.py`](server/calibration_grader.py) — 3×2 calibration matrix. [`train/train_minimal.py`](train/train_minimal.py) — TRL GRPO loop that calls the live HTTP env over `requests.Session` (MR-2 compliant, no static dataset). |
 - [x] Built on **OpenEnv `0.2.3`** (latest at submission time) — see `requirements.txt`
 - [x] Working **TRL training script in a Colab notebook** — [`train/train_debatefloor.ipynb`](train/train_debatefloor.ipynb)
 - [x] **Real reward + loss plots** committed to the repo — [`docs/reward_curve.svg`](docs/reward_curve.svg), [`docs/component_shift.svg`](docs/component_shift.svg)
+- [x] **Mini-blog** at [`BLOG.md`](BLOG.md)
 - [x] **OpenEnv-compliant env hosted on HF Spaces** — https://huggingface.co/spaces/AniketAsla/debatefloor
 - [x] **README** motivates the problem, explains the env, and shows results (this file)
 - [x] **`openenv.yaml`** manifest valid — see repo root
 │   ├── train_minimal.py            ← Pure TRL GRPOTrainer, T4 in 15 min
 │   └── train_debatefloor.ipynb     ← Colab notebook (dynamic wrapper)
 │
+├── BLOG.md                         ← writeup (root for visibility)
+│
 ├── docs/
 │   ├── reward_curve.svg            ← training reward curve (embedded above)
+│   └── component_shift.svg         ← before/after component scores (embedded above)
 │
 └── reports/
     ├── training_summary.json

docs/VideoScript_ClaimCourt.md CHANGED Viewed

@@ -139,7 +139,7 @@ Click **Run Episode** once on **clean claim** so the audience sees the flow star
 - **Try it:** https://huggingface.co/spaces/AniketAsla/debatefloor
 - **Code:** https://github.com/AniketAslaliya/debateFloor
-- **Mini-blog (markdown):** https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/docs/HFBlogPost.md
 - **Weights & Biases (all training runs):** https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl
 ---

 - **Try it:** https://huggingface.co/spaces/AniketAsla/debatefloor
 - **Code:** https://github.com/AniketAslaliya/debateFloor
+- **Mini-blog (markdown):** https://huggingface.co/spaces/AniketAsla/debatefloor/blob/main/BLOG.md
 - **Weights & Biases (all training runs):** https://wandb.ai/aniketaslaliya-lnmiit/debatefloor-insurance-rl
 ---