Spaces:

akhiilll
/

claims-env

Sleeping

App Files Files Community

akhiilll commited on 29 days ago

Commit

43372d5

verified ·

1 Parent(s): 2581673

wire Colab notebook to TRL GRPOTrainer (real LoRA weight updates)

Browse files

Files changed (2) hide show

README.md +14 -5
training/InsureClaim_Training_Colab.ipynb +139 -673

README.md CHANGED Viewed

@@ -32,7 +32,7 @@ tags:
 | **Live Space** | <https://huggingface.co/spaces/akhiilll/claims-env> |
 | **API root** | <https://akhiilll-claims-env.hf.space> · `/health` · `/api` · `/docs` |
 | **WebSocket** | `wss://akhiilll-claims-env.hf.space/ws` |
-| **Training (Colab)** | [`training/InsureClaim_Training_Colab.ipynb`](training/InsureClaim_Training_Colab.ipynb) — Unsloth + TRL |
 | **Training (HF Job, 4×A10G)** | [`training/train_local_hf.py`](training/train_local_hf.py) |
 | **Latest run artifacts** | [`runs/20260425-215059/`](runs/20260425-215059) |
@@ -202,9 +202,18 @@ print(job.url)
 The job streams to `runs/<timestamp>/{reward_curves.png,reward_summary.json}` automatically.
-### 4.4 Train with TRL + Unsloth in Colab
-Open [`training/InsureClaim_Training_Colab.ipynb`](training/InsureClaim_Training_Colab.ipynb) on a free T4. The notebook loads `unsloth/Qwen2.5-1.5B-Instruct` in 4-bit with LoRA adapters, connects to **this** deployed Space over WebSocket, runs a REINFORCE-style policy-gradient loop, and saves reward-curve PNGs.
 ## 5. Repo layout
@@ -242,9 +251,9 @@ Open [`training/InsureClaim_Training_Colab.ipynb`](training/InsureClaim_Training
 ## 7. What we'll do next (post-deadline)
-* **GRPO with TRL + Unsloth** — replace the REINFORCE-style notebook with a GRPO loop so the LLM's *weights* update on the per-component rewards (currently we only do online rollouts).
-* **Curriculum** — start episodes only on the routine-approval cases, then unlock fraud / lapsed-policy / escalation cases as `final_avg` crosses thresholds.
 * **Process supervision** — reward correct *intermediate* tool selection (e.g. running `check_fraud` before approving a high-amount auto-theft claim), not just terminal verdicts.
 ## 8. Materials & links

 | **Live Space** | <https://huggingface.co/spaces/akhiilll/claims-env> |
 | **API root** | <https://akhiilll-claims-env.hf.space> · `/health` · `/api` · `/docs` |
 | **WebSocket** | `wss://akhiilll-claims-env.hf.space/ws` |
+| **Training (Colab, GRPO)** | [`training/InsureClaim_Training_Colab.ipynb`](training/InsureClaim_Training_Colab.ipynb) — Unsloth + TRL `GRPOTrainer` (real LoRA weight updates) |
 | **Training (HF Job, 4×A10G)** | [`training/train_local_hf.py`](training/train_local_hf.py) |
 | **Latest run artifacts** | [`runs/20260425-215059/`](runs/20260425-215059) |
 The job streams to `runs/<timestamp>/{reward_curves.png,reward_summary.json}` automatically.
+### 4.4 Train with TRL `GRPOTrainer` + Unsloth in Colab (real weight updates)
+Open [`training/InsureClaim_Training_Colab.ipynb`](training/InsureClaim_Training_Colab.ipynb) on a free Colab T4. The notebook:
+1. Clones this Space repo so the gym runs **in-process** in Colab and is fully deterministic per case (`scenario_index = 0..7`).
+2. Loads `unsloth/Qwen2.5-1.5B-Instruct` in 4-bit with LoRA `r=16, alpha=32` adapters (~12-15 M trainable params).
+3. Builds a prompt dataset where each row is pinned to one of the 8 curated cases.
+4. Defines **two independent reward functions** (anti-reward-hack pattern from the hackathon guide):
+   - `format_reward_fn` — was the completion parseable and did it end in a terminal verb?
+   - `env_reward_fn` — replays the trajectory inside the deterministic gym, returns cumulative env reward.
+5. Trains with `trl.GRPOTrainer` (`num_generations=4`, `epsilon=0.2` PPO clip, `beta=0.04` KL), logs reward / KL / completion-length.
+6. Plots curves, runs a per-case **before-vs-after rollout** so judges can see behaviour change, saves the LoRA adapter (with optional `push_to_hub`).
 ## 5. Repo layout
 ## 7. What we'll do next (post-deadline)
+* **Curriculum** — start GRPO episodes only on the routine-approval cases, then unlock fraud / lapsed-policy / escalation cases as `final_avg` crosses thresholds.
 * **Process supervision** — reward correct *intermediate* tool selection (e.g. running `check_fraud` before approving a high-amount auto-theft claim), not just terminal verdicts.
+* **Push trained adapter to the Hub** — once GRPO finishes in Colab, `push_to_hub("akhiilll/claims-grpo-qwen2.5-1.5b")` so a one-line `from_pretrained` reproduces the trained agent.
 ## 8. Materials & links

training/InsureClaim_Training_Colab.ipynb CHANGED Viewed

@@ -1,674 +1,140 @@
-{
-  "cells": [
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "header"
-      },
-      "source": [
-        "# InsureClaim AI - RL Training on the ClaimSense OpenEnv Gym\n",
-        "\n",
-        "> Apr 2026 OpenEnv Hackathon - Theme 3.1 (Professional Tasks) + Theme 2 (Long-Horizon Planning)\n",
-        "\n",
-        "This notebook trains `unsloth/Qwen2.5-1.5B-Instruct` against the live\n",
-        "ClaimSense Space (https://huggingface.co/spaces/akhiilll/claims-env)\n",
-        "on a free Colab T4 using Unsloth + TRL. Open in Colab, click *Run all*,\n",
-        "and the trained reward curve drops out as `reward_curves.png`.\n",
-        "\n",
-        "# (legacy header below kept for reference)\n",
-        "# InsureClaim AI - RL Training with Unsloth\n",
-        "\n",
-        "**OpenEnv Hackathon | Statement 3.1 + Scaler AI Labs**\n",
-        "\n",
-        "This notebook demonstrates training an LLM to process insurance claims using:\n",
-        "- **Unsloth** for efficient 4-bit model loading\n",
-        "- **TRL** for reinforcement learning\n",
-        "- **OpenEnv** for the claims processing environment\n",
-        "\n",
-        "## Results Preview\n",
-        "- Starting reward: **-5.5**\n",
-        "- Final reward: **+11.75**\n",
-        "- Improvement: **+17.25**\n",
-        "- Fraud detection: **+17.4** max reward"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "install_header"
-      },
-      "source": [
-        "## 1️⃣ Install Dependencies"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "metadata": {
-        "id": "install"
-      },
-      "source": [
-        "%%capture\n",
-        "# Install Unsloth (optimized for Colab)\n",
-        "!pip install \"unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git\"\n",
-        "!pip install --no-deps trl peft accelerate bitsandbytes\n",
-        "\n",
-        "# Install environment dependencies\n",
-        "!pip install websockets nest_asyncio certifi matplotlib\n",
-        "\n",
-        "print(\"✅ Dependencies installed!\")"
-      ],
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "model_header"
-      },
-      "source": [
-        "## 2️⃣ Load Model with Unsloth (4-bit quantization)"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "metadata": {
-        "id": "load_model"
-      },
-      "source": [
-        "from unsloth import FastLanguageModel\n",
-        "import torch\n",
-        "\n",
-        "# Check GPU\n",
-        "print(f\"GPU Available: {torch.cuda.is_available()}\")\n",
-        "if torch.cuda.is_available():\n",
-        "    print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
-        "    print(f\"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB\")\n",
-        "\n",
-        "# Load model with Unsloth (4x faster, 70% less memory)\n",
-        "model, tokenizer = FastLanguageModel.from_pretrained(\n",
-        "    model_name=\"unsloth/Qwen2.5-1.5B-Instruct\",\n",
-        "    max_seq_length=2048,\n",
-        "    load_in_4bit=True,\n",
-        "    dtype=None,  # auto-detect\n",
-        ")\n",
-        "\n",
-        "# Add LoRA adapters for efficient fine-tuning\n",
-        "model = FastLanguageModel.get_peft_model(\n",
-        "    model,\n",
-        "    r=16,\n",
-        "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
-        "                    \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
-        "    lora_alpha=16,\n",
-        "    lora_dropout=0,\n",
-        "    bias=\"none\",\n",
-        "    use_gradient_checkpointing=\"unsloth\",\n",
-        "    random_state=42,\n",
-        ")\n",
-        "\n",
-        "# Ensure pad token\n",
-        "if tokenizer.pad_token is None:\n",
-        "    tokenizer.pad_token = tokenizer.eos_token\n",
-        "\n",
-        "print(\"\\n✅ Model loaded with Unsloth + LoRA!\")\n",
-        "print(f\"Trainable parameters: {model.print_trainable_parameters()}\")"
-      ],
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "env_header"
-      },
-      "source": [
-        "## 3️⃣ Connect to Claims Environment"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "metadata": {
-        "id": "connect_env"
-      },
-      "source": [
-        "import asyncio\n",
-        "import websockets\n",
-        "import json\n",
-        "import ssl\n",
-        "import certifi\n",
-        "import nest_asyncio\n",
-        "\n",
-        "# Fix for Colab event loop\n",
-        "nest_asyncio.apply()\n",
-        "\n",
-        "# Environment URLs\n",
-        "ENV_URL = \"https://akhiilll-claims-env.hf.space\"\n",
-        "WS_URL = \"wss://akhiilll-claims-env.hf.space/ws\"\n",
-        "\n",
-        "# SSL context for Colab\n",
-        "ssl_context = ssl.create_default_context(cafile=certifi.where())\n",
-        "\n",
-        "# Test connection\n",
-        "import httpx\n",
-        "response = httpx.get(f\"{ENV_URL}/health\", timeout=30)\n",
-        "print(f\"Health check: {response.json()}\")\n",
-        "\n",
-        "# Test WebSocket with one episode\n",
-        "async def test_environment():\n",
-        "    async with websockets.connect(WS_URL, ssl=ssl_context) as ws:\n",
-        "        await ws.send('{\"type\": \"reset\", \"data\": {}}')\n",
-        "        response = json.loads(await ws.recv())\n",
-        "        obs = response[\"data\"][\"observation\"]\n",
-        "        print(f\"\\n📋 Test Claim: {obs['claim_id']}\")\n",
-        "        print(f\"   Type: {obs['claim_type']}\")\n",
-        "        print(f\"   Amount: ${obs['claim_amount_requested']:,.2f}\")\n",
-        "\n",
-        "        # Quick action test\n",
-        "        await ws.send('{\"type\": \"step\", \"data\": {\"action_type\": \"query_policy\"}}')\n",
-        "        response = json.loads(await ws.recv())\n",
-        "        reward = response[\"data\"].get(\"reward\", 0)\n",
-        "        print(f\"   query_policy reward: {reward}\")\n",
-        "\n",
-        "        await ws.send('{\"type\": \"close\", \"data\": {}}')\n",
-        "        return True\n",
-        "\n",
-        "asyncio.get_event_loop().run_until_complete(test_environment())\n",
-        "print(\"\\n✅ Environment connected!\")"
-      ],
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "components_header"
-      },
-      "source": [
-        "## 4️⃣ Define Training Components"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "metadata": {
-        "id": "components"
-      },
-      "source": [
-        "import re\n",
-        "from dataclasses import dataclass\n",
-        "from typing import List, Dict, Any, Tuple\n",
-        "\n",
-        "# System prompt for claims adjuster\n",
-        "SYSTEM_PROMPT = \"\"\"You are an expert insurance claims adjuster. Process claims efficiently and accurately.\n",
-        "\n",
-        "Available actions:\n",
-        "- query_policy: Look up policy details\n",
-        "- check_fraud: Run fraud detection\n",
-        "- verify_purchase: Verify via Plaid transactions\n",
-        "- approve: Approve claim (include amount)\n",
-        "- deny: Deny claim (include reason)\n",
-        "- escalate: Escalate to senior adjuster\n",
-        "\n",
-        "Respond with just the action, e.g., 'query_policy' or 'approve 3500' or 'deny fraud detected'.\"\"\"\n",
-        "\n",
-        "def format_observation(obs: dict) -> str:\n",
-        "    \"\"\"Format observation for LLM.\"\"\"\n",
-        "    text = f\"\"\"Claim: {obs.get('claim_id', 'N/A')}\n",
-        "Type: {obs.get('claim_type', 'N/A')}\n",
-        "Amount: ${obs.get('claim_amount_requested', 0):,.2f}\n",
-        "Description: {obs.get('description', 'N/A')}\n",
-        "\n",
-        "System: {obs.get('system_response', 'Ready')}\"\"\"\n",
-        "\n",
-        "    if obs.get('revealed_info'):\n",
-        "        info = obs['revealed_info']\n",
-        "        if 'fraud_analysis' in info:\n",
-        "            fa = info['fraud_analysis']\n",
-        "            text += f\"\\n\\nFraud Risk: {fa.get('risk_score', 0):.2f}\"\n",
-        "            if fa.get('flags'):\n",
-        "                text += f\" | Flags: {', '.join(fa['flags'])}\"\n",
-        "\n",
-        "    return text\n",
-        "\n",
-        "def parse_action(response: str, claim_amount: float) -> dict:\n",
-        "    \"\"\"Parse LLM response to action.\"\"\"\n",
-        "    response = response.lower().strip()\n",
-        "\n",
-        "    # Terminal actions\n",
-        "    if \"approve\" in response:\n",
-        "        match = re.search(r'(\\d+(?:\\.\\d+)?)', response)\n",
-        "        payout = float(match.group(1)) if match else claim_amount\n",
-        "        return {\"action_type\": \"approve\", \"parameters\": {\"payout\": payout}}\n",
-        "\n",
-        "    if \"deny\" in response:\n",
-        "        return {\"action_type\": \"deny\", \"parameters\": {\"reason\": \"Denied after review\"}}\n",
-        "\n",
-        "    if \"escalate\" in response:\n",
-        "        return {\"action_type\": \"escalate\", \"parameters\": {\"reason\": \"Needs review\"}}\n",
-        "\n",
-        "    # Information gathering\n",
-        "    if \"fraud\" in response:\n",
-        "        return {\"action_type\": \"check_fraud\", \"parameters\": {}}\n",
-        "    if \"policy\" in response:\n",
-        "        return {\"action_type\": \"query_policy\", \"parameters\": {}}\n",
-        "    if \"purchase\" in response or \"plaid\" in response:\n",
-        "        return {\"action_type\": \"verify_purchase\", \"parameters\": {}}\n",
-        "\n",
-        "    # Default\n",
-        "    return {\"action_type\": \"query_policy\", \"parameters\": {}}\n",
-        "\n",
-        "@dataclass\n",
-        "class Experience:\n",
-        "    \"\"\"Single step experience for training.\"\"\"\n",
-        "    prompt: str\n",
-        "    response: str\n",
-        "    reward: float\n",
-        "    action: str\n",
-        "\n",
-        "print(\"✅ Training components defined!\")"
-      ],
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "training_header"
-      },
-      "source": [
-        "## 5️⃣ Training Loop with Policy Gradient\n",
-        "\n",
-        "This implements a simplified REINFORCE algorithm:\n",
-        "1. Generate actions using the model\n",
-        "2. Collect rewards from environment\n",
-        "3. Update model to favor high-reward actions"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "metadata": {
-        "id": "training_loop"
-      },
-      "source": [
-        "from torch.optim import AdamW\n",
-        "import random\n",
-        "\n",
-        "# Training configuration\n",
-        "NUM_EPISODES = 50\n",
-        "MAX_STEPS = 8\n",
-        "LEARNING_RATE = 2e-5\n",
-        "BASELINE_REWARD = 0.0  # For variance reduction\n",
-        "\n",
-        "# Optimizer\n",
-        "optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)\n",
-        "\n",
-        "# Metrics\n",
-        "episode_rewards = []\n",
-        "running_avg_rewards = []\n",
-        "losses = []\n",
-        "\n",
-        "async def run_episode_with_training(episode_num: int, debug: bool = False):\n",
-        "    \"\"\"Run episode and collect experiences for training.\"\"\"\n",
-        "    global BASELINE_REWARD\n",
-        "\n",
-        "    experiences = []\n",
-        "    episode_reward = 0\n",
-        "\n",
-        "    try:\n",
-        "        async with websockets.connect(WS_URL, ssl=ssl_context, close_timeout=15) as ws:\n",
-        "            # Reset\n",
-        "            await ws.send(json.dumps({\"type\": \"reset\", \"data\": {}}))\n",
-        "            response = json.loads(await ws.recv())\n",
-        "            obs = response[\"data\"][\"observation\"]\n",
-        "            claim_amount = obs.get('claim_amount_requested', 0)\n",
-        "\n",
-        "            if debug:\n",
-        "                print(f\"  Claim: {obs['claim_id']} - ${claim_amount:,.0f}\")\n",
-        "\n",
-        "            done = False\n",
-        "            step = 0\n",
-        "\n",
-        "            while not done and step < MAX_STEPS:\n",
-        "                # Format prompt\n",
-        "                prompt = f\"{SYSTEM_PROMPT}\\n\\n{format_observation(obs)}\\n\\nAction:\"\n",
-        "\n",
-        "                # Generate with model\n",
-        "                inputs = tokenizer(prompt, return_tensors=\"pt\", truncation=True, max_length=1024)\n",
-        "                inputs = {k: v.to(model.device) for k, v in inputs.items()}\n",
-        "\n",
-        "                # Exploration: mix model output with random actions early on\n",
-        "                explore_rate = max(0.1, 1.0 - episode_num / 30)\n",
-        "\n",
-        "                if random.random() < explore_rate and step < 3:\n",
-        "                    # Explore: random action\n",
-        "                    actions = [\"query_policy\", \"check_fraud\", \"verify_purchase\"]\n",
-        "                    response_text = random.choice(actions)\n",
-        "                else:\n",
-        "                    # Exploit: use model\n",
-        "                    with torch.no_grad():\n",
-        "                        outputs = model.generate(\n",
-        "                            **inputs,\n",
-        "                            max_new_tokens=20,\n",
-        "                            temperature=0.7,\n",
-        "                            do_sample=True,\n",
-        "                            pad_token_id=tokenizer.pad_token_id,\n",
-        "                        )\n",
-        "                    response_text = tokenizer.decode(\n",
-        "                        outputs[0][inputs['input_ids'].shape[1]:],\n",
-        "                        skip_special_tokens=True\n",
-        "                    )\n",
-        "\n",
-        "                # Parse action\n",
-        "                action = parse_action(response_text, claim_amount)\n",
-        "\n",
-        "                if debug:\n",
-        "                    print(f\"    Step {step}: {action['action_type']} ('{response_text[:30]}...')\")\n",
-        "\n",
-        "                # Execute in environment\n",
-        "                await ws.send(json.dumps({\"type\": \"step\", \"data\": action}))\n",
-        "                env_response = json.loads(await ws.recv())\n",
-        "\n",
-        "                obs = env_response[\"data\"][\"observation\"]\n",
-        "                reward = env_response[\"data\"].get(\"reward\") or 0\n",
-        "                done = env_response[\"data\"].get(\"done\", False) or obs.get('is_terminal', False)\n",
-        "\n",
-        "                # Store experience\n",
-        "                experiences.append(Experience(\n",
-        "                    prompt=prompt,\n",
-        "                    response=response_text,\n",
-        "                    reward=reward,\n",
-        "                    action=action['action_type']\n",
-        "                ))\n",
-        "\n",
-        "                episode_reward += reward\n",
-        "                step += 1\n",
-        "\n",
-        "                if debug:\n",
-        "                    print(f\"      reward={reward:+.2f}, done={done}\")\n",
-        "\n",
-        "            await ws.send(json.dumps({\"type\": \"close\", \"data\": {}}))\n",
-        "\n",
-        "    except Exception as e:\n",
-        "        if debug:\n",
-        "            print(f\"  Error: {e}\")\n",
-        "        return -5.0, [], 0.0\n",
-        "\n",
-        "    # Compute advantage for policy gradient\n",
-        "    advantage = episode_reward - BASELINE_REWARD\n",
-        "\n",
-        "    # Update baseline with moving average\n",
-        "    BASELINE_REWARD = 0.9 * BASELINE_REWARD + 0.1 * episode_reward\n",
-        "\n",
-        "    # Return the advantage as \"loss\" for tracking\n",
-        "    return episode_reward, experiences, abs(advantage)\n",
-        "\n",
-        "print(\"✅ Training loop defined!\")"
-      ],
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "run_header"
-      },
-      "source": [
-        "## 6️⃣ Run Training"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "metadata": {
-        "id": "run_training"
-      },
-      "source": [
-        "print(\"=\" * 60)\n",
-        "print(\"🚀 Starting Training\")\n",
-        "print(f\"   Episodes: {NUM_EPISODES}\")\n",
-        "print(f\"   Max steps: {MAX_STEPS}\")\n",
-        "print(f\"   Exploration-based learning with reward signal\")\n",
-        "print(\"=\" * 60)\n",
-        "\n",
-        "# Debug first episode\n",
-        "print(\"\\n📋 Debug Episode 1:\")\n",
-        "reward, exps, adv = asyncio.get_event_loop().run_until_complete(\n",
-        "    run_episode_with_training(0, debug=True)\n",
-        ")\n",
-        "episode_rewards.append(reward)\n",
-        "running_avg_rewards.append(reward)\n",
-        "losses.append(adv)\n",
-        "print(f\"\\n   Episode 1: reward={reward:+.2f}, advantage={adv:.2f}\")\n",
-        "\n",
-        "# Training loop\n",
-        "print(f\"\\n{'='*60}\")\n",
-        "print(\"Training Progress:\")\n",
-        "print(f\"{'='*60}\")\n",
-        "\n",
-        "for episode in range(1, NUM_EPISODES):\n",
-        "    # Run episode\n",
-        "    reward, experiences, advantage = asyncio.get_event_loop().run_until_complete(\n",
-        "        run_episode_with_training(episode, debug=False)\n",
-        "    )\n",
-        "\n",
-        "    # Track metrics\n",
-        "    episode_rewards.append(reward)\n",
-        "    window = min(10, len(episode_rewards))\n",
-        "    running_avg = sum(episode_rewards[-window:]) / window\n",
-        "    running_avg_rewards.append(running_avg)\n",
-        "    losses.append(advantage)\n",
-        "\n",
-        "    # Note: In a full implementation, we'd update model weights here\n",
-        "    # For this demo, the exploration rate decay serves as the \"learning\" mechanism\n",
-        "    # Early episodes explore randomly, later episodes use the model more\n",
-        "    # This demonstrates the environment produces meaningful reward signals\n",
-        "\n",
-        "    # Log progress\n",
-        "    if (episode + 1) % 5 == 0:\n",
-        "        print(f\"Episode {episode+1:3d}/{NUM_EPISODES} | \"\n",
-        "              f\"Reward: {reward:+6.1f} | \"\n",
-        "              f\"Avg(10): {running_avg:+6.1f} | \"\n",
-        "              f\"Advantage: {advantage:.2f}\")\n",
-        "\n",
-        "print(f\"\\n{'='*60}\")\n",
-        "print(\"✅ Training Complete!\")\n",
-        "print(f\"{'='*60}\")\n",
-        "print(f\"Final running average: {running_avg_rewards[-1]:+.2f}\")\n",
-        "print(f\"Improvement: {running_avg_rewards[-1] - running_avg_rewards[0]:+.2f}\")\n",
-        "print(f\"Reward range: [{min(episode_rewards):.1f}, {max(episode_rewards):.1f}]\")"
-      ],
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "plot_header"
-      },
-      "source": [
-        "## 7️⃣ Plot Reward Curves (Required for Judging)"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "metadata": {
-        "id": "plot"
-      },
-      "source": [
-        "import matplotlib.pyplot as plt\n",
-        "\n",
-        "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n",
-        "\n",
-        "# Plot 1: Episode Rewards\n",
-        "ax1 = axes[0]\n",
-        "ax1.plot(episode_rewards, alpha=0.5, label='Episode Reward', color='blue')\n",
-        "ax1.plot(running_avg_rewards, linewidth=2, label='Running Avg (10)', color='red')\n",
-        "ax1.axhline(y=0, color='gray', linestyle='--', alpha=0.5)\n",
-        "ax1.set_xlabel('Episode', fontsize=12)\n",
-        "ax1.set_ylabel('Reward', fontsize=12)\n",
-        "ax1.set_title('Training Progress', fontsize=14)\n",
-        "ax1.legend()\n",
-        "ax1.grid(True, alpha=0.3)\n",
-        "\n",
-        "# Plot 2: Reward Distribution\n",
-        "ax2 = axes[1]\n",
-        "ax2.hist(episode_rewards, bins=15, edgecolor='black', alpha=0.7, color='green')\n",
-        "ax2.axvline(x=0, color='red', linestyle='--', label='Break-even')\n",
-        "ax2.axvline(x=sum(episode_rewards)/len(episode_rewards), color='blue',\n",
-        "            linestyle='-', linewidth=2, label=f'Mean: {sum(episode_rewards)/len(episode_rewards):.1f}')\n",
-        "ax2.set_xlabel('Reward', fontsize=12)\n",
-        "ax2.set_ylabel('Frequency', fontsize=12)\n",
-        "ax2.set_title('Reward Distribution', fontsize=14)\n",
-        "ax2.legend()\n",
-        "ax2.grid(True, alpha=0.3)\n",
-        "\n",
-        "# Plot 3: Advantage (reward - baseline)\n",
-        "ax3 = axes[2]\n",
-        "ax3.plot(losses, alpha=0.7, color='purple')\n",
-        "ax3.axhline(y=0, color='gray', linestyle='--', alpha=0.5)\n",
-        "ax3.set_xlabel('Episode', fontsize=12)\n",
-        "ax3.set_ylabel('|Advantage|', fontsize=12)\n",
-        "ax3.set_title('Advantage Over Baseline', fontsize=14)\n",
-        "ax3.grid(True, alpha=0.3)\n",
-        "\n",
-        "plt.tight_layout()\n",
-        "plt.savefig('reward_curves.png', dpi=150, bbox_inches='tight')\n",
-        "plt.show()\n",
-        "\n",
-        "print(\"\\n✅ Saved: reward_curves.png\")"
-      ],
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "demo_header"
-      },
-      "source": [
-        "## 8️⃣ Demo: Watch Trained Agent"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "metadata": {
-        "id": "demo"
-      },
-      "source": [
-        "async def demo_trained_agent():\n",
-        "    \"\"\"Demo the trained agent processing a claim.\"\"\"\n",
-        "    print(\"=\" * 60)\n",
-        "    print(\"🎯 DEMO: Trained Agent Processing Claim\")\n",
-        "    print(\"=\" * 60)\n",
-        "\n",
-        "    async with websockets.connect(WS_URL, ssl=ssl_context) as ws:\n",
-        "        await ws.send(json.dumps({\"type\": \"reset\", \"data\": {}}))\n",
-        "        response = json.loads(await ws.recv())\n",
-        "        obs = response[\"data\"][\"observation\"]\n",
-        "\n",
-        "        print(f\"\\n📋 Claim: {obs['claim_id']}\")\n",
-        "        print(f\"   Type: {obs['claim_type']}\")\n",
-        "        print(f\"   Amount: ${obs['claim_amount_requested']:,.2f}\")\n",
-        "        print(f\"   Description: {obs['description']}\")\n",
-        "\n",
-        "        claim_amount = obs['claim_amount_requested']\n",
-        "        done = False\n",
-        "        step = 0\n",
-        "        total_reward = 0\n",
-        "\n",
-        "        print(\"\\n📝 Processing:\")\n",
-        "\n",
-        "        while not done and step < 6:\n",
-        "            prompt = f\"{SYSTEM_PROMPT}\\n\\n{format_observation(obs)}\\n\\nAction:\"\n",
-        "\n",
-        "            inputs = tokenizer(prompt, return_tensors=\"pt\", truncation=True, max_length=1024)\n",
-        "            inputs = {k: v.to(model.device) for k, v in inputs.items()}\n",
-        "\n",
-        "            with torch.no_grad():\n",
-        "                outputs = model.generate(\n",
-        "                    **inputs,\n",
-        "                    max_new_tokens=20,\n",
-        "                    temperature=0.3,  # Lower temp for demo\n",
-        "                    do_sample=True,\n",
-        "                    pad_token_id=tokenizer.pad_token_id,\n",
-        "                )\n",
-        "\n",
-        "            response_text = tokenizer.decode(\n",
-        "                outputs[0][inputs['input_ids'].shape[1]:],\n",
-        "                skip_special_tokens=True\n",
-        "            )\n",
-        "\n",
-        "            action = parse_action(response_text, claim_amount)\n",
-        "\n",
-        "            print(f\"\\n   Step {step + 1}: {action['action_type']}\")\n",
-        "\n",
-        "            await ws.send(json.dumps({\"type\": \"step\", \"data\": action}))\n",
-        "            env_response = json.loads(await ws.recv())\n",
-        "\n",
-        "            obs = env_response[\"data\"][\"observation\"]\n",
-        "            reward = env_response[\"data\"].get(\"reward\") or 0\n",
-        "            done = env_response[\"data\"].get(\"done\", False) or obs.get('is_terminal', False)\n",
-        "\n",
-        "            total_reward += reward\n",
-        "\n",
-        "            print(f\"   Response: {obs['system_response'][:80]}...\")\n",
-        "            print(f\"   Reward: {reward:+.2f}\")\n",
-        "\n",
-        "            step += 1\n",
-        "\n",
-        "        await ws.send(json.dumps({\"type\": \"close\", \"data\": {}}))\n",
-        "\n",
-        "        print(f\"\\n{'='*60}\")\n",
-        "        print(f\"✅ Decision: {obs.get('terminal_reason', 'N/A').upper()}\")\n",
-        "        print(f\"💰 Total Reward: {total_reward:+.2f}\")\n",
-        "        print(f\"{'='*60}\")\n",
-        "\n",
-        "asyncio.get_event_loop().run_until_complete(demo_trained_agent())"
-      ],
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "summary"
-      },
-      "source": [
-        "## 📊 Summary\n",
-        "\n",
-        "This notebook demonstrated:\n",
-        "\n",
-        "1. **Unsloth** - 4-bit model loading with LoRA adapters\n",
-        "2. **TRL** - Policy gradient training infrastructure\n",
-        "3. **OpenEnv** - Claims processing environment via WebSocket\n",
-        "4. **Training** - Reward improvement over 50 episodes\n",
-        "\n",
-        "### Key Results\n",
-        "- Starting reward: **-5.5**\n",
-        "- Final reward: **+11.75**\n",
-        "- Improvement: **+17.25**\n",
-        "\n",
-        "### Links\n",
-        "- **HF Space**: https://akhiilll-claims-env.hf.space\n",
-        "- **GitHub**: https://github.com/pramodmisra/claims-env-hackathon\n",
-        "\n",
-        "### Hackathon\n",
-        "- **Problem**: 3.1 - Professional Tasks (World Modeling)\n",
-        "- **Theme**: Scaler AI Labs - Enterprise Workflows"
-      ]
-    }
-  ],
-  "metadata": {
-    "accelerator": "GPU",
-    "colab": {
-      "gpuType": "T4",
-      "provenance": []
-    },
-    "kernelspec": {
-      "display_name": "Python 3",
-      "name": "python3"
-    },
-    "language_info": {
-      "name": "python"
-    }
-  },
-  "nbformat": 4,
-  "nbformat_minor": 0
 }

+{
+ "nbformat": 4,
+ "nbformat_minor": 5,
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "pygments_lexer": "ipython3"
+  },
+  "colab": {
+   "provenance": [],
+   "gpuType": "T4"
+  },
+  "accelerator": "GPU"
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "# ClaimSense GRPO Training (TRL + Unsloth)\n\n> **Apr 2026 OpenEnv Hackathon - India**\n> **Theme 3.1 - World Modeling, Professional Tasks** + **Theme 2 - Long-Horizon Planning**\n\nThis notebook performs **real GRPO weight updates** on\n`unsloth/Qwen2.5-1.5B-Instruct` against the\n[ClaimSense adjudication gym](https://huggingface.co/spaces/akhiilll/claims-env).\n\nThe training loop:\n\n1. Clones the Space repo so the gym runs **in-process** in Colab (deterministic\n   per-claim resets via `scenario_index`).\n2. Loads Qwen2.5-1.5B in 4-bit with LoRA adapters via Unsloth (fits a free T4).\n3. Builds a prompt dataset where each row is pinned to a specific case\n   (`scenario_index = 0..7`), so the prompt the model sees and the env we\n   score against describe the *same* claim.\n4. Defines **two independent reward functions** (multiple independent rewards\n   is explicitly recommended by the hackathon guide to combat reward hacking):\n   - `format_reward_fn`  - did the model emit at least one well-formed\n     terminal verb?\n   - `env_reward_fn`     - cumulative reward from replaying the model's\n     trajectory inside the deterministic gym.\n5. Runs `trl.GRPOTrainer.train()` with `num_generations=4` so the per-group\n   advantage signal has variance to learn from.\n6. Plots reward curves, does a before/after rollout, and saves the LoRA\n   adapter so it can be pushed to the Hub.\n\nRun all cells from a Colab T4. Total runtime: ~25-35 minutes for ~80\ntraining steps. Adjust `NUM_GRPO_STEPS` in the training cell to taste."
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 1. Install dependencies"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "%%capture\n%pip install -q \"unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git\"\n%pip install -q --no-deps \"trl>=0.18\" peft accelerate bitsandbytes datasets\n%pip install -q openenv-core matplotlib hf_transfer\n\nimport os\nos.environ[\"HF_HUB_ENABLE_HF_TRANSFER\"] = \"1\"\nprint(\"deps installed\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 2. Clone the ClaimSense Space (gym runs locally in Colab)\n\nWe avoid network round-trips from inside the GRPO reward function by running\na fresh in-process gym per reward computation. The gym code lives on the\nSpace repo at `akhiilll/claims-env`."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "!rm -rf /content/claims-env-repo\n!git clone https://huggingface.co/spaces/akhiilll/claims-env /content/claims-env-repo\n\nimport sys\nsys.path.insert(0, \"/content/claims-env-repo\")\n\nfrom server.claims_environment import AdjudicationGym, ACTION_VOCABULARY\nfrom server.mock_systems import CASE_LIBRARY\nfrom models import AdjudicatorAction\n\nprint(f\"verbs ({len(ACTION_VOCABULARY)}):\", ACTION_VOCABULARY)\nprint(f\"cases ({len(CASE_LIBRARY)}):\")\nfor i, c in enumerate(CASE_LIBRARY):\n    print(f\"  [{i}] {c.claim_id:<14} {c.claim_type:<22} ${c.claim_amount:>10,.0f}  ({c.complexity})\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 3. Load Qwen2.5-1.5B-Instruct in 4-bit + LoRA (Unsloth)\n\nUnsloth gives ~4x faster RL training and ~70 % less memory than vanilla TRL,\nwhich is what makes GRPO fit on a free T4."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "from unsloth import FastLanguageModel\nimport torch\n\nprint(\"CUDA :\", torch.cuda.is_available(),\n      \"|\", torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"no GPU\")\n\nMAX_SEQ_LENGTH = 1024\n\nmodel, tokenizer = FastLanguageModel.from_pretrained(\n    model_name=\"unsloth/Qwen2.5-1.5B-Instruct\",\n    max_seq_length=MAX_SEQ_LENGTH,\n    load_in_4bit=True,\n    dtype=None,  # auto\n)\n\nmodel = FastLanguageModel.get_peft_model(\n    model,\n    r=16,\n    lora_alpha=32,\n    lora_dropout=0.0,\n    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n                    \"gate_proj\", \"up_proj\", \"down_proj\"],\n    use_gradient_checkpointing=\"unsloth\",\n    random_state=42,\n)\n\nif tokenizer.pad_token is None:\n    tokenizer.pad_token = tokenizer.eos_token\n\nn_trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)\nn_total = sum(p.numel() for p in model.parameters())\nprint(f\"trainable LoRA params: {n_trainable/1e6:.1f}M / {n_total/1e9:.2f}B \"\n      f\"({100*n_trainable/n_total:.2f}%)\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 4. Build the GRPO prompt dataset\n\nEach row in the dataset is `(prompt, scenario_index)`. The prompt is already\ntemplated through the chat template so we feed plain strings to GRPO. The\n`scenario_index` column is *passed through* by `GRPOTrainer` to our reward\nfunctions as a kwarg, so we can replay the trajectory against the correct\ndeterministic case."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "from datasets import Dataset\n\nSYSTEM_PROMPT = (\n    \"You are an expert insurance claims adjuster.\\n\"\n    \"\\n\"\n    \"Available actions (one per line, lowercase, in this order of execution):\\n\"\n    \"  query_policy\\n\"\n    \"  query_claim_history\\n\"\n    \"  check_fraud\\n\"\n    \"  request_documents\\n\"\n    \"  verify_coverage\\n\"\n    \"  verify_purchase\\n\"\n    \"  calculate_payout\\n\"\n    \"  approve <amount>     (terminal)\\n\"\n    \"  deny <reason>        (terminal)\\n\"\n    \"  escalate <reason>    (terminal)\\n\"\n    \"\\n\"\n    \"Information actions cost a small fee; correct terminal verdicts pay big.\\n\"\n    \"Catching fraud via deny pays even more. Output up to 6 actions, one per\\n\"\n    \"line, ending with a terminal action. Do not write anything else.\"\n)\n\n\ndef claim_to_user_msg(scenario_index: int) -> str:\n    env = AdjudicationGym(scenario_index=scenario_index)\n    obs = env.reset()\n    return (\n        f\"New claim arrived:\\n\"\n        f\"  claim_id     : {obs.claim_id}\\n\"\n        f\"  type         : {obs.claim_type}\\n\"\n        f\"  amount       : ${obs.claim_amount_requested:,.2f}\\n\"\n        f\"  claimant     : {obs.claimant_name}\\n\"\n        f\"  incident_date: {obs.incident_date}\\n\"\n        f\"  description  : {obs.description}\\n\"\n        f\"\\nWhat is your action plan?\"\n    )\n\n\ndef make_prompt(scenario_index: int) -> str:\n    msgs = [\n        {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n        {\"role\": \"user\", \"content\": claim_to_user_msg(scenario_index)},\n    ]\n    return tokenizer.apply_chat_template(\n        msgs, tokenize=False, add_generation_prompt=True\n    )\n\n\nCASE_REPEATS = 8  # how many times each of the 8 curated cases appears\nrows = []\nfor repeat in range(CASE_REPEATS):\n    for sidx in range(len(CASE_LIBRARY)):\n        rows.append({\"prompt\": make_prompt(sidx), \"scenario_index\": sidx})\n\ntrain_ds = Dataset.from_list(rows).shuffle(seed=42)\nprint(f\"dataset rows: {len(train_ds)}  | unique cases: {len(CASE_LIBRARY)}\")\nprint()\nprint(\"--- example prompt ---\")\nprint(train_ds[0][\"prompt\"][:900])"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 5. Reward functions (multiple independent signals)\n\n`format_reward_fn`\n:  Did the model emit at least one parseable action and end with a terminal\n   verb? Cheap signal, prevents the model from outputting arbitrary text.\n\n`env_reward_fn`\n:  Replays the parsed trajectory in a deterministic gym pinned to the same\n   `scenario_index` as the prompt. Returns the cumulative env reward\n   (between roughly -16 and +20)."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import re\n\nACTIONS_SET = set(ACTION_VOCABULARY)\nTERMINALS = {\"approve\", \"deny\", \"escalate\"}\n\n\ndef _coerce_completion(c) -> str:\n    if isinstance(c, list):  # chat-style completions\n        if not c:\n            return \"\"\n        return c[0].get(\"content\", \"\") if isinstance(c[0], dict) else str(c[0])\n    return str(c)\n\n\ndef parse_actions(completion: str) -> list[AdjudicatorAction]:\n    actions: list[AdjudicatorAction] = []\n    for raw in completion.strip().splitlines():\n        line = raw.strip().lstrip(\"-*0123456789. \").lower().strip()\n        if not line:\n            continue\n        parts = line.split(maxsplit=1)\n        verb = parts[0]\n        if verb not in ACTIONS_SET:\n            continue\n        params: dict = {}\n        rest = parts[1] if len(parts) > 1 else \"\"\n        if verb == \"approve\":\n            m = re.search(r\"\\d[\\d,\\.]*\", rest)\n            if m:\n                try:\n                    params[\"amount\"] = float(m.group().replace(\",\", \"\"))\n                except ValueError:\n                    pass\n        elif verb == \"deny\":\n            params[\"reason\"] = (rest or \"policy_violation\")[:80]\n        elif verb == \"escalate\":\n            params[\"reason\"] = (rest or \"manager_review\")[:80]\n        actions.append(AdjudicatorAction(action_type=verb, parameters=params))\n        if verb in TERMINALS:\n            break\n    return actions\n\n\ndef replay(actions: list[AdjudicatorAction], scenario_index: int,\n           max_steps: int = 8) -> tuple[float, str, int]:\n    env = AdjudicationGym(scenario_index=int(scenario_index))\n    env.reset()\n    total = 0.0\n    terminal = \"max_steps\"\n    steps = 0\n    for act in actions[:max_steps]:\n        obs = env.step(act)\n        total += float(obs.reward)\n        steps += 1\n        if obs.done:\n            terminal = act.action_type\n            break\n    return total, terminal, steps\n\n\ndef format_reward_fn(prompts, completions, **kwargs) -> list[float]:\n    rewards = []\n    for c in completions:\n        text = _coerce_completion(c)\n        actions = parse_actions(text)\n        if not actions:\n            rewards.append(-1.0)        # zero parseable actions\n            continue\n        ended_in_terminal = actions[-1].action_type in TERMINALS\n        rewards.append(0.5 if ended_in_terminal else -0.25)\n    return rewards\n\n\ndef env_reward_fn(prompts, completions, scenario_index, **kwargs) -> list[float]:\n    rewards = []\n    for c, sidx in zip(completions, scenario_index):\n        text = _coerce_completion(c)\n        actions = parse_actions(text)\n        env_r, _, _ = replay(actions, int(sidx))\n        rewards.append(env_r)\n    return rewards\n\n\n# Sanity checks\nprint(\"=== sanity check ===\")\noptimal = \"query_policy\\ncheck_fraud\\napprove 3500\"\nr_opt, term, steps = replay(parse_actions(optimal), scenario_index=0)\nprint(f\"optimal trace on case 0  -> reward={r_opt:+.2f} terminal={term} steps={steps}\")\n\nbad = \"approve 99999\"  # blind approve on case 0\nr_bad, term, steps = replay(parse_actions(bad), scenario_index=0)\nprint(f\"blind approve on case 0  -> reward={r_bad:+.2f} terminal={term} steps={steps}\")\n\nempty = \"lorem ipsum\"\nr_empty, term, steps = replay(parse_actions(empty), scenario_index=0)\nprint(f\"unparseable on case 0    -> reward={r_empty:+.2f} terminal={term} steps={steps}\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 6. GRPO training (real weight updates)\n\n`num_generations=4` means the trainer samples 4 completions per prompt and\ncomputes per-group advantages. With `per_device_train_batch_size=2`, each\noptimization step uses 2 prompts x 4 completions = 8 rollouts.\n\n`NUM_GRPO_STEPS=80` with `batch_size=2 * num_generations=4 = 8` covers ~640\nrollouts. Bump it once you confirm the loop is healthy."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "from trl import GRPOConfig, GRPOTrainer\n\nNUM_GRPO_STEPS = 80\n\ntraining_args = GRPOConfig(\n    output_dir=\"/content/grpo-claims\",\n    learning_rate=5e-6,\n    adam_beta1=0.9,\n    adam_beta2=0.99,\n    weight_decay=0.1,\n    warmup_ratio=0.1,\n    lr_scheduler_type=\"cosine\",\n    optim=\"adamw_8bit\",\n    logging_steps=1,\n\n    per_device_train_batch_size=2,\n    gradient_accumulation_steps=2,\n    num_generations=4,\n    max_prompt_length=512,\n    max_completion_length=256,\n\n    max_steps=NUM_GRPO_STEPS,\n    save_steps=999_999,    # we save the adapter manually at the end\n    report_to=\"none\",\n    bf16=True,\n\n    temperature=0.9,\n    top_p=0.95,\n    epsilon=0.2,           # PPO clip\n    beta=0.04,             # KL penalty vs reference\n)\n\ntrainer = GRPOTrainer(\n    model=model,\n    processing_class=tokenizer,\n    reward_funcs=[format_reward_fn, env_reward_fn],\n    args=training_args,\n    train_dataset=train_ds,\n)\n\nprint(\"GRPO trainer ready.  starting training...\")\ntrainer.train()\nprint(\"training done.\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 7. Plot training curves\n\nWe plot:\n- mean group reward per step\n- mean per-reward-function score (so you can see format-reward saturate first\n  and env-reward keep climbing)\n- KL vs reference model\n- mean completion length\n\nThese are the curves judges will look at."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import json\nimport matplotlib.pyplot as plt\nfrom pathlib import Path\n\nlog = trainer.state.log_history\nprint(f\"log entries: {len(log)} | sample keys:\")\nprint(set().union(*(r.keys() for r in log[:20])) if log else \"(empty)\")\n\n\ndef series(key):\n    xs, ys = [], []\n    for entry in log:\n        if key in entry and \"step\" in entry:\n            xs.append(entry[\"step\"])\n            ys.append(entry[key])\n    return xs, ys\n\n\nfig, axes = plt.subplots(2, 2, figsize=(13, 8))\n\nxs, ys = series(\"reward\")\naxes[0, 0].plot(xs, ys, color=\"#1f77b4\")\naxes[0, 0].set_title(\"mean group reward\")\naxes[0, 0].set_xlabel(\"training step\")\naxes[0, 0].set_ylabel(\"reward\")\naxes[0, 0].grid(alpha=0.3)\n\n# per-reward-fn scores (TRL emits e.g. \"rewards/format_reward_fn\")\nfmt_xs, fmt_ys = series(\"rewards/format_reward_fn\")\nenv_xs, env_ys = series(\"rewards/env_reward_fn\")\nif not fmt_ys:\n    fmt_xs, fmt_ys = series(\"rewards/format_reward_fn/mean\")\n    env_xs, env_ys = series(\"rewards/env_reward_fn/mean\")\naxes[0, 1].plot(fmt_xs, fmt_ys, label=\"format reward\", color=\"#2ca02c\")\naxes[0, 1].plot(env_xs, env_ys, label=\"env reward\", color=\"#d62728\")\naxes[0, 1].set_title(\"per-reward-function score\")\naxes[0, 1].set_xlabel(\"training step\")\naxes[0, 1].set_ylabel(\"reward\")\naxes[0, 1].legend()\naxes[0, 1].grid(alpha=0.3)\n\nxs, ys = series(\"kl\")\naxes[1, 0].plot(xs, ys, color=\"#9467bd\")\naxes[1, 0].set_title(\"KL(model || reference)\")\naxes[1, 0].set_xlabel(\"training step\")\naxes[1, 0].set_ylabel(\"kl\")\naxes[1, 0].grid(alpha=0.3)\n\nxs, ys = series(\"completion_length\") or series(\"completions/mean_length\")\naxes[1, 1].plot(xs, ys, color=\"#ff7f0e\")\naxes[1, 1].set_title(\"mean completion length (tokens)\")\naxes[1, 1].set_xlabel(\"training step\")\naxes[1, 1].set_ylabel(\"tokens\")\naxes[1, 1].grid(alpha=0.3)\n\nfig.tight_layout()\nout_dir = Path(\"/content/grpo-claims\")\nout_dir.mkdir(parents=True, exist_ok=True)\nfig.savefig(out_dir / \"grpo_training.png\", dpi=120)\nplt.show()\n\nwith (out_dir / \"training_log.json\").open(\"w\") as fh:\n    json.dump(log, fh, indent=2, default=str)\nprint(\"saved:\", out_dir / \"grpo_training.png\")\nprint(\"saved:\", out_dir / \"training_log.json\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 8. Before / after rollout demo\n\nRoll out the trained adapter and a \"no-LoRA\" baseline on the same case and\ncompare environment reward + the actual generated trajectory."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "from peft import PeftModel\nimport statistics\n\nFastLanguageModel.for_inference(model)\n\n\ndef generate(prompt_text: str, max_new_tokens: int = 200) -> str:\n    inputs = tokenizer(prompt_text, return_tensors=\"pt\").to(model.device)\n    out = model.generate(\n        **inputs,\n        max_new_tokens=max_new_tokens,\n        do_sample=True,\n        temperature=0.7,\n        top_p=0.9,\n        pad_token_id=tokenizer.pad_token_id,\n    )\n    return tokenizer.decode(out[0, inputs[\"input_ids\"].shape[1]:], skip_special_tokens=True)\n\n\ndef rollout_with_adapter(scenario_index: int, *, with_adapter: bool) -> tuple[str, float]:\n    if with_adapter:\n        model.enable_adapter_layers()\n    else:\n        model.disable_adapter_layers()\n    text = generate(make_prompt(scenario_index))\n    env_r, _, _ = replay(parse_actions(text), scenario_index)\n    return text, env_r\n\n\nfor sidx in range(len(CASE_LIBRARY)):\n    case = CASE_LIBRARY[sidx]\n    print(\"=\" * 72)\n    print(f\"case [{sidx}] {case.claim_id} - {case.claim_type} (${case.claim_amount:,.0f})\")\n\n    base_text, base_r = rollout_with_adapter(sidx, with_adapter=False)\n    print(f\"\\n[BASE  no LoRA]  env reward = {base_r:+.2f}\")\n    print(\"---\")\n    print(base_text.strip())\n\n    trained_text, trained_r = rollout_with_adapter(sidx, with_adapter=True)\n    print(f\"\\n[TRAINED LoRA]   env reward = {trained_r:+.2f}    delta = {trained_r-base_r:+.2f}\")\n    print(\"---\")\n    print(trained_text.strip())\n    print()\n\n# always re-enable adapter at the end\nmodel.enable_adapter_layers()"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 9. Save the LoRA adapter\n\nWe save the LoRA adapter (small) and a tiny summary JSON. Optionally push to\nthe Hub - judges can then load it with one line."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "from pathlib import Path\nimport json\n\nADAPTER_DIR = Path(\"/content/grpo-claims/lora-adapter\")\nmodel.save_pretrained(str(ADAPTER_DIR))\ntokenizer.save_pretrained(str(ADAPTER_DIR))\nprint(\"saved LoRA adapter to:\", ADAPTER_DIR)\n\nsummary = {\n    \"base_model\": \"unsloth/Qwen2.5-1.5B-Instruct\",\n    \"adapter_method\": \"LoRA r=16, alpha=32\",\n    \"trainer\": \"trl.GRPOTrainer\",\n    \"num_generations\": 4,\n    \"max_steps\": NUM_GRPO_STEPS,\n    \"reward_functions\": [\"format_reward_fn\", \"env_reward_fn\"],\n    \"env\": \"ClaimSense (https://huggingface.co/spaces/akhiilll/claims-env)\",\n}\nwith open(\"/content/grpo-claims/run_summary.json\", \"w\") as fh:\n    json.dump(summary, fh, indent=2)\nprint(json.dumps(summary, indent=2))\n\n\n# OPTIONAL: push the adapter to your namespace.\n# Replace MODEL_REPO with something like \"akhiilll/claims-grpo-qwen2.5-1.5b\".\n#\n# from huggingface_hub import notebook_login\n# notebook_login()\n#\n# MODEL_REPO = \"akhiilll/claims-grpo-qwen2.5-1.5b\"\n# model.push_to_hub(MODEL_REPO)\n# tokenizer.push_to_hub(MODEL_REPO)\nprint(\"(uncomment the push_to_hub block above to publish the adapter)\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Recap\n\nWhat this notebook did:\n\n1. Cloned the OpenEnv-compliant `akhiilll/claims-env` Space into Colab so the\n   adjudication gym runs in-process and is deterministic per case.\n2. Loaded `unsloth/Qwen2.5-1.5B-Instruct` 4-bit, attached LoRA r=16 adapters\n   (~12-15M trainable params).\n3. Built a prompt dataset where every row is pinned to one of the 8 curated\n   cases via `scenario_index`.\n4. Trained for `NUM_GRPO_STEPS` GRPO updates with **two independent reward\n   functions** (format + env-replay) - this is the multi-reward, anti-hack\n   pattern the hackathon guide explicitly recommends.\n5. Plotted reward / KL / completion-length curves and saved them to disk.\n6. Did a per-case before-vs-after rollout demo so reviewers can see the\n   trained adapter's behaviour change.\n7. Saved the LoRA adapter (with an optional `push_to_hub`).\n\n### Links\n- **Environment Space:** https://huggingface.co/spaces/akhiilll/claims-env\n- **Live API:** https://akhiilll-claims-env.hf.space\n- **Repo README:** https://huggingface.co/spaces/akhiilll/claims-env/blob/main/README.md"
+  }
+ ]
 }