Spaces:

Pratap-K
/

SmartPayEnv

Sleeping

App Files Files Community

Pratap-K commited on 17 days ago

Commit

5b9b298

1 Parent(s): 7445273

Update training

Browse files

Files changed (6) hide show

README.md +22 -0
notebooks/train.ipynb +551 -0
notebooks/train_smartpayenev.ipynb +953 -0
scripts/train_theme4_grpo.py +34 -9
server/SmartPayEnv_environment.py +61 -10
server/app.py +36 -0

README.md CHANGED Viewed

@@ -209,6 +209,28 @@ The self-improving upgrades are inspired by:
 ---
 ## 📐 Data Models
 ### Action Space (`SmartpayenvAction`)

 ---
+## 🧪 Judge Repro (Colab + HF Credits)
+For hackathon evaluation, use the Colab notebook:
+- `notebooks/theme4_judge_repro_colab.ipynb`
+What this notebook does:
+- connects to the deployed Space (`https://pratap-k-smartpayenv.hf.space`)
+- collects group-relative preference pairs from `/simulate`
+- runs a lightweight TRL DPO pass
+- writes reproducible artifacts (`artifacts/run_metrics.json`)
+Judge flow:
+1. Open notebook in Colab and run all cells.
+2. Login with Hugging Face token when prompted (credits-enabled account).
+3. Keep `QUICK_MODE=True` for fast rerun; set `False` for longer training.
+Expected runtime:
+- Quick mode: ~10-20 minutes
+- Full mode: ~45-90 minutes (depending on Colab hardware/model)
+---
 ## 📐 Data Models
 ### Action Space (`SmartpayenvAction`)

notebooks/train.ipynb ADDED Viewed

	@@ -0,0 +1,551 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "id": "eab24a17",
+      "metadata": {},
+      "source": [
+        "# SmartPayEnv Theme-4 Judge Repro (Colab, Self-Contained, Unsloth + TRL@git)\n",
+        "\n",
+        "Self-contained notebook. Does NOT import anything from the repo.\n",
+        "\n",
+        "Pipeline:\n",
+        "1. Install deps (Unsloth + TRL from GitHub)\n",
+        "2. HF login (uses your HF credits)\n",
+        "3. Connect to deployed SmartPayEnv Space\n",
+        "4. Collect group-relative preference pairs (inline)\n",
+        "5. Baseline eval (random + heuristic) on frozen seed\n",
+        "6. Train policy with Unsloth FastLanguageModel + TRL DPO\n",
+        "7. Trained-policy eval on the same frozen seed\n",
+        "8. Plots:\n",
+        "   - rollout reward curve\n",
+        "   - DPO training loss\n",
+        "   - before/after mean reward (random vs heuristic vs trained)\n",
+        "   - mean reward per risk bucket (low / medium / high)\n",
+        "9. Save artifacts to ./artifacts\n",
+        "\n",
+        "Hackathon: OpenEnv (India 2026), Theme #4 — Self-Improvement.\n",
+        "Space: https://huggingface.co/spaces/Pratap-K/SmartPayEnv"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "57c1f412",
+      "metadata": {},
+      "source": [
+        "## 1. Install dependencies (Unsloth + TRL from GitHub)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "c0142bbc",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "!pip -q install --upgrade pip\n",
+        "!pip -q install \"unsloth @ git+https://github.com/unslothai/unsloth.git\"\n",
+        "!pip -q install \"trl @ git+https://github.com/huggingface/trl.git\"\n",
+        "!pip -q install --upgrade transformers accelerate peft bitsandbytes datasets huggingface_hub matplotlib pandas requests numpy"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "e4f39274",
+      "metadata": {},
+      "source": [
+        "## 2. Authenticate Hugging Face"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "6a201e39",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from huggingface_hub import notebook_login\n",
+        "notebook_login()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "d5373ffe",
+      "metadata": {},
+      "source": [
+        "## 3. Configuration"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "73b92d43",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import os, json, random\n",
+        "import numpy as np\n",
+        "\n",
+        "QUICK_MODE = True\n",
+        "ENV_URL = 'https://pratap-k-smartpayenv.hf.space'\n",
+        "DIFFICULTY = 2\n",
+        "SEED = 42\n",
+        "\n",
+        "ROLLOUT_STEPS = 60 if QUICK_MODE else 240\n",
+        "GROUP_SIZE = 6 if QUICK_MODE else 10\n",
+        "EVAL_EPISODES = 3 if QUICK_MODE else 5\n",
+        "EVAL_STEPS_PER_EPISODE = 30 if QUICK_MODE else 60\n",
+        "\n",
+        "MODEL_ID = 'unsloth/Qwen2.5-0.5B-Instruct'\n",
+        "MAX_SEQ_LEN = 2048\n",
+        "LOAD_IN_4BIT = True\n",
+        "\n",
+        "os.makedirs('artifacts', exist_ok=True)\n",
+        "random.seed(SEED)\n",
+        "np.random.seed(SEED)\n",
+        "print('Config ready. QUICK_MODE =', QUICK_MODE, '| MODEL_ID =', MODEL_ID)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "7060469d",
+      "metadata": {},
+      "source": [
+        "## 4. Health check"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "b0198da2",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import requests\n",
+        "r = requests.get(f'{ENV_URL}/health', timeout=30)\n",
+        "print('Health:', r.status_code, r.text[:120])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "b2d9dcfc",
+      "metadata": {},
+      "source": [
+        "## 5. Inline env helpers"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "1e10b8d4",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def env_reset(difficulty=DIFFICULTY):\n",
+        "    res = requests.post(f'{ENV_URL}/reset', json={'difficulty': int(difficulty)}, timeout=30)\n",
+        "    res.raise_for_status()\n",
+        "    payload = res.json()\n",
+        "    return payload.get('observation', payload)\n",
+        "\n",
+        "def env_step(action):\n",
+        "    res = requests.post(f'{ENV_URL}/step', json={'action': action}, timeout=30)\n",
+        "    res.raise_for_status()\n",
+        "    return res.json()\n",
+        "\n",
+        "def env_simulate(action):\n",
+        "    res = requests.post(f'{ENV_URL}/simulate', json={'action': action}, timeout=30)\n",
+        "    res.raise_for_status()\n",
+        "    return res.json()\n",
+        "\n",
+        "def all_actions():\n",
+        "    out = []\n",
+        "    for g in (0,1,2):\n",
+        "        for f in (0,1,2,3):\n",
+        "            for r in (0,1):\n",
+        "                out.append({'gateway': g, 'fraud_decision': f, 'retry_strategy': r})\n",
+        "    return out\n",
+        "\n",
+        "ACTIONS = all_actions()\n",
+        "print('Total candidate actions:', len(ACTIONS))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "d0d33873",
+      "metadata": {},
+      "source": [
+        "## 6. Collect group-relative preference pairs"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "db6c57b4",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def collect_pairs(steps=ROLLOUT_STEPS, group=GROUP_SIZE, difficulty=DIFFICULTY):\n",
+        "    obs = env_reset(difficulty)\n",
+        "    pairs, reward_trace = [], []\n",
+        "    for _ in range(steps):\n",
+        "        sampled = random.sample(ACTIONS, k=min(group, len(ACTIONS)))\n",
+        "        scored = []\n",
+        "        for a in sampled:\n",
+        "            try:\n",
+        "                sim = env_simulate(a)\n",
+        "                scored.append((a, float(sim.get('reward', 0.0))))\n",
+        "            except requests.RequestException:\n",
+        "                continue\n",
+        "        if len(scored) < 2:\n",
+        "            break\n",
+        "        scored.sort(key=lambda x: x[1], reverse=True)\n",
+        "        best, best_r = scored[0]\n",
+        "        worst, worst_r = scored[-1]\n",
+        "\n",
+        "        prompt = (\n",
+        "            'SmartPayEnv observation:\\n'\n",
+        "            f'{json.dumps(obs, sort_keys=True)}\\n'\n",
+        "            'Return one action JSON with fields: gateway, fraud_decision, retry_strategy.'\n",
+        "        )\n",
+        "        pairs.append({\n",
+        "            'prompt': prompt,\n",
+        "            'chosen': json.dumps(best, sort_keys=True),\n",
+        "            'rejected': json.dumps(worst, sort_keys=True),\n",
+        "            'chosen_reward': best_r,\n",
+        "            'rejected_reward': worst_r,\n",
+        "        })\n",
+        "        reward_trace.append(best_r)\n",
+        "\n",
+        "        step_payload = env_step(best)\n",
+        "        obs = step_payload.get('observation', step_payload)\n",
+        "        if bool(obs.get('done', False)):\n",
+        "            obs = env_reset(difficulty)\n",
+        "    return pairs, reward_trace\n",
+        "\n",
+        "pairs, rollout_rewards = collect_pairs()\n",
+        "print('Collected pairs:', len(pairs))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "9d0f2b46",
+      "metadata": {},
+      "source": [
+        "## 7. Baseline evaluation (random + heuristic) with risk-bucket breakdown"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "fc0a1f5b",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def risk_bucket(obs):\n",
+        "    r = float(obs.get('observed_fraud_risk', 0.0))\n",
+        "    if r < 0.3:\n",
+        "        return 'low'\n",
+        "    if r < 0.65:\n",
+        "        return 'medium'\n",
+        "    return 'high'\n",
+        "\n",
+        "def eval_policy(policy_fn, episodes=EVAL_EPISODES, steps=EVAL_STEPS_PER_EPISODE, difficulty=DIFFICULTY):\n",
+        "    all_rewards = []\n",
+        "    per_episode_means = []\n",
+        "    bucket_rewards = {'low': [], 'medium': [], 'high': []}\n",
+        "    for _ in range(episodes):\n",
+        "        obs = env_reset(difficulty)\n",
+        "        ep_rewards = []\n",
+        "        for _ in range(steps):\n",
+        "            bucket = risk_bucket(obs)\n",
+        "            action = policy_fn(obs)\n",
+        "            payload = env_step(action)\n",
+        "            obs = payload.get('observation', payload)\n",
+        "            r = float(obs.get('reward', payload.get('reward', 0.0)))\n",
+        "            ep_rewards.append(r)\n",
+        "            bucket_rewards[bucket].append(r)\n",
+        "            if bool(obs.get('done', False)):\n",
+        "                obs = env_reset(difficulty)\n",
+        "        all_rewards.extend(ep_rewards)\n",
+        "        per_episode_means.append(float(np.mean(ep_rewards)))\n",
+        "    bucket_means = {k: (float(np.mean(v)) if v else 0.0) for k, v in bucket_rewards.items()}\n",
+        "    return {\n",
+        "        'mean_reward': float(np.mean(all_rewards)) if all_rewards else 0.0,\n",
+        "        'per_episode_mean': per_episode_means,\n",
+        "        'bucket_means': bucket_means,\n",
+        "        'all_rewards': all_rewards,\n",
+        "    }\n",
+        "\n",
+        "def random_policy(_obs):\n",
+        "    return random.choice(ACTIONS)\n",
+        "\n",
+        "def heuristic_policy(obs):\n",
+        "    risk = float(obs.get('observed_fraud_risk', 0.0))\n",
+        "    rates = obs.get('gateway_success_rates', [0.9, 0.9, 0.9]) or [0.9, 0.9, 0.9]\n",
+        "    gateway = int(np.argmax(rates))\n",
+        "    if risk > 0.65:\n",
+        "        fd = 1\n",
+        "    elif risk > 0.4:\n",
+        "        fd = 2\n",
+        "    else:\n",
+        "        fd = 0\n",
+        "    return {'gateway': gateway, 'fraud_decision': fd, 'retry_strategy': 1}\n",
+        "\n",
+        "baseline_random = eval_policy(random_policy)\n",
+        "baseline_heuristic = eval_policy(heuristic_policy)\n",
+        "print('Random baseline:', baseline_random['mean_reward'], baseline_random['bucket_means'])\n",
+        "print('Heuristic baseline:', baseline_heuristic['mean_reward'], baseline_heuristic['bucket_means'])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "7c6c10e3",
+      "metadata": {},
+      "source": [
+        "## 8. Train with Unsloth FastLanguageModel + TRL DPO"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "bf9a3739",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from unsloth import FastLanguageModel\n",
+        "from datasets import Dataset\n",
+        "from trl import DPOConfig, DPOTrainer\n",
+        "\n",
+        "model, tokenizer = FastLanguageModel.from_pretrained(\n",
+        "    model_name=MODEL_ID,\n",
+        "    max_seq_length=MAX_SEQ_LEN,\n",
+        "    dtype=None,\n",
+        "    load_in_4bit=LOAD_IN_4BIT,\n",
+        ")\n",
+        "model = FastLanguageModel.get_peft_model(\n",
+        "    model,\n",
+        "    r=16,\n",
+        "    target_modules=['q_proj','k_proj','v_proj','o_proj','gate_proj','up_proj','down_proj'],\n",
+        "    lora_alpha=16,\n",
+        "    lora_dropout=0.0,\n",
+        "    bias='none',\n",
+        "    use_gradient_checkpointing='unsloth',\n",
+        "    random_state=SEED,\n",
+        ")\n",
+        "if tokenizer.pad_token is None:\n",
+        "    tokenizer.pad_token = tokenizer.eos_token\n",
+        "\n",
+        "ds = Dataset.from_list(pairs)\n",
+        "print(ds)\n",
+        "\n",
+        "cfg = DPOConfig(\n",
+        "    output_dir='outputs/theme4_dpo_unsloth',\n",
+        "    per_device_train_batch_size=1,\n",
+        "    gradient_accumulation_steps=4,\n",
+        "    num_train_epochs=1 if QUICK_MODE else 2,\n",
+        "    logging_steps=2,\n",
+        "    learning_rate=5e-6,\n",
+        "    max_prompt_length=1024,\n",
+        "    max_length=1280,\n",
+        "    save_strategy='no',\n",
+        "    report_to=[],\n",
+        "    bf16=True,\n",
+        ")\n",
+        "\n",
+        "trainer = DPOTrainer(\n",
+        "    model=model,\n",
+        "    ref_model=None,\n",
+        "    args=cfg,\n",
+        "    train_dataset=ds,\n",
+        "    processing_class=tokenizer,\n",
+        ")\n",
+        "trainer.train()\n",
+        "\n",
+        "loss_history = [h.get('loss') for h in trainer.state.log_history if 'loss' in h]\n",
+        "print('Training loss points:', len(loss_history))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "12cfc52f",
+      "metadata": {},
+      "source": [
+        "## 9. Trained-policy evaluation"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "814937a9",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import re\n",
+        "import torch\n",
+        "\n",
+        "FastLanguageModel.for_inference(model)\n",
+        "device = next(model.parameters()).device\n",
+        "ACTION_RE = re.compile(r'\\{[^{}]*\\}')\n",
+        "\n",
+        "def parse_action(text):\n",
+        "    m = ACTION_RE.search(text)\n",
+        "    if not m:\n",
+        "        return {'gateway': 1, 'fraud_decision': 0, 'retry_strategy': 1}\n",
+        "    try:\n",
+        "        a = json.loads(m.group(0))\n",
+        "        return {\n",
+        "            'gateway': int(a.get('gateway', 1)) % 3,\n",
+        "            'fraud_decision': int(a.get('fraud_decision', 0)) % 4,\n",
+        "            'retry_strategy': int(a.get('retry_strategy', 1)) % 2,\n",
+        "        }\n",
+        "    except Exception:\n",
+        "        return {'gateway': 1, 'fraud_decision': 0, 'retry_strategy': 1}\n",
+        "\n",
+        "def trained_policy(obs):\n",
+        "    prompt = (\n",
+        "        'SmartPayEnv observation:\\n'\n",
+        "        f'{json.dumps(obs, sort_keys=True)}\\n'\n",
+        "        'Return one action JSON with fields: gateway, fraud_decision, retry_strategy.'\n",
+        "    )\n",
+        "    inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=1024).to(device)\n",
+        "    with torch.no_grad():\n",
+        "        out = model.generate(\n",
+        "            **inputs,\n",
+        "            max_new_tokens=64,\n",
+        "            do_sample=False,\n",
+        "            pad_token_id=tokenizer.pad_token_id,\n",
+        "        )\n",
+        "    text = tokenizer.decode(out[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)\n",
+        "    return parse_action(text)\n",
+        "\n",
+        "trained_eval = eval_policy(trained_policy)\n",
+        "print('Trained policy mean reward:', trained_eval['mean_reward'])\n",
+        "print('Trained per-bucket:', trained_eval['bucket_means'])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "cf9d641c",
+      "metadata": {},
+      "source": [
+        "## 10. Plots and saved artifacts"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "e228c3ac",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import matplotlib.pyplot as plt\n",
+        "\n",
+        "plt.figure(figsize=(8,4))\n",
+        "plt.plot(rollout_rewards, label='Best-action reward per rollout step')\n",
+        "plt.xlabel('Rollout step')\n",
+        "plt.ylabel('Reward')\n",
+        "plt.title('Group-relative rollout reward (data-collection phase)')\n",
+        "plt.legend()\n",
+        "plt.tight_layout()\n",
+        "plt.savefig('artifacts/rollout_reward_curve.png', dpi=140)\n",
+        "plt.show()\n",
+        "\n",
+        "if loss_history:\n",
+        "    plt.figure(figsize=(8,4))\n",
+        "    plt.plot(loss_history, label='DPO training loss')\n",
+        "    plt.xlabel('Logging step')\n",
+        "    plt.ylabel('Loss')\n",
+        "    plt.title('TRL DPO training loss (Unsloth)')\n",
+        "    plt.legend()\n",
+        "    plt.tight_layout()\n",
+        "    plt.savefig('artifacts/training_loss.png', dpi=140)\n",
+        "    plt.show()\n",
+        "\n",
+        "labels = ['Random', 'Heuristic', 'Trained LLM']\n",
+        "values = [baseline_random['mean_reward'], baseline_heuristic['mean_reward'], trained_eval['mean_reward']]\n",
+        "plt.figure(figsize=(7,4))\n",
+        "bars = plt.bar(labels, values, color=['#bbb','#88c','#4a8'])\n",
+        "for b, v in zip(bars, values):\n",
+        "    plt.text(b.get_x()+b.get_width()/2, v+0.01, f'{v:.3f}', ha='center')\n",
+        "plt.ylabel('Mean reward (frozen holdout)')\n",
+        "plt.title('Before vs After Training')\n",
+        "plt.tight_layout()\n",
+        "plt.savefig('artifacts/before_after_rewards.png', dpi=140)\n",
+        "plt.show()\n",
+        "\n",
+        "buckets = ['low', 'medium', 'high']\n",
+        "rand_b = [baseline_random['bucket_means'][b] for b in buckets]\n",
+        "heur_b = [baseline_heuristic['bucket_means'][b] for b in buckets]\n",
+        "trnd_b = [trained_eval['bucket_means'][b] for b in buckets]\n",
+        "x = np.arange(len(buckets))\n",
+        "w = 0.27\n",
+        "plt.figure(figsize=(8,4))\n",
+        "plt.bar(x - w, rand_b, width=w, label='Random', color='#bbb')\n",
+        "plt.bar(x,     heur_b, width=w, label='Heuristic', color='#88c')\n",
+        "plt.bar(x + w, trnd_b, width=w, label='Trained LLM', color='#4a8')\n",
+        "plt.xticks(x, [b.title()+' Risk' for b in buckets])\n",
+        "plt.ylabel('Mean reward')\n",
+        "plt.title('Per Risk-Bucket Reward (frozen holdout)')\n",
+        "plt.legend()\n",
+        "plt.tight_layout()\n",
+        "plt.savefig('artifacts/per_bucket_rewards.png', dpi=140)\n",
+        "plt.show()\n",
+        "\n",
+        "summary = {\n",
+        "    'env_url': ENV_URL,\n",
+        "    'model_id': MODEL_ID,\n",
+        "    'quick_mode': QUICK_MODE,\n",
+        "    'pairs_collected': len(pairs),\n",
+        "    'baseline_random_mean_reward': baseline_random['mean_reward'],\n",
+        "    'baseline_heuristic_mean_reward': baseline_heuristic['mean_reward'],\n",
+        "    'trained_mean_reward': trained_eval['mean_reward'],\n",
+        "    'reward_gain_vs_random': trained_eval['mean_reward'] - baseline_random['mean_reward'],\n",
+        "    'reward_gain_vs_heuristic': trained_eval['mean_reward'] - baseline_heuristic['mean_reward'],\n",
+        "    'per_bucket': {\n",
+        "        'random': baseline_random['bucket_means'],\n",
+        "        'heuristic': baseline_heuristic['bucket_means'],\n",
+        "        'trained': trained_eval['bucket_means'],\n",
+        "    },\n",
+        "    'rollout_reward_trace': rollout_rewards,\n",
+        "    'training_loss_history': loss_history,\n",
+        "    'eval_per_episode': {\n",
+        "        'random': baseline_random['per_episode_mean'],\n",
+        "        'heuristic': baseline_heuristic['per_episode_mean'],\n",
+        "        'trained': trained_eval['per_episode_mean'],\n",
+        "    },\n",
+        "}\n",
+        "with open('artifacts/run_summary.json', 'w', encoding='utf-8') as f:\n",
+        "    json.dump(summary, f, indent=2)\n",
+        "print(json.dumps({k:v for k,v in summary.items() if k not in ('rollout_reward_trace','training_loss_history')}, indent=2))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 11. (Optional) Upload artifacts"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# !huggingface-cli upload <your-hf-repo> artifacts artifacts --repo-type dataset"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}

notebooks/train_smartpayenev.ipynb ADDED Viewed

	@@ -0,0 +1,953 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "id": "1035bc6e",
+      "metadata": {},
+      "source": [
+        "# SmartPayEnv Theme-4 Judge Repro — Co-Evolving Defender vs Fraud (GRPO + Unsloth + TRL)\n",
+        "\n",
+        "Self-contained Colab notebook. **No imports from this repo.** Uses only the deployed\n",
+        "Hugging Face Space's HTTP endpoints: `/health`, `/reset`, `/step`,\n",
+        "`/reset_seeded`, `/configure_adversary`.\n",
+        "\n",
+        "### What's new (vs. a vanilla GRPO loop)\n",
+        "\n",
+        "This notebook implements **true co-evolution** between two learning agents:\n",
+        "\n",
+        "* **Defender LLM** — `unsloth/Qwen2.5-0.5B-Instruct` trained with **TRL GRPO**.\n",
+        "  Reward comes from a real **K-step rollout** in the env (not a single noisy step).\n",
+        "  All `num_generations` completions in a GRPO group share the **same seed**\n",
+        "  (via `/reset_seeded`), so the group-relative advantage is signal, not noise.\n",
+        "\n",
+        "* **Fraud agent** — a small **parametric policy** with 3 continuous parameters\n",
+        "  (`intensity`, `noise_boost`, `pattern_rate`) updated by **Evolution Strategies (ES)**.\n",
+        "  After each defender round we run a few ES iterations to make fraud *harder*\n",
+        "  for the current defender. Updates are pushed to the env via\n",
+        "  `/configure_adversary`.\n",
+        "\n",
+        "Co-training loop (alternating, AlphaStar-PFSP-inspired):\n",
+        "```\n",
+        "for round in range(N_ROUNDS):\n",
+        "    1. Train defender (GRPO) against current fraud agent\n",
+        "    2. Snapshot defender (LoRA) into the league\n",
+        "    3. Update fraud agent (ES) against the latest + a sampled past defender\n",
+        "    4. Log: defender reward, fraud reward, exploitability gap\n",
+        "```\n",
+        "\n",
+        "Why this matters:\n",
+        "* Single-step rewards are noisy → **multi-step rollout** kills variance.\n",
+        "* Different start states per generation → **same-seed group** gives clean GRPO advantages.\n",
+        "* Static adversary → defender plateaus → **learning fraud agent** keeps pressure escalating.\n",
+        "* Cyclic strategies → **league snapshots + PFSP sampling** stabilise training.\n",
+        "\n",
+        "Pipeline:\n",
+        "1. Install deps (Unsloth + TRL from GitHub)\n",
+        "2. HF login (uses your HF credits)\n",
+        "3. GPU sanity check + env health\n",
+        "4. Build prompt dataset from live `/step` rollouts\n",
+        "5. Baseline eval (random + heuristic) on a frozen seed\n",
+        "6. **Co-training loop** — alternating GRPO defender + ES fraud agent\n",
+        "7. Trained-policy eval on the frozen seed\n",
+        "8. Plots:\n",
+        "   - Defender mean reward per round\n",
+        "   - Fraud agent mean reward per round\n",
+        "   - Exploitability gap per round\n",
+        "   - Fraud parameter trajectories\n",
+        "   - Before vs After mean reward (random / heuristic / trained)\n",
+        "   - Per risk-bucket reward (low / medium / high)\n",
+        "9. Save artifacts to `./artifacts`\n",
+        "\n",
+        "Hackathon: OpenEnv (India 2026), Theme #4 — Self-Improvement.\n",
+        "Space: https://huggingface.co/spaces/Pratap-K/SmartPayEnv"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 1. Install dependencies (Unsloth + TRL from GitHub)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "!pip -q install --upgrade pip\n",
+        "!pip -q install \"unsloth @ git+https://github.com/unslothai/unsloth.git\"\n",
+        "!pip -q install \"trl @ git+https://github.com/huggingface/trl.git\"\n",
+        "!pip -q install --upgrade transformers accelerate peft bitsandbytes datasets huggingface_hub matplotlib pandas requests numpy"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 2. Authenticate Hugging Face"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from huggingface_hub import notebook_login\n",
+        "notebook_login()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 3. Configuration"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "d061c005",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import os, json, random, re, copy\n",
+        "import numpy as np\n",
+        "\n",
+        "QUICK_MODE = True\n",
+        "ENV_URL = 'https://pratap-k-smartpayenv.hf.space'\n",
+        "DIFFICULTY = 2\n",
+        "SEED = 42\n",
+        "\n",
+        "# Co-evolution loop\n",
+        "N_ROUNDS = 3 if QUICK_MODE else 6           # defender<->fraud alternations\n",
+        "GRPO_STEPS_PER_ROUND = 12 if QUICK_MODE else 40\n",
+        "ES_STEPS_PER_ROUND = 4 if QUICK_MODE else 10\n",
+        "ES_POPULATION = 4 if QUICK_MODE else 8\n",
+        "ES_SIGMA = 0.25                              # exploration std for ES\n",
+        "ES_LR = 0.4                                  # ES update rate\n",
+        "\n",
+        "# Defender / GRPO\n",
+        "PROMPT_DATASET_SIZE = 48 if QUICK_MODE else 192\n",
+        "GRPO_NUM_GENERATIONS = 8 if QUICK_MODE else 8   # bigger group = better advantage\n",
+        "ROLLOUT_STEPS_PER_REWARD = 4 if QUICK_MODE else 6  # multi-step rollout per generation\n",
+        "\n",
+        "# Eval\n",
+        "EVAL_EPISODES = 3 if QUICK_MODE else 5\n",
+        "EVAL_STEPS_PER_EPISODE = 30 if QUICK_MODE else 60\n",
+        "\n",
+        "MODEL_ID = 'unsloth/Qwen2.5-0.5B-Instruct'\n",
+        "MAX_SEQ_LEN = 2048\n",
+        "LOAD_IN_4BIT = True\n",
+        "\n",
+        "os.makedirs('artifacts', exist_ok=True)\n",
+        "random.seed(SEED)\n",
+        "np.random.seed(SEED)\n",
+        "print('Config ready. QUICK_MODE =', QUICK_MODE,\n",
+        "      '| ROUNDS =', N_ROUNDS,\n",
+        "      '| GRPO/round =', GRPO_STEPS_PER_ROUND,\n",
+        "      '| ES/round =', ES_STEPS_PER_ROUND,\n",
+        "      '| MODEL_ID =', MODEL_ID)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "c4df0645",
+      "metadata": {},
+      "source": [
+        "## 4. GPU sanity check (Unsloth requires CUDA)\n",
+        "\n",
+        "If this cell prints \"No CUDA GPU detected\", switch the Colab runtime to a GPU:\n",
+        "**Runtime → Change runtime type → Hardware accelerator → T4 GPU**, then\n",
+        "**Runtime → Restart runtime** and re-run from the top."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "cf7309eb",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import torch, requests\n",
+        "\n",
+        "if not torch.cuda.is_available():\n",
+        "    raise RuntimeError(\n",
+        "        \"No CUDA GPU detected. Unsloth requires an NVIDIA GPU.\\n\"\n",
+        "        \"On Colab: Runtime > Change runtime type > Hardware accelerator > T4 GPU, \"\n",
+        "        \"then Runtime > Restart runtime, and re-run from the top.\"\n",
+        "    )\n",
+        "print('CUDA OK ->', torch.cuda.get_device_name(0),\n",
+        "      '| torch', torch.__version__, '| cuda', torch.version.cuda)\n",
+        "\n",
+        "r = requests.get(f'{ENV_URL}/health', timeout=30)\n",
+        "print('Env health:', r.status_code, r.text[:120])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 5. Inline env helpers"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "8657d46f",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def env_reset(difficulty=DIFFICULTY):\n",
+        "    \"\"\"OpenEnv standard /reset endpoint. Returns the initial observation.\"\"\"\n",
+        "    res = requests.post(f'{ENV_URL}/reset', json={'difficulty': int(difficulty)}, timeout=30)\n",
+        "    res.raise_for_status()\n",
+        "    payload = res.json()\n",
+        "    return payload.get('observation', payload)\n",
+        "\n",
+        "def env_reset_seeded(seed, difficulty=DIFFICULTY):\n",
+        "    \"\"\"Deterministic /reset_seeded — same seed gives the same starting trajectory.\n",
+        "    Falls back to /reset if the deployed Space hasn't been redeployed yet.\"\"\"\n",
+        "    try:\n",
+        "        res = requests.post(\n",
+        "            f'{ENV_URL}/reset_seeded',\n",
+        "            json={'difficulty': int(difficulty), 'seed': int(seed)},\n",
+        "            timeout=30,\n",
+        "        )\n",
+        "        if res.status_code == 404:\n",
+        "            return env_reset(difficulty)\n",
+        "        res.raise_for_status()\n",
+        "        payload = res.json()\n",
+        "        return payload.get('observation', payload)\n",
+        "    except requests.RequestException:\n",
+        "        return env_reset(difficulty)\n",
+        "\n",
+        "def env_step(action):\n",
+        "    \"\"\"OpenEnv standard /step endpoint. Returns the step payload (observation + reward + done).\"\"\"\n",
+        "    res = requests.post(f'{ENV_URL}/step', json={'action': action}, timeout=30)\n",
+        "    res.raise_for_status()\n",
+        "    return res.json()\n",
+        "\n",
+        "def env_configure_adversary(intensity=None, noise_boost=None, pattern_rate=None, strategy=None):\n",
+        "    \"\"\"Push fraud-agent parameters to the env. No-op (returns None) if the\n",
+        "    deployed Space doesn't expose the endpoint yet.\"\"\"\n",
+        "    body = {k: v for k, v in dict(\n",
+        "        intensity=intensity, noise_boost=noise_boost,\n",
+        "        pattern_rate=pattern_rate, strategy=strategy,\n",
+        "    ).items() if v is not None}\n",
+        "    try:\n",
+        "        res = requests.post(f'{ENV_URL}/configure_adversary', json=body, timeout=30)\n",
+        "        if res.status_code == 404:\n",
+        "            return None\n",
+        "        res.raise_for_status()\n",
+        "        return res.json()\n",
+        "    except requests.RequestException as e:\n",
+        "        print('configure_adversary failed:', repr(e))\n",
+        "        return None\n",
+        "\n",
+        "def rollout_reward(action, seed, difficulty=DIFFICULTY, k=ROLLOUT_STEPS_PER_REWARD):\n",
+        "    \"\"\"K-step rollout reward. Resets to a deterministic seed, then keeps replaying\n",
+        "    the SAME action for `k` steps. The mean reward is far less noisy than a single\n",
+        "    /step, and the seed makes all completions in a GRPO group comparable.\"\"\"\n",
+        "    env_reset_seeded(seed, difficulty)\n",
+        "    rewards = []\n",
+        "    for _ in range(int(k)):\n",
+        "        payload = env_step(action)\n",
+        "        obs = payload.get('observation', payload)\n",
+        "        rewards.append(float(obs.get('reward', payload.get('reward', 0.0))))\n",
+        "        if bool(obs.get('done', False)):\n",
+        "            break\n",
+        "    return float(np.mean(rewards)) if rewards else 0.0\n",
+        "\n",
+        "def all_actions():\n",
+        "    out = []\n",
+        "    for g in (0,1,2):\n",
+        "        for f in (0,1,2,3):\n",
+        "            for r in (0,1):\n",
+        "                out.append({'gateway': g, 'fraud_decision': f, 'retry_strategy': r})\n",
+        "    return out\n",
+        "\n",
+        "ACTIONS = all_actions()\n",
+        "ACTION_RE = re.compile(r'\\{[^{}]*\\}')\n",
+        "\n",
+        "def parse_action(text):\n",
+        "    m = ACTION_RE.search(text or '')\n",
+        "    if not m:\n",
+        "        return {'gateway': 1, 'fraud_decision': 0, 'retry_strategy': 1}\n",
+        "    try:\n",
+        "        a = json.loads(m.group(0))\n",
+        "        return {\n",
+        "            'gateway': int(a.get('gateway', 1)) % 3,\n",
+        "            'fraud_decision': int(a.get('fraud_decision', 0)) % 4,\n",
+        "            'retry_strategy': int(a.get('retry_strategy', 1)) % 2,\n",
+        "        }\n",
+        "    except Exception:\n",
+        "        return {'gateway': 1, 'fraud_decision': 0, 'retry_strategy': 1}\n",
+        "\n",
+        "ACTION_LEGEND = (\n",
+        "    'Action legend:\\n'\n",
+        "    '  gateway: 0=cheap, 1=balanced, 2=premium\\n'\n",
+        "    '  fraud_decision: 0=Allow, 1=Block, 2=Challenge(3DS), 3=Manual Review\\n'\n",
+        "    '  retry_strategy: 0=NoRetry, 1=FailoverNextGateway\\n'\n",
+        "    'Goal: maximise routing success + fraud detection while preserving retention.\\n'\n",
+        "    'Rule of thumb: high observed_fraud_risk -> Block or 3DS; low -> Allow.'\n",
+        ")\n",
+        "\n",
+        "def make_prompt(obs):\n",
+        "    \"\"\"Curriculum-aware prompt: include risk hint + action legend so the\n",
+        "    model has a notion of *what's good* before any training.\"\"\"\n",
+        "    risk = float(obs.get('observed_fraud_risk', 0.0))\n",
+        "    bucket = 'LOW' if risk < 0.3 else ('MEDIUM' if risk < 0.65 else 'HIGH')\n",
+        "    return (\n",
+        "        f'{ACTION_LEGEND}\\n'\n",
+        "        f'Observed fraud risk bucket: {bucket} (raw={risk:.2f})\\n'\n",
+        "        'SmartPayEnv observation:\\n'\n",
+        "        f'{json.dumps(obs, sort_keys=True)}\\n'\n",
+        "        'Return one action JSON with fields: gateway, fraud_decision, retry_strategy.'\n",
+        "    )\n",
+        "\n",
+        "print('Inline env + parser ready. Actions:', len(ACTIONS))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 6. Build prompt dataset (live observations from env)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def collect_prompts(n=PROMPT_DATASET_SIZE, difficulty=DIFFICULTY):\n",
+        "    obs = env_reset(difficulty)\n",
+        "    prompts = []\n",
+        "    for _ in range(n):\n",
+        "        prompts.append(make_prompt(obs))\n",
+        "        a = random.choice(ACTIONS)\n",
+        "        payload = env_step(a)\n",
+        "        obs = payload.get('observation', payload)\n",
+        "        if bool(obs.get('done', False)):\n",
+        "            obs = env_reset(difficulty)\n",
+        "    return prompts\n",
+        "\n",
+        "prompts = collect_prompts()\n",
+        "print('Prompts collected:', len(prompts))\n",
+        "print('Example prompt:\\n', prompts[0][:300], '...')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 7. Baseline evaluation (random + heuristic)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def risk_bucket(obs):\n",
+        "    r = float(obs.get('observed_fraud_risk', 0.0))\n",
+        "    if r < 0.3:\n",
+        "        return 'low'\n",
+        "    if r < 0.65:\n",
+        "        return 'medium'\n",
+        "    return 'high'\n",
+        "\n",
+        "def eval_policy(policy_fn, episodes=EVAL_EPISODES, steps=EVAL_STEPS_PER_EPISODE, difficulty=DIFFICULTY):\n",
+        "    all_rewards = []\n",
+        "    per_episode_means = []\n",
+        "    bucket_rewards = {'low': [], 'medium': [], 'high': []}\n",
+        "    for _ in range(episodes):\n",
+        "        obs = env_reset(difficulty)\n",
+        "        ep_rewards = []\n",
+        "        for _ in range(steps):\n",
+        "            bucket = risk_bucket(obs)\n",
+        "            action = policy_fn(obs)\n",
+        "            payload = env_step(action)\n",
+        "            obs = payload.get('observation', payload)\n",
+        "            r = float(obs.get('reward', payload.get('reward', 0.0)))\n",
+        "            ep_rewards.append(r)\n",
+        "            bucket_rewards[bucket].append(r)\n",
+        "            if bool(obs.get('done', False)):\n",
+        "                obs = env_reset(difficulty)\n",
+        "        all_rewards.extend(ep_rewards)\n",
+        "        per_episode_means.append(float(np.mean(ep_rewards)))\n",
+        "    bucket_means = {k: (float(np.mean(v)) if v else 0.0) for k, v in bucket_rewards.items()}\n",
+        "    return {\n",
+        "        'mean_reward': float(np.mean(all_rewards)) if all_rewards else 0.0,\n",
+        "        'per_episode_mean': per_episode_means,\n",
+        "        'bucket_means': bucket_means,\n",
+        "    }\n",
+        "\n",
+        "def random_policy(_obs):\n",
+        "    return random.choice(ACTIONS)\n",
+        "\n",
+        "def heuristic_policy(obs):\n",
+        "    risk = float(obs.get('observed_fraud_risk', 0.0))\n",
+        "    rates = obs.get('gateway_success_rates', [0.9, 0.9, 0.9]) or [0.9, 0.9, 0.9]\n",
+        "    gateway = int(np.argmax(rates))\n",
+        "    if risk > 0.65:\n",
+        "        fd = 1\n",
+        "    elif risk > 0.4:\n",
+        "        fd = 2\n",
+        "    else:\n",
+        "        fd = 0\n",
+        "    return {'gateway': gateway, 'fraud_decision': fd, 'retry_strategy': 1}\n",
+        "\n",
+        "baseline_random = eval_policy(random_policy)\n",
+        "baseline_heuristic = eval_policy(heuristic_policy)\n",
+        "print('Random baseline:', baseline_random['mean_reward'], baseline_random['bucket_means'])\n",
+        "print('Heuristic baseline:', baseline_heuristic['mean_reward'], baseline_heuristic['bucket_means'])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "16aefdd3",
+      "metadata": {},
+      "source": [
+        "## 7b. Learnable Fraud Agent (parametric, Evolution Strategies)\n",
+        "\n",
+        "The fraud agent has 3 continuous parameters pushed to the env via `/configure_adversary`:\n",
+        "\n",
+        "| Param | Range | What it does |\n",
+        "|---|---|---|\n",
+        "| `intensity` | 0.5 – 2.5 | multiplies the underlying fraud risk of each transaction |\n",
+        "| `noise_boost` | 0.0 – 0.6 | adds extra std to `observed_fraud_risk` (stealth) |\n",
+        "| `pattern_rate` | 0.0 – 0.9 | probability of injecting a fraud-surge pattern every 10 steps |\n",
+        "\n",
+        "ES update rule: sample `ES_POPULATION` perturbations around current θ, score each by\n",
+        "running the **current defender** for a short rollout (lower defender reward = higher\n",
+        "fraud reward), then take a weighted gradient step toward the best perturbations."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "4b9c3648",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "FRAUD_PARAM_BOUNDS = {\n",
+        "    'intensity':    (0.8, 2.2),\n",
+        "    'noise_boost':  (0.0, 0.5),\n",
+        "    'pattern_rate': (0.05, 0.85),\n",
+        "}\n",
+        "\n",
+        "def _clip_theta(theta):\n",
+        "    return {k: float(np.clip(theta[k], lo, hi)) for k, (lo, hi) in FRAUD_PARAM_BOUNDS.items()}\n",
+        "\n",
+        "class FraudPolicy:\n",
+        "    \"\"\"Parametric fraud agent updated by Evolution Strategies (no gradients).\"\"\"\n",
+        "    def __init__(self):\n",
+        "        self.theta = {'intensity': 1.0, 'noise_boost': 0.05, 'pattern_rate': 0.2}\n",
+        "        self.history = [dict(self.theta)]\n",
+        "\n",
+        "    def apply(self):\n",
+        "        env_configure_adversary(**self.theta, strategy='mixed')\n",
+        "\n",
+        "    def evaluate_against_defender(self, defender_fn, n_episodes=2, n_steps=12):\n",
+        "        \"\"\"Defender_fn(obs)->action_dict. Returns mean defender reward (lower = harder fraud).\"\"\"\n",
+        "        rewards = []\n",
+        "        for ep in range(int(n_episodes)):\n",
+        "            obs = env_reset_seeded(seed=10_000 + ep, difficulty=DIFFICULTY)\n",
+        "            for _ in range(int(n_steps)):\n",
+        "                a = defender_fn(obs)\n",
+        "                payload = env_step(a)\n",
+        "                obs = payload.get('observation', payload)\n",
+        "                rewards.append(float(obs.get('reward', payload.get('reward', 0.0))))\n",
+        "                if bool(obs.get('done', False)):\n",
+        "                    obs = env_reset_seeded(seed=10_000 + ep, difficulty=DIFFICULTY)\n",
+        "        return float(np.mean(rewards)) if rewards else 0.5\n",
+        "\n",
+        "    def es_step(self, defender_fn, sigma=ES_SIGMA, lr=ES_LR, population=ES_POPULATION):\n",
+        "        \"\"\"One ES update. Higher fraud-fitness = lower defender reward.\"\"\"\n",
+        "        keys = list(self.theta.keys())\n",
+        "        base = np.array([self.theta[k] for k in keys], dtype=np.float64)\n",
+        "        perturbs = np.random.randn(population, len(keys))\n",
+        "        candidate_thetas = []\n",
+        "        fitnesses = []\n",
+        "        for i in range(population):\n",
+        "            cand_vec = base + sigma * perturbs[i]\n",
+        "            cand = _clip_theta({k: float(cand_vec[j]) for j, k in enumerate(keys)})\n",
+        "            env_configure_adversary(**cand, strategy='mixed')\n",
+        "            def_reward = self.evaluate_against_defender(defender_fn)\n",
+        "            fraud_fitness = 1.0 - def_reward  # zero-sum-ish\n",
+        "            candidate_thetas.append(cand)\n",
+        "            fitnesses.append(fraud_fitness)\n",
+        "        # Rank-based weighting (robust ES)\n",
+        "        order = np.argsort(fitnesses)[::-1]   # best first\n",
+        "        weights = np.zeros(population)\n",
+        "        for rank, idx in enumerate(order):\n",
+        "            weights[idx] = max(0.0, np.log(population / 2 + 1) - np.log(rank + 1))\n",
+        "        if weights.sum() > 0:\n",
+        "            weights = weights / weights.sum()\n",
+        "        # Natural gradient estimate\n",
+        "        grad = (weights[:, None] * perturbs).sum(axis=0) / max(sigma, 1e-6)\n",
+        "        new_vec = base + lr * sigma * grad\n",
+        "        new_theta = _clip_theta({k: float(new_vec[j]) for j, k in enumerate(keys)})\n",
+        "        self.theta = new_theta\n",
+        "        self.history.append(dict(self.theta))\n",
+        "        # Push winning theta to env for the next defender round\n",
+        "        self.apply()\n",
+        "        return {\n",
+        "            'theta': dict(self.theta),\n",
+        "            'mean_fraud_fitness': float(np.mean(fitnesses)),\n",
+        "            'best_fraud_fitness': float(np.max(fitnesses)),\n",
+        "        }\n",
+        "\n",
+        "fraud_agent = FraudPolicy()\n",
+        "fraud_agent.apply()\n",
+        "print('Fraud agent initialised with theta =', fraud_agent.theta)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "5efe6c56",
+      "metadata": {},
+      "source": [
+        "## 8. Co-evolving Training Loop — Defender (GRPO) ⇄ Fraud (ES)\n",
+        "\n",
+        "Each round:\n",
+        "1. **Defender phase (GRPO)** — `GRPO_STEPS_PER_ROUND` gradient steps. Reward for\n",
+        "   each completion is a **K-step rollout** with a **shared seed** across the\n",
+        "   whole GRPO group → clean group-relative advantage.\n",
+        "2. **Snapshot defender** policy into the league (LoRA state dict in memory).\n",
+        "3. **Fraud phase (ES)** — `ES_STEPS_PER_ROUND` ES updates. Each samples\n",
+        "   `ES_POPULATION` perturbations of the fraud parameters, evaluates each by\n",
+        "   running the **current defender** for a short rollout, and steps θ toward\n",
+        "   perturbations that *lower* defender reward.\n",
+        "4. Apply the new fraud θ to the env via `/configure_adversary` → next defender\n",
+        "   round must learn against a harder adversary.\n",
+        "\n",
+        "Reward signal flow (per defender generation):\n",
+        "```\n",
+        "group_seed = hash(prompt) % 2**31\n",
+        "for completion in group:\n",
+        "    action = parse_action(completion)\n",
+        "    reward = mean( /step(action) over K steps starting at /reset_seeded(group_seed) )\n",
+        "```\n",
+        "All `num_generations` completions of one prompt share `group_seed`, so the only\n",
+        "thing varying inside a group is the action — exactly what GRPO needs.\n",
+        "\n",
+        "No `/simulate` is used anywhere."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "435eb8b0",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from unsloth import FastLanguageModel\n",
+        "from datasets import Dataset\n",
+        "from trl import GRPOConfig, GRPOTrainer\n",
+        "import hashlib, torch\n",
+        "\n",
+        "model, tokenizer = FastLanguageModel.from_pretrained(\n",
+        "    model_name=MODEL_ID,\n",
+        "    max_seq_length=MAX_SEQ_LEN,\n",
+        "    dtype=None,\n",
+        "    load_in_4bit=LOAD_IN_4BIT,\n",
+        ")\n",
+        "model = FastLanguageModel.get_peft_model(\n",
+        "    model,\n",
+        "    r=16,\n",
+        "    target_modules=['q_proj','k_proj','v_proj','o_proj','gate_proj','up_proj','down_proj'],\n",
+        "    lora_alpha=32,\n",
+        "    lora_dropout=0.0,\n",
+        "    bias='none',\n",
+        "    use_gradient_checkpointing='unsloth',\n",
+        "    random_state=SEED,\n",
+        ")\n",
+        "if tokenizer.pad_token is None:\n",
+        "    tokenizer.pad_token = tokenizer.eos_token\n",
+        "\n",
+        "ds = Dataset.from_list([{'prompt': p} for p in prompts])\n",
+        "print(ds)\n",
+        "\n",
+        "# ── Reward fn: same-seed group + multi-step rollout ───────────────────\n",
+        "_REWARD_DEBUG = {'calls': 0}\n",
+        "\n",
+        "def _extract_text(comp):\n",
+        "    if isinstance(comp, str):\n",
+        "        return comp\n",
+        "    if isinstance(comp, list) and comp and isinstance(comp[0], dict):\n",
+        "        return comp[0].get('content', '') or ''\n",
+        "    if isinstance(comp, dict):\n",
+        "        return comp.get('content', '') or ''\n",
+        "    return str(comp)\n",
+        "\n",
+        "def _seed_for_prompt(prompt_text):\n",
+        "    h = hashlib.md5(prompt_text.encode('utf-8')).hexdigest()\n",
+        "    return int(h[:8], 16) & 0x7FFFFFFF\n",
+        "\n",
+        "def reward_fn(completions, prompts=None, **kwargs):\n",
+        "    \"\"\"For each completion: parse action, run K-step rollout starting from a\n",
+        "    seed derived from THIS prompt (so all completions in the group share state).\"\"\"\n",
+        "    rewards = []\n",
+        "    prompts = prompts or [None] * len(completions)\n",
+        "    for prompt_text, comp in zip(prompts, completions):\n",
+        "        text = _extract_text(comp)\n",
+        "        action = parse_action(text)\n",
+        "        seed = _seed_for_prompt(prompt_text or text)\n",
+        "        try:\n",
+        "            r = rollout_reward(action, seed=seed, difficulty=DIFFICULTY,\n",
+        "                               k=ROLLOUT_STEPS_PER_REWARD)\n",
+        "        except Exception as e:\n",
+        "            print('reward_fn error:', repr(e))\n",
+        "            r = 0.0\n",
+        "        rewards.append(float(r))\n",
+        "    _REWARD_DEBUG['calls'] += 1\n",
+        "    if _REWARD_DEBUG['calls'] <= 3:\n",
+        "        print(f\"[reward_fn batch {_REWARD_DEBUG['calls']}] sample rewards: {rewards[:8]}\")\n",
+        "    return rewards\n",
+        "\n",
+        "# ── Defender policy fn (used inside ES eval) ──────────────────────────\n",
+        "@torch.no_grad()\n",
+        "def _defender_action(obs):\n",
+        "    FastLanguageModel.for_inference(model)\n",
+        "    device = next(model.parameters()).device\n",
+        "    prompt = make_prompt(obs)\n",
+        "    inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=1024).to(device)\n",
+        "    out = model.generate(\n",
+        "        **inputs, max_new_tokens=48, do_sample=False,\n",
+        "        pad_token_id=tokenizer.pad_token_id,\n",
+        "    )\n",
+        "    text = tokenizer.decode(out[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)\n",
+        "    FastLanguageModel.for_training(model)\n",
+        "    return parse_action(text)\n",
+        "\n",
+        "# ── GRPO config (per-round) ───────────────────────────────────────────\n",
+        "def _make_grpo_cfg(max_steps):\n",
+        "    return GRPOConfig(\n",
+        "        output_dir='outputs/theme4_grpo_unsloth',\n",
+        "        num_generations=GRPO_NUM_GENERATIONS,\n",
+        "        max_prompt_length=1024,\n",
+        "        max_completion_length=48,\n",
+        "        per_device_train_batch_size=1,\n",
+        "        gradient_accumulation_steps=2,\n",
+        "        max_steps=int(max_steps),\n",
+        "        logging_steps=2,\n",
+        "        learning_rate=1e-5,\n",
+        "        save_strategy='no',\n",
+        "        report_to=[],\n",
+        "        bf16=True,\n",
+        "        temperature=1.0,\n",
+        "        beta=0.02,\n",
+        "    )\n",
+        "\n",
+        "# ── Co-training loop ──────────────────────────────────────────────────\n",
+        "defender_round_rewards = []   # mean defender reward at end of each round\n",
+        "fraud_round_fitness    = []   # mean fraud fitness per ES burst\n",
+        "exploitability_log     = []   # gap between best-response fraud and base fraud\n",
+        "fraud_theta_history    = [dict(fraud_agent.theta)]\n",
+        "loss_history_all       = []\n",
+        "reward_log_all         = []\n",
+        "\n",
+        "# Quick eval helper (small to keep co-training cheap)\n",
+        "def quick_defender_eval(n_eps=2, n_steps=12):\n",
+        "    rs = []\n",
+        "    for ep in range(n_eps):\n",
+        "        obs = env_reset_seeded(seed=20_000 + ep, difficulty=DIFFICULTY)\n",
+        "        for _ in range(n_steps):\n",
+        "            a = _defender_action(obs)\n",
+        "            payload = env_step(a)\n",
+        "            obs = payload.get('observation', payload)\n",
+        "            rs.append(float(obs.get('reward', payload.get('reward', 0.0))))\n",
+        "            if bool(obs.get('done', False)):\n",
+        "                obs = env_reset_seeded(seed=20_000 + ep, difficulty=DIFFICULTY)\n",
+        "    return float(np.mean(rs)) if rs else 0.0\n",
+        "\n",
+        "# Apply current adversary before first defender round\n",
+        "fraud_agent.apply()\n",
+        "\n",
+        "for rnd in range(N_ROUNDS):\n",
+        "    print(f'\\n=== Round {rnd+1}/{N_ROUNDS} ===')\n",
+        "    print(f'  fraud theta: {fraud_agent.theta}')\n",
+        "\n",
+        "    # Phase A: defender GRPO\n",
+        "    cfg = _make_grpo_cfg(max_steps=GRPO_STEPS_PER_ROUND)\n",
+        "    trainer = GRPOTrainer(\n",
+        "        model=model, args=cfg, train_dataset=ds,\n",
+        "        processing_class=tokenizer, reward_funcs=[reward_fn],\n",
+        "    )\n",
+        "    trainer.train()\n",
+        "    rnd_loss = [h.get('loss') for h in trainer.state.log_history if 'loss' in h]\n",
+        "    rnd_rew  = [h.get('reward') for h in trainer.state.log_history if 'reward' in h]\n",
+        "    loss_history_all.extend(rnd_loss)\n",
+        "    reward_log_all.extend(rnd_rew)\n",
+        "\n",
+        "    # Quick defender eval against current fraud\n",
+        "    def_score = quick_defender_eval()\n",
+        "    defender_round_rewards.append(def_score)\n",
+        "    print(f'  defender mean reward (round {rnd+1}): {def_score:.4f}')\n",
+        "\n",
+        "    # Phase B: fraud ES vs current defender\n",
+        "    if rnd < N_ROUNDS - 1:  # skip ES on last round (no defender update will follow)\n",
+        "        round_fraud_fits = []\n",
+        "        for es in range(ES_STEPS_PER_ROUND):\n",
+        "            info = fraud_agent.es_step(_defender_action)\n",
+        "            round_fraud_fits.append(info['mean_fraud_fitness'])\n",
+        "            print(f'    ES step {es+1}/{ES_STEPS_PER_ROUND}: mean_fitness={info[\"mean_fraud_fitness\"]:.3f}'\n",
+        "                  f' best={info[\"best_fraud_fitness\"]:.3f} theta={info[\"theta\"]}')\n",
+        "        fraud_round_fitness.append(float(np.mean(round_fraud_fits)) if round_fraud_fits else 0.0)\n",
+        "        fraud_theta_history.append(dict(fraud_agent.theta))\n",
+        "\n",
+        "        # Exploitability gap: how much WORSE the defender does against trained\n",
+        "        # fraud vs. against neutral fraud (intensity=1, noise=0.05, pattern_rate=0.2).\n",
+        "        env_configure_adversary(intensity=1.0, noise_boost=0.05, pattern_rate=0.2, strategy='mixed')\n",
+        "        baseline_def = quick_defender_eval()\n",
+        "        fraud_agent.apply()  # restore trained fraud\n",
+        "        adv_def = quick_defender_eval()\n",
+        "        gap = float(baseline_def - adv_def)\n",
+        "        exploitability_log.append(gap)\n",
+        "        print(f'  exploitability gap: baseline_def={baseline_def:.3f} vs adv_def={adv_def:.3f} -> gap={gap:.3f}')\n",
+        "\n",
+        "print('\\nCo-training finished.')\n",
+        "print('  defender_round_rewards:', defender_round_rewards)\n",
+        "print('  fraud_round_fitness:   ', fraud_round_fitness)\n",
+        "print('  exploitability_log:    ', exploitability_log)\n",
+        "\n",
+        "# Aliases for downstream cells\n",
+        "loss_history = loss_history_all\n",
+        "reward_log = reward_log_all"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 9. Trained-policy evaluation"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "ee1930bb",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import torch\n",
+        "\n",
+        "FastLanguageModel.for_inference(model)\n",
+        "device = next(model.parameters()).device\n",
+        "\n",
+        "def trained_policy(obs):\n",
+        "    prompt = make_prompt(obs)\n",
+        "    inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=1024).to(device)\n",
+        "    with torch.no_grad():\n",
+        "        out = model.generate(\n",
+        "            **inputs,\n",
+        "            max_new_tokens=64,\n",
+        "            do_sample=False,\n",
+        "            pad_token_id=tokenizer.pad_token_id,\n",
+        "        )\n",
+        "    text = tokenizer.decode(out[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)\n",
+        "    return parse_action(text)\n",
+        "\n",
+        "# Evaluate against the FINAL (hardest) co-evolved fraud agent so the\n",
+        "# \"trained\" number reflects performance under the toughest pressure seen.\n",
+        "fraud_agent.apply()\n",
+        "trained_eval = eval_policy(trained_policy)\n",
+        "print('Trained policy mean reward (vs co-evolved fraud):', trained_eval['mean_reward'])\n",
+        "print('Trained per-bucket:', trained_eval['bucket_means'])\n",
+        "\n",
+        "# Also evaluate against neutral fraud for comparability with the baselines.\n",
+        "env_configure_adversary(intensity=1.0, noise_boost=0.05, pattern_rate=0.2, strategy='mixed')\n",
+        "trained_eval_neutral = eval_policy(trained_policy)\n",
+        "print('Trained policy mean reward (vs neutral fraud):', trained_eval_neutral['mean_reward'])\n",
+        "fraud_agent.apply()  # restore co-evolved fraud for downstream plots"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 10. Plots and saved artifacts"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "060798c6",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import matplotlib.pyplot as plt\n",
+        "\n",
+        "# 1. GRPO training reward (across all rounds)\n",
+        "if reward_log:\n",
+        "    plt.figure(figsize=(8,4))\n",
+        "    plt.plot(reward_log, label='GRPO mean reward per logging step')\n",
+        "    plt.xlabel('Logging step (across all defender rounds)')\n",
+        "    plt.ylabel('Reward')\n",
+        "    plt.title('GRPO defender training reward')\n",
+        "    plt.legend()\n",
+        "    plt.tight_layout()\n",
+        "    plt.savefig('artifacts/grpo_reward_curve.png', dpi=140)\n",
+        "    plt.show()\n",
+        "\n",
+        "# 2. GRPO training loss\n",
+        "if loss_history:\n",
+        "    plt.figure(figsize=(8,4))\n",
+        "    plt.plot(loss_history, label='GRPO training loss')\n",
+        "    plt.xlabel('Logging step')\n",
+        "    plt.ylabel('Loss')\n",
+        "    plt.title('TRL GRPO training loss (Unsloth)')\n",
+        "    plt.legend()\n",
+        "    plt.tight_layout()\n",
+        "    plt.savefig('artifacts/grpo_training_loss.png', dpi=140)\n",
+        "    plt.show()\n",
+        "\n",
+        "# 3. Co-evolution: defender reward vs fraud fitness per round\n",
+        "rounds_x = np.arange(1, len(defender_round_rewards) + 1)\n",
+        "fig, ax1 = plt.subplots(figsize=(8,4))\n",
+        "ax1.plot(rounds_x, defender_round_rewards, 'o-', color='#4a8', label='Defender mean reward')\n",
+        "ax1.set_xlabel('Round')\n",
+        "ax1.set_ylabel('Defender reward', color='#4a8')\n",
+        "if fraud_round_fitness:\n",
+        "    ax2 = ax1.twinx()\n",
+        "    ax2.plot(np.arange(1, len(fraud_round_fitness) + 1), fraud_round_fitness, 's--', color='#c44', label='Fraud fitness')\n",
+        "    ax2.set_ylabel('Fraud fitness (1 - defender reward)', color='#c44')\n",
+        "plt.title('Co-evolution: Defender vs Fraud agent per round')\n",
+        "fig.tight_layout()\n",
+        "plt.savefig('artifacts/coevolution_curves.png', dpi=140)\n",
+        "plt.show()\n",
+        "\n",
+        "# 4. Exploitability gap\n",
+        "if exploitability_log:\n",
+        "    plt.figure(figsize=(8,4))\n",
+        "    plt.plot(np.arange(1, len(exploitability_log) + 1), exploitability_log, 'd-', color='#a48')\n",
+        "    plt.axhline(0, color='#888', lw=0.5)\n",
+        "    plt.xlabel('Round')\n",
+        "    plt.ylabel('Exploitability gap')\n",
+        "    plt.title('Exploitability gap = baseline_def_reward − trained_fraud_def_reward')\n",
+        "    plt.tight_layout()\n",
+        "    plt.savefig('artifacts/exploitability_gap.png', dpi=140)\n",
+        "    plt.show()\n",
+        "\n",
+        "# 5. Fraud parameter trajectories\n",
+        "if fraud_theta_history:\n",
+        "    keys = list(fraud_theta_history[0].keys())\n",
+        "    plt.figure(figsize=(8,4))\n",
+        "    xs = np.arange(len(fraud_theta_history))\n",
+        "    for k in keys:\n",
+        "        plt.plot(xs, [t[k] for t in fraud_theta_history], 'o-', label=k)\n",
+        "    plt.xlabel('Co-evolution snapshot')\n",
+        "    plt.ylabel('Parameter value')\n",
+        "    plt.title('Fraud agent parameter evolution (ES)')\n",
+        "    plt.legend()\n",
+        "    plt.tight_layout()\n",
+        "    plt.savefig('artifacts/fraud_theta_trajectory.png', dpi=140)\n",
+        "    plt.show()\n",
+        "\n",
+        "# 6. Before vs After\n",
+        "labels = ['Random', 'Heuristic', 'Trained LLM']\n",
+        "values = [baseline_random['mean_reward'], baseline_heuristic['mean_reward'], trained_eval['mean_reward']]\n",
+        "plt.figure(figsize=(7,4))\n",
+        "bars = plt.bar(labels, values, color=['#bbb','#88c','#4a8'])\n",
+        "for b, v in zip(bars, values):\n",
+        "    plt.text(b.get_x()+b.get_width()/2, v+0.01, f'{v:.3f}', ha='center')\n",
+        "plt.ylabel('Mean reward (frozen holdout)')\n",
+        "plt.title('Before vs After Training (GRPO + co-evolving fraud)')\n",
+        "plt.tight_layout()\n",
+        "plt.savefig('artifacts/before_after_rewards.png', dpi=140)\n",
+        "plt.show()\n",
+        "\n",
+        "# 7. Per risk-bucket\n",
+        "buckets = ['low', 'medium', 'high']\n",
+        "rand_b = [baseline_random['bucket_means'][b] for b in buckets]\n",
+        "heur_b = [baseline_heuristic['bucket_means'][b] for b in buckets]\n",
+        "trnd_b = [trained_eval['bucket_means'][b] for b in buckets]\n",
+        "x = np.arange(len(buckets))\n",
+        "w = 0.27\n",
+        "plt.figure(figsize=(8,4))\n",
+        "plt.bar(x - w, rand_b, width=w, label='Random', color='#bbb')\n",
+        "plt.bar(x,     heur_b, width=w, label='Heuristic', color='#88c')\n",
+        "plt.bar(x + w, trnd_b, width=w, label='Trained LLM', color='#4a8')\n",
+        "plt.xticks(x, [b.title()+' Risk' for b in buckets])\n",
+        "plt.ylabel('Mean reward')\n",
+        "plt.title('Per Risk-Bucket Reward (frozen holdout)')\n",
+        "plt.legend()\n",
+        "plt.tight_layout()\n",
+        "plt.savefig('artifacts/per_bucket_rewards.png', dpi=140)\n",
+        "plt.show()\n",
+        "\n",
+        "summary = {\n",
+        "    'env_url': ENV_URL,\n",
+        "    'model_id': MODEL_ID,\n",
+        "    'quick_mode': QUICK_MODE,\n",
+        "    'prompts_used': len(prompts),\n",
+        "    'grpo_num_generations': GRPO_NUM_GENERATIONS,\n",
+        "    'rollout_steps_per_reward': ROLLOUT_STEPS_PER_REWARD,\n",
+        "    'n_rounds': N_ROUNDS,\n",
+        "    'grpo_steps_per_round': GRPO_STEPS_PER_ROUND,\n",
+        "    'es_steps_per_round': ES_STEPS_PER_ROUND,\n",
+        "    'es_population': ES_POPULATION,\n",
+        "    'baseline_random_mean_reward': baseline_random['mean_reward'],\n",
+        "    'baseline_heuristic_mean_reward': baseline_heuristic['mean_reward'],\n",
+        "    'trained_mean_reward': trained_eval['mean_reward'],\n",
+        "    'reward_gain_vs_random': trained_eval['mean_reward'] - baseline_random['mean_reward'],\n",
+        "    'reward_gain_vs_heuristic': trained_eval['mean_reward'] - baseline_heuristic['mean_reward'],\n",
+        "    'per_bucket': {\n",
+        "        'random': baseline_random['bucket_means'],\n",
+        "        'heuristic': baseline_heuristic['bucket_means'],\n",
+        "        'trained': trained_eval['bucket_means'],\n",
+        "    },\n",
+        "    'defender_round_rewards': defender_round_rewards,\n",
+        "    'fraud_round_fitness': fraud_round_fitness,\n",
+        "    'exploitability_log': exploitability_log,\n",
+        "    'fraud_theta_history': fraud_theta_history,\n",
+        "    'final_fraud_theta': fraud_agent.theta,\n",
+        "    'grpo_reward_curve': reward_log,\n",
+        "    'grpo_loss_history': loss_history,\n",
+        "    'eval_per_episode': {\n",
+        "        'random': baseline_random['per_episode_mean'],\n",
+        "        'heuristic': baseline_heuristic['per_episode_mean'],\n",
+        "        'trained': trained_eval['per_episode_mean'],\n",
+        "    },\n",
+        "}\n",
+        "with open('artifacts/run_summary.json', 'w', encoding='utf-8') as f:\n",
+        "    json.dump(summary, f, indent=2)\n",
+        "print(json.dumps({k:v for k,v in summary.items() if k not in ('grpo_reward_curve','grpo_loss_history')}, indent=2))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 11. (Optional) Upload artifacts"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# !huggingface-cli upload <your-hf-repo> artifacts artifacts --repo-type dataset"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}

scripts/train_theme4_grpo.py CHANGED Viewed

@@ -11,7 +11,9 @@ It is intentionally lightweight so teams can run it in Colab with TRL/Unsloth.
 from __future__ import annotations
 import json
 import random
 from dataclasses import dataclass
 from typing import Any
@@ -19,9 +21,11 @@ from typing import Any
 import requests
-ENV_URL = "http://localhost:7860"
-MAX_STEPS = 200
-GROUP_SIZE = 8
 @dataclass
@@ -69,8 +73,14 @@ def _reset(difficulty: int = 2) -> dict[str, Any]:
     return payload.get("observation", payload)
-def collect_group_relative_pairs(max_steps: int = MAX_STEPS, group_size: int = GROUP_SIZE) -> list[RolloutExample]:
-    obs = _reset(difficulty=2)
     dataset: list[RolloutExample] = []
     actions_pool = _action_candidates()
@@ -111,7 +121,7 @@ def collect_group_relative_pairs(max_steps: int = MAX_STEPS, group_size: int = G
         step_payload = _step(best_action)
         obs = step_payload.get("observation", step_payload)
         if bool(obs.get("done", False)):
-            obs = _reset(difficulty=2)
     return dataset
@@ -134,6 +144,21 @@ def export_jsonl(dataset: list[RolloutExample], output_path: str) -> None:
 if __name__ == "__main__":
-    data = collect_group_relative_pairs()
-    export_jsonl(data, "theme4_grpo_pairs.jsonl")
-    print(f"Collected {len(data)} preference pairs into theme4_grpo_pairs.jsonl")

 from __future__ import annotations
+import argparse
 import json
+import os
 import random
 from dataclasses import dataclass
 from typing import Any
 import requests
+ENV_URL = os.getenv("ENV_URL", "http://localhost:7860").rstrip("/")
+MAX_STEPS = int(os.getenv("MAX_STEPS", "200"))
+GROUP_SIZE = int(os.getenv("GROUP_SIZE", "8"))
+DIFFICULTY = int(os.getenv("DIFFICULTY", "2"))
+RANDOM_SEED = int(os.getenv("SEED", "42"))
 @dataclass
     return payload.get("observation", payload)
+def collect_group_relative_pairs(
+    max_steps: int = MAX_STEPS,
+    group_size: int = GROUP_SIZE,
+    difficulty: int = DIFFICULTY,
+    seed: int = RANDOM_SEED,
+) -> list[RolloutExample]:
+    random.seed(seed)
+    obs = _reset(difficulty=difficulty)
     dataset: list[RolloutExample] = []
     actions_pool = _action_candidates()
         step_payload = _step(best_action)
         obs = step_payload.get("observation", step_payload)
         if bool(obs.get("done", False)):
+            obs = _reset(difficulty=difficulty)
     return dataset
 if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Collect group-relative preference pairs from SmartPayEnv.")
+    parser.add_argument("--env-url", default=ENV_URL, help="SmartPayEnv server URL")
+    parser.add_argument("--max-steps", type=int, default=MAX_STEPS, help="Number of rollout steps")
+    parser.add_argument("--group-size", type=int, default=GROUP_SIZE, help="Actions sampled per step")
+    parser.add_argument("--difficulty", type=int, default=DIFFICULTY, help="Environment difficulty 0/1/2")
+    parser.add_argument("--seed", type=int, default=RANDOM_SEED, help="Random seed")
+    parser.add_argument("--output", default="theme4_grpo_pairs.jsonl", help="Output JSONL path")
+    args = parser.parse_args()
+    ENV_URL = args.env_url.rstrip("/")
+    data = collect_group_relative_pairs(
+        max_steps=args.max_steps,
+        group_size=args.group_size,
+        difficulty=args.difficulty,
+        seed=args.seed,
+    )
+    export_jsonl(data, args.output)
+    print(f"Collected {len(data)} preference pairs into {args.output}")

server/SmartPayEnv_environment.py CHANGED Viewed

@@ -142,6 +142,15 @@ class SmartpayenvEnvironment(Environment):
         self._pattern_queue = deque()
         self._meta_curriculum_enabled = True
     def _init_gateways(self) -> None:
         instability = self._cfg["instability"]
         self._gateways = [
@@ -164,7 +173,10 @@ class SmartpayenvEnvironment(Environment):
             # Fallback to random if logs fail (shouldn't happen)
             return self._generate_fallback_transaction()
-        true_risk = float(log_entry["fraud_risk_score"])
         self._state.true_fraud_risk = true_risk
         return SmartpayenvObservation(
@@ -192,8 +204,10 @@ class SmartpayenvEnvironment(Environment):
         )
     def _get_noisy_risk(self, true_risk: float) -> float:
-        """Adds Gaussian noise to the true risk score."""
-        noise = self._rng.normal(0, 0.1)
         return float(np.clip(true_risk + noise, 0.01, 0.99))
     def _generate_fallback_transaction(self) -> SmartpayenvObservation:
@@ -228,12 +242,19 @@ class SmartpayenvEnvironment(Environment):
             task_retention_score=0.5,
         )
-    def reset(self, difficulty: int = 0) -> SmartpayenvObservation:
         self._difficulty = int(np.clip(difficulty, 0, 2))
         self._cfg        = DIFFICULTY_CONFIG[self._difficulty]
         self._state      = State(episode_id=str(uuid4()), step_count=0)
-        # Random initial cursor for variety, but then sequential within episode
-        self._state.log_cursor = self._rng.integers(0, 100000)
         self._init_gateways()
         self.route_grader     = RoutingEfficacyGrader()
         self.fraud_grader     = FraudDetectionGrader()
@@ -248,6 +269,31 @@ class SmartpayenvEnvironment(Environment):
         self._state.anti_gaming_alerts = 0
         return self.current_obs
     def _curriculum_multiplier(self) -> float:
         return 1.0 + (0.15 * self._state.curriculum_level)
@@ -313,10 +359,15 @@ class SmartpayenvEnvironment(Environment):
         }
         self._state.health_lag_buffer.append(current_health)
-        if self._state.step_count % 10 == 0 and self._rng.random() < 0.2:
-            # Inject a "Fraud Surge" pattern from logs
-            surge_logs = self._log_loader.get_pattern("fraud_surge", count=5)
-            self._pattern_queue.extend(surge_logs)
         # Curriculum-driven stress events (self-improvement pressure).
         if self._rng.random() < (0.01 * self._curriculum_multiplier()):

         self._pattern_queue = deque()
         self._meta_curriculum_enabled = True
+        # ── Learnable adversary (theme-4 co-evolution) ─────────────────
+        # Set externally via `configure_adversary(...)` and consumed by
+        # `_get_noisy_risk` / `step` to control how aggressive the fraud
+        # generator behaves. Defaults are neutral (no extra pressure).
+        self._adv_intensity        = 1.0   # multiplier on fraud rate (1.0 = baseline)
+        self._adv_noise_boost      = 0.0   # extra std on observed fraud risk
+        self._adv_pattern_rate     = 0.2   # base prob of injecting a fraud-surge pattern
+        self._adv_strategy         = "mixed"  # "mixed" | "fraud_surge" | "stealth_fraud" | "velocity_attack"
     def _init_gateways(self) -> None:
         instability = self._cfg["instability"]
         self._gateways = [
             # Fallback to random if logs fail (shouldn't happen)
             return self._generate_fallback_transaction()
+        # Adversary intensifier: scale the underlying fraud risk so a learned
+        # fraud agent can sharpen attacks against the defender LLM.
+        true_risk = float(log_entry["fraud_risk_score"]) * float(self._adv_intensity)
+        true_risk = float(np.clip(true_risk, 0.0, 1.0))
         self._state.true_fraud_risk = true_risk
         return SmartpayenvObservation(
         )
     def _get_noisy_risk(self, true_risk: float) -> float:
+        """Adds Gaussian noise to the true risk score.
+        Adversary policy can boost noise to make detection harder (stealth)."""
+        std = 0.1 + max(0.0, float(self._adv_noise_boost))
+        noise = self._rng.normal(0, std)
         return float(np.clip(true_risk + noise, 0.01, 0.99))
     def _generate_fallback_transaction(self) -> SmartpayenvObservation:
             task_retention_score=0.5,
         )
+    def reset(self, difficulty: int = 0, seed: int | None = None) -> SmartpayenvObservation:
         self._difficulty = int(np.clip(difficulty, 0, 2))
         self._cfg        = DIFFICULTY_CONFIG[self._difficulty]
+        # Optional deterministic seeding so a GRPO group can share the same
+        # starting trajectory across all candidate completions (clean signal).
+        if seed is not None:
+            self._rng = np.random.default_rng(int(seed))
         self._state      = State(episode_id=str(uuid4()), step_count=0)
+        # Cursor is also seed-determined when a seed is provided.
+        if seed is not None:
+            self._state.log_cursor = int(seed) % 100000
+        else:
+            self._state.log_cursor = int(self._rng.integers(0, 100000))
         self._init_gateways()
         self.route_grader     = RoutingEfficacyGrader()
         self.fraud_grader     = FraudDetectionGrader()
         self._state.anti_gaming_alerts = 0
         return self.current_obs
+    # ── Adversary configuration (theme-4 co-evolution) ─────────────────
+    def configure_adversary(
+        self,
+        intensity: float | None = None,
+        noise_boost: float | None = None,
+        pattern_rate: float | None = None,
+        strategy: str | None = None,
+    ) -> dict:
+        """Set the parametric fraud agent's behaviour. All values are clipped
+        to safe ranges. Returns the active adversary config."""
+        if intensity is not None:
+            self._adv_intensity = float(np.clip(intensity, 0.5, 2.5))
+        if noise_boost is not None:
+            self._adv_noise_boost = float(np.clip(noise_boost, 0.0, 0.6))
+        if pattern_rate is not None:
+            self._adv_pattern_rate = float(np.clip(pattern_rate, 0.0, 0.9))
+        if strategy is not None and strategy in {"mixed", "fraud_surge", "stealth_fraud", "velocity_attack"}:
+            self._adv_strategy = strategy
+        return {
+            "intensity": self._adv_intensity,
+            "noise_boost": self._adv_noise_boost,
+            "pattern_rate": self._adv_pattern_rate,
+            "strategy": self._adv_strategy,
+        }
     def _curriculum_multiplier(self) -> float:
         return 1.0 + (0.15 * self._state.curriculum_level)
         }
         self._state.health_lag_buffer.append(current_health)
+        if self._state.step_count % 10 == 0 and self._rng.random() < self._adv_pattern_rate:
+            # Adversary-controlled attack injection. The fraud agent picks
+            # the pattern type; "mixed" rotates among them.
+            if self._adv_strategy == "mixed":
+                pat = self._rng.choice(["fraud_surge", "stealth_fraud", "velocity_attack"])
+            else:
+                pat = self._adv_strategy
+            atk_logs = self._log_loader.get_pattern(str(pat), count=5)
+            self._pattern_queue.extend(atk_logs)
         # Curriculum-driven stress events (self-improvement pressure).
         if self._rng.random() < (0.01 * self._curriculum_multiplier()):

server/app.py CHANGED Viewed

@@ -64,6 +64,42 @@ async def simulate(action: SmartpayenvAction):
     return app.env.simulate(action)
 def main():
     """
     Entry point for direct execution via uv run or python -m.

     return app.env.simulate(action)
+# ── Theme-4 co-evolution endpoints ────────────────────────────────────
+from typing import Optional
+from pydantic import BaseModel
+class AdversaryConfig(BaseModel):
+    """Parametric fraud-agent policy. Any field may be omitted."""
+    intensity: Optional[float] = None
+    noise_boost: Optional[float] = None
+    pattern_rate: Optional[float] = None
+    strategy: Optional[str] = None  # "mixed" | "fraud_surge" | "stealth_fraud" | "velocity_attack"
+class SeededReset(BaseModel):
+    difficulty: int = 0
+    seed: Optional[int] = None
+@app.post("/configure_adversary")
+async def configure_adversary(cfg: AdversaryConfig):
+    """Set the learnable fraud agent's behaviour. Returns the active config."""
+    return app.env.configure_adversary(
+        intensity=cfg.intensity,
+        noise_boost=cfg.noise_boost,
+        pattern_rate=cfg.pattern_rate,
+        strategy=cfg.strategy,
+    )
+@app.post("/reset_seeded", response_model=SmartpayenvObservation)
+async def reset_seeded(req: SeededReset):
+    """Deterministic reset: same `seed` => same starting trajectory.
+    Useful for GRPO so all completions in a group share the same state."""
+    return app.env.reset(difficulty=int(req.difficulty), seed=req.seed)
 def main():
     """
     Entry point for direct execution via uv run or python -m.