Spaces:

Pratap-K
/

SmartPayEnv

Sleeping

App Files Files Community

Pratap-K commited on 13 days ago

Commit

640cca9

1 Parent(s): 27a0d2f

Env improvement

Browse files

Files changed (6) hide show

README.md +50 -1
data/transactions_log.jsonl +0 -0
scripts/generate_logs.py +273 -60
scripts/train_theme4_grpo.py +139 -0
server/SmartPayEnv_environment.py +78 -3
server/utils.py +8 -0

README.md CHANGED Viewed

@@ -14,7 +14,7 @@ tags:
   - Reinforcement Learning
 ---
-# 💳 SmartPayEnv: Advanced Fintech Reality Layer
 **A high-fidelity, production-grade benchmark for training and evaluating AI Agents (LLMs/RL) on the messy reality of global payment orchestration.**
@@ -23,6 +23,12 @@ tags:
 SmartPayEnv bridges the gap between simple simulations and production fintech. It models the adversarial loops, infrastructure instability, and delayed feedback cycles that define modern payment systems.
 ---
 ## 🚀 Why SmartPayEnv?
@@ -122,6 +128,12 @@ Agents can send transactions to manual review (Action 3). Resolutions are 100% a
 - **📊 BIN-Gateway Affinity**: A hidden matrix of gateway performance across different card types. Agents must discover these affinities to optimize routing success.
 - **🧠 Preference-Based Learning (Simulation Branching)**: Supports advanced training (e.g., DPO/PPO) by allowing agents to "What-if" multiple actions from the same state via the `/simulate` endpoint. Agents can group similar contexts (BIN + Amount + Risk) and learn from relative advantages.
 ---
 ## 🎯 Benchmark Tasks
@@ -180,6 +192,22 @@ SmartPayEnv enables GRPO by providing the infrastructure for **Group Sampling**
 - **Learnable Gradients**: Unlike binary simulations, our **Deterministic Graders** (see Scoring section) map fuzzy outcomes to continuous rewards $[0, 1]$. This prevents the "sparse reward" problem and provides stable gradients for PPO clip-range optimization.
 - **Context Bucketing**: The `server/preference_utils.py` module allows agents to bundle similar (BIN, Amount, Risk) states, enabling faster convergence on preference-based objectives.
 ---
 ## 📐 Data Models
@@ -219,6 +247,9 @@ cd SmartPayEnv
 # Install dependencies
 uv sync
 # Run the OpenEnv validation suite
 openenv validate
@@ -233,6 +264,24 @@ uv run -m SmartPayEnv.server.app
 ```
 Access the **Swagger UI** at `http://localhost:7860/` (auto-redirects to `/docs`).
 ### 3. Multi-Mode Deployment (Docker)
 ```bash
 # Build the production image

   - Reinforcement Learning
 ---
+# 💳 SmartPayEnv: Advanced Fintech Reality Layer (Theme 4: Self-Improvement)
 **A high-fidelity, production-grade benchmark for training and evaluating AI Agents (LLMs/RL) on the messy reality of global payment orchestration.**
 SmartPayEnv bridges the gap between simple simulations and production fintech. It models the adversarial loops, infrastructure instability, and delayed feedback cycles that define modern payment systems.
+This release is explicitly upgraded for **OpenEnv Hackathon Theme #4 (Self-Improvement)** with a light blend of Theme #1 and Theme #2:
+- **League-style challenger dynamics** inside the environment (agent vs moving opponent skill frontier).
+- **Adaptive curriculum** that auto-escalates pressure after sustained performance and de-escalates after regressions.
+- **Anti-reward-hacking penalties** for degenerate policies (e.g., overusing manual review without fraud/retention quality).
+- **Long-horizon credit pressure** through delayed chargebacks + review queues + temporal events.
 ---
 ## 🚀 Why SmartPayEnv?
 - **📊 BIN-Gateway Affinity**: A hidden matrix of gateway performance across different card types. Agents must discover these affinities to optimize routing success.
 - **🧠 Preference-Based Learning (Simulation Branching)**: Supports advanced training (e.g., DPO/PPO) by allowing agents to "What-if" multiple actions from the same state via the `/simulate` endpoint. Agents can group similar contexts (BIN + Amount + Risk) and learn from relative advantages.
+### 5. Self-Improving Meta-Curriculum (NEW)
+- **📈 Curriculum Level**: Each episode now tracks a continuous curriculum level (0-2) that increases after sustained high rolling performance.
+- **🥊 Challenger Skill**: A moving challenger policy estimate is maintained and used to compute regret-style penalties when the active policy underperforms.
+- **🧯 Anti-Gaming Guardrails**: Repeatedly selecting costly manual review without corresponding quality gains triggers adaptive penalties.
+- **🧠 Metadata for Training**: Step metadata exposes `curriculum_level`, `policy_skill_estimate`, `challenger_skill`, and shaping terms to support richer RL diagnostics.
 ---
 ## 🎯 Benchmark Tasks
 - **Learnable Gradients**: Unlike binary simulations, our **Deterministic Graders** (see Scoring section) map fuzzy outcomes to continuous rewards $[0, 1]$. This prevents the "sparse reward" problem and provides stable gradients for PPO clip-range optimization.
 - **Context Bucketing**: The `server/preference_utils.py` module allows agents to bundle similar (BIN, Amount, Risk) states, enabling faster convergence on preference-based objectives.
+### 3. Theme-4 Group-Relative Collection (NEW)
+- Use `scripts/train_theme4_grpo.py` to build **group-relative preference pairs** from online interactions:
+  - sample action groups for each live observation
+  - rank via `/simulate` reward
+  - export best-vs-worst pairs (`theme4_grpo_pairs.jsonl`)
+- This supports novel post-training flows in **HF TRL / Unsloth** and aligns with modern critic-free RL ideas.
+---
+## 📚 Research-Inspired Design
+The self-improving upgrades are inspired by:
+- **League / PFSP dynamics** for avoiding cyclic overfitting and improving robustness: [AlphaStar (Nature, 2019)](https://www.nature.com/articles/s41586-019-1724-z)
+- **Group-relative policy updates** for efficient critic-free optimization: [DeepSeekMath / GRPO (arXiv:2402.03300)](https://arxiv.org/abs/2402.03300)
+- **Cross-play and equilibrium-oriented opponent diversity**: [Fictitious Cross-Play (arXiv:2310.03354)](https://arxiv.org/abs/2310.03354)
 ---
 ## 📐 Data Models
 # Install dependencies
 uv sync
+# (Recommended) Regenerate realistic synthetic data
+python scripts/generate_logs.py --num-transactions 20000 --n-users 5000 --seed 42
 # Run the OpenEnv validation suite
 openenv validate
 ```
 Access the **Swagger UI** at `http://localhost:7860/` (auto-redirects to `/docs`).
+### 4. Synthetic Data World Generator (NEW)
+Use this when you want realistic, evolving "real-world-like" transaction streams:
+```bash
+python scripts/generate_logs.py \
+  --output data/transactions_log.jsonl \
+  --num-transactions 20000 \
+  --n-users 5000 \
+  --seed 42 \
+  --base-fraud-rate 0.08
+```
+What gets generated:
+- **Normal baseline behavior** (segment-based spend, location/device consistency, time-of-day effects)
+- **Seed fraud templates** (`high_value_spike`, `velocity_burst`, `geo_anomaly`, `device_spoof`, `split_transactions`)
+- **Adaptive fraud evolution** (strategy composition and stealth attacks such as `low_risk_disguise`)
+- **Strategy labels for storytelling** via `fraud_strategy` and `event_marker`
 ### 3. Multi-Mode Deployment (Docker)
 ```bash
 # Build the production image

data/transactions_log.jsonl CHANGED Viewed

The diff for this file is too large to render. See raw diff

scripts/generate_logs.py CHANGED Viewed

@@ -1,68 +1,281 @@
 import json
-import numpy as np
 import os
-from uuid import uuid4
-def generate_logs(output_path="data/transactions_log.jsonl", num_transactions=5000):
-    rng = np.random.default_rng()
     os.makedirs(os.path.dirname(output_path), exist_ok=True)
     current_hour = 0
-    steps_per_hour = 100 # average density
-    active_spike_countdown = 0
-    with open(output_path, "w") as f:
-        for i in range(num_transactions):
-            # Advance time every ~100 transactions
-            if i % steps_per_hour == 0:
                 current_hour = (current_hour + 1) % 24
-            # Randomly start a fraud spike (correlated event)
-            if active_spike_countdown <= 0 and rng.random() < 0.005:
-                active_spike_countdown = rng.integers(20, 50)
-            # 1. Hour of day (Diurnal pattern)
-            hour = current_hour
-            # 2. Segment & MCC
-            segment = int(rng.choice([0, 1, 2], p=[0.25, 0.60, 0.15]))
-            mcc = int(rng.choice([0, 1, 2, 3, 4, 5], p=[0.3, 0.2, 0.1, 0.1, 0.1, 0.2]))
-            # 3. Fraud Risk with Correlation (Spikes)
-            is_night = (1 <= hour <= 5)
-            base_risk = {0: 0.02, 1: 0.05, 2: 0.15, 3: 0.08, 4: 0.25, 5: 0.12}[mcc]
-            risk_boost = 0.0
-            if active_spike_countdown > 0:
-                risk_boost = 0.4 # Persistent spike
-                active_spike_countdown -= 1
-            elif is_night:
-                risk_boost = 0.2
-            final_risk = base_risk + risk_boost + rng.uniform(-0.05, 0.05)
-            fraud_risk_score = float(np.clip(final_risk * {0: 1.8, 1: 1.0, 2: 0.3}[segment], 0.01, 0.99))
-            # 4. Transaction Details
-            amount = float(rng.lognormal(mean={0: 4.0, 1: 4.5, 2: 6.5, 3: 7.0, 4: 5.0, 5: 3.0}[mcc], sigma=0.8))
-            bin_category = int(rng.integers(0, 10))
-            is_international = bool(rng.random() < (0.4 if mcc == 3 else 0.15))
-            log_entry = {
-                "amount": amount,
-                "merchant_category": mcc,
-                "is_international": is_international,
-                "card_present": bool(rng.random() > 0.5),
-                "user_segment": segment,
-                "user_history_score": float(np.clip(rng.normal({0: 0.3, 1: 0.7, 2: 0.9}[segment], 0.15), 0.1, 1.0)),
-                "device_type": int(rng.choice([0, 1, 2], p=[0.5, 0.4, 0.1])),
-                "bin_category": bin_category,
-                "time_of_day": hour,
-                "transaction_velocity": float(np.clip(rng.random() * 0.2 + (0.5 if active_spike_countdown > 0 else 0.0), 0.1, 0.9)),
-                "fraud_risk_score": fraud_risk_score,
-                "event_marker": "fraud_spike" if active_spike_countdown > 0 else None
-            }
-            f.write(json.dumps(log_entry) + "\n")
 if __name__ == "__main__":
-    generate_logs(num_transactions=5000)
-    print("Sequential logs with correlated events generated.")

+import argparse
 import json
 import os
+from collections import defaultdict, deque
+import numpy as np
+LOCATIONS = ["Bangalore", "Mumbai", "Delhi", "Hyderabad", "Chennai", "Pune", "Kolkata", "Europe", "Singapore"]
+SEGMENT_LABELS = {0: "new", 1: "existing", 2: "premium"}
+BASE_MCC_DIST = [0.30, 0.20, 0.10, 0.10, 0.10, 0.20]
+HIGH_RISK_MCCS = {2, 4, 5}
+RISKY_HOURS = {1, 2, 3, 4, 5}
+def _time_bucket(hour: int) -> str:
+    if 0 <= hour <= 5:
+        return "night"
+    if 6 <= hour <= 11:
+        return "morning"
+    if 12 <= hour <= 17:
+        return "afternoon"
+    return "evening"
+def _sample_user_profiles(rng: np.random.Generator, n_users: int) -> list[dict]:
+    profiles: list[dict] = []
+    for uid in range(n_users):
+        segment = int(rng.choice([0, 1, 2], p=[0.30, 0.55, 0.15]))
+        traveler = bool(rng.random() < {0: 0.08, 1: 0.15, 2: 0.35}[segment])
+        home = str(rng.choice(LOCATIONS[:7]))
+        preferred_mcc = int(rng.choice([0, 1, 3, 5], p=[0.35, 0.25, 0.20, 0.20]))
+        profiles.append(
+            {
+                "user_id": f"user_{uid}",
+                "user_segment": segment,
+                "frequent_traveler": traveler,
+                "home_location": home,
+                "preferred_mcc": preferred_mcc,
+                "base_device_type": int(rng.choice([0, 1, 2], p=[0.55, 0.35, 0.10])),
+                "base_spend_mu": {0: 3.8, 1: 4.5, 2: 5.0}[segment],
+                "base_spend_sigma": {0: 0.70, 1: 0.75, 2: 0.85}[segment],
+                "history_base": {0: 0.35, 1: 0.72, 2: 0.88}[segment],
+            }
+        )
+    return profiles
+def _normal_transaction(
+    rng: np.random.Generator,
+    profile: dict,
+    hour: int,
+    user_recent_times: deque,
+    user_recent_amounts: deque,
+) -> dict:
+    mcc_probs = np.array(BASE_MCC_DIST, dtype=float)
+    mcc_probs[profile["preferred_mcc"]] += 0.18
+    mcc_probs = mcc_probs / mcc_probs.sum()
+    mcc = int(rng.choice([0, 1, 2, 3, 4, 5], p=mcc_probs))
+    amount = float(rng.lognormal(mean=profile["base_spend_mu"], sigma=profile["base_spend_sigma"]))
+    if mcc in HIGH_RISK_MCCS:
+        amount *= 1.35
+    location = profile["home_location"]
+    is_international = False
+    if profile["frequent_traveler"] and rng.random() < 0.10:
+        location = str(rng.choice(["Europe", "Singapore"]))
+        is_international = True
+    device_type = profile["base_device_type"]
+    if rng.random() < 0.07:
+        device_type = int(rng.choice([0, 1, 2]))
+    velocity = float(min(12, len([t for t in user_recent_times if hour - t <= 1])))
+    velocity_norm = float(np.clip(velocity / 10.0, 0.05, 0.98))
+    risk = 0.02
+    risk += 0.06 if hour in RISKY_HOURS else 0.0
+    risk += 0.05 if mcc in HIGH_RISK_MCCS else 0.0
+    risk += 0.04 if device_type != profile["base_device_type"] else 0.0
+    risk += 0.03 if is_international else 0.0
+    risk += 0.08 * velocity_norm
+    risk += rng.normal(0.0, 0.02)
+    return {
+        "amount": float(np.clip(amount, 5.0, 150000.0)),
+        "currency": "INR",
+        "time": _time_bucket(hour),
+        "merchant_category": mcc,
+        "location": location,
+        "is_international": is_international,
+        "card_present": bool(rng.random() > 0.45),
+        "user_segment": profile["user_segment"],
+        "user_history_score": float(np.clip(rng.normal(profile["history_base"], 0.12), 0.05, 1.0)),
+        "device_type": device_type,
+        "ip_risk": float(np.clip(rng.normal(0.10 if location == profile["home_location"] else 0.45, 0.08), 0.01, 0.99)),
+        "bin_category": int(rng.integers(0, 10)),
+        "time_of_day": int(hour),
+        "transaction_velocity": velocity_norm,
+        "fraud_risk_score": float(np.clip(risk, 0.01, 0.99)),
+        "fraud_strategy": "none",
+        "event_marker": None,
+        "is_fraud": False,
+    }
+def _fraud_agent_strategy_mix(
+    rng: np.random.Generator,
+    attack_level: float,
+) -> list[str]:
+    templates = [
+        ("high_value_spike", 0.20),
+        ("velocity_burst", 0.22),
+        ("geo_anomaly", 0.16),
+        ("device_spoof", 0.18),
+        ("split_transactions", 0.14),
+        ("low_risk_disguise", 0.10),
+    ]
+    weights = np.array([w for _, w in templates], dtype=float)
+    # Self-improving fraud agent: shifts toward stealth blends as defender hardens.
+    stealth_boost = min(0.18, 0.06 * attack_level)
+    weights[5] += stealth_boost
+    weights[4] += stealth_boost * 0.8
+    weights = weights / weights.sum()
+    k = 1 if attack_level < 1.0 else (2 if rng.random() < 0.75 else 3)
+    selected = rng.choice([name for name, _ in templates], size=k, replace=False, p=weights)
+    return list(selected)
+def _apply_fraud_strategy(
+    rng: np.random.Generator,
+    tx: dict,
+    profile: dict,
+    strategies: list[str],
+) -> list[dict]:
+    tx = dict(tx)
+    event_markers = []
+    for s in strategies:
+        if s == "high_value_spike":
+            tx["amount"] = float(min(200000.0, tx["amount"] * rng.uniform(6.0, 18.0)))
+            event_markers.append("high_value_spike")
+        elif s == "velocity_burst":
+            tx["transaction_velocity"] = float(np.clip(tx["transaction_velocity"] + rng.uniform(0.45, 0.85), 0.1, 0.99))
+            event_markers.append("velocity_burst")
+        elif s == "geo_anomaly":
+            tx["location"] = str(rng.choice(["Europe", "Singapore"]))
+            tx["is_international"] = True
+            tx["ip_risk"] = float(np.clip(tx["ip_risk"] + rng.uniform(0.25, 0.50), 0.01, 0.99))
+            event_markers.append("geo_anomaly")
+        elif s == "device_spoof":
+            tx["device_type"] = int((profile["base_device_type"] + int(rng.integers(1, 3))) % 3)
+            tx["card_present"] = False
+            tx["ip_risk"] = float(np.clip(tx["ip_risk"] + rng.uniform(0.18, 0.35), 0.01, 0.99))
+            event_markers.append("device_spoof")
+        elif s == "split_transactions":
+            # Converted to multiple low-value events that preserve a high total.
+            pieces = int(rng.integers(4, 10))
+            each_amount = float(max(1500.0, tx["amount"] * rng.uniform(0.10, 0.22)))
+            generated = []
+            for _ in range(pieces):
+                p = dict(tx)
+                p["amount"] = each_amount
+                p["transaction_velocity"] = float(np.clip(tx["transaction_velocity"] + rng.uniform(0.35, 0.55), 0.1, 0.99))
+                p["event_marker"] = "split_transactions"
+                p["fraud_strategy"] = "split_transactions"
+                p["is_fraud"] = True
+                risk = p["fraud_risk_score"] + rng.uniform(0.18, 0.32)
+                p["fraud_risk_score"] = float(np.clip(risk, 0.01, 0.99))
+                generated.append(p)
+            return generated
+        elif s == "low_risk_disguise":
+            # Fraud tries to look normal: lower explicit risk while preserving anomalies elsewhere.
+            tx["amount"] = float(np.clip(tx["amount"] * rng.uniform(0.18, 0.35), 250.0, 12000.0))
+            tx["merchant_category"] = int(rng.choice([0, 1, 3], p=[0.5, 0.3, 0.2]))
+            tx["fraud_risk_score"] = float(np.clip(tx["fraud_risk_score"] - rng.uniform(0.08, 0.20), 0.02, 0.80))
+            event_markers.append("low_risk_disguise")
+    tx["fraud_strategy"] = "+".join(strategies)
+    tx["event_marker"] = "|".join(event_markers) if event_markers else "fraud_pattern"
+    tx["is_fraud"] = True
+    tx["fraud_risk_score"] = float(np.clip(tx["fraud_risk_score"] + rng.uniform(0.18, 0.42), 0.01, 0.99))
+    return [tx]
+def generate_logs(
+    output_path: str = "data/transactions_log.jsonl",
+    num_transactions: int = 15000,
+    n_users: int = 4000,
+    seed: int = 7,
+    base_fraud_rate: float = 0.08,
+) -> None:
+    """
+    Generates realistic synthetic payment logs with an evolving fraud adversary.
+    """
+    rng = np.random.default_rng(seed)
     os.makedirs(os.path.dirname(output_path), exist_ok=True)
+    profiles = _sample_user_profiles(rng, n_users=n_users)
+    user_recent_times: dict[str, deque] = defaultdict(lambda: deque(maxlen=40))
+    user_recent_amounts: dict[str, deque] = defaultdict(lambda: deque(maxlen=40))
     current_hour = 0
+    steps_per_hour = 90
+    global_attack_level = 0.0
+    defender_pressure = 0.0
+    records_written = 0
+    with open(output_path, "w", encoding="utf-8") as f:
+        while records_written < num_transactions:
+            if records_written % steps_per_hour == 0:
                 current_hour = (current_hour + 1) % 24
+            profile = profiles[int(rng.integers(0, len(profiles)))]
+            uid = profile["user_id"]
+            base_tx = _normal_transaction(
+                rng=rng,
+                profile=profile,
+                hour=current_hour,
+                user_recent_times=user_recent_times[uid],
+                user_recent_amounts=user_recent_amounts[uid],
+            )
+            fraud_p = base_fraud_rate + (0.05 if current_hour in RISKY_HOURS else 0.0) + (0.07 * global_attack_level)
+            fraud_p = float(np.clip(fraud_p, 0.01, 0.55))
+            is_attack = bool(rng.random() < fraud_p)
+            if is_attack:
+                strategies = _fraud_agent_strategy_mix(rng, attack_level=global_attack_level)
+                txs = _apply_fraud_strategy(rng, base_tx, profile, strategies)
+            else:
+                txs = [base_tx]
+            for tx in txs:
+                tx["user_id"] = uid
+                tx["user_profile"] = {
+                    "segment": SEGMENT_LABELS[profile["user_segment"]],
+                    "frequent_traveler": profile["frequent_traveler"],
+                    "home_location": profile["home_location"],
+                }
+                tx["attack_level"] = round(float(global_attack_level), 4)
+                tx["defender_pressure"] = round(float(defender_pressure), 4)
+                f.write(json.dumps(tx) + "\n")
+                records_written += 1
+                user_recent_times[uid].append(current_hour)
+                user_recent_amounts[uid].append(tx["amount"])
+                if records_written >= num_transactions:
+                    break
+            # Self-improvement dynamics:
+            # when fraud is frequently obvious, increase defender pressure;
+            # when stealth fraud appears, raise attack sophistication.
+            if is_attack and any("low_risk_disguise" in t.get("fraud_strategy", "") for t in txs):
+                global_attack_level = float(np.clip(global_attack_level + 0.015, 0.0, 3.0))
+            elif is_attack:
+                defender_pressure = float(np.clip(defender_pressure + 0.010, 0.0, 2.5))
+            else:
+                global_attack_level = float(np.clip(global_attack_level + 0.002 - (0.001 * defender_pressure), 0.0, 3.0))
 if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Generate synthetic SmartPayEnv transaction logs.")
+    parser.add_argument("--output", default="data/transactions_log.jsonl", help="Output JSONL file path")
+    parser.add_argument("--num-transactions", type=int, default=15000, help="Number of transactions")
+    parser.add_argument("--n-users", type=int, default=4000, help="Number of synthetic users")
+    parser.add_argument("--seed", type=int, default=7, help="Random seed")
+    parser.add_argument("--base-fraud-rate", type=float, default=0.08, help="Baseline fraud probability")
+    args = parser.parse_args()
+    generate_logs(
+        output_path=args.output,
+        num_transactions=args.num_transactions,
+        n_users=args.n_users,
+        seed=args.seed,
+        base_fraud_rate=args.base_fraud_rate,
+    )
+    print(f"Generated {args.num_transactions} synthetic transactions at {args.output}")

scripts/train_theme4_grpo.py ADDED Viewed

	@@ -0,0 +1,139 @@

+"""
+Theme-4 training starter for SmartPayEnv.
+This script demonstrates a novel self-improvement loop:
+1) sample K candidate actions per observation
+2) score each candidate with /simulate rewards (group-relative signal)
+3) collect best/worst pairs for preference-style post-training
+It is intentionally lightweight so teams can run it in Colab with TRL/Unsloth.
+"""
+from __future__ import annotations
+import json
+import random
+from dataclasses import dataclass
+from typing import Any
+import requests
+ENV_URL = "http://localhost:7860"
+MAX_STEPS = 200
+GROUP_SIZE = 8
+@dataclass
+class RolloutExample:
+    prompt: str
+    chosen: str
+    rejected: str
+    chosen_reward: float
+    rejected_reward: float
+def _action_candidates() -> list[dict[str, int]]:
+    all_actions: list[dict[str, int]] = []
+    for gateway in (0, 1, 2):
+        for fraud_decision in (0, 1, 2, 3):
+            for retry_strategy in (0, 1):
+                all_actions.append(
+                    {
+                        "gateway": gateway,
+                        "fraud_decision": fraud_decision,
+                        "retry_strategy": retry_strategy,
+                    }
+                )
+    random.shuffle(all_actions)
+    return all_actions
+def _simulate_reward(action: dict[str, int]) -> float:
+    response = requests.post(f"{ENV_URL}/simulate", json={"action": action}, timeout=30)
+    response.raise_for_status()
+    obs = response.json()
+    return float(obs.get("reward", 0.0))
+def _step(action: dict[str, int]) -> dict[str, Any]:
+    response = requests.post(f"{ENV_URL}/step", json={"action": action}, timeout=30)
+    response.raise_for_status()
+    return response.json()
+def _reset(difficulty: int = 2) -> dict[str, Any]:
+    response = requests.post(f"{ENV_URL}/reset", json={"difficulty": difficulty}, timeout=30)
+    response.raise_for_status()
+    payload = response.json()
+    return payload.get("observation", payload)
+def collect_group_relative_pairs(max_steps: int = MAX_STEPS, group_size: int = GROUP_SIZE) -> list[RolloutExample]:
+    obs = _reset(difficulty=2)
+    dataset: list[RolloutExample] = []
+    actions_pool = _action_candidates()
+    for _ in range(max_steps):
+        sampled = random.sample(actions_pool, k=min(group_size, len(actions_pool)))
+        scored: list[tuple[dict[str, int], float]] = []
+        for action in sampled:
+            try:
+                reward = _simulate_reward(action)
+                scored.append((action, reward))
+            except requests.RequestException:
+                continue
+        if len(scored) < 2:
+            break
+        scored.sort(key=lambda x: x[1], reverse=True)
+        best_action, best_reward = scored[0]
+        worst_action, worst_reward = scored[-1]
+        prompt = (
+            "SmartPayEnv observation:\n"
+            f"{json.dumps(obs, sort_keys=True)}\n"
+            "Return one action JSON with fields: gateway, fraud_decision, retry_strategy."
+        )
+        dataset.append(
+            RolloutExample(
+                prompt=prompt,
+                chosen=json.dumps(best_action, sort_keys=True),
+                rejected=json.dumps(worst_action, sort_keys=True),
+                chosen_reward=best_reward,
+                rejected_reward=worst_reward,
+            )
+        )
+        step_payload = _step(best_action)
+        obs = step_payload.get("observation", step_payload)
+        if bool(obs.get("done", False)):
+            obs = _reset(difficulty=2)
+    return dataset
+def export_jsonl(dataset: list[RolloutExample], output_path: str) -> None:
+    with open(output_path, "w", encoding="utf-8") as f:
+        for row in dataset:
+            f.write(
+                json.dumps(
+                    {
+                        "prompt": row.prompt,
+                        "chosen": row.chosen,
+                        "rejected": row.rejected,
+                        "chosen_reward": row.chosen_reward,
+                        "rejected_reward": row.rejected_reward,
+                    }
+                )
+                + "\n"
+            )
+if __name__ == "__main__":
+    data = collect_group_relative_pairs()
+    export_jsonl(data, "theme4_grpo_pairs.jsonl")
+    print(f"Collected {len(data)} preference pairs into theme4_grpo_pairs.jsonl")

server/SmartPayEnv_environment.py CHANGED Viewed

@@ -77,6 +77,14 @@ class State:
     active_events: dict = field(default_factory=dict) # e.g. {"fraud_spike": 10, "outage": 5}
     log_cursor: int = 0
     review_queue: list = field(default_factory=list) # [{ 'step': int, 'is_fraud': bool, 'amount': float }]
 class _GatewayState:
@@ -132,6 +140,7 @@ class SmartpayenvEnvironment(Environment):
         self.current_obs   = None
         self._log_loader   = LogLoader()
         self._pattern_queue = deque()
     def _init_gateways(self) -> None:
         instability = self._cfg["instability"]
@@ -233,8 +242,44 @@ class SmartpayenvEnvironment(Environment):
         self.current_obs = self._generate_transaction()
         # Synchronize simulation clock with the log's starting hour
         self._state.simulation_hour = self.current_obs.time_of_day
         return self.current_obs
     def step(self, action: SmartpayenvAction) -> SmartpayenvObservation:
         self._state.step_count += 1
@@ -273,6 +318,10 @@ class SmartpayenvEnvironment(Environment):
             surge_logs = self._log_loader.get_pattern("fraud_surge", count=5)
             self._pattern_queue.extend(surge_logs)
         for gw in self._gateways: gw.step()
         # 1. 3DS / Action Logic
@@ -402,14 +451,40 @@ class SmartpayenvEnvironment(Environment):
         fs = self.fraud_grader.evaluate()
         rs = self.retention_grader.evaluate()
         base_reward = (0.4 * route_score) + (0.4 * fs) + (0.2 * rs)
-        # Norm punishment for chargebacks
-        final_reward = base_reward - (cb_amt / 150.0)
         self.current_obs.reward = float(np.clip(final_reward, 0.001, 0.999))
         self.current_obs.task_routing_score = route_score
         self.current_obs.task_fraud_mcc_score = fs
         self.current_obs.task_retention_score = rs
         return self.current_obs

     active_events: dict = field(default_factory=dict) # e.g. {"fraud_spike": 10, "outage": 5}
     log_cursor: int = 0
     review_queue: list = field(default_factory=list) # [{ 'step': int, 'is_fraud': bool, 'amount': float }]
+    curriculum_level: float = 0.0
+    policy_skill_estimate: float = 0.5
+    challenger_skill: float = 0.55
+    recent_rewards: deque = field(default_factory=lambda: deque(maxlen=25))
+    recent_route_scores: deque = field(default_factory=lambda: deque(maxlen=25))
+    recent_fraud_scores: deque = field(default_factory=lambda: deque(maxlen=25))
+    recent_retention_scores: deque = field(default_factory=lambda: deque(maxlen=25))
+    anti_gaming_alerts: int = 0
 class _GatewayState:
         self.current_obs   = None
         self._log_loader   = LogLoader()
         self._pattern_queue = deque()
+        self._meta_curriculum_enabled = True
     def _init_gateways(self) -> None:
         instability = self._cfg["instability"]
         self.current_obs = self._generate_transaction()
         # Synchronize simulation clock with the log's starting hour
         self._state.simulation_hour = self.current_obs.time_of_day
+        self._state.curriculum_level = float(self._difficulty)
+        self._state.policy_skill_estimate = 0.5
+        self._state.challenger_skill = 0.55 + (0.08 * self._difficulty)
+        self._state.anti_gaming_alerts = 0
         return self.current_obs
+    def _curriculum_multiplier(self) -> float:
+        return 1.0 + (0.15 * self._state.curriculum_level)
+    def _update_self_play_curriculum(self, route_score: float, fraud_score: float, retention_score: float) -> None:
+        """
+        Theme-4 core: self-improvement loop inspired by league training.
+        The policy competes against a moving challenger and environment complexity
+        scales with sustained performance.
+        """
+        self._state.recent_route_scores.append(route_score)
+        self._state.recent_fraud_scores.append(fraud_score)
+        self._state.recent_retention_scores.append(retention_score)
+        perf = (0.45 * route_score) + (0.35 * fraud_score) + (0.20 * retention_score)
+        self._state.recent_rewards.append(perf)
+        if not self._state.recent_rewards:
+            return
+        rolling_perf = float(np.mean(self._state.recent_rewards))
+        skill_delta = 0.08 * (rolling_perf - 0.5)
+        self._state.policy_skill_estimate = float(np.clip(self._state.policy_skill_estimate + skill_delta, 0.05, 0.99))
+        # PFSP-inspired challenger adaptation: keep matches near policy frontier.
+        gap = self._state.policy_skill_estimate - self._state.challenger_skill
+        self._state.challenger_skill = float(np.clip(self._state.challenger_skill + (0.06 * gap), 0.05, 0.99))
+        if self._meta_curriculum_enabled and len(self._state.recent_rewards) >= 8:
+            if rolling_perf > 0.72:
+                self._state.curriculum_level = float(np.clip(self._state.curriculum_level + 0.12, 0.0, 2.0))
+            elif rolling_perf < 0.45:
+                self._state.curriculum_level = float(np.clip(self._state.curriculum_level - 0.08, 0.0, 2.0))
     def step(self, action: SmartpayenvAction) -> SmartpayenvObservation:
         self._state.step_count += 1
             surge_logs = self._log_loader.get_pattern("fraud_surge", count=5)
             self._pattern_queue.extend(surge_logs)
+        # Curriculum-driven stress events (self-improvement pressure).
+        if self._rng.random() < (0.01 * self._curriculum_multiplier()):
+            self._state.active_events["adversarial_shift"] = int(self._rng.integers(4, 12))
         for gw in self._gateways: gw.step()
         # 1. 3DS / Action Logic
         fs = self.fraud_grader.evaluate()
         rs = self.retention_grader.evaluate()
         base_reward = (0.4 * route_score) + (0.4 * fs) + (0.2 * rs)
+        # League-style regret: penalize underperforming against moving challenger.
+        challenger_regret = max(0.0, self._state.challenger_skill - base_reward)
+        regret_penalty = 0.35 * challenger_regret
+        # Anti-gaming check: repeatedly overusing manual review without quality gains.
+        gaming_penalty = 0.0
+        if action.fraud_decision == 3 and fs < 0.55 and rs < 0.6:
+            self._state.anti_gaming_alerts += 1
+            gaming_penalty = min(0.12, 0.02 * self._state.anti_gaming_alerts)
+        # Curriculum bonus: reward robust performance under higher difficulty pressure.
+        robustness_bonus = 0.06 * self._state.curriculum_level * max(0.0, base_reward - 0.55)
+        # Norm punishment for delayed liabilities + self-improvement terms.
+        final_reward = base_reward - (cb_amt / 150.0) - regret_penalty - gaming_penalty + robustness_bonus
         self.current_obs.reward = float(np.clip(final_reward, 0.001, 0.999))
         self.current_obs.task_routing_score = route_score
         self.current_obs.task_fraud_mcc_score = fs
         self.current_obs.task_retention_score = rs
+        self._update_self_play_curriculum(route_score, fs, rs)
+        self.current_obs.metadata = {
+            "theme": "self_improvement",
+            "curriculum_level": round(self._state.curriculum_level, 4),
+            "policy_skill_estimate": round(self._state.policy_skill_estimate, 4),
+            "challenger_skill": round(self._state.challenger_skill, 4),
+            "challenger_regret": round(challenger_regret, 4),
+            "gaming_penalty": round(gaming_penalty, 4),
+            "robustness_bonus": round(robustness_bonus, 4),
+            "anti_gaming_alerts": int(self._state.anti_gaming_alerts),
+            "active_events": dict(self._state.active_events),
+        }
         return self.current_obs

server/utils.py CHANGED Viewed

@@ -39,6 +39,14 @@ class LogLoader:
         if pattern_type == "fraud_surge":
             # Filter for high fraud risk
             candidates = [l for l in self.logs if l.get("fraud_risk_score", 0) > 0.5]
         elif pattern_type == "premium_only":
             candidates = [l for l in self.logs if l.get("user_segment") == 2]
         else:

         if pattern_type == "fraud_surge":
             # Filter for high fraud risk
             candidates = [l for l in self.logs if l.get("fraud_risk_score", 0) > 0.5]
+        elif pattern_type == "stealth_fraud":
+            candidates = [
+                l for l in self.logs
+                if l.get("is_fraud", False)
+                and "low_risk_disguise" in str(l.get("fraud_strategy", ""))
+            ]
+        elif pattern_type == "velocity_attack":
+            candidates = [l for l in self.logs if float(l.get("transaction_velocity", 0.0)) > 0.7]
         elif pattern_type == "premium_only":
             candidates = [l for l in self.logs if l.get("user_segment") == 2]
         else: