Spaces:
Sleeping
Sleeping
Env improvement
Browse files- README.md +50 -1
- data/transactions_log.jsonl +0 -0
- scripts/generate_logs.py +273 -60
- scripts/train_theme4_grpo.py +139 -0
- server/SmartPayEnv_environment.py +78 -3
- server/utils.py +8 -0
README.md
CHANGED
|
@@ -14,7 +14,7 @@ tags:
|
|
| 14 |
- Reinforcement Learning
|
| 15 |
---
|
| 16 |
|
| 17 |
-
# 💳 SmartPayEnv: Advanced Fintech Reality Layer
|
| 18 |
|
| 19 |
**A high-fidelity, production-grade benchmark for training and evaluating AI Agents (LLMs/RL) on the messy reality of global payment orchestration.**
|
| 20 |
|
|
@@ -23,6 +23,12 @@ tags:
|
|
| 23 |
|
| 24 |
SmartPayEnv bridges the gap between simple simulations and production fintech. It models the adversarial loops, infrastructure instability, and delayed feedback cycles that define modern payment systems.
|
| 25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
---
|
| 27 |
|
| 28 |
## 🚀 Why SmartPayEnv?
|
|
@@ -122,6 +128,12 @@ Agents can send transactions to manual review (Action 3). Resolutions are 100% a
|
|
| 122 |
- **📊 BIN-Gateway Affinity**: A hidden matrix of gateway performance across different card types. Agents must discover these affinities to optimize routing success.
|
| 123 |
- **🧠 Preference-Based Learning (Simulation Branching)**: Supports advanced training (e.g., DPO/PPO) by allowing agents to "What-if" multiple actions from the same state via the `/simulate` endpoint. Agents can group similar contexts (BIN + Amount + Risk) and learn from relative advantages.
|
| 124 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
---
|
| 126 |
|
| 127 |
## 🎯 Benchmark Tasks
|
|
@@ -180,6 +192,22 @@ SmartPayEnv enables GRPO by providing the infrastructure for **Group Sampling**
|
|
| 180 |
- **Learnable Gradients**: Unlike binary simulations, our **Deterministic Graders** (see Scoring section) map fuzzy outcomes to continuous rewards $[0, 1]$. This prevents the "sparse reward" problem and provides stable gradients for PPO clip-range optimization.
|
| 181 |
- **Context Bucketing**: The `server/preference_utils.py` module allows agents to bundle similar (BIN, Amount, Risk) states, enabling faster convergence on preference-based objectives.
|
| 182 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 183 |
---
|
| 184 |
|
| 185 |
## 📐 Data Models
|
|
@@ -219,6 +247,9 @@ cd SmartPayEnv
|
|
| 219 |
# Install dependencies
|
| 220 |
uv sync
|
| 221 |
|
|
|
|
|
|
|
|
|
|
| 222 |
# Run the OpenEnv validation suite
|
| 223 |
openenv validate
|
| 224 |
|
|
@@ -233,6 +264,24 @@ uv run -m SmartPayEnv.server.app
|
|
| 233 |
```
|
| 234 |
Access the **Swagger UI** at `http://localhost:7860/` (auto-redirects to `/docs`).
|
| 235 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 236 |
### 3. Multi-Mode Deployment (Docker)
|
| 237 |
```bash
|
| 238 |
# Build the production image
|
|
|
|
| 14 |
- Reinforcement Learning
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# 💳 SmartPayEnv: Advanced Fintech Reality Layer (Theme 4: Self-Improvement)
|
| 18 |
|
| 19 |
**A high-fidelity, production-grade benchmark for training and evaluating AI Agents (LLMs/RL) on the messy reality of global payment orchestration.**
|
| 20 |
|
|
|
|
| 23 |
|
| 24 |
SmartPayEnv bridges the gap between simple simulations and production fintech. It models the adversarial loops, infrastructure instability, and delayed feedback cycles that define modern payment systems.
|
| 25 |
|
| 26 |
+
This release is explicitly upgraded for **OpenEnv Hackathon Theme #4 (Self-Improvement)** with a light blend of Theme #1 and Theme #2:
|
| 27 |
+
- **League-style challenger dynamics** inside the environment (agent vs moving opponent skill frontier).
|
| 28 |
+
- **Adaptive curriculum** that auto-escalates pressure after sustained performance and de-escalates after regressions.
|
| 29 |
+
- **Anti-reward-hacking penalties** for degenerate policies (e.g., overusing manual review without fraud/retention quality).
|
| 30 |
+
- **Long-horizon credit pressure** through delayed chargebacks + review queues + temporal events.
|
| 31 |
+
|
| 32 |
---
|
| 33 |
|
| 34 |
## 🚀 Why SmartPayEnv?
|
|
|
|
| 128 |
- **📊 BIN-Gateway Affinity**: A hidden matrix of gateway performance across different card types. Agents must discover these affinities to optimize routing success.
|
| 129 |
- **🧠 Preference-Based Learning (Simulation Branching)**: Supports advanced training (e.g., DPO/PPO) by allowing agents to "What-if" multiple actions from the same state via the `/simulate` endpoint. Agents can group similar contexts (BIN + Amount + Risk) and learn from relative advantages.
|
| 130 |
|
| 131 |
+
### 5. Self-Improving Meta-Curriculum (NEW)
|
| 132 |
+
- **📈 Curriculum Level**: Each episode now tracks a continuous curriculum level (0-2) that increases after sustained high rolling performance.
|
| 133 |
+
- **🥊 Challenger Skill**: A moving challenger policy estimate is maintained and used to compute regret-style penalties when the active policy underperforms.
|
| 134 |
+
- **🧯 Anti-Gaming Guardrails**: Repeatedly selecting costly manual review without corresponding quality gains triggers adaptive penalties.
|
| 135 |
+
- **🧠 Metadata for Training**: Step metadata exposes `curriculum_level`, `policy_skill_estimate`, `challenger_skill`, and shaping terms to support richer RL diagnostics.
|
| 136 |
+
|
| 137 |
---
|
| 138 |
|
| 139 |
## 🎯 Benchmark Tasks
|
|
|
|
| 192 |
- **Learnable Gradients**: Unlike binary simulations, our **Deterministic Graders** (see Scoring section) map fuzzy outcomes to continuous rewards $[0, 1]$. This prevents the "sparse reward" problem and provides stable gradients for PPO clip-range optimization.
|
| 193 |
- **Context Bucketing**: The `server/preference_utils.py` module allows agents to bundle similar (BIN, Amount, Risk) states, enabling faster convergence on preference-based objectives.
|
| 194 |
|
| 195 |
+
### 3. Theme-4 Group-Relative Collection (NEW)
|
| 196 |
+
- Use `scripts/train_theme4_grpo.py` to build **group-relative preference pairs** from online interactions:
|
| 197 |
+
- sample action groups for each live observation
|
| 198 |
+
- rank via `/simulate` reward
|
| 199 |
+
- export best-vs-worst pairs (`theme4_grpo_pairs.jsonl`)
|
| 200 |
+
- This supports novel post-training flows in **HF TRL / Unsloth** and aligns with modern critic-free RL ideas.
|
| 201 |
+
|
| 202 |
+
---
|
| 203 |
+
|
| 204 |
+
## 📚 Research-Inspired Design
|
| 205 |
+
|
| 206 |
+
The self-improving upgrades are inspired by:
|
| 207 |
+
- **League / PFSP dynamics** for avoiding cyclic overfitting and improving robustness: [AlphaStar (Nature, 2019)](https://www.nature.com/articles/s41586-019-1724-z)
|
| 208 |
+
- **Group-relative policy updates** for efficient critic-free optimization: [DeepSeekMath / GRPO (arXiv:2402.03300)](https://arxiv.org/abs/2402.03300)
|
| 209 |
+
- **Cross-play and equilibrium-oriented opponent diversity**: [Fictitious Cross-Play (arXiv:2310.03354)](https://arxiv.org/abs/2310.03354)
|
| 210 |
+
|
| 211 |
---
|
| 212 |
|
| 213 |
## 📐 Data Models
|
|
|
|
| 247 |
# Install dependencies
|
| 248 |
uv sync
|
| 249 |
|
| 250 |
+
# (Recommended) Regenerate realistic synthetic data
|
| 251 |
+
python scripts/generate_logs.py --num-transactions 20000 --n-users 5000 --seed 42
|
| 252 |
+
|
| 253 |
# Run the OpenEnv validation suite
|
| 254 |
openenv validate
|
| 255 |
|
|
|
|
| 264 |
```
|
| 265 |
Access the **Swagger UI** at `http://localhost:7860/` (auto-redirects to `/docs`).
|
| 266 |
|
| 267 |
+
### 4. Synthetic Data World Generator (NEW)
|
| 268 |
+
Use this when you want realistic, evolving "real-world-like" transaction streams:
|
| 269 |
+
|
| 270 |
+
```bash
|
| 271 |
+
python scripts/generate_logs.py \
|
| 272 |
+
--output data/transactions_log.jsonl \
|
| 273 |
+
--num-transactions 20000 \
|
| 274 |
+
--n-users 5000 \
|
| 275 |
+
--seed 42 \
|
| 276 |
+
--base-fraud-rate 0.08
|
| 277 |
+
```
|
| 278 |
+
|
| 279 |
+
What gets generated:
|
| 280 |
+
- **Normal baseline behavior** (segment-based spend, location/device consistency, time-of-day effects)
|
| 281 |
+
- **Seed fraud templates** (`high_value_spike`, `velocity_burst`, `geo_anomaly`, `device_spoof`, `split_transactions`)
|
| 282 |
+
- **Adaptive fraud evolution** (strategy composition and stealth attacks such as `low_risk_disguise`)
|
| 283 |
+
- **Strategy labels for storytelling** via `fraud_strategy` and `event_marker`
|
| 284 |
+
|
| 285 |
### 3. Multi-Mode Deployment (Docker)
|
| 286 |
```bash
|
| 287 |
# Build the production image
|
data/transactions_log.jsonl
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
scripts/generate_logs.py
CHANGED
|
@@ -1,68 +1,281 @@
|
|
|
|
|
| 1 |
import json
|
| 2 |
-
import numpy as np
|
| 3 |
import os
|
| 4 |
-
from
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
|
| 6 |
-
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
os.makedirs(os.path.dirname(output_path), exist_ok=True)
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
current_hour = 0
|
| 11 |
-
steps_per_hour =
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
|
|
|
| 18 |
current_hour = (current_hour + 1) % 24
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
if __name__ == "__main__":
|
| 67 |
-
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import argparse
|
| 2 |
import json
|
|
|
|
| 3 |
import os
|
| 4 |
+
from collections import defaultdict, deque
|
| 5 |
+
|
| 6 |
+
import numpy as np
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
LOCATIONS = ["Bangalore", "Mumbai", "Delhi", "Hyderabad", "Chennai", "Pune", "Kolkata", "Europe", "Singapore"]
|
| 10 |
+
SEGMENT_LABELS = {0: "new", 1: "existing", 2: "premium"}
|
| 11 |
+
BASE_MCC_DIST = [0.30, 0.20, 0.10, 0.10, 0.10, 0.20]
|
| 12 |
+
HIGH_RISK_MCCS = {2, 4, 5}
|
| 13 |
+
RISKY_HOURS = {1, 2, 3, 4, 5}
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def _time_bucket(hour: int) -> str:
|
| 17 |
+
if 0 <= hour <= 5:
|
| 18 |
+
return "night"
|
| 19 |
+
if 6 <= hour <= 11:
|
| 20 |
+
return "morning"
|
| 21 |
+
if 12 <= hour <= 17:
|
| 22 |
+
return "afternoon"
|
| 23 |
+
return "evening"
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
def _sample_user_profiles(rng: np.random.Generator, n_users: int) -> list[dict]:
|
| 27 |
+
profiles: list[dict] = []
|
| 28 |
+
for uid in range(n_users):
|
| 29 |
+
segment = int(rng.choice([0, 1, 2], p=[0.30, 0.55, 0.15]))
|
| 30 |
+
traveler = bool(rng.random() < {0: 0.08, 1: 0.15, 2: 0.35}[segment])
|
| 31 |
+
home = str(rng.choice(LOCATIONS[:7]))
|
| 32 |
+
preferred_mcc = int(rng.choice([0, 1, 3, 5], p=[0.35, 0.25, 0.20, 0.20]))
|
| 33 |
+
profiles.append(
|
| 34 |
+
{
|
| 35 |
+
"user_id": f"user_{uid}",
|
| 36 |
+
"user_segment": segment,
|
| 37 |
+
"frequent_traveler": traveler,
|
| 38 |
+
"home_location": home,
|
| 39 |
+
"preferred_mcc": preferred_mcc,
|
| 40 |
+
"base_device_type": int(rng.choice([0, 1, 2], p=[0.55, 0.35, 0.10])),
|
| 41 |
+
"base_spend_mu": {0: 3.8, 1: 4.5, 2: 5.0}[segment],
|
| 42 |
+
"base_spend_sigma": {0: 0.70, 1: 0.75, 2: 0.85}[segment],
|
| 43 |
+
"history_base": {0: 0.35, 1: 0.72, 2: 0.88}[segment],
|
| 44 |
+
}
|
| 45 |
+
)
|
| 46 |
+
return profiles
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
def _normal_transaction(
|
| 50 |
+
rng: np.random.Generator,
|
| 51 |
+
profile: dict,
|
| 52 |
+
hour: int,
|
| 53 |
+
user_recent_times: deque,
|
| 54 |
+
user_recent_amounts: deque,
|
| 55 |
+
) -> dict:
|
| 56 |
+
mcc_probs = np.array(BASE_MCC_DIST, dtype=float)
|
| 57 |
+
mcc_probs[profile["preferred_mcc"]] += 0.18
|
| 58 |
+
mcc_probs = mcc_probs / mcc_probs.sum()
|
| 59 |
+
mcc = int(rng.choice([0, 1, 2, 3, 4, 5], p=mcc_probs))
|
| 60 |
+
|
| 61 |
+
amount = float(rng.lognormal(mean=profile["base_spend_mu"], sigma=profile["base_spend_sigma"]))
|
| 62 |
+
if mcc in HIGH_RISK_MCCS:
|
| 63 |
+
amount *= 1.35
|
| 64 |
+
|
| 65 |
+
location = profile["home_location"]
|
| 66 |
+
is_international = False
|
| 67 |
+
if profile["frequent_traveler"] and rng.random() < 0.10:
|
| 68 |
+
location = str(rng.choice(["Europe", "Singapore"]))
|
| 69 |
+
is_international = True
|
| 70 |
+
|
| 71 |
+
device_type = profile["base_device_type"]
|
| 72 |
+
if rng.random() < 0.07:
|
| 73 |
+
device_type = int(rng.choice([0, 1, 2]))
|
| 74 |
+
|
| 75 |
+
velocity = float(min(12, len([t for t in user_recent_times if hour - t <= 1])))
|
| 76 |
+
velocity_norm = float(np.clip(velocity / 10.0, 0.05, 0.98))
|
| 77 |
+
|
| 78 |
+
risk = 0.02
|
| 79 |
+
risk += 0.06 if hour in RISKY_HOURS else 0.0
|
| 80 |
+
risk += 0.05 if mcc in HIGH_RISK_MCCS else 0.0
|
| 81 |
+
risk += 0.04 if device_type != profile["base_device_type"] else 0.0
|
| 82 |
+
risk += 0.03 if is_international else 0.0
|
| 83 |
+
risk += 0.08 * velocity_norm
|
| 84 |
+
risk += rng.normal(0.0, 0.02)
|
| 85 |
+
|
| 86 |
+
return {
|
| 87 |
+
"amount": float(np.clip(amount, 5.0, 150000.0)),
|
| 88 |
+
"currency": "INR",
|
| 89 |
+
"time": _time_bucket(hour),
|
| 90 |
+
"merchant_category": mcc,
|
| 91 |
+
"location": location,
|
| 92 |
+
"is_international": is_international,
|
| 93 |
+
"card_present": bool(rng.random() > 0.45),
|
| 94 |
+
"user_segment": profile["user_segment"],
|
| 95 |
+
"user_history_score": float(np.clip(rng.normal(profile["history_base"], 0.12), 0.05, 1.0)),
|
| 96 |
+
"device_type": device_type,
|
| 97 |
+
"ip_risk": float(np.clip(rng.normal(0.10 if location == profile["home_location"] else 0.45, 0.08), 0.01, 0.99)),
|
| 98 |
+
"bin_category": int(rng.integers(0, 10)),
|
| 99 |
+
"time_of_day": int(hour),
|
| 100 |
+
"transaction_velocity": velocity_norm,
|
| 101 |
+
"fraud_risk_score": float(np.clip(risk, 0.01, 0.99)),
|
| 102 |
+
"fraud_strategy": "none",
|
| 103 |
+
"event_marker": None,
|
| 104 |
+
"is_fraud": False,
|
| 105 |
+
}
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
def _fraud_agent_strategy_mix(
|
| 109 |
+
rng: np.random.Generator,
|
| 110 |
+
attack_level: float,
|
| 111 |
+
) -> list[str]:
|
| 112 |
+
templates = [
|
| 113 |
+
("high_value_spike", 0.20),
|
| 114 |
+
("velocity_burst", 0.22),
|
| 115 |
+
("geo_anomaly", 0.16),
|
| 116 |
+
("device_spoof", 0.18),
|
| 117 |
+
("split_transactions", 0.14),
|
| 118 |
+
("low_risk_disguise", 0.10),
|
| 119 |
+
]
|
| 120 |
+
weights = np.array([w for _, w in templates], dtype=float)
|
| 121 |
+
# Self-improving fraud agent: shifts toward stealth blends as defender hardens.
|
| 122 |
+
stealth_boost = min(0.18, 0.06 * attack_level)
|
| 123 |
+
weights[5] += stealth_boost
|
| 124 |
+
weights[4] += stealth_boost * 0.8
|
| 125 |
+
weights = weights / weights.sum()
|
| 126 |
|
| 127 |
+
k = 1 if attack_level < 1.0 else (2 if rng.random() < 0.75 else 3)
|
| 128 |
+
selected = rng.choice([name for name, _ in templates], size=k, replace=False, p=weights)
|
| 129 |
+
return list(selected)
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
def _apply_fraud_strategy(
|
| 133 |
+
rng: np.random.Generator,
|
| 134 |
+
tx: dict,
|
| 135 |
+
profile: dict,
|
| 136 |
+
strategies: list[str],
|
| 137 |
+
) -> list[dict]:
|
| 138 |
+
tx = dict(tx)
|
| 139 |
+
event_markers = []
|
| 140 |
+
|
| 141 |
+
for s in strategies:
|
| 142 |
+
if s == "high_value_spike":
|
| 143 |
+
tx["amount"] = float(min(200000.0, tx["amount"] * rng.uniform(6.0, 18.0)))
|
| 144 |
+
event_markers.append("high_value_spike")
|
| 145 |
+
elif s == "velocity_burst":
|
| 146 |
+
tx["transaction_velocity"] = float(np.clip(tx["transaction_velocity"] + rng.uniform(0.45, 0.85), 0.1, 0.99))
|
| 147 |
+
event_markers.append("velocity_burst")
|
| 148 |
+
elif s == "geo_anomaly":
|
| 149 |
+
tx["location"] = str(rng.choice(["Europe", "Singapore"]))
|
| 150 |
+
tx["is_international"] = True
|
| 151 |
+
tx["ip_risk"] = float(np.clip(tx["ip_risk"] + rng.uniform(0.25, 0.50), 0.01, 0.99))
|
| 152 |
+
event_markers.append("geo_anomaly")
|
| 153 |
+
elif s == "device_spoof":
|
| 154 |
+
tx["device_type"] = int((profile["base_device_type"] + int(rng.integers(1, 3))) % 3)
|
| 155 |
+
tx["card_present"] = False
|
| 156 |
+
tx["ip_risk"] = float(np.clip(tx["ip_risk"] + rng.uniform(0.18, 0.35), 0.01, 0.99))
|
| 157 |
+
event_markers.append("device_spoof")
|
| 158 |
+
elif s == "split_transactions":
|
| 159 |
+
# Converted to multiple low-value events that preserve a high total.
|
| 160 |
+
pieces = int(rng.integers(4, 10))
|
| 161 |
+
each_amount = float(max(1500.0, tx["amount"] * rng.uniform(0.10, 0.22)))
|
| 162 |
+
generated = []
|
| 163 |
+
for _ in range(pieces):
|
| 164 |
+
p = dict(tx)
|
| 165 |
+
p["amount"] = each_amount
|
| 166 |
+
p["transaction_velocity"] = float(np.clip(tx["transaction_velocity"] + rng.uniform(0.35, 0.55), 0.1, 0.99))
|
| 167 |
+
p["event_marker"] = "split_transactions"
|
| 168 |
+
p["fraud_strategy"] = "split_transactions"
|
| 169 |
+
p["is_fraud"] = True
|
| 170 |
+
risk = p["fraud_risk_score"] + rng.uniform(0.18, 0.32)
|
| 171 |
+
p["fraud_risk_score"] = float(np.clip(risk, 0.01, 0.99))
|
| 172 |
+
generated.append(p)
|
| 173 |
+
return generated
|
| 174 |
+
elif s == "low_risk_disguise":
|
| 175 |
+
# Fraud tries to look normal: lower explicit risk while preserving anomalies elsewhere.
|
| 176 |
+
tx["amount"] = float(np.clip(tx["amount"] * rng.uniform(0.18, 0.35), 250.0, 12000.0))
|
| 177 |
+
tx["merchant_category"] = int(rng.choice([0, 1, 3], p=[0.5, 0.3, 0.2]))
|
| 178 |
+
tx["fraud_risk_score"] = float(np.clip(tx["fraud_risk_score"] - rng.uniform(0.08, 0.20), 0.02, 0.80))
|
| 179 |
+
event_markers.append("low_risk_disguise")
|
| 180 |
+
|
| 181 |
+
tx["fraud_strategy"] = "+".join(strategies)
|
| 182 |
+
tx["event_marker"] = "|".join(event_markers) if event_markers else "fraud_pattern"
|
| 183 |
+
tx["is_fraud"] = True
|
| 184 |
+
tx["fraud_risk_score"] = float(np.clip(tx["fraud_risk_score"] + rng.uniform(0.18, 0.42), 0.01, 0.99))
|
| 185 |
+
return [tx]
|
| 186 |
+
|
| 187 |
+
|
| 188 |
+
def generate_logs(
|
| 189 |
+
output_path: str = "data/transactions_log.jsonl",
|
| 190 |
+
num_transactions: int = 15000,
|
| 191 |
+
n_users: int = 4000,
|
| 192 |
+
seed: int = 7,
|
| 193 |
+
base_fraud_rate: float = 0.08,
|
| 194 |
+
) -> None:
|
| 195 |
+
"""
|
| 196 |
+
Generates realistic synthetic payment logs with an evolving fraud adversary.
|
| 197 |
+
"""
|
| 198 |
+
rng = np.random.default_rng(seed)
|
| 199 |
os.makedirs(os.path.dirname(output_path), exist_ok=True)
|
| 200 |
+
|
| 201 |
+
profiles = _sample_user_profiles(rng, n_users=n_users)
|
| 202 |
+
user_recent_times: dict[str, deque] = defaultdict(lambda: deque(maxlen=40))
|
| 203 |
+
user_recent_amounts: dict[str, deque] = defaultdict(lambda: deque(maxlen=40))
|
| 204 |
+
|
| 205 |
current_hour = 0
|
| 206 |
+
steps_per_hour = 90
|
| 207 |
+
global_attack_level = 0.0
|
| 208 |
+
defender_pressure = 0.0
|
| 209 |
+
|
| 210 |
+
records_written = 0
|
| 211 |
+
with open(output_path, "w", encoding="utf-8") as f:
|
| 212 |
+
while records_written < num_transactions:
|
| 213 |
+
if records_written % steps_per_hour == 0:
|
| 214 |
current_hour = (current_hour + 1) % 24
|
| 215 |
+
|
| 216 |
+
profile = profiles[int(rng.integers(0, len(profiles)))]
|
| 217 |
+
uid = profile["user_id"]
|
| 218 |
+
|
| 219 |
+
base_tx = _normal_transaction(
|
| 220 |
+
rng=rng,
|
| 221 |
+
profile=profile,
|
| 222 |
+
hour=current_hour,
|
| 223 |
+
user_recent_times=user_recent_times[uid],
|
| 224 |
+
user_recent_amounts=user_recent_amounts[uid],
|
| 225 |
+
)
|
| 226 |
+
|
| 227 |
+
fraud_p = base_fraud_rate + (0.05 if current_hour in RISKY_HOURS else 0.0) + (0.07 * global_attack_level)
|
| 228 |
+
fraud_p = float(np.clip(fraud_p, 0.01, 0.55))
|
| 229 |
+
is_attack = bool(rng.random() < fraud_p)
|
| 230 |
+
|
| 231 |
+
if is_attack:
|
| 232 |
+
strategies = _fraud_agent_strategy_mix(rng, attack_level=global_attack_level)
|
| 233 |
+
txs = _apply_fraud_strategy(rng, base_tx, profile, strategies)
|
| 234 |
+
else:
|
| 235 |
+
txs = [base_tx]
|
| 236 |
+
|
| 237 |
+
for tx in txs:
|
| 238 |
+
tx["user_id"] = uid
|
| 239 |
+
tx["user_profile"] = {
|
| 240 |
+
"segment": SEGMENT_LABELS[profile["user_segment"]],
|
| 241 |
+
"frequent_traveler": profile["frequent_traveler"],
|
| 242 |
+
"home_location": profile["home_location"],
|
| 243 |
+
}
|
| 244 |
+
tx["attack_level"] = round(float(global_attack_level), 4)
|
| 245 |
+
tx["defender_pressure"] = round(float(defender_pressure), 4)
|
| 246 |
+
f.write(json.dumps(tx) + "\n")
|
| 247 |
+
records_written += 1
|
| 248 |
+
|
| 249 |
+
user_recent_times[uid].append(current_hour)
|
| 250 |
+
user_recent_amounts[uid].append(tx["amount"])
|
| 251 |
+
if records_written >= num_transactions:
|
| 252 |
+
break
|
| 253 |
+
|
| 254 |
+
# Self-improvement dynamics:
|
| 255 |
+
# when fraud is frequently obvious, increase defender pressure;
|
| 256 |
+
# when stealth fraud appears, raise attack sophistication.
|
| 257 |
+
if is_attack and any("low_risk_disguise" in t.get("fraud_strategy", "") for t in txs):
|
| 258 |
+
global_attack_level = float(np.clip(global_attack_level + 0.015, 0.0, 3.0))
|
| 259 |
+
elif is_attack:
|
| 260 |
+
defender_pressure = float(np.clip(defender_pressure + 0.010, 0.0, 2.5))
|
| 261 |
+
else:
|
| 262 |
+
global_attack_level = float(np.clip(global_attack_level + 0.002 - (0.001 * defender_pressure), 0.0, 3.0))
|
| 263 |
+
|
| 264 |
|
| 265 |
if __name__ == "__main__":
|
| 266 |
+
parser = argparse.ArgumentParser(description="Generate synthetic SmartPayEnv transaction logs.")
|
| 267 |
+
parser.add_argument("--output", default="data/transactions_log.jsonl", help="Output JSONL file path")
|
| 268 |
+
parser.add_argument("--num-transactions", type=int, default=15000, help="Number of transactions")
|
| 269 |
+
parser.add_argument("--n-users", type=int, default=4000, help="Number of synthetic users")
|
| 270 |
+
parser.add_argument("--seed", type=int, default=7, help="Random seed")
|
| 271 |
+
parser.add_argument("--base-fraud-rate", type=float, default=0.08, help="Baseline fraud probability")
|
| 272 |
+
args = parser.parse_args()
|
| 273 |
+
|
| 274 |
+
generate_logs(
|
| 275 |
+
output_path=args.output,
|
| 276 |
+
num_transactions=args.num_transactions,
|
| 277 |
+
n_users=args.n_users,
|
| 278 |
+
seed=args.seed,
|
| 279 |
+
base_fraud_rate=args.base_fraud_rate,
|
| 280 |
+
)
|
| 281 |
+
print(f"Generated {args.num_transactions} synthetic transactions at {args.output}")
|
scripts/train_theme4_grpo.py
ADDED
|
@@ -0,0 +1,139 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Theme-4 training starter for SmartPayEnv.
|
| 3 |
+
|
| 4 |
+
This script demonstrates a novel self-improvement loop:
|
| 5 |
+
1) sample K candidate actions per observation
|
| 6 |
+
2) score each candidate with /simulate rewards (group-relative signal)
|
| 7 |
+
3) collect best/worst pairs for preference-style post-training
|
| 8 |
+
|
| 9 |
+
It is intentionally lightweight so teams can run it in Colab with TRL/Unsloth.
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
from __future__ import annotations
|
| 13 |
+
|
| 14 |
+
import json
|
| 15 |
+
import random
|
| 16 |
+
from dataclasses import dataclass
|
| 17 |
+
from typing import Any
|
| 18 |
+
|
| 19 |
+
import requests
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
ENV_URL = "http://localhost:7860"
|
| 23 |
+
MAX_STEPS = 200
|
| 24 |
+
GROUP_SIZE = 8
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
@dataclass
|
| 28 |
+
class RolloutExample:
|
| 29 |
+
prompt: str
|
| 30 |
+
chosen: str
|
| 31 |
+
rejected: str
|
| 32 |
+
chosen_reward: float
|
| 33 |
+
rejected_reward: float
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
def _action_candidates() -> list[dict[str, int]]:
|
| 37 |
+
all_actions: list[dict[str, int]] = []
|
| 38 |
+
for gateway in (0, 1, 2):
|
| 39 |
+
for fraud_decision in (0, 1, 2, 3):
|
| 40 |
+
for retry_strategy in (0, 1):
|
| 41 |
+
all_actions.append(
|
| 42 |
+
{
|
| 43 |
+
"gateway": gateway,
|
| 44 |
+
"fraud_decision": fraud_decision,
|
| 45 |
+
"retry_strategy": retry_strategy,
|
| 46 |
+
}
|
| 47 |
+
)
|
| 48 |
+
random.shuffle(all_actions)
|
| 49 |
+
return all_actions
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def _simulate_reward(action: dict[str, int]) -> float:
|
| 53 |
+
response = requests.post(f"{ENV_URL}/simulate", json={"action": action}, timeout=30)
|
| 54 |
+
response.raise_for_status()
|
| 55 |
+
obs = response.json()
|
| 56 |
+
return float(obs.get("reward", 0.0))
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
def _step(action: dict[str, int]) -> dict[str, Any]:
|
| 60 |
+
response = requests.post(f"{ENV_URL}/step", json={"action": action}, timeout=30)
|
| 61 |
+
response.raise_for_status()
|
| 62 |
+
return response.json()
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
def _reset(difficulty: int = 2) -> dict[str, Any]:
|
| 66 |
+
response = requests.post(f"{ENV_URL}/reset", json={"difficulty": difficulty}, timeout=30)
|
| 67 |
+
response.raise_for_status()
|
| 68 |
+
payload = response.json()
|
| 69 |
+
return payload.get("observation", payload)
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
def collect_group_relative_pairs(max_steps: int = MAX_STEPS, group_size: int = GROUP_SIZE) -> list[RolloutExample]:
|
| 73 |
+
obs = _reset(difficulty=2)
|
| 74 |
+
dataset: list[RolloutExample] = []
|
| 75 |
+
actions_pool = _action_candidates()
|
| 76 |
+
|
| 77 |
+
for _ in range(max_steps):
|
| 78 |
+
sampled = random.sample(actions_pool, k=min(group_size, len(actions_pool)))
|
| 79 |
+
scored: list[tuple[dict[str, int], float]] = []
|
| 80 |
+
|
| 81 |
+
for action in sampled:
|
| 82 |
+
try:
|
| 83 |
+
reward = _simulate_reward(action)
|
| 84 |
+
scored.append((action, reward))
|
| 85 |
+
except requests.RequestException:
|
| 86 |
+
continue
|
| 87 |
+
|
| 88 |
+
if len(scored) < 2:
|
| 89 |
+
break
|
| 90 |
+
|
| 91 |
+
scored.sort(key=lambda x: x[1], reverse=True)
|
| 92 |
+
best_action, best_reward = scored[0]
|
| 93 |
+
worst_action, worst_reward = scored[-1]
|
| 94 |
+
|
| 95 |
+
prompt = (
|
| 96 |
+
"SmartPayEnv observation:\n"
|
| 97 |
+
f"{json.dumps(obs, sort_keys=True)}\n"
|
| 98 |
+
"Return one action JSON with fields: gateway, fraud_decision, retry_strategy."
|
| 99 |
+
)
|
| 100 |
+
|
| 101 |
+
dataset.append(
|
| 102 |
+
RolloutExample(
|
| 103 |
+
prompt=prompt,
|
| 104 |
+
chosen=json.dumps(best_action, sort_keys=True),
|
| 105 |
+
rejected=json.dumps(worst_action, sort_keys=True),
|
| 106 |
+
chosen_reward=best_reward,
|
| 107 |
+
rejected_reward=worst_reward,
|
| 108 |
+
)
|
| 109 |
+
)
|
| 110 |
+
|
| 111 |
+
step_payload = _step(best_action)
|
| 112 |
+
obs = step_payload.get("observation", step_payload)
|
| 113 |
+
if bool(obs.get("done", False)):
|
| 114 |
+
obs = _reset(difficulty=2)
|
| 115 |
+
|
| 116 |
+
return dataset
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
def export_jsonl(dataset: list[RolloutExample], output_path: str) -> None:
|
| 120 |
+
with open(output_path, "w", encoding="utf-8") as f:
|
| 121 |
+
for row in dataset:
|
| 122 |
+
f.write(
|
| 123 |
+
json.dumps(
|
| 124 |
+
{
|
| 125 |
+
"prompt": row.prompt,
|
| 126 |
+
"chosen": row.chosen,
|
| 127 |
+
"rejected": row.rejected,
|
| 128 |
+
"chosen_reward": row.chosen_reward,
|
| 129 |
+
"rejected_reward": row.rejected_reward,
|
| 130 |
+
}
|
| 131 |
+
)
|
| 132 |
+
+ "\n"
|
| 133 |
+
)
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
if __name__ == "__main__":
|
| 137 |
+
data = collect_group_relative_pairs()
|
| 138 |
+
export_jsonl(data, "theme4_grpo_pairs.jsonl")
|
| 139 |
+
print(f"Collected {len(data)} preference pairs into theme4_grpo_pairs.jsonl")
|
server/SmartPayEnv_environment.py
CHANGED
|
@@ -77,6 +77,14 @@ class State:
|
|
| 77 |
active_events: dict = field(default_factory=dict) # e.g. {"fraud_spike": 10, "outage": 5}
|
| 78 |
log_cursor: int = 0
|
| 79 |
review_queue: list = field(default_factory=list) # [{ 'step': int, 'is_fraud': bool, 'amount': float }]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
|
| 82 |
class _GatewayState:
|
|
@@ -132,6 +140,7 @@ class SmartpayenvEnvironment(Environment):
|
|
| 132 |
self.current_obs = None
|
| 133 |
self._log_loader = LogLoader()
|
| 134 |
self._pattern_queue = deque()
|
|
|
|
| 135 |
|
| 136 |
def _init_gateways(self) -> None:
|
| 137 |
instability = self._cfg["instability"]
|
|
@@ -233,8 +242,44 @@ class SmartpayenvEnvironment(Environment):
|
|
| 233 |
self.current_obs = self._generate_transaction()
|
| 234 |
# Synchronize simulation clock with the log's starting hour
|
| 235 |
self._state.simulation_hour = self.current_obs.time_of_day
|
|
|
|
|
|
|
|
|
|
|
|
|
| 236 |
return self.current_obs
|
| 237 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 238 |
def step(self, action: SmartpayenvAction) -> SmartpayenvObservation:
|
| 239 |
self._state.step_count += 1
|
| 240 |
|
|
@@ -273,6 +318,10 @@ class SmartpayenvEnvironment(Environment):
|
|
| 273 |
surge_logs = self._log_loader.get_pattern("fraud_surge", count=5)
|
| 274 |
self._pattern_queue.extend(surge_logs)
|
| 275 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 276 |
for gw in self._gateways: gw.step()
|
| 277 |
|
| 278 |
# 1. 3DS / Action Logic
|
|
@@ -402,14 +451,40 @@ class SmartpayenvEnvironment(Environment):
|
|
| 402 |
fs = self.fraud_grader.evaluate()
|
| 403 |
rs = self.retention_grader.evaluate()
|
| 404 |
base_reward = (0.4 * route_score) + (0.4 * fs) + (0.2 * rs)
|
| 405 |
-
|
| 406 |
-
#
|
| 407 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 408 |
self.current_obs.reward = float(np.clip(final_reward, 0.001, 0.999))
|
| 409 |
|
| 410 |
self.current_obs.task_routing_score = route_score
|
| 411 |
self.current_obs.task_fraud_mcc_score = fs
|
| 412 |
self.current_obs.task_retention_score = rs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 413 |
|
| 414 |
return self.current_obs
|
| 415 |
|
|
|
|
| 77 |
active_events: dict = field(default_factory=dict) # e.g. {"fraud_spike": 10, "outage": 5}
|
| 78 |
log_cursor: int = 0
|
| 79 |
review_queue: list = field(default_factory=list) # [{ 'step': int, 'is_fraud': bool, 'amount': float }]
|
| 80 |
+
curriculum_level: float = 0.0
|
| 81 |
+
policy_skill_estimate: float = 0.5
|
| 82 |
+
challenger_skill: float = 0.55
|
| 83 |
+
recent_rewards: deque = field(default_factory=lambda: deque(maxlen=25))
|
| 84 |
+
recent_route_scores: deque = field(default_factory=lambda: deque(maxlen=25))
|
| 85 |
+
recent_fraud_scores: deque = field(default_factory=lambda: deque(maxlen=25))
|
| 86 |
+
recent_retention_scores: deque = field(default_factory=lambda: deque(maxlen=25))
|
| 87 |
+
anti_gaming_alerts: int = 0
|
| 88 |
|
| 89 |
|
| 90 |
class _GatewayState:
|
|
|
|
| 140 |
self.current_obs = None
|
| 141 |
self._log_loader = LogLoader()
|
| 142 |
self._pattern_queue = deque()
|
| 143 |
+
self._meta_curriculum_enabled = True
|
| 144 |
|
| 145 |
def _init_gateways(self) -> None:
|
| 146 |
instability = self._cfg["instability"]
|
|
|
|
| 242 |
self.current_obs = self._generate_transaction()
|
| 243 |
# Synchronize simulation clock with the log's starting hour
|
| 244 |
self._state.simulation_hour = self.current_obs.time_of_day
|
| 245 |
+
self._state.curriculum_level = float(self._difficulty)
|
| 246 |
+
self._state.policy_skill_estimate = 0.5
|
| 247 |
+
self._state.challenger_skill = 0.55 + (0.08 * self._difficulty)
|
| 248 |
+
self._state.anti_gaming_alerts = 0
|
| 249 |
return self.current_obs
|
| 250 |
|
| 251 |
+
def _curriculum_multiplier(self) -> float:
|
| 252 |
+
return 1.0 + (0.15 * self._state.curriculum_level)
|
| 253 |
+
|
| 254 |
+
def _update_self_play_curriculum(self, route_score: float, fraud_score: float, retention_score: float) -> None:
|
| 255 |
+
"""
|
| 256 |
+
Theme-4 core: self-improvement loop inspired by league training.
|
| 257 |
+
The policy competes against a moving challenger and environment complexity
|
| 258 |
+
scales with sustained performance.
|
| 259 |
+
"""
|
| 260 |
+
self._state.recent_route_scores.append(route_score)
|
| 261 |
+
self._state.recent_fraud_scores.append(fraud_score)
|
| 262 |
+
self._state.recent_retention_scores.append(retention_score)
|
| 263 |
+
perf = (0.45 * route_score) + (0.35 * fraud_score) + (0.20 * retention_score)
|
| 264 |
+
self._state.recent_rewards.append(perf)
|
| 265 |
+
|
| 266 |
+
if not self._state.recent_rewards:
|
| 267 |
+
return
|
| 268 |
+
|
| 269 |
+
rolling_perf = float(np.mean(self._state.recent_rewards))
|
| 270 |
+
skill_delta = 0.08 * (rolling_perf - 0.5)
|
| 271 |
+
self._state.policy_skill_estimate = float(np.clip(self._state.policy_skill_estimate + skill_delta, 0.05, 0.99))
|
| 272 |
+
|
| 273 |
+
# PFSP-inspired challenger adaptation: keep matches near policy frontier.
|
| 274 |
+
gap = self._state.policy_skill_estimate - self._state.challenger_skill
|
| 275 |
+
self._state.challenger_skill = float(np.clip(self._state.challenger_skill + (0.06 * gap), 0.05, 0.99))
|
| 276 |
+
|
| 277 |
+
if self._meta_curriculum_enabled and len(self._state.recent_rewards) >= 8:
|
| 278 |
+
if rolling_perf > 0.72:
|
| 279 |
+
self._state.curriculum_level = float(np.clip(self._state.curriculum_level + 0.12, 0.0, 2.0))
|
| 280 |
+
elif rolling_perf < 0.45:
|
| 281 |
+
self._state.curriculum_level = float(np.clip(self._state.curriculum_level - 0.08, 0.0, 2.0))
|
| 282 |
+
|
| 283 |
def step(self, action: SmartpayenvAction) -> SmartpayenvObservation:
|
| 284 |
self._state.step_count += 1
|
| 285 |
|
|
|
|
| 318 |
surge_logs = self._log_loader.get_pattern("fraud_surge", count=5)
|
| 319 |
self._pattern_queue.extend(surge_logs)
|
| 320 |
|
| 321 |
+
# Curriculum-driven stress events (self-improvement pressure).
|
| 322 |
+
if self._rng.random() < (0.01 * self._curriculum_multiplier()):
|
| 323 |
+
self._state.active_events["adversarial_shift"] = int(self._rng.integers(4, 12))
|
| 324 |
+
|
| 325 |
for gw in self._gateways: gw.step()
|
| 326 |
|
| 327 |
# 1. 3DS / Action Logic
|
|
|
|
| 451 |
fs = self.fraud_grader.evaluate()
|
| 452 |
rs = self.retention_grader.evaluate()
|
| 453 |
base_reward = (0.4 * route_score) + (0.4 * fs) + (0.2 * rs)
|
| 454 |
+
|
| 455 |
+
# League-style regret: penalize underperforming against moving challenger.
|
| 456 |
+
challenger_regret = max(0.0, self._state.challenger_skill - base_reward)
|
| 457 |
+
regret_penalty = 0.35 * challenger_regret
|
| 458 |
+
|
| 459 |
+
# Anti-gaming check: repeatedly overusing manual review without quality gains.
|
| 460 |
+
gaming_penalty = 0.0
|
| 461 |
+
if action.fraud_decision == 3 and fs < 0.55 and rs < 0.6:
|
| 462 |
+
self._state.anti_gaming_alerts += 1
|
| 463 |
+
gaming_penalty = min(0.12, 0.02 * self._state.anti_gaming_alerts)
|
| 464 |
+
|
| 465 |
+
# Curriculum bonus: reward robust performance under higher difficulty pressure.
|
| 466 |
+
robustness_bonus = 0.06 * self._state.curriculum_level * max(0.0, base_reward - 0.55)
|
| 467 |
+
|
| 468 |
+
# Norm punishment for delayed liabilities + self-improvement terms.
|
| 469 |
+
final_reward = base_reward - (cb_amt / 150.0) - regret_penalty - gaming_penalty + robustness_bonus
|
| 470 |
self.current_obs.reward = float(np.clip(final_reward, 0.001, 0.999))
|
| 471 |
|
| 472 |
self.current_obs.task_routing_score = route_score
|
| 473 |
self.current_obs.task_fraud_mcc_score = fs
|
| 474 |
self.current_obs.task_retention_score = rs
|
| 475 |
+
self._update_self_play_curriculum(route_score, fs, rs)
|
| 476 |
+
|
| 477 |
+
self.current_obs.metadata = {
|
| 478 |
+
"theme": "self_improvement",
|
| 479 |
+
"curriculum_level": round(self._state.curriculum_level, 4),
|
| 480 |
+
"policy_skill_estimate": round(self._state.policy_skill_estimate, 4),
|
| 481 |
+
"challenger_skill": round(self._state.challenger_skill, 4),
|
| 482 |
+
"challenger_regret": round(challenger_regret, 4),
|
| 483 |
+
"gaming_penalty": round(gaming_penalty, 4),
|
| 484 |
+
"robustness_bonus": round(robustness_bonus, 4),
|
| 485 |
+
"anti_gaming_alerts": int(self._state.anti_gaming_alerts),
|
| 486 |
+
"active_events": dict(self._state.active_events),
|
| 487 |
+
}
|
| 488 |
|
| 489 |
return self.current_obs
|
| 490 |
|
server/utils.py
CHANGED
|
@@ -39,6 +39,14 @@ class LogLoader:
|
|
| 39 |
if pattern_type == "fraud_surge":
|
| 40 |
# Filter for high fraud risk
|
| 41 |
candidates = [l for l in self.logs if l.get("fraud_risk_score", 0) > 0.5]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
elif pattern_type == "premium_only":
|
| 43 |
candidates = [l for l in self.logs if l.get("user_segment") == 2]
|
| 44 |
else:
|
|
|
|
| 39 |
if pattern_type == "fraud_surge":
|
| 40 |
# Filter for high fraud risk
|
| 41 |
candidates = [l for l in self.logs if l.get("fraud_risk_score", 0) > 0.5]
|
| 42 |
+
elif pattern_type == "stealth_fraud":
|
| 43 |
+
candidates = [
|
| 44 |
+
l for l in self.logs
|
| 45 |
+
if l.get("is_fraud", False)
|
| 46 |
+
and "low_risk_disguise" in str(l.get("fraud_strategy", ""))
|
| 47 |
+
]
|
| 48 |
+
elif pattern_type == "velocity_attack":
|
| 49 |
+
candidates = [l for l in self.logs if float(l.get("transaction_velocity", 0.0)) > 0.7]
|
| 50 |
elif pattern_type == "premium_only":
|
| 51 |
candidates = [l for l in self.logs if l.get("user_segment") == 2]
|
| 52 |
else:
|