Pratap-K commited on
Commit
640cca9
·
1 Parent(s): 27a0d2f

Env improvement

Browse files
README.md CHANGED
@@ -14,7 +14,7 @@ tags:
14
  - Reinforcement Learning
15
  ---
16
 
17
- # 💳 SmartPayEnv: Advanced Fintech Reality Layer
18
 
19
  **A high-fidelity, production-grade benchmark for training and evaluating AI Agents (LLMs/RL) on the messy reality of global payment orchestration.**
20
 
@@ -23,6 +23,12 @@ tags:
23
 
24
  SmartPayEnv bridges the gap between simple simulations and production fintech. It models the adversarial loops, infrastructure instability, and delayed feedback cycles that define modern payment systems.
25
 
 
 
 
 
 
 
26
  ---
27
 
28
  ## 🚀 Why SmartPayEnv?
@@ -122,6 +128,12 @@ Agents can send transactions to manual review (Action 3). Resolutions are 100% a
122
  - **📊 BIN-Gateway Affinity**: A hidden matrix of gateway performance across different card types. Agents must discover these affinities to optimize routing success.
123
  - **🧠 Preference-Based Learning (Simulation Branching)**: Supports advanced training (e.g., DPO/PPO) by allowing agents to "What-if" multiple actions from the same state via the `/simulate` endpoint. Agents can group similar contexts (BIN + Amount + Risk) and learn from relative advantages.
124
 
 
 
 
 
 
 
125
  ---
126
 
127
  ## 🎯 Benchmark Tasks
@@ -180,6 +192,22 @@ SmartPayEnv enables GRPO by providing the infrastructure for **Group Sampling**
180
  - **Learnable Gradients**: Unlike binary simulations, our **Deterministic Graders** (see Scoring section) map fuzzy outcomes to continuous rewards $[0, 1]$. This prevents the "sparse reward" problem and provides stable gradients for PPO clip-range optimization.
181
  - **Context Bucketing**: The `server/preference_utils.py` module allows agents to bundle similar (BIN, Amount, Risk) states, enabling faster convergence on preference-based objectives.
182
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
183
  ---
184
 
185
  ## 📐 Data Models
@@ -219,6 +247,9 @@ cd SmartPayEnv
219
  # Install dependencies
220
  uv sync
221
 
 
 
 
222
  # Run the OpenEnv validation suite
223
  openenv validate
224
 
@@ -233,6 +264,24 @@ uv run -m SmartPayEnv.server.app
233
  ```
234
  Access the **Swagger UI** at `http://localhost:7860/` (auto-redirects to `/docs`).
235
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
236
  ### 3. Multi-Mode Deployment (Docker)
237
  ```bash
238
  # Build the production image
 
14
  - Reinforcement Learning
15
  ---
16
 
17
+ # 💳 SmartPayEnv: Advanced Fintech Reality Layer (Theme 4: Self-Improvement)
18
 
19
  **A high-fidelity, production-grade benchmark for training and evaluating AI Agents (LLMs/RL) on the messy reality of global payment orchestration.**
20
 
 
23
 
24
  SmartPayEnv bridges the gap between simple simulations and production fintech. It models the adversarial loops, infrastructure instability, and delayed feedback cycles that define modern payment systems.
25
 
26
+ This release is explicitly upgraded for **OpenEnv Hackathon Theme #4 (Self-Improvement)** with a light blend of Theme #1 and Theme #2:
27
+ - **League-style challenger dynamics** inside the environment (agent vs moving opponent skill frontier).
28
+ - **Adaptive curriculum** that auto-escalates pressure after sustained performance and de-escalates after regressions.
29
+ - **Anti-reward-hacking penalties** for degenerate policies (e.g., overusing manual review without fraud/retention quality).
30
+ - **Long-horizon credit pressure** through delayed chargebacks + review queues + temporal events.
31
+
32
  ---
33
 
34
  ## 🚀 Why SmartPayEnv?
 
128
  - **📊 BIN-Gateway Affinity**: A hidden matrix of gateway performance across different card types. Agents must discover these affinities to optimize routing success.
129
  - **🧠 Preference-Based Learning (Simulation Branching)**: Supports advanced training (e.g., DPO/PPO) by allowing agents to "What-if" multiple actions from the same state via the `/simulate` endpoint. Agents can group similar contexts (BIN + Amount + Risk) and learn from relative advantages.
130
 
131
+ ### 5. Self-Improving Meta-Curriculum (NEW)
132
+ - **📈 Curriculum Level**: Each episode now tracks a continuous curriculum level (0-2) that increases after sustained high rolling performance.
133
+ - **🥊 Challenger Skill**: A moving challenger policy estimate is maintained and used to compute regret-style penalties when the active policy underperforms.
134
+ - **🧯 Anti-Gaming Guardrails**: Repeatedly selecting costly manual review without corresponding quality gains triggers adaptive penalties.
135
+ - **🧠 Metadata for Training**: Step metadata exposes `curriculum_level`, `policy_skill_estimate`, `challenger_skill`, and shaping terms to support richer RL diagnostics.
136
+
137
  ---
138
 
139
  ## 🎯 Benchmark Tasks
 
192
  - **Learnable Gradients**: Unlike binary simulations, our **Deterministic Graders** (see Scoring section) map fuzzy outcomes to continuous rewards $[0, 1]$. This prevents the "sparse reward" problem and provides stable gradients for PPO clip-range optimization.
193
  - **Context Bucketing**: The `server/preference_utils.py` module allows agents to bundle similar (BIN, Amount, Risk) states, enabling faster convergence on preference-based objectives.
194
 
195
+ ### 3. Theme-4 Group-Relative Collection (NEW)
196
+ - Use `scripts/train_theme4_grpo.py` to build **group-relative preference pairs** from online interactions:
197
+ - sample action groups for each live observation
198
+ - rank via `/simulate` reward
199
+ - export best-vs-worst pairs (`theme4_grpo_pairs.jsonl`)
200
+ - This supports novel post-training flows in **HF TRL / Unsloth** and aligns with modern critic-free RL ideas.
201
+
202
+ ---
203
+
204
+ ## 📚 Research-Inspired Design
205
+
206
+ The self-improving upgrades are inspired by:
207
+ - **League / PFSP dynamics** for avoiding cyclic overfitting and improving robustness: [AlphaStar (Nature, 2019)](https://www.nature.com/articles/s41586-019-1724-z)
208
+ - **Group-relative policy updates** for efficient critic-free optimization: [DeepSeekMath / GRPO (arXiv:2402.03300)](https://arxiv.org/abs/2402.03300)
209
+ - **Cross-play and equilibrium-oriented opponent diversity**: [Fictitious Cross-Play (arXiv:2310.03354)](https://arxiv.org/abs/2310.03354)
210
+
211
  ---
212
 
213
  ## 📐 Data Models
 
247
  # Install dependencies
248
  uv sync
249
 
250
+ # (Recommended) Regenerate realistic synthetic data
251
+ python scripts/generate_logs.py --num-transactions 20000 --n-users 5000 --seed 42
252
+
253
  # Run the OpenEnv validation suite
254
  openenv validate
255
 
 
264
  ```
265
  Access the **Swagger UI** at `http://localhost:7860/` (auto-redirects to `/docs`).
266
 
267
+ ### 4. Synthetic Data World Generator (NEW)
268
+ Use this when you want realistic, evolving "real-world-like" transaction streams:
269
+
270
+ ```bash
271
+ python scripts/generate_logs.py \
272
+ --output data/transactions_log.jsonl \
273
+ --num-transactions 20000 \
274
+ --n-users 5000 \
275
+ --seed 42 \
276
+ --base-fraud-rate 0.08
277
+ ```
278
+
279
+ What gets generated:
280
+ - **Normal baseline behavior** (segment-based spend, location/device consistency, time-of-day effects)
281
+ - **Seed fraud templates** (`high_value_spike`, `velocity_burst`, `geo_anomaly`, `device_spoof`, `split_transactions`)
282
+ - **Adaptive fraud evolution** (strategy composition and stealth attacks such as `low_risk_disguise`)
283
+ - **Strategy labels for storytelling** via `fraud_strategy` and `event_marker`
284
+
285
  ### 3. Multi-Mode Deployment (Docker)
286
  ```bash
287
  # Build the production image
data/transactions_log.jsonl CHANGED
The diff for this file is too large to render. See raw diff
 
scripts/generate_logs.py CHANGED
@@ -1,68 +1,281 @@
 
1
  import json
2
- import numpy as np
3
  import os
4
- from uuid import uuid4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
- def generate_logs(output_path="data/transactions_log.jsonl", num_transactions=5000):
7
- rng = np.random.default_rng()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  os.makedirs(os.path.dirname(output_path), exist_ok=True)
9
-
 
 
 
 
10
  current_hour = 0
11
- steps_per_hour = 100 # average density
12
- active_spike_countdown = 0
13
-
14
- with open(output_path, "w") as f:
15
- for i in range(num_transactions):
16
- # Advance time every ~100 transactions
17
- if i % steps_per_hour == 0:
 
18
  current_hour = (current_hour + 1) % 24
19
-
20
- # Randomly start a fraud spike (correlated event)
21
- if active_spike_countdown <= 0 and rng.random() < 0.005:
22
- active_spike_countdown = rng.integers(20, 50)
23
-
24
- # 1. Hour of day (Diurnal pattern)
25
- hour = current_hour
26
-
27
- # 2. Segment & MCC
28
- segment = int(rng.choice([0, 1, 2], p=[0.25, 0.60, 0.15]))
29
- mcc = int(rng.choice([0, 1, 2, 3, 4, 5], p=[0.3, 0.2, 0.1, 0.1, 0.1, 0.2]))
30
-
31
- # 3. Fraud Risk with Correlation (Spikes)
32
- is_night = (1 <= hour <= 5)
33
- base_risk = {0: 0.02, 1: 0.05, 2: 0.15, 3: 0.08, 4: 0.25, 5: 0.12}[mcc]
34
-
35
- risk_boost = 0.0
36
- if active_spike_countdown > 0:
37
- risk_boost = 0.4 # Persistent spike
38
- active_spike_countdown -= 1
39
- elif is_night:
40
- risk_boost = 0.2
41
-
42
- final_risk = base_risk + risk_boost + rng.uniform(-0.05, 0.05)
43
- fraud_risk_score = float(np.clip(final_risk * {0: 1.8, 1: 1.0, 2: 0.3}[segment], 0.01, 0.99))
44
-
45
- # 4. Transaction Details
46
- amount = float(rng.lognormal(mean={0: 4.0, 1: 4.5, 2: 6.5, 3: 7.0, 4: 5.0, 5: 3.0}[mcc], sigma=0.8))
47
- bin_category = int(rng.integers(0, 10))
48
- is_international = bool(rng.random() < (0.4 if mcc == 3 else 0.15))
49
-
50
- log_entry = {
51
- "amount": amount,
52
- "merchant_category": mcc,
53
- "is_international": is_international,
54
- "card_present": bool(rng.random() > 0.5),
55
- "user_segment": segment,
56
- "user_history_score": float(np.clip(rng.normal({0: 0.3, 1: 0.7, 2: 0.9}[segment], 0.15), 0.1, 1.0)),
57
- "device_type": int(rng.choice([0, 1, 2], p=[0.5, 0.4, 0.1])),
58
- "bin_category": bin_category,
59
- "time_of_day": hour,
60
- "transaction_velocity": float(np.clip(rng.random() * 0.2 + (0.5 if active_spike_countdown > 0 else 0.0), 0.1, 0.9)),
61
- "fraud_risk_score": fraud_risk_score,
62
- "event_marker": "fraud_spike" if active_spike_countdown > 0 else None
63
- }
64
- f.write(json.dumps(log_entry) + "\n")
 
 
 
65
 
66
  if __name__ == "__main__":
67
- generate_logs(num_transactions=5000)
68
- print("Sequential logs with correlated events generated.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
  import json
 
3
  import os
4
+ from collections import defaultdict, deque
5
+
6
+ import numpy as np
7
+
8
+
9
+ LOCATIONS = ["Bangalore", "Mumbai", "Delhi", "Hyderabad", "Chennai", "Pune", "Kolkata", "Europe", "Singapore"]
10
+ SEGMENT_LABELS = {0: "new", 1: "existing", 2: "premium"}
11
+ BASE_MCC_DIST = [0.30, 0.20, 0.10, 0.10, 0.10, 0.20]
12
+ HIGH_RISK_MCCS = {2, 4, 5}
13
+ RISKY_HOURS = {1, 2, 3, 4, 5}
14
+
15
+
16
+ def _time_bucket(hour: int) -> str:
17
+ if 0 <= hour <= 5:
18
+ return "night"
19
+ if 6 <= hour <= 11:
20
+ return "morning"
21
+ if 12 <= hour <= 17:
22
+ return "afternoon"
23
+ return "evening"
24
+
25
+
26
+ def _sample_user_profiles(rng: np.random.Generator, n_users: int) -> list[dict]:
27
+ profiles: list[dict] = []
28
+ for uid in range(n_users):
29
+ segment = int(rng.choice([0, 1, 2], p=[0.30, 0.55, 0.15]))
30
+ traveler = bool(rng.random() < {0: 0.08, 1: 0.15, 2: 0.35}[segment])
31
+ home = str(rng.choice(LOCATIONS[:7]))
32
+ preferred_mcc = int(rng.choice([0, 1, 3, 5], p=[0.35, 0.25, 0.20, 0.20]))
33
+ profiles.append(
34
+ {
35
+ "user_id": f"user_{uid}",
36
+ "user_segment": segment,
37
+ "frequent_traveler": traveler,
38
+ "home_location": home,
39
+ "preferred_mcc": preferred_mcc,
40
+ "base_device_type": int(rng.choice([0, 1, 2], p=[0.55, 0.35, 0.10])),
41
+ "base_spend_mu": {0: 3.8, 1: 4.5, 2: 5.0}[segment],
42
+ "base_spend_sigma": {0: 0.70, 1: 0.75, 2: 0.85}[segment],
43
+ "history_base": {0: 0.35, 1: 0.72, 2: 0.88}[segment],
44
+ }
45
+ )
46
+ return profiles
47
+
48
+
49
+ def _normal_transaction(
50
+ rng: np.random.Generator,
51
+ profile: dict,
52
+ hour: int,
53
+ user_recent_times: deque,
54
+ user_recent_amounts: deque,
55
+ ) -> dict:
56
+ mcc_probs = np.array(BASE_MCC_DIST, dtype=float)
57
+ mcc_probs[profile["preferred_mcc"]] += 0.18
58
+ mcc_probs = mcc_probs / mcc_probs.sum()
59
+ mcc = int(rng.choice([0, 1, 2, 3, 4, 5], p=mcc_probs))
60
+
61
+ amount = float(rng.lognormal(mean=profile["base_spend_mu"], sigma=profile["base_spend_sigma"]))
62
+ if mcc in HIGH_RISK_MCCS:
63
+ amount *= 1.35
64
+
65
+ location = profile["home_location"]
66
+ is_international = False
67
+ if profile["frequent_traveler"] and rng.random() < 0.10:
68
+ location = str(rng.choice(["Europe", "Singapore"]))
69
+ is_international = True
70
+
71
+ device_type = profile["base_device_type"]
72
+ if rng.random() < 0.07:
73
+ device_type = int(rng.choice([0, 1, 2]))
74
+
75
+ velocity = float(min(12, len([t for t in user_recent_times if hour - t <= 1])))
76
+ velocity_norm = float(np.clip(velocity / 10.0, 0.05, 0.98))
77
+
78
+ risk = 0.02
79
+ risk += 0.06 if hour in RISKY_HOURS else 0.0
80
+ risk += 0.05 if mcc in HIGH_RISK_MCCS else 0.0
81
+ risk += 0.04 if device_type != profile["base_device_type"] else 0.0
82
+ risk += 0.03 if is_international else 0.0
83
+ risk += 0.08 * velocity_norm
84
+ risk += rng.normal(0.0, 0.02)
85
+
86
+ return {
87
+ "amount": float(np.clip(amount, 5.0, 150000.0)),
88
+ "currency": "INR",
89
+ "time": _time_bucket(hour),
90
+ "merchant_category": mcc,
91
+ "location": location,
92
+ "is_international": is_international,
93
+ "card_present": bool(rng.random() > 0.45),
94
+ "user_segment": profile["user_segment"],
95
+ "user_history_score": float(np.clip(rng.normal(profile["history_base"], 0.12), 0.05, 1.0)),
96
+ "device_type": device_type,
97
+ "ip_risk": float(np.clip(rng.normal(0.10 if location == profile["home_location"] else 0.45, 0.08), 0.01, 0.99)),
98
+ "bin_category": int(rng.integers(0, 10)),
99
+ "time_of_day": int(hour),
100
+ "transaction_velocity": velocity_norm,
101
+ "fraud_risk_score": float(np.clip(risk, 0.01, 0.99)),
102
+ "fraud_strategy": "none",
103
+ "event_marker": None,
104
+ "is_fraud": False,
105
+ }
106
+
107
+
108
+ def _fraud_agent_strategy_mix(
109
+ rng: np.random.Generator,
110
+ attack_level: float,
111
+ ) -> list[str]:
112
+ templates = [
113
+ ("high_value_spike", 0.20),
114
+ ("velocity_burst", 0.22),
115
+ ("geo_anomaly", 0.16),
116
+ ("device_spoof", 0.18),
117
+ ("split_transactions", 0.14),
118
+ ("low_risk_disguise", 0.10),
119
+ ]
120
+ weights = np.array([w for _, w in templates], dtype=float)
121
+ # Self-improving fraud agent: shifts toward stealth blends as defender hardens.
122
+ stealth_boost = min(0.18, 0.06 * attack_level)
123
+ weights[5] += stealth_boost
124
+ weights[4] += stealth_boost * 0.8
125
+ weights = weights / weights.sum()
126
 
127
+ k = 1 if attack_level < 1.0 else (2 if rng.random() < 0.75 else 3)
128
+ selected = rng.choice([name for name, _ in templates], size=k, replace=False, p=weights)
129
+ return list(selected)
130
+
131
+
132
+ def _apply_fraud_strategy(
133
+ rng: np.random.Generator,
134
+ tx: dict,
135
+ profile: dict,
136
+ strategies: list[str],
137
+ ) -> list[dict]:
138
+ tx = dict(tx)
139
+ event_markers = []
140
+
141
+ for s in strategies:
142
+ if s == "high_value_spike":
143
+ tx["amount"] = float(min(200000.0, tx["amount"] * rng.uniform(6.0, 18.0)))
144
+ event_markers.append("high_value_spike")
145
+ elif s == "velocity_burst":
146
+ tx["transaction_velocity"] = float(np.clip(tx["transaction_velocity"] + rng.uniform(0.45, 0.85), 0.1, 0.99))
147
+ event_markers.append("velocity_burst")
148
+ elif s == "geo_anomaly":
149
+ tx["location"] = str(rng.choice(["Europe", "Singapore"]))
150
+ tx["is_international"] = True
151
+ tx["ip_risk"] = float(np.clip(tx["ip_risk"] + rng.uniform(0.25, 0.50), 0.01, 0.99))
152
+ event_markers.append("geo_anomaly")
153
+ elif s == "device_spoof":
154
+ tx["device_type"] = int((profile["base_device_type"] + int(rng.integers(1, 3))) % 3)
155
+ tx["card_present"] = False
156
+ tx["ip_risk"] = float(np.clip(tx["ip_risk"] + rng.uniform(0.18, 0.35), 0.01, 0.99))
157
+ event_markers.append("device_spoof")
158
+ elif s == "split_transactions":
159
+ # Converted to multiple low-value events that preserve a high total.
160
+ pieces = int(rng.integers(4, 10))
161
+ each_amount = float(max(1500.0, tx["amount"] * rng.uniform(0.10, 0.22)))
162
+ generated = []
163
+ for _ in range(pieces):
164
+ p = dict(tx)
165
+ p["amount"] = each_amount
166
+ p["transaction_velocity"] = float(np.clip(tx["transaction_velocity"] + rng.uniform(0.35, 0.55), 0.1, 0.99))
167
+ p["event_marker"] = "split_transactions"
168
+ p["fraud_strategy"] = "split_transactions"
169
+ p["is_fraud"] = True
170
+ risk = p["fraud_risk_score"] + rng.uniform(0.18, 0.32)
171
+ p["fraud_risk_score"] = float(np.clip(risk, 0.01, 0.99))
172
+ generated.append(p)
173
+ return generated
174
+ elif s == "low_risk_disguise":
175
+ # Fraud tries to look normal: lower explicit risk while preserving anomalies elsewhere.
176
+ tx["amount"] = float(np.clip(tx["amount"] * rng.uniform(0.18, 0.35), 250.0, 12000.0))
177
+ tx["merchant_category"] = int(rng.choice([0, 1, 3], p=[0.5, 0.3, 0.2]))
178
+ tx["fraud_risk_score"] = float(np.clip(tx["fraud_risk_score"] - rng.uniform(0.08, 0.20), 0.02, 0.80))
179
+ event_markers.append("low_risk_disguise")
180
+
181
+ tx["fraud_strategy"] = "+".join(strategies)
182
+ tx["event_marker"] = "|".join(event_markers) if event_markers else "fraud_pattern"
183
+ tx["is_fraud"] = True
184
+ tx["fraud_risk_score"] = float(np.clip(tx["fraud_risk_score"] + rng.uniform(0.18, 0.42), 0.01, 0.99))
185
+ return [tx]
186
+
187
+
188
+ def generate_logs(
189
+ output_path: str = "data/transactions_log.jsonl",
190
+ num_transactions: int = 15000,
191
+ n_users: int = 4000,
192
+ seed: int = 7,
193
+ base_fraud_rate: float = 0.08,
194
+ ) -> None:
195
+ """
196
+ Generates realistic synthetic payment logs with an evolving fraud adversary.
197
+ """
198
+ rng = np.random.default_rng(seed)
199
  os.makedirs(os.path.dirname(output_path), exist_ok=True)
200
+
201
+ profiles = _sample_user_profiles(rng, n_users=n_users)
202
+ user_recent_times: dict[str, deque] = defaultdict(lambda: deque(maxlen=40))
203
+ user_recent_amounts: dict[str, deque] = defaultdict(lambda: deque(maxlen=40))
204
+
205
  current_hour = 0
206
+ steps_per_hour = 90
207
+ global_attack_level = 0.0
208
+ defender_pressure = 0.0
209
+
210
+ records_written = 0
211
+ with open(output_path, "w", encoding="utf-8") as f:
212
+ while records_written < num_transactions:
213
+ if records_written % steps_per_hour == 0:
214
  current_hour = (current_hour + 1) % 24
215
+
216
+ profile = profiles[int(rng.integers(0, len(profiles)))]
217
+ uid = profile["user_id"]
218
+
219
+ base_tx = _normal_transaction(
220
+ rng=rng,
221
+ profile=profile,
222
+ hour=current_hour,
223
+ user_recent_times=user_recent_times[uid],
224
+ user_recent_amounts=user_recent_amounts[uid],
225
+ )
226
+
227
+ fraud_p = base_fraud_rate + (0.05 if current_hour in RISKY_HOURS else 0.0) + (0.07 * global_attack_level)
228
+ fraud_p = float(np.clip(fraud_p, 0.01, 0.55))
229
+ is_attack = bool(rng.random() < fraud_p)
230
+
231
+ if is_attack:
232
+ strategies = _fraud_agent_strategy_mix(rng, attack_level=global_attack_level)
233
+ txs = _apply_fraud_strategy(rng, base_tx, profile, strategies)
234
+ else:
235
+ txs = [base_tx]
236
+
237
+ for tx in txs:
238
+ tx["user_id"] = uid
239
+ tx["user_profile"] = {
240
+ "segment": SEGMENT_LABELS[profile["user_segment"]],
241
+ "frequent_traveler": profile["frequent_traveler"],
242
+ "home_location": profile["home_location"],
243
+ }
244
+ tx["attack_level"] = round(float(global_attack_level), 4)
245
+ tx["defender_pressure"] = round(float(defender_pressure), 4)
246
+ f.write(json.dumps(tx) + "\n")
247
+ records_written += 1
248
+
249
+ user_recent_times[uid].append(current_hour)
250
+ user_recent_amounts[uid].append(tx["amount"])
251
+ if records_written >= num_transactions:
252
+ break
253
+
254
+ # Self-improvement dynamics:
255
+ # when fraud is frequently obvious, increase defender pressure;
256
+ # when stealth fraud appears, raise attack sophistication.
257
+ if is_attack and any("low_risk_disguise" in t.get("fraud_strategy", "") for t in txs):
258
+ global_attack_level = float(np.clip(global_attack_level + 0.015, 0.0, 3.0))
259
+ elif is_attack:
260
+ defender_pressure = float(np.clip(defender_pressure + 0.010, 0.0, 2.5))
261
+ else:
262
+ global_attack_level = float(np.clip(global_attack_level + 0.002 - (0.001 * defender_pressure), 0.0, 3.0))
263
+
264
 
265
  if __name__ == "__main__":
266
+ parser = argparse.ArgumentParser(description="Generate synthetic SmartPayEnv transaction logs.")
267
+ parser.add_argument("--output", default="data/transactions_log.jsonl", help="Output JSONL file path")
268
+ parser.add_argument("--num-transactions", type=int, default=15000, help="Number of transactions")
269
+ parser.add_argument("--n-users", type=int, default=4000, help="Number of synthetic users")
270
+ parser.add_argument("--seed", type=int, default=7, help="Random seed")
271
+ parser.add_argument("--base-fraud-rate", type=float, default=0.08, help="Baseline fraud probability")
272
+ args = parser.parse_args()
273
+
274
+ generate_logs(
275
+ output_path=args.output,
276
+ num_transactions=args.num_transactions,
277
+ n_users=args.n_users,
278
+ seed=args.seed,
279
+ base_fraud_rate=args.base_fraud_rate,
280
+ )
281
+ print(f"Generated {args.num_transactions} synthetic transactions at {args.output}")
scripts/train_theme4_grpo.py ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Theme-4 training starter for SmartPayEnv.
3
+
4
+ This script demonstrates a novel self-improvement loop:
5
+ 1) sample K candidate actions per observation
6
+ 2) score each candidate with /simulate rewards (group-relative signal)
7
+ 3) collect best/worst pairs for preference-style post-training
8
+
9
+ It is intentionally lightweight so teams can run it in Colab with TRL/Unsloth.
10
+ """
11
+
12
+ from __future__ import annotations
13
+
14
+ import json
15
+ import random
16
+ from dataclasses import dataclass
17
+ from typing import Any
18
+
19
+ import requests
20
+
21
+
22
+ ENV_URL = "http://localhost:7860"
23
+ MAX_STEPS = 200
24
+ GROUP_SIZE = 8
25
+
26
+
27
+ @dataclass
28
+ class RolloutExample:
29
+ prompt: str
30
+ chosen: str
31
+ rejected: str
32
+ chosen_reward: float
33
+ rejected_reward: float
34
+
35
+
36
+ def _action_candidates() -> list[dict[str, int]]:
37
+ all_actions: list[dict[str, int]] = []
38
+ for gateway in (0, 1, 2):
39
+ for fraud_decision in (0, 1, 2, 3):
40
+ for retry_strategy in (0, 1):
41
+ all_actions.append(
42
+ {
43
+ "gateway": gateway,
44
+ "fraud_decision": fraud_decision,
45
+ "retry_strategy": retry_strategy,
46
+ }
47
+ )
48
+ random.shuffle(all_actions)
49
+ return all_actions
50
+
51
+
52
+ def _simulate_reward(action: dict[str, int]) -> float:
53
+ response = requests.post(f"{ENV_URL}/simulate", json={"action": action}, timeout=30)
54
+ response.raise_for_status()
55
+ obs = response.json()
56
+ return float(obs.get("reward", 0.0))
57
+
58
+
59
+ def _step(action: dict[str, int]) -> dict[str, Any]:
60
+ response = requests.post(f"{ENV_URL}/step", json={"action": action}, timeout=30)
61
+ response.raise_for_status()
62
+ return response.json()
63
+
64
+
65
+ def _reset(difficulty: int = 2) -> dict[str, Any]:
66
+ response = requests.post(f"{ENV_URL}/reset", json={"difficulty": difficulty}, timeout=30)
67
+ response.raise_for_status()
68
+ payload = response.json()
69
+ return payload.get("observation", payload)
70
+
71
+
72
+ def collect_group_relative_pairs(max_steps: int = MAX_STEPS, group_size: int = GROUP_SIZE) -> list[RolloutExample]:
73
+ obs = _reset(difficulty=2)
74
+ dataset: list[RolloutExample] = []
75
+ actions_pool = _action_candidates()
76
+
77
+ for _ in range(max_steps):
78
+ sampled = random.sample(actions_pool, k=min(group_size, len(actions_pool)))
79
+ scored: list[tuple[dict[str, int], float]] = []
80
+
81
+ for action in sampled:
82
+ try:
83
+ reward = _simulate_reward(action)
84
+ scored.append((action, reward))
85
+ except requests.RequestException:
86
+ continue
87
+
88
+ if len(scored) < 2:
89
+ break
90
+
91
+ scored.sort(key=lambda x: x[1], reverse=True)
92
+ best_action, best_reward = scored[0]
93
+ worst_action, worst_reward = scored[-1]
94
+
95
+ prompt = (
96
+ "SmartPayEnv observation:\n"
97
+ f"{json.dumps(obs, sort_keys=True)}\n"
98
+ "Return one action JSON with fields: gateway, fraud_decision, retry_strategy."
99
+ )
100
+
101
+ dataset.append(
102
+ RolloutExample(
103
+ prompt=prompt,
104
+ chosen=json.dumps(best_action, sort_keys=True),
105
+ rejected=json.dumps(worst_action, sort_keys=True),
106
+ chosen_reward=best_reward,
107
+ rejected_reward=worst_reward,
108
+ )
109
+ )
110
+
111
+ step_payload = _step(best_action)
112
+ obs = step_payload.get("observation", step_payload)
113
+ if bool(obs.get("done", False)):
114
+ obs = _reset(difficulty=2)
115
+
116
+ return dataset
117
+
118
+
119
+ def export_jsonl(dataset: list[RolloutExample], output_path: str) -> None:
120
+ with open(output_path, "w", encoding="utf-8") as f:
121
+ for row in dataset:
122
+ f.write(
123
+ json.dumps(
124
+ {
125
+ "prompt": row.prompt,
126
+ "chosen": row.chosen,
127
+ "rejected": row.rejected,
128
+ "chosen_reward": row.chosen_reward,
129
+ "rejected_reward": row.rejected_reward,
130
+ }
131
+ )
132
+ + "\n"
133
+ )
134
+
135
+
136
+ if __name__ == "__main__":
137
+ data = collect_group_relative_pairs()
138
+ export_jsonl(data, "theme4_grpo_pairs.jsonl")
139
+ print(f"Collected {len(data)} preference pairs into theme4_grpo_pairs.jsonl")
server/SmartPayEnv_environment.py CHANGED
@@ -77,6 +77,14 @@ class State:
77
  active_events: dict = field(default_factory=dict) # e.g. {"fraud_spike": 10, "outage": 5}
78
  log_cursor: int = 0
79
  review_queue: list = field(default_factory=list) # [{ 'step': int, 'is_fraud': bool, 'amount': float }]
 
 
 
 
 
 
 
 
80
 
81
 
82
  class _GatewayState:
@@ -132,6 +140,7 @@ class SmartpayenvEnvironment(Environment):
132
  self.current_obs = None
133
  self._log_loader = LogLoader()
134
  self._pattern_queue = deque()
 
135
 
136
  def _init_gateways(self) -> None:
137
  instability = self._cfg["instability"]
@@ -233,8 +242,44 @@ class SmartpayenvEnvironment(Environment):
233
  self.current_obs = self._generate_transaction()
234
  # Synchronize simulation clock with the log's starting hour
235
  self._state.simulation_hour = self.current_obs.time_of_day
 
 
 
 
236
  return self.current_obs
237
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
238
  def step(self, action: SmartpayenvAction) -> SmartpayenvObservation:
239
  self._state.step_count += 1
240
 
@@ -273,6 +318,10 @@ class SmartpayenvEnvironment(Environment):
273
  surge_logs = self._log_loader.get_pattern("fraud_surge", count=5)
274
  self._pattern_queue.extend(surge_logs)
275
 
 
 
 
 
276
  for gw in self._gateways: gw.step()
277
 
278
  # 1. 3DS / Action Logic
@@ -402,14 +451,40 @@ class SmartpayenvEnvironment(Environment):
402
  fs = self.fraud_grader.evaluate()
403
  rs = self.retention_grader.evaluate()
404
  base_reward = (0.4 * route_score) + (0.4 * fs) + (0.2 * rs)
405
-
406
- # Norm punishment for chargebacks
407
- final_reward = base_reward - (cb_amt / 150.0)
 
 
 
 
 
 
 
 
 
 
 
 
 
408
  self.current_obs.reward = float(np.clip(final_reward, 0.001, 0.999))
409
 
410
  self.current_obs.task_routing_score = route_score
411
  self.current_obs.task_fraud_mcc_score = fs
412
  self.current_obs.task_retention_score = rs
 
 
 
 
 
 
 
 
 
 
 
 
 
413
 
414
  return self.current_obs
415
 
 
77
  active_events: dict = field(default_factory=dict) # e.g. {"fraud_spike": 10, "outage": 5}
78
  log_cursor: int = 0
79
  review_queue: list = field(default_factory=list) # [{ 'step': int, 'is_fraud': bool, 'amount': float }]
80
+ curriculum_level: float = 0.0
81
+ policy_skill_estimate: float = 0.5
82
+ challenger_skill: float = 0.55
83
+ recent_rewards: deque = field(default_factory=lambda: deque(maxlen=25))
84
+ recent_route_scores: deque = field(default_factory=lambda: deque(maxlen=25))
85
+ recent_fraud_scores: deque = field(default_factory=lambda: deque(maxlen=25))
86
+ recent_retention_scores: deque = field(default_factory=lambda: deque(maxlen=25))
87
+ anti_gaming_alerts: int = 0
88
 
89
 
90
  class _GatewayState:
 
140
  self.current_obs = None
141
  self._log_loader = LogLoader()
142
  self._pattern_queue = deque()
143
+ self._meta_curriculum_enabled = True
144
 
145
  def _init_gateways(self) -> None:
146
  instability = self._cfg["instability"]
 
242
  self.current_obs = self._generate_transaction()
243
  # Synchronize simulation clock with the log's starting hour
244
  self._state.simulation_hour = self.current_obs.time_of_day
245
+ self._state.curriculum_level = float(self._difficulty)
246
+ self._state.policy_skill_estimate = 0.5
247
+ self._state.challenger_skill = 0.55 + (0.08 * self._difficulty)
248
+ self._state.anti_gaming_alerts = 0
249
  return self.current_obs
250
 
251
+ def _curriculum_multiplier(self) -> float:
252
+ return 1.0 + (0.15 * self._state.curriculum_level)
253
+
254
+ def _update_self_play_curriculum(self, route_score: float, fraud_score: float, retention_score: float) -> None:
255
+ """
256
+ Theme-4 core: self-improvement loop inspired by league training.
257
+ The policy competes against a moving challenger and environment complexity
258
+ scales with sustained performance.
259
+ """
260
+ self._state.recent_route_scores.append(route_score)
261
+ self._state.recent_fraud_scores.append(fraud_score)
262
+ self._state.recent_retention_scores.append(retention_score)
263
+ perf = (0.45 * route_score) + (0.35 * fraud_score) + (0.20 * retention_score)
264
+ self._state.recent_rewards.append(perf)
265
+
266
+ if not self._state.recent_rewards:
267
+ return
268
+
269
+ rolling_perf = float(np.mean(self._state.recent_rewards))
270
+ skill_delta = 0.08 * (rolling_perf - 0.5)
271
+ self._state.policy_skill_estimate = float(np.clip(self._state.policy_skill_estimate + skill_delta, 0.05, 0.99))
272
+
273
+ # PFSP-inspired challenger adaptation: keep matches near policy frontier.
274
+ gap = self._state.policy_skill_estimate - self._state.challenger_skill
275
+ self._state.challenger_skill = float(np.clip(self._state.challenger_skill + (0.06 * gap), 0.05, 0.99))
276
+
277
+ if self._meta_curriculum_enabled and len(self._state.recent_rewards) >= 8:
278
+ if rolling_perf > 0.72:
279
+ self._state.curriculum_level = float(np.clip(self._state.curriculum_level + 0.12, 0.0, 2.0))
280
+ elif rolling_perf < 0.45:
281
+ self._state.curriculum_level = float(np.clip(self._state.curriculum_level - 0.08, 0.0, 2.0))
282
+
283
  def step(self, action: SmartpayenvAction) -> SmartpayenvObservation:
284
  self._state.step_count += 1
285
 
 
318
  surge_logs = self._log_loader.get_pattern("fraud_surge", count=5)
319
  self._pattern_queue.extend(surge_logs)
320
 
321
+ # Curriculum-driven stress events (self-improvement pressure).
322
+ if self._rng.random() < (0.01 * self._curriculum_multiplier()):
323
+ self._state.active_events["adversarial_shift"] = int(self._rng.integers(4, 12))
324
+
325
  for gw in self._gateways: gw.step()
326
 
327
  # 1. 3DS / Action Logic
 
451
  fs = self.fraud_grader.evaluate()
452
  rs = self.retention_grader.evaluate()
453
  base_reward = (0.4 * route_score) + (0.4 * fs) + (0.2 * rs)
454
+
455
+ # League-style regret: penalize underperforming against moving challenger.
456
+ challenger_regret = max(0.0, self._state.challenger_skill - base_reward)
457
+ regret_penalty = 0.35 * challenger_regret
458
+
459
+ # Anti-gaming check: repeatedly overusing manual review without quality gains.
460
+ gaming_penalty = 0.0
461
+ if action.fraud_decision == 3 and fs < 0.55 and rs < 0.6:
462
+ self._state.anti_gaming_alerts += 1
463
+ gaming_penalty = min(0.12, 0.02 * self._state.anti_gaming_alerts)
464
+
465
+ # Curriculum bonus: reward robust performance under higher difficulty pressure.
466
+ robustness_bonus = 0.06 * self._state.curriculum_level * max(0.0, base_reward - 0.55)
467
+
468
+ # Norm punishment for delayed liabilities + self-improvement terms.
469
+ final_reward = base_reward - (cb_amt / 150.0) - regret_penalty - gaming_penalty + robustness_bonus
470
  self.current_obs.reward = float(np.clip(final_reward, 0.001, 0.999))
471
 
472
  self.current_obs.task_routing_score = route_score
473
  self.current_obs.task_fraud_mcc_score = fs
474
  self.current_obs.task_retention_score = rs
475
+ self._update_self_play_curriculum(route_score, fs, rs)
476
+
477
+ self.current_obs.metadata = {
478
+ "theme": "self_improvement",
479
+ "curriculum_level": round(self._state.curriculum_level, 4),
480
+ "policy_skill_estimate": round(self._state.policy_skill_estimate, 4),
481
+ "challenger_skill": round(self._state.challenger_skill, 4),
482
+ "challenger_regret": round(challenger_regret, 4),
483
+ "gaming_penalty": round(gaming_penalty, 4),
484
+ "robustness_bonus": round(robustness_bonus, 4),
485
+ "anti_gaming_alerts": int(self._state.anti_gaming_alerts),
486
+ "active_events": dict(self._state.active_events),
487
+ }
488
 
489
  return self.current_obs
490
 
server/utils.py CHANGED
@@ -39,6 +39,14 @@ class LogLoader:
39
  if pattern_type == "fraud_surge":
40
  # Filter for high fraud risk
41
  candidates = [l for l in self.logs if l.get("fraud_risk_score", 0) > 0.5]
 
 
 
 
 
 
 
 
42
  elif pattern_type == "premium_only":
43
  candidates = [l for l in self.logs if l.get("user_segment") == 2]
44
  else:
 
39
  if pattern_type == "fraud_surge":
40
  # Filter for high fraud risk
41
  candidates = [l for l in self.logs if l.get("fraud_risk_score", 0) > 0.5]
42
+ elif pattern_type == "stealth_fraud":
43
+ candidates = [
44
+ l for l in self.logs
45
+ if l.get("is_fraud", False)
46
+ and "low_risk_disguise" in str(l.get("fraud_strategy", ""))
47
+ ]
48
+ elif pattern_type == "velocity_attack":
49
+ candidates = [l for l in self.logs if float(l.get("transaction_velocity", 0.0)) > 0.7]
50
  elif pattern_type == "premium_only":
51
  candidates = [l for l in self.logs if l.get("user_segment") == 2]
52
  else: