File size: 15,205 Bytes
39c0d5b
 
 
 
650f9d9
39c0d5b
 
 
 
 
 
 
 
 
 
640cca9
39c0d5b
 
 
 
 
 
 
 
640cca9
 
 
 
 
 
39c0d5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f953d1e
39c0d5b
f953d1e
39c0d5b
 
 
 
f953d1e
 
 
39c0d5b
f953d1e
 
39c0d5b
27a0d2f
 
 
 
 
 
 
f953d1e
 
 
 
 
39c0d5b
 
f953d1e
 
39c0d5b
 
 
 
f953d1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27a0d2f
f953d1e
640cca9
 
 
 
 
 
f953d1e
 
39c0d5b
 
f953d1e
39c0d5b
 
 
 
 
2f14e67
39c0d5b
 
 
 
 
 
 
 
 
 
 
 
 
f953d1e
 
 
39c0d5b
 
f953d1e
 
 
 
39c0d5b
 
f953d1e
 
 
 
 
39c0d5b
 
 
27a0d2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
640cca9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27a0d2f
 
5b9b298
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39c0d5b
 
 
 
 
f953d1e
 
 
39c0d5b
 
f953d1e
 
 
 
 
 
 
 
 
 
 
39c0d5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
640cca9
 
 
39c0d5b
 
 
 
f953d1e
39c0d5b
 
 
 
 
 
 
 
 
640cca9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39c0d5b
 
 
 
 
 
 
 
 
 
 
 
 
 
f953d1e
 
 
 
39c0d5b
 
 
f953d1e
 
39c0d5b
 
f953d1e
 
39c0d5b
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
---
title: SmartPayEnv  Advanced Fintech Reality Layer
emoji: 💳
colorFrom: blue
colorTo: gray
sdk: docker
pinned: true
app_port: 7860
tags:
  - openenv
  - fintech
  - payment-orchestration
  - Reinforcement Learning
---

# 💳 SmartPayEnv: Advanced Fintech Reality Layer (Theme 4: Self-Improvement)

**A high-fidelity, production-grade benchmark for training and evaluating AI Agents (LLMs/RL) on the messy reality of global payment orchestration.**

[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Pratap-K/SmartPayEnv)
[![OpenEnv Compliant](https://img.shields.io/badge/OpenEnv-Compliant-green)](https://github.com/meta-pytorch/OpenEnv)

SmartPayEnv bridges the gap between simple simulations and production fintech. It models the adversarial loops, infrastructure instability, and delayed feedback cycles that define modern payment systems.

This release is explicitly upgraded for **OpenEnv Hackathon Theme #4 (Self-Improvement)** with a light blend of Theme #1 and Theme #2:
- **League-style challenger dynamics** inside the environment (agent vs moving opponent skill frontier).
- **Adaptive curriculum** that auto-escalates pressure after sustained performance and de-escalates after regressions.
- **Anti-reward-hacking penalties** for degenerate policies (e.g., overusing manual review without fraud/retention quality).
- **Long-horizon credit pressure** through delayed chargebacks + review queues + temporal events.

---

## 🚀 Why SmartPayEnv?

In the real world, payment orchestration isn't just about "Allow" or "Block." It's about optimizing for **Conversion**, **Fraud Risk**, and **Operational Cost** simultaneously. SmartPayEnv introduces:

- **Delayed Credit Assignment**: Undetected fraud today becomes a Chargeback 40 steps later.
- **Conversion Friction**: Security measures (3DS) can cause high-value users to abandon their carts.
- **Gateway Drift**: Provider success rates fluctuate based on bank-level performance and network drift.

---

## 🏗️ System Architecture

SmartPayEnv leverages the **OpenEnv** framework to provide a standardized interface for AI agents.

```mermaid
graph TD
    subgraph "Agent Layer"
        LLM[LLM Agent / RL Policy]
    end

    subgraph "Interface Layer (FastAPI)"
        Srv[server/app.py]
        WS[WebSocket /ws]
        HTTP[HTTP /step, /reset]
    end

    subgraph "Reality Engine"
        Env[SmartPayEnvironment]
        State[Persistence & Queues]
        Logic[BIN Affinity & 3DS Friction]
    end

    subgraph "Feedback Loop"
        Gr_R[RoutingEfficacyGrader]
        Gr_F[FraudDetectionGrader]
        Gr_U[UserRetentionGrader]
    end

    LLM <-->|JSON Observation/Action| Srv
    Srv <--> Env
    Env <--> State & Logic
    Env -->|Metrics| Gr_R & Gr_F & Gr_U
```

---

## 🌊 The Payment Lifecycle (The Reality Loop)

The environment models a high-frequency feedback loop where agents navigate noisy signals and delayed consequences.

```mermaid
sequenceDiagram
    autonumber
    participant Agent as AI Agent (LLM/RL)
    participant Env as Reality Engine
    participant Queue as Review/CB Queues

    Note over Env: [State] Clock advances + Events Triggered
    Env->>Agent: Observation (Noisy Risk + Lagged Health + Resolution Alerts)
    
    rect rgb(30, 30, 30)
        Note over Agent: [Optional] Simulation (GRPO/PPO)
        Agent->>Env: POST /simulate (Group Samples)
        Env-->>Agent: Branch Results (Advantage Signal)
    end

    Agent->>Env: Final Action (Gateway Strategy + Fraud Decision)

    rect rgb(30, 30, 30)
        Note over Env: [Reality] Execution & Scheduling
        Env->>Env: Success = f(Health, BIN, TrueRisk, Noise)
        Env->>Queue: Schedule Reviews (10s) and Chargebacks (40s)
    end

    Queue-->>Env: Matured Results from previous steps
    Env->>Agent: Feedback (Reward, Done, Resolved Alerts)
```

---

## 💎 Advanced Reality Features

### 1. Log-Driven Time-Series
Sequentially streams from synthetic logs to simulate real-world distributions, diurnal cycles (simulation clock), and persistent fraud surges.

### 2. Partial Observability
Forces agents to infer state by adding noise to risk signals, hiding internal user tiers, and lagging gateway health metrics by 2 steps.

### 3. Human-in-the-Loop (HITL)
Agents can send transactions to manual review (Action 3). Resolutions are 100% accurate but incur a $5.00 fee and a 10-25 step delay.

### 4. Advanced Adversarial Mechanics
- **🛡️ 3DS Friction (Action 2)**: Provides a **90% fraud reduction** but triggers a **15-25% abandonment rate**. Agents must balance security vs. customer drop-off.
- **⏳ Delayed Chargebacks**: Undetected fraud ($TrueRisk > 0.65$) matures into penalties (Tx Amount + $20 fee) **30-50 steps later**, forcing long-term liability management.
- **📊 BIN-Gateway Affinity**: A hidden matrix of gateway performance across different card types. Agents must discover these affinities to optimize routing success.
- **🧠 Preference-Based Learning (Simulation Branching)**: Supports advanced training (e.g., DPO/PPO) by allowing agents to "What-if" multiple actions from the same state via the `/simulate` endpoint. Agents can group similar contexts (BIN + Amount + Risk) and learn from relative advantages.

### 5. Self-Improving Meta-Curriculum (NEW)
- **📈 Curriculum Level**: Each episode now tracks a continuous curriculum level (0-2) that increases after sustained high rolling performance.
- **🥊 Challenger Skill**: A moving challenger policy estimate is maintained and used to compute regret-style penalties when the active policy underperforms.
- **🧯 Anti-Gaming Guardrails**: Repeatedly selecting costly manual review without corresponding quality gains triggers adaptive penalties.
- **🧠 Metadata for Training**: Step metadata exposes `curriculum_level`, `policy_skill_estimate`, `challenger_skill`, and shaping terms to support richer RL diagnostics.

---

## 🎯 Benchmark Tasks

SmartPayEnv supports four curriculum tasks, ranging from basic classification to complex joint optimization.

| Task | Level | Objective | Metrics |
|------|-------|-----------|---------|
| `routing_efficacy` | Easy | Choose the gateway (0-2) with the highest affinity for the current card BIN. | Routing Score |
| `user_retention` | Medium| Minimize customer churn by ensuring high availability for premium/existing users. | Retention Score |
| `fraud_detection` | Medium| Correctily identify and block (`action=1`) fraudulent transactions based on risk signals. | MCC Score |
| `payment_optimization`| Hard | **Joint Equilibrium**: Optimize routing success, fraud mitigation, and user retention simultaneously. | Combined Reward |

---

## 📐 Exhaustive Grader Documentation

Our graders utilize a **Deterministic Mathematical Framework** to provide stable gradients for agent training.

### 1. Routing Efficacy Grader
Grades the quality of the gateway choice and transaction outcome.
- **Formula**: $Reward = \sigma(\alpha \cdot (2E - 1) - (\beta \cdot Cost + \gamma \cdot Retries) + \delta \cdot Quality)$
- **Key Parameters**:
    - **$\alpha$ (Outcome Weight: 1.2)**: Scales the impact of the expected success.
    - **$\beta$ (Cost Multiplier: 0.15)**: Penalizes choosing expensive gateways.
    - **$\gamma$ (Retry Penalty: 0.4)**: Discourages excessive retries.
    - **$\delta$ (Decision Bonus: 0.8)**: Rewards selecting the gateway with the highest current affinity.

### 2. Fraud Detection Grader (MCC)
Uses the **Matthews Correlation Coefficient (MCC)** to handle imbalanced transaction data (fraud is rare, ~2%).
- **MCC Formula**:
$$MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}$$
- **Reward Mapping**: Maps MCC $[-1, 1]$ to a learnable range $[0, 1]$ using $R = \frac{MCC + 1}{2}$. A baseline of $0.5$ represents a random classifier.

### 3. User Retention Grader
Models customer churn using an **Exponential Hazard Function** to simulate the "Trust Deficit."
- **Retention Formula**:
$$Retention = e^{-\lambda \cdot f^2}$$
where $f$ is the count of consecutive failed transactions for that user cohort.
- **Rationale**: Consecutive failures cause non-linear churn; a first failure is an annoyance, but a third consecutive failure leads to near-certain platform abandonment.

---

## 🧠 Reinforcement Learning Optimization (GRPO/PPO)

SmartPayEnv is architected to support state-of-the-art RL training algorithms like **Group Relative Policy Optimization (GRPO)** and **Proximal Policy Optimization (PPO)**.

### 1. Group Relative Policy Optimization (GRPO)
SmartPayEnv enables GRPO by providing the infrastructure for **Group Sampling** without a value model.
- **Group Signal**: Use the `POST /simulate` endpoint to generate $G$ actions for the same state.
- **Relative Advantage**: The environment computes the advantage by standardizing rewards within the group:
  $$Adv_i = \frac{R_i - \text{mean}(R_{group})}{\text{std}(R_{group}) + \epsilon}$$
- **Stability**: This eliminates the need for a separate critic/baseline, mirroring the training architecture used for **DeepSeek-V3**.

### 2. PPO & Policy Gradients
- **Learnable Gradients**: Unlike binary simulations, our **Deterministic Graders** (see Scoring section) map fuzzy outcomes to continuous rewards $[0, 1]$. This prevents the "sparse reward" problem and provides stable gradients for PPO clip-range optimization.
- **Context Bucketing**: The `server/preference_utils.py` module allows agents to bundle similar (BIN, Amount, Risk) states, enabling faster convergence on preference-based objectives.

### 3. Theme-4 Group-Relative Collection (NEW)
- Use `scripts/train_theme4_grpo.py` to build **group-relative preference pairs** from online interactions:
  - sample action groups for each live observation
  - rank via `/simulate` reward
  - export best-vs-worst pairs (`theme4_grpo_pairs.jsonl`)
- This supports novel post-training flows in **HF TRL / Unsloth** and aligns with modern critic-free RL ideas.

---

## 📚 Research-Inspired Design

The self-improving upgrades are inspired by:
- **League / PFSP dynamics** for avoiding cyclic overfitting and improving robustness: [AlphaStar (Nature, 2019)](https://www.nature.com/articles/s41586-019-1724-z)
- **Group-relative policy updates** for efficient critic-free optimization: [DeepSeekMath / GRPO (arXiv:2402.03300)](https://arxiv.org/abs/2402.03300)
- **Cross-play and equilibrium-oriented opponent diversity**: [Fictitious Cross-Play (arXiv:2310.03354)](https://arxiv.org/abs/2310.03354)

---

## 🧪 Judge Repro (Colab + HF Credits)

For hackathon evaluation, use the Colab notebook:
- `notebooks/theme4_judge_repro_colab.ipynb`

What this notebook does:
- connects to the deployed Space (`https://pratap-k-smartpayenv.hf.space`)
- collects group-relative preference pairs from `/simulate`
- runs a lightweight TRL DPO pass
- writes reproducible artifacts (`artifacts/run_metrics.json`)

Judge flow:
1. Open notebook in Colab and run all cells.
2. Login with Hugging Face token when prompted (credits-enabled account).
3. Keep `QUICK_MODE=True` for fast rerun; set `False` for longer training.

Expected runtime:
- Quick mode: ~10-20 minutes
- Full mode: ~45-90 minutes (depending on Colab hardware/model)

---

## 📐 Data Models

### Action Space (`SmartpayenvAction`)
| Field | Type | Values | Description |
|-------|------|--------|-------------|
| `gateway` | `int` | `0, 1, 2` | 0=Economy, 1=Standard, 2=Premium |
| `fraud_decision`| `int` | `0, 1, 2, 3`| 0=Allow, 1=Block, 2=3DS (Challenge), 3=Manual Review |
| `retry_strategy`| `int` | `0, 1` | 0=No Retry, 1=Auto-Failover |

### Observation Space (`SmartpayenvObservation`)
| Category | Field | Description |
|----------|-------|-------------|
| **Context** | `amount` | Transaction value in USD |
| | `bin_category` | Card type (0-9) |
| | `user_segment` | 0=New, 1=Existing, 2=Premium |
| **Signals** | `observed_fraud_risk`| Noisy risk probability [0,1] |
| | `time_of_day` | Current simulation hour (0-23) |
| **Reviews**| `review_resolutions`| List of matured manual review results |
| **Health** | `gateway_states` | LAGGED Health status (2 steps delay) |
| | `gateway_success_rates`| LAGGED success probabilities |
| **Tracking**| `chargeback_penalty_applied`| Penalty from a past undetected fraud |

---

## 🏗️ Step-by-Step Setup

### 1. Local Development
We recommend using [uv](https://github.com/astral-sh/uv) for fast, reliable dependency management.

```bash
# Clone and enter the repository
git clone https://github.com/pratap-nitjsr/SmartPayEnv.git
cd SmartPayEnv

# Install dependencies
uv sync

# (Recommended) Regenerate realistic synthetic data
python scripts/generate_logs.py --num-transactions 20000 --n-users 5000 --seed 42

# Run the OpenEnv validation suite
openenv validate

# Run core logic tests
python tests/test_reality_features.py
```

### 2. Starting the Server
```bash
# Run via uv
uv run -m SmartPayEnv.server.app
```
Access the **Swagger UI** at `http://localhost:7860/` (auto-redirects to `/docs`).

### 4. Synthetic Data World Generator (NEW)
Use this when you want realistic, evolving "real-world-like" transaction streams:

```bash
python scripts/generate_logs.py \
  --output data/transactions_log.jsonl \
  --num-transactions 20000 \
  --n-users 5000 \
  --seed 42 \
  --base-fraud-rate 0.08
```

What gets generated:
- **Normal baseline behavior** (segment-based spend, location/device consistency, time-of-day effects)
- **Seed fraud templates** (`high_value_spike`, `velocity_burst`, `geo_anomaly`, `device_spoof`, `split_transactions`)
- **Adaptive fraud evolution** (strategy composition and stealth attacks such as `low_risk_disguise`)
- **Strategy labels for storytelling** via `fraud_strategy` and `event_marker`

### 3. Multi-Mode Deployment (Docker)
```bash
# Build the production image
docker build -t smartpay-env .

# Run the container
docker run -p 7860:7860 smartpay-env
```

---

## 📁 Project Structure
```text
SmartPayEnv/
├── scripts/
│   ├── generate_logs.py         # Synthetic dataset generator
├── data/
│   ├── transactions_log.jsonl   # Pre-generated transaction pool
├── server/
│   ├── app.py                  # FastAPI Entry Point (Uvicorn)
│   ├── SmartPayEnv_environment.py # Core Reality Layer Logic
│   ├── graders.py               # Math models for RL Reward
│   └── utils.py                 # Log loading & sampling utilities
├── tests/
│   ├── test_graders.py         # Unit tests for scoring math
│   ├── test_reality_features.py # Reality layer verification
│   └── test_env_logs.py        # Log-driven simulation test
├── models.py                   # Pydantic Action/Observation Schemas
├── inference.py                # LLM/RL Agent Driver & Curriculum
├── pyproject.toml              # Dependency & Build Manifest
└── openenv.yaml                # OpenEnv Environment Metadata
```

## 📄 License
This project is licensed under the MIT License - see the [LICENSE](file:///d:/meta-pytorch-final/SmartPayEnv/LICENSE) file for details.