ninarg commited on
Commit
ee9eb6a
Β·
1 Parent(s): a585e24

Initial: Gradio inference Space (edge fraud + node anomaly)

Browse files
Files changed (5) hide show
  1. .gitignore +4 -0
  2. README.md +48 -8
  3. app.py +403 -0
  4. models.py +265 -0
  5. requirements.txt +9 -0
.gitignore ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ __pycache__/
2
+ *.pyc
3
+ .gradio/
4
+ .venv/
README.md CHANGED
@@ -1,13 +1,53 @@
1
  ---
2
- title: Fraud Gnn Demo
3
- emoji: πŸš€
4
- colorFrom: green
5
- colorTo: red
6
  sdk: gradio
7
- sdk_version: 6.14.0
8
- python_version: '3.13'
9
  app_file: app.py
10
- pinned: false
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: VynFi Fraud-GNN Demo
3
+ emoji: πŸ›‘οΈ
4
+ colorFrom: red
5
+ colorTo: indigo
6
  sdk: gradio
7
+ sdk_version: 5.5.0
8
+ python_version: '3.11'
9
  app_file: app.py
10
+ pinned: true
11
+ license: apache-2.0
12
+ short_description: GraphSAGE fraud + GAE anomaly on synthetic JE network
13
+ tags:
14
+ - vynfi
15
+ - graph-neural-network
16
+ - fraud-detection
17
+ - anomaly-detection
18
+ - synthetic-data
19
  ---
20
 
21
+ # πŸ›‘οΈ VynFi Fraud-GNN Demo
22
+
23
+ Interactive inference Space for the
24
+ [`VynFi/je-fraud-gnn`](https://huggingface.co/VynFi/je-fraud-gnn)
25
+ model bundle.
26
+
27
+ ## Three tabs
28
+
29
+ * **Edge fraud predictor** β€” pick a curated sample (clear fraud / clear
30
+ normal / borderline) or build your own edge from any of the 499 GL
31
+ accounts in the published COA. Returns fraud probability + anomaly MSE.
32
+ * **Node anomaly explorer** β€” top-K accounts ranked by GAE
33
+ reconstruction error on a 5,000-edge sample; surfaces accounts whose
34
+ attribute patterns don't fit the structural prior.
35
+ * **Live evaluation** β€” sample N edges from
36
+ [`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m),
37
+ run the classifier, render confusion matrix + ROC against ground truth.
38
+
39
+ ## Tech
40
+
41
+ * Gradio + torch-geometric + pandas + matplotlib
42
+ * Loads model bundle from `VynFi/je-fraud-gnn` at cold-start (cached after).
43
+ * Loads dataset slices from `VynFi/vynfi-journal-entries-1m` on demand.
44
+
45
+ ## Source
46
+
47
+ * [Engine repo (`spaces/fraud-gnn-demo/`)](https://github.com/mivertowski/SyntheticData/tree/main/spaces/fraud-gnn-demo)
48
+ * [Model card](https://huggingface.co/VynFi/je-fraud-gnn) β€” full training details, metrics, and honest discussion of where GNN helps vs LR baseline.
49
+ * [Companion paper (SSRN)](https://ssrn.com/abstract=6538639)
50
+
51
+ ## License
52
+
53
+ Apache-2.0.
app.py ADDED
@@ -0,0 +1,403 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """VynFi Fraud-GNN Demo β€” Gradio Space.
2
+
3
+ Three tabs:
4
+
5
+ * **Edge fraud predictor** β€” dataset-sampled examples + manual entry.
6
+ * **Node anomaly explorer** β€” top-K accounts by GAE reconstruction MSE.
7
+ * **Live check** β€” random val sample with confusion matrix + ROC.
8
+ """
9
+ from __future__ import annotations
10
+
11
+ from functools import lru_cache
12
+ from typing import Any
13
+
14
+ import gradio as gr
15
+ import matplotlib.pyplot as plt
16
+ import numpy as np
17
+ import pandas as pd
18
+ import torch
19
+ from huggingface_hub import hf_hub_download, snapshot_download
20
+ from sklearn.metrics import (
21
+ average_precision_score,
22
+ confusion_matrix,
23
+ roc_auc_score,
24
+ roc_curve,
25
+ )
26
+
27
+ from models import BUSINESS_PROCESSES, InferenceBundle, load_bundle
28
+
29
+
30
+ MODEL_REPO = "VynFi/je-fraud-gnn"
31
+ DATA_REPO = "VynFi/vynfi-journal-entries-1m"
32
+
33
+
34
+ # ─── Lazy loaders (executed once at app startup; cached thereafter) ─────────
35
+
36
+
37
+ @lru_cache(maxsize=1)
38
+ def get_bundle() -> InferenceBundle:
39
+ local = snapshot_download(repo_id=MODEL_REPO)
40
+ return load_bundle(local)
41
+
42
+
43
+ @lru_cache(maxsize=1)
44
+ def get_account_catalog() -> pd.DataFrame:
45
+ fp = hf_hub_download(repo_id=DATA_REPO, filename="chart_of_accounts.parquet", repo_type="dataset")
46
+ df = pd.read_parquet(fp)[
47
+ ["account_number", "short_description", "account_type", "account_class", "account_class_name"]
48
+ ]
49
+ df["account_number"] = df["account_number"].astype(str)
50
+ df = df.drop_duplicates(subset=["account_number"], keep="first")
51
+ df["label"] = df["account_number"] + " β€” " + df["short_description"]
52
+ return df
53
+
54
+
55
+ @lru_cache(maxsize=1)
56
+ def get_edge_sample() -> pd.DataFrame:
57
+ fp = hf_hub_download(repo_id=DATA_REPO, filename="je_network.parquet", repo_type="dataset")
58
+ df = pd.read_parquet(fp)
59
+ df["from_account"] = df["from_account"].astype(str)
60
+ df["to_account"] = df["to_account"].astype(str)
61
+ return df
62
+
63
+
64
+ def account_choices() -> list[str]:
65
+ bundle = get_bundle()
66
+ cat = get_account_catalog()
67
+ cat = cat[cat["account_number"].isin(bundle.node_index)].sort_values("account_number")
68
+ return cat["label"].tolist()
69
+
70
+
71
+ def label_to_account(label: str) -> str:
72
+ return label.split(" β€” ", 1)[0]
73
+
74
+
75
+ # ─── Tab 1: Edge fraud predictor ─────────────────────────────────────────────
76
+
77
+
78
+ CURATED_SAMPLES = [
79
+ {
80
+ "label": "Clear-fraud P2P (round-dollar + weekend)",
81
+ "from": "1000 β€” Operating Cash",
82
+ "to": "2000 β€” Trade Payables",
83
+ "amount": 25_000.0,
84
+ "process": "P2P",
85
+ "date": "2024-08-10",
86
+ },
87
+ {
88
+ "label": "Clear-fraud O2C (round + Sunday)",
89
+ "from": "1100 β€” Accounts Receivable",
90
+ "to": "4000 β€” Sales Revenue",
91
+ "amount": 50_000.0,
92
+ "process": "O2C",
93
+ "date": "2024-09-08",
94
+ },
95
+ {
96
+ "label": "Clear-normal P2P (off-round amount, weekday)",
97
+ "from": "1000 β€” Operating Cash",
98
+ "to": "2000 β€” Trade Payables",
99
+ "amount": 7_432.89,
100
+ "process": "P2P",
101
+ "date": "2024-03-12",
102
+ },
103
+ {
104
+ "label": "Clear-normal O2C (mid-month, weekday)",
105
+ "from": "1100 β€” Accounts Receivable",
106
+ "to": "4000 β€” Sales Revenue",
107
+ "amount": 12_876.43,
108
+ "process": "O2C",
109
+ "date": "2024-04-17",
110
+ },
111
+ {
112
+ "label": "Borderline (round amount, weekday)",
113
+ "from": "1000 β€” Operating Cash",
114
+ "to": "2000 β€” Trade Payables",
115
+ "amount": 10_000.0,
116
+ "process": "P2P",
117
+ "date": "2024-05-15",
118
+ },
119
+ ]
120
+
121
+
122
+ def fmt_money(x: float) -> str:
123
+ sign = "-" if x < 0 else ""
124
+ x = abs(float(x))
125
+ if x >= 1e9:
126
+ return f"{sign}${x / 1e9:.2f}B"
127
+ if x >= 1e6:
128
+ return f"{sign}${x / 1e6:.2f}M"
129
+ if x >= 1e3:
130
+ return f"{sign}${x / 1e3:.2f}K"
131
+ return f"{sign}${x:.2f}"
132
+
133
+
134
+ def predict_one(
135
+ from_label: str,
136
+ to_label: str,
137
+ amount: float,
138
+ process: str,
139
+ date: str,
140
+ ) -> tuple[str, dict]:
141
+ bundle = get_bundle()
142
+ src = label_to_account(from_label)
143
+ dst = label_to_account(to_label)
144
+ fraud_p = float(
145
+ bundle.predict_fraud(
146
+ from_account=[src],
147
+ to_account=[dst],
148
+ amount=[float(amount)],
149
+ business_process=[process],
150
+ posting_date=[str(date)],
151
+ )[0]
152
+ )
153
+ anomaly_mse = float(
154
+ bundle.anomaly_score_edges(
155
+ from_account=[src],
156
+ to_account=[dst],
157
+ amount=[float(amount)],
158
+ business_process=[process],
159
+ posting_date=[str(date)],
160
+ )[0]
161
+ )
162
+ threshold = bundle.fraud_threshold
163
+ verdict = "🚨 FRAUD" if fraud_p >= threshold else "βœ“ normal"
164
+ summary_md = (
165
+ f"### {verdict}\n\n"
166
+ f"**Fraud probability:** `{fraud_p:.4f}` (threshold = `{threshold:.3f}`) \n"
167
+ f"**Anomaly MSE:** `{anomaly_mse:.4f}` (higher = more unusual)\n\n"
168
+ f"**Edge:** `{src}` β†’ `{dst}` \n"
169
+ f"**Amount:** {fmt_money(amount)} Β· **Process:** {process} Β· **Date:** {date}\n"
170
+ )
171
+ feature_inspect = {
172
+ "is_round_dollar": any(abs(float(amount) - lv) < 1.0 for lv in [1000, 5000, 10000, 25000, 50000, 100000]),
173
+ "is_weekend": pd.to_datetime(date).dayofweek >= 5,
174
+ "amount": float(amount),
175
+ "process": process,
176
+ }
177
+ return summary_md, feature_inspect
178
+
179
+
180
+ def load_sample(sample_label: str) -> tuple[str, str, float, str, str]:
181
+ s = next(s for s in CURATED_SAMPLES if s["label"] == sample_label)
182
+ return s["from"], s["to"], s["amount"], s["process"], s["date"]
183
+
184
+
185
+ # ─── Tab 2: Node anomaly explorer ────────────────────────────────────────────
186
+
187
+
188
+ def build_node_anomaly_table(top_k: int = 50) -> pd.DataFrame:
189
+ bundle = get_bundle()
190
+ cat = get_account_catalog()
191
+ edges_df = get_edge_sample()
192
+
193
+ test_sample = edges_df.sample(min(5000, len(edges_df)), random_state=42)
194
+ test_sample = test_sample[
195
+ test_sample["from_account"].isin(bundle.node_index)
196
+ & test_sample["to_account"].isin(bundle.node_index)
197
+ ]
198
+ per_edge_mse = bundle.anomaly_score_edges(
199
+ from_account=test_sample["from_account"].tolist(),
200
+ to_account=test_sample["to_account"].tolist(),
201
+ amount=test_sample["amount"].tolist(),
202
+ business_process=test_sample["business_process"].tolist(),
203
+ posting_date=test_sample["posting_date"].astype(str).tolist(),
204
+ )
205
+
206
+ df = test_sample.copy()
207
+ df["mse"] = per_edge_mse
208
+ src_agg = df.groupby("from_account").agg(out_mse=("mse", "mean"), out_count=("mse", "count"))
209
+ dst_agg = df.groupby("to_account").agg(in_mse=("mse", "mean"), in_count=("mse", "count"))
210
+ by_node = src_agg.join(dst_agg, how="outer").fillna(0)
211
+ by_node["mean_mse"] = (
212
+ (by_node["out_mse"] * by_node["out_count"] + by_node["in_mse"] * by_node["in_count"])
213
+ / (by_node["out_count"] + by_node["in_count"]).replace(0, 1)
214
+ )
215
+ by_node["incident_edges"] = by_node["out_count"] + by_node["in_count"]
216
+ by_node = by_node.reset_index().rename(columns={"index": "account_number"})
217
+
218
+ enriched = by_node.merge(cat, on="account_number", how="left")
219
+ enriched = enriched.sort_values("mean_mse", ascending=False).head(int(top_k))
220
+ enriched["mean_mse"] = enriched["mean_mse"].round(4)
221
+ return enriched[
222
+ [
223
+ "account_number",
224
+ "short_description",
225
+ "account_type",
226
+ "account_class",
227
+ "mean_mse",
228
+ "incident_edges",
229
+ ]
230
+ ].rename(
231
+ columns={
232
+ "account_number": "GL #",
233
+ "short_description": "Account",
234
+ "account_type": "Type",
235
+ "account_class": "Class",
236
+ "mean_mse": "Anomaly MSE",
237
+ "incident_edges": "Sample edges",
238
+ }
239
+ )
240
+
241
+
242
+ # ─── Tab 3: Live check ───────────────────────────────────────────────────────
243
+
244
+
245
+ def run_live_check(n_samples: int = 200) -> tuple[Any, Any, str]:
246
+ bundle = get_bundle()
247
+ edges_df = get_edge_sample()
248
+ edges_df = edges_df[
249
+ edges_df["from_account"].isin(bundle.node_index)
250
+ & edges_df["to_account"].isin(bundle.node_index)
251
+ ]
252
+ sample = edges_df.sample(int(n_samples), random_state=None)
253
+
254
+ probs = bundle.predict_fraud(
255
+ from_account=sample["from_account"].tolist(),
256
+ to_account=sample["to_account"].tolist(),
257
+ amount=sample["amount"].tolist(),
258
+ business_process=sample["business_process"].tolist(),
259
+ posting_date=sample["posting_date"].astype(str).tolist(),
260
+ )
261
+ y_true = sample["is_fraud"].astype(int).to_numpy()
262
+ threshold = bundle.fraud_threshold
263
+ y_pred = (probs >= threshold).astype(int)
264
+ if y_true.sum() == 0 or y_true.sum() == len(y_true):
265
+ return None, None, "Sampled batch had only one class β€” try a larger sample."
266
+
267
+ auc = roc_auc_score(y_true, probs)
268
+ ap = average_precision_score(y_true, probs)
269
+ cm = confusion_matrix(y_true, y_pred)
270
+
271
+ fig_cm = plt.figure(figsize=(4, 4), dpi=120)
272
+ ax = fig_cm.add_subplot(111)
273
+ ax.imshow(cm, cmap="Blues")
274
+ ax.set_xticks([0, 1])
275
+ ax.set_yticks([0, 1])
276
+ ax.set_xticklabels(["normal", "fraud"])
277
+ ax.set_yticklabels(["normal", "fraud"])
278
+ for i in range(2):
279
+ for j in range(2):
280
+ ax.text(j, i, str(cm[i, j]), ha="center", va="center", fontsize=14, color="black")
281
+ ax.set_xlabel("predicted")
282
+ ax.set_ylabel("actual")
283
+ ax.set_title(f"Confusion matrix (n={int(n_samples)})")
284
+ fig_cm.tight_layout()
285
+
286
+ fpr, tpr, _ = roc_curve(y_true, probs)
287
+ fig_roc = plt.figure(figsize=(4, 4), dpi=120)
288
+ ax2 = fig_roc.add_subplot(111)
289
+ ax2.plot(fpr, tpr, label=f"ROC AUC = {auc:.3f}")
290
+ ax2.plot([0, 1], [0, 1], "k--", alpha=0.4)
291
+ ax2.set_xlabel("false positive rate")
292
+ ax2.set_ylabel("true positive rate")
293
+ ax2.set_title("ROC")
294
+ ax2.legend()
295
+ fig_roc.tight_layout()
296
+
297
+ summary = (
298
+ f"### Live check on {int(n_samples)} sampled edges\n\n"
299
+ f"- AUC-ROC: **{auc:.4f}**\n"
300
+ f"- AUC-PR: **{ap:.4f}**\n"
301
+ f"- True fraud: {int(y_true.sum())} / {len(y_true)}\n"
302
+ f"- Predicted fraud: {int(y_pred.sum())} / {len(y_pred)}\n"
303
+ f"- Threshold: {threshold:.3f}\n"
304
+ )
305
+ return fig_cm, fig_roc, summary
306
+
307
+
308
+ # ─── Gradio UI ───────────────────────────────────────────────────────────────
309
+
310
+
311
+ def build_app() -> gr.Blocks:
312
+ with gr.Blocks(title="VynFi Fraud-GNN Demo", theme=gr.themes.Soft()) as app:
313
+ gr.Markdown(
314
+ """
315
+ # πŸ›‘οΈ VynFi Fraud-GNN Demo
316
+
317
+ Interactive inference on the
318
+ [`VynFi/je-fraud-gnn`](https://huggingface.co/VynFi/je-fraud-gnn)
319
+ model β€” GraphSAGE edge fraud classifier + attribute-reconstruction
320
+ GAE node anomaly scorer, trained on the v5.9.0 Method-A network
321
+ in
322
+ [`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m).
323
+ """
324
+ )
325
+
326
+ with gr.Tab("Edge fraud predictor"):
327
+ with gr.Row():
328
+ with gr.Column():
329
+ sample_picker = gr.Dropdown(
330
+ label="Curated samples",
331
+ choices=[s["label"] for s in CURATED_SAMPLES],
332
+ value=None,
333
+ info="Or fill in the form below for a custom edge.",
334
+ )
335
+ from_dd = gr.Dropdown(label="From account", choices=account_choices(), value=None)
336
+ to_dd = gr.Dropdown(label="To account", choices=account_choices(), value=None)
337
+ amount_in = gr.Number(label="Amount (USD)", value=10_000.0)
338
+ process_dd = gr.Dropdown(
339
+ label="Business process",
340
+ choices=BUSINESS_PROCESSES,
341
+ value="P2P",
342
+ )
343
+ date_in = gr.Textbox(label="Posting date (YYYY-MM-DD)", value="2024-06-15")
344
+ predict_btn = gr.Button("Predict", variant="primary")
345
+
346
+ with gr.Column():
347
+ summary_md = gr.Markdown()
348
+ feat_box = gr.JSON(label="Feature trace")
349
+
350
+ sample_picker.change(
351
+ load_sample,
352
+ inputs=[sample_picker],
353
+ outputs=[from_dd, to_dd, amount_in, process_dd, date_in],
354
+ )
355
+ predict_btn.click(
356
+ predict_one,
357
+ inputs=[from_dd, to_dd, amount_in, process_dd, date_in],
358
+ outputs=[summary_md, feat_box],
359
+ )
360
+
361
+ with gr.Tab("Node anomaly explorer"):
362
+ gr.Markdown(
363
+ "Top accounts ranked by mean per-edge reconstruction MSE on a "
364
+ "5,000-edge sample β€” accounts whose *attribute patterns* don't fit the "
365
+ "structural prior learned by the GAE."
366
+ )
367
+ top_k_slider = gr.Slider(label="Top K", minimum=10, maximum=200, value=50, step=10)
368
+ anomaly_table = gr.Dataframe(value=build_node_anomaly_table(50), wrap=True)
369
+ refresh_btn = gr.Button("Recompute")
370
+ refresh_btn.click(build_node_anomaly_table, inputs=[top_k_slider], outputs=[anomaly_table])
371
+
372
+ with gr.Tab("Live check"):
373
+ gr.Markdown(
374
+ "Sample N random edges from the published dataset, run the "
375
+ "fraud classifier, show confusion matrix + ROC against ground truth."
376
+ )
377
+ n_slider = gr.Slider(label="Sample size", minimum=50, maximum=2000, value=300, step=50)
378
+ run_btn = gr.Button("Run", variant="primary")
379
+ with gr.Row():
380
+ cm_plot = gr.Plot(label="Confusion matrix")
381
+ roc_plot = gr.Plot(label="ROC curve")
382
+ check_summary = gr.Markdown()
383
+ run_btn.click(run_live_check, inputs=[n_slider], outputs=[cm_plot, roc_plot, check_summary])
384
+
385
+ gr.Markdown(
386
+ """
387
+ ---
388
+
389
+ **Honest caveat.** The synthetic fraud-bias model puts strong local
390
+ signals into edge attributes (40 % round-dollar, 30 % weekend), so a
391
+ simple LR on edge features already gets to AUC 0.91. GraphSAGE adds
392
+ +0.13 AUC pts on the supervised task; the unsupervised attribute-GAE
393
+ is where graph methods earn their keep here (AUC 0.65 *with no labels*).
394
+ See the [model card](https://huggingface.co/VynFi/je-fraud-gnn) for
395
+ full metrics + a discussion of where the GNN does/doesn't add value.
396
+ """
397
+ )
398
+
399
+ return app
400
+
401
+
402
+ if __name__ == "__main__":
403
+ build_app().launch()
models.py ADDED
@@ -0,0 +1,265 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Vendored model classes + inference bundle for the Gradio Space.
2
+
3
+ Self-contained β€” does not import from the engine repo so the Space can
4
+ deploy from `VynFi/je-fraud-gnn` without pulling the full SyntheticData
5
+ codebase.
6
+ """
7
+ from __future__ import annotations
8
+
9
+ import json
10
+ from dataclasses import dataclass
11
+ from pathlib import Path
12
+ from typing import Any
13
+
14
+ import numpy as np
15
+ import pandas as pd
16
+ import torch
17
+ import torch.nn.functional as F
18
+ from torch import nn
19
+ from torch_geometric.nn import SAGEConv
20
+
21
+ ROUND_LEVELS = np.array([1_000.0, 5_000.0, 10_000.0, 25_000.0, 50_000.0, 100_000.0])
22
+ BUSINESS_PROCESSES = ["P2P", "O2C", "R2R", "H2R", "A2R"]
23
+
24
+
25
+ # ─── Model classes (must match training scripts byte-for-byte) ───────────────
26
+
27
+
28
+ class EdgeFraudGNN(nn.Module):
29
+ def __init__(
30
+ self,
31
+ node_in: int,
32
+ edge_in: int,
33
+ hidden: int = 64,
34
+ out: int = 64,
35
+ head_hidden: int = 128,
36
+ dropout: float = 0.2,
37
+ ) -> None:
38
+ super().__init__()
39
+ self.conv1 = SAGEConv(node_in, hidden, aggr="mean")
40
+ self.conv2 = SAGEConv(hidden, out, aggr="mean")
41
+ self.dropout = dropout
42
+ self.head = nn.Sequential(
43
+ nn.Linear(2 * out + edge_in, head_hidden),
44
+ nn.ReLU(),
45
+ nn.Dropout(dropout),
46
+ nn.Linear(head_hidden, 1),
47
+ )
48
+
49
+ def encode(self, x: torch.Tensor, edge_index: torch.Tensor) -> torch.Tensor:
50
+ h = F.relu(self.conv1(x, edge_index))
51
+ h = F.dropout(h, p=self.dropout, training=self.training)
52
+ h = self.conv2(h, edge_index)
53
+ return h
54
+
55
+ def edge_logits(self, h, edge_index, edge_attr):
56
+ src, dst = edge_index
57
+ z = torch.cat([h[src], h[dst], edge_attr], dim=-1)
58
+ return self.head(z).squeeze(-1)
59
+
60
+
61
+ class SageEncoder(nn.Module):
62
+ def __init__(self, in_dim: int, hidden: int = 64, out: int = 32, dropout: float = 0.2) -> None:
63
+ super().__init__()
64
+ self.conv1 = SAGEConv(in_dim, hidden, aggr="mean")
65
+ self.conv2 = SAGEConv(hidden, out, aggr="mean")
66
+ self.dropout = dropout
67
+
68
+ def forward(self, x, edge_index):
69
+ h = F.relu(self.conv1(x, edge_index))
70
+ h = F.dropout(h, p=self.dropout, training=self.training)
71
+ return self.conv2(h, edge_index)
72
+
73
+
74
+ class AttrDecoder(nn.Module):
75
+ def __init__(self, z_dim: int, edge_attr_dim: int, hidden: int = 128, dropout: float = 0.2) -> None:
76
+ super().__init__()
77
+ self.net = nn.Sequential(
78
+ nn.Linear(2 * z_dim, hidden),
79
+ nn.ReLU(),
80
+ nn.Dropout(dropout),
81
+ nn.Linear(hidden, edge_attr_dim),
82
+ )
83
+
84
+ def forward(self, z, edge_index):
85
+ src, dst = edge_index
86
+ return self.net(torch.cat([z[src], z[dst]], dim=-1))
87
+
88
+
89
+ class AttrGAE(nn.Module):
90
+ def __init__(self, in_dim: int, edge_attr_dim: int, hidden: int = 64, out: int = 32, dropout: float = 0.2) -> None:
91
+ super().__init__()
92
+ self.encoder = SageEncoder(in_dim=in_dim, hidden=hidden, out=out, dropout=dropout)
93
+ self.decoder = AttrDecoder(z_dim=out, edge_attr_dim=edge_attr_dim, hidden=hidden * 2, dropout=dropout)
94
+
95
+ def forward(self, x, edge_index, target_edges):
96
+ z = self.encoder(x, edge_index)
97
+ return self.decoder(z, target_edges)
98
+
99
+
100
+ # ─── Inference bundle ────────────────────────────────────────────────────────
101
+
102
+
103
+ @dataclass
104
+ class InferenceBundle:
105
+ fraud_model: EdgeFraudGNN
106
+ anomaly_model: AttrGAE
107
+ node_index: dict[str, int]
108
+ edge_attr_scaler_mean: np.ndarray
109
+ edge_attr_scaler_scale: np.ndarray
110
+ node_feature_scaler_mean: np.ndarray
111
+ node_feature_scaler_scale: np.ndarray
112
+ node_features_raw: np.ndarray
113
+ edge_index: np.ndarray
114
+ feature_columns: dict[str, list[str]]
115
+ fraud_threshold: float
116
+ metadata: dict[str, Any]
117
+
118
+ @property
119
+ def node_features_scaled(self) -> torch.Tensor:
120
+ x = (self.node_features_raw - self.node_feature_scaler_mean) / self.node_feature_scaler_scale
121
+ return torch.from_numpy(x.astype(np.float32))
122
+
123
+ @property
124
+ def reverse_node_index(self) -> dict[int, str]:
125
+ return {v: k for k, v in self.node_index.items()}
126
+
127
+ def encode_edges(
128
+ self,
129
+ from_account,
130
+ to_account,
131
+ amount,
132
+ business_process,
133
+ posting_date,
134
+ confidence=None,
135
+ ) -> tuple[torch.Tensor, torch.Tensor]:
136
+ n = len(from_account)
137
+ if confidence is None:
138
+ confidence = [1.0] * n
139
+ df = pd.DataFrame(
140
+ {
141
+ "from_account": [str(a) for a in from_account],
142
+ "to_account": [str(a) for a in to_account],
143
+ "amount": amount,
144
+ "business_process": business_process,
145
+ "posting_date": pd.to_datetime(posting_date, errors="coerce"),
146
+ "confidence": confidence,
147
+ }
148
+ )
149
+
150
+ unknown = set(df["from_account"]) | set(df["to_account"])
151
+ unknown -= set(self.node_index.keys())
152
+ if unknown:
153
+ raise ValueError(f"unknown account number(s): {sorted(unknown)}")
154
+
155
+ src = df["from_account"].map(self.node_index).to_numpy(dtype=np.int64)
156
+ dst = df["to_account"].map(self.node_index).to_numpy(dtype=np.int64)
157
+ edge_index = np.stack([src, dst], axis=0)
158
+
159
+ a = df["amount"].astype(float).to_numpy()
160
+ log_amt = np.log1p(a).astype(np.float32)
161
+ diffs = np.abs(a[:, None] - ROUND_LEVELS[None, :])
162
+ nearest = diffs.min(axis=1)
163
+ is_round = (nearest < 1.0).astype(np.float32)
164
+ log_dist = np.log1p(nearest).astype(np.float32)
165
+ nearest_idx = diffs.argmin(axis=1)
166
+ per_level = np.zeros((n, len(ROUND_LEVELS)), dtype=np.float32)
167
+ is_close = nearest < 1.0
168
+ per_level[is_close, nearest_idx[is_close]] = 1.0
169
+
170
+ bp_oh = (
171
+ pd.get_dummies(df["business_process"].fillna("UNK"), prefix="bp")
172
+ .reindex(columns=[f"bp_{p}" for p in BUSINESS_PROCESSES], fill_value=0)
173
+ .astype(np.float32)
174
+ .to_numpy()
175
+ )
176
+
177
+ dt = df["posting_date"]
178
+ doy = dt.dt.dayofyear.fillna(1).to_numpy()
179
+ woy = dt.dt.isocalendar().week.astype(int).to_numpy()
180
+ dow = dt.dt.dayofweek.fillna(0).to_numpy()
181
+ is_weekend = (dow >= 5).astype(np.float32)
182
+ date_feats = np.stack(
183
+ [
184
+ np.sin(2 * np.pi * doy / 366),
185
+ np.cos(2 * np.pi * doy / 366),
186
+ np.sin(2 * np.pi * woy / 53),
187
+ np.cos(2 * np.pi * woy / 53),
188
+ np.sin(2 * np.pi * dow / 7),
189
+ np.cos(2 * np.pi * dow / 7),
190
+ is_weekend,
191
+ ],
192
+ axis=1,
193
+ ).astype(np.float32)
194
+
195
+ confidence_arr = df["confidence"].astype(float).to_numpy().reshape(-1, 1).astype(np.float32)
196
+
197
+ edge_attr = np.concatenate(
198
+ [
199
+ log_amt[:, None],
200
+ is_round[:, None],
201
+ log_dist[:, None],
202
+ per_level,
203
+ confidence_arr,
204
+ bp_oh,
205
+ date_feats,
206
+ ],
207
+ axis=1,
208
+ )
209
+ edge_attr_scaled = (
210
+ (edge_attr - self.edge_attr_scaler_mean) / self.edge_attr_scaler_scale
211
+ ).astype(np.float32)
212
+
213
+ return torch.from_numpy(edge_index), torch.from_numpy(edge_attr_scaled)
214
+
215
+ @torch.no_grad()
216
+ def predict_fraud(self, **kwargs) -> np.ndarray:
217
+ target_edge_index, target_edge_attr = self.encode_edges(**kwargs)
218
+ graph_edge_index = torch.from_numpy(self.edge_index)
219
+ x = self.node_features_scaled
220
+
221
+ self.fraud_model.train(False)
222
+ h = self.fraud_model.encode(x, graph_edge_index)
223
+ logits = self.fraud_model.edge_logits(h, target_edge_index, target_edge_attr)
224
+ return torch.sigmoid(logits).cpu().numpy()
225
+
226
+ @torch.no_grad()
227
+ def anomaly_score_edges(self, **kwargs) -> np.ndarray:
228
+ target_edge_index, target_edge_attr = self.encode_edges(**kwargs)
229
+ graph_edge_index = torch.from_numpy(self.edge_index)
230
+ x = self.node_features_scaled
231
+
232
+ self.anomaly_model.train(False)
233
+ recon = self.anomaly_model(x, graph_edge_index, target_edge_index)
234
+ return ((recon - target_edge_attr) ** 2).mean(dim=-1).cpu().numpy()
235
+
236
+
237
+ def load_bundle(model_dir: Path | str) -> InferenceBundle:
238
+ model_dir = Path(model_dir)
239
+ fraud_payload = torch.load(model_dir / "je_fraud_gnn.pt", weights_only=False, map_location="cpu")
240
+ anomaly_payload = torch.load(model_dir / "je_anomaly_gae.pt", weights_only=False, map_location="cpu")
241
+ preprocessor = torch.load(model_dir / "preprocessor.pt", weights_only=False, map_location="cpu")
242
+ metadata = json.loads((model_dir / "metadata.json").read_text())
243
+
244
+ fraud_model = EdgeFraudGNN(**fraud_payload["model_config"])
245
+ fraud_model.load_state_dict(fraud_payload["model_state_dict"])
246
+ fraud_model.train(False)
247
+
248
+ anomaly_model = AttrGAE(**anomaly_payload["model_config"])
249
+ anomaly_model.load_state_dict(anomaly_payload["model_state_dict"])
250
+ anomaly_model.train(False)
251
+
252
+ return InferenceBundle(
253
+ fraud_model=fraud_model,
254
+ anomaly_model=anomaly_model,
255
+ node_index=preprocessor["node_index"],
256
+ edge_attr_scaler_mean=np.asarray(preprocessor["edge_attr_scaler_mean"], dtype=np.float32),
257
+ edge_attr_scaler_scale=np.asarray(preprocessor["edge_attr_scaler_scale"], dtype=np.float32),
258
+ node_feature_scaler_mean=np.asarray(preprocessor["node_feature_scaler_mean"], dtype=np.float32),
259
+ node_feature_scaler_scale=np.asarray(preprocessor["node_feature_scaler_scale"], dtype=np.float32),
260
+ node_features_raw=np.asarray(preprocessor["node_features_raw"], dtype=np.float32),
261
+ edge_index=np.asarray(preprocessor["edge_index"], dtype=np.int64),
262
+ feature_columns=preprocessor["feature_columns"],
263
+ fraud_threshold=float(metadata.get("fraud_threshold", 0.5)),
264
+ metadata=metadata,
265
+ )
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ gradio==5.5.0
2
+ torch==2.5.1
3
+ torch-geometric==2.6.1
4
+ huggingface_hub==0.26.2
5
+ pandas==2.2.3
6
+ pyarrow==17.0.0
7
+ scikit-learn==1.5.2
8
+ numpy==2.1.3
9
+ matplotlib==3.9.2