--- license: apache-2.0 language: - en library_name: pytorch tags: - vynfi - graph-neural-network - graphsage - fraud-detection - anomaly-detection - financial-data - synthetic-data - pyg - torch-geometric pipeline_tag: graph-ml base_model: none datasets: - VynFi/vynfi-journal-entries-1m metrics: - roc_auc - average_precision - f1 --- # VynFi JE Fraud GNN — GraphSAGE edge classifier + GAE node anomaly scorer Trained on the v5.9.0 Method-A accounting network from [`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m). Two complementary models in one bundle: | Task | Model | Test AUC-ROC | Test AUC-PR | Notes | |---|---|---|---|---| | Edge fraud classification (supervised) | GraphSAGE → edge head | **0.9136** | 0.7949 | Beats LR baseline by +0.13 AUC pts (LR already strong because weekend + round-dollar features are highly discriminative). | | Edge anomaly scoring (unsupervised) | Attribute-reconstruction GAE | **0.6540** | 0.1434 | Pure unsupervised — no `is_fraud`/`is_anomaly` labels seen at train time. Surfaces edges whose attributes are unusual given their structural neighborhood. | ## Per-business-process breakdown (edge fraud classifier, test split) | Process | AUC-ROC | AUC-PR | F1 | n | |---|---|---|---|---| | **P2P** | 0.9289 | 0.8146 | 0.8041 | 2,835 | | **O2C** | 0.8965 | 0.7660 | 0.7423 | 3,155 | | **R2R** | 0.9301 | 0.8113 | 0.8000 | 1,895 | | **H2R** | 0.8859 | 0.7517 | 0.7523 | 914 | | **A2R** | 0.9512 | 0.9273 | 0.9565 | 450 | ## Architecture **Fraud classifier** — `EdgeFraudGNN`: * 2-layer **GraphSAGE** encoder (mean aggregator) → 64-dim node embeddings. * Edge head: MLP on `concat(emb_src, emb_dst, edge_attr)` → sigmoid. * BCE loss with positive-class weight ≈ 16.3 (5.79 % fraud rate). **Anomaly scorer** — `AttrGAE`: * Same 2-layer GraphSAGE encoder. * MLP decoder predicts `edge_attr` from `concat(z_src, z_dst)`. * MSE loss; per-edge reconstruction error ranks anomalous edges (high error = unusual attributes given structural context). Both models share the same node feature space (17 dims): account-type one-hot · structural flags · hierarchy level · log-aggregated in/out flows. Edge features (22 dims): log-amount · is-round-dollar · per-level round flags · confidence · business-process one-hot · day-of-year sin/cos · week-of-year sin/cos · day-of-week sin/cos · is-weekend. ## Quick start ```python from huggingface_hub import snapshot_download from scripts.ml.inference import load_bundle local_dir = snapshot_download(repo_id="VynFi/je-fraud-gnn") bundle = load_bundle(local_dir) # Predict fraud probability for one or more edges probs = bundle.predict_fraud( from_account=["1000", "5000"], to_account=["2000", "4000"], amount=[7432.89, 25000.00], # second is a "round" amount business_process=["P2P", "O2C"], posting_date=["2024-03-15", "2024-08-10"], ) print(probs) # array([0.13, 0.99]) — round amount → strong fraud signal # Per-edge anomaly score (high MSE = unusual attribute combination) mse = bundle.anomaly_score_edges( from_account=["1000", "5000"], to_account=["2000", "4000"], amount=[7432.89, 25000.00], business_process=["P2P", "O2C"], posting_date=["2024-03-15", "2024-08-10"], ) print(mse) ``` The `scripts/ml/inference.py` module is shipped in the [engine repo](https://github.com/mivertowski/SyntheticData/tree/main/scripts/ml). ## Training data Sourced from [`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m) v5.9.0: * **499** GL accounts (after dedupe of 4 conflicting `account_number` rows in the COA) * **61,656** Method-A edges (one edge per 2-line journal entry) * **5.79 %** fraud rate (3,571 / 61,656) * **6.52 %** anomaly rate * Stratified 70/15/15 train/val/test split on `is_fraud` (seed = 20260509) * Generated under the v5.9.0 release tag (ChaCha8 PRNG, platform-stable) ## Why does the GraphSAGE encoder add only marginal lift over LR? Honest answer: the synthetic fraud bias in DataSynth v5.x writes strong, *local* signals into edge attributes: > `fraud_bias.weekend_bias=0.30` → 41 % of fraud edges land on Sat/Sun vs 0.5 % of non-fraud (77× lift) > `fraud_bias.round_dollar_bias=0.40` → 55 % of fraud edges hit a $1K/$5K/$10K/$25K/$50K/$100K canonical level vs 0.14 % (378× lift) A LogisticRegression with day-of-week + round-dollar features already gets to **AUC 0.912** — there's not much room left for the graph encoder to add value on the supervised task. The GraphSAGE encoder adds +0.13 AUC pts and +0.84 AUC-PR pts; the per-process breakdown is where it shines (A2R stretches to 0.95 AUC). Where the graph contribution **does** show up: * **Unsupervised anomaly detection**. The attribute-reconstruction GAE reaches AUC-ROC 0.654 on edge-level anomaly *with no labels at train time* — the structural prior is doing the work. * **Top-K anomalous accounts**. The GAE's per-node aggregated MSE (mean across incident test edges) ranks accounts by structural weirdness; precision@10 = 0.60 against the median anomaly-fraction threshold. For deployment scenarios where you have crisp labels *and* fraud patterns are local to single transactions, an LR baseline may be competitive. For labelless or graph-context fraud (multi-hop laundering, ring transactions), the GNN signal is the differentiator. ## Limitations * Trained on a single 1M-JE generator run. Generalisation to other v5.9.0 datasets has not been evaluated. * `is_fraud` labels come from DataSynth's fraud-bias mechanism — they reflect known bias signatures (weekend / round-dollar / off-hours / post-close), not the full universe of real-world fraud patterns. * Account vocabulary is fixed at the 499 nodes in the published COA. Inference on unseen `account_number` values raises `ValueError`. * Per-node anomaly AUC is close to random (0.48) — the per-edge signal is the load-bearing one. For ranking accounts, use precision@K instead of AUC. ## Reproducibility ```bash git clone https://github.com/mivertowski/SyntheticData.git cd SyntheticData pip install -r requirements-ml.txt python -m scripts.ml.build_je_pyg_dataset --output data/ml/je_pyg_v1.pt --seed 20260509 python -m scripts.ml.train_je_fraud_gnn --epochs 60 python -m scripts.ml.train_je_anomaly_gae --epochs 80 python -m scripts.ml.package_for_hf ``` ## Citation ```bibtex @misc{ivertowski2026datasynth, author = {Ivertowski, Michael}, title = {{DataSynth}: Reference Knowledge Graphs for Enterprise Audit Analytics through Synthetic Data Generation with Provable Statistical Properties}, year = {2026}, month = {April}, howpublished = {SSRN Working Paper}, url = {https://ssrn.com/abstract=6538639} } ``` ## License Apache-2.0. ## Related * [`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m) — training dataset * [`VynFi/accounting-network-explorer`](https://huggingface.co/spaces/VynFi/accounting-network-explorer) — interactive class-level network viewer * [`VynFi/fraud-gnn-demo`](https://huggingface.co/spaces/VynFi/fraud-gnn-demo) — Gradio inference Space (companion) * [Engine repo](https://github.com/mivertowski/SyntheticData) · [SSRN paper](https://ssrn.com/abstract=6538639)