| --- |
| license: apache-2.0 |
| language: |
| - en |
| library_name: pytorch |
| tags: |
| - vynfi |
| - graph-neural-network |
| - graphsage |
| - fraud-detection |
| - anomaly-detection |
| - financial-data |
| - synthetic-data |
| - pyg |
| - torch-geometric |
| pipeline_tag: graph-ml |
| base_model: none |
| datasets: |
| - VynFi/vynfi-journal-entries-1m |
| metrics: |
| - roc_auc |
| - average_precision |
| - f1 |
| --- |
| |
| # VynFi JE Fraud GNN — GraphSAGE edge classifier + GAE node anomaly scorer |
|
|
| Trained on the v5.9.0 Method-A accounting network from |
| [`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m). |
| Two complementary models in one bundle: |
|
|
| | Task | Model | Test AUC-ROC | Test AUC-PR | Notes | |
| |---|---|---|---|---| |
| | Edge fraud classification (supervised) | GraphSAGE → edge head | **0.9136** | 0.7949 | Beats LR baseline by +0.13 AUC pts (LR already strong because weekend + round-dollar features are highly discriminative). | |
| | Edge anomaly scoring (unsupervised) | Attribute-reconstruction GAE | **0.6540** | 0.1434 | Pure unsupervised — no `is_fraud`/`is_anomaly` labels seen at train time. Surfaces edges whose attributes are unusual given their structural neighborhood. | |
|
|
| ## Per-business-process breakdown (edge fraud classifier, test split) |
|
|
| | Process | AUC-ROC | AUC-PR | F1 | n | |
| |---|---|---|---|---| |
| | **P2P** | 0.9289 | 0.8146 | 0.8041 | 2,835 | |
| | **O2C** | 0.8965 | 0.7660 | 0.7423 | 3,155 | |
| | **R2R** | 0.9301 | 0.8113 | 0.8000 | 1,895 | |
| | **H2R** | 0.8859 | 0.7517 | 0.7523 | 914 | |
| | **A2R** | 0.9512 | 0.9273 | 0.9565 | 450 | |
|
|
| ## Architecture |
|
|
| **Fraud classifier** — `EdgeFraudGNN`: |
| * 2-layer **GraphSAGE** encoder (mean aggregator) → 64-dim node embeddings. |
| * Edge head: MLP on `concat(emb_src, emb_dst, edge_attr)` → sigmoid. |
| * BCE loss with positive-class weight ≈ 16.3 (5.79 % fraud rate). |
|
|
| **Anomaly scorer** — `AttrGAE`: |
| * Same 2-layer GraphSAGE encoder. |
| * MLP decoder predicts `edge_attr` from `concat(z_src, z_dst)`. |
| * MSE loss; per-edge reconstruction error ranks anomalous edges |
| (high error = unusual attributes given structural context). |
|
|
| Both models share the same node feature space (17 dims): |
| account-type one-hot · structural flags · hierarchy level · log-aggregated |
| in/out flows. |
|
|
| Edge features (22 dims): log-amount · is-round-dollar · per-level |
| round flags · confidence · business-process one-hot · day-of-year |
| sin/cos · week-of-year sin/cos · day-of-week sin/cos · is-weekend. |
|
|
| ## Quick start |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| from scripts.ml.inference import load_bundle |
| |
| local_dir = snapshot_download(repo_id="VynFi/je-fraud-gnn") |
| bundle = load_bundle(local_dir) |
| |
| # Predict fraud probability for one or more edges |
| probs = bundle.predict_fraud( |
| from_account=["1000", "5000"], |
| to_account=["2000", "4000"], |
| amount=[7432.89, 25000.00], # second is a "round" amount |
| business_process=["P2P", "O2C"], |
| posting_date=["2024-03-15", "2024-08-10"], |
| ) |
| print(probs) # array([0.13, 0.99]) — round amount → strong fraud signal |
| |
| # Per-edge anomaly score (high MSE = unusual attribute combination) |
| mse = bundle.anomaly_score_edges( |
| from_account=["1000", "5000"], |
| to_account=["2000", "4000"], |
| amount=[7432.89, 25000.00], |
| business_process=["P2P", "O2C"], |
| posting_date=["2024-03-15", "2024-08-10"], |
| ) |
| print(mse) |
| ``` |
|
|
| The `scripts/ml/inference.py` module is shipped in the |
| [engine repo](https://github.com/mivertowski/SyntheticData/tree/main/scripts/ml). |
|
|
| ## Training data |
|
|
| Sourced from |
| [`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m) |
| v5.9.0: |
|
|
| * **499** GL accounts (after dedupe of 4 conflicting `account_number` rows in the COA) |
| * **61,656** Method-A edges (one edge per 2-line journal entry) |
| * **5.79 %** fraud rate (3,571 / 61,656) |
| * **6.52 %** anomaly rate |
| * Stratified 70/15/15 train/val/test split on `is_fraud` (seed = 20260509) |
| * Generated under the v5.9.0 release tag (ChaCha8 PRNG, platform-stable) |
|
|
| ## Why does the GraphSAGE encoder add only marginal lift over LR? |
|
|
| Honest answer: the synthetic fraud bias in DataSynth v5.x writes |
| strong, *local* signals into edge attributes: |
|
|
| > `fraud_bias.weekend_bias=0.30` → 41 % of fraud edges land on Sat/Sun vs 0.5 % of non-fraud (77× lift) |
| > `fraud_bias.round_dollar_bias=0.40` → 55 % of fraud edges hit a $1K/$5K/$10K/$25K/$50K/$100K canonical level vs 0.14 % (378× lift) |
| |
| A LogisticRegression with day-of-week + round-dollar features already |
| gets to **AUC 0.912** — there's not much room left for the graph |
| encoder to add value on the supervised task. The GraphSAGE encoder |
| adds +0.13 AUC pts and +0.84 AUC-PR pts; the per-process breakdown is |
| where it shines (A2R stretches to 0.95 AUC). |
| |
| Where the graph contribution **does** show up: |
| |
| * **Unsupervised anomaly detection**. The attribute-reconstruction GAE |
| reaches AUC-ROC 0.654 on edge-level anomaly *with no labels at train |
| time* — the structural prior is doing the work. |
| * **Top-K anomalous accounts**. The GAE's per-node aggregated MSE |
| (mean across incident test edges) ranks accounts by structural |
| weirdness; precision@10 = 0.60 against the median anomaly-fraction |
| threshold. |
| |
| For deployment scenarios where you have crisp labels *and* fraud |
| patterns are local to single transactions, an LR baseline may be |
| competitive. For labelless or graph-context fraud (multi-hop |
| laundering, ring transactions), the GNN signal is the differentiator. |
| |
| ## Limitations |
| |
| * Trained on a single 1M-JE generator run. Generalisation to other |
| v5.9.0 datasets has not been evaluated. |
| * `is_fraud` labels come from DataSynth's fraud-bias mechanism — they |
| reflect known bias signatures (weekend / round-dollar / off-hours / |
| post-close), not the full universe of real-world fraud patterns. |
| * Account vocabulary is fixed at the 499 nodes in the published COA. |
| Inference on unseen `account_number` values raises `ValueError`. |
| * Per-node anomaly AUC is close to random (0.48) — the per-edge |
| signal is the load-bearing one. For ranking accounts, use |
| precision@K instead of AUC. |
|
|
| ## Reproducibility |
|
|
| ```bash |
| git clone https://github.com/mivertowski/SyntheticData.git |
| cd SyntheticData |
| pip install -r requirements-ml.txt |
| python -m scripts.ml.build_je_pyg_dataset --output data/ml/je_pyg_v1.pt --seed 20260509 |
| python -m scripts.ml.train_je_fraud_gnn --epochs 60 |
| python -m scripts.ml.train_je_anomaly_gae --epochs 80 |
| python -m scripts.ml.package_for_hf |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{ivertowski2026datasynth, |
| author = {Ivertowski, Michael}, |
| title = {{DataSynth}: Reference Knowledge Graphs for Enterprise |
| Audit Analytics through Synthetic Data Generation |
| with Provable Statistical Properties}, |
| year = {2026}, |
| month = {April}, |
| howpublished = {SSRN Working Paper}, |
| url = {https://ssrn.com/abstract=6538639} |
| } |
| ``` |
|
|
| ## License |
|
|
| Apache-2.0. |
|
|
| ## Related |
|
|
| * [`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m) — training dataset |
| * [`VynFi/accounting-network-explorer`](https://huggingface.co/spaces/VynFi/accounting-network-explorer) — interactive class-level network viewer |
| * [`VynFi/fraud-gnn-demo`](https://huggingface.co/spaces/VynFi/fraud-gnn-demo) — Gradio inference Space (companion) |
| * [Engine repo](https://github.com/mivertowski/SyntheticData) · [SSRN paper](https://ssrn.com/abstract=6538639) |
|
|