je-fraud-gnn / README.md
ninarg's picture
Initial: GraphSAGE edge fraud classifier + AttrGAE node anomaly scorer (v5.9.0)
a4c108a
---
license: apache-2.0
language:
- en
library_name: pytorch
tags:
- vynfi
- graph-neural-network
- graphsage
- fraud-detection
- anomaly-detection
- financial-data
- synthetic-data
- pyg
- torch-geometric
pipeline_tag: graph-ml
base_model: none
datasets:
- VynFi/vynfi-journal-entries-1m
metrics:
- roc_auc
- average_precision
- f1
---
# VynFi JE Fraud GNN — GraphSAGE edge classifier + GAE node anomaly scorer
Trained on the v5.9.0 Method-A accounting network from
[`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m).
Two complementary models in one bundle:
| Task | Model | Test AUC-ROC | Test AUC-PR | Notes |
|---|---|---|---|---|
| Edge fraud classification (supervised) | GraphSAGE → edge head | **0.9136** | 0.7949 | Beats LR baseline by +0.13 AUC pts (LR already strong because weekend + round-dollar features are highly discriminative). |
| Edge anomaly scoring (unsupervised) | Attribute-reconstruction GAE | **0.6540** | 0.1434 | Pure unsupervised — no `is_fraud`/`is_anomaly` labels seen at train time. Surfaces edges whose attributes are unusual given their structural neighborhood. |
## Per-business-process breakdown (edge fraud classifier, test split)
| Process | AUC-ROC | AUC-PR | F1 | n |
|---|---|---|---|---|
| **P2P** | 0.9289 | 0.8146 | 0.8041 | 2,835 |
| **O2C** | 0.8965 | 0.7660 | 0.7423 | 3,155 |
| **R2R** | 0.9301 | 0.8113 | 0.8000 | 1,895 |
| **H2R** | 0.8859 | 0.7517 | 0.7523 | 914 |
| **A2R** | 0.9512 | 0.9273 | 0.9565 | 450 |
## Architecture
**Fraud classifier**`EdgeFraudGNN`:
* 2-layer **GraphSAGE** encoder (mean aggregator) → 64-dim node embeddings.
* Edge head: MLP on `concat(emb_src, emb_dst, edge_attr)` → sigmoid.
* BCE loss with positive-class weight ≈ 16.3 (5.79 % fraud rate).
**Anomaly scorer**`AttrGAE`:
* Same 2-layer GraphSAGE encoder.
* MLP decoder predicts `edge_attr` from `concat(z_src, z_dst)`.
* MSE loss; per-edge reconstruction error ranks anomalous edges
(high error = unusual attributes given structural context).
Both models share the same node feature space (17 dims):
account-type one-hot · structural flags · hierarchy level · log-aggregated
in/out flows.
Edge features (22 dims): log-amount · is-round-dollar · per-level
round flags · confidence · business-process one-hot · day-of-year
sin/cos · week-of-year sin/cos · day-of-week sin/cos · is-weekend.
## Quick start
```python
from huggingface_hub import snapshot_download
from scripts.ml.inference import load_bundle
local_dir = snapshot_download(repo_id="VynFi/je-fraud-gnn")
bundle = load_bundle(local_dir)
# Predict fraud probability for one or more edges
probs = bundle.predict_fraud(
from_account=["1000", "5000"],
to_account=["2000", "4000"],
amount=[7432.89, 25000.00], # second is a "round" amount
business_process=["P2P", "O2C"],
posting_date=["2024-03-15", "2024-08-10"],
)
print(probs) # array([0.13, 0.99]) — round amount → strong fraud signal
# Per-edge anomaly score (high MSE = unusual attribute combination)
mse = bundle.anomaly_score_edges(
from_account=["1000", "5000"],
to_account=["2000", "4000"],
amount=[7432.89, 25000.00],
business_process=["P2P", "O2C"],
posting_date=["2024-03-15", "2024-08-10"],
)
print(mse)
```
The `scripts/ml/inference.py` module is shipped in the
[engine repo](https://github.com/mivertowski/SyntheticData/tree/main/scripts/ml).
## Training data
Sourced from
[`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m)
v5.9.0:
* **499** GL accounts (after dedupe of 4 conflicting `account_number` rows in the COA)
* **61,656** Method-A edges (one edge per 2-line journal entry)
* **5.79 %** fraud rate (3,571 / 61,656)
* **6.52 %** anomaly rate
* Stratified 70/15/15 train/val/test split on `is_fraud` (seed = 20260509)
* Generated under the v5.9.0 release tag (ChaCha8 PRNG, platform-stable)
## Why does the GraphSAGE encoder add only marginal lift over LR?
Honest answer: the synthetic fraud bias in DataSynth v5.x writes
strong, *local* signals into edge attributes:
> `fraud_bias.weekend_bias=0.30` → 41 % of fraud edges land on Sat/Sun vs 0.5 % of non-fraud (77× lift)
> `fraud_bias.round_dollar_bias=0.40` → 55 % of fraud edges hit a $1K/$5K/$10K/$25K/$50K/$100K canonical level vs 0.14 % (378× lift)
A LogisticRegression with day-of-week + round-dollar features already
gets to **AUC 0.912** — there's not much room left for the graph
encoder to add value on the supervised task. The GraphSAGE encoder
adds +0.13 AUC pts and +0.84 AUC-PR pts; the per-process breakdown is
where it shines (A2R stretches to 0.95 AUC).
Where the graph contribution **does** show up:
* **Unsupervised anomaly detection**. The attribute-reconstruction GAE
reaches AUC-ROC 0.654 on edge-level anomaly *with no labels at train
time* — the structural prior is doing the work.
* **Top-K anomalous accounts**. The GAE's per-node aggregated MSE
(mean across incident test edges) ranks accounts by structural
weirdness; precision@10 = 0.60 against the median anomaly-fraction
threshold.
For deployment scenarios where you have crisp labels *and* fraud
patterns are local to single transactions, an LR baseline may be
competitive. For labelless or graph-context fraud (multi-hop
laundering, ring transactions), the GNN signal is the differentiator.
## Limitations
* Trained on a single 1M-JE generator run. Generalisation to other
v5.9.0 datasets has not been evaluated.
* `is_fraud` labels come from DataSynth's fraud-bias mechanism — they
reflect known bias signatures (weekend / round-dollar / off-hours /
post-close), not the full universe of real-world fraud patterns.
* Account vocabulary is fixed at the 499 nodes in the published COA.
Inference on unseen `account_number` values raises `ValueError`.
* Per-node anomaly AUC is close to random (0.48) — the per-edge
signal is the load-bearing one. For ranking accounts, use
precision@K instead of AUC.
## Reproducibility
```bash
git clone https://github.com/mivertowski/SyntheticData.git
cd SyntheticData
pip install -r requirements-ml.txt
python -m scripts.ml.build_je_pyg_dataset --output data/ml/je_pyg_v1.pt --seed 20260509
python -m scripts.ml.train_je_fraud_gnn --epochs 60
python -m scripts.ml.train_je_anomaly_gae --epochs 80
python -m scripts.ml.package_for_hf
```
## Citation
```bibtex
@misc{ivertowski2026datasynth,
author = {Ivertowski, Michael},
title = {{DataSynth}: Reference Knowledge Graphs for Enterprise
Audit Analytics through Synthetic Data Generation
with Provable Statistical Properties},
year = {2026},
month = {April},
howpublished = {SSRN Working Paper},
url = {https://ssrn.com/abstract=6538639}
}
```
## License
Apache-2.0.
## Related
* [`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m) — training dataset
* [`VynFi/accounting-network-explorer`](https://huggingface.co/spaces/VynFi/accounting-network-explorer) — interactive class-level network viewer
* [`VynFi/fraud-gnn-demo`](https://huggingface.co/spaces/VynFi/fraud-gnn-demo) — Gradio inference Space (companion)
* [Engine repo](https://github.com/mivertowski/SyntheticData) · [SSRN paper](https://ssrn.com/abstract=6538639)