Initial: GraphSAGE edge fraud classifier + AttrGAE node anomaly scorer (v5.9.0)

a4c108a 11 days ago

7.44 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: pytorch
	tags:
	- vynfi
	- graph-neural-network
	- graphsage
	- fraud-detection
	- anomaly-detection
	- financial-data
	- synthetic-data
	- pyg
	- torch-geometric
	pipeline_tag: graph-ml
	base_model: none
	datasets:
	- VynFi/vynfi-journal-entries-1m
	metrics:
	- roc_auc
	- average_precision
	- f1
	---

	# VynFi JE Fraud GNN — GraphSAGE edge classifier + GAE node anomaly scorer

	Trained on the v5.9.0 Method-A accounting network from
	[`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m).
	Two complementary models in one bundle:

	\| Task \| Model \| Test AUC-ROC \| Test AUC-PR \| Notes \|
	\|---\|---\|---\|---\|---\|
	\| Edge fraud classification (supervised) \| GraphSAGE → edge head \| 0.9136 \| 0.7949 \| Beats LR baseline by +0.13 AUC pts (LR already strong because weekend + round-dollar features are highly discriminative). \|
	\| Edge anomaly scoring (unsupervised) \| Attribute-reconstruction GAE \| 0.6540 \| 0.1434 \| Pure unsupervised — no `is_fraud`/`is_anomaly` labels seen at train time. Surfaces edges whose attributes are unusual given their structural neighborhood. \|

	## Per-business-process breakdown (edge fraud classifier, test split)

	\| Process \| AUC-ROC \| AUC-PR \| F1 \| n \|
	\|---\|---\|---\|---\|---\|
	\| P2P \| 0.9289 \| 0.8146 \| 0.8041 \| 2,835 \|
	\| O2C \| 0.8965 \| 0.7660 \| 0.7423 \| 3,155 \|
	\| R2R \| 0.9301 \| 0.8113 \| 0.8000 \| 1,895 \|
	\| H2R \| 0.8859 \| 0.7517 \| 0.7523 \| 914 \|
	\| A2R \| 0.9512 \| 0.9273 \| 0.9565 \| 450 \|

	## Architecture

	Fraud classifier — `EdgeFraudGNN`:
	* 2-layer GraphSAGE encoder (mean aggregator) → 64-dim node embeddings.
	* Edge head: MLP on `concat(emb_src, emb_dst, edge_attr)` → sigmoid.
	* BCE loss with positive-class weight ≈ 16.3 (5.79 % fraud rate).

	Anomaly scorer — `AttrGAE`:
	* Same 2-layer GraphSAGE encoder.
	* MLP decoder predicts `edge_attr` from `concat(z_src, z_dst)`.
	* MSE loss; per-edge reconstruction error ranks anomalous edges
	(high error = unusual attributes given structural context).

	Both models share the same node feature space (17 dims):
	account-type one-hot · structural flags · hierarchy level · log-aggregated
	in/out flows.

	Edge features (22 dims): log-amount · is-round-dollar · per-level
	round flags · confidence · business-process one-hot · day-of-year
	sin/cos · week-of-year sin/cos · day-of-week sin/cos · is-weekend.

	## Quick start

	```python
	from huggingface_hub import snapshot_download
	from scripts.ml.inference import load_bundle

	local_dir = snapshot_download(repo_id="VynFi/je-fraud-gnn")
	bundle = load_bundle(local_dir)

	# Predict fraud probability for one or more edges
	probs = bundle.predict_fraud(
	from_account=["1000", "5000"],
	to_account=["2000", "4000"],
	amount=[7432.89, 25000.00], # second is a "round" amount
	business_process=["P2P", "O2C"],
	posting_date=["2024-03-15", "2024-08-10"],
	)
	print(probs) # array([0.13, 0.99]) — round amount → strong fraud signal

	# Per-edge anomaly score (high MSE = unusual attribute combination)
	mse = bundle.anomaly_score_edges(
	from_account=["1000", "5000"],
	to_account=["2000", "4000"],
	amount=[7432.89, 25000.00],
	business_process=["P2P", "O2C"],
	posting_date=["2024-03-15", "2024-08-10"],
	)
	print(mse)
	```

	The `scripts/ml/inference.py` module is shipped in the
	[engine repo](https://github.com/mivertowski/SyntheticData/tree/main/scripts/ml).

	## Training data

	Sourced from
	[`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m)
	v5.9.0:

	* 499 GL accounts (after dedupe of 4 conflicting `account_number` rows in the COA)
	* 61,656 Method-A edges (one edge per 2-line journal entry)
	* 5.79 % fraud rate (3,571 / 61,656)
	* 6.52 % anomaly rate
	* Stratified 70/15/15 train/val/test split on `is_fraud` (seed = 20260509)
	* Generated under the v5.9.0 release tag (ChaCha8 PRNG, platform-stable)

	## Why does the GraphSAGE encoder add only marginal lift over LR?

	Honest answer: the synthetic fraud bias in DataSynth v5.x writes
	strong, local signals into edge attributes:

	> `fraud_bias.weekend_bias=0.30` → 41 % of fraud edges land on Sat/Sun vs 0.5 % of non-fraud (77× lift)
	> `fraud_bias.round_dollar_bias=0.40` → 55 % of fraud edges hit a $1K/$5K/$10K/$25K/$50K/$100K canonical level vs 0.14 % (378× lift)

	A LogisticRegression with day-of-week + round-dollar features already
	gets to AUC 0.912 — there's not much room left for the graph
	encoder to add value on the supervised task. The GraphSAGE encoder
	adds +0.13 AUC pts and +0.84 AUC-PR pts; the per-process breakdown is
	where it shines (A2R stretches to 0.95 AUC).

	Where the graph contribution does show up:

	* Unsupervised anomaly detection. The attribute-reconstruction GAE
	reaches AUC-ROC 0.654 on edge-level anomaly *with no labels at train
	time* — the structural prior is doing the work.
	* Top-K anomalous accounts. The GAE's per-node aggregated MSE
	(mean across incident test edges) ranks accounts by structural
	weirdness; precision@10 = 0.60 against the median anomaly-fraction
	threshold.

	For deployment scenarios where you have crisp labels and fraud
	patterns are local to single transactions, an LR baseline may be
	competitive. For labelless or graph-context fraud (multi-hop
	laundering, ring transactions), the GNN signal is the differentiator.

	## Limitations

	* Trained on a single 1M-JE generator run. Generalisation to other
	v5.9.0 datasets has not been evaluated.
	* `is_fraud` labels come from DataSynth's fraud-bias mechanism — they
	reflect known bias signatures (weekend / round-dollar / off-hours /
	post-close), not the full universe of real-world fraud patterns.
	* Account vocabulary is fixed at the 499 nodes in the published COA.
	Inference on unseen `account_number` values raises `ValueError`.
	* Per-node anomaly AUC is close to random (0.48) — the per-edge
	signal is the load-bearing one. For ranking accounts, use
	precision@K instead of AUC.

	## Reproducibility

	```bash
	git clone https://github.com/mivertowski/SyntheticData.git
	cd SyntheticData
	pip install -r requirements-ml.txt
	python -m scripts.ml.build_je_pyg_dataset --output data/ml/je_pyg_v1.pt --seed 20260509
	python -m scripts.ml.train_je_fraud_gnn --epochs 60
	python -m scripts.ml.train_je_anomaly_gae --epochs 80
	python -m scripts.ml.package_for_hf
	```

	## Citation

	```bibtex
	@misc{ivertowski2026datasynth,
	author = {Ivertowski, Michael},
	title = {{DataSynth}: Reference Knowledge Graphs for Enterprise
	Audit Analytics through Synthetic Data Generation
	with Provable Statistical Properties},
	year = {2026},
	month = {April},
	howpublished = {SSRN Working Paper},
	url = {https://ssrn.com/abstract=6538639}
	}
	```

	## License

	Apache-2.0.

	## Related

	* [`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m) — training dataset
	* [`VynFi/accounting-network-explorer`](https://huggingface.co/spaces/VynFi/accounting-network-explorer) — interactive class-level network viewer
	* [`VynFi/fraud-gnn-demo`](https://huggingface.co/spaces/VynFi/fraud-gnn-demo) — Gradio inference Space (companion)
	* [Engine repo](https://github.com/mivertowski/SyntheticData) · [SSRN paper](https://ssrn.com/abstract=6538639)