---
license: apache-2.0
language:
  - en
library_name: pytorch
tags:
  - vynfi
  - graph-neural-network
  - graphsage
  - fraud-detection
  - anomaly-detection
  - financial-data
  - synthetic-data
  - pyg
  - torch-geometric
pipeline_tag: graph-ml
base_model: none
datasets:
  - VynFi/vynfi-journal-entries-1m
metrics:
  - roc_auc
  - average_precision
  - f1
---

# VynFi JE Fraud GNN — GraphSAGE edge classifier + GAE node anomaly scorer

Trained on the v5.9.0 Method-A accounting network from
[`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m).
Two complementary models in one bundle:

| Task | Model | Test AUC-ROC | Test AUC-PR | Notes |
|---|---|---|---|---|
| Edge fraud classification (supervised) | GraphSAGE → edge head | **0.9136** | 0.7949 | Beats LR baseline by +0.13 AUC pts (LR already strong because weekend + round-dollar features are highly discriminative). |
| Edge anomaly scoring (unsupervised) | Attribute-reconstruction GAE | **0.6540** | 0.1434 | Pure unsupervised — no `is_fraud`/`is_anomaly` labels seen at train time. Surfaces edges whose attributes are unusual given their structural neighborhood. |

## Per-business-process breakdown (edge fraud classifier, test split)

| Process | AUC-ROC | AUC-PR | F1 | n |
|---|---|---|---|---|
| **P2P** | 0.9289 | 0.8146 | 0.8041 | 2,835 |
| **O2C** | 0.8965 | 0.7660 | 0.7423 | 3,155 |
| **R2R** | 0.9301 | 0.8113 | 0.8000 | 1,895 |
| **H2R** | 0.8859 | 0.7517 | 0.7523 | 914 |
| **A2R** | 0.9512 | 0.9273 | 0.9565 | 450 |

## Architecture

**Fraud classifier** — `EdgeFraudGNN`:
* 2-layer **GraphSAGE** encoder (mean aggregator) → 64-dim node embeddings.
* Edge head: MLP on `concat(emb_src, emb_dst, edge_attr)` → sigmoid.
* BCE loss with positive-class weight ≈ 16.3 (5.79 % fraud rate).

**Anomaly scorer** — `AttrGAE`:
* Same 2-layer GraphSAGE encoder.
* MLP decoder predicts `edge_attr` from `concat(z_src, z_dst)`.
* MSE loss; per-edge reconstruction error ranks anomalous edges
  (high error = unusual attributes given structural context).

Both models share the same node feature space (17 dims):
account-type one-hot · structural flags · hierarchy level · log-aggregated
in/out flows.

Edge features (22 dims): log-amount · is-round-dollar · per-level
round flags · confidence · business-process one-hot · day-of-year
sin/cos · week-of-year sin/cos · day-of-week sin/cos · is-weekend.

## Quick start

```python
from huggingface_hub import snapshot_download
from scripts.ml.inference import load_bundle

local_dir = snapshot_download(repo_id="VynFi/je-fraud-gnn")
bundle = load_bundle(local_dir)

# Predict fraud probability for one or more edges
probs = bundle.predict_fraud(
    from_account=["1000", "5000"],
    to_account=["2000", "4000"],
    amount=[7432.89, 25000.00],            # second is a "round" amount
    business_process=["P2P", "O2C"],
    posting_date=["2024-03-15", "2024-08-10"],
)
print(probs)  # array([0.13, 0.99]) — round amount → strong fraud signal

# Per-edge anomaly score (high MSE = unusual attribute combination)
mse = bundle.anomaly_score_edges(
    from_account=["1000", "5000"],
    to_account=["2000", "4000"],
    amount=[7432.89, 25000.00],
    business_process=["P2P", "O2C"],
    posting_date=["2024-03-15", "2024-08-10"],
)
print(mse)
```

The `scripts/ml/inference.py` module is shipped in the
[engine repo](https://github.com/mivertowski/SyntheticData/tree/main/scripts/ml).

## Training data

Sourced from
[`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m)
v5.9.0:

* **499** GL accounts (after dedupe of 4 conflicting `account_number` rows in the COA)
* **61,656** Method-A edges (one edge per 2-line journal entry)
* **5.79 %** fraud rate (3,571 / 61,656)
* **6.52 %** anomaly rate
* Stratified 70/15/15 train/val/test split on `is_fraud` (seed = 20260509)
* Generated under the v5.9.0 release tag (ChaCha8 PRNG, platform-stable)

## Why does the GraphSAGE encoder add only marginal lift over LR?

Honest answer:  the synthetic fraud bias in DataSynth v5.x writes
strong, *local* signals into edge attributes:

> `fraud_bias.weekend_bias=0.30`  → 41 % of fraud edges land on Sat/Sun  vs 0.5 % of non-fraud (77× lift)
> `fraud_bias.round_dollar_bias=0.40`  → 55 % of fraud edges hit a $1K/$5K/$10K/$25K/$50K/$100K canonical level  vs 0.14 % (378× lift)

A LogisticRegression with day-of-week + round-dollar features already
gets to **AUC 0.912** — there's not much room left for the graph
encoder to add value on the supervised task.  The GraphSAGE encoder
adds +0.13 AUC pts and +0.84 AUC-PR pts; the per-process breakdown is
where it shines (A2R stretches to 0.95 AUC).

Where the graph contribution **does** show up:

* **Unsupervised anomaly detection**.  The attribute-reconstruction GAE
  reaches AUC-ROC 0.654 on edge-level anomaly *with no labels at train
  time* — the structural prior is doing the work.
* **Top-K anomalous accounts**.  The GAE's per-node aggregated MSE
  (mean across incident test edges) ranks accounts by structural
  weirdness; precision@10 = 0.60 against the median anomaly-fraction
  threshold.

For deployment scenarios where you have crisp labels *and* fraud
patterns are local to single transactions, an LR baseline may be
competitive.  For labelless or graph-context fraud (multi-hop
laundering, ring transactions), the GNN signal is the differentiator.

## Limitations

* Trained on a single 1M-JE generator run.  Generalisation to other
  v5.9.0 datasets has not been evaluated.
* `is_fraud` labels come from DataSynth's fraud-bias mechanism — they
  reflect known bias signatures (weekend / round-dollar / off-hours /
  post-close), not the full universe of real-world fraud patterns.
* Account vocabulary is fixed at the 499 nodes in the published COA.
  Inference on unseen `account_number` values raises `ValueError`.
* Per-node anomaly AUC is close to random (0.48) — the per-edge
  signal is the load-bearing one.  For ranking accounts, use
  precision@K instead of AUC.

## Reproducibility

```bash
git clone https://github.com/mivertowski/SyntheticData.git
cd SyntheticData
pip install -r requirements-ml.txt
python -m scripts.ml.build_je_pyg_dataset --output data/ml/je_pyg_v1.pt --seed 20260509
python -m scripts.ml.train_je_fraud_gnn --epochs 60
python -m scripts.ml.train_je_anomaly_gae --epochs 80
python -m scripts.ml.package_for_hf
```

## Citation

```bibtex
@misc{ivertowski2026datasynth,
  author       = {Ivertowski, Michael},
  title        = {{DataSynth}: Reference Knowledge Graphs for Enterprise
                  Audit Analytics through Synthetic Data Generation
                  with Provable Statistical Properties},
  year         = {2026},
  month        = {April},
  howpublished = {SSRN Working Paper},
  url          = {https://ssrn.com/abstract=6538639}
}
```

## License

Apache-2.0.

## Related

* [`VynFi/vynfi-journal-entries-1m`](https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m) — training dataset
* [`VynFi/accounting-network-explorer`](https://huggingface.co/spaces/VynFi/accounting-network-explorer) — interactive class-level network viewer
* [`VynFi/fraud-gnn-demo`](https://huggingface.co/spaces/VynFi/fraud-gnn-demo) — Gradio inference Space (companion)
* [Engine repo](https://github.com/mivertowski/SyntheticData) · [SSRN paper](https://ssrn.com/abstract=6538639)