File size: 7,949 Bytes

---
license: apache-2.0
tags:
- temporal-graph-learning
- fraud-detection
- synthetic-data
- benchmark
- upi
- causal-evaluation
- matched-controls
- neurips
---

# Temporal Twins: A Matched-Control Benchmark for Temporal Fraud Detection

Synthetic UPI-style temporal transaction benchmark where fraud and benign trajectories are matched on static and prefix-level summaries but differ in delayed event-order structure.

## Links

- Dataset repository: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins)
- Code repository: [https://huggingface.co/temporal-twins-benchmark/temporal-twins-code](https://huggingface.co/temporal-twins-benchmark/temporal-twins-code)
- Croissant metadata URL: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json)
- Paper or preprint: Not available during double-blind review; to be added after publication.

## Installation

Recommended Python: `3.11+`

```bash
pip install -r requirements.txt
```

If you prefer Conda:

```bash
conda env create -f environment.yml
conda activate temporal-twins
```

## Repository Structure

- `src/`: synthetic user, transaction, risk, fraud, graph, and temporal benchmark generation code
- `models/`: SeqGRU, static baselines, audit/probe models, and temporal GNN wrappers
- `experiments/`: deterministic benchmark runner and matched-prefix evaluation utilities
- `config/`: base YAML configs used by the experiment runner
- `configs/`: release-facing config snapshots for calibration and paper-suite reproduction
- `docs/`: determinism and supporting documentation
- `metadata/`: MLCommons Croissant metadata and validation notes
- `results/`: lightweight frozen paper-suite summaries and interpretation notes

## Quick Smoke Test

```bash
PYTHONPATH=. python3 experiments/run_all.py \
  --fast \
  --seed 0 \
  --benchmark-mode temporal_twins_oracle_calib \
  --experiments audit \
  --device cpu
```

## Exact Paper-Scale Reproduction

The checked-in CLI exposes `--benchmark-mode`, `--seed`, `--seeds`, `--fast`, `--device`, and `--experiments`, but not separate `--difficulty`, `--num-users`, or `--simulation-days` flags. For the exact grouped paper-scale runs, use the helper below from the repository root.

Define this shell helper once:

```bash
run_group() {
  local group="$1"
  local seed="$2"
  local out_json="$3"

  PYTHONPATH=. python3 - "$group" "$seed" "$out_json" <<'PY'
import json
import math
import sys
import time
from pathlib import Path

from src.core.config_loader import load_config
from experiments.run_all import (
    build_gate_pool_from_frames,
    gate_volume_is_sufficient,
    generate_single_difficulty,
    offset_gate_namespace,
    prepare_gate_subset,
    run_motif_validity_check,
    set_global_determinism,
)


def normalize(value):
    if isinstance(value, dict):
        return {k: normalize(v) for k, v in value.items()}
    if isinstance(value, (list, tuple)):
        return [normalize(v) for v in value]
    if hasattr(value, "item"):
        try:
            value = value.item()
        except Exception:
            pass
    if isinstance(value, float) and not math.isfinite(value):
        return None
    return value


group = sys.argv[1]
seed = int(sys.argv[2])
out_json = Path(sys.argv[3])

if group == "oracle_calib":
    benchmark_mode = "temporal_twins_oracle_calib"
    difficulty = "easy"
    hard_abort = True
else:
    benchmark_mode = "temporal_twins"
    difficulty = group
    hard_abort = False

cfg = load_config("config/default.yaml")
cfg = cfg.model_copy(
    update={
        "num_users": 350,
        "simulation_days": 45,
        "benchmark_mode": benchmark_mode,
        "random_seed": seed,
    }
)

set_global_determinism(seed)
pool = generate_single_difficulty(
    cfg,
    difficulty=difficulty,
    seed=seed,
    benchmark_mode=benchmark_mode,
)
gate = prepare_gate_subset(pool, seed=seed, fast_mode=False)
pack_count = 1

while (not gate_volume_is_sufficient(gate["volume"], False)) and pack_count <= 6:
    extra_seed = seed + pack_count * 10007
    extra_pack = generate_single_difficulty(
        cfg,
        difficulty=difficulty,
        seed=extra_seed,
        benchmark_mode=benchmark_mode,
    )
    extra_pack = offset_gate_namespace(extra_pack, pack_count)
    pool = build_gate_pool_from_frames([pool, extra_pack])
    gate = prepare_gate_subset(pool, seed=seed, fast_mode=False)
    pack_count += 1

gate["source_pool_events"] = int(len(pool))
gate["source_pool_pairs"] = int(pool.loc[pool["twin_pair_id"] >= 0, "twin_pair_id"].nunique()) if "twin_pair_id" in pool.columns else 0
gate["source_pool_packs"] = int(pack_count)

start = time.time()
gate_pass, report = run_motif_validity_check(
    df=pool,
    config=cfg,
    seed=seed,
    device="cpu",
    num_epochs=3,
    node_epochs=150,
    n_checkpoints=8,
    hard_abort=hard_abort,
    benchmark_mode=benchmark_mode,
    fast_mode=False,
    force_temporal_models=True,
    prebuilt_gate=gate,
)
elapsed = time.time() - start

result = {
    "benchmark_group": group,
    "benchmark_mode": benchmark_mode,
    "seed": seed,
    "primary_metric_label": report["audit_metric_label"],
    "secondary_metric_label": report["raw_metric_label"],
    "gate_pass": bool(gate_pass),
    "run_wall_time_sec": float(elapsed),
    **report,
}

out_json.parent.mkdir(parents=True, exist_ok=True)
out_json.write_text(json.dumps(normalize(result), indent=2) + "\n")
print(f"Wrote {out_json}")
PY
}
```

### Reproduce `oracle_calib`

```bash
run_group oracle_calib 0 results/paper_suite_repro/jobs/oracle_calib_0.json
```

### Reproduce `easy`

```bash
run_group easy 0 results/paper_suite_repro/jobs/easy_0.json
```

### Reproduce `medium`

```bash
run_group medium 0 results/paper_suite_repro/jobs/medium_0.json
```

### Reproduce `hard`

```bash
run_group hard 0 results/paper_suite_repro/jobs/hard_0.json
```

## Reproduce the Full Paper Suite

```bash
mkdir -p results/paper_suite_repro/jobs

for group in oracle_calib easy medium hard; do
  for seed in 0 1 2 3 4; do
    run_group "$group" "$seed" "results/paper_suite_repro/jobs/${group}_${seed}.json"
  done
done
```

The frozen reference outputs for the final deterministic suite are already included in `results/`:

- `paper_suite_summary.csv`
- `paper_suite_summary.md`
- `paper_suite_runtime.csv`
- `paper_suite_meta.json`
- `paper_suite_runs.csv`
- `PAPER_GATE_INTERPRETATION.md`

## Expected Headline Results

| Benchmark | XGBoost ROC-AUC | StaticGNN ROC-AUC | SeqGRU ROC-AUC | SeqGRU Shuffle Delta |
| --- | ---: | ---: | ---: | ---: |
| `oracle_calib` | `0.5000` | `0.5222` | `1.0000` | `-0.5032` |
| `easy` | `0.5000` | `0.4946` | `1.0000` | `-0.5003` |
| `medium` | `0.5000` | `0.4922` | `0.8391` | `-0.3337` |
| `hard` | `0.5000` | `0.5026` | `0.6876` | `-0.1883` |

## Determinism

CPU deterministic runtime is enabled. The same seed should reproduce identical matched-prefix data and metrics. Deterministic torch settings can slow runtime, especially for the non-fast paper-scale suite.

## Data Note

This code repository contains source code, metadata, documentation, and lightweight result summaries only. The generated synthetic dataset and full release artifacts are hosted separately at the dataset repository:

- [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins)

## Privacy Note

- Synthetic data only
- No real UPI transactions
- No real users
- No real bank accounts
- No personal financial records

## License

- Code: `Apache-2.0`
- Dataset and generated benchmark artifacts: `CC-BY-4.0`

## Citation

Anonymous NeurIPS 2026 submission; final citation to be added after review.