| # Temporal Twins Dataset Card |
|
|
| ## 1. Dataset Summary |
|
|
| Temporal Twins is a synthetic UPI-style transaction benchmark for temporal fraud detection. It is designed to evaluate whether a model can distinguish fraud from benign behavior using order-sensitive temporal structure rather than static aggregates such as total transaction count, account age, or prefix length. |
|
|
| The benchmark simulates users sending transactions over time and then assigns fraud labels through delayed temporal mechanisms. Its core design is a matched fraud/benign temporal-twin construction: |
|
|
| - each positive example is a fraud twin evaluated at a local event index `k` |
| - each negative example is a benign twin evaluated at the same local event index `k` |
| - both twins are matched on static and prefix-level summaries |
| - the benign twin contains the same unordered ingredients but violates the fraud-relevant temporal order |
|
|
| Temporal Twins exposes four benchmark modes: |
|
|
| - `oracle_calib` |
| - `easy` |
| - `medium` |
| - `hard` |
|
|
| The frozen paper-suite configuration used in this repository is: |
|
|
| - `num_users = 350` |
| - `simulation_days = 45` |
| - `seeds = [0, 1, 2, 3, 4]` |
| - `fast_mode = false` |
| - `n_checkpoints = 8` |
|
|
| ## 2. Dataset Motivation |
|
|
| Many fraud datasets can be solved by static shortcuts: longer histories, later evaluation times, higher transaction counts, or other aggregate correlates can make a benchmark look temporally rich while actually rewarding non-temporal models. Temporal Twins was built to remove those shortcuts and isolate order-sensitive temporal reasoning. |
|
|
| The benchmark therefore aims to answer a narrower research question: |
|
|
| - when static summaries are matched between positives and negatives, can a model still recover delayed fraud signals from temporal order alone? |
|
|
| It is intended for benchmarking temporal representation learning, causal order sensitivity, and delayed-label detection under controlled synthetic conditions. |
|
|
| ## 3. Dataset Composition |
|
|
| Temporal Twins is generated programmatically from synthetic user and transaction processes. There is no fixed real-world corpus. Each generated artifact is an event table in which each row is a synthetic transaction. |
|
|
| At a high level, each run contains: |
|
|
| - a synthetic user population |
| - a synthetic stream of UPI-style transactions |
| - risk-engine outputs such as transaction risk scores and failures |
| - benchmark-specific fraud and audit annotations |
| - matched fraud/benign evaluation pairs extracted from the event stream |
|
|
| The paper-scale suite in this repository contains 20 deterministic runs: |
|
|
| - `oracle_calib` with seeds `0..4` |
| - `easy` with seeds `0..4` |
| - `medium` with seeds `0..4` |
| - `hard` with seeds `0..4` |
|
|
| Mean matched evaluation-pair counts in the frozen paper suite are: |
|
|
| | Mode | Matched evaluation pairs (mean +- std) | |
| |---|---:| |
| | `oracle_calib` | `2606.6 +- 454.3` | |
| | `easy` | `2222.2 +- 128.4` | |
| | `medium` | `2356.6 +- 18.0` | |
| | `hard` | `2317.6 +- 22.0` | |
|
|
| Each paper-suite run is class-balanced at evaluation time: |
|
|
| - positives = negatives |
| - positive rate = `0.5000` |
|
|
| ## 4. Dataset Generation Process |
|
|
| The generation pipeline has four stages: |
|
|
| 1. Synthetic user generation |
| 2. Synthetic transaction generation |
| 3. Synthetic risk and retry generation |
| 4. Fraud-mechanism and matched-twin generation |
|
|
| More concretely: |
|
|
| 1. A synthetic user set is created with user-level behavioral parameters. |
| 2. A synthetic transaction stream is sampled with sender IDs, receiver IDs, timestamps, transaction amounts, and transaction types. |
| 3. A risk engine adds synthetic risk-related fields such as `risk_score`, `fail_prob`, `failed`, and retry-like events. |
| 4. The fraud engine applies benchmark-mode-specific temporal mechanisms and constructs matched temporal twins. |
|
|
| For the `temporal_twins` benchmark family, the generator then: |
|
|
| - constructs fraud twins and benign twins from matched carrier users and templates |
| - preserves matched static and prefix-level summaries |
| - injects delayed fraud labels into fraud twins |
| - forces benign twins to avoid the fraud-relevant temporal motif while retaining similar unordered ingredients |
|
|
| The benchmark is deterministic under fixed configuration, seed, and runtime settings. |
|
|
| ## 5. Fraud Mechanisms |
|
|
| Temporal Twins uses delayed, order-sensitive fraud mechanisms rather than directly labeling static outliers. Important mechanisms include: |
|
|
| - velocity-like activity acceleration |
| - retry-like behavior |
| - delayed receiver revisits |
| - burst-release-burst motifs |
| - adversarial timing perturbations |
| - delayed fraud assignment |
| - hidden latent fraud-state dynamics |
|
|
| These mechanisms are combined with difficulty-dependent noise and camouflage. In the standard `easy`, `medium`, and `hard` modes, the fraud signal is intentionally imperfect and partially obscured. In `oracle_calib`, the construction is designed to validate motif and evaluation alignment under matched-prefix conditions. |
|
|
| ## 6. Matched-Control Construction |
|
|
| The central benchmark control is the fraud/benign temporal twin. |
|
|
| For every fraud twin positive label at local event index `k`: |
|
|
| - the benign twin is evaluated at the same local event index `k` |
| - both examples use the same local prefix length |
| - both examples are truncated at prefix index `k` |
| - no future events are visible to the model |
|
|
| Within each matched pair, the protocol additionally matches: |
|
|
| - total transaction count |
| - local prefix length |
| - evaluation timestamp |
| - account age |
| - active age |
| - receiver histograms |
| - static aggregate summaries |
|
|
| In words: |
|
|
| - the fraud twin contains a temporally meaningful order pattern that triggers a delayed positive label |
| - the benign twin contains comparable ingredients and prefix statistics but violates the fraud-relevant temporal order |
|
|
| This design is meant to prevent performance from arising from: |
|
|
| - longer histories |
| - older accounts |
| - later prefix positions |
| - different transaction totals |
| - unmatched prefix ages |
| - benign negatives evaluated at arbitrary or easier positions |
|
|
| ## 7. Dataset Modes and Difficulty Ladder |
|
|
| Temporal Twins provides four modes. |
|
|
| ### `oracle_calib` |
| |
| This is the calibration mode used to validate that the matched-prefix protocol is working as intended. |
| |
| - Oracle metrics remain near-perfect. |
| - Static shortcut baselines remain at chance. |
| - Benign motif hit rate remains zero. |
| - This mode is primarily for protocol validation rather than realistic difficulty. |
| |
| ### `easy` |
| |
| - strong motif signal |
| - low noise |
| - shorter delay |
| - expected SeqGRU performance near `0.90-1.00` |
| |
| ### `medium` |
| |
| - moderate motif signal |
| - moderate noise |
| - longer delay |
| - expected SeqGRU performance near `0.80-0.90` |
| |
| ### `hard` |
| |
| - weaker motif signal |
| - longer delay |
| - adversarial perturbations and decoys |
| - expected SeqGRU performance near `0.70-0.85` |
| |
| Naming convention: |
| |
| - in `oracle_calib`, `AuditOracle` and `RawMotifOracle` are true oracle-style references |
| - in standard `easy`, `medium`, and `hard`, the corresponding scores are reported as `MotifProbe` and `RawMotifProbe` because realism and noise make them probes rather than perfect oracles |
|
|
| ## 8. Data Schema |
|
|
| The event table contains model-facing fields, supervision labels, and audit/oracle-only fields. The table below lists the most important columns used in this repository. |
|
|
| | Column name | Type | Description | Exposed to ordinary models? | Notes | |
| |---|---|---|---|---| |
| | `txn_id` | `int32` | Synthetic transaction identifier | Yes | Identifier only; not a benchmark target | |
| | `sender_id` | `int32` / `int64` | Synthetic sender account ID | Yes | Node identity available to temporal models | |
| | `receiver_id` | `int32` / `int64` | Synthetic receiver account ID | Yes | Used for graph and sequence structure | |
| | `timestamp` | `float32` | Synthetic event time in seconds from simulation start | Yes | Prefix truncation is based on timestamp and local index | |
| | `amount` | `float32` | Synthetic transaction amount | Yes | Not tied to real currency records | |
| | `txn_type` | `int8` | Synthetic transaction-type code | Yes | UPI-style categorical event attribute | |
| | `risk_score` | `float32` | Synthetic risk score from the risk engine | Yes | No real production risk model is used | |
| | `fail_prob` | `float32` | Synthetic failure probability | Yes | Risk-engine output | |
| | `failed` | `int8` | Binary failure indicator | Yes | Used as a normal model-facing field | |
| | `is_retry` | `int8` / derived | Retry-like event indicator | Yes | Available to ordinary models when present | |
| | `pair_freq` | `float32` / derived | Sender-receiver interaction-frequency feature | Yes | Derived from visible event history | |
| | `risk_noisy` | `float32` | Noisy synthetic risk feature | Yes | Benchmark feature, not an audit signal | |
| | `txn_count_10` | `float32` / derived | Recent-count feature over a short window | Yes | Derived from visible history | |
| | `amount_sum_10` | `float32` / derived | Recent amount-sum feature | Yes | Derived from visible history | |
| | `is_fraud` | `int8` | Binary fraud label | No | Supervision target only, not a model input | |
| | `twin_pair_id` | `int64` | Matched fraud/benign pair identifier | No | Audit/oracle-only; not exposed to learned baselines | |
| | `twin_role` | `string` | Twin role such as `fraud`, `benign`, or `background` | No | Audit/oracle-only | |
| | `twin_label` | `int8` | Pairwise matched label for audit utilities | No | Audit/oracle-only | |
| | `template_id` | `int64` | Source template identifier used during twin construction | No | Audit/oracle-only | |
| | `dynamic_fraud_state` | `float32` | Latent synthetic fraud-state variable | No | Hidden mechanism for analysis only | |
| | `motif_source` | `int8` | Indicator for motif-source events in a sequence | No | Audit/oracle-only | |
| | `motif_hit_count` | `int32` | Count of motif hits in the sequence | No | Audit/oracle-only | |
| | `trigger_event_idx` | `int32` | Local event index of the trigger event | No | Audit/oracle-only | |
| | `label_event_idx` | `int32` | Local event index at which the fraud label becomes active | No | Audit/oracle-only | |
| | `label_delay` | `int32` | Delay between trigger and labeled event index | No | Audit/oracle-only | |
| | `fraud_source` | `string` | Cause of fraud label, e.g. motif or fallback chain | No | Audit/oracle-only | |
| | `is_fallback_label` | `int8` | Indicator that a label came from fallback logic | No | Audit/oracle-only | |
| | `motif_chain_state` | `float32` | Internal motif-chain analysis field | No | Audit/oracle-only | |
| | `motif_strength` | `float32` | Internal motif-strength analysis field | No | Audit/oracle-only | |
|
|
| Not every baseline uses every model-facing column. The important guarantee is that learned baselines do not receive the audit/oracle-only fields listed above. |
|
|
| ## 9. Model-Facing vs Audit/Oracle-Only Columns |
|
|
| Ordinary learned baselines are restricted to model-facing transaction attributes and histories. In this repository, audit/oracle-only columns are explicitly stripped before learned baselines are trained or evaluated. |
|
|
| Ordinary models may use fields such as: |
|
|
| - `sender_id` |
| - `receiver_id` |
| - `timestamp` |
| - `amount` |
| - `risk_score` |
| - `fail_prob` |
| - `failed` |
| - `txn_type` |
| - other derived non-oracle features built from visible prefix history |
|
|
| Ordinary models must not use: |
|
|
| - `motif_hit_count` |
| - `motif_source` |
| - `trigger_event_idx` |
| - `label_event_idx` |
| - `label_delay` |
| - `fraud_source` |
| - `twin_role` |
| - `twin_label` |
| - `twin_pair_id` |
| - `template_id` |
| - `dynamic_fraud_state` |
| - other oracle-only diagnostics |
|
|
| This separation is necessary for the benchmark claim that performance should come from temporal reasoning rather than privileged audit information. |
|
|
| ## 10. Benchmark Tasks |
|
|
| Temporal Twins supports the following benchmark task: |
|
|
| - binary fraud detection on matched prefix examples |
|
|
| The standard evaluation protocol is: |
|
|
| - build matched fraud/benign examples |
| - truncate each sender history at the matched prefix index `k` |
| - train or score on the visible prefix only |
| - evaluate binary discrimination at the matched example level |
|
|
| Primary reported metrics include: |
|
|
| - ROC-AUC |
| - PR-AUC |
| - shuffled-order ROC-AUC |
| - shuffle delta = shuffled ROC-AUC minus clean ROC-AUC |
|
|
| The shuffled-order test is important: it measures how much performance depends on event order rather than unordered ingredients. |
|
|
| ## 11. Baselines and Reference Results |
|
|
| The frozen 5-seed paper suite uses: |
|
|
| - `num_users = 350` |
| - `simulation_days = 45` |
| - `seeds = [0, 1, 2, 3, 4]` |
| - `fast_mode = false` |
| - `n_checkpoints = 8` |
|
|
| Compact reference results: |
|
|
| | Mode | Primary reference | Secondary reference | XGBoost ROC-AUC | StaticGNN ROC-AUC | SeqGRU ROC-AUC | SeqGRU shuffled delta | |
| |---|---:|---:|---:|---:|---:|---:| |
| | `oracle_calib` | `AuditOracle 1.0000 +- 0.0000` | `RawMotifOracle 1.0000 +- 0.0000` | `0.5000 +- 0.0000` | `0.5222 +- 0.0235` | `1.0000 +- 0.0000` | `-0.5032 +- 0.0043` | |
| | `easy` | `MotifProbe 1.0000 +- 0.0000` | `RawMotifProbe 0.9983 +- 0.0011` | `0.5000 +- 0.0000` | `0.4946 +- 0.0128` | `1.0000 +- 0.0000` | `-0.5003 +- 0.0096` | |
| | `medium` | `MotifProbe 0.6374 +- 0.0069` | `RawMotifProbe 0.6482 +- 0.0058` | `0.5000 +- 0.0000` | `0.4922 +- 0.0203` | `0.8391 +- 0.0174` | `-0.3337 +- 0.0191` | |
| | `hard` | `MotifProbe 0.5790 +- 0.0045` | `RawMotifProbe 0.5910 +- 0.0105` | `0.5000 +- 0.0000` | `0.5026 +- 0.0198` | `0.6876 +- 0.0128` | `-0.1883 +- 0.0111` | |
|
|
| Static shortcut audit across all 20 paper-suite runs: |
|
|
| - `static_agg_auc = 0.5000 +- 0.0000` |
| - `total_txn_count AUC = 0.5000 +- 0.0000` |
| - `local_event_idx AUC = 0.5000 +- 0.0000` |
| - `prefix_txn_count AUC = 0.5000 +- 0.0000` |
| - `timestamp AUC = 0.5000 +- 0.0000` |
| - `account_age AUC = 0.5000 +- 0.0000` |
| - `active_age AUC = 0.5000 +- 0.0000` |
| - `benign_motif_hit_rate = 0.0000 +- 0.0000` |
|
|
| These results support the intended interpretation: |
|
|
| - static shortcuts are neutralized |
| - `oracle_calib` validates matched-prefix correctness |
| - `easy` is readily learnable by order-sensitive sequence models |
| - `medium` remains learnable but meaningfully harder |
| - `hard` remains above static baselines but is substantially more challenging |
|
|
| Full paper-suite artifacts, including temporal GNN results and per-seed CSVs, are stored under: |
|
|
| - `results/paper_suite_20260503_202810/` |
|
|
| ## 12. Intended Use |
|
|
| This dataset is intended for: |
|
|
| - research on temporal fraud detection |
| - benchmarking order-sensitive sequence and temporal-graph models |
| - evaluating whether performance survives matched static controls |
| - studying delayed labels and prefix-only evaluation |
| - comparing clean-order and shuffled-order performance |
|
|
| It is appropriate for methodology papers, controlled ablation studies, and robustness checks on temporal inductive bias. |
|
|
| ## 13. Out-of-Scope Use |
|
|
| Temporal Twins is out of scope for: |
|
|
| - direct training of production fraud systems |
| - making real financial, banking, or payment decisions |
| - approving or denying transactions for real users |
| - risk-scoring real individuals or organizations |
| - regulatory, legal, or operational decisions in production financial systems |
|
|
| The dataset must not be used to train production fraud systems directly or to make real financial decisions. |
|
|
| ## 14. Limitations |
|
|
| Important limitations include: |
|
|
| - the benchmark is fully synthetic and reflects designer assumptions |
| - user behavior, fraud behavior, and benign behavior are simplified relative to real financial ecosystems |
| - the only ground truth is the generator's own labeling logic |
| - real-world fraud often depends on richer institutional, device, merchant, and social context not present here |
| - difficulty levels are benchmark design choices, not calibrated measures of real operational difficulty |
| - temporal GNN underperformance on this benchmark should not be generalized to all real fraud settings |
|
|
| ## 15. Biases and Risks |
|
|
| As a synthetic benchmark, Temporal Twins inherits the modeling biases of its generator: |
|
|
| - it emphasizes order-sensitive motifs chosen by the benchmark designers |
| - it encodes a particular notion of delayed fraud and camouflage |
| - it may reward models that are well aligned to these synthetic mechanisms |
| - it may underrepresent other real fraud styles not captured by the generator |
|
|
| There is also a scientific risk: |
|
|
| - because the benchmark intentionally removes common static shortcuts, performance on Temporal Twins may differ from performance on operational datasets where those shortcuts exist, for better or worse |
|
|
| ## 16. Privacy and Sensitive Data |
|
|
| Temporal Twins contains no real financial or personal data. |
|
|
| Specifically: |
|
|
| - no real UPI data |
| - no real users |
| - no real bank accounts |
| - no real transactions |
| - no personal financial records |
| - no protected demographic attributes |
|
|
| All user IDs, receiver IDs, timestamps, amounts, and risk signals are synthetic artifacts produced by the generator. |
|
|
| ## 17. Ethical Considerations |
|
|
| Temporal Twins is safer to share than real financial logs because it does not contain real persons or institutions. However, ethical care is still needed. |
|
|
| Users of the dataset should not: |
|
|
| - present synthetic results as direct evidence of production readiness |
| - claim fairness or social validity that has not been tested on real populations |
| - use the dataset as justification for automated decisions about real people |
|
|
| The intended ethical use is research benchmarking, not operational deployment. |
|
|
| ## 18. Reproducibility |
|
|
| The repository includes deterministic generation and evaluation settings for the frozen paper suite. |
|
|
| Paper-suite configuration: |
|
|
| - `num_users = 350` |
| - `simulation_days = 45` |
| - `seeds = [0, 1, 2, 3, 4]` |
| - `fast_mode = false` |
| - `n_checkpoints = 8` |
|
|
| Reproducibility properties: |
|
|
| - stable deterministic seed derivation is used for benchmark modes and profiles |
| - Python, NumPy, and PyTorch seeds are fixed per run |
| - deterministic runtime flags are enabled where safe |
| - matched-prefix datasets are reproducible under fixed config and seed |
| - the final paper suite in this repository is stored as deterministic CSV artifacts |
|
|
| Reference artifacts: |
|
|
| - `results/paper_suite_20260503_202810/paper_suite_runs.csv` |
| - `results/paper_suite_20260503_202810/paper_suite_summary.csv` |
| - `results/paper_suite_20260503_202810/paper_suite_runtime.csv` |
| - `results/paper_suite_20260503_202810/paper_suite_failed_checks.csv` |
|
|
| ## 19. Hosting, License, and Citation |
|
|
| ### Hosting |
|
|
| The benchmark is currently generated from code in this repository rather than distributed as a fixed external archive. |
|
|
| Current status: |
|
|
| - dataset hosting location: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins) |
| - code repository: [https://huggingface.co/temporal-twins-benchmark/temporal-twins-code](https://huggingface.co/temporal-twins-benchmark/temporal-twins-code) |
| - canonical pre-generated release archive: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins) |
| - Croissant metadata URL: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json) |
| - Croissant metadata browser page: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/blob/main/metadata/temporal_twins_croissant.json](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/blob/main/metadata/temporal_twins_croissant.json) |
| - data files: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/data](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/data) |
| - results: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/results](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/results) |
| - configs: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/configs](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/configs) |
| - metadata directory: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/metadata](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/metadata) |
| - paper or preprint: Not available during double-blind review; to be added after publication. |
| - reference paper-suite results: `results/paper_suite_20260503_202810/` |
|
|
| The Croissant file passes JSON, schema, and Responsible AI validation in the official checker. The optional records-generation test currently reports a known Parquet-in-zip streaming issue; direct pandas/pyarrow loading instructions are provided in `data/README_GENERATION.md`. |
|
|
| ### License |
|
|
| - Dataset license: `CC BY 4.0` (`CC-BY-4.0`) |
| - Code license: `Apache License 2.0` (`Apache-2.0`) |
|
|
| ### Citation |
|
|
| The final paper or preprint citation is not available during double-blind review and will be added after publication. |
|
|
| `TODO` placeholder BibTeX: |
|
|
| ```bibtex |
| @dataset{temporal_twins_todo, |
| title = {Temporal Twins: A Synthetic UPI-Style Benchmark for Temporal Fraud Detection}, |
| author = {TODO}, |
| year = {TODO}, |
| howpublished = {TODO}, |
| note = {Synthetic matched-prefix temporal fraud benchmark}, |
| url = {TODO} |
| } |
| ``` |
|
|