Upload 2 files

b4d7f08 verified 5 days ago

21.1 kB

	# Temporal Twins Dataset Card

	## 1. Dataset Summary

	Temporal Twins is a synthetic UPI-style transaction benchmark for temporal fraud detection. It is designed to evaluate whether a model can distinguish fraud from benign behavior using order-sensitive temporal structure rather than static aggregates such as total transaction count, account age, or prefix length.

	The benchmark simulates users sending transactions over time and then assigns fraud labels through delayed temporal mechanisms. Its core design is a matched fraud/benign temporal-twin construction:

	- each positive example is a fraud twin evaluated at a local event index `k`
	- each negative example is a benign twin evaluated at the same local event index `k`
	- both twins are matched on static and prefix-level summaries
	- the benign twin contains the same unordered ingredients but violates the fraud-relevant temporal order

	Temporal Twins exposes four benchmark modes:

	- `oracle_calib`
	- `easy`
	- `medium`
	- `hard`

	The frozen paper-suite configuration used in this repository is:

	- `num_users = 350`
	- `simulation_days = 45`
	- `seeds = [0, 1, 2, 3, 4]`
	- `fast_mode = false`
	- `n_checkpoints = 8`

	## 2. Dataset Motivation

	Many fraud datasets can be solved by static shortcuts: longer histories, later evaluation times, higher transaction counts, or other aggregate correlates can make a benchmark look temporally rich while actually rewarding non-temporal models. Temporal Twins was built to remove those shortcuts and isolate order-sensitive temporal reasoning.

	The benchmark therefore aims to answer a narrower research question:

	- when static summaries are matched between positives and negatives, can a model still recover delayed fraud signals from temporal order alone?

	It is intended for benchmarking temporal representation learning, causal order sensitivity, and delayed-label detection under controlled synthetic conditions.

	## 3. Dataset Composition

	Temporal Twins is generated programmatically from synthetic user and transaction processes. There is no fixed real-world corpus. Each generated artifact is an event table in which each row is a synthetic transaction.

	At a high level, each run contains:

	- a synthetic user population
	- a synthetic stream of UPI-style transactions
	- risk-engine outputs such as transaction risk scores and failures
	- benchmark-specific fraud and audit annotations
	- matched fraud/benign evaluation pairs extracted from the event stream

	The paper-scale suite in this repository contains 20 deterministic runs:

	- `oracle_calib` with seeds `0..4`
	- `easy` with seeds `0..4`
	- `medium` with seeds `0..4`
	- `hard` with seeds `0..4`

	Mean matched evaluation-pair counts in the frozen paper suite are:

	\| Mode \| Matched evaluation pairs (mean +- std) \|
	\|---\|---:\|
	\| `oracle_calib` \| `2606.6 +- 454.3` \|
	\| `easy` \| `2222.2 +- 128.4` \|
	\| `medium` \| `2356.6 +- 18.0` \|
	\| `hard` \| `2317.6 +- 22.0` \|

	Each paper-suite run is class-balanced at evaluation time:

	- positives = negatives
	- positive rate = `0.5000`

	## 4. Dataset Generation Process

	The generation pipeline has four stages:

	1. Synthetic user generation
	2. Synthetic transaction generation
	3. Synthetic risk and retry generation
	4. Fraud-mechanism and matched-twin generation

	More concretely:

	1. A synthetic user set is created with user-level behavioral parameters.
	2. A synthetic transaction stream is sampled with sender IDs, receiver IDs, timestamps, transaction amounts, and transaction types.
	3. A risk engine adds synthetic risk-related fields such as `risk_score`, `fail_prob`, `failed`, and retry-like events.
	4. The fraud engine applies benchmark-mode-specific temporal mechanisms and constructs matched temporal twins.

	For the `temporal_twins` benchmark family, the generator then:

	- constructs fraud twins and benign twins from matched carrier users and templates
	- preserves matched static and prefix-level summaries
	- injects delayed fraud labels into fraud twins
	- forces benign twins to avoid the fraud-relevant temporal motif while retaining similar unordered ingredients

	The benchmark is deterministic under fixed configuration, seed, and runtime settings.

	## 5. Fraud Mechanisms

	Temporal Twins uses delayed, order-sensitive fraud mechanisms rather than directly labeling static outliers. Important mechanisms include:

	- velocity-like activity acceleration
	- retry-like behavior
	- delayed receiver revisits
	- burst-release-burst motifs
	- adversarial timing perturbations
	- delayed fraud assignment
	- hidden latent fraud-state dynamics

	These mechanisms are combined with difficulty-dependent noise and camouflage. In the standard `easy`, `medium`, and `hard` modes, the fraud signal is intentionally imperfect and partially obscured. In `oracle_calib`, the construction is designed to validate motif and evaluation alignment under matched-prefix conditions.

	## 6. Matched-Control Construction

	The central benchmark control is the fraud/benign temporal twin.

	For every fraud twin positive label at local event index `k`:

	- the benign twin is evaluated at the same local event index `k`
	- both examples use the same local prefix length
	- both examples are truncated at prefix index `k`
	- no future events are visible to the model

	Within each matched pair, the protocol additionally matches:

	- total transaction count
	- local prefix length
	- evaluation timestamp
	- account age
	- active age
	- receiver histograms
	- static aggregate summaries

	In words:

	- the fraud twin contains a temporally meaningful order pattern that triggers a delayed positive label
	- the benign twin contains comparable ingredients and prefix statistics but violates the fraud-relevant temporal order

	This design is meant to prevent performance from arising from:

	- longer histories
	- older accounts
	- later prefix positions
	- different transaction totals
	- unmatched prefix ages
	- benign negatives evaluated at arbitrary or easier positions

	## 7. Dataset Modes and Difficulty Ladder

	Temporal Twins provides four modes.

	### `oracle_calib`

	This is the calibration mode used to validate that the matched-prefix protocol is working as intended.

	- Oracle metrics remain near-perfect.
	- Static shortcut baselines remain at chance.
	- Benign motif hit rate remains zero.
	- This mode is primarily for protocol validation rather than realistic difficulty.

	### `easy`

	- strong motif signal
	- low noise
	- shorter delay
	- expected SeqGRU performance near `0.90-1.00`

	### `medium`

	- moderate motif signal
	- moderate noise
	- longer delay
	- expected SeqGRU performance near `0.80-0.90`

	### `hard`

	- weaker motif signal
	- longer delay
	- adversarial perturbations and decoys
	- expected SeqGRU performance near `0.70-0.85`

	Naming convention:

	- in `oracle_calib`, `AuditOracle` and `RawMotifOracle` are true oracle-style references
	- in standard `easy`, `medium`, and `hard`, the corresponding scores are reported as `MotifProbe` and `RawMotifProbe` because realism and noise make them probes rather than perfect oracles

	## 8. Data Schema

	The event table contains model-facing fields, supervision labels, and audit/oracle-only fields. The table below lists the most important columns used in this repository.

	\| Column name \| Type \| Description \| Exposed to ordinary models? \| Notes \|
	\|---\|---\|---\|---\|---\|
	\| `txn_id` \| `int32` \| Synthetic transaction identifier \| Yes \| Identifier only; not a benchmark target \|
	\| `sender_id` \| `int32` / `int64` \| Synthetic sender account ID \| Yes \| Node identity available to temporal models \|
	\| `receiver_id` \| `int32` / `int64` \| Synthetic receiver account ID \| Yes \| Used for graph and sequence structure \|
	\| `timestamp` \| `float32` \| Synthetic event time in seconds from simulation start \| Yes \| Prefix truncation is based on timestamp and local index \|
	\| `amount` \| `float32` \| Synthetic transaction amount \| Yes \| Not tied to real currency records \|
	\| `txn_type` \| `int8` \| Synthetic transaction-type code \| Yes \| UPI-style categorical event attribute \|
	\| `risk_score` \| `float32` \| Synthetic risk score from the risk engine \| Yes \| No real production risk model is used \|
	\| `fail_prob` \| `float32` \| Synthetic failure probability \| Yes \| Risk-engine output \|
	\| `failed` \| `int8` \| Binary failure indicator \| Yes \| Used as a normal model-facing field \|
	\| `is_retry` \| `int8` / derived \| Retry-like event indicator \| Yes \| Available to ordinary models when present \|
	\| `pair_freq` \| `float32` / derived \| Sender-receiver interaction-frequency feature \| Yes \| Derived from visible event history \|
	\| `risk_noisy` \| `float32` \| Noisy synthetic risk feature \| Yes \| Benchmark feature, not an audit signal \|
	\| `txn_count_10` \| `float32` / derived \| Recent-count feature over a short window \| Yes \| Derived from visible history \|
	\| `amount_sum_10` \| `float32` / derived \| Recent amount-sum feature \| Yes \| Derived from visible history \|
	\| `is_fraud` \| `int8` \| Binary fraud label \| No \| Supervision target only, not a model input \|
	\| `twin_pair_id` \| `int64` \| Matched fraud/benign pair identifier \| No \| Audit/oracle-only; not exposed to learned baselines \|
	\| `twin_role` \| `string` \| Twin role such as `fraud`, `benign`, or `background` \| No \| Audit/oracle-only \|
	\| `twin_label` \| `int8` \| Pairwise matched label for audit utilities \| No \| Audit/oracle-only \|
	\| `template_id` \| `int64` \| Source template identifier used during twin construction \| No \| Audit/oracle-only \|
	\| `dynamic_fraud_state` \| `float32` \| Latent synthetic fraud-state variable \| No \| Hidden mechanism for analysis only \|
	\| `motif_source` \| `int8` \| Indicator for motif-source events in a sequence \| No \| Audit/oracle-only \|
	\| `motif_hit_count` \| `int32` \| Count of motif hits in the sequence \| No \| Audit/oracle-only \|
	\| `trigger_event_idx` \| `int32` \| Local event index of the trigger event \| No \| Audit/oracle-only \|
	\| `label_event_idx` \| `int32` \| Local event index at which the fraud label becomes active \| No \| Audit/oracle-only \|
	\| `label_delay` \| `int32` \| Delay between trigger and labeled event index \| No \| Audit/oracle-only \|
	\| `fraud_source` \| `string` \| Cause of fraud label, e.g. motif or fallback chain \| No \| Audit/oracle-only \|
	\| `is_fallback_label` \| `int8` \| Indicator that a label came from fallback logic \| No \| Audit/oracle-only \|
	\| `motif_chain_state` \| `float32` \| Internal motif-chain analysis field \| No \| Audit/oracle-only \|
	\| `motif_strength` \| `float32` \| Internal motif-strength analysis field \| No \| Audit/oracle-only \|

	Not every baseline uses every model-facing column. The important guarantee is that learned baselines do not receive the audit/oracle-only fields listed above.

	## 9. Model-Facing vs Audit/Oracle-Only Columns

	Ordinary learned baselines are restricted to model-facing transaction attributes and histories. In this repository, audit/oracle-only columns are explicitly stripped before learned baselines are trained or evaluated.

	Ordinary models may use fields such as:

	- `sender_id`
	- `receiver_id`
	- `timestamp`
	- `amount`
	- `risk_score`
	- `fail_prob`
	- `failed`
	- `txn_type`
	- other derived non-oracle features built from visible prefix history

	Ordinary models must not use:

	- `motif_hit_count`
	- `motif_source`
	- `trigger_event_idx`
	- `label_event_idx`
	- `label_delay`
	- `fraud_source`
	- `twin_role`
	- `twin_label`
	- `twin_pair_id`
	- `template_id`
	- `dynamic_fraud_state`
	- other oracle-only diagnostics

	This separation is necessary for the benchmark claim that performance should come from temporal reasoning rather than privileged audit information.

	## 10. Benchmark Tasks

	Temporal Twins supports the following benchmark task:

	- binary fraud detection on matched prefix examples

	The standard evaluation protocol is:

	- build matched fraud/benign examples
	- truncate each sender history at the matched prefix index `k`
	- train or score on the visible prefix only
	- evaluate binary discrimination at the matched example level

	Primary reported metrics include:

	- ROC-AUC
	- PR-AUC
	- shuffled-order ROC-AUC
	- shuffle delta = shuffled ROC-AUC minus clean ROC-AUC

	The shuffled-order test is important: it measures how much performance depends on event order rather than unordered ingredients.

	## 11. Baselines and Reference Results

	The frozen 5-seed paper suite uses:

	- `num_users = 350`
	- `simulation_days = 45`
	- `seeds = [0, 1, 2, 3, 4]`
	- `fast_mode = false`
	- `n_checkpoints = 8`

	Compact reference results:

	\| Mode \| Primary reference \| Secondary reference \| XGBoost ROC-AUC \| StaticGNN ROC-AUC \| SeqGRU ROC-AUC \| SeqGRU shuffled delta \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| `oracle_calib` \| `AuditOracle 1.0000 +- 0.0000` \| `RawMotifOracle 1.0000 +- 0.0000` \| `0.5000 +- 0.0000` \| `0.5222 +- 0.0235` \| `1.0000 +- 0.0000` \| `-0.5032 +- 0.0043` \|
	\| `easy` \| `MotifProbe 1.0000 +- 0.0000` \| `RawMotifProbe 0.9983 +- 0.0011` \| `0.5000 +- 0.0000` \| `0.4946 +- 0.0128` \| `1.0000 +- 0.0000` \| `-0.5003 +- 0.0096` \|
	\| `medium` \| `MotifProbe 0.6374 +- 0.0069` \| `RawMotifProbe 0.6482 +- 0.0058` \| `0.5000 +- 0.0000` \| `0.4922 +- 0.0203` \| `0.8391 +- 0.0174` \| `-0.3337 +- 0.0191` \|
	\| `hard` \| `MotifProbe 0.5790 +- 0.0045` \| `RawMotifProbe 0.5910 +- 0.0105` \| `0.5000 +- 0.0000` \| `0.5026 +- 0.0198` \| `0.6876 +- 0.0128` \| `-0.1883 +- 0.0111` \|

	Static shortcut audit across all 20 paper-suite runs:

	- `static_agg_auc = 0.5000 +- 0.0000`
	- `total_txn_count AUC = 0.5000 +- 0.0000`
	- `local_event_idx AUC = 0.5000 +- 0.0000`
	- `prefix_txn_count AUC = 0.5000 +- 0.0000`
	- `timestamp AUC = 0.5000 +- 0.0000`
	- `account_age AUC = 0.5000 +- 0.0000`
	- `active_age AUC = 0.5000 +- 0.0000`
	- `benign_motif_hit_rate = 0.0000 +- 0.0000`

	These results support the intended interpretation:

	- static shortcuts are neutralized
	- `oracle_calib` validates matched-prefix correctness
	- `easy` is readily learnable by order-sensitive sequence models
	- `medium` remains learnable but meaningfully harder
	- `hard` remains above static baselines but is substantially more challenging

	Full paper-suite artifacts, including temporal GNN results and per-seed CSVs, are stored under:

	- `results/paper_suite_20260503_202810/`

	## 12. Intended Use

	This dataset is intended for:

	- research on temporal fraud detection
	- benchmarking order-sensitive sequence and temporal-graph models
	- evaluating whether performance survives matched static controls
	- studying delayed labels and prefix-only evaluation
	- comparing clean-order and shuffled-order performance

	It is appropriate for methodology papers, controlled ablation studies, and robustness checks on temporal inductive bias.

	## 13. Out-of-Scope Use

	Temporal Twins is out of scope for:

	- direct training of production fraud systems
	- making real financial, banking, or payment decisions
	- approving or denying transactions for real users
	- risk-scoring real individuals or organizations
	- regulatory, legal, or operational decisions in production financial systems

	The dataset must not be used to train production fraud systems directly or to make real financial decisions.

	## 14. Limitations

	Important limitations include:

	- the benchmark is fully synthetic and reflects designer assumptions
	- user behavior, fraud behavior, and benign behavior are simplified relative to real financial ecosystems
	- the only ground truth is the generator's own labeling logic
	- real-world fraud often depends on richer institutional, device, merchant, and social context not present here
	- difficulty levels are benchmark design choices, not calibrated measures of real operational difficulty
	- temporal GNN underperformance on this benchmark should not be generalized to all real fraud settings

	## 15. Biases and Risks

	As a synthetic benchmark, Temporal Twins inherits the modeling biases of its generator:

	- it emphasizes order-sensitive motifs chosen by the benchmark designers
	- it encodes a particular notion of delayed fraud and camouflage
	- it may reward models that are well aligned to these synthetic mechanisms
	- it may underrepresent other real fraud styles not captured by the generator

	There is also a scientific risk:

	- because the benchmark intentionally removes common static shortcuts, performance on Temporal Twins may differ from performance on operational datasets where those shortcuts exist, for better or worse

	## 16. Privacy and Sensitive Data

	Temporal Twins contains no real financial or personal data.

	Specifically:

	- no real UPI data
	- no real users
	- no real bank accounts
	- no real transactions
	- no personal financial records
	- no protected demographic attributes

	All user IDs, receiver IDs, timestamps, amounts, and risk signals are synthetic artifacts produced by the generator.

	## 17. Ethical Considerations

	Temporal Twins is safer to share than real financial logs because it does not contain real persons or institutions. However, ethical care is still needed.

	Users of the dataset should not:

	- present synthetic results as direct evidence of production readiness
	- claim fairness or social validity that has not been tested on real populations
	- use the dataset as justification for automated decisions about real people

	The intended ethical use is research benchmarking, not operational deployment.

	## 18. Reproducibility

	The repository includes deterministic generation and evaluation settings for the frozen paper suite.

	Paper-suite configuration:

	- `num_users = 350`
	- `simulation_days = 45`
	- `seeds = [0, 1, 2, 3, 4]`
	- `fast_mode = false`
	- `n_checkpoints = 8`

	Reproducibility properties:

	- stable deterministic seed derivation is used for benchmark modes and profiles
	- Python, NumPy, and PyTorch seeds are fixed per run
	- deterministic runtime flags are enabled where safe
	- matched-prefix datasets are reproducible under fixed config and seed
	- the final paper suite in this repository is stored as deterministic CSV artifacts

	Reference artifacts:

	- `results/paper_suite_20260503_202810/paper_suite_runs.csv`
	- `results/paper_suite_20260503_202810/paper_suite_summary.csv`
	- `results/paper_suite_20260503_202810/paper_suite_runtime.csv`
	- `results/paper_suite_20260503_202810/paper_suite_failed_checks.csv`

	## 19. Hosting, License, and Citation

	### Hosting

	The benchmark is currently generated from code in this repository rather than distributed as a fixed external archive.

	Current status:

	- dataset hosting location: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins)
	- code repository: [https://huggingface.co/temporal-twins-benchmark/temporal-twins-code](https://huggingface.co/temporal-twins-benchmark/temporal-twins-code)
	- canonical pre-generated release archive: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins)
	- Croissant metadata URL: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json)
	- Croissant metadata browser page: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/blob/main/metadata/temporal_twins_croissant.json](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/blob/main/metadata/temporal_twins_croissant.json)
	- data files: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/data](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/data)
	- results: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/results](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/results)
	- configs: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/configs](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/configs)
	- metadata directory: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/metadata](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/metadata)
	- paper or preprint: Not available during double-blind review; to be added after publication.
	- reference paper-suite results: `results/paper_suite_20260503_202810/`

	The Croissant file passes JSON, schema, and Responsible AI validation in the official checker. The optional records-generation test currently reports a known Parquet-in-zip streaming issue; direct pandas/pyarrow loading instructions are provided in `data/README_GENERATION.md`.

	### License

	- Dataset license: `CC BY 4.0` (`CC-BY-4.0`)
	- Code license: `Apache License 2.0` (`Apache-2.0`)

	### Citation

	The final paper or preprint citation is not available during double-blind review and will be added after publication.

	`TODO` placeholder BibTeX:

	```bibtex
	@dataset{temporal_twins_todo,
	title = {Temporal Twins: A Synthetic UPI-Style Benchmark for Temporal Fraud Detection},
	author = {TODO},
	year = {TODO},
	howpublished = {TODO},
	note = {Synthetic matched-prefix temporal fraud benchmark},
	url = {TODO}
	}
	```