temporal-twins-code / DATASET_CARD.md

Upload 2 files

b4d7f08 verified 5 days ago

preview code

raw

history blame contribute delete

21.1 kB

Temporal Twins Dataset Card

1. Dataset Summary

Temporal Twins is a synthetic UPI-style transaction benchmark for temporal fraud detection. It is designed to evaluate whether a model can distinguish fraud from benign behavior using order-sensitive temporal structure rather than static aggregates such as total transaction count, account age, or prefix length.

The benchmark simulates users sending transactions over time and then assigns fraud labels through delayed temporal mechanisms. Its core design is a matched fraud/benign temporal-twin construction:

each positive example is a fraud twin evaluated at a local event index k
each negative example is a benign twin evaluated at the same local event index k
both twins are matched on static and prefix-level summaries
the benign twin contains the same unordered ingredients but violates the fraud-relevant temporal order

Temporal Twins exposes four benchmark modes:

oracle_calib
easy
medium
hard

The frozen paper-suite configuration used in this repository is:

num_users = 350
simulation_days = 45
seeds = [0, 1, 2, 3, 4]
fast_mode = false
n_checkpoints = 8

2. Dataset Motivation

Many fraud datasets can be solved by static shortcuts: longer histories, later evaluation times, higher transaction counts, or other aggregate correlates can make a benchmark look temporally rich while actually rewarding non-temporal models. Temporal Twins was built to remove those shortcuts and isolate order-sensitive temporal reasoning.

The benchmark therefore aims to answer a narrower research question:

when static summaries are matched between positives and negatives, can a model still recover delayed fraud signals from temporal order alone?

It is intended for benchmarking temporal representation learning, causal order sensitivity, and delayed-label detection under controlled synthetic conditions.

3. Dataset Composition

Temporal Twins is generated programmatically from synthetic user and transaction processes. There is no fixed real-world corpus. Each generated artifact is an event table in which each row is a synthetic transaction.

At a high level, each run contains:

a synthetic user population
a synthetic stream of UPI-style transactions
risk-engine outputs such as transaction risk scores and failures
benchmark-specific fraud and audit annotations
matched fraud/benign evaluation pairs extracted from the event stream

The paper-scale suite in this repository contains 20 deterministic runs:

oracle_calib with seeds 0..4
easy with seeds 0..4
medium with seeds 0..4
hard with seeds 0..4

Mean matched evaluation-pair counts in the frozen paper suite are:

Mode	Matched evaluation pairs (mean +- std)
`oracle_calib`	`2606.6 +- 454.3`
`easy`	`2222.2 +- 128.4`
`medium`	`2356.6 +- 18.0`
`hard`	`2317.6 +- 22.0`

Each paper-suite run is class-balanced at evaluation time:

positives = negatives
positive rate = 0.5000

4. Dataset Generation Process

The generation pipeline has four stages:

Synthetic user generation
Synthetic transaction generation
Synthetic risk and retry generation
Fraud-mechanism and matched-twin generation

More concretely:

A synthetic user set is created with user-level behavioral parameters.
A synthetic transaction stream is sampled with sender IDs, receiver IDs, timestamps, transaction amounts, and transaction types.
A risk engine adds synthetic risk-related fields such as risk_score, fail_prob, failed, and retry-like events.
The fraud engine applies benchmark-mode-specific temporal mechanisms and constructs matched temporal twins.

For the temporal_twins benchmark family, the generator then:

constructs fraud twins and benign twins from matched carrier users and templates
preserves matched static and prefix-level summaries
injects delayed fraud labels into fraud twins
forces benign twins to avoid the fraud-relevant temporal motif while retaining similar unordered ingredients

The benchmark is deterministic under fixed configuration, seed, and runtime settings.

5. Fraud Mechanisms

Temporal Twins uses delayed, order-sensitive fraud mechanisms rather than directly labeling static outliers. Important mechanisms include:

velocity-like activity acceleration
retry-like behavior
delayed receiver revisits
burst-release-burst motifs
adversarial timing perturbations
delayed fraud assignment
hidden latent fraud-state dynamics

These mechanisms are combined with difficulty-dependent noise and camouflage. In the standard easy, medium, and hard modes, the fraud signal is intentionally imperfect and partially obscured. In oracle_calib, the construction is designed to validate motif and evaluation alignment under matched-prefix conditions.

6. Matched-Control Construction

The central benchmark control is the fraud/benign temporal twin.

For every fraud twin positive label at local event index k:

the benign twin is evaluated at the same local event index k
both examples use the same local prefix length
both examples are truncated at prefix index k
no future events are visible to the model

Within each matched pair, the protocol additionally matches:

total transaction count
local prefix length
evaluation timestamp
account age
active age
receiver histograms
static aggregate summaries

In words:

the fraud twin contains a temporally meaningful order pattern that triggers a delayed positive label
the benign twin contains comparable ingredients and prefix statistics but violates the fraud-relevant temporal order

This design is meant to prevent performance from arising from:

longer histories
older accounts
later prefix positions
different transaction totals
unmatched prefix ages
benign negatives evaluated at arbitrary or easier positions

7. Dataset Modes and Difficulty Ladder

Temporal Twins provides four modes.

`oracle_calib`

This is the calibration mode used to validate that the matched-prefix protocol is working as intended.

Oracle metrics remain near-perfect.
Static shortcut baselines remain at chance.
Benign motif hit rate remains zero.
This mode is primarily for protocol validation rather than realistic difficulty.

`easy`

strong motif signal
low noise
shorter delay
expected SeqGRU performance near 0.90-1.00

`medium`

moderate motif signal
moderate noise
longer delay
expected SeqGRU performance near 0.80-0.90

`hard`

weaker motif signal
longer delay
adversarial perturbations and decoys
expected SeqGRU performance near 0.70-0.85

Naming convention:

in oracle_calib, AuditOracle and RawMotifOracle are true oracle-style references
in standard easy, medium, and hard, the corresponding scores are reported as MotifProbe and RawMotifProbe because realism and noise make them probes rather than perfect oracles

8. Data Schema

The event table contains model-facing fields, supervision labels, and audit/oracle-only fields. The table below lists the most important columns used in this repository.

Column name	Type	Description	Exposed to ordinary models?	Notes
`txn_id`	`int32`	Synthetic transaction identifier	Yes	Identifier only; not a benchmark target
`sender_id`	`int32` / `int64`	Synthetic sender account ID	Yes	Node identity available to temporal models
`receiver_id`	`int32` / `int64`	Synthetic receiver account ID	Yes	Used for graph and sequence structure
`timestamp`	`float32`	Synthetic event time in seconds from simulation start	Yes	Prefix truncation is based on timestamp and local index
`amount`	`float32`	Synthetic transaction amount	Yes	Not tied to real currency records
`txn_type`	`int8`	Synthetic transaction-type code	Yes	UPI-style categorical event attribute
`risk_score`	`float32`	Synthetic risk score from the risk engine	Yes	No real production risk model is used
`fail_prob`	`float32`	Synthetic failure probability	Yes	Risk-engine output
`failed`	`int8`	Binary failure indicator	Yes	Used as a normal model-facing field
`is_retry`	`int8` / derived	Retry-like event indicator	Yes	Available to ordinary models when present
`pair_freq`	`float32` / derived	Sender-receiver interaction-frequency feature	Yes	Derived from visible event history
`risk_noisy`	`float32`	Noisy synthetic risk feature	Yes	Benchmark feature, not an audit signal
`txn_count_10`	`float32` / derived	Recent-count feature over a short window	Yes	Derived from visible history
`amount_sum_10`	`float32` / derived	Recent amount-sum feature	Yes	Derived from visible history
`is_fraud`	`int8`	Binary fraud label	No	Supervision target only, not a model input
`twin_pair_id`	`int64`	Matched fraud/benign pair identifier	No	Audit/oracle-only; not exposed to learned baselines
`twin_role`	`string`	Twin role such as `fraud`, `benign`, or `background`	No	Audit/oracle-only
`twin_label`	`int8`	Pairwise matched label for audit utilities	No	Audit/oracle-only
`template_id`	`int64`	Source template identifier used during twin construction	No	Audit/oracle-only
`dynamic_fraud_state`	`float32`	Latent synthetic fraud-state variable	No	Hidden mechanism for analysis only
`motif_source`	`int8`	Indicator for motif-source events in a sequence	No	Audit/oracle-only
`motif_hit_count`	`int32`	Count of motif hits in the sequence	No	Audit/oracle-only
`trigger_event_idx`	`int32`	Local event index of the trigger event	No	Audit/oracle-only
`label_event_idx`	`int32`	Local event index at which the fraud label becomes active	No	Audit/oracle-only
`label_delay`	`int32`	Delay between trigger and labeled event index	No	Audit/oracle-only
`fraud_source`	`string`	Cause of fraud label, e.g. motif or fallback chain	No	Audit/oracle-only
`is_fallback_label`	`int8`	Indicator that a label came from fallback logic	No	Audit/oracle-only
`motif_chain_state`	`float32`	Internal motif-chain analysis field	No	Audit/oracle-only
`motif_strength`	`float32`	Internal motif-strength analysis field	No	Audit/oracle-only

Not every baseline uses every model-facing column. The important guarantee is that learned baselines do not receive the audit/oracle-only fields listed above.

9. Model-Facing vs Audit/Oracle-Only Columns

Ordinary learned baselines are restricted to model-facing transaction attributes and histories. In this repository, audit/oracle-only columns are explicitly stripped before learned baselines are trained or evaluated.

Ordinary models may use fields such as:

sender_id
receiver_id
timestamp
amount
risk_score
fail_prob
failed
txn_type
other derived non-oracle features built from visible prefix history

Ordinary models must not use:

motif_hit_count
motif_source
trigger_event_idx
label_event_idx
label_delay
fraud_source
twin_role
twin_label
twin_pair_id
template_id
dynamic_fraud_state
other oracle-only diagnostics

This separation is necessary for the benchmark claim that performance should come from temporal reasoning rather than privileged audit information.

10. Benchmark Tasks

Temporal Twins supports the following benchmark task:

binary fraud detection on matched prefix examples

The standard evaluation protocol is:

build matched fraud/benign examples
truncate each sender history at the matched prefix index k
train or score on the visible prefix only
evaluate binary discrimination at the matched example level

Primary reported metrics include:

ROC-AUC
PR-AUC
shuffled-order ROC-AUC
shuffle delta = shuffled ROC-AUC minus clean ROC-AUC

The shuffled-order test is important: it measures how much performance depends on event order rather than unordered ingredients.

11. Baselines and Reference Results

The frozen 5-seed paper suite uses:

num_users = 350
simulation_days = 45
seeds = [0, 1, 2, 3, 4]
fast_mode = false
n_checkpoints = 8

Compact reference results:

Mode	Primary reference	Secondary reference	XGBoost ROC-AUC	StaticGNN ROC-AUC	SeqGRU ROC-AUC	SeqGRU shuffled delta
`oracle_calib`	`AuditOracle 1.0000 +- 0.0000`	`RawMotifOracle 1.0000 +- 0.0000`	`0.5000 +- 0.0000`	`0.5222 +- 0.0235`	`1.0000 +- 0.0000`	`-0.5032 +- 0.0043`
`easy`	`MotifProbe 1.0000 +- 0.0000`	`RawMotifProbe 0.9983 +- 0.0011`	`0.5000 +- 0.0000`	`0.4946 +- 0.0128`	`1.0000 +- 0.0000`	`-0.5003 +- 0.0096`
`medium`	`MotifProbe 0.6374 +- 0.0069`	`RawMotifProbe 0.6482 +- 0.0058`	`0.5000 +- 0.0000`	`0.4922 +- 0.0203`	`0.8391 +- 0.0174`	`-0.3337 +- 0.0191`
`hard`	`MotifProbe 0.5790 +- 0.0045`	`RawMotifProbe 0.5910 +- 0.0105`	`0.5000 +- 0.0000`	`0.5026 +- 0.0198`	`0.6876 +- 0.0128`	`-0.1883 +- 0.0111`

Static shortcut audit across all 20 paper-suite runs:

static_agg_auc = 0.5000 +- 0.0000
total_txn_count AUC = 0.5000 +- 0.0000
local_event_idx AUC = 0.5000 +- 0.0000
prefix_txn_count AUC = 0.5000 +- 0.0000
timestamp AUC = 0.5000 +- 0.0000
account_age AUC = 0.5000 +- 0.0000
active_age AUC = 0.5000 +- 0.0000
benign_motif_hit_rate = 0.0000 +- 0.0000

These results support the intended interpretation:

static shortcuts are neutralized
oracle_calib validates matched-prefix correctness
easy is readily learnable by order-sensitive sequence models
medium remains learnable but meaningfully harder
hard remains above static baselines but is substantially more challenging

Full paper-suite artifacts, including temporal GNN results and per-seed CSVs, are stored under:

results/paper_suite_20260503_202810/

12. Intended Use

This dataset is intended for:

research on temporal fraud detection
benchmarking order-sensitive sequence and temporal-graph models
evaluating whether performance survives matched static controls
studying delayed labels and prefix-only evaluation
comparing clean-order and shuffled-order performance

It is appropriate for methodology papers, controlled ablation studies, and robustness checks on temporal inductive bias.

13. Out-of-Scope Use

Temporal Twins is out of scope for:

direct training of production fraud systems
making real financial, banking, or payment decisions
approving or denying transactions for real users
risk-scoring real individuals or organizations
regulatory, legal, or operational decisions in production financial systems

The dataset must not be used to train production fraud systems directly or to make real financial decisions.

14. Limitations

Important limitations include:

the benchmark is fully synthetic and reflects designer assumptions
user behavior, fraud behavior, and benign behavior are simplified relative to real financial ecosystems
the only ground truth is the generator's own labeling logic
real-world fraud often depends on richer institutional, device, merchant, and social context not present here
difficulty levels are benchmark design choices, not calibrated measures of real operational difficulty
temporal GNN underperformance on this benchmark should not be generalized to all real fraud settings

15. Biases and Risks

As a synthetic benchmark, Temporal Twins inherits the modeling biases of its generator:

it emphasizes order-sensitive motifs chosen by the benchmark designers
it encodes a particular notion of delayed fraud and camouflage
it may reward models that are well aligned to these synthetic mechanisms
it may underrepresent other real fraud styles not captured by the generator

There is also a scientific risk:

because the benchmark intentionally removes common static shortcuts, performance on Temporal Twins may differ from performance on operational datasets where those shortcuts exist, for better or worse

16. Privacy and Sensitive Data

Temporal Twins contains no real financial or personal data.

Specifically:

no real UPI data
no real users
no real bank accounts
no real transactions
no personal financial records
no protected demographic attributes

All user IDs, receiver IDs, timestamps, amounts, and risk signals are synthetic artifacts produced by the generator.

17. Ethical Considerations

Temporal Twins is safer to share than real financial logs because it does not contain real persons or institutions. However, ethical care is still needed.

Users of the dataset should not:

present synthetic results as direct evidence of production readiness
claim fairness or social validity that has not been tested on real populations
use the dataset as justification for automated decisions about real people

The intended ethical use is research benchmarking, not operational deployment.

18. Reproducibility

The repository includes deterministic generation and evaluation settings for the frozen paper suite.

Paper-suite configuration:

num_users = 350
simulation_days = 45
seeds = [0, 1, 2, 3, 4]
fast_mode = false
n_checkpoints = 8

Reproducibility properties:

stable deterministic seed derivation is used for benchmark modes and profiles
Python, NumPy, and PyTorch seeds are fixed per run
deterministic runtime flags are enabled where safe
matched-prefix datasets are reproducible under fixed config and seed
the final paper suite in this repository is stored as deterministic CSV artifacts

Reference artifacts:

results/paper_suite_20260503_202810/paper_suite_runs.csv
results/paper_suite_20260503_202810/paper_suite_summary.csv
results/paper_suite_20260503_202810/paper_suite_runtime.csv
results/paper_suite_20260503_202810/paper_suite_failed_checks.csv

19. Hosting, License, and Citation

Hosting

The benchmark is currently generated from code in this repository rather than distributed as a fixed external archive.

Current status:

dataset hosting location: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins
code repository: https://huggingface.co/temporal-twins-benchmark/temporal-twins-code
canonical pre-generated release archive: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins
Croissant metadata URL: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json
Croissant metadata browser page: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/blob/main/metadata/temporal_twins_croissant.json
data files: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/data
results: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/results
configs: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/configs
metadata directory: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/metadata
paper or preprint: Not available during double-blind review; to be added after publication.
reference paper-suite results: results/paper_suite_20260503_202810/

The Croissant file passes JSON, schema, and Responsible AI validation in the official checker. The optional records-generation test currently reports a known Parquet-in-zip streaming issue; direct pandas/pyarrow loading instructions are provided in data/README_GENERATION.md.

License

Dataset license: CC BY 4.0 (CC-BY-4.0)
Code license: Apache License 2.0 (Apache-2.0)

Citation

The final paper or preprint citation is not available during double-blind review and will be added after publication.

TODO placeholder BibTeX:

@dataset{temporal_twins_todo,
  title        = {Temporal Twins: A Synthetic UPI-Style Benchmark for Temporal Fraud Detection},
  author       = {TODO},
  year         = {TODO},
  howpublished = {TODO},
  note         = {Synthetic matched-prefix temporal fraud benchmark},
  url          = {TODO}
}