Temporal Twins Dataset Card
1. Dataset Summary
Temporal Twins is a synthetic UPI-style transaction benchmark for temporal fraud detection. It is designed to evaluate whether a model can distinguish fraud from benign behavior using order-sensitive temporal structure rather than static aggregates such as total transaction count, account age, or prefix length.
The benchmark simulates users sending transactions over time and then assigns fraud labels through delayed temporal mechanisms. Its core design is a matched fraud/benign temporal-twin construction:
- each positive example is a fraud twin evaluated at a local event index
k - each negative example is a benign twin evaluated at the same local event index
k - both twins are matched on static and prefix-level summaries
- the benign twin contains the same unordered ingredients but violates the fraud-relevant temporal order
Temporal Twins exposes four benchmark modes:
oracle_calibeasymediumhard
The frozen paper-suite configuration used in this repository is:
num_users = 350simulation_days = 45seeds = [0, 1, 2, 3, 4]fast_mode = falsen_checkpoints = 8
2. Dataset Motivation
Many fraud datasets can be solved by static shortcuts: longer histories, later evaluation times, higher transaction counts, or other aggregate correlates can make a benchmark look temporally rich while actually rewarding non-temporal models. Temporal Twins was built to remove those shortcuts and isolate order-sensitive temporal reasoning.
The benchmark therefore aims to answer a narrower research question:
- when static summaries are matched between positives and negatives, can a model still recover delayed fraud signals from temporal order alone?
It is intended for benchmarking temporal representation learning, causal order sensitivity, and delayed-label detection under controlled synthetic conditions.
3. Dataset Composition
Temporal Twins is generated programmatically from synthetic user and transaction processes. There is no fixed real-world corpus. Each generated artifact is an event table in which each row is a synthetic transaction.
At a high level, each run contains:
- a synthetic user population
- a synthetic stream of UPI-style transactions
- risk-engine outputs such as transaction risk scores and failures
- benchmark-specific fraud and audit annotations
- matched fraud/benign evaluation pairs extracted from the event stream
The paper-scale suite in this repository contains 20 deterministic runs:
oracle_calibwith seeds0..4easywith seeds0..4mediumwith seeds0..4hardwith seeds0..4
Mean matched evaluation-pair counts in the frozen paper suite are:
| Mode | Matched evaluation pairs (mean +- std) |
|---|---|
oracle_calib |
2606.6 +- 454.3 |
easy |
2222.2 +- 128.4 |
medium |
2356.6 +- 18.0 |
hard |
2317.6 +- 22.0 |
Each paper-suite run is class-balanced at evaluation time:
- positives = negatives
- positive rate =
0.5000
4. Dataset Generation Process
The generation pipeline has four stages:
- Synthetic user generation
- Synthetic transaction generation
- Synthetic risk and retry generation
- Fraud-mechanism and matched-twin generation
More concretely:
- A synthetic user set is created with user-level behavioral parameters.
- A synthetic transaction stream is sampled with sender IDs, receiver IDs, timestamps, transaction amounts, and transaction types.
- A risk engine adds synthetic risk-related fields such as
risk_score,fail_prob,failed, and retry-like events. - The fraud engine applies benchmark-mode-specific temporal mechanisms and constructs matched temporal twins.
For the temporal_twins benchmark family, the generator then:
- constructs fraud twins and benign twins from matched carrier users and templates
- preserves matched static and prefix-level summaries
- injects delayed fraud labels into fraud twins
- forces benign twins to avoid the fraud-relevant temporal motif while retaining similar unordered ingredients
The benchmark is deterministic under fixed configuration, seed, and runtime settings.
5. Fraud Mechanisms
Temporal Twins uses delayed, order-sensitive fraud mechanisms rather than directly labeling static outliers. Important mechanisms include:
- velocity-like activity acceleration
- retry-like behavior
- delayed receiver revisits
- burst-release-burst motifs
- adversarial timing perturbations
- delayed fraud assignment
- hidden latent fraud-state dynamics
These mechanisms are combined with difficulty-dependent noise and camouflage. In the standard easy, medium, and hard modes, the fraud signal is intentionally imperfect and partially obscured. In oracle_calib, the construction is designed to validate motif and evaluation alignment under matched-prefix conditions.
6. Matched-Control Construction
The central benchmark control is the fraud/benign temporal twin.
For every fraud twin positive label at local event index k:
- the benign twin is evaluated at the same local event index
k - both examples use the same local prefix length
- both examples are truncated at prefix index
k - no future events are visible to the model
Within each matched pair, the protocol additionally matches:
- total transaction count
- local prefix length
- evaluation timestamp
- account age
- active age
- receiver histograms
- static aggregate summaries
In words:
- the fraud twin contains a temporally meaningful order pattern that triggers a delayed positive label
- the benign twin contains comparable ingredients and prefix statistics but violates the fraud-relevant temporal order
This design is meant to prevent performance from arising from:
- longer histories
- older accounts
- later prefix positions
- different transaction totals
- unmatched prefix ages
- benign negatives evaluated at arbitrary or easier positions
7. Dataset Modes and Difficulty Ladder
Temporal Twins provides four modes.
oracle_calib
This is the calibration mode used to validate that the matched-prefix protocol is working as intended.
- Oracle metrics remain near-perfect.
- Static shortcut baselines remain at chance.
- Benign motif hit rate remains zero.
- This mode is primarily for protocol validation rather than realistic difficulty.
easy
- strong motif signal
- low noise
- shorter delay
- expected SeqGRU performance near
0.90-1.00
medium
- moderate motif signal
- moderate noise
- longer delay
- expected SeqGRU performance near
0.80-0.90
hard
- weaker motif signal
- longer delay
- adversarial perturbations and decoys
- expected SeqGRU performance near
0.70-0.85
Naming convention:
- in
oracle_calib,AuditOracleandRawMotifOracleare true oracle-style references - in standard
easy,medium, andhard, the corresponding scores are reported asMotifProbeandRawMotifProbebecause realism and noise make them probes rather than perfect oracles
8. Data Schema
The event table contains model-facing fields, supervision labels, and audit/oracle-only fields. The table below lists the most important columns used in this repository.
| Column name | Type | Description | Exposed to ordinary models? | Notes |
|---|---|---|---|---|
txn_id |
int32 |
Synthetic transaction identifier | Yes | Identifier only; not a benchmark target |
sender_id |
int32 / int64 |
Synthetic sender account ID | Yes | Node identity available to temporal models |
receiver_id |
int32 / int64 |
Synthetic receiver account ID | Yes | Used for graph and sequence structure |
timestamp |
float32 |
Synthetic event time in seconds from simulation start | Yes | Prefix truncation is based on timestamp and local index |
amount |
float32 |
Synthetic transaction amount | Yes | Not tied to real currency records |
txn_type |
int8 |
Synthetic transaction-type code | Yes | UPI-style categorical event attribute |
risk_score |
float32 |
Synthetic risk score from the risk engine | Yes | No real production risk model is used |
fail_prob |
float32 |
Synthetic failure probability | Yes | Risk-engine output |
failed |
int8 |
Binary failure indicator | Yes | Used as a normal model-facing field |
is_retry |
int8 / derived |
Retry-like event indicator | Yes | Available to ordinary models when present |
pair_freq |
float32 / derived |
Sender-receiver interaction-frequency feature | Yes | Derived from visible event history |
risk_noisy |
float32 |
Noisy synthetic risk feature | Yes | Benchmark feature, not an audit signal |
txn_count_10 |
float32 / derived |
Recent-count feature over a short window | Yes | Derived from visible history |
amount_sum_10 |
float32 / derived |
Recent amount-sum feature | Yes | Derived from visible history |
is_fraud |
int8 |
Binary fraud label | No | Supervision target only, not a model input |
twin_pair_id |
int64 |
Matched fraud/benign pair identifier | No | Audit/oracle-only; not exposed to learned baselines |
twin_role |
string |
Twin role such as fraud, benign, or background |
No | Audit/oracle-only |
twin_label |
int8 |
Pairwise matched label for audit utilities | No | Audit/oracle-only |
template_id |
int64 |
Source template identifier used during twin construction | No | Audit/oracle-only |
dynamic_fraud_state |
float32 |
Latent synthetic fraud-state variable | No | Hidden mechanism for analysis only |
motif_source |
int8 |
Indicator for motif-source events in a sequence | No | Audit/oracle-only |
motif_hit_count |
int32 |
Count of motif hits in the sequence | No | Audit/oracle-only |
trigger_event_idx |
int32 |
Local event index of the trigger event | No | Audit/oracle-only |
label_event_idx |
int32 |
Local event index at which the fraud label becomes active | No | Audit/oracle-only |
label_delay |
int32 |
Delay between trigger and labeled event index | No | Audit/oracle-only |
fraud_source |
string |
Cause of fraud label, e.g. motif or fallback chain | No | Audit/oracle-only |
is_fallback_label |
int8 |
Indicator that a label came from fallback logic | No | Audit/oracle-only |
motif_chain_state |
float32 |
Internal motif-chain analysis field | No | Audit/oracle-only |
motif_strength |
float32 |
Internal motif-strength analysis field | No | Audit/oracle-only |
Not every baseline uses every model-facing column. The important guarantee is that learned baselines do not receive the audit/oracle-only fields listed above.
9. Model-Facing vs Audit/Oracle-Only Columns
Ordinary learned baselines are restricted to model-facing transaction attributes and histories. In this repository, audit/oracle-only columns are explicitly stripped before learned baselines are trained or evaluated.
Ordinary models may use fields such as:
sender_idreceiver_idtimestampamountrisk_scorefail_probfailedtxn_type- other derived non-oracle features built from visible prefix history
Ordinary models must not use:
motif_hit_countmotif_sourcetrigger_event_idxlabel_event_idxlabel_delayfraud_sourcetwin_roletwin_labeltwin_pair_idtemplate_iddynamic_fraud_state- other oracle-only diagnostics
This separation is necessary for the benchmark claim that performance should come from temporal reasoning rather than privileged audit information.
10. Benchmark Tasks
Temporal Twins supports the following benchmark task:
- binary fraud detection on matched prefix examples
The standard evaluation protocol is:
- build matched fraud/benign examples
- truncate each sender history at the matched prefix index
k - train or score on the visible prefix only
- evaluate binary discrimination at the matched example level
Primary reported metrics include:
- ROC-AUC
- PR-AUC
- shuffled-order ROC-AUC
- shuffle delta = shuffled ROC-AUC minus clean ROC-AUC
The shuffled-order test is important: it measures how much performance depends on event order rather than unordered ingredients.
11. Baselines and Reference Results
The frozen 5-seed paper suite uses:
num_users = 350simulation_days = 45seeds = [0, 1, 2, 3, 4]fast_mode = falsen_checkpoints = 8
Compact reference results:
| Mode | Primary reference | Secondary reference | XGBoost ROC-AUC | StaticGNN ROC-AUC | SeqGRU ROC-AUC | SeqGRU shuffled delta |
|---|---|---|---|---|---|---|
oracle_calib |
AuditOracle 1.0000 +- 0.0000 |
RawMotifOracle 1.0000 +- 0.0000 |
0.5000 +- 0.0000 |
0.5222 +- 0.0235 |
1.0000 +- 0.0000 |
-0.5032 +- 0.0043 |
easy |
MotifProbe 1.0000 +- 0.0000 |
RawMotifProbe 0.9983 +- 0.0011 |
0.5000 +- 0.0000 |
0.4946 +- 0.0128 |
1.0000 +- 0.0000 |
-0.5003 +- 0.0096 |
medium |
MotifProbe 0.6374 +- 0.0069 |
RawMotifProbe 0.6482 +- 0.0058 |
0.5000 +- 0.0000 |
0.4922 +- 0.0203 |
0.8391 +- 0.0174 |
-0.3337 +- 0.0191 |
hard |
MotifProbe 0.5790 +- 0.0045 |
RawMotifProbe 0.5910 +- 0.0105 |
0.5000 +- 0.0000 |
0.5026 +- 0.0198 |
0.6876 +- 0.0128 |
-0.1883 +- 0.0111 |
Static shortcut audit across all 20 paper-suite runs:
static_agg_auc = 0.5000 +- 0.0000total_txn_count AUC = 0.5000 +- 0.0000local_event_idx AUC = 0.5000 +- 0.0000prefix_txn_count AUC = 0.5000 +- 0.0000timestamp AUC = 0.5000 +- 0.0000account_age AUC = 0.5000 +- 0.0000active_age AUC = 0.5000 +- 0.0000benign_motif_hit_rate = 0.0000 +- 0.0000
These results support the intended interpretation:
- static shortcuts are neutralized
oracle_calibvalidates matched-prefix correctnesseasyis readily learnable by order-sensitive sequence modelsmediumremains learnable but meaningfully harderhardremains above static baselines but is substantially more challenging
Full paper-suite artifacts, including temporal GNN results and per-seed CSVs, are stored under:
results/paper_suite_20260503_202810/
12. Intended Use
This dataset is intended for:
- research on temporal fraud detection
- benchmarking order-sensitive sequence and temporal-graph models
- evaluating whether performance survives matched static controls
- studying delayed labels and prefix-only evaluation
- comparing clean-order and shuffled-order performance
It is appropriate for methodology papers, controlled ablation studies, and robustness checks on temporal inductive bias.
13. Out-of-Scope Use
Temporal Twins is out of scope for:
- direct training of production fraud systems
- making real financial, banking, or payment decisions
- approving or denying transactions for real users
- risk-scoring real individuals or organizations
- regulatory, legal, or operational decisions in production financial systems
The dataset must not be used to train production fraud systems directly or to make real financial decisions.
14. Limitations
Important limitations include:
- the benchmark is fully synthetic and reflects designer assumptions
- user behavior, fraud behavior, and benign behavior are simplified relative to real financial ecosystems
- the only ground truth is the generator's own labeling logic
- real-world fraud often depends on richer institutional, device, merchant, and social context not present here
- difficulty levels are benchmark design choices, not calibrated measures of real operational difficulty
- temporal GNN underperformance on this benchmark should not be generalized to all real fraud settings
15. Biases and Risks
As a synthetic benchmark, Temporal Twins inherits the modeling biases of its generator:
- it emphasizes order-sensitive motifs chosen by the benchmark designers
- it encodes a particular notion of delayed fraud and camouflage
- it may reward models that are well aligned to these synthetic mechanisms
- it may underrepresent other real fraud styles not captured by the generator
There is also a scientific risk:
- because the benchmark intentionally removes common static shortcuts, performance on Temporal Twins may differ from performance on operational datasets where those shortcuts exist, for better or worse
16. Privacy and Sensitive Data
Temporal Twins contains no real financial or personal data.
Specifically:
- no real UPI data
- no real users
- no real bank accounts
- no real transactions
- no personal financial records
- no protected demographic attributes
All user IDs, receiver IDs, timestamps, amounts, and risk signals are synthetic artifacts produced by the generator.
17. Ethical Considerations
Temporal Twins is safer to share than real financial logs because it does not contain real persons or institutions. However, ethical care is still needed.
Users of the dataset should not:
- present synthetic results as direct evidence of production readiness
- claim fairness or social validity that has not been tested on real populations
- use the dataset as justification for automated decisions about real people
The intended ethical use is research benchmarking, not operational deployment.
18. Reproducibility
The repository includes deterministic generation and evaluation settings for the frozen paper suite.
Paper-suite configuration:
num_users = 350simulation_days = 45seeds = [0, 1, 2, 3, 4]fast_mode = falsen_checkpoints = 8
Reproducibility properties:
- stable deterministic seed derivation is used for benchmark modes and profiles
- Python, NumPy, and PyTorch seeds are fixed per run
- deterministic runtime flags are enabled where safe
- matched-prefix datasets are reproducible under fixed config and seed
- the final paper suite in this repository is stored as deterministic CSV artifacts
Reference artifacts:
results/paper_suite_20260503_202810/paper_suite_runs.csvresults/paper_suite_20260503_202810/paper_suite_summary.csvresults/paper_suite_20260503_202810/paper_suite_runtime.csvresults/paper_suite_20260503_202810/paper_suite_failed_checks.csv
19. Hosting, License, and Citation
Hosting
The benchmark is currently generated from code in this repository rather than distributed as a fixed external archive.
Current status:
- dataset hosting location: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins
- code repository: https://huggingface.co/temporal-twins-benchmark/temporal-twins-code
- canonical pre-generated release archive: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins
- Croissant metadata URL: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json
- Croissant metadata browser page: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/blob/main/metadata/temporal_twins_croissant.json
- data files: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/data
- results: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/results
- configs: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/configs
- metadata directory: https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/metadata
- paper or preprint: Not available during double-blind review; to be added after publication.
- reference paper-suite results:
results/paper_suite_20260503_202810/
The Croissant file passes JSON, schema, and Responsible AI validation in the official checker. The optional records-generation test currently reports a known Parquet-in-zip streaming issue; direct pandas/pyarrow loading instructions are provided in data/README_GENERATION.md.
License
- Dataset license:
CC BY 4.0(CC-BY-4.0) - Code license:
Apache License 2.0(Apache-2.0)
Citation
The final paper or preprint citation is not available during double-blind review and will be added after publication.
TODO placeholder BibTeX:
@dataset{temporal_twins_todo,
title = {Temporal Twins: A Synthetic UPI-Style Benchmark for Temporal Fraud Detection},
author = {TODO},
year = {TODO},
howpublished = {TODO},
note = {Synthetic matched-prefix temporal fraud benchmark},
url = {TODO}
}