temporal-twins-code / DATASET_CARD.md
temporal-twins-anon's picture
Upload 2 files
b4d7f08 verified

Temporal Twins Dataset Card

1. Dataset Summary

Temporal Twins is a synthetic UPI-style transaction benchmark for temporal fraud detection. It is designed to evaluate whether a model can distinguish fraud from benign behavior using order-sensitive temporal structure rather than static aggregates such as total transaction count, account age, or prefix length.

The benchmark simulates users sending transactions over time and then assigns fraud labels through delayed temporal mechanisms. Its core design is a matched fraud/benign temporal-twin construction:

  • each positive example is a fraud twin evaluated at a local event index k
  • each negative example is a benign twin evaluated at the same local event index k
  • both twins are matched on static and prefix-level summaries
  • the benign twin contains the same unordered ingredients but violates the fraud-relevant temporal order

Temporal Twins exposes four benchmark modes:

  • oracle_calib
  • easy
  • medium
  • hard

The frozen paper-suite configuration used in this repository is:

  • num_users = 350
  • simulation_days = 45
  • seeds = [0, 1, 2, 3, 4]
  • fast_mode = false
  • n_checkpoints = 8

2. Dataset Motivation

Many fraud datasets can be solved by static shortcuts: longer histories, later evaluation times, higher transaction counts, or other aggregate correlates can make a benchmark look temporally rich while actually rewarding non-temporal models. Temporal Twins was built to remove those shortcuts and isolate order-sensitive temporal reasoning.

The benchmark therefore aims to answer a narrower research question:

  • when static summaries are matched between positives and negatives, can a model still recover delayed fraud signals from temporal order alone?

It is intended for benchmarking temporal representation learning, causal order sensitivity, and delayed-label detection under controlled synthetic conditions.

3. Dataset Composition

Temporal Twins is generated programmatically from synthetic user and transaction processes. There is no fixed real-world corpus. Each generated artifact is an event table in which each row is a synthetic transaction.

At a high level, each run contains:

  • a synthetic user population
  • a synthetic stream of UPI-style transactions
  • risk-engine outputs such as transaction risk scores and failures
  • benchmark-specific fraud and audit annotations
  • matched fraud/benign evaluation pairs extracted from the event stream

The paper-scale suite in this repository contains 20 deterministic runs:

  • oracle_calib with seeds 0..4
  • easy with seeds 0..4
  • medium with seeds 0..4
  • hard with seeds 0..4

Mean matched evaluation-pair counts in the frozen paper suite are:

Mode Matched evaluation pairs (mean +- std)
oracle_calib 2606.6 +- 454.3
easy 2222.2 +- 128.4
medium 2356.6 +- 18.0
hard 2317.6 +- 22.0

Each paper-suite run is class-balanced at evaluation time:

  • positives = negatives
  • positive rate = 0.5000

4. Dataset Generation Process

The generation pipeline has four stages:

  1. Synthetic user generation
  2. Synthetic transaction generation
  3. Synthetic risk and retry generation
  4. Fraud-mechanism and matched-twin generation

More concretely:

  1. A synthetic user set is created with user-level behavioral parameters.
  2. A synthetic transaction stream is sampled with sender IDs, receiver IDs, timestamps, transaction amounts, and transaction types.
  3. A risk engine adds synthetic risk-related fields such as risk_score, fail_prob, failed, and retry-like events.
  4. The fraud engine applies benchmark-mode-specific temporal mechanisms and constructs matched temporal twins.

For the temporal_twins benchmark family, the generator then:

  • constructs fraud twins and benign twins from matched carrier users and templates
  • preserves matched static and prefix-level summaries
  • injects delayed fraud labels into fraud twins
  • forces benign twins to avoid the fraud-relevant temporal motif while retaining similar unordered ingredients

The benchmark is deterministic under fixed configuration, seed, and runtime settings.

5. Fraud Mechanisms

Temporal Twins uses delayed, order-sensitive fraud mechanisms rather than directly labeling static outliers. Important mechanisms include:

  • velocity-like activity acceleration
  • retry-like behavior
  • delayed receiver revisits
  • burst-release-burst motifs
  • adversarial timing perturbations
  • delayed fraud assignment
  • hidden latent fraud-state dynamics

These mechanisms are combined with difficulty-dependent noise and camouflage. In the standard easy, medium, and hard modes, the fraud signal is intentionally imperfect and partially obscured. In oracle_calib, the construction is designed to validate motif and evaluation alignment under matched-prefix conditions.

6. Matched-Control Construction

The central benchmark control is the fraud/benign temporal twin.

For every fraud twin positive label at local event index k:

  • the benign twin is evaluated at the same local event index k
  • both examples use the same local prefix length
  • both examples are truncated at prefix index k
  • no future events are visible to the model

Within each matched pair, the protocol additionally matches:

  • total transaction count
  • local prefix length
  • evaluation timestamp
  • account age
  • active age
  • receiver histograms
  • static aggregate summaries

In words:

  • the fraud twin contains a temporally meaningful order pattern that triggers a delayed positive label
  • the benign twin contains comparable ingredients and prefix statistics but violates the fraud-relevant temporal order

This design is meant to prevent performance from arising from:

  • longer histories
  • older accounts
  • later prefix positions
  • different transaction totals
  • unmatched prefix ages
  • benign negatives evaluated at arbitrary or easier positions

7. Dataset Modes and Difficulty Ladder

Temporal Twins provides four modes.

oracle_calib

This is the calibration mode used to validate that the matched-prefix protocol is working as intended.

  • Oracle metrics remain near-perfect.
  • Static shortcut baselines remain at chance.
  • Benign motif hit rate remains zero.
  • This mode is primarily for protocol validation rather than realistic difficulty.

easy

  • strong motif signal
  • low noise
  • shorter delay
  • expected SeqGRU performance near 0.90-1.00

medium

  • moderate motif signal
  • moderate noise
  • longer delay
  • expected SeqGRU performance near 0.80-0.90

hard

  • weaker motif signal
  • longer delay
  • adversarial perturbations and decoys
  • expected SeqGRU performance near 0.70-0.85

Naming convention:

  • in oracle_calib, AuditOracle and RawMotifOracle are true oracle-style references
  • in standard easy, medium, and hard, the corresponding scores are reported as MotifProbe and RawMotifProbe because realism and noise make them probes rather than perfect oracles

8. Data Schema

The event table contains model-facing fields, supervision labels, and audit/oracle-only fields. The table below lists the most important columns used in this repository.

Column name Type Description Exposed to ordinary models? Notes
txn_id int32 Synthetic transaction identifier Yes Identifier only; not a benchmark target
sender_id int32 / int64 Synthetic sender account ID Yes Node identity available to temporal models
receiver_id int32 / int64 Synthetic receiver account ID Yes Used for graph and sequence structure
timestamp float32 Synthetic event time in seconds from simulation start Yes Prefix truncation is based on timestamp and local index
amount float32 Synthetic transaction amount Yes Not tied to real currency records
txn_type int8 Synthetic transaction-type code Yes UPI-style categorical event attribute
risk_score float32 Synthetic risk score from the risk engine Yes No real production risk model is used
fail_prob float32 Synthetic failure probability Yes Risk-engine output
failed int8 Binary failure indicator Yes Used as a normal model-facing field
is_retry int8 / derived Retry-like event indicator Yes Available to ordinary models when present
pair_freq float32 / derived Sender-receiver interaction-frequency feature Yes Derived from visible event history
risk_noisy float32 Noisy synthetic risk feature Yes Benchmark feature, not an audit signal
txn_count_10 float32 / derived Recent-count feature over a short window Yes Derived from visible history
amount_sum_10 float32 / derived Recent amount-sum feature Yes Derived from visible history
is_fraud int8 Binary fraud label No Supervision target only, not a model input
twin_pair_id int64 Matched fraud/benign pair identifier No Audit/oracle-only; not exposed to learned baselines
twin_role string Twin role such as fraud, benign, or background No Audit/oracle-only
twin_label int8 Pairwise matched label for audit utilities No Audit/oracle-only
template_id int64 Source template identifier used during twin construction No Audit/oracle-only
dynamic_fraud_state float32 Latent synthetic fraud-state variable No Hidden mechanism for analysis only
motif_source int8 Indicator for motif-source events in a sequence No Audit/oracle-only
motif_hit_count int32 Count of motif hits in the sequence No Audit/oracle-only
trigger_event_idx int32 Local event index of the trigger event No Audit/oracle-only
label_event_idx int32 Local event index at which the fraud label becomes active No Audit/oracle-only
label_delay int32 Delay between trigger and labeled event index No Audit/oracle-only
fraud_source string Cause of fraud label, e.g. motif or fallback chain No Audit/oracle-only
is_fallback_label int8 Indicator that a label came from fallback logic No Audit/oracle-only
motif_chain_state float32 Internal motif-chain analysis field No Audit/oracle-only
motif_strength float32 Internal motif-strength analysis field No Audit/oracle-only

Not every baseline uses every model-facing column. The important guarantee is that learned baselines do not receive the audit/oracle-only fields listed above.

9. Model-Facing vs Audit/Oracle-Only Columns

Ordinary learned baselines are restricted to model-facing transaction attributes and histories. In this repository, audit/oracle-only columns are explicitly stripped before learned baselines are trained or evaluated.

Ordinary models may use fields such as:

  • sender_id
  • receiver_id
  • timestamp
  • amount
  • risk_score
  • fail_prob
  • failed
  • txn_type
  • other derived non-oracle features built from visible prefix history

Ordinary models must not use:

  • motif_hit_count
  • motif_source
  • trigger_event_idx
  • label_event_idx
  • label_delay
  • fraud_source
  • twin_role
  • twin_label
  • twin_pair_id
  • template_id
  • dynamic_fraud_state
  • other oracle-only diagnostics

This separation is necessary for the benchmark claim that performance should come from temporal reasoning rather than privileged audit information.

10. Benchmark Tasks

Temporal Twins supports the following benchmark task:

  • binary fraud detection on matched prefix examples

The standard evaluation protocol is:

  • build matched fraud/benign examples
  • truncate each sender history at the matched prefix index k
  • train or score on the visible prefix only
  • evaluate binary discrimination at the matched example level

Primary reported metrics include:

  • ROC-AUC
  • PR-AUC
  • shuffled-order ROC-AUC
  • shuffle delta = shuffled ROC-AUC minus clean ROC-AUC

The shuffled-order test is important: it measures how much performance depends on event order rather than unordered ingredients.

11. Baselines and Reference Results

The frozen 5-seed paper suite uses:

  • num_users = 350
  • simulation_days = 45
  • seeds = [0, 1, 2, 3, 4]
  • fast_mode = false
  • n_checkpoints = 8

Compact reference results:

Mode Primary reference Secondary reference XGBoost ROC-AUC StaticGNN ROC-AUC SeqGRU ROC-AUC SeqGRU shuffled delta
oracle_calib AuditOracle 1.0000 +- 0.0000 RawMotifOracle 1.0000 +- 0.0000 0.5000 +- 0.0000 0.5222 +- 0.0235 1.0000 +- 0.0000 -0.5032 +- 0.0043
easy MotifProbe 1.0000 +- 0.0000 RawMotifProbe 0.9983 +- 0.0011 0.5000 +- 0.0000 0.4946 +- 0.0128 1.0000 +- 0.0000 -0.5003 +- 0.0096
medium MotifProbe 0.6374 +- 0.0069 RawMotifProbe 0.6482 +- 0.0058 0.5000 +- 0.0000 0.4922 +- 0.0203 0.8391 +- 0.0174 -0.3337 +- 0.0191
hard MotifProbe 0.5790 +- 0.0045 RawMotifProbe 0.5910 +- 0.0105 0.5000 +- 0.0000 0.5026 +- 0.0198 0.6876 +- 0.0128 -0.1883 +- 0.0111

Static shortcut audit across all 20 paper-suite runs:

  • static_agg_auc = 0.5000 +- 0.0000
  • total_txn_count AUC = 0.5000 +- 0.0000
  • local_event_idx AUC = 0.5000 +- 0.0000
  • prefix_txn_count AUC = 0.5000 +- 0.0000
  • timestamp AUC = 0.5000 +- 0.0000
  • account_age AUC = 0.5000 +- 0.0000
  • active_age AUC = 0.5000 +- 0.0000
  • benign_motif_hit_rate = 0.0000 +- 0.0000

These results support the intended interpretation:

  • static shortcuts are neutralized
  • oracle_calib validates matched-prefix correctness
  • easy is readily learnable by order-sensitive sequence models
  • medium remains learnable but meaningfully harder
  • hard remains above static baselines but is substantially more challenging

Full paper-suite artifacts, including temporal GNN results and per-seed CSVs, are stored under:

  • results/paper_suite_20260503_202810/

12. Intended Use

This dataset is intended for:

  • research on temporal fraud detection
  • benchmarking order-sensitive sequence and temporal-graph models
  • evaluating whether performance survives matched static controls
  • studying delayed labels and prefix-only evaluation
  • comparing clean-order and shuffled-order performance

It is appropriate for methodology papers, controlled ablation studies, and robustness checks on temporal inductive bias.

13. Out-of-Scope Use

Temporal Twins is out of scope for:

  • direct training of production fraud systems
  • making real financial, banking, or payment decisions
  • approving or denying transactions for real users
  • risk-scoring real individuals or organizations
  • regulatory, legal, or operational decisions in production financial systems

The dataset must not be used to train production fraud systems directly or to make real financial decisions.

14. Limitations

Important limitations include:

  • the benchmark is fully synthetic and reflects designer assumptions
  • user behavior, fraud behavior, and benign behavior are simplified relative to real financial ecosystems
  • the only ground truth is the generator's own labeling logic
  • real-world fraud often depends on richer institutional, device, merchant, and social context not present here
  • difficulty levels are benchmark design choices, not calibrated measures of real operational difficulty
  • temporal GNN underperformance on this benchmark should not be generalized to all real fraud settings

15. Biases and Risks

As a synthetic benchmark, Temporal Twins inherits the modeling biases of its generator:

  • it emphasizes order-sensitive motifs chosen by the benchmark designers
  • it encodes a particular notion of delayed fraud and camouflage
  • it may reward models that are well aligned to these synthetic mechanisms
  • it may underrepresent other real fraud styles not captured by the generator

There is also a scientific risk:

  • because the benchmark intentionally removes common static shortcuts, performance on Temporal Twins may differ from performance on operational datasets where those shortcuts exist, for better or worse

16. Privacy and Sensitive Data

Temporal Twins contains no real financial or personal data.

Specifically:

  • no real UPI data
  • no real users
  • no real bank accounts
  • no real transactions
  • no personal financial records
  • no protected demographic attributes

All user IDs, receiver IDs, timestamps, amounts, and risk signals are synthetic artifacts produced by the generator.

17. Ethical Considerations

Temporal Twins is safer to share than real financial logs because it does not contain real persons or institutions. However, ethical care is still needed.

Users of the dataset should not:

  • present synthetic results as direct evidence of production readiness
  • claim fairness or social validity that has not been tested on real populations
  • use the dataset as justification for automated decisions about real people

The intended ethical use is research benchmarking, not operational deployment.

18. Reproducibility

The repository includes deterministic generation and evaluation settings for the frozen paper suite.

Paper-suite configuration:

  • num_users = 350
  • simulation_days = 45
  • seeds = [0, 1, 2, 3, 4]
  • fast_mode = false
  • n_checkpoints = 8

Reproducibility properties:

  • stable deterministic seed derivation is used for benchmark modes and profiles
  • Python, NumPy, and PyTorch seeds are fixed per run
  • deterministic runtime flags are enabled where safe
  • matched-prefix datasets are reproducible under fixed config and seed
  • the final paper suite in this repository is stored as deterministic CSV artifacts

Reference artifacts:

  • results/paper_suite_20260503_202810/paper_suite_runs.csv
  • results/paper_suite_20260503_202810/paper_suite_summary.csv
  • results/paper_suite_20260503_202810/paper_suite_runtime.csv
  • results/paper_suite_20260503_202810/paper_suite_failed_checks.csv

19. Hosting, License, and Citation

Hosting

The benchmark is currently generated from code in this repository rather than distributed as a fixed external archive.

Current status:

The Croissant file passes JSON, schema, and Responsible AI validation in the official checker. The optional records-generation test currently reports a known Parquet-in-zip streaming issue; direct pandas/pyarrow loading instructions are provided in data/README_GENERATION.md.

License

  • Dataset license: CC BY 4.0 (CC-BY-4.0)
  • Code license: Apache License 2.0 (Apache-2.0)

Citation

The final paper or preprint citation is not available during double-blind review and will be added after publication.

TODO placeholder BibTeX:

@dataset{temporal_twins_todo,
  title        = {Temporal Twins: A Synthetic UPI-Style Benchmark for Temporal Fraud Detection},
  author       = {TODO},
  year         = {TODO},
  howpublished = {TODO},
  note         = {Synthetic matched-prefix temporal fraud benchmark},
  url          = {TODO}
}