File size: 21,133 Bytes
a3682cf 2c3d57f a3682cf 2c3d57f a3682cf b4d7f08 a3682cf 2c3d57f a3682cf | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 | # Temporal Twins Dataset Card
## 1. Dataset Summary
Temporal Twins is a synthetic UPI-style transaction benchmark for temporal fraud detection. It is designed to evaluate whether a model can distinguish fraud from benign behavior using order-sensitive temporal structure rather than static aggregates such as total transaction count, account age, or prefix length.
The benchmark simulates users sending transactions over time and then assigns fraud labels through delayed temporal mechanisms. Its core design is a matched fraud/benign temporal-twin construction:
- each positive example is a fraud twin evaluated at a local event index `k`
- each negative example is a benign twin evaluated at the same local event index `k`
- both twins are matched on static and prefix-level summaries
- the benign twin contains the same unordered ingredients but violates the fraud-relevant temporal order
Temporal Twins exposes four benchmark modes:
- `oracle_calib`
- `easy`
- `medium`
- `hard`
The frozen paper-suite configuration used in this repository is:
- `num_users = 350`
- `simulation_days = 45`
- `seeds = [0, 1, 2, 3, 4]`
- `fast_mode = false`
- `n_checkpoints = 8`
## 2. Dataset Motivation
Many fraud datasets can be solved by static shortcuts: longer histories, later evaluation times, higher transaction counts, or other aggregate correlates can make a benchmark look temporally rich while actually rewarding non-temporal models. Temporal Twins was built to remove those shortcuts and isolate order-sensitive temporal reasoning.
The benchmark therefore aims to answer a narrower research question:
- when static summaries are matched between positives and negatives, can a model still recover delayed fraud signals from temporal order alone?
It is intended for benchmarking temporal representation learning, causal order sensitivity, and delayed-label detection under controlled synthetic conditions.
## 3. Dataset Composition
Temporal Twins is generated programmatically from synthetic user and transaction processes. There is no fixed real-world corpus. Each generated artifact is an event table in which each row is a synthetic transaction.
At a high level, each run contains:
- a synthetic user population
- a synthetic stream of UPI-style transactions
- risk-engine outputs such as transaction risk scores and failures
- benchmark-specific fraud and audit annotations
- matched fraud/benign evaluation pairs extracted from the event stream
The paper-scale suite in this repository contains 20 deterministic runs:
- `oracle_calib` with seeds `0..4`
- `easy` with seeds `0..4`
- `medium` with seeds `0..4`
- `hard` with seeds `0..4`
Mean matched evaluation-pair counts in the frozen paper suite are:
| Mode | Matched evaluation pairs (mean +- std) |
|---|---:|
| `oracle_calib` | `2606.6 +- 454.3` |
| `easy` | `2222.2 +- 128.4` |
| `medium` | `2356.6 +- 18.0` |
| `hard` | `2317.6 +- 22.0` |
Each paper-suite run is class-balanced at evaluation time:
- positives = negatives
- positive rate = `0.5000`
## 4. Dataset Generation Process
The generation pipeline has four stages:
1. Synthetic user generation
2. Synthetic transaction generation
3. Synthetic risk and retry generation
4. Fraud-mechanism and matched-twin generation
More concretely:
1. A synthetic user set is created with user-level behavioral parameters.
2. A synthetic transaction stream is sampled with sender IDs, receiver IDs, timestamps, transaction amounts, and transaction types.
3. A risk engine adds synthetic risk-related fields such as `risk_score`, `fail_prob`, `failed`, and retry-like events.
4. The fraud engine applies benchmark-mode-specific temporal mechanisms and constructs matched temporal twins.
For the `temporal_twins` benchmark family, the generator then:
- constructs fraud twins and benign twins from matched carrier users and templates
- preserves matched static and prefix-level summaries
- injects delayed fraud labels into fraud twins
- forces benign twins to avoid the fraud-relevant temporal motif while retaining similar unordered ingredients
The benchmark is deterministic under fixed configuration, seed, and runtime settings.
## 5. Fraud Mechanisms
Temporal Twins uses delayed, order-sensitive fraud mechanisms rather than directly labeling static outliers. Important mechanisms include:
- velocity-like activity acceleration
- retry-like behavior
- delayed receiver revisits
- burst-release-burst motifs
- adversarial timing perturbations
- delayed fraud assignment
- hidden latent fraud-state dynamics
These mechanisms are combined with difficulty-dependent noise and camouflage. In the standard `easy`, `medium`, and `hard` modes, the fraud signal is intentionally imperfect and partially obscured. In `oracle_calib`, the construction is designed to validate motif and evaluation alignment under matched-prefix conditions.
## 6. Matched-Control Construction
The central benchmark control is the fraud/benign temporal twin.
For every fraud twin positive label at local event index `k`:
- the benign twin is evaluated at the same local event index `k`
- both examples use the same local prefix length
- both examples are truncated at prefix index `k`
- no future events are visible to the model
Within each matched pair, the protocol additionally matches:
- total transaction count
- local prefix length
- evaluation timestamp
- account age
- active age
- receiver histograms
- static aggregate summaries
In words:
- the fraud twin contains a temporally meaningful order pattern that triggers a delayed positive label
- the benign twin contains comparable ingredients and prefix statistics but violates the fraud-relevant temporal order
This design is meant to prevent performance from arising from:
- longer histories
- older accounts
- later prefix positions
- different transaction totals
- unmatched prefix ages
- benign negatives evaluated at arbitrary or easier positions
## 7. Dataset Modes and Difficulty Ladder
Temporal Twins provides four modes.
### `oracle_calib`
This is the calibration mode used to validate that the matched-prefix protocol is working as intended.
- Oracle metrics remain near-perfect.
- Static shortcut baselines remain at chance.
- Benign motif hit rate remains zero.
- This mode is primarily for protocol validation rather than realistic difficulty.
### `easy`
- strong motif signal
- low noise
- shorter delay
- expected SeqGRU performance near `0.90-1.00`
### `medium`
- moderate motif signal
- moderate noise
- longer delay
- expected SeqGRU performance near `0.80-0.90`
### `hard`
- weaker motif signal
- longer delay
- adversarial perturbations and decoys
- expected SeqGRU performance near `0.70-0.85`
Naming convention:
- in `oracle_calib`, `AuditOracle` and `RawMotifOracle` are true oracle-style references
- in standard `easy`, `medium`, and `hard`, the corresponding scores are reported as `MotifProbe` and `RawMotifProbe` because realism and noise make them probes rather than perfect oracles
## 8. Data Schema
The event table contains model-facing fields, supervision labels, and audit/oracle-only fields. The table below lists the most important columns used in this repository.
| Column name | Type | Description | Exposed to ordinary models? | Notes |
|---|---|---|---|---|
| `txn_id` | `int32` | Synthetic transaction identifier | Yes | Identifier only; not a benchmark target |
| `sender_id` | `int32` / `int64` | Synthetic sender account ID | Yes | Node identity available to temporal models |
| `receiver_id` | `int32` / `int64` | Synthetic receiver account ID | Yes | Used for graph and sequence structure |
| `timestamp` | `float32` | Synthetic event time in seconds from simulation start | Yes | Prefix truncation is based on timestamp and local index |
| `amount` | `float32` | Synthetic transaction amount | Yes | Not tied to real currency records |
| `txn_type` | `int8` | Synthetic transaction-type code | Yes | UPI-style categorical event attribute |
| `risk_score` | `float32` | Synthetic risk score from the risk engine | Yes | No real production risk model is used |
| `fail_prob` | `float32` | Synthetic failure probability | Yes | Risk-engine output |
| `failed` | `int8` | Binary failure indicator | Yes | Used as a normal model-facing field |
| `is_retry` | `int8` / derived | Retry-like event indicator | Yes | Available to ordinary models when present |
| `pair_freq` | `float32` / derived | Sender-receiver interaction-frequency feature | Yes | Derived from visible event history |
| `risk_noisy` | `float32` | Noisy synthetic risk feature | Yes | Benchmark feature, not an audit signal |
| `txn_count_10` | `float32` / derived | Recent-count feature over a short window | Yes | Derived from visible history |
| `amount_sum_10` | `float32` / derived | Recent amount-sum feature | Yes | Derived from visible history |
| `is_fraud` | `int8` | Binary fraud label | No | Supervision target only, not a model input |
| `twin_pair_id` | `int64` | Matched fraud/benign pair identifier | No | Audit/oracle-only; not exposed to learned baselines |
| `twin_role` | `string` | Twin role such as `fraud`, `benign`, or `background` | No | Audit/oracle-only |
| `twin_label` | `int8` | Pairwise matched label for audit utilities | No | Audit/oracle-only |
| `template_id` | `int64` | Source template identifier used during twin construction | No | Audit/oracle-only |
| `dynamic_fraud_state` | `float32` | Latent synthetic fraud-state variable | No | Hidden mechanism for analysis only |
| `motif_source` | `int8` | Indicator for motif-source events in a sequence | No | Audit/oracle-only |
| `motif_hit_count` | `int32` | Count of motif hits in the sequence | No | Audit/oracle-only |
| `trigger_event_idx` | `int32` | Local event index of the trigger event | No | Audit/oracle-only |
| `label_event_idx` | `int32` | Local event index at which the fraud label becomes active | No | Audit/oracle-only |
| `label_delay` | `int32` | Delay between trigger and labeled event index | No | Audit/oracle-only |
| `fraud_source` | `string` | Cause of fraud label, e.g. motif or fallback chain | No | Audit/oracle-only |
| `is_fallback_label` | `int8` | Indicator that a label came from fallback logic | No | Audit/oracle-only |
| `motif_chain_state` | `float32` | Internal motif-chain analysis field | No | Audit/oracle-only |
| `motif_strength` | `float32` | Internal motif-strength analysis field | No | Audit/oracle-only |
Not every baseline uses every model-facing column. The important guarantee is that learned baselines do not receive the audit/oracle-only fields listed above.
## 9. Model-Facing vs Audit/Oracle-Only Columns
Ordinary learned baselines are restricted to model-facing transaction attributes and histories. In this repository, audit/oracle-only columns are explicitly stripped before learned baselines are trained or evaluated.
Ordinary models may use fields such as:
- `sender_id`
- `receiver_id`
- `timestamp`
- `amount`
- `risk_score`
- `fail_prob`
- `failed`
- `txn_type`
- other derived non-oracle features built from visible prefix history
Ordinary models must not use:
- `motif_hit_count`
- `motif_source`
- `trigger_event_idx`
- `label_event_idx`
- `label_delay`
- `fraud_source`
- `twin_role`
- `twin_label`
- `twin_pair_id`
- `template_id`
- `dynamic_fraud_state`
- other oracle-only diagnostics
This separation is necessary for the benchmark claim that performance should come from temporal reasoning rather than privileged audit information.
## 10. Benchmark Tasks
Temporal Twins supports the following benchmark task:
- binary fraud detection on matched prefix examples
The standard evaluation protocol is:
- build matched fraud/benign examples
- truncate each sender history at the matched prefix index `k`
- train or score on the visible prefix only
- evaluate binary discrimination at the matched example level
Primary reported metrics include:
- ROC-AUC
- PR-AUC
- shuffled-order ROC-AUC
- shuffle delta = shuffled ROC-AUC minus clean ROC-AUC
The shuffled-order test is important: it measures how much performance depends on event order rather than unordered ingredients.
## 11. Baselines and Reference Results
The frozen 5-seed paper suite uses:
- `num_users = 350`
- `simulation_days = 45`
- `seeds = [0, 1, 2, 3, 4]`
- `fast_mode = false`
- `n_checkpoints = 8`
Compact reference results:
| Mode | Primary reference | Secondary reference | XGBoost ROC-AUC | StaticGNN ROC-AUC | SeqGRU ROC-AUC | SeqGRU shuffled delta |
|---|---:|---:|---:|---:|---:|---:|
| `oracle_calib` | `AuditOracle 1.0000 +- 0.0000` | `RawMotifOracle 1.0000 +- 0.0000` | `0.5000 +- 0.0000` | `0.5222 +- 0.0235` | `1.0000 +- 0.0000` | `-0.5032 +- 0.0043` |
| `easy` | `MotifProbe 1.0000 +- 0.0000` | `RawMotifProbe 0.9983 +- 0.0011` | `0.5000 +- 0.0000` | `0.4946 +- 0.0128` | `1.0000 +- 0.0000` | `-0.5003 +- 0.0096` |
| `medium` | `MotifProbe 0.6374 +- 0.0069` | `RawMotifProbe 0.6482 +- 0.0058` | `0.5000 +- 0.0000` | `0.4922 +- 0.0203` | `0.8391 +- 0.0174` | `-0.3337 +- 0.0191` |
| `hard` | `MotifProbe 0.5790 +- 0.0045` | `RawMotifProbe 0.5910 +- 0.0105` | `0.5000 +- 0.0000` | `0.5026 +- 0.0198` | `0.6876 +- 0.0128` | `-0.1883 +- 0.0111` |
Static shortcut audit across all 20 paper-suite runs:
- `static_agg_auc = 0.5000 +- 0.0000`
- `total_txn_count AUC = 0.5000 +- 0.0000`
- `local_event_idx AUC = 0.5000 +- 0.0000`
- `prefix_txn_count AUC = 0.5000 +- 0.0000`
- `timestamp AUC = 0.5000 +- 0.0000`
- `account_age AUC = 0.5000 +- 0.0000`
- `active_age AUC = 0.5000 +- 0.0000`
- `benign_motif_hit_rate = 0.0000 +- 0.0000`
These results support the intended interpretation:
- static shortcuts are neutralized
- `oracle_calib` validates matched-prefix correctness
- `easy` is readily learnable by order-sensitive sequence models
- `medium` remains learnable but meaningfully harder
- `hard` remains above static baselines but is substantially more challenging
Full paper-suite artifacts, including temporal GNN results and per-seed CSVs, are stored under:
- `results/paper_suite_20260503_202810/`
## 12. Intended Use
This dataset is intended for:
- research on temporal fraud detection
- benchmarking order-sensitive sequence and temporal-graph models
- evaluating whether performance survives matched static controls
- studying delayed labels and prefix-only evaluation
- comparing clean-order and shuffled-order performance
It is appropriate for methodology papers, controlled ablation studies, and robustness checks on temporal inductive bias.
## 13. Out-of-Scope Use
Temporal Twins is out of scope for:
- direct training of production fraud systems
- making real financial, banking, or payment decisions
- approving or denying transactions for real users
- risk-scoring real individuals or organizations
- regulatory, legal, or operational decisions in production financial systems
The dataset must not be used to train production fraud systems directly or to make real financial decisions.
## 14. Limitations
Important limitations include:
- the benchmark is fully synthetic and reflects designer assumptions
- user behavior, fraud behavior, and benign behavior are simplified relative to real financial ecosystems
- the only ground truth is the generator's own labeling logic
- real-world fraud often depends on richer institutional, device, merchant, and social context not present here
- difficulty levels are benchmark design choices, not calibrated measures of real operational difficulty
- temporal GNN underperformance on this benchmark should not be generalized to all real fraud settings
## 15. Biases and Risks
As a synthetic benchmark, Temporal Twins inherits the modeling biases of its generator:
- it emphasizes order-sensitive motifs chosen by the benchmark designers
- it encodes a particular notion of delayed fraud and camouflage
- it may reward models that are well aligned to these synthetic mechanisms
- it may underrepresent other real fraud styles not captured by the generator
There is also a scientific risk:
- because the benchmark intentionally removes common static shortcuts, performance on Temporal Twins may differ from performance on operational datasets where those shortcuts exist, for better or worse
## 16. Privacy and Sensitive Data
Temporal Twins contains no real financial or personal data.
Specifically:
- no real UPI data
- no real users
- no real bank accounts
- no real transactions
- no personal financial records
- no protected demographic attributes
All user IDs, receiver IDs, timestamps, amounts, and risk signals are synthetic artifacts produced by the generator.
## 17. Ethical Considerations
Temporal Twins is safer to share than real financial logs because it does not contain real persons or institutions. However, ethical care is still needed.
Users of the dataset should not:
- present synthetic results as direct evidence of production readiness
- claim fairness or social validity that has not been tested on real populations
- use the dataset as justification for automated decisions about real people
The intended ethical use is research benchmarking, not operational deployment.
## 18. Reproducibility
The repository includes deterministic generation and evaluation settings for the frozen paper suite.
Paper-suite configuration:
- `num_users = 350`
- `simulation_days = 45`
- `seeds = [0, 1, 2, 3, 4]`
- `fast_mode = false`
- `n_checkpoints = 8`
Reproducibility properties:
- stable deterministic seed derivation is used for benchmark modes and profiles
- Python, NumPy, and PyTorch seeds are fixed per run
- deterministic runtime flags are enabled where safe
- matched-prefix datasets are reproducible under fixed config and seed
- the final paper suite in this repository is stored as deterministic CSV artifacts
Reference artifacts:
- `results/paper_suite_20260503_202810/paper_suite_runs.csv`
- `results/paper_suite_20260503_202810/paper_suite_summary.csv`
- `results/paper_suite_20260503_202810/paper_suite_runtime.csv`
- `results/paper_suite_20260503_202810/paper_suite_failed_checks.csv`
## 19. Hosting, License, and Citation
### Hosting
The benchmark is currently generated from code in this repository rather than distributed as a fixed external archive.
Current status:
- dataset hosting location: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins)
- code repository: [https://huggingface.co/temporal-twins-benchmark/temporal-twins-code](https://huggingface.co/temporal-twins-benchmark/temporal-twins-code)
- canonical pre-generated release archive: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins)
- Croissant metadata URL: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/raw/main/metadata/temporal_twins_croissant.json)
- Croissant metadata browser page: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/blob/main/metadata/temporal_twins_croissant.json](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/blob/main/metadata/temporal_twins_croissant.json)
- data files: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/data](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/data)
- results: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/results](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/results)
- configs: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/configs](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/configs)
- metadata directory: [https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/metadata](https://huggingface.co/datasets/temporal-twins-benchmark/temporal-twins/tree/main/metadata)
- paper or preprint: Not available during double-blind review; to be added after publication.
- reference paper-suite results: `results/paper_suite_20260503_202810/`
The Croissant file passes JSON, schema, and Responsible AI validation in the official checker. The optional records-generation test currently reports a known Parquet-in-zip streaming issue; direct pandas/pyarrow loading instructions are provided in `data/README_GENERATION.md`.
### License
- Dataset license: `CC BY 4.0` (`CC-BY-4.0`)
- Code license: `Apache License 2.0` (`Apache-2.0`)
### Citation
The final paper or preprint citation is not available during double-blind review and will be added after publication.
`TODO` placeholder BibTeX:
```bibtex
@dataset{temporal_twins_todo,
title = {Temporal Twins: A Synthetic UPI-Style Benchmark for Temporal Fraud Detection},
author = {TODO},
year = {TODO},
howpublished = {TODO},
note = {Synthetic matched-prefix temporal fraud benchmark},
url = {TODO}
}
```
|