Paper Gate Interpretation for Temporal Twins

This document is the paper-facing interpretation layer for the final Temporal Twins suite. It does not alter the raw diagnostic output in results/paper_suite_20260503_202810/paper_suite_failed_checks.csv. Instead, it reclassifies which checks are true hard gates versus descriptive or advisory findings for the paper.

Gate Categories

A. Hard Gates for `oracle_calib`

These are true benchmark validity gates for temporal_twins_oracle_calib:

matched_eval_pairs >= required threshold
positive_rate = 0.5
benign_motif_hit_rate = 0
static_agg_auc near 0.5
shortcut AUCs near 0.5
XGBoost near 0.5
StaticGNN near chance
AuditOracle near 1.0
RawMotifOracle near 1.0
SeqGRU high
SeqGRU shuffle delta strongly negative

B. Hard Gates for Standard `easy` / `medium` / `hard`

For the standard temporal_twins difficulty ladder, the hard gates are the matched static-control checks:

matched_eval_pairs >= required threshold
positive_rate = 0.5
benign_motif_hit_rate = 0
static_agg_auc near 0.5
shortcut AUCs near 0.5
XGBoost near 0.5
StaticGNN near chance

These conditions verify that the benchmark remains shortcut-resistant and that fraud and benign twins are properly matched at evaluation.

C. Advisory / Descriptive Checks for Standard `easy` / `medium` / `hard`

The following are not hard validity gates for the standard difficulty ladder:

MotifProbe
RawMotifProbe
SeqGRU difficulty trend
SeqGRU shuffle delta
temporal-GNN performance
temporal-GNN shuffle delta

These measurements are descriptive benchmark outcomes. They characterize difficulty and inductive bias; they do not determine whether the dataset itself is valid.

Reclassified Final Paper-Suite Status

`oracle_calib`

hard gate passes: 5/5
AuditOracle = 1.0000 ± 0.0000
RawMotifOracle = 1.0000 ± 0.0000
XGBoost = 0.5000 ± 0.0000
StaticGNN = 0.5222 ± 0.0235
SeqGRU = 1.0000 ± 0.0000
SeqGRU delta = -0.5032 ± 0.0043

Interpretation: oracle_calib passes the intended hard benchmark validation. The oracle/probe alignment is correct, static shortcuts are eliminated, and a causal sequence model recovers the signal with a large negative shuffle delta.

`easy`

static-control hard gates pass: 5/5
XGBoost = 0.5000 ± 0.0000
StaticGNN = 0.4946 ± 0.0128
SeqGRU = 1.0000 ± 0.0000
SeqGRU delta = -0.5003 ± 0.0096

Interpretation: easy is a valid standard benchmark slice. Static shortcuts remain suppressed, and the temporal sequence signal is strong.

`medium`

static-control hard gates pass: 5/5
XGBoost = 0.5000 ± 0.0000
StaticGNN = 0.4922 ± 0.0203
SeqGRU = 0.8391 ± 0.0174
SeqGRU delta = -0.3337 ± 0.0191
MotifProbe and RawMotifProbe are lower by design and should not be treated as hard-gate failures

Interpretation: medium is not a failed dataset. It passes the static-control hard gates and shows the intended increase in temporal difficulty.

`hard`

static-control hard gates pass: 5/5
XGBoost = 0.5000 ± 0.0000
StaticGNN = 0.5026 ± 0.0198
SeqGRU = 0.6876 ± 0.0128
SeqGRU delta = -0.1883 ± 0.0111
lower probe and SeqGRU scores reflect intended difficulty

Interpretation: hard is not a failed dataset. It passes the static-control hard gates and intentionally weakens temporal recoverability relative to easy and medium.

Reclassified Status Table

Benchmark	Static-control hard gates	Probe/oracle status	SeqGRU status	Temporal-GNN status	Paper interpretation
`oracle_calib`	Pass `5/5`	`AuditOracle` and `RawMotifOracle` both near `1.0`; oracle behavior validated	`1.0000 ± 0.0000`, delta `-0.5032 ± 0.0043`	Underperformance is advisory only	Valid calibration benchmark with correct motif-label alignment and dead static shortcuts
`easy`	Pass `5/5`	`MotifProbe` / `RawMotifProbe` high, descriptive	`1.0000 ± 0.0000`, delta `-0.5003 ± 0.0096`	Advisory underperformance only	Valid standard benchmark with strong temporal signal
`medium`	Pass `5/5`	Lower probes are expected and descriptive, not failures	`0.8391 ± 0.0174`, delta `-0.3337 ± 0.0191`	Advisory underperformance only	Valid medium-difficulty benchmark with increased temporal challenge
`hard`	Pass `5/5`	Lower probes are expected and descriptive, not failures	`0.6876 ± 0.0128`, delta `-0.1883 ± 0.0111`	Advisory underperformance only	Valid hard benchmark with intentionally reduced temporal recoverability

Temporal-GNN advisory failures are not benchmark failures. They support the paper finding that current temporal GNNs may not exploit order-sensitive temporal structure as effectively as a causal sequence model under matched static controls.

medium and hard are not failed datasets. They are intended difficulty levels in the Temporal Twins ladder. Their lower MotifProbe, RawMotifProbe, and SeqGRU values show increasing temporal difficulty rather than invalid benchmark construction.

Notes on the Raw Diagnostic File

results/paper_suite_20260503_202810/paper_suite_failed_checks.csv is retained unchanged as the raw diagnostic output.
The raw file still reflects older gate semantics in which standard-mode probe thresholds and temporal-GNN thresholds appeared in failure columns.
This document is the corrected paper-facing interpretation layer and should be cited when describing benchmark validity in the manuscript.

Paper Gate Interpretation for Temporal Twins

Gate Categories

A. Hard Gates for oracle_calib

B. Hard Gates for Standard easy / medium / hard

C. Advisory / Descriptive Checks for Standard easy / medium / hard