Paper Gate Interpretation for Temporal Twins
This document is the paper-facing interpretation layer for the final Temporal Twins suite. It does not alter the raw diagnostic output in results/paper_suite_20260503_202810/paper_suite_failed_checks.csv. Instead, it reclassifies which checks are true hard gates versus descriptive or advisory findings for the paper.
Gate Categories
A. Hard Gates for oracle_calib
These are true benchmark validity gates for temporal_twins_oracle_calib:
matched_eval_pairs >=required thresholdpositive_rate = 0.5benign_motif_hit_rate = 0static_agg_aucnear0.5- shortcut AUCs near
0.5 XGBoostnear0.5StaticGNNnear chanceAuditOraclenear1.0RawMotifOraclenear1.0SeqGRUhighSeqGRUshuffle delta strongly negative
B. Hard Gates for Standard easy / medium / hard
For the standard temporal_twins difficulty ladder, the hard gates are the matched static-control checks:
matched_eval_pairs >=required thresholdpositive_rate = 0.5benign_motif_hit_rate = 0static_agg_aucnear0.5- shortcut AUCs near
0.5 XGBoostnear0.5StaticGNNnear chance
These conditions verify that the benchmark remains shortcut-resistant and that fraud and benign twins are properly matched at evaluation.
C. Advisory / Descriptive Checks for Standard easy / medium / hard
The following are not hard validity gates for the standard difficulty ladder:
MotifProbeRawMotifProbeSeqGRUdifficulty trendSeqGRUshuffle delta- temporal-GNN performance
- temporal-GNN shuffle delta
These measurements are descriptive benchmark outcomes. They characterize difficulty and inductive bias; they do not determine whether the dataset itself is valid.
Reclassified Final Paper-Suite Status
oracle_calib
- hard gate passes:
5/5 AuditOracle = 1.0000 ± 0.0000RawMotifOracle = 1.0000 ± 0.0000XGBoost = 0.5000 ± 0.0000StaticGNN = 0.5222 ± 0.0235SeqGRU = 1.0000 ± 0.0000SeqGRU delta = -0.5032 ± 0.0043
Interpretation: oracle_calib passes the intended hard benchmark validation. The oracle/probe alignment is correct, static shortcuts are eliminated, and a causal sequence model recovers the signal with a large negative shuffle delta.
easy
- static-control hard gates pass:
5/5 XGBoost = 0.5000 ± 0.0000StaticGNN = 0.4946 ± 0.0128SeqGRU = 1.0000 ± 0.0000SeqGRU delta = -0.5003 ± 0.0096
Interpretation: easy is a valid standard benchmark slice. Static shortcuts remain suppressed, and the temporal sequence signal is strong.
medium
- static-control hard gates pass:
5/5 XGBoost = 0.5000 ± 0.0000StaticGNN = 0.4922 ± 0.0203SeqGRU = 0.8391 ± 0.0174SeqGRU delta = -0.3337 ± 0.0191MotifProbeandRawMotifProbeare lower by design and should not be treated as hard-gate failures
Interpretation: medium is not a failed dataset. It passes the static-control hard gates and shows the intended increase in temporal difficulty.
hard
- static-control hard gates pass:
5/5 XGBoost = 0.5000 ± 0.0000StaticGNN = 0.5026 ± 0.0198SeqGRU = 0.6876 ± 0.0128SeqGRU delta = -0.1883 ± 0.0111- lower probe and SeqGRU scores reflect intended difficulty
Interpretation: hard is not a failed dataset. It passes the static-control hard gates and intentionally weakens temporal recoverability relative to easy and medium.
Reclassified Status Table
| Benchmark | Static-control hard gates | Probe/oracle status | SeqGRU status | Temporal-GNN status | Paper interpretation |
|---|---|---|---|---|---|
oracle_calib |
Pass 5/5 |
AuditOracle and RawMotifOracle both near 1.0; oracle behavior validated |
1.0000 ± 0.0000, delta -0.5032 ± 0.0043 |
Underperformance is advisory only | Valid calibration benchmark with correct motif-label alignment and dead static shortcuts |
easy |
Pass 5/5 |
MotifProbe / RawMotifProbe high, descriptive |
1.0000 ± 0.0000, delta -0.5003 ± 0.0096 |
Advisory underperformance only | Valid standard benchmark with strong temporal signal |
medium |
Pass 5/5 |
Lower probes are expected and descriptive, not failures | 0.8391 ± 0.0174, delta -0.3337 ± 0.0191 |
Advisory underperformance only | Valid medium-difficulty benchmark with increased temporal challenge |
hard |
Pass 5/5 |
Lower probes are expected and descriptive, not failures | 0.6876 ± 0.0128, delta -0.1883 ± 0.0111 |
Advisory underperformance only | Valid hard benchmark with intentionally reduced temporal recoverability |
Temporal-GNN advisory failures are not benchmark failures. They support the paper finding that current temporal GNNs may not exploit order-sensitive temporal structure as effectively as a causal sequence model under matched static controls.
medium and hard are not failed datasets. They are intended difficulty levels in the Temporal Twins ladder. Their lower MotifProbe, RawMotifProbe, and SeqGRU values show increasing temporal difficulty rather than invalid benchmark construction.
Notes on the Raw Diagnostic File
results/paper_suite_20260503_202810/paper_suite_failed_checks.csvis retained unchanged as the raw diagnostic output.- The raw file still reflects older gate semantics in which standard-mode probe thresholds and temporal-GNN thresholds appeared in failure columns.
- This document is the corrected paper-facing interpretation layer and should be cited when describing benchmark validity in the manuscript.