# Paper Gate Interpretation for Temporal Twins This document is the paper-facing interpretation layer for the final Temporal Twins suite. It does **not** alter the raw diagnostic output in `results/paper_suite_20260503_202810/paper_suite_failed_checks.csv`. Instead, it reclassifies which checks are true hard gates versus descriptive or advisory findings for the paper. ## Gate Categories ### A. Hard Gates for `oracle_calib` These are true benchmark validity gates for `temporal_twins_oracle_calib`: - `matched_eval_pairs >=` required threshold - `positive_rate = 0.5` - `benign_motif_hit_rate = 0` - `static_agg_auc` near `0.5` - shortcut AUCs near `0.5` - `XGBoost` near `0.5` - `StaticGNN` near chance - `AuditOracle` near `1.0` - `RawMotifOracle` near `1.0` - `SeqGRU` high - `SeqGRU` shuffle delta strongly negative ### B. Hard Gates for Standard `easy` / `medium` / `hard` For the standard `temporal_twins` difficulty ladder, the hard gates are the matched static-control checks: - `matched_eval_pairs >=` required threshold - `positive_rate = 0.5` - `benign_motif_hit_rate = 0` - `static_agg_auc` near `0.5` - shortcut AUCs near `0.5` - `XGBoost` near `0.5` - `StaticGNN` near chance These conditions verify that the benchmark remains shortcut-resistant and that fraud and benign twins are properly matched at evaluation. ### C. Advisory / Descriptive Checks for Standard `easy` / `medium` / `hard` The following are **not** hard validity gates for the standard difficulty ladder: - `MotifProbe` - `RawMotifProbe` - `SeqGRU` difficulty trend - `SeqGRU` shuffle delta - temporal-GNN performance - temporal-GNN shuffle delta These measurements are descriptive benchmark outcomes. They characterize difficulty and inductive bias; they do not determine whether the dataset itself is valid. ## Reclassified Final Paper-Suite Status ### `oracle_calib` - hard gate passes: `5/5` - `AuditOracle = 1.0000 ± 0.0000` - `RawMotifOracle = 1.0000 ± 0.0000` - `XGBoost = 0.5000 ± 0.0000` - `StaticGNN = 0.5222 ± 0.0235` - `SeqGRU = 1.0000 ± 0.0000` - `SeqGRU delta = -0.5032 ± 0.0043` Interpretation: `oracle_calib` passes the intended hard benchmark validation. The oracle/probe alignment is correct, static shortcuts are eliminated, and a causal sequence model recovers the signal with a large negative shuffle delta. ### `easy` - static-control hard gates pass: `5/5` - `XGBoost = 0.5000 ± 0.0000` - `StaticGNN = 0.4946 ± 0.0128` - `SeqGRU = 1.0000 ± 0.0000` - `SeqGRU delta = -0.5003 ± 0.0096` Interpretation: `easy` is a valid standard benchmark slice. Static shortcuts remain suppressed, and the temporal sequence signal is strong. ### `medium` - static-control hard gates pass: `5/5` - `XGBoost = 0.5000 ± 0.0000` - `StaticGNN = 0.4922 ± 0.0203` - `SeqGRU = 0.8391 ± 0.0174` - `SeqGRU delta = -0.3337 ± 0.0191` - `MotifProbe` and `RawMotifProbe` are lower by design and should **not** be treated as hard-gate failures Interpretation: `medium` is **not** a failed dataset. It passes the static-control hard gates and shows the intended increase in temporal difficulty. ### `hard` - static-control hard gates pass: `5/5` - `XGBoost = 0.5000 ± 0.0000` - `StaticGNN = 0.5026 ± 0.0198` - `SeqGRU = 0.6876 ± 0.0128` - `SeqGRU delta = -0.1883 ± 0.0111` - lower probe and SeqGRU scores reflect intended difficulty Interpretation: `hard` is **not** a failed dataset. It passes the static-control hard gates and intentionally weakens temporal recoverability relative to `easy` and `medium`. ## Reclassified Status Table | Benchmark | Static-control hard gates | Probe/oracle status | SeqGRU status | Temporal-GNN status | Paper interpretation | |---|---|---|---|---|---| | `oracle_calib` | Pass `5/5` | `AuditOracle` and `RawMotifOracle` both near `1.0`; oracle behavior validated | `1.0000 ± 0.0000`, delta `-0.5032 ± 0.0043` | Underperformance is advisory only | Valid calibration benchmark with correct motif-label alignment and dead static shortcuts | | `easy` | Pass `5/5` | `MotifProbe` / `RawMotifProbe` high, descriptive | `1.0000 ± 0.0000`, delta `-0.5003 ± 0.0096` | Advisory underperformance only | Valid standard benchmark with strong temporal signal | | `medium` | Pass `5/5` | Lower probes are expected and descriptive, not failures | `0.8391 ± 0.0174`, delta `-0.3337 ± 0.0191` | Advisory underperformance only | Valid medium-difficulty benchmark with increased temporal challenge | | `hard` | Pass `5/5` | Lower probes are expected and descriptive, not failures | `0.6876 ± 0.0128`, delta `-0.1883 ± 0.0111` | Advisory underperformance only | Valid hard benchmark with intentionally reduced temporal recoverability | Temporal-GNN advisory failures are **not** benchmark failures. They support the paper finding that current temporal GNNs may not exploit order-sensitive temporal structure as effectively as a causal sequence model under matched static controls. `medium` and `hard` are **not** failed datasets. They are intended difficulty levels in the Temporal Twins ladder. Their lower `MotifProbe`, `RawMotifProbe`, and `SeqGRU` values show increasing temporal difficulty rather than invalid benchmark construction. ## Notes on the Raw Diagnostic File - `results/paper_suite_20260503_202810/paper_suite_failed_checks.csv` is retained unchanged as the raw diagnostic output. - The raw file still reflects older gate semantics in which standard-mode probe thresholds and temporal-GNN thresholds appeared in failure columns. - This document is the corrected paper-facing interpretation layer and should be cited when describing benchmark validity in the manuscript.