| # Paper Gate Interpretation for Temporal Twins |
|
|
| This document is the paper-facing interpretation layer for the final Temporal Twins suite. It does **not** alter the raw diagnostic output in `results/paper_suite_20260503_202810/paper_suite_failed_checks.csv`. Instead, it reclassifies which checks are true hard gates versus descriptive or advisory findings for the paper. |
|
|
| ## Gate Categories |
|
|
| ### A. Hard Gates for `oracle_calib` |
| |
| These are true benchmark validity gates for `temporal_twins_oracle_calib`: |
|
|
| - `matched_eval_pairs >=` required threshold |
| - `positive_rate = 0.5` |
| - `benign_motif_hit_rate = 0` |
| - `static_agg_auc` near `0.5` |
| - shortcut AUCs near `0.5` |
| - `XGBoost` near `0.5` |
| - `StaticGNN` near chance |
| - `AuditOracle` near `1.0` |
| - `RawMotifOracle` near `1.0` |
| - `SeqGRU` high |
| - `SeqGRU` shuffle delta strongly negative |
|
|
| ### B. Hard Gates for Standard `easy` / `medium` / `hard` |
|
|
| For the standard `temporal_twins` difficulty ladder, the hard gates are the matched static-control checks: |
|
|
| - `matched_eval_pairs >=` required threshold |
| - `positive_rate = 0.5` |
| - `benign_motif_hit_rate = 0` |
| - `static_agg_auc` near `0.5` |
| - shortcut AUCs near `0.5` |
| - `XGBoost` near `0.5` |
| - `StaticGNN` near chance |
|
|
| These conditions verify that the benchmark remains shortcut-resistant and that fraud and benign twins are properly matched at evaluation. |
|
|
| ### C. Advisory / Descriptive Checks for Standard `easy` / `medium` / `hard` |
|
|
| The following are **not** hard validity gates for the standard difficulty ladder: |
|
|
| - `MotifProbe` |
| - `RawMotifProbe` |
| - `SeqGRU` difficulty trend |
| - `SeqGRU` shuffle delta |
| - temporal-GNN performance |
| - temporal-GNN shuffle delta |
|
|
| These measurements are descriptive benchmark outcomes. They characterize difficulty and inductive bias; they do not determine whether the dataset itself is valid. |
|
|
| ## Reclassified Final Paper-Suite Status |
|
|
| ### `oracle_calib` |
| |
| - hard gate passes: `5/5` |
| - `AuditOracle = 1.0000 ± 0.0000` |
| - `RawMotifOracle = 1.0000 ± 0.0000` |
| - `XGBoost = 0.5000 ± 0.0000` |
| - `StaticGNN = 0.5222 ± 0.0235` |
| - `SeqGRU = 1.0000 ± 0.0000` |
| - `SeqGRU delta = -0.5032 ± 0.0043` |
| |
| Interpretation: `oracle_calib` passes the intended hard benchmark validation. The oracle/probe alignment is correct, static shortcuts are eliminated, and a causal sequence model recovers the signal with a large negative shuffle delta. |
|
|
| ### `easy` |
|
|
| - static-control hard gates pass: `5/5` |
| - `XGBoost = 0.5000 ± 0.0000` |
| - `StaticGNN = 0.4946 ± 0.0128` |
| - `SeqGRU = 1.0000 ± 0.0000` |
| - `SeqGRU delta = -0.5003 ± 0.0096` |
|
|
| Interpretation: `easy` is a valid standard benchmark slice. Static shortcuts remain suppressed, and the temporal sequence signal is strong. |
|
|
| ### `medium` |
|
|
| - static-control hard gates pass: `5/5` |
| - `XGBoost = 0.5000 ± 0.0000` |
| - `StaticGNN = 0.4922 ± 0.0203` |
| - `SeqGRU = 0.8391 ± 0.0174` |
| - `SeqGRU delta = -0.3337 ± 0.0191` |
| - `MotifProbe` and `RawMotifProbe` are lower by design and should **not** be treated as hard-gate failures |
|
|
| Interpretation: `medium` is **not** a failed dataset. It passes the static-control hard gates and shows the intended increase in temporal difficulty. |
|
|
| ### `hard` |
|
|
| - static-control hard gates pass: `5/5` |
| - `XGBoost = 0.5000 ± 0.0000` |
| - `StaticGNN = 0.5026 ± 0.0198` |
| - `SeqGRU = 0.6876 ± 0.0128` |
| - `SeqGRU delta = -0.1883 ± 0.0111` |
| - lower probe and SeqGRU scores reflect intended difficulty |
|
|
| Interpretation: `hard` is **not** a failed dataset. It passes the static-control hard gates and intentionally weakens temporal recoverability relative to `easy` and `medium`. |
|
|
| ## Reclassified Status Table |
|
|
| | Benchmark | Static-control hard gates | Probe/oracle status | SeqGRU status | Temporal-GNN status | Paper interpretation | |
| |---|---|---|---|---|---| |
| | `oracle_calib` | Pass `5/5` | `AuditOracle` and `RawMotifOracle` both near `1.0`; oracle behavior validated | `1.0000 ± 0.0000`, delta `-0.5032 ± 0.0043` | Underperformance is advisory only | Valid calibration benchmark with correct motif-label alignment and dead static shortcuts | |
| | `easy` | Pass `5/5` | `MotifProbe` / `RawMotifProbe` high, descriptive | `1.0000 ± 0.0000`, delta `-0.5003 ± 0.0096` | Advisory underperformance only | Valid standard benchmark with strong temporal signal | |
| | `medium` | Pass `5/5` | Lower probes are expected and descriptive, not failures | `0.8391 ± 0.0174`, delta `-0.3337 ± 0.0191` | Advisory underperformance only | Valid medium-difficulty benchmark with increased temporal challenge | |
| | `hard` | Pass `5/5` | Lower probes are expected and descriptive, not failures | `0.6876 ± 0.0128`, delta `-0.1883 ± 0.0111` | Advisory underperformance only | Valid hard benchmark with intentionally reduced temporal recoverability | |
|
|
| Temporal-GNN advisory failures are **not** benchmark failures. They support the paper finding that current temporal GNNs may not exploit order-sensitive temporal structure as effectively as a causal sequence model under matched static controls. |
|
|
| `medium` and `hard` are **not** failed datasets. They are intended difficulty levels in the Temporal Twins ladder. Their lower `MotifProbe`, `RawMotifProbe`, and `SeqGRU` values show increasing temporal difficulty rather than invalid benchmark construction. |
|
|
| ## Notes on the Raw Diagnostic File |
|
|
| - `results/paper_suite_20260503_202810/paper_suite_failed_checks.csv` is retained unchanged as the raw diagnostic output. |
| - The raw file still reflects older gate semantics in which standard-mode probe thresholds and temporal-GNN thresholds appeared in failure columns. |
| - This document is the corrected paper-facing interpretation layer and should be cited when describing benchmark validity in the manuscript. |
|
|