File size: 5,632 Bytes
a3682cf | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | # Paper Gate Interpretation for Temporal Twins
This document is the paper-facing interpretation layer for the final Temporal Twins suite. It does **not** alter the raw diagnostic output in `results/paper_suite_20260503_202810/paper_suite_failed_checks.csv`. Instead, it reclassifies which checks are true hard gates versus descriptive or advisory findings for the paper.
## Gate Categories
### A. Hard Gates for `oracle_calib`
These are true benchmark validity gates for `temporal_twins_oracle_calib`:
- `matched_eval_pairs >=` required threshold
- `positive_rate = 0.5`
- `benign_motif_hit_rate = 0`
- `static_agg_auc` near `0.5`
- shortcut AUCs near `0.5`
- `XGBoost` near `0.5`
- `StaticGNN` near chance
- `AuditOracle` near `1.0`
- `RawMotifOracle` near `1.0`
- `SeqGRU` high
- `SeqGRU` shuffle delta strongly negative
### B. Hard Gates for Standard `easy` / `medium` / `hard`
For the standard `temporal_twins` difficulty ladder, the hard gates are the matched static-control checks:
- `matched_eval_pairs >=` required threshold
- `positive_rate = 0.5`
- `benign_motif_hit_rate = 0`
- `static_agg_auc` near `0.5`
- shortcut AUCs near `0.5`
- `XGBoost` near `0.5`
- `StaticGNN` near chance
These conditions verify that the benchmark remains shortcut-resistant and that fraud and benign twins are properly matched at evaluation.
### C. Advisory / Descriptive Checks for Standard `easy` / `medium` / `hard`
The following are **not** hard validity gates for the standard difficulty ladder:
- `MotifProbe`
- `RawMotifProbe`
- `SeqGRU` difficulty trend
- `SeqGRU` shuffle delta
- temporal-GNN performance
- temporal-GNN shuffle delta
These measurements are descriptive benchmark outcomes. They characterize difficulty and inductive bias; they do not determine whether the dataset itself is valid.
## Reclassified Final Paper-Suite Status
### `oracle_calib`
- hard gate passes: `5/5`
- `AuditOracle = 1.0000 ± 0.0000`
- `RawMotifOracle = 1.0000 ± 0.0000`
- `XGBoost = 0.5000 ± 0.0000`
- `StaticGNN = 0.5222 ± 0.0235`
- `SeqGRU = 1.0000 ± 0.0000`
- `SeqGRU delta = -0.5032 ± 0.0043`
Interpretation: `oracle_calib` passes the intended hard benchmark validation. The oracle/probe alignment is correct, static shortcuts are eliminated, and a causal sequence model recovers the signal with a large negative shuffle delta.
### `easy`
- static-control hard gates pass: `5/5`
- `XGBoost = 0.5000 ± 0.0000`
- `StaticGNN = 0.4946 ± 0.0128`
- `SeqGRU = 1.0000 ± 0.0000`
- `SeqGRU delta = -0.5003 ± 0.0096`
Interpretation: `easy` is a valid standard benchmark slice. Static shortcuts remain suppressed, and the temporal sequence signal is strong.
### `medium`
- static-control hard gates pass: `5/5`
- `XGBoost = 0.5000 ± 0.0000`
- `StaticGNN = 0.4922 ± 0.0203`
- `SeqGRU = 0.8391 ± 0.0174`
- `SeqGRU delta = -0.3337 ± 0.0191`
- `MotifProbe` and `RawMotifProbe` are lower by design and should **not** be treated as hard-gate failures
Interpretation: `medium` is **not** a failed dataset. It passes the static-control hard gates and shows the intended increase in temporal difficulty.
### `hard`
- static-control hard gates pass: `5/5`
- `XGBoost = 0.5000 ± 0.0000`
- `StaticGNN = 0.5026 ± 0.0198`
- `SeqGRU = 0.6876 ± 0.0128`
- `SeqGRU delta = -0.1883 ± 0.0111`
- lower probe and SeqGRU scores reflect intended difficulty
Interpretation: `hard` is **not** a failed dataset. It passes the static-control hard gates and intentionally weakens temporal recoverability relative to `easy` and `medium`.
## Reclassified Status Table
| Benchmark | Static-control hard gates | Probe/oracle status | SeqGRU status | Temporal-GNN status | Paper interpretation |
|---|---|---|---|---|---|
| `oracle_calib` | Pass `5/5` | `AuditOracle` and `RawMotifOracle` both near `1.0`; oracle behavior validated | `1.0000 ± 0.0000`, delta `-0.5032 ± 0.0043` | Underperformance is advisory only | Valid calibration benchmark with correct motif-label alignment and dead static shortcuts |
| `easy` | Pass `5/5` | `MotifProbe` / `RawMotifProbe` high, descriptive | `1.0000 ± 0.0000`, delta `-0.5003 ± 0.0096` | Advisory underperformance only | Valid standard benchmark with strong temporal signal |
| `medium` | Pass `5/5` | Lower probes are expected and descriptive, not failures | `0.8391 ± 0.0174`, delta `-0.3337 ± 0.0191` | Advisory underperformance only | Valid medium-difficulty benchmark with increased temporal challenge |
| `hard` | Pass `5/5` | Lower probes are expected and descriptive, not failures | `0.6876 ± 0.0128`, delta `-0.1883 ± 0.0111` | Advisory underperformance only | Valid hard benchmark with intentionally reduced temporal recoverability |
Temporal-GNN advisory failures are **not** benchmark failures. They support the paper finding that current temporal GNNs may not exploit order-sensitive temporal structure as effectively as a causal sequence model under matched static controls.
`medium` and `hard` are **not** failed datasets. They are intended difficulty levels in the Temporal Twins ladder. Their lower `MotifProbe`, `RawMotifProbe`, and `SeqGRU` values show increasing temporal difficulty rather than invalid benchmark construction.
## Notes on the Raw Diagnostic File
- `results/paper_suite_20260503_202810/paper_suite_failed_checks.csv` is retained unchanged as the raw diagnostic output.
- The raw file still reflects older gate semantics in which standard-mode probe thresholds and temporal-GNN thresholds appeared in failure columns.
- This document is the corrected paper-facing interpretation layer and should be cited when describing benchmark validity in the manuscript.
|