temporal-twins-code / results /PAPER_GATE_INTERPRETATION.md

Add anonymous Temporal Twins code release

a3682cf verified 5 days ago

5.63 kB

	# Paper Gate Interpretation for Temporal Twins

	This document is the paper-facing interpretation layer for the final Temporal Twins suite. It does not alter the raw diagnostic output in `results/paper_suite_20260503_202810/paper_suite_failed_checks.csv`. Instead, it reclassifies which checks are true hard gates versus descriptive or advisory findings for the paper.

	## Gate Categories

	### A. Hard Gates for `oracle_calib`

	These are true benchmark validity gates for `temporal_twins_oracle_calib`:

	- `matched_eval_pairs >=` required threshold
	- `positive_rate = 0.5`
	- `benign_motif_hit_rate = 0`
	- `static_agg_auc` near `0.5`
	- shortcut AUCs near `0.5`
	- `XGBoost` near `0.5`
	- `StaticGNN` near chance
	- `AuditOracle` near `1.0`
	- `RawMotifOracle` near `1.0`
	- `SeqGRU` high
	- `SeqGRU` shuffle delta strongly negative

	### B. Hard Gates for Standard `easy` / `medium` / `hard`

	For the standard `temporal_twins` difficulty ladder, the hard gates are the matched static-control checks:

	- `matched_eval_pairs >=` required threshold
	- `positive_rate = 0.5`
	- `benign_motif_hit_rate = 0`
	- `static_agg_auc` near `0.5`
	- shortcut AUCs near `0.5`
	- `XGBoost` near `0.5`
	- `StaticGNN` near chance

	These conditions verify that the benchmark remains shortcut-resistant and that fraud and benign twins are properly matched at evaluation.

	### C. Advisory / Descriptive Checks for Standard `easy` / `medium` / `hard`

	The following are not hard validity gates for the standard difficulty ladder:

	- `MotifProbe`
	- `RawMotifProbe`
	- `SeqGRU` difficulty trend
	- `SeqGRU` shuffle delta
	- temporal-GNN performance
	- temporal-GNN shuffle delta

	These measurements are descriptive benchmark outcomes. They characterize difficulty and inductive bias; they do not determine whether the dataset itself is valid.

	## Reclassified Final Paper-Suite Status

	### `oracle_calib`

	- hard gate passes: `5/5`
	- `AuditOracle = 1.0000 ± 0.0000`
	- `RawMotifOracle = 1.0000 ± 0.0000`
	- `XGBoost = 0.5000 ± 0.0000`
	- `StaticGNN = 0.5222 ± 0.0235`
	- `SeqGRU = 1.0000 ± 0.0000`
	- `SeqGRU delta = -0.5032 ± 0.0043`

	Interpretation: `oracle_calib` passes the intended hard benchmark validation. The oracle/probe alignment is correct, static shortcuts are eliminated, and a causal sequence model recovers the signal with a large negative shuffle delta.

	### `easy`

	- static-control hard gates pass: `5/5`
	- `XGBoost = 0.5000 ± 0.0000`
	- `StaticGNN = 0.4946 ± 0.0128`
	- `SeqGRU = 1.0000 ± 0.0000`
	- `SeqGRU delta = -0.5003 ± 0.0096`

	Interpretation: `easy` is a valid standard benchmark slice. Static shortcuts remain suppressed, and the temporal sequence signal is strong.

	### `medium`

	- static-control hard gates pass: `5/5`
	- `XGBoost = 0.5000 ± 0.0000`
	- `StaticGNN = 0.4922 ± 0.0203`
	- `SeqGRU = 0.8391 ± 0.0174`
	- `SeqGRU delta = -0.3337 ± 0.0191`
	- `MotifProbe` and `RawMotifProbe` are lower by design and should not be treated as hard-gate failures

	Interpretation: `medium` is not a failed dataset. It passes the static-control hard gates and shows the intended increase in temporal difficulty.

	### `hard`

	- static-control hard gates pass: `5/5`
	- `XGBoost = 0.5000 ± 0.0000`
	- `StaticGNN = 0.5026 ± 0.0198`
	- `SeqGRU = 0.6876 ± 0.0128`
	- `SeqGRU delta = -0.1883 ± 0.0111`
	- lower probe and SeqGRU scores reflect intended difficulty

	Interpretation: `hard` is not a failed dataset. It passes the static-control hard gates and intentionally weakens temporal recoverability relative to `easy` and `medium`.

	## Reclassified Status Table

	\| Benchmark \| Static-control hard gates \| Probe/oracle status \| SeqGRU status \| Temporal-GNN status \| Paper interpretation \|
	\|---\|---\|---\|---\|---\|---\|
	\| `oracle_calib` \| Pass `5/5` \| `AuditOracle` and `RawMotifOracle` both near `1.0`; oracle behavior validated \| `1.0000 ± 0.0000`, delta `-0.5032 ± 0.0043` \| Underperformance is advisory only \| Valid calibration benchmark with correct motif-label alignment and dead static shortcuts \|
	\| `easy` \| Pass `5/5` \| `MotifProbe` / `RawMotifProbe` high, descriptive \| `1.0000 ± 0.0000`, delta `-0.5003 ± 0.0096` \| Advisory underperformance only \| Valid standard benchmark with strong temporal signal \|
	\| `medium` \| Pass `5/5` \| Lower probes are expected and descriptive, not failures \| `0.8391 ± 0.0174`, delta `-0.3337 ± 0.0191` \| Advisory underperformance only \| Valid medium-difficulty benchmark with increased temporal challenge \|
	\| `hard` \| Pass `5/5` \| Lower probes are expected and descriptive, not failures \| `0.6876 ± 0.0128`, delta `-0.1883 ± 0.0111` \| Advisory underperformance only \| Valid hard benchmark with intentionally reduced temporal recoverability \|

	Temporal-GNN advisory failures are not benchmark failures. They support the paper finding that current temporal GNNs may not exploit order-sensitive temporal structure as effectively as a causal sequence model under matched static controls.

	`medium` and `hard` are not failed datasets. They are intended difficulty levels in the Temporal Twins ladder. Their lower `MotifProbe`, `RawMotifProbe`, and `SeqGRU` values show increasing temporal difficulty rather than invalid benchmark construction.

	## Notes on the Raw Diagnostic File

	- `results/paper_suite_20260503_202810/paper_suite_failed_checks.csv` is retained unchanged as the raw diagnostic output.
	- The raw file still reflects older gate semantics in which standard-mode probe thresholds and temporal-GNN thresholds appeared in failure columns.
	- This document is the corrected paper-facing interpretation layer and should be cited when describing benchmark validity in the manuscript.