temporal-twins-code / results /PAPER_GATE_INTERPRETATION.md
temporal-twins-anon's picture
Add anonymous Temporal Twins code release
a3682cf verified

Paper Gate Interpretation for Temporal Twins

This document is the paper-facing interpretation layer for the final Temporal Twins suite. It does not alter the raw diagnostic output in results/paper_suite_20260503_202810/paper_suite_failed_checks.csv. Instead, it reclassifies which checks are true hard gates versus descriptive or advisory findings for the paper.

Gate Categories

A. Hard Gates for oracle_calib

These are true benchmark validity gates for temporal_twins_oracle_calib:

  • matched_eval_pairs >= required threshold
  • positive_rate = 0.5
  • benign_motif_hit_rate = 0
  • static_agg_auc near 0.5
  • shortcut AUCs near 0.5
  • XGBoost near 0.5
  • StaticGNN near chance
  • AuditOracle near 1.0
  • RawMotifOracle near 1.0
  • SeqGRU high
  • SeqGRU shuffle delta strongly negative

B. Hard Gates for Standard easy / medium / hard

For the standard temporal_twins difficulty ladder, the hard gates are the matched static-control checks:

  • matched_eval_pairs >= required threshold
  • positive_rate = 0.5
  • benign_motif_hit_rate = 0
  • static_agg_auc near 0.5
  • shortcut AUCs near 0.5
  • XGBoost near 0.5
  • StaticGNN near chance

These conditions verify that the benchmark remains shortcut-resistant and that fraud and benign twins are properly matched at evaluation.

C. Advisory / Descriptive Checks for Standard easy / medium / hard

The following are not hard validity gates for the standard difficulty ladder:

  • MotifProbe
  • RawMotifProbe
  • SeqGRU difficulty trend
  • SeqGRU shuffle delta
  • temporal-GNN performance
  • temporal-GNN shuffle delta

These measurements are descriptive benchmark outcomes. They characterize difficulty and inductive bias; they do not determine whether the dataset itself is valid.

Reclassified Final Paper-Suite Status

oracle_calib

  • hard gate passes: 5/5
  • AuditOracle = 1.0000 ± 0.0000
  • RawMotifOracle = 1.0000 ± 0.0000
  • XGBoost = 0.5000 ± 0.0000
  • StaticGNN = 0.5222 ± 0.0235
  • SeqGRU = 1.0000 ± 0.0000
  • SeqGRU delta = -0.5032 ± 0.0043

Interpretation: oracle_calib passes the intended hard benchmark validation. The oracle/probe alignment is correct, static shortcuts are eliminated, and a causal sequence model recovers the signal with a large negative shuffle delta.

easy

  • static-control hard gates pass: 5/5
  • XGBoost = 0.5000 ± 0.0000
  • StaticGNN = 0.4946 ± 0.0128
  • SeqGRU = 1.0000 ± 0.0000
  • SeqGRU delta = -0.5003 ± 0.0096

Interpretation: easy is a valid standard benchmark slice. Static shortcuts remain suppressed, and the temporal sequence signal is strong.

medium

  • static-control hard gates pass: 5/5
  • XGBoost = 0.5000 ± 0.0000
  • StaticGNN = 0.4922 ± 0.0203
  • SeqGRU = 0.8391 ± 0.0174
  • SeqGRU delta = -0.3337 ± 0.0191
  • MotifProbe and RawMotifProbe are lower by design and should not be treated as hard-gate failures

Interpretation: medium is not a failed dataset. It passes the static-control hard gates and shows the intended increase in temporal difficulty.

hard

  • static-control hard gates pass: 5/5
  • XGBoost = 0.5000 ± 0.0000
  • StaticGNN = 0.5026 ± 0.0198
  • SeqGRU = 0.6876 ± 0.0128
  • SeqGRU delta = -0.1883 ± 0.0111
  • lower probe and SeqGRU scores reflect intended difficulty

Interpretation: hard is not a failed dataset. It passes the static-control hard gates and intentionally weakens temporal recoverability relative to easy and medium.

Reclassified Status Table

Benchmark Static-control hard gates Probe/oracle status SeqGRU status Temporal-GNN status Paper interpretation
oracle_calib Pass 5/5 AuditOracle and RawMotifOracle both near 1.0; oracle behavior validated 1.0000 ± 0.0000, delta -0.5032 ± 0.0043 Underperformance is advisory only Valid calibration benchmark with correct motif-label alignment and dead static shortcuts
easy Pass 5/5 MotifProbe / RawMotifProbe high, descriptive 1.0000 ± 0.0000, delta -0.5003 ± 0.0096 Advisory underperformance only Valid standard benchmark with strong temporal signal
medium Pass 5/5 Lower probes are expected and descriptive, not failures 0.8391 ± 0.0174, delta -0.3337 ± 0.0191 Advisory underperformance only Valid medium-difficulty benchmark with increased temporal challenge
hard Pass 5/5 Lower probes are expected and descriptive, not failures 0.6876 ± 0.0128, delta -0.1883 ± 0.0111 Advisory underperformance only Valid hard benchmark with intentionally reduced temporal recoverability

Temporal-GNN advisory failures are not benchmark failures. They support the paper finding that current temporal GNNs may not exploit order-sensitive temporal structure as effectively as a causal sequence model under matched static controls.

medium and hard are not failed datasets. They are intended difficulty levels in the Temporal Twins ladder. Their lower MotifProbe, RawMotifProbe, and SeqGRU values show increasing temporal difficulty rather than invalid benchmark construction.

Notes on the Raw Diagnostic File

  • results/paper_suite_20260503_202810/paper_suite_failed_checks.csv is retained unchanged as the raw diagnostic output.
  • The raw file still reflects older gate semantics in which standard-mode probe thresholds and temporal-GNN thresholds appeared in failure columns.
  • This document is the corrected paper-facing interpretation layer and should be cited when describing benchmark validity in the manuscript.