YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

πŸ”¬ Event-Aware Data Splitting: A New Rule for Fraud Detection

"Before building complex AI models, split your data correctly."

πŸ“Œ Abstract

This study introduces Event-Aware Data Splitting β€” a new paradigm for evaluating fraud detection systems that accounts for the temporal and contextual nature of real-world events. We demonstrate empirically that how you split your data has a larger impact on reported performance than which model you choose, and that standard random splitting creates dangerously optimistic evaluations that collapse under real-world event conditions.

Our key contribution: a formal splitting methodology that preserves event boundaries (holidays, weekends, month-end periods, night-time surges) and provides realistic performance estimates that mirror production deployment.


🎯 The Problem

Most fraud detection research uses random train/test splitting, which:

  • ❌ Leaks future event patterns into training data
  • ❌ Creates artificially balanced test sets that don't reflect production
  • ❌ Produces over-optimistic performance metrics (up to 30% inflation)
  • ❌ Hides critical vulnerabilities during event periods (holidays, weekends)

In production, fraud detection systems face events they haven't seen before. A model trained on normal days must handle Black Friday. Random splitting hides this reality.


πŸ”‘ Key Findings

Finding 1: The Adversarial Collapse

When models trained only on "normal" periods are tested on event periods:

Model Random Split AUROC Adversarial-Event AUROC Drop
Logistic Regression 0.9415 0.8609 -8.6%
Random Forest 0.9944 0.9676 -2.7%
XGBoost 0.9990 0.9833 -1.6%
LightGBM 0.8813 0.6154 -30.2% πŸ”΄

LightGBM loses 30% of its AUROC when the split respects event boundaries β€” it collapses to near-random performance.

Finding 2: Every Event Type is a Vulnerability

Under adversarial-event splitting, ALL event types show critical AUROC degradation for LightGBM:

Event Type AUROC Status
holiday_weekday 0.5997 πŸ”΄ CRITICAL
holiday_weekend 0.6103 πŸ”΄ CRITICAL
weekday_night 0.6223 πŸ”΄ CRITICAL
weekend_night 0.6288 πŸ”΄ CRITICAL
month_end 0.6541 πŸ”΄ CRITICAL
weekend_day 0.6600 πŸ”΄ CRITICAL

Finding 3: Fraud Rates Vary Dramatically by Event Type

The dataset reveals fundamentally different fraud landscapes across events:

Event Type Fraud Rate Transactions
weekday_night 1.65% 140,147
weekend_night 1.61% 79,629
holiday_weekday 0.62% 202,411
holiday_weekend 0.51% 113,797
normal 0.12% 291,093
weekend_day 0.10% 186,746
month_end 0.10% 34,752

Night-time fraud rates are 16x higher than normal periods. Random splitting masks this entirely.

Finding 4: Event-Aware vs Random β€” AUPRC Tells the Real Story

While AUROC drops are modest for robust models, AUPRC reveals dramatic differences:

Model Random AUPRC Event-Aware AUPRC Adversarial AUPRC
XGBoost 0.9512 0.8892 0.8385
Random Forest 0.8400 0.8057 0.6218
LightGBM 0.3108 0.4828 0.0428 πŸ”΄

πŸ“Š Experimental Design

Five Splitting Strategies

# Strategy Description Purpose
1 Random (Naive) Standard train_test_split with stratification Baseline β€” what most papers use
2 Temporal Train on past, test on future Standard time-series practice
3 Event-Aware (Ours) Split respecting event window boundaries Our contribution β€” realistic evaluation
4 Stratified-Temporal Temporal split with per-event-type stratification Balanced temporal evaluation
5 Adversarial-Event Train on normal only, test on events only Worst-case stress test

Four Models (Deliberately Simple)

We use simple models to prove the point β€” it's not about model complexity:

  • Logistic Regression
  • Random Forest (200 trees)
  • XGBoost (200 boosters)
  • LightGBM (200 boosters)

Dataset

  • Source: CIS435-CreditCardFraudDetection
  • Size: 1,048,575 transactions
  • Features: 23 engineered features (temporal, geographic, categorical, amount-based)
  • Fraud Rate: 0.57% (6,006 fraud cases)
  • Temporal Span: Full year with rich temporal metadata

πŸ“ˆ Results Visualization

Figure 1: Performance Heatmap Across All Strategies and Models

Performance Heatmap

Figure 2: AUROC Comparison

AUROC Comparison

Figure 3: Event-Type Analysis β€” Different Events = Different Fraud Landscapes

Event Analysis

Figure 4: Performance Stability Across Event Types

Event Stability

Figure 5: Performance Degradation vs Random Split

Degradation Heatmap

Figure 6: Temporal Fraud Patterns β€” Why Random Splits Fail

Temporal Patterns

Figure 7: The Key Insight β€” Split Strategy Matters More Than Model Complexity

Complexity vs Splitting


🧠 The Event-Aware Splitting Algorithm

def split_event_aware(df, X, y, test_ratio=0.2):
    """
    EVENT-AWARE SPLIT β€” Our Contribution
    
    Principle: Split data such that complete "event windows" are kept 
    intact. Never split within an event period. Ensure test set contains 
    representative event periods the model has NOT seen.
    
    Algorithm:
    1. Group transactions into event windows (year-month Γ— event_type)
    2. Sort windows temporally 
    3. Assign the LAST ~20% of windows to test set
    4. Model must generalize to unseen event instances
    """
    df['event_window'] = df['year_month'] + '_' + df['event_type']
    window_times = df.groupby('event_window')['unix_time'].mean().sort_values()
    
    n_test = int(len(window_times) * test_ratio)
    test_windows = set(window_times.index[-n_test:])
    
    train_mask = ~df['event_window'].isin(test_windows)
    test_mask = df['event_window'].isin(test_windows)
    
    return X[train_mask], X[test_mask], y[train_mask], y[test_mask]

Event Types Defined

Event Condition
holiday_weekend Holiday season (Nov-Jan) + Weekend
holiday_weekday Holiday season (Nov-Jan) + Weekday
weekend_night Weekend + Night (10PM-5AM)
weekend_day Weekend + Day
weekday_night Weekday + Night (10PM-5AM)
month_end Day 28+ of month
normal None of the above

πŸ’‘ Practical Recommendations

Based on our findings, we propose the Event-Aware Data Splitting Rule:

For Researchers

  1. Never use random splits for temporal fraud data β€” always use at minimum temporal splitting
  2. Report performance per event type β€” overall metrics hide critical vulnerabilities
  3. Include adversarial-event evaluation as a stress test alongside standard metrics
  4. Use AUPRC as the primary metric β€” AUROC is too forgiving for imbalanced fraud data

For Practitioners

  1. Test your production model on event periods it hasn't seen β€” this is the true test
  2. Monitor per-event-type performance in production with rolling windows
  3. Retrain before major event periods (holiday season, year-end)
  4. Simple models with correct splitting > Complex models with random splitting

For the ML Community

  • Standardize event-aware splitting in fraud detection benchmarks
  • Publish per-event breakdowns alongside overall metrics
  • Treat data splitting as a first-class research contribution, not an afterthought

πŸ“ Repository Structure

β”œβ”€β”€ README.md                              # This file
β”œβ”€β”€ event_aware_splitting.py               # Complete experiment code (reproducible)
β”œβ”€β”€ figures/
β”‚   β”œβ”€β”€ fig1_performance_heatmap.png       # Overall performance comparison
β”‚   β”œβ”€β”€ fig2_auroc_comparison.png          # AUROC by model and strategy
β”‚   β”œβ”€β”€ fig3_event_analysis.png            # Fraud rates by event type
β”‚   β”œβ”€β”€ fig4_event_stability.png           # Per-event performance stability
β”‚   β”œβ”€β”€ fig5_degradation_heatmap.png       # Performance degradation vs random
β”‚   β”œβ”€β”€ fig6_temporal_patterns.png         # Temporal fraud patterns
β”‚   └── fig7_complexity_vs_splitting.png   # The key insight visualization
β”œβ”€β”€ results/
β”‚   β”œβ”€β”€ experiment_results.csv             # All results in tabular format
β”‚   β”œβ”€β”€ detailed_event_results.json        # Per-event-type breakdown
β”‚   └── experiment_summary.json            # Summary statistics

πŸ”„ Reproducibility

pip install pandas numpy scikit-learn matplotlib seaborn datasets lightgbm xgboost
python event_aware_splitting.py

All experiments run on CPU in ~20 minutes. No GPU required.


πŸ“š Related Work

This work builds on insights from:

  • TabReD (Yandex, 2024) β€” First systematic study of time-based splits in tabular benchmarks. Found 11/100 datasets have data leakage. Paper
  • Fraud Dataset Benchmark (2022) β€” Standardized fraud detection datasets showing feature engineering matters more than model choice. Paper
  • Comparative Evaluation of AD Methods for Fraud (2023) β€” Demonstrated distribution shift between 2018-2020 fraud patterns (COVID impact). Paper
  • The Window Dilemma (2026) β€” Showed concept drift detection is fundamentally ill-posed due to windowing artifacts. Paper

πŸ“– Citation

If you use this methodology in your research, please cite:

@misc{event_aware_splitting_2026,
  title={Event-Aware Data Splitting: A New Rule for Fraud Detection Evaluation},
  author={Moco22},
  year={2026},
  howpublished={\url{https://huggingface.co/Moco22/event-aware-data-splitting-fraud-detection}},
  note={Demonstrates that data splitting strategy impacts fraud detection evaluation more than model complexity}
}

⚠️ Key Takeaway

A simple Logistic Regression with event-aware splitting gives you a more honest picture of production performance than a state-of-the-art XGBoost with random splitting.

The gap between split strategies (up to 30% AUROC) is often larger than the gap between models (typically 5-10%).

Split your data right. Then worry about your model.


Built with πŸ€— Hugging Face

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for MucahitSylmz/event-aware-data-splitting-fraud-detection