YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

🔬 Event-Aware Data Splitting: A New Rule for Fraud Detection

"Before building complex AI models, split your data correctly."

📌 Abstract

This study introduces Event-Aware Data Splitting — a new paradigm for evaluating fraud detection systems that accounts for the temporal and contextual nature of real-world events. We demonstrate empirically that how you split your data has a larger impact on reported performance than which model you choose, and that standard random splitting creates dangerously optimistic evaluations that collapse under real-world event conditions.

Our key contribution: a formal splitting methodology that preserves event boundaries (holidays, weekends, month-end periods, night-time surges) and provides realistic performance estimates that mirror production deployment.

🎯 The Problem

Most fraud detection research uses random train/test splitting, which:

❌ Leaks future event patterns into training data
❌ Creates artificially balanced test sets that don't reflect production
❌ Produces over-optimistic performance metrics (up to 30% inflation)
❌ Hides critical vulnerabilities during event periods (holidays, weekends)

In production, fraud detection systems face events they haven't seen before. A model trained on normal days must handle Black Friday. Random splitting hides this reality.

🔑 Key Findings

Finding 1: The Adversarial Collapse

When models trained only on "normal" periods are tested on event periods:

Model	Random Split AUROC	Adversarial-Event AUROC	Drop
Logistic Regression	0.9415	0.8609	-8.6%
Random Forest	0.9944	0.9676	-2.7%
XGBoost	0.9990	0.9833	-1.6%
LightGBM	0.8813	0.6154	-30.2% 🔴

LightGBM loses 30% of its AUROC when the split respects event boundaries — it collapses to near-random performance.

Finding 2: Every Event Type is a Vulnerability

Under adversarial-event splitting, ALL event types show critical AUROC degradation for LightGBM:

Event Type	AUROC	Status
holiday_weekday	0.5997	🔴 CRITICAL
holiday_weekend	0.6103	🔴 CRITICAL
weekday_night	0.6223	🔴 CRITICAL
weekend_night	0.6288	🔴 CRITICAL
month_end	0.6541	🔴 CRITICAL
weekend_day	0.6600	🔴 CRITICAL

Finding 3: Fraud Rates Vary Dramatically by Event Type

The dataset reveals fundamentally different fraud landscapes across events:

Event Type	Fraud Rate	Transactions
weekday_night	1.65%	140,147
weekend_night	1.61%	79,629
holiday_weekday	0.62%	202,411
holiday_weekend	0.51%	113,797
normal	0.12%	291,093
weekend_day	0.10%	186,746
month_end	0.10%	34,752

Night-time fraud rates are 16x higher than normal periods. Random splitting masks this entirely.

Finding 4: Event-Aware vs Random — AUPRC Tells the Real Story

While AUROC drops are modest for robust models, AUPRC reveals dramatic differences:

Model	Random AUPRC	Event-Aware AUPRC	Adversarial AUPRC
XGBoost	0.9512	0.8892	0.8385
Random Forest	0.8400	0.8057	0.6218
LightGBM	0.3108	0.4828	0.0428 🔴

📊 Experimental Design

Five Splitting Strategies

#	Strategy	Description	Purpose
1	Random (Naive)	Standard `train_test_split` with stratification	Baseline — what most papers use
2	Temporal	Train on past, test on future	Standard time-series practice
3	Event-Aware (Ours)	Split respecting event window boundaries	Our contribution — realistic evaluation
4	Stratified-Temporal	Temporal split with per-event-type stratification	Balanced temporal evaluation
5	Adversarial-Event	Train on normal only, test on events only	Worst-case stress test

Four Models (Deliberately Simple)

We use simple models to prove the point — it's not about model complexity:

Logistic Regression
Random Forest (200 trees)
XGBoost (200 boosters)
LightGBM (200 boosters)

Dataset

Source: CIS435-CreditCardFraudDetection
Size: 1,048,575 transactions
Features: 23 engineered features (temporal, geographic, categorical, amount-based)
Fraud Rate: 0.57% (6,006 fraud cases)
Temporal Span: Full year with rich temporal metadata

📈 Results Visualization

Figure 1: Performance Heatmap Across All Strategies and Models

Figure 2: AUROC Comparison

Figure 3: Event-Type Analysis — Different Events = Different Fraud Landscapes

Figure 4: Performance Stability Across Event Types

Figure 5: Performance Degradation vs Random Split

Figure 6: Temporal Fraud Patterns — Why Random Splits Fail

Figure 7: The Key Insight — Split Strategy Matters More Than Model Complexity

🧠 The Event-Aware Splitting Algorithm

def split_event_aware(df, X, y, test_ratio=0.2):
    """
    EVENT-AWARE SPLIT — Our Contribution
    
    Principle: Split data such that complete "event windows" are kept 
    intact. Never split within an event period. Ensure test set contains 
    representative event periods the model has NOT seen.
    
    Algorithm:
    1. Group transactions into event windows (year-month × event_type)
    2. Sort windows temporally 
    3. Assign the LAST ~20% of windows to test set
    4. Model must generalize to unseen event instances
    """
    df['event_window'] = df['year_month'] + '_' + df['event_type']
    window_times = df.groupby('event_window')['unix_time'].mean().sort_values()
    
    n_test = int(len(window_times) * test_ratio)
    test_windows = set(window_times.index[-n_test:])
    
    train_mask = ~df['event_window'].isin(test_windows)
    test_mask = df['event_window'].isin(test_windows)
    
    return X[train_mask], X[test_mask], y[train_mask], y[test_mask]

Event Types Defined

Event	Condition
`holiday_weekend`	Holiday season (Nov-Jan) + Weekend
`holiday_weekday`	Holiday season (Nov-Jan) + Weekday
`weekend_night`	Weekend + Night (10PM-5AM)
`weekend_day`	Weekend + Day
`weekday_night`	Weekday + Night (10PM-5AM)
`month_end`	Day 28+ of month
`normal`	None of the above

💡 Practical Recommendations

Based on our findings, we propose the Event-Aware Data Splitting Rule:

For Researchers

Never use random splits for temporal fraud data — always use at minimum temporal splitting
Report performance per event type — overall metrics hide critical vulnerabilities
Include adversarial-event evaluation as a stress test alongside standard metrics
Use AUPRC as the primary metric — AUROC is too forgiving for imbalanced fraud data

For Practitioners

Test your production model on event periods it hasn't seen — this is the true test
Monitor per-event-type performance in production with rolling windows
Retrain before major event periods (holiday season, year-end)
Simple models with correct splitting > Complex models with random splitting

For the ML Community

Standardize event-aware splitting in fraud detection benchmarks
Publish per-event breakdowns alongside overall metrics
Treat data splitting as a first-class research contribution, not an afterthought

📁 Repository Structure

├── README.md                              # This file
├── event_aware_splitting.py               # Complete experiment code (reproducible)
├── figures/
│   ├── fig1_performance_heatmap.png       # Overall performance comparison
│   ├── fig2_auroc_comparison.png          # AUROC by model and strategy
│   ├── fig3_event_analysis.png            # Fraud rates by event type
│   ├── fig4_event_stability.png           # Per-event performance stability
│   ├── fig5_degradation_heatmap.png       # Performance degradation vs random
│   ├── fig6_temporal_patterns.png         # Temporal fraud patterns
│   └── fig7_complexity_vs_splitting.png   # The key insight visualization
├── results/
│   ├── experiment_results.csv             # All results in tabular format
│   ├── detailed_event_results.json        # Per-event-type breakdown
│   └── experiment_summary.json            # Summary statistics

🔄 Reproducibility

pip install pandas numpy scikit-learn matplotlib seaborn datasets lightgbm xgboost
python event_aware_splitting.py

All experiments run on CPU in ~20 minutes. No GPU required.

📚 Related Work

This work builds on insights from:

TabReD (Yandex, 2024) — First systematic study of time-based splits in tabular benchmarks. Found 11/100 datasets have data leakage. Paper
Fraud Dataset Benchmark (2022) — Standardized fraud detection datasets showing feature engineering matters more than model choice. Paper
Comparative Evaluation of AD Methods for Fraud (2023) — Demonstrated distribution shift between 2018-2020 fraud patterns (COVID impact). Paper
The Window Dilemma (2026) — Showed concept drift detection is fundamentally ill-posed due to windowing artifacts. Paper

📖 Citation

If you use this methodology in your research, please cite:

@misc{event_aware_splitting_2026,
  title={Event-Aware Data Splitting: A New Rule for Fraud Detection Evaluation},
  author={Moco22},
  year={2026},
  howpublished={\url{https://huggingface.co/Moco22/event-aware-data-splitting-fraud-detection}},
  note={Demonstrates that data splitting strategy impacts fraud detection evaluation more than model complexity}
}

⚠️ Key Takeaway

A simple Logistic Regression with event-aware splitting gives you a more honest picture of production performance than a state-of-the-art XGBoost with random splitting.

The gap between split strategies (up to 30% AUROC) is often larger than the gap between models (typically 5-10%).

Split your data right. Then worry about your model.

Built with 🤗 Hugging Face

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for MucahitSylmz/event-aware-data-splitting-fraud-detection

The Window Dilemma: Why Concept Drift Detection is Ill-Posed

Paper • 2602.06456 • Published Feb 6

TabReD: A Benchmark of Tabular Machine Learning in-the-Wild

Paper • 2406.19380 • Published Jun 27, 2024 • 49

Comparative Evaluation of Anomaly Detection Methods for Fraud Detection in Online Credit Card Payments

Paper • 2312.13896 • Published Dec 21, 2023

Fraud Dataset Benchmark and Applications

Paper • 2208.14417 • Published Aug 30, 2022