YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
π¬ Event-Aware Data Splitting: A New Rule for Fraud Detection
"Before building complex AI models, split your data correctly."
π Abstract
This study introduces Event-Aware Data Splitting β a new paradigm for evaluating fraud detection systems that accounts for the temporal and contextual nature of real-world events. We demonstrate empirically that how you split your data has a larger impact on reported performance than which model you choose, and that standard random splitting creates dangerously optimistic evaluations that collapse under real-world event conditions.
Our key contribution: a formal splitting methodology that preserves event boundaries (holidays, weekends, month-end periods, night-time surges) and provides realistic performance estimates that mirror production deployment.
π― The Problem
Most fraud detection research uses random train/test splitting, which:
- β Leaks future event patterns into training data
- β Creates artificially balanced test sets that don't reflect production
- β Produces over-optimistic performance metrics (up to 30% inflation)
- β Hides critical vulnerabilities during event periods (holidays, weekends)
In production, fraud detection systems face events they haven't seen before. A model trained on normal days must handle Black Friday. Random splitting hides this reality.
π Key Findings
Finding 1: The Adversarial Collapse
When models trained only on "normal" periods are tested on event periods:
| Model | Random Split AUROC | Adversarial-Event AUROC | Drop |
|---|---|---|---|
| Logistic Regression | 0.9415 | 0.8609 | -8.6% |
| Random Forest | 0.9944 | 0.9676 | -2.7% |
| XGBoost | 0.9990 | 0.9833 | -1.6% |
| LightGBM | 0.8813 | 0.6154 | -30.2% π΄ |
LightGBM loses 30% of its AUROC when the split respects event boundaries β it collapses to near-random performance.
Finding 2: Every Event Type is a Vulnerability
Under adversarial-event splitting, ALL event types show critical AUROC degradation for LightGBM:
| Event Type | AUROC | Status |
|---|---|---|
| holiday_weekday | 0.5997 | π΄ CRITICAL |
| holiday_weekend | 0.6103 | π΄ CRITICAL |
| weekday_night | 0.6223 | π΄ CRITICAL |
| weekend_night | 0.6288 | π΄ CRITICAL |
| month_end | 0.6541 | π΄ CRITICAL |
| weekend_day | 0.6600 | π΄ CRITICAL |
Finding 3: Fraud Rates Vary Dramatically by Event Type
The dataset reveals fundamentally different fraud landscapes across events:
| Event Type | Fraud Rate | Transactions |
|---|---|---|
| weekday_night | 1.65% | 140,147 |
| weekend_night | 1.61% | 79,629 |
| holiday_weekday | 0.62% | 202,411 |
| holiday_weekend | 0.51% | 113,797 |
| normal | 0.12% | 291,093 |
| weekend_day | 0.10% | 186,746 |
| month_end | 0.10% | 34,752 |
Night-time fraud rates are 16x higher than normal periods. Random splitting masks this entirely.
Finding 4: Event-Aware vs Random β AUPRC Tells the Real Story
While AUROC drops are modest for robust models, AUPRC reveals dramatic differences:
| Model | Random AUPRC | Event-Aware AUPRC | Adversarial AUPRC |
|---|---|---|---|
| XGBoost | 0.9512 | 0.8892 | 0.8385 |
| Random Forest | 0.8400 | 0.8057 | 0.6218 |
| LightGBM | 0.3108 | 0.4828 | 0.0428 π΄ |
π Experimental Design
Five Splitting Strategies
| # | Strategy | Description | Purpose |
|---|---|---|---|
| 1 | Random (Naive) | Standard train_test_split with stratification |
Baseline β what most papers use |
| 2 | Temporal | Train on past, test on future | Standard time-series practice |
| 3 | Event-Aware (Ours) | Split respecting event window boundaries | Our contribution β realistic evaluation |
| 4 | Stratified-Temporal | Temporal split with per-event-type stratification | Balanced temporal evaluation |
| 5 | Adversarial-Event | Train on normal only, test on events only | Worst-case stress test |
Four Models (Deliberately Simple)
We use simple models to prove the point β it's not about model complexity:
- Logistic Regression
- Random Forest (200 trees)
- XGBoost (200 boosters)
- LightGBM (200 boosters)
Dataset
- Source: CIS435-CreditCardFraudDetection
- Size: 1,048,575 transactions
- Features: 23 engineered features (temporal, geographic, categorical, amount-based)
- Fraud Rate: 0.57% (6,006 fraud cases)
- Temporal Span: Full year with rich temporal metadata
π Results Visualization
Figure 1: Performance Heatmap Across All Strategies and Models
Figure 2: AUROC Comparison
Figure 3: Event-Type Analysis β Different Events = Different Fraud Landscapes
Figure 4: Performance Stability Across Event Types
Figure 5: Performance Degradation vs Random Split
Figure 6: Temporal Fraud Patterns β Why Random Splits Fail
Figure 7: The Key Insight β Split Strategy Matters More Than Model Complexity
π§ The Event-Aware Splitting Algorithm
def split_event_aware(df, X, y, test_ratio=0.2):
"""
EVENT-AWARE SPLIT β Our Contribution
Principle: Split data such that complete "event windows" are kept
intact. Never split within an event period. Ensure test set contains
representative event periods the model has NOT seen.
Algorithm:
1. Group transactions into event windows (year-month Γ event_type)
2. Sort windows temporally
3. Assign the LAST ~20% of windows to test set
4. Model must generalize to unseen event instances
"""
df['event_window'] = df['year_month'] + '_' + df['event_type']
window_times = df.groupby('event_window')['unix_time'].mean().sort_values()
n_test = int(len(window_times) * test_ratio)
test_windows = set(window_times.index[-n_test:])
train_mask = ~df['event_window'].isin(test_windows)
test_mask = df['event_window'].isin(test_windows)
return X[train_mask], X[test_mask], y[train_mask], y[test_mask]
Event Types Defined
| Event | Condition |
|---|---|
holiday_weekend |
Holiday season (Nov-Jan) + Weekend |
holiday_weekday |
Holiday season (Nov-Jan) + Weekday |
weekend_night |
Weekend + Night (10PM-5AM) |
weekend_day |
Weekend + Day |
weekday_night |
Weekday + Night (10PM-5AM) |
month_end |
Day 28+ of month |
normal |
None of the above |
π‘ Practical Recommendations
Based on our findings, we propose the Event-Aware Data Splitting Rule:
For Researchers
- Never use random splits for temporal fraud data β always use at minimum temporal splitting
- Report performance per event type β overall metrics hide critical vulnerabilities
- Include adversarial-event evaluation as a stress test alongside standard metrics
- Use AUPRC as the primary metric β AUROC is too forgiving for imbalanced fraud data
For Practitioners
- Test your production model on event periods it hasn't seen β this is the true test
- Monitor per-event-type performance in production with rolling windows
- Retrain before major event periods (holiday season, year-end)
- Simple models with correct splitting > Complex models with random splitting
For the ML Community
- Standardize event-aware splitting in fraud detection benchmarks
- Publish per-event breakdowns alongside overall metrics
- Treat data splitting as a first-class research contribution, not an afterthought
π Repository Structure
βββ README.md # This file
βββ event_aware_splitting.py # Complete experiment code (reproducible)
βββ figures/
β βββ fig1_performance_heatmap.png # Overall performance comparison
β βββ fig2_auroc_comparison.png # AUROC by model and strategy
β βββ fig3_event_analysis.png # Fraud rates by event type
β βββ fig4_event_stability.png # Per-event performance stability
β βββ fig5_degradation_heatmap.png # Performance degradation vs random
β βββ fig6_temporal_patterns.png # Temporal fraud patterns
β βββ fig7_complexity_vs_splitting.png # The key insight visualization
βββ results/
β βββ experiment_results.csv # All results in tabular format
β βββ detailed_event_results.json # Per-event-type breakdown
β βββ experiment_summary.json # Summary statistics
π Reproducibility
pip install pandas numpy scikit-learn matplotlib seaborn datasets lightgbm xgboost
python event_aware_splitting.py
All experiments run on CPU in ~20 minutes. No GPU required.
π Related Work
This work builds on insights from:
- TabReD (Yandex, 2024) β First systematic study of time-based splits in tabular benchmarks. Found 11/100 datasets have data leakage. Paper
- Fraud Dataset Benchmark (2022) β Standardized fraud detection datasets showing feature engineering matters more than model choice. Paper
- Comparative Evaluation of AD Methods for Fraud (2023) β Demonstrated distribution shift between 2018-2020 fraud patterns (COVID impact). Paper
- The Window Dilemma (2026) β Showed concept drift detection is fundamentally ill-posed due to windowing artifacts. Paper
π Citation
If you use this methodology in your research, please cite:
@misc{event_aware_splitting_2026,
title={Event-Aware Data Splitting: A New Rule for Fraud Detection Evaluation},
author={Moco22},
year={2026},
howpublished={\url{https://huggingface.co/Moco22/event-aware-data-splitting-fraud-detection}},
note={Demonstrates that data splitting strategy impacts fraud detection evaluation more than model complexity}
}
β οΈ Key Takeaway
A simple Logistic Regression with event-aware splitting gives you a more honest picture of production performance than a state-of-the-art XGBoost with random splitting.
The gap between split strategies (up to 30% AUROC) is often larger than the gap between models (typically 5-10%).
Split your data right. Then worry about your model.
Built with π€ Hugging Face






