MucahitSylmz
/

event-aware-data-splitting-fraud-detection

Model card Files Files and versions

xet

Community

Mucahit S. commited on 30 days ago

Commit

346eefd

verified ·

1 Parent(s): b936675

Upload README.md

Browse files

Files changed (1) hide show

README.md +263 -0

README.md ADDED Viewed

	@@ -0,0 +1,263 @@

+# 🔬 Event-Aware Data Splitting: A New Rule for Fraud Detection
+> **"Before building complex AI models, split your data correctly."**
+## 📌 Abstract
+This study introduces **Event-Aware Data Splitting** — a new paradigm for evaluating fraud detection systems that accounts for the temporal and contextual nature of real-world events. We demonstrate empirically that **how you split your data has a larger impact on reported performance than which model you choose**, and that standard random splitting creates dangerously optimistic evaluations that collapse under real-world event conditions.
+Our key contribution: **a formal splitting methodology that preserves event boundaries** (holidays, weekends, month-end periods, night-time surges) and provides realistic performance estimates that mirror production deployment.
+---
+## 🎯 The Problem
+Most fraud detection research uses **random train/test splitting**, which:
+- ❌ Leaks future event patterns into training data
+- ❌ Creates artificially balanced test sets that don't reflect production
+- ❌ Produces over-optimistic performance metrics (up to **30% inflation**)
+- ❌ Hides critical vulnerabilities during event periods (holidays, weekends)
+**In production, fraud detection systems face events they haven't seen before.** A model trained on normal days must handle Black Friday. Random splitting hides this reality.
+---
+## 🔑 Key Findings
+### Finding 1: The Adversarial Collapse
+When models trained only on "normal" periods are tested on event periods:
+| Model | Random Split AUROC | Adversarial-Event AUROC | **Drop** |
+|-------|-------------------|------------------------|----------|
+| Logistic Regression | 0.9415 | 0.8609 | **-8.6%** |
+| Random Forest | 0.9944 | 0.9676 | **-2.7%** |
+| XGBoost | 0.9990 | 0.9833 | **-1.6%** |
+| LightGBM | 0.8813 | 0.6154 | **-30.2%** 🔴 |
+**LightGBM loses 30% of its AUROC** when the split respects event boundaries — it collapses to near-random performance.
+### Finding 2: Every Event Type is a Vulnerability
+Under adversarial-event splitting, **ALL event types** show critical AUROC degradation for LightGBM:
+| Event Type | AUROC | Status |
+|-----------|-------|--------|
+| holiday_weekday | 0.5997 | 🔴 CRITICAL |
+| holiday_weekend | 0.6103 | 🔴 CRITICAL |
+| weekday_night | 0.6223 | 🔴 CRITICAL |
+| weekend_night | 0.6288 | 🔴 CRITICAL |
+| month_end | 0.6541 | 🔴 CRITICAL |
+| weekend_day | 0.6600 | 🔴 CRITICAL |
+### Finding 3: Fraud Rates Vary Dramatically by Event Type
+The dataset reveals fundamentally different fraud landscapes across events:
+| Event Type | Fraud Rate | Transactions |
+|-----------|-----------|-------------|
+| weekday_night | **1.65%** | 140,147 |
+| weekend_night | **1.61%** | 79,629 |
+| holiday_weekday | 0.62% | 202,411 |
+| holiday_weekend | 0.51% | 113,797 |
+| normal | 0.12% | 291,093 |
+| weekend_day | 0.10% | 186,746 |
+| month_end | 0.10% | 34,752 |
+Night-time fraud rates are **16x higher** than normal periods. Random splitting masks this entirely.
+### Finding 4: Event-Aware vs Random — AUPRC Tells the Real Story
+While AUROC drops are modest for robust models, **AUPRC reveals dramatic differences**:
+| Model | Random AUPRC | Event-Aware AUPRC | Adversarial AUPRC |
+|-------|-------------|-------------------|-------------------|
+| XGBoost | **0.9512** | 0.8892 | 0.8385 |
+| Random Forest | 0.8400 | 0.8057 | 0.6218 |
+| LightGBM | 0.3108 | 0.4828 | **0.0428** 🔴 |
+---
+## 📊 Experimental Design
+### Five Splitting Strategies
+| # | Strategy | Description | Purpose |
+|---|----------|-------------|---------|
+| 1 | **Random (Naive)** | Standard `train_test_split` with stratification | Baseline — what most papers use |
+| 2 | **Temporal** | Train on past, test on future | Standard time-series practice |
+| 3 | **Event-Aware (Ours)** | Split respecting event window boundaries | Our contribution — realistic evaluation |
+| 4 | **Stratified-Temporal** | Temporal split with per-event-type stratification | Balanced temporal evaluation |
+| 5 | **Adversarial-Event** | Train on normal only, test on events only | Worst-case stress test |
+### Four Models (Deliberately Simple)
+We use simple models to prove the point — **it's not about model complexity**:
+- Logistic Regression
+- Random Forest (200 trees)
+- XGBoost (200 boosters)
+- LightGBM (200 boosters)
+### Dataset
+- **Source:** [CIS435-CreditCardFraudDetection](https://huggingface.co/datasets/dazzle-nu/CIS435-CreditCardFraudDetection)
+- **Size:** 1,048,575 transactions
+- **Features:** 23 engineered features (temporal, geographic, categorical, amount-based)
+- **Fraud Rate:** 0.57% (6,006 fraud cases)
+- **Temporal Span:** Full year with rich temporal metadata
+---
+## 📈 Results Visualization
+### Figure 1: Performance Heatmap Across All Strategies and Models
+![Performance Heatmap](figures/fig1_performance_heatmap.png)
+### Figure 2: AUROC Comparison
+![AUROC Comparison](figures/fig2_auroc_comparison.png)
+### Figure 3: Event-Type Analysis — Different Events = Different Fraud Landscapes
+![Event Analysis](figures/fig3_event_analysis.png)
+### Figure 4: Performance Stability Across Event Types
+![Event Stability](figures/fig4_event_stability.png)
+### Figure 5: Performance Degradation vs Random Split
+![Degradation Heatmap](figures/fig5_degradation_heatmap.png)
+### Figure 6: Temporal Fraud Patterns — Why Random Splits Fail
+![Temporal Patterns](figures/fig6_temporal_patterns.png)
+### Figure 7: The Key Insight — Split Strategy Matters More Than Model Complexity
+![Complexity vs Splitting](figures/fig7_complexity_vs_splitting.png)
+---
+## 🧠 The Event-Aware Splitting Algorithm
+```python
+def split_event_aware(df, X, y, test_ratio=0.2):
+    """
+    EVENT-AWARE SPLIT — Our Contribution
+    Principle: Split data such that complete "event windows" are kept
+    intact. Never split within an event period. Ensure test set contains
+    representative event periods the model has NOT seen.
+    Algorithm:
+    1. Group transactions into event windows (year-month × event_type)
+    2. Sort windows temporally
+    3. Assign the LAST ~20% of windows to test set
+    4. Model must generalize to unseen event instances
+    """
+    df['event_window'] = df['year_month'] + '_' + df['event_type']
+    window_times = df.groupby('event_window')['unix_time'].mean().sort_values()
+    n_test = int(len(window_times) * test_ratio)
+    test_windows = set(window_times.index[-n_test:])
+    train_mask = ~df['event_window'].isin(test_windows)
+    test_mask = df['event_window'].isin(test_windows)
+    return X[train_mask], X[test_mask], y[train_mask], y[test_mask]
+```
+### Event Types Defined
+| Event | Condition |
+|-------|-----------|
+| `holiday_weekend` | Holiday season (Nov-Jan) + Weekend |
+| `holiday_weekday` | Holiday season (Nov-Jan) + Weekday |
+| `weekend_night` | Weekend + Night (10PM-5AM) |
+| `weekend_day` | Weekend + Day |
+| `weekday_night` | Weekday + Night (10PM-5AM) |
+| `month_end` | Day 28+ of month |
+| `normal` | None of the above |
+---
+## 💡 Practical Recommendations
+Based on our findings, we propose the **Event-Aware Data Splitting Rule**:
+### For Researchers
+1. **Never use random splits** for temporal fraud data — always use at minimum temporal splitting
+2. **Report performance per event type** — overall metrics hide critical vulnerabilities
+3. **Include adversarial-event evaluation** as a stress test alongside standard metrics
+4. **Use AUPRC as the primary metric** — AUROC is too forgiving for imbalanced fraud data
+### For Practitioners
+1. **Test your production model on event periods it hasn't seen** — this is the true test
+2. **Monitor per-event-type performance** in production with rolling windows
+3. **Retrain before major event periods** (holiday season, year-end)
+4. **Simple models with correct splitting > Complex models with random splitting**
+### For the ML Community
+- Standardize event-aware splitting in fraud detection benchmarks
+- Publish per-event breakdowns alongside overall metrics
+- Treat data splitting as a first-class research contribution, not an afterthought
+---
+## 📁 Repository Structure
+```
+├── README.md                              # This file
+├── event_aware_splitting.py               # Complete experiment code (reproducible)
+├── figures/
+│   ├── fig1_performance_heatmap.png       # Overall performance comparison
+│   ├── fig2_auroc_comparison.png          # AUROC by model and strategy
+│   ├── fig3_event_analysis.png            # Fraud rates by event type
+│   ├── fig4_event_stability.png           # Per-event performance stability
+│   ├── fig5_degradation_heatmap.png       # Performance degradation vs random
+│   ├── fig6_temporal_patterns.png         # Temporal fraud patterns
+│   └── fig7_complexity_vs_splitting.png   # The key insight visualization
+├── results/
+│   ├── experiment_results.csv             # All results in tabular format
+│   ├── detailed_event_results.json        # Per-event-type breakdown
+│   └── experiment_summary.json            # Summary statistics
+```
+---
+## 🔄 Reproducibility
+```bash
+pip install pandas numpy scikit-learn matplotlib seaborn datasets lightgbm xgboost
+python event_aware_splitting.py
+```
+All experiments run on CPU in ~20 minutes. No GPU required.
+---
+## 📚 Related Work
+This work builds on insights from:
+- **TabReD** (Yandex, 2024) — First systematic study of time-based splits in tabular benchmarks. Found 11/100 datasets have data leakage. [Paper](https://arxiv.org/abs/2406.19380)
+- **Fraud Dataset Benchmark** (2022) — Standardized fraud detection datasets showing feature engineering matters more than model choice. [Paper](https://arxiv.org/abs/2208.14417)
+- **Comparative Evaluation of AD Methods for Fraud** (2023) — Demonstrated distribution shift between 2018-2020 fraud patterns (COVID impact). [Paper](https://arxiv.org/abs/2312.13896)
+- **The Window Dilemma** (2026) — Showed concept drift detection is fundamentally ill-posed due to windowing artifacts. [Paper](https://arxiv.org/abs/2602.06456)
+---
+## 📖 Citation
+If you use this methodology in your research, please cite:
+```bibtex
+@misc{event_aware_splitting_2026,
+  title={Event-Aware Data Splitting: A New Rule for Fraud Detection Evaluation},
+  author={Moco22},
+  year={2026},
+  howpublished={\url{https://huggingface.co/Moco22/event-aware-data-splitting-fraud-detection}},
+  note={Demonstrates that data splitting strategy impacts fraud detection evaluation more than model complexity}
+}
+```
+---
+## ⚠️ Key Takeaway
+> **A simple Logistic Regression with event-aware splitting gives you a more honest picture of production performance than a state-of-the-art XGBoost with random splitting.**
+>
+> The gap between split strategies (up to 30% AUROC) is often larger than the gap between models (typically 5-10%).
+>
+> **Split your data right. Then worry about your model.**
+---
+*Built with 🤗 Hugging Face*