Mucahit S. commited on
Commit
346eefd
Β·
verified Β·
1 Parent(s): b936675

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +263 -0
README.md ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ”¬ Event-Aware Data Splitting: A New Rule for Fraud Detection
2
+
3
+ > **"Before building complex AI models, split your data correctly."**
4
+
5
+ ## πŸ“Œ Abstract
6
+
7
+ This study introduces **Event-Aware Data Splitting** β€” a new paradigm for evaluating fraud detection systems that accounts for the temporal and contextual nature of real-world events. We demonstrate empirically that **how you split your data has a larger impact on reported performance than which model you choose**, and that standard random splitting creates dangerously optimistic evaluations that collapse under real-world event conditions.
8
+
9
+ Our key contribution: **a formal splitting methodology that preserves event boundaries** (holidays, weekends, month-end periods, night-time surges) and provides realistic performance estimates that mirror production deployment.
10
+
11
+ ---
12
+
13
+ ## 🎯 The Problem
14
+
15
+ Most fraud detection research uses **random train/test splitting**, which:
16
+ - ❌ Leaks future event patterns into training data
17
+ - ❌ Creates artificially balanced test sets that don't reflect production
18
+ - ❌ Produces over-optimistic performance metrics (up to **30% inflation**)
19
+ - ❌ Hides critical vulnerabilities during event periods (holidays, weekends)
20
+
21
+ **In production, fraud detection systems face events they haven't seen before.** A model trained on normal days must handle Black Friday. Random splitting hides this reality.
22
+
23
+ ---
24
+
25
+ ## πŸ”‘ Key Findings
26
+
27
+ ### Finding 1: The Adversarial Collapse
28
+ When models trained only on "normal" periods are tested on event periods:
29
+
30
+ | Model | Random Split AUROC | Adversarial-Event AUROC | **Drop** |
31
+ |-------|-------------------|------------------------|----------|
32
+ | Logistic Regression | 0.9415 | 0.8609 | **-8.6%** |
33
+ | Random Forest | 0.9944 | 0.9676 | **-2.7%** |
34
+ | XGBoost | 0.9990 | 0.9833 | **-1.6%** |
35
+ | LightGBM | 0.8813 | 0.6154 | **-30.2%** πŸ”΄ |
36
+
37
+ **LightGBM loses 30% of its AUROC** when the split respects event boundaries β€” it collapses to near-random performance.
38
+
39
+ ### Finding 2: Every Event Type is a Vulnerability
40
+ Under adversarial-event splitting, **ALL event types** show critical AUROC degradation for LightGBM:
41
+
42
+ | Event Type | AUROC | Status |
43
+ |-----------|-------|--------|
44
+ | holiday_weekday | 0.5997 | πŸ”΄ CRITICAL |
45
+ | holiday_weekend | 0.6103 | πŸ”΄ CRITICAL |
46
+ | weekday_night | 0.6223 | πŸ”΄ CRITICAL |
47
+ | weekend_night | 0.6288 | πŸ”΄ CRITICAL |
48
+ | month_end | 0.6541 | πŸ”΄ CRITICAL |
49
+ | weekend_day | 0.6600 | πŸ”΄ CRITICAL |
50
+
51
+ ### Finding 3: Fraud Rates Vary Dramatically by Event Type
52
+ The dataset reveals fundamentally different fraud landscapes across events:
53
+
54
+ | Event Type | Fraud Rate | Transactions |
55
+ |-----------|-----------|-------------|
56
+ | weekday_night | **1.65%** | 140,147 |
57
+ | weekend_night | **1.61%** | 79,629 |
58
+ | holiday_weekday | 0.62% | 202,411 |
59
+ | holiday_weekend | 0.51% | 113,797 |
60
+ | normal | 0.12% | 291,093 |
61
+ | weekend_day | 0.10% | 186,746 |
62
+ | month_end | 0.10% | 34,752 |
63
+
64
+ Night-time fraud rates are **16x higher** than normal periods. Random splitting masks this entirely.
65
+
66
+ ### Finding 4: Event-Aware vs Random β€” AUPRC Tells the Real Story
67
+ While AUROC drops are modest for robust models, **AUPRC reveals dramatic differences**:
68
+
69
+ | Model | Random AUPRC | Event-Aware AUPRC | Adversarial AUPRC |
70
+ |-------|-------------|-------------------|-------------------|
71
+ | XGBoost | **0.9512** | 0.8892 | 0.8385 |
72
+ | Random Forest | 0.8400 | 0.8057 | 0.6218 |
73
+ | LightGBM | 0.3108 | 0.4828 | **0.0428** πŸ”΄ |
74
+
75
+ ---
76
+
77
+ ## πŸ“Š Experimental Design
78
+
79
+ ### Five Splitting Strategies
80
+
81
+ | # | Strategy | Description | Purpose |
82
+ |---|----------|-------------|---------|
83
+ | 1 | **Random (Naive)** | Standard `train_test_split` with stratification | Baseline β€” what most papers use |
84
+ | 2 | **Temporal** | Train on past, test on future | Standard time-series practice |
85
+ | 3 | **Event-Aware (Ours)** | Split respecting event window boundaries | Our contribution β€” realistic evaluation |
86
+ | 4 | **Stratified-Temporal** | Temporal split with per-event-type stratification | Balanced temporal evaluation |
87
+ | 5 | **Adversarial-Event** | Train on normal only, test on events only | Worst-case stress test |
88
+
89
+ ### Four Models (Deliberately Simple)
90
+ We use simple models to prove the point β€” **it's not about model complexity**:
91
+ - Logistic Regression
92
+ - Random Forest (200 trees)
93
+ - XGBoost (200 boosters)
94
+ - LightGBM (200 boosters)
95
+
96
+ ### Dataset
97
+ - **Source:** [CIS435-CreditCardFraudDetection](https://huggingface.co/datasets/dazzle-nu/CIS435-CreditCardFraudDetection)
98
+ - **Size:** 1,048,575 transactions
99
+ - **Features:** 23 engineered features (temporal, geographic, categorical, amount-based)
100
+ - **Fraud Rate:** 0.57% (6,006 fraud cases)
101
+ - **Temporal Span:** Full year with rich temporal metadata
102
+
103
+ ---
104
+
105
+ ## πŸ“ˆ Results Visualization
106
+
107
+ ### Figure 1: Performance Heatmap Across All Strategies and Models
108
+ ![Performance Heatmap](figures/fig1_performance_heatmap.png)
109
+
110
+ ### Figure 2: AUROC Comparison
111
+ ![AUROC Comparison](figures/fig2_auroc_comparison.png)
112
+
113
+ ### Figure 3: Event-Type Analysis β€” Different Events = Different Fraud Landscapes
114
+ ![Event Analysis](figures/fig3_event_analysis.png)
115
+
116
+ ### Figure 4: Performance Stability Across Event Types
117
+ ![Event Stability](figures/fig4_event_stability.png)
118
+
119
+ ### Figure 5: Performance Degradation vs Random Split
120
+ ![Degradation Heatmap](figures/fig5_degradation_heatmap.png)
121
+
122
+ ### Figure 6: Temporal Fraud Patterns β€” Why Random Splits Fail
123
+ ![Temporal Patterns](figures/fig6_temporal_patterns.png)
124
+
125
+ ### Figure 7: The Key Insight β€” Split Strategy Matters More Than Model Complexity
126
+ ![Complexity vs Splitting](figures/fig7_complexity_vs_splitting.png)
127
+
128
+ ---
129
+
130
+ ## 🧠 The Event-Aware Splitting Algorithm
131
+
132
+ ```python
133
+ def split_event_aware(df, X, y, test_ratio=0.2):
134
+ """
135
+ EVENT-AWARE SPLIT β€” Our Contribution
136
+
137
+ Principle: Split data such that complete "event windows" are kept
138
+ intact. Never split within an event period. Ensure test set contains
139
+ representative event periods the model has NOT seen.
140
+
141
+ Algorithm:
142
+ 1. Group transactions into event windows (year-month Γ— event_type)
143
+ 2. Sort windows temporally
144
+ 3. Assign the LAST ~20% of windows to test set
145
+ 4. Model must generalize to unseen event instances
146
+ """
147
+ df['event_window'] = df['year_month'] + '_' + df['event_type']
148
+ window_times = df.groupby('event_window')['unix_time'].mean().sort_values()
149
+
150
+ n_test = int(len(window_times) * test_ratio)
151
+ test_windows = set(window_times.index[-n_test:])
152
+
153
+ train_mask = ~df['event_window'].isin(test_windows)
154
+ test_mask = df['event_window'].isin(test_windows)
155
+
156
+ return X[train_mask], X[test_mask], y[train_mask], y[test_mask]
157
+ ```
158
+
159
+ ### Event Types Defined
160
+ | Event | Condition |
161
+ |-------|-----------|
162
+ | `holiday_weekend` | Holiday season (Nov-Jan) + Weekend |
163
+ | `holiday_weekday` | Holiday season (Nov-Jan) + Weekday |
164
+ | `weekend_night` | Weekend + Night (10PM-5AM) |
165
+ | `weekend_day` | Weekend + Day |
166
+ | `weekday_night` | Weekday + Night (10PM-5AM) |
167
+ | `month_end` | Day 28+ of month |
168
+ | `normal` | None of the above |
169
+
170
+ ---
171
+
172
+ ## πŸ’‘ Practical Recommendations
173
+
174
+ Based on our findings, we propose the **Event-Aware Data Splitting Rule**:
175
+
176
+ ### For Researchers
177
+ 1. **Never use random splits** for temporal fraud data β€” always use at minimum temporal splitting
178
+ 2. **Report performance per event type** β€” overall metrics hide critical vulnerabilities
179
+ 3. **Include adversarial-event evaluation** as a stress test alongside standard metrics
180
+ 4. **Use AUPRC as the primary metric** β€” AUROC is too forgiving for imbalanced fraud data
181
+
182
+ ### For Practitioners
183
+ 1. **Test your production model on event periods it hasn't seen** β€” this is the true test
184
+ 2. **Monitor per-event-type performance** in production with rolling windows
185
+ 3. **Retrain before major event periods** (holiday season, year-end)
186
+ 4. **Simple models with correct splitting > Complex models with random splitting**
187
+
188
+ ### For the ML Community
189
+ - Standardize event-aware splitting in fraud detection benchmarks
190
+ - Publish per-event breakdowns alongside overall metrics
191
+ - Treat data splitting as a first-class research contribution, not an afterthought
192
+
193
+ ---
194
+
195
+ ## πŸ“ Repository Structure
196
+
197
+ ```
198
+ β”œβ”€β”€ README.md # This file
199
+ β”œβ”€β”€ event_aware_splitting.py # Complete experiment code (reproducible)
200
+ β”œβ”€β”€ figures/
201
+ β”‚ β”œβ”€β”€ fig1_performance_heatmap.png # Overall performance comparison
202
+ β”‚ β”œβ”€β”€ fig2_auroc_comparison.png # AUROC by model and strategy
203
+ β”‚ β”œβ”€β”€ fig3_event_analysis.png # Fraud rates by event type
204
+ β”‚ β”œβ”€β”€ fig4_event_stability.png # Per-event performance stability
205
+ β”‚ β”œβ”€β”€ fig5_degradation_heatmap.png # Performance degradation vs random
206
+ β”‚ β”œβ”€β”€ fig6_temporal_patterns.png # Temporal fraud patterns
207
+ β”‚ └── fig7_complexity_vs_splitting.png # The key insight visualization
208
+ β”œβ”€β”€ results/
209
+ β”‚ β”œβ”€β”€ experiment_results.csv # All results in tabular format
210
+ β”‚ β”œβ”€β”€ detailed_event_results.json # Per-event-type breakdown
211
+ β”‚ └── experiment_summary.json # Summary statistics
212
+ ```
213
+
214
+ ---
215
+
216
+ ## πŸ”„ Reproducibility
217
+
218
+ ```bash
219
+ pip install pandas numpy scikit-learn matplotlib seaborn datasets lightgbm xgboost
220
+ python event_aware_splitting.py
221
+ ```
222
+
223
+ All experiments run on CPU in ~20 minutes. No GPU required.
224
+
225
+ ---
226
+
227
+ ## πŸ“š Related Work
228
+
229
+ This work builds on insights from:
230
+ - **TabReD** (Yandex, 2024) β€” First systematic study of time-based splits in tabular benchmarks. Found 11/100 datasets have data leakage. [Paper](https://arxiv.org/abs/2406.19380)
231
+ - **Fraud Dataset Benchmark** (2022) β€” Standardized fraud detection datasets showing feature engineering matters more than model choice. [Paper](https://arxiv.org/abs/2208.14417)
232
+ - **Comparative Evaluation of AD Methods for Fraud** (2023) β€” Demonstrated distribution shift between 2018-2020 fraud patterns (COVID impact). [Paper](https://arxiv.org/abs/2312.13896)
233
+ - **The Window Dilemma** (2026) β€” Showed concept drift detection is fundamentally ill-posed due to windowing artifacts. [Paper](https://arxiv.org/abs/2602.06456)
234
+
235
+ ---
236
+
237
+ ## πŸ“– Citation
238
+
239
+ If you use this methodology in your research, please cite:
240
+
241
+ ```bibtex
242
+ @misc{event_aware_splitting_2026,
243
+ title={Event-Aware Data Splitting: A New Rule for Fraud Detection Evaluation},
244
+ author={Moco22},
245
+ year={2026},
246
+ howpublished={\url{https://huggingface.co/Moco22/event-aware-data-splitting-fraud-detection}},
247
+ note={Demonstrates that data splitting strategy impacts fraud detection evaluation more than model complexity}
248
+ }
249
+ ```
250
+
251
+ ---
252
+
253
+ ## ⚠️ Key Takeaway
254
+
255
+ > **A simple Logistic Regression with event-aware splitting gives you a more honest picture of production performance than a state-of-the-art XGBoost with random splitting.**
256
+ >
257
+ > The gap between split strategies (up to 30% AUROC) is often larger than the gap between models (typically 5-10%).
258
+ >
259
+ > **Split your data right. Then worry about your model.**
260
+
261
+ ---
262
+
263
+ *Built with πŸ€— Hugging Face*