rajvivan commited on
Commit
35852d7
Β·
verified Β·
1 Parent(s): bf86f89

Remove author details and improve IEEE formatting

Browse files
Files changed (1) hide show
  1. paper/fraud_detection_paper.tex +124 -64
paper/fraud_detection_paper.tex CHANGED
@@ -1,42 +1,55 @@
1
  \documentclass[journal]{IEEEtran}
 
 
2
  \usepackage{cite}
3
  \usepackage{amsmath,amssymb,amsfonts}
4
  \usepackage{graphicx}
5
  \usepackage{textcomp}
6
  \usepackage{xcolor}
7
- \usepackage{listings}
8
  \usepackage{booktabs}
9
- \usepackage{hyperref}
10
  \usepackage{multirow}
 
 
 
 
11
  \usepackage{array}
12
  \usepackage{float}
 
 
13
 
 
14
  \lstset{
15
  language=Python,
16
- basicstyle=\ttfamily\footnotesize,
17
  keywordstyle=\color{blue},
18
  stringstyle=\color{red},
19
  commentstyle=\color{green!60!black},
20
- numbers=left,
21
- numberstyle=\tiny\color{gray},
22
  breaklines=true,
23
  frame=single,
24
- captionpos=b
 
 
25
  }
26
 
 
 
 
27
  \begin{document}
28
 
 
 
 
 
29
  \title{A Comprehensive Ensemble-Based Framework for Credit Card Fraud Detection with Explainable AI}
30
 
31
- \author{
32
- \IEEEauthorblockN{Raj Vivan}
33
- \IEEEauthorblockA{Department of Computer Science\\
34
- \textit{Independent Research}\\
35
- Email: rajvivan@example.com}
36
- }
37
 
38
  \maketitle
39
 
 
 
 
 
40
  \begin{abstract}
41
  Credit card fraud poses a significant threat to the global financial ecosystem, with estimated losses exceeding \$32 billion annually. This paper presents a comprehensive end-to-end fraud detection framework that systematically evaluates and compares seven machine learning approaches: Logistic Regression, Random Forest, XGBoost, LightGBM, Multilayer Perceptron, Autoencoder-based anomaly detection, and a Voting Ensemble. Using the benchmark European Cardholder dataset (284,807 transactions, 0.173\% fraud rate), we engineer 12 novel features and address the extreme class imbalance through both SMOTE oversampling and cost-sensitive learning with class weights. Our XGBoost model achieves the best performance with a PR-AUC of 0.8166, precision of 0.9048, recall of 0.8028, and F1-score of 0.8507 on the held-out test set. We demonstrate that optimizing the decision threshold from the default 0.5 to 0.55 improves F1 from 0.8507 to 0.8636. Comprehensive model explainability via SHAP and LIME analysis reveals that PCA components V4, V14, and V12 are the primary discriminative features. Error analysis shows that false negatives arise from sophisticated fraud patterns that closely mimic legitimate transaction behavior. We deploy the model as a production-ready FastAPI service achieving sub-10ms inference latency. The framework includes automated concept drift monitoring and retraining recommendations. All code, models, and results are publicly available.
42
  \end{abstract}
@@ -45,13 +58,17 @@ Credit card fraud poses a significant threat to the global financial ecosystem,
45
  Fraud detection, credit card, machine learning, XGBoost, ensemble learning, explainable AI, SHAP, class imbalance, anomaly detection
46
  \end{IEEEkeywords}
47
 
 
 
 
 
48
  \section{Introduction}
49
 
50
- Financial fraud detection has become one of the most critical applications of machine learning in the modern digital economy. The proliferation of electronic payment systems has led to an exponential increase in both the volume of transactions and the sophistication of fraudulent activities \cite{dal2015credit}. According to the Nilson Report, global card fraud losses reached \$32.34 billion in 2021 and are projected to exceed \$43 billion by 2026 \cite{nilson2022}.
51
 
52
- The fundamental challenge in fraud detection lies in the extreme class imbalance inherent in transaction data. In typical datasets, fraudulent transactions constitute less than 0.5\% of all transactions \cite{pozzolo2015calibrating}. This imbalance renders conventional classification metrics such as accuracy misleading and necessitates specialized evaluation criteria including Precision-Recall AUC and Matthews Correlation Coefficient \cite{saito2015precision}.
53
 
54
- Previous approaches to fraud detection have ranged from rule-based expert systems \cite{bolton2002statistical} to sophisticated deep learning architectures \cite{zhang2021fraud}. While deep learning methods have shown promise, tree-based ensemble methods such as XGBoost and LightGBM continue to demonstrate competitive or superior performance on tabular financial data \cite{shwartz2022tabular}, particularly when augmented with careful feature engineering and proper handling of class imbalance.
55
 
56
  This paper makes the following contributions:
57
  \begin{enumerate}
@@ -63,35 +80,43 @@ This paper makes the following contributions:
63
  \item Quantitative business impact analysis estimating financial savings from deployment.
64
  \end{enumerate}
65
 
 
 
 
 
66
  \section{Related Work}
67
 
68
- Credit card fraud detection has been extensively studied across multiple paradigms. Dal Pozzolo et al. \cite{dal2015credit} provided a foundational analysis of the challenges posed by class imbalance and concept drift in real-world fraud detection systems. Their work established that undersampling strategies could be effective but risked losing valuable information from the majority class.
 
 
69
 
70
- Chawla et al. \cite{chawla2002smote} introduced SMOTE (Synthetic Minority Over-sampling Technique), which generates synthetic minority class samples by interpolating between existing examples. Subsequent work by Fernandez et al. \cite{fernandez2018smote} demonstrated that SMOTE should be applied exclusively to training data, as applying it before splitting introduces data leakage.
71
 
72
- Ensemble methods have shown particular promise in fraud detection. Xuan et al. \cite{xuan2018random} demonstrated that Random Forests achieve robust performance through bagging and feature randomization. Chen and Guestrin \cite{chen2016xgboost} introduced XGBoost, which has since become a dominant method for tabular data classification, including fraud detection \cite{taha2020detection}.
73
 
74
- Ke et al. \cite{ke2017lightgbm} proposed LightGBM with leaf-wise tree growth and gradient-based one-side sampling, achieving faster training with comparable accuracy. Prokhorenkova et al. \cite{prokhorenkova2018catboost} introduced CatBoost with ordered boosting to handle categorical features natively.
75
 
76
- Deep learning approaches have also been explored. Pumsirirat and Yan \cite{pumsirirat2018credit} employed autoencoders for anomaly-based fraud detection, training exclusively on legitimate transactions and detecting fraud through reconstruction error. Zhang et al. \cite{zhang2021fraud} proposed attention-based recurrent neural networks that capture sequential transaction patterns.
77
 
78
- Explainability in fraud detection has gained importance due to regulatory requirements. Lundberg and Lee \cite{lundberg2017unified} introduced SHAP (SHapley Additive exPlanations), providing consistent feature attribution. Ribeiro et al. \cite{ribeiro2016lime} proposed LIME (Local Interpretable Model-agnostic Explanations) for instance-level interpretability. Belle and Papantonis \cite{belle2021principles} surveyed explainable AI methods applicable to financial decision-making.
79
 
80
- Akiba et al. \cite{akiba2019optuna} introduced Optuna, a hyperparameter optimization framework using Tree-structured Parzen Estimators (TPE) that efficiently explores complex search spaces.
81
 
82
- Recent work by Shwartz-Ziv and Armon \cite{shwartz2022tabular} demonstrated that well-tuned tree-based methods still outperform deep learning on most tabular datasets, supporting our choice of XGBoost as the primary model. Grinsztajn et al. \cite{grinsztajn2022tree} further corroborated this finding with extensive benchmarking.
 
 
83
 
84
  \section{Dataset and Exploratory Data Analysis}
85
 
86
  \subsection{Dataset Description}
87
 
88
- We use the European Cardholder Credit Card Fraud Detection dataset \cite{dal2015credit}, containing 284,807 transactions made over two days in September 2013. The dataset includes 28 PCA-transformed features (V1--V28), the original \texttt{Time} and \texttt{Amount} features, and a binary \texttt{Class} label (0 = legitimate, 1 = fraud).
89
 
90
  \subsection{Class Distribution}
91
 
92
  The dataset exhibits extreme class imbalance with only 492 fraudulent transactions (0.173\%), yielding an imbalance ratio of approximately 1:577. This severe imbalance necessitates specialized handling during both training and evaluation.
93
 
94
- \begin{table}[h]
95
  \centering
96
  \caption{Class Distribution in the Dataset}
97
  \label{tab:class_dist}
@@ -119,13 +144,17 @@ Our exploratory analysis revealed five critical findings:
119
  \item \textbf{Feature Scale}: V1--V28 are PCA-transformed; only Time and Amount require normalization.
120
  \end{enumerate}
121
 
122
- \begin{figure}[h]
123
  \centering
124
- \includegraphics[width=\columnwidth]{figures/class_distribution.png}
125
  \caption{Class distribution showing extreme imbalance (0.173\% fraud rate).}
126
  \label{fig:class_dist}
127
  \end{figure}
128
 
 
 
 
 
129
  \section{Methodology}
130
 
131
  \subsection{Feature Engineering}
@@ -164,7 +193,7 @@ M = \sqrt{\sum_{i=1}^{28} V_i^2}
164
 
165
  We compare two approaches for handling the 1:577 class imbalance:
166
 
167
- \textbf{SMOTE} \cite{chawla2002smote}: Applied exclusively to the training set after splitting, generating synthetic fraud samples to achieve a 1:2 minority-to-majority ratio.
168
 
169
  \textbf{Cost-Sensitive Learning}: Applying class weights inversely proportional to class frequency:
170
  \begin{equation}
@@ -218,12 +247,16 @@ P(\text{fraud}|x) = \frac{1}{3}\sum_{m=1}^{3} P_m(\text{fraud}|x)
218
 
219
  \subsection{Hyperparameter Optimization}
220
 
221
- We use Optuna \cite{akiba2019optuna} with Tree-structured Parzen Estimators (TPE) to tune the top three models, optimizing PR-AUC on the validation set:
222
 
223
  \begin{equation}
224
  \theta^* = \arg\max_{\theta} \text{PR-AUC}(f_\theta, \mathcal{D}_{val})
225
  \end{equation}
226
 
 
 
 
 
227
  \section{Experimental Setup}
228
 
229
  \subsection{Environment}
@@ -241,11 +274,15 @@ Given the extreme class imbalance, we report six metrics:
241
  \item \textbf{MCC}: $\frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$
242
  \end{itemize}
243
 
 
 
 
 
244
  \section{Results and Discussion}
245
 
246
  \subsection{Model Comparison}
247
 
248
- \begin{table*}[t]
249
  \centering
250
  \caption{Comprehensive Model Comparison on Test Set (Threshold = 0.5)}
251
  \label{tab:results}
@@ -270,22 +307,22 @@ Table~\ref{tab:results} presents the comprehensive evaluation results. XGBoost a
270
 
271
  Key observations:
272
 
273
- \textbf{Tree-based models dominate}: XGBoost, Random Forest, and LightGBM consistently outperform the neural network approaches, consistent with findings by Shwartz-Ziv and Armon \cite{shwartz2022tabular}.
274
 
275
  \textbf{Class weight handling matters}: Logistic Regression achieves high recall (0.8873) but extremely low precision (0.0488), indicating that the linear decision boundary with class weights is too aggressive in flagging transactions.
276
 
277
  \textbf{Autoencoder limitations}: While achieving perfect recall (1.0), the autoencoder suffers from extremely low precision (0.0033), flagging nearly all transactions as anomalous. This suggests that the reconstruction-based approach is too sensitive for this PCA-transformed feature space.
278
 
279
- \begin{figure}[h]
280
  \centering
281
- \includegraphics[width=\columnwidth]{figures/roc_curves.png}
282
  \caption{ROC curves for all models. XGBoost and Voting Ensemble achieve the highest AUC.}
283
  \label{fig:roc}
284
  \end{figure}
285
 
286
- \begin{figure}[h]
287
  \centering
288
- \includegraphics[width=\columnwidth]{figures/pr_curves.png}
289
  \caption{Precision-Recall curves. PR-AUC is the primary metric for imbalanced classification.}
290
  \label{fig:pr}
291
  \end{figure}
@@ -294,7 +331,7 @@ Key observations:
294
 
295
  The default threshold of 0.5 is suboptimal for imbalanced data. Our analysis reveals that a threshold of 0.55 maximizes F1-score:
296
 
297
- \begin{table}[h]
298
  \centering
299
  \caption{Threshold Sensitivity for XGBoost}
300
  \label{tab:threshold}
@@ -314,7 +351,7 @@ The default threshold of 0.5 is suboptimal for imbalanced data. Our analysis rev
314
 
315
  \subsection{Business Impact}
316
 
317
- \begin{table}[h]
318
  \centering
319
  \caption{Business Impact Analysis (Test Set)}
320
  \label{tab:business}
@@ -326,7 +363,7 @@ XGBoost & 6,966 & 1,711 & 6,936 \\
326
  Ensemble & 6,966 & 1,711 & 6,921 \\
327
  RF (Tuned) & 6,722 & 1,955 & 6,682 \\
328
  LR & 7,699 & 978 & 1,554 \\
329
- Autoencoder & 8,677 & 0 & -97,368 \\
330
  \bottomrule
331
  \end{tabular}
332
  \end{table}
@@ -337,13 +374,17 @@ Table~\ref{tab:business} demonstrates that XGBoost provides the highest net savi
337
 
338
  SHAP analysis reveals that V4 (mean $|\text{SHAP}| = 1.913$), V14 (1.843), and PCA\_magnitude (1.113) are the primary fraud discriminators. These features correspond to specific latent patterns in the PCA-transformed space that distinguish fraudulent from legitimate behavior.
339
 
340
- \begin{figure}[h]
341
  \centering
342
- \includegraphics[width=\columnwidth]{figures/shap_summary.png}
343
  \caption{SHAP summary plot showing feature contributions to fraud predictions.}
344
  \label{fig:shap}
345
  \end{figure}
346
 
 
 
 
 
347
  \section{Error Analysis}
348
 
349
  \subsection{False Negative Analysis}
@@ -356,7 +397,11 @@ The 6 false positives have a mean predicted fraud probability of 0.827, with fea
356
 
357
  \subsection{Concept Drift Assessment}
358
 
359
- Comparing model confidence between early and late test periods reveals a drift indicator of +0.115, suggesting modest temporal variation. We recommend weekly monitoring with automated retraining triggers when PR-AUC drops below 0.70.
 
 
 
 
360
 
361
  \section{Limitations}
362
 
@@ -368,15 +413,19 @@ Comparing model confidence between early and late test periods reveals a drift i
368
  \item \textbf{Static Threshold}: The optimal threshold may shift as fraud patterns evolve; dynamic threshold adaptation is not implemented.
369
  \end{enumerate}
370
 
 
 
 
 
371
  \section{Future Work}
372
 
373
  Several promising directions emerge from this research:
374
 
375
- \textbf{Graph Neural Networks}: Modeling transaction networks as graphs could enable detection of fraud rings through collaborative behavioral patterns \cite{liu2021graph}.
376
 
377
  \textbf{Real-Time Streaming}: Integration with Apache Kafka and Apache Flink for millisecond-latency processing of transaction streams at scale.
378
 
379
- \textbf{Federated Learning}: Training across multiple banks without sharing raw transaction data, preserving privacy while improving generalization \cite{yang2019federated}.
380
 
381
  \textbf{LLM-Generated Explanations}: Using large language models to generate natural-language compliance explanations for flagged transactions, facilitating human review.
382
 
@@ -384,79 +433,90 @@ Several promising directions emerge from this research:
384
 
385
  \textbf{Adversarial Robustness}: Training models that are robust to adversarial perturbations designed to evade detection.
386
 
 
 
 
 
387
  \section{Conclusion}
388
 
389
  This paper presents a comprehensive fraud detection framework that systematically evaluates seven machine learning approaches on the benchmark European Cardholder dataset. Our results demonstrate that XGBoost achieves the best overall performance (PR-AUC: 0.8166, F1: 0.8507) through cost-sensitive learning with optimized class weights. Threshold optimization from 0.5 to 0.55 further improves F1 to 0.8636. The framework includes complete explainability through SHAP and LIME, production deployment via FastAPI with sub-10ms latency, and automated drift monitoring. Our analysis confirms that tree-based ensemble methods remain the most effective approach for tabular fraud detection, while highlighting the importance of proper class imbalance handling, threshold optimization, and the inadequacy of accuracy as a metric for imbalanced classification.
390
 
 
 
 
 
 
 
 
391
  \bibliographystyle{IEEEtran}
392
 
393
  \begin{thebibliography}{99}
394
 
395
  \bibitem{dal2015credit}
396
- A. Dal Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi, ``Calibrating probability with undersampling for unbalanced classification,'' in \textit{Proc. IEEE Symp. Comput. Intell. Data Mining (CIDM)}, 2015, pp. 159--166.
397
 
398
  \bibitem{nilson2022}
399
  Nilson Report, ``Global card fraud losses,'' \textit{Nilson Report}, Issue 1209, 2022.
400
 
401
  \bibitem{pozzolo2015calibrating}
402
- A. Dal Pozzolo, O. Caelen, and G. Bontempi, ``When is undersampling effective in unbalanced classification tasks?,'' in \textit{Proc. European Conf. Machine Learning and Knowledge Discovery in Databases}, 2015, pp. 200--215.
403
 
404
  \bibitem{saito2015precision}
405
- T. Saito and M. Rehmsmeier, ``The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets,'' \textit{PLoS ONE}, vol. 10, no. 3, 2015.
406
 
407
  \bibitem{bolton2002statistical}
408
- R. J. Bolton and D. J. Hand, ``Statistical fraud detection: A review,'' \textit{Statistical Science}, vol. 17, no. 3, pp. 235--255, 2002.
409
 
410
  \bibitem{zhang2021fraud}
411
- Z. Zhang, X. Zhou, X. Zhang, L. Wang, and P. Wang, ``A model based on convolutional recurrent neural network for fraud detection in credit card,'' \textit{Complexity}, vol. 2021, pp. 1--9, 2021.
412
 
413
  \bibitem{shwartz2022tabular}
414
- R. Shwartz-Ziv and A. Armon, ``Tabular data: Deep learning is not all you need,'' \textit{Information Fusion}, vol. 81, pp. 84--90, 2022.
415
 
416
  \bibitem{chawla2002smote}
417
- N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, ``SMOTE: Synthetic Minority Over-sampling Technique,'' \textit{J. Artificial Intelligence Research}, vol. 16, pp. 321--357, 2002.
418
 
419
  \bibitem{fernandez2018smote}
420
- A. Fernandez, S. Garcia, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, \textit{Learning from Imbalanced Data Sets}. Springer, 2018.
421
 
422
  \bibitem{xuan2018random}
423
- S. Xuan, G. Liu, Z. Li, L. Zheng, S. Wang, and C. Jiang, ``Random forest for credit card fraud detection,'' in \textit{Proc. IEEE 15th Intl. Conf. Networking, Sensing and Control (ICNSC)}, 2018, pp. 1--6.
424
 
425
  \bibitem{chen2016xgboost}
426
- T. Chen and C. Guestrin, ``XGBoost: A scalable tree boosting system,'' in \textit{Proc. 22nd ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining}, 2016, pp. 785--794.
427
 
428
  \bibitem{taha2020detection}
429
- A. A. Taha and S. J. Malebary, ``An intelligent approach to credit card fraud detection using an optimized light gradient boosting machine,'' \textit{IEEE Access}, vol. 8, pp. 25579--25587, 2020.
430
 
431
  \bibitem{ke2017lightgbm}
432
- G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, ``LightGBM: A highly efficient gradient boosting decision tree,'' in \textit{Advances in Neural Information Processing Systems}, vol. 30, 2017.
433
 
434
  \bibitem{prokhorenkova2018catboost}
435
- L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, ``CatBoost: Unbiased boosting with categorical features,'' in \textit{Advances in Neural Information Processing Systems}, vol. 31, 2018.
436
 
437
  \bibitem{pumsirirat2018credit}
438
- A. Pumsirirat and L. Yan, ``Credit card fraud detection using deep learning based on auto-encoder and restricted Boltzmann machine,'' \textit{Intl. J. Advanced Computer Science and Applications}, vol. 9, no. 1, 2018.
439
 
440
  \bibitem{lundberg2017unified}
441
- S. M. Lundberg and S.-I. Lee, ``A unified approach to interpreting model predictions,'' in \textit{Advances in Neural Information Processing Systems}, vol. 30, 2017.
442
 
443
  \bibitem{ribeiro2016lime}
444
- M. T. Ribeiro, S. Singh, and C. Guestrin, ``Why should I trust you?: Explaining the predictions of any classifier,'' in \textit{Proc. 22nd ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining}, 2016, pp. 1135--1144.
445
 
446
  \bibitem{belle2021principles}
447
- V. Belle and I. Papantonis, ``Principles and practice of explainable machine learning,'' \textit{Frontiers in Big Data}, vol. 4, 2021.
448
 
449
  \bibitem{akiba2019optuna}
450
- T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, ``Optuna: A next-generation hyperparameter optimization framework,'' in \textit{Proc. 25th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining}, 2019, pp. 2623--2631.
451
 
452
  \bibitem{grinsztajn2022tree}
453
- L. Grinsztajn, E. Oyallon, and G. Varoquaux, ``Why do tree-based models still outperform deep learning on tabular data?,'' in \textit{Advances in Neural Information Processing Systems}, vol. 35, 2022.
454
 
455
  \bibitem{liu2021graph}
456
- Y. Liu, M. Ao, C. Chi, F. Feng, D. Yang, and J. He, ``Pick and choose: A GNN-based imbalanced learning approach for fraud detection,'' in \textit{Proc. Web Conf.}, 2021, pp. 3168--3177.
457
 
458
  \bibitem{yang2019federated}
459
- Q. Yang, Y. Liu, T. Chen, and Y. Tong, ``Federated machine learning: Concept and applications,'' \textit{ACM Trans. Intelligent Systems and Technology}, vol. 10, no. 2, pp. 1--19, 2019.
460
 
461
  \end{thebibliography}
462
 
 
1
  \documentclass[journal]{IEEEtran}
2
+
3
+ % ─── Packages ──────────────────────────────────────────────────────────────────
4
  \usepackage{cite}
5
  \usepackage{amsmath,amssymb,amsfonts}
6
  \usepackage{graphicx}
7
  \usepackage{textcomp}
8
  \usepackage{xcolor}
 
9
  \usepackage{booktabs}
 
10
  \usepackage{multirow}
11
+ \usepackage{hyperref}
12
+ \usepackage{listings}
13
+ \usepackage{algorithm}
14
+ \usepackage{algorithmic}
15
  \usepackage{array}
16
  \usepackage{float}
17
+ \usepackage{url}
18
+ \usepackage{balance}
19
 
20
+ % ─── Listings Configuration ────────────────────────────────────────────────────
21
  \lstset{
22
  language=Python,
23
+ basicstyle=\ttfamily\scriptsize,
24
  keywordstyle=\color{blue},
25
  stringstyle=\color{red},
26
  commentstyle=\color{green!60!black},
 
 
27
  breaklines=true,
28
  frame=single,
29
+ numbers=left,
30
+ numberstyle=\tiny\color{gray},
31
+ captionpos=b,
32
  }
33
 
34
+ % ─── Graphics Path ─────────────────────────────────────────────────────────────
35
+ \graphicspath{{figures/}}
36
+
37
  \begin{document}
38
 
39
+ % ═══════════════════════════════════════════════════════════════════════════════
40
+ % TITLE
41
+ % ═══════════════════════════════════════════════════════════════════════════════
42
+
43
  \title{A Comprehensive Ensemble-Based Framework for Credit Card Fraud Detection with Explainable AI}
44
 
45
+ \author{}
 
 
 
 
 
46
 
47
  \maketitle
48
 
49
+ % ═══════════════════════════════════════════════════════════════════════════════
50
+ % ABSTRACT
51
+ % ═══════════════════════════════════════════════════════════════════════════════
52
+
53
  \begin{abstract}
54
  Credit card fraud poses a significant threat to the global financial ecosystem, with estimated losses exceeding \$32 billion annually. This paper presents a comprehensive end-to-end fraud detection framework that systematically evaluates and compares seven machine learning approaches: Logistic Regression, Random Forest, XGBoost, LightGBM, Multilayer Perceptron, Autoencoder-based anomaly detection, and a Voting Ensemble. Using the benchmark European Cardholder dataset (284,807 transactions, 0.173\% fraud rate), we engineer 12 novel features and address the extreme class imbalance through both SMOTE oversampling and cost-sensitive learning with class weights. Our XGBoost model achieves the best performance with a PR-AUC of 0.8166, precision of 0.9048, recall of 0.8028, and F1-score of 0.8507 on the held-out test set. We demonstrate that optimizing the decision threshold from the default 0.5 to 0.55 improves F1 from 0.8507 to 0.8636. Comprehensive model explainability via SHAP and LIME analysis reveals that PCA components V4, V14, and V12 are the primary discriminative features. Error analysis shows that false negatives arise from sophisticated fraud patterns that closely mimic legitimate transaction behavior. We deploy the model as a production-ready FastAPI service achieving sub-10ms inference latency. The framework includes automated concept drift monitoring and retraining recommendations. All code, models, and results are publicly available.
55
  \end{abstract}
 
58
  Fraud detection, credit card, machine learning, XGBoost, ensemble learning, explainable AI, SHAP, class imbalance, anomaly detection
59
  \end{IEEEkeywords}
60
 
61
+ % ═══════════════════════════════════════════════════════════════════════════════
62
+ % I. INTRODUCTION
63
+ % ═══════════════════════════════════════════════════════════════════════════════
64
+
65
  \section{Introduction}
66
 
67
+ \IEEEPARstart{F}{inancial} fraud detection has become one of the most critical applications of machine learning in the modern digital economy. The proliferation of electronic payment systems has led to an exponential increase in both the volume of transactions and the sophistication of fraudulent activities~\cite{dal2015credit}. According to the Nilson Report, global card fraud losses reached \$32.34 billion in 2021 and are projected to exceed \$43 billion by 2026~\cite{nilson2022}.
68
 
69
+ The fundamental challenge in fraud detection lies in the extreme class imbalance inherent in transaction data. In typical datasets, fraudulent transactions constitute less than 0.5\% of all transactions~\cite{pozzolo2015calibrating}. This imbalance renders conventional classification metrics such as accuracy misleading and necessitates specialized evaluation criteria including Precision-Recall AUC and Matthews Correlation Coefficient~\cite{saito2015precision}.
70
 
71
+ Previous approaches to fraud detection have ranged from rule-based expert systems~\cite{bolton2002statistical} to sophisticated deep learning architectures~\cite{zhang2021fraud}. While deep learning methods have shown promise, tree-based ensemble methods such as XGBoost and LightGBM continue to demonstrate competitive or superior performance on tabular financial data~\cite{shwartz2022tabular}, particularly when augmented with careful feature engineering and proper handling of class imbalance.
72
 
73
  This paper makes the following contributions:
74
  \begin{enumerate}
 
80
  \item Quantitative business impact analysis estimating financial savings from deployment.
81
  \end{enumerate}
82
 
83
+ % ═══════════════════════════════════════════════════════════════════════════════
84
+ % II. RELATED WORK
85
+ % ═══════════════════════════════════════════════════════════════════════════════
86
+
87
  \section{Related Work}
88
 
89
+ Credit card fraud detection has been extensively studied across multiple paradigms. Dal Pozzolo et al.~\cite{dal2015credit} provided a foundational analysis of the challenges posed by class imbalance and concept drift in real-world fraud detection systems. Their work established that undersampling strategies could be effective but risked losing valuable information from the majority class.
90
+
91
+ Chawla et al.~\cite{chawla2002smote} introduced SMOTE (Synthetic Minority Over-sampling Technique), which generates synthetic minority class samples by interpolating between existing examples. Subsequent work by Fernandez et al.~\cite{fernandez2018smote} demonstrated that SMOTE should be applied exclusively to training data, as applying it before splitting introduces data leakage.
92
 
93
+ Ensemble methods have shown particular promise in fraud detection. Xuan et al.~\cite{xuan2018random} demonstrated that Random Forests achieve robust performance through bagging and feature randomization. Chen and Guestrin~\cite{chen2016xgboost} introduced XGBoost, which has since become a dominant method for tabular data classification, including fraud detection~\cite{taha2020detection}.
94
 
95
+ Ke et al.~\cite{ke2017lightgbm} proposed LightGBM with leaf-wise tree growth and gradient-based one-side sampling, achieving faster training with comparable accuracy. Prokhorenkova et al.~\cite{prokhorenkova2018catboost} introduced CatBoost with ordered boosting to handle categorical features natively.
96
 
97
+ Deep learning approaches have also been explored. Pumsirirat and Yan~\cite{pumsirirat2018credit} employed autoencoders for anomaly-based fraud detection, training exclusively on legitimate transactions and detecting fraud through reconstruction error. Zhang et al.~\cite{zhang2021fraud} proposed attention-based recurrent neural networks that capture sequential transaction patterns.
98
 
99
+ Explainability in fraud detection has gained importance due to regulatory requirements. Lundberg and Lee~\cite{lundberg2017unified} introduced SHAP (SHapley Additive exPlanations), providing consistent feature attribution. Ribeiro et al.~\cite{ribeiro2016lime} proposed LIME (Local Interpretable Model-agnostic Explanations) for instance-level interpretability. Belle and Papantonis~\cite{belle2021principles} surveyed explainable AI methods applicable to financial decision-making.
100
 
101
+ Akiba et al.~\cite{akiba2019optuna} introduced Optuna, a hyperparameter optimization framework using Tree-structured Parzen Estimators (TPE) that efficiently explores complex search spaces.
102
 
103
+ Recent work by Shwartz-Ziv and Armon~\cite{shwartz2022tabular} demonstrated that well-tuned tree-based methods still outperform deep learning on most tabular datasets, supporting our choice of XGBoost as the primary model. Grinsztajn et al.~\cite{grinsztajn2022tree} further corroborated this finding with extensive benchmarking.
104
 
105
+ % ═══════════════════════════════════════════════════════════════════════════════
106
+ % III. DATASET AND EXPLORATORY DATA ANALYSIS
107
+ % ═══════════════════════════════════════════════════════════════════════════════
108
 
109
  \section{Dataset and Exploratory Data Analysis}
110
 
111
  \subsection{Dataset Description}
112
 
113
+ We use the European Cardholder Credit Card Fraud Detection dataset~\cite{dal2015credit}, containing 284,807 transactions made over two days in September 2013. The dataset includes 28 PCA-transformed features (V1--V28), the original \texttt{Time} and \texttt{Amount} features, and a binary \texttt{Class} label (0~=~legitimate, 1~=~fraud).
114
 
115
  \subsection{Class Distribution}
116
 
117
  The dataset exhibits extreme class imbalance with only 492 fraudulent transactions (0.173\%), yielding an imbalance ratio of approximately 1:577. This severe imbalance necessitates specialized handling during both training and evaluation.
118
 
119
+ \begin{table}[!t]
120
  \centering
121
  \caption{Class Distribution in the Dataset}
122
  \label{tab:class_dist}
 
144
  \item \textbf{Feature Scale}: V1--V28 are PCA-transformed; only Time and Amount require normalization.
145
  \end{enumerate}
146
 
147
+ \begin{figure}[!t]
148
  \centering
149
+ \includegraphics[width=\columnwidth]{class_distribution.png}
150
  \caption{Class distribution showing extreme imbalance (0.173\% fraud rate).}
151
  \label{fig:class_dist}
152
  \end{figure}
153
 
154
+ % ═══════════════════════════════════════════════════════════════════════════════
155
+ % IV. METHODOLOGY
156
+ % ═══════════════════════════════════════════════════════════════════════════════
157
+
158
  \section{Methodology}
159
 
160
  \subsection{Feature Engineering}
 
193
 
194
  We compare two approaches for handling the 1:577 class imbalance:
195
 
196
+ \textbf{SMOTE}~\cite{chawla2002smote}: Applied exclusively to the training set after splitting, generating synthetic fraud samples to achieve a 1:2 minority-to-majority ratio.
197
 
198
  \textbf{Cost-Sensitive Learning}: Applying class weights inversely proportional to class frequency:
199
  \begin{equation}
 
247
 
248
  \subsection{Hyperparameter Optimization}
249
 
250
+ We use Optuna~\cite{akiba2019optuna} with Tree-structured Parzen Estimators (TPE) to tune the top three models, optimizing PR-AUC on the validation set:
251
 
252
  \begin{equation}
253
  \theta^* = \arg\max_{\theta} \text{PR-AUC}(f_\theta, \mathcal{D}_{val})
254
  \end{equation}
255
 
256
+ % ═══════════════════════════════════════════════════════════════════════════════
257
+ % V. EXPERIMENTAL SETUP
258
+ % ═══════════════════════════════════════════════════════════════════════════════
259
+
260
  \section{Experimental Setup}
261
 
262
  \subsection{Environment}
 
274
  \item \textbf{MCC}: $\frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$
275
  \end{itemize}
276
 
277
+ % ═══════════════════════════════════════════════════════════════════════════════
278
+ % VI. RESULTS AND DISCUSSION
279
+ % ═══════════════════════════════════════════════════════════════════════════════
280
+
281
  \section{Results and Discussion}
282
 
283
  \subsection{Model Comparison}
284
 
285
+ \begin{table*}[!t]
286
  \centering
287
  \caption{Comprehensive Model Comparison on Test Set (Threshold = 0.5)}
288
  \label{tab:results}
 
307
 
308
  Key observations:
309
 
310
+ \textbf{Tree-based models dominate}: XGBoost, Random Forest, and LightGBM consistently outperform the neural network approaches, consistent with findings by Shwartz-Ziv and Armon~\cite{shwartz2022tabular}.
311
 
312
  \textbf{Class weight handling matters}: Logistic Regression achieves high recall (0.8873) but extremely low precision (0.0488), indicating that the linear decision boundary with class weights is too aggressive in flagging transactions.
313
 
314
  \textbf{Autoencoder limitations}: While achieving perfect recall (1.0), the autoencoder suffers from extremely low precision (0.0033), flagging nearly all transactions as anomalous. This suggests that the reconstruction-based approach is too sensitive for this PCA-transformed feature space.
315
 
316
+ \begin{figure}[!t]
317
  \centering
318
+ \includegraphics[width=\columnwidth]{roc_curves.png}
319
  \caption{ROC curves for all models. XGBoost and Voting Ensemble achieve the highest AUC.}
320
  \label{fig:roc}
321
  \end{figure}
322
 
323
+ \begin{figure}[!t]
324
  \centering
325
+ \includegraphics[width=\columnwidth]{pr_curves.png}
326
  \caption{Precision-Recall curves. PR-AUC is the primary metric for imbalanced classification.}
327
  \label{fig:pr}
328
  \end{figure}
 
331
 
332
  The default threshold of 0.5 is suboptimal for imbalanced data. Our analysis reveals that a threshold of 0.55 maximizes F1-score:
333
 
334
+ \begin{table}[!t]
335
  \centering
336
  \caption{Threshold Sensitivity for XGBoost}
337
  \label{tab:threshold}
 
351
 
352
  \subsection{Business Impact}
353
 
354
+ \begin{table}[!t]
355
  \centering
356
  \caption{Business Impact Analysis (Test Set)}
357
  \label{tab:business}
 
363
  Ensemble & 6,966 & 1,711 & 6,921 \\
364
  RF (Tuned) & 6,722 & 1,955 & 6,682 \\
365
  LR & 7,699 & 978 & 1,554 \\
366
+ Autoencoder & 8,677 & 0 & $-$97,368 \\
367
  \bottomrule
368
  \end{tabular}
369
  \end{table}
 
374
 
375
  SHAP analysis reveals that V4 (mean $|\text{SHAP}| = 1.913$), V14 (1.843), and PCA\_magnitude (1.113) are the primary fraud discriminators. These features correspond to specific latent patterns in the PCA-transformed space that distinguish fraudulent from legitimate behavior.
376
 
377
+ \begin{figure}[!t]
378
  \centering
379
+ \includegraphics[width=\columnwidth]{shap_summary.png}
380
  \caption{SHAP summary plot showing feature contributions to fraud predictions.}
381
  \label{fig:shap}
382
  \end{figure}
383
 
384
+ % ═══════════════════════════════════════════════════════════════════════════════
385
+ % VII. ERROR ANALYSIS
386
+ % ═══════════════════════════════════════════════════════════════════════════════
387
+
388
  \section{Error Analysis}
389
 
390
  \subsection{False Negative Analysis}
 
397
 
398
  \subsection{Concept Drift Assessment}
399
 
400
+ Comparing model confidence between early and late test periods reveals a drift indicator of $+0.115$, suggesting modest temporal variation. We recommend weekly monitoring with automated retraining triggers when PR-AUC drops below 0.70.
401
+
402
+ % ═══════════════════════════════════════════════════════════════════════════════
403
+ % VIII. LIMITATIONS
404
+ % ═══════════════════════════════════════════════════════════════════════════════
405
 
406
  \section{Limitations}
407
 
 
413
  \item \textbf{Static Threshold}: The optimal threshold may shift as fraud patterns evolve; dynamic threshold adaptation is not implemented.
414
  \end{enumerate}
415
 
416
+ % ═══════════════════════════════════════════════════════════════════════════════
417
+ % IX. FUTURE WORK
418
+ % ═══════════════════════════════════════════════════════════════════════════════
419
+
420
  \section{Future Work}
421
 
422
  Several promising directions emerge from this research:
423
 
424
+ \textbf{Graph Neural Networks}: Modeling transaction networks as graphs could enable detection of fraud rings through collaborative behavioral patterns~\cite{liu2021graph}.
425
 
426
  \textbf{Real-Time Streaming}: Integration with Apache Kafka and Apache Flink for millisecond-latency processing of transaction streams at scale.
427
 
428
+ \textbf{Federated Learning}: Training across multiple banks without sharing raw transaction data, preserving privacy while improving generalization~\cite{yang2019federated}.
429
 
430
  \textbf{LLM-Generated Explanations}: Using large language models to generate natural-language compliance explanations for flagged transactions, facilitating human review.
431
 
 
433
 
434
  \textbf{Adversarial Robustness}: Training models that are robust to adversarial perturbations designed to evade detection.
435
 
436
+ % ═══════════════════════════════════════════════════════════════════════════════
437
+ % X. CONCLUSION
438
+ % ═══════════════════════════════════════════════════════════════════════════════
439
+
440
  \section{Conclusion}
441
 
442
  This paper presents a comprehensive fraud detection framework that systematically evaluates seven machine learning approaches on the benchmark European Cardholder dataset. Our results demonstrate that XGBoost achieves the best overall performance (PR-AUC: 0.8166, F1: 0.8507) through cost-sensitive learning with optimized class weights. Threshold optimization from 0.5 to 0.55 further improves F1 to 0.8636. The framework includes complete explainability through SHAP and LIME, production deployment via FastAPI with sub-10ms latency, and automated drift monitoring. Our analysis confirms that tree-based ensemble methods remain the most effective approach for tabular fraud detection, while highlighting the importance of proper class imbalance handling, threshold optimization, and the inadequacy of accuracy as a metric for imbalanced classification.
443
 
444
+ All code, models, and results are publicly available.
445
+
446
+ % ═══════════════════════════════════════════════════════════════════════════════
447
+ % REFERENCES
448
+ % ═══════════════════════════════════════════════════════════════════════════════
449
+
450
+ \balance
451
  \bibliographystyle{IEEEtran}
452
 
453
  \begin{thebibliography}{99}
454
 
455
  \bibitem{dal2015credit}
456
+ A.~Dal~Pozzolo, O.~Caelen, R.~A.~Johnson, and G.~Bontempi, ``Calibrating probability with undersampling for unbalanced classification,'' in \textit{Proc. IEEE Symp. Comput. Intell. Data Mining (CIDM)}, 2015, pp.~159--166.
457
 
458
  \bibitem{nilson2022}
459
  Nilson Report, ``Global card fraud losses,'' \textit{Nilson Report}, Issue 1209, 2022.
460
 
461
  \bibitem{pozzolo2015calibrating}
462
+ A.~Dal~Pozzolo, O.~Caelen, and G.~Bontempi, ``When is undersampling effective in unbalanced classification tasks?,'' in \textit{Proc. European Conf. Machine Learning and Knowledge Discovery in Databases}, 2015, pp.~200--215.
463
 
464
  \bibitem{saito2015precision}
465
+ T.~Saito and M.~Rehmsmeier, ``The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets,'' \textit{PLoS ONE}, vol.~10, no.~3, 2015.
466
 
467
  \bibitem{bolton2002statistical}
468
+ R.~J.~Bolton and D.~J.~Hand, ``Statistical fraud detection: A review,'' \textit{Statistical Science}, vol.~17, no.~3, pp.~235--255, 2002.
469
 
470
  \bibitem{zhang2021fraud}
471
+ Z.~Zhang, X.~Zhou, X.~Zhang, L.~Wang, and P.~Wang, ``A model based on convolutional recurrent neural network for fraud detection in credit card,'' \textit{Complexity}, vol.~2021, pp.~1--9, 2021.
472
 
473
  \bibitem{shwartz2022tabular}
474
+ R.~Shwartz-Ziv and A.~Armon, ``Tabular data: Deep learning is not all you need,'' \textit{Information Fusion}, vol.~81, pp.~84--90, 2022.
475
 
476
  \bibitem{chawla2002smote}
477
+ N.~V.~Chawla, K.~W.~Bowyer, L.~O.~Hall, and W.~P.~Kegelmeyer, ``SMOTE: Synthetic Minority Over-sampling Technique,'' \textit{J. Artificial Intelligence Research}, vol.~16, pp.~321--357, 2002.
478
 
479
  \bibitem{fernandez2018smote}
480
+ A.~Fernandez, S.~Garcia, M.~Galar, R.~C.~Prati, B.~Krawczyk, and F.~Herrera, \textit{Learning from Imbalanced Data Sets}.\ \ Springer, 2018.
481
 
482
  \bibitem{xuan2018random}
483
+ S.~Xuan, G.~Liu, Z.~Li, L.~Zheng, S.~Wang, and C.~Jiang, ``Random forest for credit card fraud detection,'' in \textit{Proc. IEEE 15th Intl. Conf. Networking, Sensing and Control (ICNSC)}, 2018, pp.~1--6.
484
 
485
  \bibitem{chen2016xgboost}
486
+ T.~Chen and C.~Guestrin, ``XGBoost: A scalable tree boosting system,'' in \textit{Proc. 22nd ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining}, 2016, pp.~785--794.
487
 
488
  \bibitem{taha2020detection}
489
+ A.~A.~Taha and S.~J.~Malebary, ``An intelligent approach to credit card fraud detection using an optimized light gradient boosting machine,'' \textit{IEEE Access}, vol.~8, pp.~25579--25587, 2020.
490
 
491
  \bibitem{ke2017lightgbm}
492
+ G.~Ke, Q.~Meng, T.~Finley, T.~Wang, W.~Chen, W.~Ma, Q.~Ye, and T.-Y.~Liu, ``LightGBM: A highly efficient gradient boosting decision tree,'' in \textit{Advances in Neural Information Processing Systems}, vol.~30, 2017.
493
 
494
  \bibitem{prokhorenkova2018catboost}
495
+ L.~Prokhorenkova, G.~Gusev, A.~Vorobev, A.~V.~Dorogush, and A.~Gulin, ``CatBoost: Unbiased boosting with categorical features,'' in \textit{Advances in Neural Information Processing Systems}, vol.~31, 2018.
496
 
497
  \bibitem{pumsirirat2018credit}
498
+ A.~Pumsirirat and L.~Yan, ``Credit card fraud detection using deep learning based on auto-encoder and restricted Boltzmann machine,'' \textit{Intl. J. Advanced Computer Science and Applications}, vol.~9, no.~1, 2018.
499
 
500
  \bibitem{lundberg2017unified}
501
+ S.~M.~Lundberg and S.-I.~Lee, ``A unified approach to interpreting model predictions,'' in \textit{Advances in Neural Information Processing Systems}, vol.~30, 2017.
502
 
503
  \bibitem{ribeiro2016lime}
504
+ M.~T.~Ribeiro, S.~Singh, and C.~Guestrin, ``Why should I trust you?: Explaining the predictions of any classifier,'' in \textit{Proc. 22nd ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining}, 2016, pp.~1135--1144.
505
 
506
  \bibitem{belle2021principles}
507
+ V.~Belle and I.~Papantonis, ``Principles and practice of explainable machine learning,'' \textit{Frontiers in Big Data}, vol.~4, 2021.
508
 
509
  \bibitem{akiba2019optuna}
510
+ T.~Akiba, S.~Sano, T.~Yanase, T.~Ohta, and M.~Koyama, ``Optuna: A next-generation hyperparameter optimization framework,'' in \textit{Proc. 25th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining}, 2019, pp.~2623--2631.
511
 
512
  \bibitem{grinsztajn2022tree}
513
+ L.~Grinsztajn, E.~Oyallon, and G.~Varoquaux, ``Why do tree-based models still outperform deep learning on tabular data?,'' in \textit{Advances in Neural Information Processing Systems}, vol.~35, 2022.
514
 
515
  \bibitem{liu2021graph}
516
+ Y.~Liu, M.~Ao, C.~Chi, F.~Feng, D.~Yang, and J.~He, ``Pick and choose: A GNN-based imbalanced learning approach for fraud detection,'' in \textit{Proc. Web Conf.}, 2021, pp.~3168--3177.
517
 
518
  \bibitem{yang2019federated}
519
+ Q.~Yang, Y.~Liu, T.~Chen, and Y.~Tong, ``Federated machine learning: Concept and applications,'' \textit{ACM Trans. Intelligent Systems and Technology}, vol.~10, no.~2, pp.~1--19, 2019.
520
 
521
  \end{thebibliography}
522