rajvivan
/

fraud-detection-system

Joblib

Model card Files Files and versions

xet

Community

rajvivan commited on 9 days ago

Commit

ad2420b

verified ·

1 Parent(s): 74553b3

Upload paper/fraud_detection_paper.tex

Browse files

Files changed (1) hide show

paper/fraud_detection_paper.tex +314 -0

paper/fraud_detection_paper.tex ADDED Viewed

	@@ -0,0 +1,314 @@

+\documentclass[journal]{IEEEtran}
+\usepackage{cite}
+\usepackage{amsmath,amssymb,amsfonts}
+\usepackage{graphicx}
+\usepackage{textcomp}
+\usepackage{xcolor}
+\usepackage{listings}
+\usepackage{booktabs}
+\usepackage{hyperref}
+\usepackage{multirow}
+\usepackage{array}
+\usepackage{float}
+\lstset{
+  language=Python,
+  basicstyle=\ttfamily\footnotesize,
+  keywordstyle=\color{blue},
+  stringstyle=\color{red},
+  commentstyle=\color{green!60!black},
+  numbers=left,
+  numberstyle=\tiny\color{gray},
+  breaklines=true,
+  frame=single,
+  captionpos=b
+}
+\begin{document}
+\title{A Comprehensive Ensemble-Based Framework for Credit Card Fraud Detection with Explainable AI}
+\author{
+\IEEEauthorblockN{Raj Vivan}
+\IEEEauthorblockA{Department of Computer Science\\
+\textit{Independent Research}\\
+Email: rajvivan@example.com}
+}
+\maketitle
+\begin{abstract}
+Credit card fraud poses a significant threat to the global financial ecosystem, with estimated losses exceeding \$32 billion annually. This paper presents a comprehensive end-to-end fraud detection framework that systematically evaluates and compares seven machine learning approaches: Logistic Regression, Random Forest, XGBoost, LightGBM, Multilayer Perceptron, Autoencoder-based anomaly detection, and a Voting Ensemble. Using the benchmark European Cardholder dataset (284,807 transactions, 0.173\% fraud rate), we engineer 12 novel features and address the extreme class imbalance through both SMOTE oversampling and cost-sensitive learning with class weights. Our XGBoost model achieves the best performance with a PR-AUC of 0.8166, precision of 0.9048, recall of 0.8028, and F1-score of 0.8507 on the held-out test set. We demonstrate that optimizing the decision threshold from the default 0.5 to 0.55 improves F1 from 0.8507 to 0.8636. Comprehensive model explainability via SHAP and LIME analysis reveals that PCA components V4, V14, and V12 are the primary discriminative features. Error analysis shows that false negatives arise from sophisticated fraud patterns that closely mimic legitimate transaction behavior. We deploy the model as a production-ready FastAPI service achieving sub-10ms inference latency.
+\end{abstract}
+\begin{IEEEkeywords}
+Fraud detection, credit card, machine learning, XGBoost, ensemble learning, explainable AI, SHAP, class imbalance, anomaly detection
+\end{IEEEkeywords}
+\section{Introduction}
+Financial fraud detection has become one of the most critical applications of machine learning in the modern digital economy. The proliferation of electronic payment systems has led to an exponential increase in both the volume of transactions and the sophistication of fraudulent activities \cite{dal2015credit}. According to the Nilson Report, global card fraud losses reached \$32.34 billion in 2021 and are projected to exceed \$43 billion by 2026 \cite{nilson2022}.
+The fundamental challenge in fraud detection lies in the extreme class imbalance inherent in transaction data. In typical datasets, fraudulent transactions constitute less than 0.5\% of all transactions \cite{pozzolo2015calibrating}. This imbalance renders conventional classification metrics such as accuracy misleading and necessitates specialized evaluation criteria including Precision-Recall AUC and Matthews Correlation Coefficient \cite{saito2015precision}.
+Previous approaches to fraud detection have ranged from rule-based expert systems \cite{bolton2002statistical} to sophisticated deep learning architectures \cite{zhang2021fraud}. While deep learning methods have shown promise, tree-based ensemble methods such as XGBoost and LightGBM continue to demonstrate competitive or superior performance on tabular financial data \cite{shwartz2022tabular}, particularly when augmented with careful feature engineering and proper handling of class imbalance.
+This paper makes the following contributions:
+\begin{enumerate}
+    \item A systematic comparison of seven machine learning approaches for fraud detection, including both supervised and unsupervised methods.
+    \item Novel feature engineering incorporating transaction velocity, amount deviation metrics, and PCA component interactions.
+    \item Rigorous evaluation methodology with SMOTE applied only after train-test splitting and feature scaling fitted exclusively on training data.
+    \item Comprehensive explainability analysis using SHAP and LIME to identify key fraud indicators.
+    \item A production-ready API deployment achieving sub-10ms inference latency.
+    \item Quantitative business impact analysis estimating financial savings from deployment.
+\end{enumerate}
+\section{Related Work}
+Credit card fraud detection has been extensively studied across multiple paradigms. Dal Pozzolo et al. \cite{dal2015credit} provided a foundational analysis of the challenges posed by class imbalance and concept drift in real-world fraud detection systems. Their work established that undersampling strategies could be effective but risked losing valuable information from the majority class.
+Chawla et al. \cite{chawla2002smote} introduced SMOTE (Synthetic Minority Over-sampling Technique), which generates synthetic minority class samples by interpolating between existing examples. Subsequent work by Fernandez et al. \cite{fernandez2018smote} demonstrated that SMOTE should be applied exclusively to training data, as applying it before splitting introduces data leakage.
+Ensemble methods have shown particular promise in fraud detection. Xuan et al. \cite{xuan2018random} demonstrated that Random Forests achieve robust performance through bagging and feature randomization. Chen and Guestrin \cite{chen2016xgboost} introduced XGBoost, which has since become a dominant method for tabular data classification, including fraud detection \cite{taha2020detection}.
+Ke et al. \cite{ke2017lightgbm} proposed LightGBM with leaf-wise tree growth and gradient-based one-side sampling, achieving faster training with comparable accuracy. Prokhorenkova et al. \cite{prokhorenkova2018catboost} introduced CatBoost with ordered boosting to handle categorical features natively.
+Deep learning approaches have also been explored. Pumsirirat and Yan \cite{pumsirirat2018credit} employed autoencoders for anomaly-based fraud detection, training exclusively on legitimate transactions and detecting fraud through reconstruction error. Zhang et al. \cite{zhang2021fraud} proposed attention-based recurrent neural networks that capture sequential transaction patterns.
+Explainability in fraud detection has gained importance due to regulatory requirements. Lundberg and Lee \cite{lundberg2017unified} introduced SHAP (SHapley Additive exPlanations), providing consistent feature attribution. Ribeiro et al. \cite{ribeiro2016lime} proposed LIME (Local Interpretable Model-agnostic Explanations) for instance-level interpretability. Belle and Papantonis \cite{belle2021principles} surveyed explainable AI methods applicable to financial decision-making.
+Akiba et al. \cite{akiba2019optuna} introduced Optuna, a hyperparameter optimization framework using Tree-structured Parzen Estimators (TPE) that efficiently explores complex search spaces.
+Recent work by Shwartz-Ziv and Armon \cite{shwartz2022tabular} demonstrated that well-tuned tree-based methods still outperform deep learning on most tabular datasets, supporting our choice of XGBoost as the primary model. Grinsztajn et al. \cite{grinsztajn2022tree} further corroborated this finding with extensive benchmarking.
+\section{Dataset and Exploratory Data Analysis}
+\subsection{Dataset Description}
+We use the European Cardholder Credit Card Fraud Detection dataset \cite{dal2015credit}, containing 284,807 transactions made over two days in September 2013. The dataset includes 28 PCA-transformed features (V1--V28), the original \texttt{Time} and \texttt{Amount} features, and a binary \texttt{Class} label (0 = legitimate, 1 = fraud).
+\subsection{Class Distribution}
+The dataset exhibits extreme class imbalance with only 492 fraudulent transactions (0.173\%), yielding an imbalance ratio of approximately 1:577. This severe imbalance necessitates specialized handling during both training and evaluation.
+\begin{table}[h]
+\centering
+\caption{Class Distribution in the Dataset}
+\label{tab:class_dist}
+\begin{tabular}{lrr}
+\toprule
+\textbf{Class} & \textbf{Count} & \textbf{Percentage} \\
+\midrule
+Legitimate (0) & 284,315 & 99.827\% \\
+Fraud (1) & 492 & 0.173\% \\
+\midrule
+Total & 284,807 & 100\% \\
+\bottomrule
+\end{tabular}
+\end{table}
+\subsection{Key Observations}
+Our exploratory analysis revealed five critical findings:
+\begin{enumerate}
+    \item \textbf{Amount Patterns}: Fraudulent transactions have a mean of \$122.21 (median \$9.25) versus legitimate mean of \$88.29 (median \$22.00).
+    \item \textbf{Temporal Patterns}: Night-time (0--6h) fraud rate is 0.518\% versus 0.137\% during daytime.
+    \item \textbf{Discriminative Features}: V17 ($r = -0.326$), V14 ($r = -0.303$), and V12 ($r = -0.261$) show the strongest negative correlation with fraud.
+    \item \textbf{Data Quality}: No missing values; 1,081 duplicate rows removed.
+    \item \textbf{Feature Scale}: V1--V28 are PCA-transformed; only Time and Amount require normalization.
+\end{enumerate}
+\section{Methodology}
+\subsection{Feature Engineering}
+We engineer 12 additional features to capture temporal, behavioral, and interaction patterns:
+\begin{equation}
+\text{Hour}_{\sin} = \sin\left(\frac{2\pi \cdot h}{24}\right), \quad \text{Hour}_{\cos} = \cos\left(\frac{2\pi \cdot h}{24}\right)
+\end{equation}
+where $h = (\texttt{Time} / 3600) \bmod 24$ is the hour of day.
+\begin{equation}
+\text{Amount}_{z} = \frac{A - \mu_A}{\sigma_A}
+\end{equation}
+\begin{equation}
+\text{Velocity} = \frac{1}{\Delta t + 1}
+\end{equation}
+Interaction features:
+\begin{equation}
+I_{ij} = V_i \times V_j, \quad (i,j) \in \{(14,17), (12,14), (10,14)\}
+\end{equation}
+PCA magnitude:
+\begin{equation}
+M = \sqrt{\sum_{i=1}^{28} V_i^2}
+\end{equation}
+\subsection{Class Imbalance Handling}
+We compare SMOTE \cite{chawla2002smote} (applied to training set only, 1:2 ratio) and cost-sensitive learning with class weights:
+\begin{equation}
+w_c = \frac{N}{2 \cdot N_c}
+\end{equation}
+\subsection{Data Splitting and Scaling}
+Stratified 70/15/15 split with RobustScaler fitted on training data only:
+\begin{equation}
+x' = \frac{x - Q_2(x)}{Q_3(x) - Q_1(x)}
+\end{equation}
+\subsection{Models}
+We evaluate seven models: Logistic Regression (baseline), Random Forest, XGBoost, LightGBM, MLP Neural Network, Autoencoder (anomaly detection), and a Voting Ensemble of the top three tuned models.
+The Autoencoder detects fraud via reconstruction error:
+\begin{equation}
+e(x) = \frac{1}{d}\sum_{i=1}^{d}(x_i - \hat{x}_i)^2
+\end{equation}
+The Voting Ensemble uses soft voting:
+\begin{equation}
+P(\text{fraud}|x) = \frac{1}{3}\sum_{m=1}^{3} P_m(\text{fraud}|x)
+\end{equation}
+\subsection{Hyperparameter Optimization}
+Optuna \cite{akiba2019optuna} with TPE sampler optimizes PR-AUC on the validation set.
+\section{Experimental Setup}
+All experiments used Python 3.12, scikit-learn 1.8.0, XGBoost 3.2.0, LightGBM 4.6.0, and PyTorch 2.11.0. We report Precision, Recall, F1, ROC-AUC, PR-AUC, and MCC.
+\section{Results and Discussion}
+\subsection{Model Comparison}
+\begin{table*}[t]
+\centering
+\caption{Comprehensive Model Comparison on Test Set (Threshold = 0.5)}
+\label{tab:results}
+\begin{tabular}{lcccccc}
+\toprule
+\textbf{Model} & \textbf{Precision} & \textbf{Recall} & \textbf{F1} & \textbf{ROC-AUC} & \textbf{PR-AUC} & \textbf{MCC} \\
+\midrule
+XGBoost & \textbf{0.9048} & 0.8028 & \textbf{0.8507} & 0.9735 & \textbf{0.8166} & \textbf{0.8520} \\
+Voting Ensemble & 0.8636 & 0.8028 & 0.8321 & \textbf{0.9783} & 0.8007 & 0.8324 \\
+LightGBM (Tuned) & 0.7073 & \textbf{0.8169} & 0.7582 & 0.9318 & 0.7958 & 0.7597 \\
+XGBoost (Tuned) & 0.8382 & 0.8028 & 0.8201 & 0.9697 & 0.7929 & 0.8200 \\
+RF (Tuned) & 0.8730 & 0.7746 & 0.8209 & 0.9675 & 0.7926 & 0.8221 \\
+Random Forest & 0.8333 & 0.7746 & 0.8029 & 0.9526 & 0.7710 & 0.8031 \\
+MLP & 0.6914 & 0.7887 & 0.7368 & 0.9433 & 0.7522 & 0.7380 \\
+Logistic Regression & 0.0488 & 0.8873 & 0.0924 & 0.9615 & 0.7350 & 0.2042 \\
+Autoencoder & 0.0033 & 1.0000 & 0.0067 & 0.9604 & 0.0442 & 0.0409 \\
+\bottomrule
+\end{tabular}
+\end{table*}
+XGBoost achieves the highest PR-AUC (0.8166), F1-score (0.8507), and MCC (0.8520). Tree-based models consistently outperform neural approaches, consistent with \cite{shwartz2022tabular}.
+\subsection{Threshold Optimization}
+Threshold of 0.55 maximizes F1 to 0.8636 (from 0.8507 at 0.5).
+\subsection{Business Impact}
+XGBoost provides the highest net savings (\$6,936 on test set), catching 80.3\% of fraud with only 6 false positives.
+\subsection{Feature Importance}
+SHAP analysis reveals V4 (mean $|\text{SHAP}| = 1.913$), V14 (1.843), and PCA\_magnitude (1.113) as primary fraud discriminators.
+\section{Error Analysis}
+Of 14 false negatives, mean predicted fraud probability was only 0.013. These transactions have V14 averaging $-0.97$ vs $-8.45$ for true positives, indicating they closely mimic legitimate patterns. The 6 false positives have feature distributions (V14: $-7.13$) closely resembling actual fraud. Concept drift analysis shows a +0.115 indicator between early and late test periods.
+\section{Limitations}
+Key limitations include PCA anonymization preventing domain-specific features, two-day temporal scope, single-institution data, and static threshold without adaptation.
+\section{Future Work}
+Promising directions include Graph Neural Networks for fraud ring detection, real-time streaming with Apache Kafka, Federated Learning across banks \cite{yang2019federated}, LLM-generated compliance explanations, and temporal modeling with Transformers.
+\section{Conclusion}
+This paper presents a comprehensive fraud detection framework evaluating seven ML approaches. XGBoost achieves the best overall performance (PR-AUC: 0.8166, F1: 0.8507). Threshold optimization further improves F1 to 0.8636. The framework includes SHAP/LIME explainability, FastAPI deployment with sub-10ms latency, and drift monitoring.
+\bibliographystyle{IEEEtran}
+\begin{thebibliography}{99}
+\bibitem{dal2015credit}
+A. Dal Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi, ``Calibrating probability with undersampling for unbalanced classification,'' in \textit{Proc. IEEE CIDM}, 2015, pp. 159--166.
+\bibitem{nilson2022}
+Nilson Report, ``Global card fraud losses,'' Issue 1209, 2022.
+\bibitem{pozzolo2015calibrating}
+A. Dal Pozzolo, O. Caelen, and G. Bontempi, ``When is undersampling effective in unbalanced classification tasks?,'' in \textit{Proc. ECML PKDD}, 2015, pp. 200--215.
+\bibitem{saito2015precision}
+T. Saito and M. Rehmsmeier, ``The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets,'' \textit{PLoS ONE}, vol. 10, no. 3, 2015.
+\bibitem{bolton2002statistical}
+R. J. Bolton and D. J. Hand, ``Statistical fraud detection: A review,'' \textit{Statistical Science}, vol. 17, no. 3, pp. 235--255, 2002.
+\bibitem{zhang2021fraud}
+Z. Zhang et al., ``A model based on convolutional recurrent neural network for fraud detection,'' \textit{Complexity}, 2021.
+\bibitem{shwartz2022tabular}
+R. Shwartz-Ziv and A. Armon, ``Tabular data: Deep learning is not all you need,'' \textit{Information Fusion}, vol. 81, pp. 84--90, 2022.
+\bibitem{chawla2002smote}
+N. V. Chawla et al., ``SMOTE: Synthetic Minority Over-sampling Technique,'' \textit{JAIR}, vol. 16, pp. 321--357, 2002.
+\bibitem{fernandez2018smote}
+A. Fernandez et al., \textit{Learning from Imbalanced Data Sets}. Springer, 2018.
+\bibitem{xuan2018random}
+S. Xuan et al., ``Random forest for credit card fraud detection,'' in \textit{Proc. IEEE ICNSC}, 2018.
+\bibitem{chen2016xgboost}
+T. Chen and C. Guestrin, ``XGBoost: A scalable tree boosting system,'' in \textit{Proc. ACM SIGKDD}, 2016, pp. 785--794.
+\bibitem{taha2020detection}
+A. A. Taha and S. J. Malebary, ``An intelligent approach to credit card fraud detection,'' \textit{IEEE Access}, vol. 8, pp. 25579--25587, 2020.
+\bibitem{ke2017lightgbm}
+G. Ke et al., ``LightGBM: A highly efficient gradient boosting decision tree,'' in \textit{NeurIPS}, 2017.
+\bibitem{prokhorenkova2018catboost}
+L. Prokhorenkova et al., ``CatBoost: Unbiased boosting with categorical features,'' in \textit{NeurIPS}, 2018.
+\bibitem{pumsirirat2018credit}
+A. Pumsirirat and L. Yan, ``Credit card fraud detection using deep learning,'' \textit{IJACSA}, vol. 9, no. 1, 2018.
+\bibitem{lundberg2017unified}
+S. M. Lundberg and S.-I. Lee, ``A unified approach to interpreting model predictions,'' in \textit{NeurIPS}, 2017.
+\bibitem{ribeiro2016lime}
+M. T. Ribeiro, S. Singh, and C. Guestrin, ``Why should I trust you?,'' in \textit{Proc. ACM SIGKDD}, 2016, pp. 1135--1144.
+\bibitem{belle2021principles}
+V. Belle and I. Papantonis, ``Principles and practice of explainable machine learning,'' \textit{Frontiers in Big Data}, vol. 4, 2021.
+\bibitem{akiba2019optuna}
+T. Akiba et al., ``Optuna: A next-generation hyperparameter optimization framework,'' in \textit{Proc. ACM SIGKDD}, 2019, pp. 2623--2631.
+\bibitem{grinsztajn2022tree}
+L. Grinsztajn et al., ``Why do tree-based models still outperform deep learning on tabular data?,'' in \textit{NeurIPS}, 2022.
+\bibitem{liu2021graph}
+Y. Liu et al., ``Pick and choose: A GNN-based imbalanced learning approach for fraud detection,'' in \textit{Proc. Web Conf.}, 2021.
+\bibitem{yang2019federated}
+Q. Yang et al., ``Federated machine learning: Concept and applications,'' \textit{ACM TIST}, vol. 10, no. 2, 2019.
+\end{thebibliography}
+\end{document}