| # ๐ Fraud Detection System for Financial Transactions |
|
|
| A comprehensive end-to-end fraud detection system using machine learning, featuring 10 models, explainability analysis, and a production-ready API. |
|
|
| ## ๐ Results Summary |
|
|
| | Model | Precision | Recall | F1 | ROC-AUC | PR-AUC | MCC | |
| |---|---|---|---|---|---|---| |
| | **XGBoost** โญ | **0.9048** | 0.8028 | **0.8507** | 0.9735 | **0.8166** | **0.8520** | |
| | Voting Ensemble | 0.8636 | 0.8028 | 0.8321 | **0.9783** | 0.8007 | 0.8324 | |
| | LightGBM (Tuned) | 0.7073 | **0.8169** | 0.7582 | 0.9318 | 0.7958 | 0.7597 | |
| | XGBoost (Tuned) | 0.8382 | 0.8028 | 0.8201 | 0.9697 | 0.7929 | 0.8200 | |
| | RF (Tuned) | 0.8730 | 0.7746 | 0.8209 | 0.9675 | 0.7926 | 0.8221 | |
| | Random Forest | 0.8333 | 0.7746 | 0.8029 | 0.9526 | 0.7710 | 0.8031 | |
| | MLP | 0.6914 | 0.7887 | 0.7368 | 0.9433 | 0.7522 | 0.7380 | |
| | Logistic Regression | 0.0488 | 0.8873 | 0.0924 | 0.9615 | 0.7350 | 0.2042 | |
| | Autoencoder | 0.0033 | 1.0000 | 0.0067 | 0.9604 | 0.0442 | 0.0409 | |
|
|
| **Best Model: XGBoost** โ PR-AUC: 0.8166, F1: 0.8507 (0.8636 with threshold=0.55) |
|
|
| ## ๐๏ธ System Architecture |
|
|
|  |
|
|
| ## ๐ Project Structure |
|
|
| ``` |
| fraud_detection/ |
| โโโ config.py # Configuration settings |
| โโโ eda.py # Exploratory Data Analysis |
| โโโ preprocessing.py # Feature engineering & splitting |
| โโโ train_all.py # Model training pipeline |
| โโโ evaluation.py # Comprehensive evaluation |
| โโโ explainability.py # SHAP & LIME analysis |
| โโโ error_analysis.py # FN/FP & drift analysis |
| โโโ ae_model.py # Autoencoder model classes |
| โโโ architecture.py # Architecture diagram generator |
| โโโ generate_pdf.py # PDF paper generator |
| โโโ requirements.txt # Python dependencies |
| โโโ api/ |
| โ โโโ app.py # FastAPI production endpoint |
| โโโ models/ |
| โ โโโ all_models.joblib # All trained models |
| โ โโโ all_models_with_ae.joblib |
| โ โโโ autoencoder.pt # PyTorch autoencoder weights |
| โ โโโ scaler.joblib # Fitted RobustScaler |
| โ โโโ tuning_results.joblib # Optuna best params |
| โโโ figures/ # All figures (PNG + PDF, 300 DPI) |
| โ โโโ class_distribution.* |
| โ โโโ amount_analysis.* |
| โ โโโ time_analysis.* |
| โ โโโ correlation_heatmap.* |
| โ โโโ feature_distributions.* |
| โ โโโ roc_curves.* |
| โ โโโ pr_curves.* |
| โ โโโ confusion_matrices.* |
| โ โโโ threshold_analysis.* |
| โ โโโ feature_importance.* |
| โ โโโ shap_summary.* |
| โ โโโ shap_top10.* |
| โ โโโ lime_explanation.* |
| โ โโโ error_analysis.* |
| โ โโโ architecture_diagram.* |
| โ โโโ model_comparison.csv |
| โ โโโ business_impact.csv |
| โ โโโ shap_feature_importance.csv |
| โโโ paper/ |
| โ โโโ fraud_detection_paper.tex # IEEE LaTeX source |
| โ โโโ fraud_detection_paper.pdf # Compiled PDF |
| โโโ data/ |
| โโโ creditcard.csv # Raw dataset |
| โโโ processed_data.joblib # Preprocessed data |
| โโโ evaluation_results.joblib # Evaluation results |
| ``` |
|
|
| ## ๐ Quick Start |
|
|
| ### Installation |
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| ### Run Full Pipeline |
| ```bash |
| # 1. EDA |
| python eda.py |
| |
| # 2. Preprocessing |
| python preprocessing.py |
| |
| # 3. Training |
| python train_all.py |
| |
| # 4. Evaluation |
| python evaluation.py |
| |
| # 5. Explainability |
| python explainability.py |
| |
| # 6. Error Analysis |
| python error_analysis.py |
| ``` |
|
|
| ### Run API |
| ```bash |
| cd fraud_detection |
| uvicorn api.app:app --host 0.0.0.0 --port 8000 |
| ``` |
|
|
| ### API Usage |
| ```bash |
| curl -X POST http://localhost:8000/predict \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "Time": 406.0, |
| "V1": -2.312, "V2": 1.951, "V3": -1.609, "V4": 3.997, |
| "V5": -0.522, "V6": -1.426, "V7": -2.537, "V8": 1.391, |
| "V9": -2.770, "V10": -2.772, "V11": 3.202, "V12": -2.899, |
| "V13": -0.595, "V14": -4.289, "V15": 0.389, "V16": -1.140, |
| "V17": -2.830, "V18": -0.016, "V19": 0.416, "V20": 0.126, |
| "V21": 0.517, "V22": -0.035, "V23": -0.465, "V24": -0.018, |
| "V25": -0.010, "V26": -0.002, "V27": -0.154, "V28": -0.048, |
| "Amount": 239.93 |
| }' |
| ``` |
|
|
| **Response:** |
| ```json |
| { |
| "transaction_id": "TXN-1714297654321", |
| "fraud_probability": 0.999943, |
| "decision": "BLOCKED - SUSPECTED FRAUD", |
| "risk_level": "CRITICAL", |
| "top_risk_factors": [...], |
| "response_time_ms": 5.62, |
| "threshold_used": 0.55, |
| "model_used": "XGBoost (Optimized)" |
| } |
| ``` |
|
|
| ## ๐ Key Findings |
|
|
| ### 5 Key Observations from EDA |
| 1. **Extreme Class Imbalance**: Only 0.173% fraud (1:577 ratio) |
| 2. **Amount Patterns**: Fraud mean $122.21 (median $9.25) vs legit mean $88.29 |
| 3. **Temporal Patterns**: Night fraud rate 0.518% vs day 0.137% |
| 4. **Key Features**: V17, V14, V12 most negatively correlated with fraud |
| 5. **Data Quality**: No missing values, 1,081 duplicates removed |
|
|
| ### Business Impact (Test Set) |
| - **XGBoost catches 80.3% of fraud** with only 6 false positives |
| - Net savings: $6,936 on test set |
| - API response time: **<10ms average** (P95: 9.27ms) |
|
|
| ### Threshold Optimization |
| - Default threshold (0.5): F1 = 0.8507 |
| - **Optimal threshold (0.55): F1 = 0.8636** (+1.5% improvement) |
|
|
| ## ๐ฌ Explainability |
|
|
| ### Top 10 Features (SHAP Analysis) |
| 1. V4 (Mean |SHAP| = 1.913) |
| 2. V14 (1.843) |
| 3. PCA_magnitude (1.113) |
| 4. V12 (0.834) |
| 5. V3 (0.749) |
| 6. V11 (0.638) |
| 7. V10 (0.582) |
| 8. V8 (0.516) |
| 9. V10_V14_interaction (0.513) |
| 10. V15 (0.454) |
| |
| ## ๐ฎ Future Scope |
| - Graph Neural Networks for fraud ring detection |
| - Real-time streaming with Apache Kafka |
| - Federated Learning across banks |
| - LLM-generated compliance explanations |
| - Temporal modeling with Transformers |
| |
| ## ๐ IEEE Paper |
| Full research paper available in `paper/` directory: |
| - LaTeX source: `paper/fraud_detection_paper.tex` |
| - Compiled PDF: `paper/fraud_detection_paper.pdf` |
| |
| ## ๐ Dataset |
| [European Cardholder Credit Card Fraud Detection](https://huggingface.co/datasets/David-Egea/Creditcard-fraud-detection) โ 284,807 transactions with 492 fraud cases (0.173%). |
| |
| ## ๐ License |
| MIT License |
| |