File size: 6,298 Bytes
408a9b2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 | # ๐ Fraud Detection System for Financial Transactions
A comprehensive end-to-end fraud detection system using machine learning, featuring 10 models, explainability analysis, and a production-ready API.
## ๐ Results Summary
| Model | Precision | Recall | F1 | ROC-AUC | PR-AUC | MCC |
|---|---|---|---|---|---|---|
| **XGBoost** โญ | **0.9048** | 0.8028 | **0.8507** | 0.9735 | **0.8166** | **0.8520** |
| Voting Ensemble | 0.8636 | 0.8028 | 0.8321 | **0.9783** | 0.8007 | 0.8324 |
| LightGBM (Tuned) | 0.7073 | **0.8169** | 0.7582 | 0.9318 | 0.7958 | 0.7597 |
| XGBoost (Tuned) | 0.8382 | 0.8028 | 0.8201 | 0.9697 | 0.7929 | 0.8200 |
| RF (Tuned) | 0.8730 | 0.7746 | 0.8209 | 0.9675 | 0.7926 | 0.8221 |
| Random Forest | 0.8333 | 0.7746 | 0.8029 | 0.9526 | 0.7710 | 0.8031 |
| MLP | 0.6914 | 0.7887 | 0.7368 | 0.9433 | 0.7522 | 0.7380 |
| Logistic Regression | 0.0488 | 0.8873 | 0.0924 | 0.9615 | 0.7350 | 0.2042 |
| Autoencoder | 0.0033 | 1.0000 | 0.0067 | 0.9604 | 0.0442 | 0.0409 |
**Best Model: XGBoost** โ PR-AUC: 0.8166, F1: 0.8507 (0.8636 with threshold=0.55)
## ๐๏ธ System Architecture

## ๐ Project Structure
```
fraud_detection/
โโโ config.py # Configuration settings
โโโ eda.py # Exploratory Data Analysis
โโโ preprocessing.py # Feature engineering & splitting
โโโ train_all.py # Model training pipeline
โโโ evaluation.py # Comprehensive evaluation
โโโ explainability.py # SHAP & LIME analysis
โโโ error_analysis.py # FN/FP & drift analysis
โโโ ae_model.py # Autoencoder model classes
โโโ architecture.py # Architecture diagram generator
โโโ generate_pdf.py # PDF paper generator
โโโ requirements.txt # Python dependencies
โโโ api/
โ โโโ app.py # FastAPI production endpoint
โโโ models/
โ โโโ all_models.joblib # All trained models
โ โโโ all_models_with_ae.joblib
โ โโโ autoencoder.pt # PyTorch autoencoder weights
โ โโโ scaler.joblib # Fitted RobustScaler
โ โโโ tuning_results.joblib # Optuna best params
โโโ figures/ # All figures (PNG + PDF, 300 DPI)
โ โโโ class_distribution.*
โ โโโ amount_analysis.*
โ โโโ time_analysis.*
โ โโโ correlation_heatmap.*
โ โโโ feature_distributions.*
โ โโโ roc_curves.*
โ โโโ pr_curves.*
โ โโโ confusion_matrices.*
โ โโโ threshold_analysis.*
โ โโโ feature_importance.*
โ โโโ shap_summary.*
โ โโโ shap_top10.*
โ โโโ lime_explanation.*
โ โโโ error_analysis.*
โ โโโ architecture_diagram.*
โ โโโ model_comparison.csv
โ โโโ business_impact.csv
โ โโโ shap_feature_importance.csv
โโโ paper/
โ โโโ fraud_detection_paper.tex # IEEE LaTeX source
โ โโโ fraud_detection_paper.pdf # Compiled PDF
โโโ data/
โโโ creditcard.csv # Raw dataset
โโโ processed_data.joblib # Preprocessed data
โโโ evaluation_results.joblib # Evaluation results
```
## ๐ Quick Start
### Installation
```bash
pip install -r requirements.txt
```
### Run Full Pipeline
```bash
# 1. EDA
python eda.py
# 2. Preprocessing
python preprocessing.py
# 3. Training
python train_all.py
# 4. Evaluation
python evaluation.py
# 5. Explainability
python explainability.py
# 6. Error Analysis
python error_analysis.py
```
### Run API
```bash
cd fraud_detection
uvicorn api.app:app --host 0.0.0.0 --port 8000
```
### API Usage
```bash
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"Time": 406.0,
"V1": -2.312, "V2": 1.951, "V3": -1.609, "V4": 3.997,
"V5": -0.522, "V6": -1.426, "V7": -2.537, "V8": 1.391,
"V9": -2.770, "V10": -2.772, "V11": 3.202, "V12": -2.899,
"V13": -0.595, "V14": -4.289, "V15": 0.389, "V16": -1.140,
"V17": -2.830, "V18": -0.016, "V19": 0.416, "V20": 0.126,
"V21": 0.517, "V22": -0.035, "V23": -0.465, "V24": -0.018,
"V25": -0.010, "V26": -0.002, "V27": -0.154, "V28": -0.048,
"Amount": 239.93
}'
```
**Response:**
```json
{
"transaction_id": "TXN-1714297654321",
"fraud_probability": 0.999943,
"decision": "BLOCKED - SUSPECTED FRAUD",
"risk_level": "CRITICAL",
"top_risk_factors": [...],
"response_time_ms": 5.62,
"threshold_used": 0.55,
"model_used": "XGBoost (Optimized)"
}
```
## ๐ Key Findings
### 5 Key Observations from EDA
1. **Extreme Class Imbalance**: Only 0.173% fraud (1:577 ratio)
2. **Amount Patterns**: Fraud mean $122.21 (median $9.25) vs legit mean $88.29
3. **Temporal Patterns**: Night fraud rate 0.518% vs day 0.137%
4. **Key Features**: V17, V14, V12 most negatively correlated with fraud
5. **Data Quality**: No missing values, 1,081 duplicates removed
### Business Impact (Test Set)
- **XGBoost catches 80.3% of fraud** with only 6 false positives
- Net savings: $6,936 on test set
- API response time: **<10ms average** (P95: 9.27ms)
### Threshold Optimization
- Default threshold (0.5): F1 = 0.8507
- **Optimal threshold (0.55): F1 = 0.8636** (+1.5% improvement)
## ๐ฌ Explainability
### Top 10 Features (SHAP Analysis)
1. V4 (Mean |SHAP| = 1.913)
2. V14 (1.843)
3. PCA_magnitude (1.113)
4. V12 (0.834)
5. V3 (0.749)
6. V11 (0.638)
7. V10 (0.582)
8. V8 (0.516)
9. V10_V14_interaction (0.513)
10. V15 (0.454)
## ๐ฎ Future Scope
- Graph Neural Networks for fraud ring detection
- Real-time streaming with Apache Kafka
- Federated Learning across banks
- LLM-generated compliance explanations
- Temporal modeling with Transformers
## ๐ IEEE Paper
Full research paper available in `paper/` directory:
- LaTeX source: `paper/fraud_detection_paper.tex`
- Compiled PDF: `paper/fraud_detection_paper.pdf`
## ๐ Dataset
[European Cardholder Credit Card Fraud Detection](https://huggingface.co/datasets/David-Egea/Creditcard-fraud-detection) โ 284,807 transactions with 492 fraud cases (0.173%).
## ๐ License
MIT License
|