File size: 6,298 Bytes
408a9b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
# ๐Ÿ” Fraud Detection System for Financial Transactions

A comprehensive end-to-end fraud detection system using machine learning, featuring 10 models, explainability analysis, and a production-ready API.

## ๐Ÿ“Š Results Summary

| Model | Precision | Recall | F1 | ROC-AUC | PR-AUC | MCC |
|---|---|---|---|---|---|---|
| **XGBoost** โญ | **0.9048** | 0.8028 | **0.8507** | 0.9735 | **0.8166** | **0.8520** |
| Voting Ensemble | 0.8636 | 0.8028 | 0.8321 | **0.9783** | 0.8007 | 0.8324 |
| LightGBM (Tuned) | 0.7073 | **0.8169** | 0.7582 | 0.9318 | 0.7958 | 0.7597 |
| XGBoost (Tuned) | 0.8382 | 0.8028 | 0.8201 | 0.9697 | 0.7929 | 0.8200 |
| RF (Tuned) | 0.8730 | 0.7746 | 0.8209 | 0.9675 | 0.7926 | 0.8221 |
| Random Forest | 0.8333 | 0.7746 | 0.8029 | 0.9526 | 0.7710 | 0.8031 |
| MLP | 0.6914 | 0.7887 | 0.7368 | 0.9433 | 0.7522 | 0.7380 |
| Logistic Regression | 0.0488 | 0.8873 | 0.0924 | 0.9615 | 0.7350 | 0.2042 |
| Autoencoder | 0.0033 | 1.0000 | 0.0067 | 0.9604 | 0.0442 | 0.0409 |

**Best Model: XGBoost** โ€” PR-AUC: 0.8166, F1: 0.8507 (0.8636 with threshold=0.55)

## ๐Ÿ—๏ธ System Architecture

![Architecture](figures/architecture_diagram.png)

## ๐Ÿ“ Project Structure

```
fraud_detection/
โ”œโ”€โ”€ config.py                    # Configuration settings
โ”œโ”€โ”€ eda.py                       # Exploratory Data Analysis
โ”œโ”€โ”€ preprocessing.py             # Feature engineering & splitting
โ”œโ”€โ”€ train_all.py                 # Model training pipeline
โ”œโ”€โ”€ evaluation.py                # Comprehensive evaluation
โ”œโ”€โ”€ explainability.py            # SHAP & LIME analysis
โ”œโ”€โ”€ error_analysis.py            # FN/FP & drift analysis
โ”œโ”€โ”€ ae_model.py                  # Autoencoder model classes
โ”œโ”€โ”€ architecture.py              # Architecture diagram generator
โ”œโ”€โ”€ generate_pdf.py              # PDF paper generator
โ”œโ”€โ”€ requirements.txt             # Python dependencies
โ”œโ”€โ”€ api/
โ”‚   โ””โ”€โ”€ app.py                   # FastAPI production endpoint
โ”œโ”€โ”€ models/
โ”‚   โ”œโ”€โ”€ all_models.joblib        # All trained models
โ”‚   โ”œโ”€โ”€ all_models_with_ae.joblib
โ”‚   โ”œโ”€โ”€ autoencoder.pt           # PyTorch autoencoder weights
โ”‚   โ”œโ”€โ”€ scaler.joblib            # Fitted RobustScaler
โ”‚   โ””โ”€โ”€ tuning_results.joblib    # Optuna best params
โ”œโ”€โ”€ figures/                     # All figures (PNG + PDF, 300 DPI)
โ”‚   โ”œโ”€โ”€ class_distribution.*
โ”‚   โ”œโ”€โ”€ amount_analysis.*
โ”‚   โ”œโ”€โ”€ time_analysis.*
โ”‚   โ”œโ”€โ”€ correlation_heatmap.*
โ”‚   โ”œโ”€โ”€ feature_distributions.*
โ”‚   โ”œโ”€โ”€ roc_curves.*
โ”‚   โ”œโ”€โ”€ pr_curves.*
โ”‚   โ”œโ”€โ”€ confusion_matrices.*
โ”‚   โ”œโ”€โ”€ threshold_analysis.*
โ”‚   โ”œโ”€โ”€ feature_importance.*
โ”‚   โ”œโ”€โ”€ shap_summary.*
โ”‚   โ”œโ”€โ”€ shap_top10.*
โ”‚   โ”œโ”€โ”€ lime_explanation.*
โ”‚   โ”œโ”€โ”€ error_analysis.*
โ”‚   โ”œโ”€โ”€ architecture_diagram.*
โ”‚   โ”œโ”€โ”€ model_comparison.csv
โ”‚   โ”œโ”€โ”€ business_impact.csv
โ”‚   โ””โ”€โ”€ shap_feature_importance.csv
โ”œโ”€โ”€ paper/
โ”‚   โ”œโ”€โ”€ fraud_detection_paper.tex  # IEEE LaTeX source
โ”‚   โ””โ”€โ”€ fraud_detection_paper.pdf  # Compiled PDF
โ””โ”€โ”€ data/
    โ”œโ”€โ”€ creditcard.csv             # Raw dataset
    โ”œโ”€โ”€ processed_data.joblib      # Preprocessed data
    โ””โ”€โ”€ evaluation_results.joblib  # Evaluation results
```

## ๐Ÿš€ Quick Start

### Installation
```bash
pip install -r requirements.txt
```

### Run Full Pipeline
```bash
# 1. EDA
python eda.py

# 2. Preprocessing
python preprocessing.py

# 3. Training
python train_all.py

# 4. Evaluation
python evaluation.py

# 5. Explainability
python explainability.py

# 6. Error Analysis
python error_analysis.py
```

### Run API
```bash
cd fraud_detection
uvicorn api.app:app --host 0.0.0.0 --port 8000
```

### API Usage
```bash
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "Time": 406.0,
    "V1": -2.312, "V2": 1.951, "V3": -1.609, "V4": 3.997,
    "V5": -0.522, "V6": -1.426, "V7": -2.537, "V8": 1.391,
    "V9": -2.770, "V10": -2.772, "V11": 3.202, "V12": -2.899,
    "V13": -0.595, "V14": -4.289, "V15": 0.389, "V16": -1.140,
    "V17": -2.830, "V18": -0.016, "V19": 0.416, "V20": 0.126,
    "V21": 0.517, "V22": -0.035, "V23": -0.465, "V24": -0.018,
    "V25": -0.010, "V26": -0.002, "V27": -0.154, "V28": -0.048,
    "Amount": 239.93
  }'
```

**Response:**
```json
{
  "transaction_id": "TXN-1714297654321",
  "fraud_probability": 0.999943,
  "decision": "BLOCKED - SUSPECTED FRAUD",
  "risk_level": "CRITICAL",
  "top_risk_factors": [...],
  "response_time_ms": 5.62,
  "threshold_used": 0.55,
  "model_used": "XGBoost (Optimized)"
}
```

## ๐Ÿ“ˆ Key Findings

### 5 Key Observations from EDA
1. **Extreme Class Imbalance**: Only 0.173% fraud (1:577 ratio)
2. **Amount Patterns**: Fraud mean $122.21 (median $9.25) vs legit mean $88.29
3. **Temporal Patterns**: Night fraud rate 0.518% vs day 0.137%
4. **Key Features**: V17, V14, V12 most negatively correlated with fraud
5. **Data Quality**: No missing values, 1,081 duplicates removed

### Business Impact (Test Set)
- **XGBoost catches 80.3% of fraud** with only 6 false positives
- Net savings: $6,936 on test set
- API response time: **<10ms average** (P95: 9.27ms)

### Threshold Optimization
- Default threshold (0.5): F1 = 0.8507
- **Optimal threshold (0.55): F1 = 0.8636** (+1.5% improvement)

## ๐Ÿ”ฌ Explainability

### Top 10 Features (SHAP Analysis)
1. V4 (Mean |SHAP| = 1.913)
2. V14 (1.843)
3. PCA_magnitude (1.113)
4. V12 (0.834)
5. V3 (0.749)
6. V11 (0.638)
7. V10 (0.582)
8. V8 (0.516)
9. V10_V14_interaction (0.513)
10. V15 (0.454)

## ๐Ÿ”ฎ Future Scope
- Graph Neural Networks for fraud ring detection
- Real-time streaming with Apache Kafka
- Federated Learning across banks
- LLM-generated compliance explanations
- Temporal modeling with Transformers

## ๐Ÿ“ IEEE Paper
Full research paper available in `paper/` directory:
- LaTeX source: `paper/fraud_detection_paper.tex`
- Compiled PDF: `paper/fraud_detection_paper.pdf`

## ๐Ÿ“Š Dataset
[European Cardholder Credit Card Fraud Detection](https://huggingface.co/datasets/David-Egea/Creditcard-fraud-detection) โ€” 284,807 transactions with 492 fraud cases (0.173%).

## ๐Ÿ“œ License
MIT License