# 🔍 Fraud Detection System for Financial Transactions

A comprehensive end-to-end fraud detection system using machine learning, featuring 10 models, explainability analysis, and a production-ready API.

## 📊 Results Summary

| Model | Precision | Recall | F1 | ROC-AUC | PR-AUC | MCC |
|---|---|---|---|---|---|---|
| **XGBoost** ⭐ | **0.9048** | 0.8028 | **0.8507** | 0.9735 | **0.8166** | **0.8520** |
| Voting Ensemble | 0.8636 | 0.8028 | 0.8321 | **0.9783** | 0.8007 | 0.8324 |
| LightGBM (Tuned) | 0.7073 | **0.8169** | 0.7582 | 0.9318 | 0.7958 | 0.7597 |
| XGBoost (Tuned) | 0.8382 | 0.8028 | 0.8201 | 0.9697 | 0.7929 | 0.8200 |
| RF (Tuned) | 0.8730 | 0.7746 | 0.8209 | 0.9675 | 0.7926 | 0.8221 |
| Random Forest | 0.8333 | 0.7746 | 0.8029 | 0.9526 | 0.7710 | 0.8031 |
| MLP | 0.6914 | 0.7887 | 0.7368 | 0.9433 | 0.7522 | 0.7380 |
| Logistic Regression | 0.0488 | 0.8873 | 0.0924 | 0.9615 | 0.7350 | 0.2042 |
| Autoencoder | 0.0033 | 1.0000 | 0.0067 | 0.9604 | 0.0442 | 0.0409 |

**Best Model: XGBoost** — PR-AUC: 0.8166, F1: 0.8507 (0.8636 with threshold=0.55)

## 🏗️ System Architecture

![Architecture](figures/architecture_diagram.png)

## 📁 Project Structure

```
fraud_detection/
├── config.py                    # Configuration settings
├── eda.py                       # Exploratory Data Analysis
├── preprocessing.py             # Feature engineering & splitting
├── train_all.py                 # Model training pipeline
├── evaluation.py                # Comprehensive evaluation
├── explainability.py            # SHAP & LIME analysis
├── error_analysis.py            # FN/FP & drift analysis
├── ae_model.py                  # Autoencoder model classes
├── architecture.py              # Architecture diagram generator
├── generate_pdf.py              # PDF paper generator
├── requirements.txt             # Python dependencies
├── api/
│   └── app.py                   # FastAPI production endpoint
├── models/
│   ├── all_models.joblib        # All trained models
│   ├── all_models_with_ae.joblib
│   ├── autoencoder.pt           # PyTorch autoencoder weights
│   ├── scaler.joblib            # Fitted RobustScaler
│   └── tuning_results.joblib    # Optuna best params
├── figures/                     # All figures (PNG + PDF, 300 DPI)
│   ├── class_distribution.*
│   ├── amount_analysis.*
│   ├── time_analysis.*
│   ├── correlation_heatmap.*
│   ├── feature_distributions.*
│   ├── roc_curves.*
│   ├── pr_curves.*
│   ├── confusion_matrices.*
│   ├── threshold_analysis.*
│   ├── feature_importance.*
│   ├── shap_summary.*
│   ├── shap_top10.*
│   ├── lime_explanation.*
│   ├── error_analysis.*
│   ├── architecture_diagram.*
│   ├── model_comparison.csv
│   ├── business_impact.csv
│   └── shap_feature_importance.csv
├── paper/
│   ├── fraud_detection_paper.tex  # IEEE LaTeX source
│   └── fraud_detection_paper.pdf  # Compiled PDF
└── data/
    ├── creditcard.csv             # Raw dataset
    ├── processed_data.joblib      # Preprocessed data
    └── evaluation_results.joblib  # Evaluation results
```

## 🚀 Quick Start

### Installation
```bash
pip install -r requirements.txt
```

### Run Full Pipeline
```bash
# 1. EDA
python eda.py

# 2. Preprocessing
python preprocessing.py

# 3. Training
python train_all.py

# 4. Evaluation
python evaluation.py

# 5. Explainability
python explainability.py

# 6. Error Analysis
python error_analysis.py
```

### Run API
```bash
cd fraud_detection
uvicorn api.app:app --host 0.0.0.0 --port 8000
```

### API Usage
```bash
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "Time": 406.0,
    "V1": -2.312, "V2": 1.951, "V3": -1.609, "V4": 3.997,
    "V5": -0.522, "V6": -1.426, "V7": -2.537, "V8": 1.391,
    "V9": -2.770, "V10": -2.772, "V11": 3.202, "V12": -2.899,
    "V13": -0.595, "V14": -4.289, "V15": 0.389, "V16": -1.140,
    "V17": -2.830, "V18": -0.016, "V19": 0.416, "V20": 0.126,
    "V21": 0.517, "V22": -0.035, "V23": -0.465, "V24": -0.018,
    "V25": -0.010, "V26": -0.002, "V27": -0.154, "V28": -0.048,
    "Amount": 239.93
  }'
```

**Response:**
```json
{
  "transaction_id": "TXN-1714297654321",
  "fraud_probability": 0.999943,
  "decision": "BLOCKED - SUSPECTED FRAUD",
  "risk_level": "CRITICAL",
  "top_risk_factors": [...],
  "response_time_ms": 5.62,
  "threshold_used": 0.55,
  "model_used": "XGBoost (Optimized)"
}
```

## 📈 Key Findings

### 5 Key Observations from EDA
1. **Extreme Class Imbalance**: Only 0.173% fraud (1:577 ratio)
2. **Amount Patterns**: Fraud mean $122.21 (median $9.25) vs legit mean $88.29
3. **Temporal Patterns**: Night fraud rate 0.518% vs day 0.137%
4. **Key Features**: V17, V14, V12 most negatively correlated with fraud
5. **Data Quality**: No missing values, 1,081 duplicates removed

### Business Impact (Test Set)
- **XGBoost catches 80.3% of fraud** with only 6 false positives
- Net savings: $6,936 on test set
- API response time: **<10ms average** (P95: 9.27ms)

### Threshold Optimization
- Default threshold (0.5): F1 = 0.8507
- **Optimal threshold (0.55): F1 = 0.8636** (+1.5% improvement)

## 🔬 Explainability

### Top 10 Features (SHAP Analysis)
1. V4 (Mean |SHAP| = 1.913)
2. V14 (1.843)
3. PCA_magnitude (1.113)
4. V12 (0.834)
5. V3 (0.749)
6. V11 (0.638)
7. V10 (0.582)
8. V8 (0.516)
9. V10_V14_interaction (0.513)
10. V15 (0.454)

## 🔮 Future Scope
- Graph Neural Networks for fraud ring detection
- Real-time streaming with Apache Kafka
- Federated Learning across banks
- LLM-generated compliance explanations
- Temporal modeling with Transformers

## 📝 IEEE Paper
Full research paper available in `paper/` directory:
- LaTeX source: `paper/fraud_detection_paper.tex`
- Compiled PDF: `paper/fraud_detection_paper.pdf`

## 📊 Dataset
[European Cardholder Credit Card Fraud Detection](https://huggingface.co/datasets/David-Egea/Creditcard-fraud-detection) — 284,807 transactions with 492 fraud cases (0.173%).

## 📜 License
MIT License