YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
๐ Fraud Detection System for Financial Transactions
A comprehensive end-to-end fraud detection system using machine learning, featuring 10 models, explainability analysis, and a production-ready API.
๐ Results Summary
| Model | Precision | Recall | F1 | ROC-AUC | PR-AUC | MCC |
|---|---|---|---|---|---|---|
| XGBoost โญ | 0.9048 | 0.8028 | 0.8507 | 0.9735 | 0.8166 | 0.8520 |
| Voting Ensemble | 0.8636 | 0.8028 | 0.8321 | 0.9783 | 0.8007 | 0.8324 |
| LightGBM (Tuned) | 0.7073 | 0.8169 | 0.7582 | 0.9318 | 0.7958 | 0.7597 |
| XGBoost (Tuned) | 0.8382 | 0.8028 | 0.8201 | 0.9697 | 0.7929 | 0.8200 |
| RF (Tuned) | 0.8730 | 0.7746 | 0.8209 | 0.9675 | 0.7926 | 0.8221 |
| Random Forest | 0.8333 | 0.7746 | 0.8029 | 0.9526 | 0.7710 | 0.8031 |
| MLP | 0.6914 | 0.7887 | 0.7368 | 0.9433 | 0.7522 | 0.7380 |
| Logistic Regression | 0.0488 | 0.8873 | 0.0924 | 0.9615 | 0.7350 | 0.2042 |
| Autoencoder | 0.0033 | 1.0000 | 0.0067 | 0.9604 | 0.0442 | 0.0409 |
Best Model: XGBoost โ PR-AUC: 0.8166, F1: 0.8507 (0.8636 with threshold=0.55)
๐๏ธ System Architecture
๐ Project Structure
fraud_detection/
โโโ config.py # Configuration settings
โโโ eda.py # Exploratory Data Analysis
โโโ preprocessing.py # Feature engineering & splitting
โโโ train_all.py # Model training pipeline
โโโ evaluation.py # Comprehensive evaluation
โโโ explainability.py # SHAP & LIME analysis
โโโ error_analysis.py # FN/FP & drift analysis
โโโ ae_model.py # Autoencoder model classes
โโโ architecture.py # Architecture diagram generator
โโโ generate_pdf.py # PDF paper generator
โโโ requirements.txt # Python dependencies
โโโ api/
โ โโโ app.py # FastAPI production endpoint
โโโ models/
โ โโโ all_models.joblib # All trained models
โ โโโ all_models_with_ae.joblib
โ โโโ autoencoder.pt # PyTorch autoencoder weights
โ โโโ scaler.joblib # Fitted RobustScaler
โ โโโ tuning_results.joblib # Optuna best params
โโโ figures/ # All figures (PNG + PDF, 300 DPI)
โ โโโ class_distribution.*
โ โโโ amount_analysis.*
โ โโโ time_analysis.*
โ โโโ correlation_heatmap.*
โ โโโ feature_distributions.*
โ โโโ roc_curves.*
โ โโโ pr_curves.*
โ โโโ confusion_matrices.*
โ โโโ threshold_analysis.*
โ โโโ feature_importance.*
โ โโโ shap_summary.*
โ โโโ shap_top10.*
โ โโโ lime_explanation.*
โ โโโ error_analysis.*
โ โโโ architecture_diagram.*
โ โโโ model_comparison.csv
โ โโโ business_impact.csv
โ โโโ shap_feature_importance.csv
โโโ paper/
โ โโโ fraud_detection_paper.tex # IEEE LaTeX source
โ โโโ fraud_detection_paper.pdf # Compiled PDF
โโโ data/
โโโ creditcard.csv # Raw dataset
โโโ processed_data.joblib # Preprocessed data
โโโ evaluation_results.joblib # Evaluation results
๐ Quick Start
Installation
pip install -r requirements.txt
Run Full Pipeline
# 1. EDA
python eda.py
# 2. Preprocessing
python preprocessing.py
# 3. Training
python train_all.py
# 4. Evaluation
python evaluation.py
# 5. Explainability
python explainability.py
# 6. Error Analysis
python error_analysis.py
Run API
cd fraud_detection
uvicorn api.app:app --host 0.0.0.0 --port 8000
API Usage
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"Time": 406.0,
"V1": -2.312, "V2": 1.951, "V3": -1.609, "V4": 3.997,
"V5": -0.522, "V6": -1.426, "V7": -2.537, "V8": 1.391,
"V9": -2.770, "V10": -2.772, "V11": 3.202, "V12": -2.899,
"V13": -0.595, "V14": -4.289, "V15": 0.389, "V16": -1.140,
"V17": -2.830, "V18": -0.016, "V19": 0.416, "V20": 0.126,
"V21": 0.517, "V22": -0.035, "V23": -0.465, "V24": -0.018,
"V25": -0.010, "V26": -0.002, "V27": -0.154, "V28": -0.048,
"Amount": 239.93
}'
Response:
{
"transaction_id": "TXN-1714297654321",
"fraud_probability": 0.999943,
"decision": "BLOCKED - SUSPECTED FRAUD",
"risk_level": "CRITICAL",
"top_risk_factors": [...],
"response_time_ms": 5.62,
"threshold_used": 0.55,
"model_used": "XGBoost (Optimized)"
}
๐ Key Findings
5 Key Observations from EDA
- Extreme Class Imbalance: Only 0.173% fraud (1:577 ratio)
- Amount Patterns: Fraud mean $122.21 (median $9.25) vs legit mean $88.29
- Temporal Patterns: Night fraud rate 0.518% vs day 0.137%
- Key Features: V17, V14, V12 most negatively correlated with fraud
- Data Quality: No missing values, 1,081 duplicates removed
Business Impact (Test Set)
- XGBoost catches 80.3% of fraud with only 6 false positives
- Net savings: $6,936 on test set
- API response time: <10ms average (P95: 9.27ms)
Threshold Optimization
- Default threshold (0.5): F1 = 0.8507
- Optimal threshold (0.55): F1 = 0.8636 (+1.5% improvement)
๐ฌ Explainability
Top 10 Features (SHAP Analysis)
- V4 (Mean |SHAP| = 1.913)
- V14 (1.843)
- PCA_magnitude (1.113)
- V12 (0.834)
- V3 (0.749)
- V11 (0.638)
- V10 (0.582)
- V8 (0.516)
- V10_V14_interaction (0.513)
- V15 (0.454)
๐ฎ Future Scope
- Graph Neural Networks for fraud ring detection
- Real-time streaming with Apache Kafka
- Federated Learning across banks
- LLM-generated compliance explanations
- Temporal modeling with Transformers
๐ IEEE Paper
Full research paper available in paper/ directory:
- LaTeX source:
paper/fraud_detection_paper.tex - Compiled PDF:
paper/fraud_detection_paper.pdf
๐ Dataset
European Cardholder Credit Card Fraud Detection โ 284,807 transactions with 492 fraud cases (0.173%).
๐ License
MIT License
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
