Upload README.md

b3e4654 verified 12 days ago

40.1 kB

	# ChurnPredict Pro: A Stacking Ensemble Framework for Customer Churn Prediction with Explainable AI and CLV Scoring

	> Subtitle: End-to-End Machine Learning Pipeline for Telecommunications and Banking Customer Retention — Combining Gradient Boosting, Neural Networks, and Game-Theoretic Interpretability

	---

	## Table of Contents

	1. [Problem Statement](#1-problem-statement)
	2. [Idea of Solution](#2-idea-of-solution)
	3. [Objectives](#3-objectives)
	4. [Literature Review & References](#4-literature-review--references)
	5. [Dataset Understanding](#5-dataset-understanding)
	6. [Proposed Methodology](#6-proposed-methodology)
	7. [Implementation Strategy](#7-implementation-strategy)
	8. [Experimental Design](#8-experimental-design)
	9. [Result Analysis](#9-result-analysis)
	10. [Iterative Improvement](#10-iterative-improvement)

	---

	## 1. Problem Statement

	### 1.1 Business Context

	Customer churn — the loss of clients to competitors or market attrition — is one of the most financially consequential challenges in subscription-based and service-oriented industries. In telecommunications, acquiring a new customer costs 5–25× more than retaining an existing one (industry estimates, 2024). In banking, customer attrition erodes lifetime value portfolios and damages brand equity. For both sectors, even a 1% reduction in churn can translate to millions in retained revenue.

	Current retention strategies suffer from two critical gaps:
	- Reactive approaches: Firms typically respond to churn after it occurs, through win-back campaigns that are expensive and low-yield.
	- Black-box predictions: Machine learning models deployed in production often lack interpretability, making it impossible for marketing and customer-success teams to act on model outputs with confidence.

	### 1.2 Technical Challenges

	\| Challenge \| Description \| Impact \|
	\|-----------\|-------------\|--------\|
	\| Class Imbalance \| Churners typically represent 10–30% of the customer base. Standard accuracy metrics are misleading. \| High false-negative rates; missed at-risk customers \|
	\| Feature Heterogeneity \| Datasets mix categorical (contract type, payment method), numerical (tenure, charges), and temporal features (quarter, month-on-book). \| Preprocessing complexity; risk of data leakage \|
	\| Concept Drift \| Customer behavior patterns shift seasonally and with market conditions. Models degrade without retraining. \| Production model staleness; declining precision \|
	\| Interpretability vs. Performance Trade-off \| High-accuracy ensembles are often opaque. Explainable models (e.g., logistic regression) underperform on tabular data. \| Regulatory non-compliance (GDPR Article 22); low stakeholder trust \|
	\| Multi-Domain Generalization \| Models trained on telecom data fail on banking data due to domain shift in feature distributions. \| Siloed, non-reusable models per industry \|

	### 1.3 Gaps in Existing Solutions

	1. Single-model reliance: Most production churn models deploy a single classifier (XGBoost or logistic regression), missing the variance-reduction benefits of ensemble diversity.
	2. No CLV integration: Churn predictions are binary — they do not incorporate which churners are most valuable to retain, leading to inefficient marketing spend.
	3. Weak experimental rigor: Many published churn studies use a single train/test split without cross-validation, statistical testing, or confidence intervals on metrics.
	4. Dataset isolation: Telco and bank churn datasets are studied separately; few works evaluate cross-domain transfer or unified pipelines.

	---

	## 2. Idea of Solution

	### 2.1 Architecture Overview

	We propose ChurnPredict Pro, a stacking ensemble architecture that combines the complementary strengths of five diverse base learners under a meta-learner. The design philosophy is:

	> "Diversity in inductive bias reduces variance; interpretability in the meta-layer preserves actionability."

	### 2.2 The 5-Model Stacking Ensemble

	```
	┌─────────────────────────────────────────────────────────────────────┐
	│ CHURNPRED PRO — STACKING ENSEMBLE │
	├─────────────────────────────────────────────────────────────────────┤
	│ │
	│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────┐ │
	│ │ XGBoost │ │LightGBM │ │CatBoost │ │ MLP │ │ LR │ │
	│ │ (GBDT) │ │ (GBDT) │ │ (OGB) │ │ (Deep) │ │(Base)│ │
	│ │ Base 1 │ │ Base 2 │ │ Base 3 │ │ Base 4 │ │Base 5│ │
	│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └──┬───┘ │
	│ │ │ │ │ │ │
	│ └─────────────┴─────────────┴─────────────┴────────────┘ │
	│ │ │
	│ ┌─────────▼─────────┐ │
	│ │ META-LEARNER │ │
	│ │ (Logistic Reg │ │
	│ │ / XGBoost) │ │
	│ └─────────┬─────────┘ │
	│ │ │
	│ ┌─────────▼─────────┐ │
	│ │ CLV SCORING │ │
	│ │ + SHAP EXPLAINER │ │
	│ └───────────────────┘ │
	│ │
	└─────────────────────────────────────────────────────────────────────┘
	```

	### 2.3 Why These 5 Base Models?

	\| Model \| Inductive Bias \| Strength on Churn Data \| Weakness Mitigated by Ensemble \|
	\|-------\|---------------\|------------------------\|-------------------------------\|
	\| XGBoost \| Greedy gradient boosting with regularization \| Best-in-class on sparse/tabular data; handles missing values natively \| Prone to overfitting on small datasets \|
	\| LightGBM \| Histogram-based leaf-wise boosting \| Faster training; GOSS sampling for large data \| Leaf-wise can overfit; GOSS introduces bias \|
	\| CatBoost \| Ordered boosting + categorical encoding \| Native categorical feature handling; reduces target leakage \| Slower than LightGBM; ordered boosting complexity \|
	\| MLP (Deep) \| Non-linear feature interactions \| Captures complex feature cross-products \| Needs more data; less interpretable \|
	\| Logistic Regression \| Linear decision boundary \| Fast, interpretable baseline; L1 regularization for feature selection \| Cannot model non-linear relationships \|

	The meta-learner (Logistic Regression or a shallow XGBoost) learns optimal weights for combining the five base models' predictions, leveraging their uncorrelated errors.

	### 2.4 CLV-Weighted Scoring

	Instead of ranking customers by churn probability alone, we multiply P(churn) by estimated CLV to produce a Retention Priority Score (RPS):

	$$
	\text{RPS}_i = P(\text{churn}_i) \times \text{CLV}_i
	$$

	This ensures retention campaigns target high-value at-risk customers, maximizing ROI.

	---

	## 3. Objectives

	### 3.1 Primary Goals

	\| ID \| Objective \| Metric Target \| Success Criterion \|
	\|----\|-----------\|---------------\|-----------------\|
	\| P1 \| Build a stacking ensemble that outperforms any single base model \| F1-Score \| ΔF1 ≥ +0.03 over best single model \|
	\| P2 \| Achieve high recall on churn class (minimize false negatives) \| Recall@Churn \| ≥ 0.85 on both datasets \|
	\| P3 \| Deliver actionable model explanations per customer \| SHAP summary \| Top-5 features identified per prediction \|
	\| P4 \| Rank customers by retention value, not just churn risk \| AUC-PR weighted by CLV \| ROC-AUC ≥ 0.90 \|

	### 3.2 Secondary Goals

	\| ID \| Objective \| Metric Target \|
	\|----\|-----------\|---------------\|
	\| S1 \| Evaluate cross-domain generalization (Telco → Bank, Bank → Telco) \| Transfer AUC ≥ 0.80 \|
	\| S2 \| Achieve sub-second inference latency for batch scoring \| ≤ 500ms per 1,000 records \|
	\| S3 \| Deploy a reproducible, version-controlled pipeline \| Docker + DVC + CI/CD \|
	\| S4 \| Document model behavior for regulatory compliance (GDPR/CCPA) \| Full SHAP + model card \|

	### 3.3 Success Criteria Summary

	- Model Performance: F1-Score > 0.85, ROC-AUC > 0.90, PR-AUC > 0.80 on both datasets
	- Business Impact: Identify top 20% at-risk customers with ≥ 70% precision
	- Interpretability: Every prediction accompanied by SHAP force plot; global SHAP summary for stakeholder dashboards
	- Robustness: 5-fold stratified CV with 95% confidence intervals on all metrics

	---

	## 4. Literature Review & References

	### 4.1 Category Overview

	\| Category \| Count \| Papers \|
	\|----------\|-------\|--------\|
	\| Ensemble / Boosting Methods \| 4 \| [1–4] \|
	\| SHAP / LIME Interpretability \| 3 \| [5–7] \|
	\| Deep Learning for Churn \| 3 \| [8–10] \|
	\| CLV / Profit-Driven Churn \| 3 \| [11–13] \|
	\| Financial / Bank Churn \| 4 \| [14–17] \|
	\| Survey / Benchmark / Foundation \| 4 \| [18–21] \|
	\| Total \| 21 \| \|

	### 4.2 Full References (2016–2024)

	#### [1] XGBoost: A Scalable Tree Boosting System
	Chen, T., & Guestrin, C. (2016). KDD. arXiv:1603.02754.
	Introduced sparsity-aware algorithms and weighted quantile sketch for gradient boosting. Became the dominant algorithm for tabular churn prediction tasks worldwide.

	#### [2] Tabular Data: Deep Learning is Not All You Need
	Shwartz-Ziv, R., & Armon, A. (2021). arXiv:2106.03253.
	Rigorous comparison showing XGBoost outperforms recent deep learning models on tabular data; ensembling deep models with XGBoost further improves performance.

	#### [3] CatBoost: Unbiased Boosting with Categorical Features
	Prokhorenkova, L., et al. (2017). arXiv:1706.09516.
	Ordered boosting and novel categorical feature processing; outperforms other boosting implementations on datasets with high-cardinality categorical churn predictors.

	#### [4] Enhancing Customer Churn Prediction: An Adaptive Ensemble Learning Approach
	Shaikhsurab, S., & Magadum, S. (2024). arXiv:2408.16284.
	Adaptive ensemble combining XGBoost, LightGBM, LSTM, MLP, and SVM with stacking + meta-feature generation; achieved 99.28% accuracy on telecom churn datasets.

	#### [5] A Unified Approach to Interpreting Model Predictions (SHAP)
	Lundberg, S. M., & Lee, S.-I. (2017). NeurIPS. arXiv:1705.07874.
	Proposed SHAP values as a unified measure of feature importance based on game-theoretic Shapley values; unified six existing explanation methods.

	#### [6] "Why Should I Trust You?": Explaining Predictions of Any Classifier (LIME)
	Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). KDD. arXiv:1602.04938.
	Introduced LIME to explain any classifier locally via interpretable surrogate models; foundational for churn model explainability and regulatory compliance.

	#### [7] XAI Handbook: Towards a Unified Framework for Explainable AI
	Palacio, D. G., et al. (2021). arXiv:2105.06677.
	Provides theoretical framework unifying XAI terminology (LIME, SHAP, Grad-CAM, etc.); essential for regulatory compliance and method comparison in churn explainability.

	#### [8] Early Churn Prediction from Large-Scale User-Product Interaction Time Series
	Bhattacharjee, A., Thukral, K., & Patil, C. (2023). arXiv:2309.14390.
	Applied multivariate time series classification with deep neural networks to fantasy sports churn; scales to 10⁸ users — demonstrates feasibility of deep learning at scale.

	#### [9] Modelling Customer Churn for the Retail Industry in a Deep Learning Sequential Framework
	Equihua, C., et al. (2023). arXiv:2304.00575.
	Deep survival framework using recurrent neural networks for non-contractual retail churn; avoids extensive feature engineering through learned representations.

	#### [10] Churn Reduction via Distillation
	Jiang, Y., et al. (2021). arXiv:2106.02654.
	Showed model distillation reduces predictive churn (model instability during retraining) while maintaining accuracy across FC, CNN, and transformer architectures.

	#### [11] OptDist: Learning Optimal Distribution for Customer Lifetime Value Prediction
	Weng, S., et al. (2024). arXiv:2408.08585.
	Proposed OptDist with distribution learning/selection modules; adaptively selects optimal sub-distributions for CLTV prediction on public and industrial datasets.

	#### [12] Customer Lifetime Value Prediction with Uncertainty Estimation Using Monte Carlo Dropout
	Cao, Y., Xu, Y., & Yang, Q. (2024). arXiv:2411.15944.
	Enhanced neural network CLTV prediction with Monte Carlo Dropout for uncertainty quantification; improved Top-5% MAPE significantly.

	#### [13] A Predict-and-Optimize Approach to Profit-Driven Churn Prevention
	Gómez-Vargas, E., Maldonado, S., & Vairetti, S. (2023). arXiv:2310.07047.
	First predict-and-optimize approach for churn prevention using individual CLVs (not averages); regret minimization via SGD; tested on 12 real-world datasets.

	#### [14] Dynamic Customer Embeddings for Financial Service Applications
	Chitsazan, N., et al. (2021). arXiv:2106.11880.
	DCE framework uses customer digital activity + financial context for intent/fraud/call-center prediction; financial services benchmark for learned representations.

	#### [15] FinPT: Financial Risk Prediction with Profile Tuning on Pretrained Foundation Models
	Yin, H., et al. (2023). arXiv:2308.00065.
	Introduced FinBench dataset + FinPT method for financial risk prediction (default, fraud, churn) using LLM-generated customer profiles; strong zero-shot transfer.

	#### [16] Advanced User Credit Risk Prediction Using LightGBM, XGBoost and TabNet with SMOTEENN
	Yu, B., et al. (2024). arXiv:2408.03497.
	Combined PCA, SMOTEENN, and LightGBM for bank credit risk prediction; outperformed other models in identifying high-quality applicants under class imbalance.

	#### [17] Credit Card Fraud Detection — Classifier Selection Strategy
	Kulatilleke, S. (2022). arXiv:2208.11900.
	Data-driven classifier selection + sampling methods for imbalanced fraud detection; directly applicable to churn's class imbalance challenges.

	#### [18] Predicting Customer Churn: Extreme Gradient Boosting with Temporal Data
	Gregory, J. (2018). arXiv:1802.03396.
	Applied XGBoost with temporal feature engineering to time-series churn data; achieved top performance in large-scale competition settings.

	#### [19] Predictive Churn with the Set of Good Models
	Watson-Daniels, D., et al. (2024). arXiv:2402.07745.
	Examined prediction instability during model retraining via Rashomon set; critical for production churn model deployment and monitoring.

	#### [20] Retention Is All You Need
	Mohiuddin, K., et al. (2023). arXiv:2304.03103.
	HR Decision Support System using SHAP + what-if analysis for employee attrition; demonstrates SHAP utility for retention/churn use cases with interpretable dashboards.

	#### [21] Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance
	(2024). arXiv:2409.19751.
	Comprehensive study of SMOTE, Class Weights, and Decision Threshold Calibration for binary classification; Decision Threshold Calibration most consistently effective — directly guides our experimental design.

	---

	## 5. Dataset Understanding

	### 5.1 Dataset 1: Telco Customer Churn (IBM)

	Source: [aai510-group1/telco-customer-churn](https://hf.co/datasets/aai510-group1/telco-customer-churn)
	Type: Fictional telecommunications company data
	Format: CSV / Parquet
	Splits: train / validation / test

	#### Schema Summary

	\| Feature Category \| Count \| Key Features \|
	\|-----------------\|-------\|-------------\|
	\| Demographics \| 7 \| Age, Gender, Married, Dependents, Number of Dependents, Senior Citizen, Under 30 \|
	\| Service Usage \| 10 \| Phone Service, Internet Service, Internet Type, Multiple Lines, Online Security, Online Backup, Device Protection, Tech Support, Streaming TV, Streaming Movies \|
	\| Contract & Billing \| 6 \| Contract, Payment Method, Paperless Billing, Monthly Charge, Total Charges, Total Refunds \|
	\| Engagement \| 7 \| Tenure (months), Number of Referrals, Referred a Friend, Offer, Satisfaction Score, Churn Score, Quarter \|
	\| Revenue \| 6 \| Total Revenue, Total Long Distance Charges, Total Extra Data Charges, Avg Monthly Long Distance, Avg Monthly GB Download, CLTV \|
	\| Geographic \| 5 \| City, State, Zip Code, Latitude, Longitude, Population \|
	\| Target \| 2 \| Churn (binary), Churn Reason (string), Churn Category (string), Customer Status \|

	Total Features: ~52 (including derived identifiers like `Lat Long`, `Customer ID`)

	#### Class Distribution (Audited)

	\| Split \| Total Rows \| Churned (1) \| Stayed (0) \| Churn Rate \|
	\|-------\|-----------\|-------------\|------------\|------------\|
	\| Train \| ~4,400 \| ~1,100 \| ~3,300 \| ~25% \|
	\| Validation \| ~1,500 \| ~375 \| ~1,125 \| ~25% \|
	\| Test \| ~1,500 \| ~375 \| ~1,125 \| ~25% \|

	Note: Exact counts vary by split. The dataset exhibits moderate class imbalance (~25% churn), manageable without aggressive oversampling.

	#### Notable Data Characteristics

	1. Rich categorical encoding: Internet Type (DSL, Fiber Optic, Cable, None), Contract (Month-to-Month, One Year, Two Year), Payment Method (4 types)
	2. Temporal granularity: `Quarter` field (Q1–Q4) enables time-aware feature engineering
	3. Pre-computed churn scores: `Churn Score` (0–100) and `Satisfaction Score` (1–5) are strong engineered features — risk of target leakage if not handled carefully
	4. CLTV integration: `CLTV` field directly available for revenue-weighted ranking
	5. Geographic features: Latitude/longitude enable spatial clustering or geo-derived features

	#### Data Quality Flags

	- `Total Charges` has blank/missing values for zero-tenure customers (new sign-ups)
	- `Churn Reason` and `Churn Category` are populated only for churned customers — post-hoc labels, not usable as features
	- `Customer Status` is highly correlated with target; should be excluded or used as stratification
	- Some categorical fields (City, State) have high cardinality (50+ states, 1,000+ cities)

	---

	### 5.2 Dataset 2: Bank Customer Churners

	Source: [ZZHHJ/bank_churners](https://hf.co/datasets/ZZHHJ/bank_churners)
	Type: Credit card customer attrition data
	Format: CSV / Parquet
	Splits: single train split (requires manual partitioning)

	#### Schema Summary

	\| Feature Category \| Count \| Key Features \|
	\|-----------------\|-------\|-------------\|
	\| Demographics \| 4 \| Customer_Age, Gender, Dependent_count, Education_Level, Marital_Status, Income_Category \|
	\| Account Behavior \| 5 \| Months_on_book, Total_Relationship_Count, Months_Inactive_12_mon, Contacts_Count_12_mon, Card_Category \|
	\| Financial \| 7 \| Credit_Limit, Total_Revolving_Bal, Avg_Open_To_Buy, Total_Amt_Chng_Q4_Q1, Total_Trans_Amt, Total_Trans_Ct, Total_Ct_Chng_Q4_Q1, Avg_Utilization_Ratio \|
	\| Target \| 1 \| Attrition_Flag (Existing Customer / Attrited Customer) \|
	\| Artifacts \| 2 \| Naive_Bayes_Classifier columns (pre-computed probabilities — must be removed to avoid data leakage) \|

	Total Features: 21 (19 usable + 1 ID + 2 NB artifacts to drop)

	#### Class Distribution (Estimated)

	\| Class \| Approximate Count \| Rate \|
	\|-------\|-------------------\|------\|
	\| Existing Customer \| ~8,500 \| ~83% \|
	\| Attrited Customer \| ~1,700 \| ~17% \|

	Churn rate ~17% — more imbalanced than Telco; SMOTE/ADASYN or class weighting will be necessary.

	#### Notable Data Characteristics

	1. Quarter-over-quarter dynamics: `Total_Amt_Chng_Q4_Q1` and `Total_Ct_Chng_Q4_Q1` capture behavioral velocity — powerful churn signals
	2. Utilization ratio: `Avg_Utilization_Ratio` is a strong proxy for engagement; low utilization often precedes attrition
	3. Income categories are binned: `$60K - $80K`, `$80K - $120K`, etc. — ordinal encoding preferred
	4. Card category: `Blue` (vast majority), `Silver`, `Gold`, `Platinum` — strong class imbalance within feature itself

	#### Data Quality Flags

	- Critical: Two `Naive_Bayes_Classifier_` columns are pre-computed churn probabilities from a baseline model. Using them as features would constitute data leakage* — they must be dropped before any model training.
	- No explicit CLTV field; must be estimated from `Credit_Limit`, `Total_Trans_Amt`, and `Total_Trans_Ct`
	- Single split requires manual stratified partitioning (70/15/15 or 80/10/10)

	---

	### 5.3 Cross-Dataset Comparison

	\| Attribute \| Telco (IBM) \| Bank Churners \|
	\|-----------\|-------------\|---------------\|
	\| Records \| ~7,000 \| ~10,000 \|
	\| Features (usable) \| ~45 \| ~19 \|
	\| Churn Rate \| ~25% \| ~17% \|
	\| Industry \| Telecommunications \| Banking / Credit Cards \|
	\| Temporal Features \| Quarter, Tenure (months) \| Months_on_book, Q4/Q1 change ratios \|
	\| CLTV Available \| Yes (explicit field) \| No (must derive) \|
	\| Geographic Data \| Yes (lat/lon, city, state) \| No \|
	\| Pre-computed Scores \| Churn Score, Satisfaction \| Naive Bayayes (leakage — drop) \|
	\| Class Imbalance Severity \| Moderate \| High \|
	\| Primary Churn Driver \| Contract type, tenure, service usage \| Inactivity, transaction decline, utilization \|

	---

	## 6. Proposed Methodology

	### 6.1 The 7-Phase Pipeline

	```
	Phase 1: Data Ingestion & Audit
	↓
	Phase 2: Preprocessing & Feature Engineering
	↓
	Phase 3: Exploratory Data Analysis (EDA)
	↓
	Phase 4: Model Training — 5-Base Stacking Ensemble
	↓
	Phase 5: Hyperparameter Optimization
	↓
	Phase 6: Evaluation, Interpretability & CLV Scoring
	↓
	Phase 7: Deployment, Monitoring & Documentation
	```

	### Phase 1: Data Ingestion & Audit

	- Load both datasets from Hugging Face `datasets` library
	- Compute schema validation: type checks, missing value audit, cardinality report
	- Flag anomalous values (negative charges, impossible ages, blank `Total Charges`)
	- Document data provenance and version hashes (DVC)

	### Phase 2: Preprocessing & Feature Engineering

	#### 2A. Cleaning
	- Telco: Impute `Total Charges` blanks with `Monthly Charge × Tenure`
	- Bank: Drop `Naive_Bayes_Classifier_*` columns immediately
	- Both datasets: remove ID fields (`Customer ID`, `CLIENTNUM`)

	#### 2B. Encoding
	\| Feature Type \| Encoding Strategy \| Rationale \|
	\|-------------\|-------------------\|-----------\|
	\| Binary categorical \| Label encoding (0/1) \| `Gender`, `Partner`, `PhoneService` \|
	\| Low-cardinality ordinal \| One-hot encoding \| `Contract`, `Payment Method`, `Education_Level` \|
	\| High-cardinality nominal \| Target encoding / CatBoost native \| `City`, `State` (Telco); `Income_Category` (Bank) \|
	\| Cyclical temporal \| Sine/cosine encoding \| `Quarter` mapped to angle \|

	#### 2C. Feature Engineering
	- RFM-style features (Bank): Recency = `Months_Inactive_12_mon`, Frequency = `Total_Trans_Ct`, Monetary = `Total_Trans_Amt`
	- Engagement ratio (Telco): `Satisfaction_Score / Churn_Score` as loyalty proxy
	- Velocity features: Month-over-month change in charges and usage
	- CLTV proxy (Bank): `Credit_Limit × Avg_Utilization_Ratio × (12 - Months_Inactive_12_mon)`

	#### 2D. Scaling & Imbalance Handling
	- Numerical features → RobustScaler (median/IQR, resistant to outliers)
	- Class imbalance → SMOTEENN (SMOTE + Edited Nearest Neighbours) on training fold only; never on validation/test
	- Class weights → `scale_pos_weight = len(negative) / len(positive)` for XGBoost/LightGBM

	### Phase 3: Exploratory Data Analysis (EDA)

	- Univariate distributions (histograms, boxplots for skew detection)
	- Bivariate analysis: churn rate by contract type, payment method, tenure bins
	- Correlation matrix (Spearman for non-linear relationships)
	- Feature-target mutual information scores for feature selection
	- Geographic heatmap (Telco: churn rate by state)

	### Phase 4: Model Training — Stacking Ensemble

	#### 4A. Cross-Validation Strategy
	- 5-fold Stratified Cross-Validation to preserve class distribution
	- GroupKFold if temporal leakage risk (same customer in multiple quarters)
	- Out-of-fold (OOF) predictions from each base model used as meta-features

	#### 4B. Base Model Training

	\| Base Model \| Key Hyperparameters \| Tuning Range \|
	\|-----------\|-------------------\|--------------\|
	\| XGBoost \| `max_depth`, `learning_rate`, `subsample`, `colsample_bytree`, `scale_pos_weight` \| depth: 3–8; lr: 0.01–0.3 \|
	\| LightGBM \| `num_leaves`, `learning_rate`, `feature_fraction`, `bagging_fraction`, `is_unbalance` \| leaves: 20–100; lr: 0.01–0.3 \|
	\| CatBoost \| `depth`, `learning_rate`, `iterations`, `auto_class_weights` \| depth: 4–10; iterations: 200–1000 \|
	\| MLP \| `hidden_layers`, `dropout`, `batch_size`, `learning_rate` \| layers: (128,64), (256,128,64); dropout: 0.2–0.5 \|
	\| Logistic Regression \| `C`, `penalty`, `solver`, `class_weight` \| C: 0.001–10; penalty: l1/l2/elasticnet \|

	#### 4C. Meta-Learner Training
	- Input: 5 OOF probability vectors (one per base model) + optionally top-K original features
	- Model: Logistic Regression (interpretable weights showing model contribution) OR XGBoost (if non-linear meta-interactions needed)
	- Validation: Same 5-fold CV; meta-learner trained on OOF predictions, tested on hold-out

	### Phase 5: Hyperparameter Optimization

	- Optuna with TPESampler (Tree-structured Parzen Estimator)
	- 100 trials per base model; 50 trials for meta-learner
	- Pruning: `MedianPruner` with early stopping on validation F1
	- Objective: Maximize F1-Score (harmonic mean of precision and recall)

	### Phase 6: Evaluation, Interpretability & CLV Scoring

	#### 6A. Metrics Suite (10 metrics)
	1. Accuracy
	2. Precision (Churn class)
	3. Recall (Churn class)
	4. F1-Score
	5. ROC-AUC
	6. PR-AUC (Precision-Recall AUC — critical for imbalanced data)
	7. Matthews Correlation Coefficient (MCC)
	8. Cohen's Kappa
	9. Balanced Accuracy
	10. Expected Calibration Error (ECE)

	#### 6B. SHAP Analysis
	- Global: SHAP summary plot (beeswarm) showing feature importance across full dataset
	- Local: SHAP force plot for individual predictions — customer-level actionable insights
	- Dependence: SHAP dependence plots for top-5 features revealing interaction effects

	#### 6C. CLV Scoring
	- Telco: Use explicit `CLTV` field; multiply by churn probability
	- Bank: Derive CLV proxy; multiply by churn probability
	- Output: Prioritized customer list sorted by RPS (Retention Priority Score)
	- Segment: Top 10% (urgent), 10–30% (high), 30–60% (medium), 60–100% (low)

	### Phase 7: Deployment, Monitoring & Documentation

	- Model serialization: `joblib` for sklearn/CatBoost, native formats for XGBoost/LightGBM
	- Inference pipeline: `scikit-learn Pipeline` + custom transformers
	- Monitoring: Track prediction distribution drift, feature drift, and metric decay over time
	- Documentation: Model card with intended use, limitations, bias analysis, and SHAP summary

	---

	## 7. Implementation Strategy

	### 7.1 Tech Stack

	\| Layer \| Technology \| Purpose \|
	\|-------\|-----------\|---------\|
	\| Data Loading \| `datasets` (HF), `pandas`, `polars` \| Efficient dataset ingestion \|
	\| Preprocessing \| `scikit-learn` (Pipeline, ColumnTransformer, RobustScaler) \| Reproducible feature engineering \|
	\| ML Models \| `xgboost`, `lightgbm`, `catboost`, `scikit-learn` (MLP, LR) \| Base learners \|
	\| Ensemble \| `mlens` / custom stacking with `scikit-learn` \| Meta-learner orchestration \|
	\| Imbalance \| `imbalanced-learn` (SMOTEENN) \| Oversampling + cleaning \|
	\| Optimization \| `optuna` \| Hyperparameter search \|
	\| Interpretability \| `shap` \| Game-theoretic explanations \|
	\| Tracking \| `trackio` + `mlflow` \| Experiment logging, metrics, artifacts \|
	\| Deployment \| `gradio` / `fastapi` + Docker \| API inference and UI demo \|
	\| Versioning \| `dvc` + `git` \| Data and model versioning \|

	### 7.2 4-Week Timeline

	\| Week \| Focus \| Deliverables \|
	\|------\|-------\|-------------\|
	\| Week 1 \| Data audit, preprocessing, EDA \| Clean notebooks; feature engineering pipeline; data quality report \|
	\| Week 2 \| Base model training, hyperparameter tuning \| 5 trained base models; Optuna study results; OOF prediction matrices \|
	\| Week 3 \| Stacking ensemble, evaluation, SHAP analysis \| Trained meta-learner; 10-metric report; SHAP dashboards; CLV scoring \|
	\| Week 4 \| Cross-domain testing, deployment, documentation \| Generalization report; Gradio demo; model card; final documentation \|

	### 7.3 Code Architecture

	```
	churnpredict-pro/
	├── data/
	│ ├── raw/ # HF datasets (versioned with DVC)
	│ ├── processed/ # Train/val/test splits
	│ └── engineered/ # Feature-engineered datasets
	├── notebooks/
	│ ├── 01_eda_telco.ipynb
	│ ├── 02_eda_bank.ipynb
	│ ├── 03_feature_engineering.ipynb
	│ └── 04_shap_analysis.ipynb
	├── src/
	│ ├── __init__.py
	│ ├── data/
	│ │ ├── load_datasets.py # HF datasets loader
	│ │ ├── preprocess.py # Cleaning + encoding + scaling
	│ │ └── feature_engineer.py # RFM, velocity, CLV proxy
	│ ├── models/
	│ │ ├── base_models.py # XGB, LGBM, CatBoost, MLP, LR wrappers
	│ │ ├── stacking_ensemble.py # OOF + meta-learner
	│ │ └── hyperparameter_search.py # Optuna studies
	│ ├── evaluation/
	│ │ ├── metrics.py # 10-metric computation
	│ │ ├── shap_explainer.py # Global + local SHAP
	│ │ └── clv_scorer.py # RPS computation
	│ └── deployment/
	│ ├── inference_pipeline.py
	│ └── app.py # Gradio/FastAPI interface
	├── configs/
	│ ├── telco_config.yaml
	│ └── bank_config.yaml
	├── experiments/ # Trackio / MLflow runs
	├── tests/
	│ ├── test_preprocessing.py
	│ └── test_models.py
	├── Dockerfile
	├── requirements.txt
	├── dvc.yaml
	└── README.md
	```

	---

	## 8. Experimental Design

	### 8.1 Five Experiments

	\| ID \| Experiment \| Hypothesis \| Method \|
	\|----\|-----------\|------------\|--------\|
	\| E1 \| Single Model Baseline \| Individual models underperform ensemble due to bias-variance limitations \| Train each of 5 base models standalone; report metrics \|
	\| E2 \| Stacking Ensemble \| Meta-learner combining 5 models outperforms best single model by ≥ 3% F1 \| 5-fold OOF stacking with LR meta-learner \|
	\| E3 \| Imbalance Strategy Comparison \| Threshold calibration is more effective than SMOTE for churn (per [21]) \| Compare: (a) no correction, (b) SMOTEENN, (c) class weights, (d) threshold calibration \|
	\| E4 \| Cross-Domain Transfer \| Models trained on Telco generalize to Bank with ≥ 80% AUC \| Train on Telco, evaluate zero-shot on Bank; then fine-tune \|
	\| E5 \| CLV-Weighted vs. Uniform Ranking \| RPS improves campaign ROI over probability-only ranking \| Compare top-20% precision: P(churn) only vs. P(churn) × CLV \|

	### 8.2 Ten Evaluation Metrics

	\| # \| Metric \| Formula / Definition \| Why It Matters for Churn \|
	\|---\|--------\|---------------------\|-------------------------\|
	\| 1 \| Accuracy \| (TP + TN) / (TP + TN + FP + FN) \| Overall correctness; misleading if imbalanced \|
	\| 2 \| Precision (Churn) \| TP / (TP + FP) \| Of predicted churners, how many actually churn? (cost of false alarms) \|
	\| 3 \| Recall (Churn) \| TP / (TP + FN) \| Of actual churners, how many did we catch? (cost of missed churners) \|
	\| 4 \| F1-Score \| 2 × (Precision × Recall) / (Precision + Recall) \| Harmonic mean; balances precision and recall \|
	\| 5 \| ROC-AUC \| Area under ROC curve \| Discrimination ability across all thresholds \|
	\| 6 \| PR-AUC \| Area under Precision-Recall curve \| More informative than ROC-AUC for imbalanced data \|
	\| 7 \| MCC \| (TP×TN − FP×FN) / √(product of marginals) \| Correlation between prediction and truth; robust to imbalance \|
	\| 8 \| Cohen's Kappa \| (Observed − Expected) / (1 − Expected) \| Agreement beyond chance; useful for inter-rater reliability analogies \|
	\| 9 \| Balanced Accuracy \| (Sensitivity + Specificity) / 2 \| Average of recall on both classes; fair on imbalanced data \|
	\| 10 \| ECE \| Expected Calibration Error \| Measures reliability of probability outputs; critical for CLV weighting \|

	### 8.3 Statistical Rigor

	1. Confidence Intervals: All metrics reported with 95% CIs from 5-fold CV (bootstrap percentile method)
	2. McNemar's Test: Statistically compare stacking ensemble vs. best single model
	3. DeLong's Test: Compare ROC-AUC differences between models
	4. Permutation Test: Validate feature importance scores from SHAP
	5. Stratification: All splits stratified on target + `Contract` type (strongest churn predictor) to prevent distribution shift

	### 8.4 Reproducibility Checklist

	- [ ] Random seeds fixed (`random_state=42`) for all stochastic operations
	- [ ] `requirements.txt` with exact versions (via `pip freeze`)
	- [ ] DVC tracking for data and model artifacts
	- [ ] Git commit hash recorded with every experiment
	- [ ] Trackio / MLflow logging of hyperparameters, metrics, and artifact paths

	---

	## 9. Result Analysis

	### 9.1 Expected Performance

	Based on literature benchmarks ([4] achieved 99.28% on telecom; [16] achieved strong results on bank credit risk with SMOTEENN + LightGBM), our targets are conservative and grounded:

	\| Dataset \| Best Single Model F1 \| Stacking Ensemble F1 \| Expected Δ \|
	\|---------\|---------------------\|---------------------\|------------\|
	\| Telco \| 0.82–0.84 (XGBoost/CatBoost) \| 0.86–0.88 \| +0.03–0.04 \|
	\| Bank \| 0.78–0.81 (LightGBM/XGBoost) \| 0.82–0.85 \| +0.03–0.04 \|

	### 9.2 SHAP Analysis — Expected Insights

	Based on prior churn research, we anticipate the following feature importance rankings:

	Telco (Expected Top 5 SHAP Features):
	1. `Contract` (Month-to-Month vs. longer) — strongest predictor
	2. `Tenure in Months` — inverse relationship with churn
	3. `Monthly Charge` / `Total Charges` — price sensitivity
	4. `Internet Type` (Fiber Optic churns more than DSL)
	5. `Payment Method` (Electronic check = high risk)

	Bank (Expected Top 5 SHAP Features):
	1. `Total_Trans_Ct` (transaction frequency decline)
	2. `Total_Trans_Amt` (monetary decline)
	3. `Months_Inactive_12_mon` (recency of activity)
	4. `Total_Relationship_Count` (cross-product engagement)
	5. `Contacts_Count_12_mon` (complaint/contact proxy)

	### 9.3 Business Impact Projections

	Assuming a hypothetical telecom with:
	- 100,000 customers
	- 25% annual churn rate
	- Average CLV = $3,000
	- Retention campaign cost = $50 per targeted customer
	- Campaign success rate (if well-targeted) = 30%

	\| Scenario \| Customers Targeted \| Campaign Cost \| Churners Caught \| Revenue Saved \| Net ROI \|
	\|----------\|-------------------\|---------------\|-----------------\|---------------\|---------\|
	\| Random targeting (25% churn) \| 20,000 \| $1,000,000 \| 1,500 \| $4,500,000 \| 4.5× \|
	\| Model-guided (top 20% by RPS) \| 20,000 \| $1,000,000 \| 4,200 \| $12,600,000 \| 12.6× \|

	Model-guided targeting improves ROI by ~2.8× over random selection by focusing on high-value, high-probability churners.

	### 9.4 Visualization Plan

	\| Visualization \| Purpose \|
	\|--------------\|---------\|
	\| ROC & PR curves (all models overlaid) \| Comparative discrimination \|
	\| Confusion matrices \| Error type analysis \|
	\| SHAP summary plot (beeswarm) \| Global feature importance \|
	\| SHAP force plots (sample customers) \| Local explanations for stakeholders \|
	\| SHAP dependence plots \| Feature interaction discovery \|
	\| Calibration plot (predicted vs. actual) \| Probability reliability \|
	\| CLV-RPS scatter plot \| Segmentation visualization \|
	\| Metric bar chart with 95% CIs \| Statistical comparison \|

	---

	## 10. Iterative Improvement

	### 10.1 Six Iteration Cycles

	\| Iteration \| Focus \| Action \| Expected Outcome \|
	\|-----------\|-------\|--------\|------------------\|
	\| Iter 1 \| Feature Engineering Deep-Dive \| Add polynomial features (tenure², charge²); interaction terms (contract × monthly charge); binning (tenure quartiles) \| +1–2% F1 from non-linear feature capture \|
	\| Iter 2 \| Advanced Sampling \| Replace SMOTEENN with ADASYN + Edited Nearest Neighbours; test BorderlineSMOTE \| Better synthetic sample quality near decision boundary \|
	\| Iter 3 \| Deep Learning Augmentation \| Replace MLP with TabNet or FT-Transformer for tabular deep learning; compare against MLP base \| Validate whether deep tabular models improve ensemble diversity \|
	\| Iter 4 \| Temporal Modeling \| For Telco: add LSTM/GRU on quarterly customer journey sequences; for Bank: add transaction time-series \| Capture temporal churn dynamics; +2–3% F1 on time-sensitive subsets \|
	\| Iter 5 \| Ensemble Expansion \| Add 6th base model (Random Forest or Extra Trees) for additional variance reduction; test blending vs. stacking \| Further variance reduction; marginal F1 gain of +0.5–1% \|
	\| Iter 6 \| Production Hardening \| Dockerize inference; add A/B test framework; build automated retraining trigger on drift detection; write full production documentation \| Deployable system with monitoring, retraining, and compliance docs \|

	### 10.2 Production Documentation Deliverables

	\| Document \| Contents \| Audience \|
	\|----------\|----------\|----------\|
	\| Model Card \| Intended use, training data summary, performance metrics, limitations, bias assessment, ethical considerations \| Data scientists, regulators \|
	\| API Documentation \| Endpoint specs, request/response schemas, rate limits, error codes \| Engineering teams \|
	\| SHAP Dashboard Guide \| How to read force plots, summary plots, and dependence plots \| Business stakeholders, customer success \|
	\| Retention Playbook \| How to act on RPS segments; recommended interventions per churn reason \| Marketing, customer success \|
	\| Retraining SOP \| When and how to retrain; drift detection thresholds; rollback procedures \| MLOps, data engineering \|
	\| Compliance Checklist \| GDPR Article 22 (automated decision-making), CCPA, internal audit requirements \| Legal, compliance \|

	---

	## Appendix A: Key Equations

	Retention Priority Score:
	$$
	\text{RPS}_i = P(\text{churn}_i) \times \text{CLV}_i
	$$

	F1-Score:
	$$
	F1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
	$$

	Matthews Correlation Coefficient:
	$$
	\text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}
	$$

	Expected Calibration Error:
	$$
	\text{ECE} = \sum_{m=1}^{M} \frac{\|B_m\|}{n} \left\| \text{acc}(B_m) - \text{conf}(B_m) \right\|
	$$

	---

	Document compiled for the ChurnPredict Pro project. All datasets verified on Hugging Face Hub. All 21 references span peer-reviewed and high-impact arXiv publications from 2016–2024.