YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

ChurnPredict Pro: A Stacking Ensemble Framework for Customer Churn Prediction with Explainable AI and CLV Scoring

Subtitle: End-to-End Machine Learning Pipeline for Telecommunications and Banking Customer Retention — Combining Gradient Boosting, Neural Networks, and Game-Theoretic Interpretability

Problem Statement
Idea of Solution
Objectives
Literature Review & References
Dataset Understanding
Proposed Methodology
Implementation Strategy
Experimental Design
Result Analysis
Iterative Improvement

1. Problem Statement

1.1 Business Context

Customer churn — the loss of clients to competitors or market attrition — is one of the most financially consequential challenges in subscription-based and service-oriented industries. In telecommunications, acquiring a new customer costs 5–25× more than retaining an existing one (industry estimates, 2024). In banking, customer attrition erodes lifetime value portfolios and damages brand equity. For both sectors, even a 1% reduction in churn can translate to millions in retained revenue.

Current retention strategies suffer from two critical gaps:

Reactive approaches: Firms typically respond to churn after it occurs, through win-back campaigns that are expensive and low-yield.
Black-box predictions: Machine learning models deployed in production often lack interpretability, making it impossible for marketing and customer-success teams to act on model outputs with confidence.

1.2 Technical Challenges

Challenge	Description	Impact
Class Imbalance	Churners typically represent 10–30% of the customer base. Standard accuracy metrics are misleading.	High false-negative rates; missed at-risk customers
Feature Heterogeneity	Datasets mix categorical (contract type, payment method), numerical (tenure, charges), and temporal features (quarter, month-on-book).	Preprocessing complexity; risk of data leakage
Concept Drift	Customer behavior patterns shift seasonally and with market conditions. Models degrade without retraining.	Production model staleness; declining precision
Interpretability vs. Performance Trade-off	High-accuracy ensembles are often opaque. Explainable models (e.g., logistic regression) underperform on tabular data.	Regulatory non-compliance (GDPR Article 22); low stakeholder trust
Multi-Domain Generalization	Models trained on telecom data fail on banking data due to domain shift in feature distributions.	Siloed, non-reusable models per industry

1.3 Gaps in Existing Solutions

Single-model reliance: Most production churn models deploy a single classifier (XGBoost or logistic regression), missing the variance-reduction benefits of ensemble diversity.
No CLV integration: Churn predictions are binary — they do not incorporate which churners are most valuable to retain, leading to inefficient marketing spend.
Weak experimental rigor: Many published churn studies use a single train/test split without cross-validation, statistical testing, or confidence intervals on metrics.
Dataset isolation: Telco and bank churn datasets are studied separately; few works evaluate cross-domain transfer or unified pipelines.

2. Idea of Solution

2.1 Architecture Overview

We propose ChurnPredict Pro, a stacking ensemble architecture that combines the complementary strengths of five diverse base learners under a meta-learner. The design philosophy is:

"Diversity in inductive bias reduces variance; interpretability in the meta-layer preserves actionability."

2.2 The 5-Model Stacking Ensemble

┌─────────────────────────────────────────────────────────────────────┐
│                    CHURNPRED PRO — STACKING ENSEMBLE                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────┐ │
│   │ XGBoost  │  │LightGBM  │  │CatBoost  │  │  MLP     │  │  LR  │ │
│   │ (GBDT)   │  │ (GBDT)   │  │ (OGB)    │  │ (Deep)   │  │(Base)│ │
│   │  Base 1  │  │  Base 2  │  │  Base 3  │  │  Base 4  │  │Base 5│ │
│   └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘  └──┬───┘ │
│        │             │             │             │            │     │
│        └─────────────┴─────────────┴─────────────┴────────────┘     │
│                            │                                        │
│                    ┌─────────▼─────────┐                             │
│                    │  META-LEARNER     │                             │
│                    │  (Logistic Reg    │                             │
│                    │   / XGBoost)      │                             │
│                    └─────────┬─────────┘                             │
│                              │                                      │
│                    ┌─────────▼─────────┐                             │
│                    │  CLV SCORING      │                             │
│                    │  + SHAP EXPLAINER │                             │
│                    └───────────────────┘                             │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

2.3 Why These 5 Base Models?

Model	Inductive Bias	Strength on Churn Data	Weakness Mitigated by Ensemble
XGBoost	Greedy gradient boosting with regularization	Best-in-class on sparse/tabular data; handles missing values natively	Prone to overfitting on small datasets
LightGBM	Histogram-based leaf-wise boosting	Faster training; GOSS sampling for large data	Leaf-wise can overfit; GOSS introduces bias
CatBoost	Ordered boosting + categorical encoding	Native categorical feature handling; reduces target leakage	Slower than LightGBM; ordered boosting complexity
MLP (Deep)	Non-linear feature interactions	Captures complex feature cross-products	Needs more data; less interpretable
Logistic Regression	Linear decision boundary	Fast, interpretable baseline; L1 regularization for feature selection	Cannot model non-linear relationships

The meta-learner (Logistic Regression or a shallow XGBoost) learns optimal weights for combining the five base models' predictions, leveraging their uncorrelated errors.

2.4 CLV-Weighted Scoring

Instead of ranking customers by churn probability alone, we multiply P(churn) by estimated CLV to produce a Retention Priority Score (RPS):

$\text{RPS}_i = P(\text{churn}_i) \times \text{CLV}_i$

This ensures retention campaigns target high-value at-risk customers, maximizing ROI.

3. Objectives

3.1 Primary Goals

ID	Objective	Metric Target	Success Criterion
P1	Build a stacking ensemble that outperforms any single base model	F1-Score	ΔF1 ≥ +0.03 over best single model
P2	Achieve high recall on churn class (minimize false negatives)	Recall@Churn	≥ 0.85 on both datasets
P3	Deliver actionable model explanations per customer	SHAP summary	Top-5 features identified per prediction
P4	Rank customers by retention value, not just churn risk	AUC-PR weighted by CLV	ROC-AUC ≥ 0.90

3.2 Secondary Goals

ID	Objective	Metric Target
S1	Evaluate cross-domain generalization (Telco → Bank, Bank → Telco)	Transfer AUC ≥ 0.80
S2	Achieve sub-second inference latency for batch scoring	≤ 500ms per 1,000 records
S3	Deploy a reproducible, version-controlled pipeline	Docker + DVC + CI/CD
S4	Document model behavior for regulatory compliance (GDPR/CCPA)	Full SHAP + model card

3.3 Success Criteria Summary

Model Performance: F1-Score > 0.85, ROC-AUC > 0.90, PR-AUC > 0.80 on both datasets
Business Impact: Identify top 20% at-risk customers with ≥ 70% precision
Interpretability: Every prediction accompanied by SHAP force plot; global SHAP summary for stakeholder dashboards
Robustness: 5-fold stratified CV with 95% confidence intervals on all metrics

4. Literature Review & References

4.1 Category Overview

Category	Count	Papers
Ensemble / Boosting Methods	4	[1–4]
SHAP / LIME Interpretability	3	[5–7]
Deep Learning for Churn	3	[8–10]
CLV / Profit-Driven Churn	3	[11–13]
Financial / Bank Churn	4	[14–17]
Survey / Benchmark / Foundation	4	[18–21]
Total	21

4.2 Full References (2016–2024)

[1] XGBoost: A Scalable Tree Boosting System

Chen, T., & Guestrin, C. (2016). KDD. arXiv:1603.02754. Introduced sparsity-aware algorithms and weighted quantile sketch for gradient boosting. Became the dominant algorithm for tabular churn prediction tasks worldwide.

[2] Tabular Data: Deep Learning is Not All You Need

Shwartz-Ziv, R., & Armon, A. (2021). arXiv:2106.03253. Rigorous comparison showing XGBoost outperforms recent deep learning models on tabular data; ensembling deep models with XGBoost further improves performance.

[3] CatBoost: Unbiased Boosting with Categorical Features

Prokhorenkova, L., et al. (2017). arXiv:1706.09516. Ordered boosting and novel categorical feature processing; outperforms other boosting implementations on datasets with high-cardinality categorical churn predictors.

[4] Enhancing Customer Churn Prediction: An Adaptive Ensemble Learning Approach

Shaikhsurab, S., & Magadum, S. (2024). arXiv:2408.16284. Adaptive ensemble combining XGBoost, LightGBM, LSTM, MLP, and SVM with stacking + meta-feature generation; achieved 99.28% accuracy on telecom churn datasets.

[5] A Unified Approach to Interpreting Model Predictions (SHAP)

Lundberg, S. M., & Lee, S.-I. (2017). NeurIPS. arXiv:1705.07874. Proposed SHAP values as a unified measure of feature importance based on game-theoretic Shapley values; unified six existing explanation methods.

[6] "Why Should I Trust You?": Explaining Predictions of Any Classifier (LIME)

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). KDD. arXiv:1602.04938. Introduced LIME to explain any classifier locally via interpretable surrogate models; foundational for churn model explainability and regulatory compliance.

[7] XAI Handbook: Towards a Unified Framework for Explainable AI

Palacio, D. G., et al. (2021). arXiv:2105.06677. Provides theoretical framework unifying XAI terminology (LIME, SHAP, Grad-CAM, etc.); essential for regulatory compliance and method comparison in churn explainability.

[8] Early Churn Prediction from Large-Scale User-Product Interaction Time Series

Bhattacharjee, A., Thukral, K., & Patil, C. (2023). arXiv:2309.14390. Applied multivariate time series classification with deep neural networks to fantasy sports churn; scales to 10⁸ users — demonstrates feasibility of deep learning at scale.

[9] Modelling Customer Churn for the Retail Industry in a Deep Learning Sequential Framework

Equihua, C., et al. (2023). arXiv:2304.00575. Deep survival framework using recurrent neural networks for non-contractual retail churn; avoids extensive feature engineering through learned representations.

[10] Churn Reduction via Distillation

Jiang, Y., et al. (2021). arXiv:2106.02654. Showed model distillation reduces predictive churn (model instability during retraining) while maintaining accuracy across FC, CNN, and transformer architectures.

[11] OptDist: Learning Optimal Distribution for Customer Lifetime Value Prediction

Weng, S., et al. (2024). arXiv:2408.08585. Proposed OptDist with distribution learning/selection modules; adaptively selects optimal sub-distributions for CLTV prediction on public and industrial datasets.

[12] Customer Lifetime Value Prediction with Uncertainty Estimation Using Monte Carlo Dropout

Cao, Y., Xu, Y., & Yang, Q. (2024). arXiv:2411.15944. Enhanced neural network CLTV prediction with Monte Carlo Dropout for uncertainty quantification; improved Top-5% MAPE significantly.

[13] A Predict-and-Optimize Approach to Profit-Driven Churn Prevention

Gómez-Vargas, E., Maldonado, S., & Vairetti, S. (2023). arXiv:2310.07047. First predict-and-optimize approach for churn prevention using individual CLVs (not averages); regret minimization via SGD; tested on 12 real-world datasets.

[14] Dynamic Customer Embeddings for Financial Service Applications

Chitsazan, N., et al. (2021). arXiv:2106.11880. DCE framework uses customer digital activity + financial context for intent/fraud/call-center prediction; financial services benchmark for learned representations.

[15] FinPT: Financial Risk Prediction with Profile Tuning on Pretrained Foundation Models

Yin, H., et al. (2023). arXiv:2308.00065. Introduced FinBench dataset + FinPT method for financial risk prediction (default, fraud, churn) using LLM-generated customer profiles; strong zero-shot transfer.

[16] Advanced User Credit Risk Prediction Using LightGBM, XGBoost and TabNet with SMOTEENN

Yu, B., et al. (2024). arXiv:2408.03497. Combined PCA, SMOTEENN, and LightGBM for bank credit risk prediction; outperformed other models in identifying high-quality applicants under class imbalance.

[17] Credit Card Fraud Detection — Classifier Selection Strategy

Kulatilleke, S. (2022). arXiv:2208.11900. Data-driven classifier selection + sampling methods for imbalanced fraud detection; directly applicable to churn's class imbalance challenges.

[18] Predicting Customer Churn: Extreme Gradient Boosting with Temporal Data

Gregory, J. (2018). arXiv:1802.03396. Applied XGBoost with temporal feature engineering to time-series churn data; achieved top performance in large-scale competition settings.

[19] Predictive Churn with the Set of Good Models

Watson-Daniels, D., et al. (2024). arXiv:2402.07745. Examined prediction instability during model retraining via Rashomon set; critical for production churn model deployment and monitoring.

[20] Retention Is All You Need

Mohiuddin, K., et al. (2023). arXiv:2304.03103. HR Decision Support System using SHAP + what-if analysis for employee attrition; demonstrates SHAP utility for retention/churn use cases with interpretable dashboards.

[21] Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance

(2024). arXiv:2409.19751. Comprehensive study of SMOTE, Class Weights, and Decision Threshold Calibration for binary classification; Decision Threshold Calibration most consistently effective — directly guides our experimental design.

5. Dataset Understanding

5.1 Dataset 1: Telco Customer Churn (IBM)

Source: aai510-group1/telco-customer-churn Type: Fictional telecommunications company data Format: CSV / Parquet Splits: train / validation / test

Schema Summary

Feature Category	Count	Key Features
Demographics	7	Age, Gender, Married, Dependents, Number of Dependents, Senior Citizen, Under 30
Service Usage	10	Phone Service, Internet Service, Internet Type, Multiple Lines, Online Security, Online Backup, Device Protection, Tech Support, Streaming TV, Streaming Movies
Contract & Billing	6	Contract, Payment Method, Paperless Billing, Monthly Charge, Total Charges, Total Refunds
Engagement	7	Tenure (months), Number of Referrals, Referred a Friend, Offer, Satisfaction Score, Churn Score, Quarter
Revenue	6	Total Revenue, Total Long Distance Charges, Total Extra Data Charges, Avg Monthly Long Distance, Avg Monthly GB Download, CLTV
Geographic	5	City, State, Zip Code, Latitude, Longitude, Population
Target	2	Churn (binary), Churn Reason (string), Churn Category (string), Customer Status

Total Features: ~52 (including derived identifiers like Lat Long, Customer ID)

Class Distribution (Audited)

Split	Total Rows	Churned (1)	Stayed (0)	Churn Rate
Train	~4,400	~1,100	~3,300	~25%
Validation	~1,500	~375	~1,125	~25%
Test	~1,500	~375	~1,125	~25%

Note: Exact counts vary by split. The dataset exhibits moderate class imbalance (~25% churn), manageable without aggressive oversampling.

Notable Data Characteristics

Rich categorical encoding: Internet Type (DSL, Fiber Optic, Cable, None), Contract (Month-to-Month, One Year, Two Year), Payment Method (4 types)
Temporal granularity: Quarter field (Q1–Q4) enables time-aware feature engineering
Pre-computed churn scores: Churn Score (0–100) and Satisfaction Score (1–5) are strong engineered features — risk of target leakage if not handled carefully
CLTV integration: CLTV field directly available for revenue-weighted ranking
Geographic features: Latitude/longitude enable spatial clustering or geo-derived features

Data Quality Flags

Total Charges has blank/missing values for zero-tenure customers (new sign-ups)
Churn Reason and Churn Category are populated only for churned customers — post-hoc labels, not usable as features
Customer Status is highly correlated with target; should be excluded or used as stratification
Some categorical fields (City, State) have high cardinality (50+ states, 1,000+ cities)

5.2 Dataset 2: Bank Customer Churners

Source: ZZHHJ/bank_churners Type: Credit card customer attrition data Format: CSV / Parquet Splits: single train split (requires manual partitioning)

Schema Summary

Feature Category	Count	Key Features
Demographics	4	Customer_Age, Gender, Dependent_count, Education_Level, Marital_Status, Income_Category
Account Behavior	5	Months_on_book, Total_Relationship_Count, Months_Inactive_12_mon, Contacts_Count_12_mon, Card_Category
Financial	7	Credit_Limit, Total_Revolving_Bal, Avg_Open_To_Buy, Total_Amt_Chng_Q4_Q1, Total_Trans_Amt, Total_Trans_Ct, Total_Ct_Chng_Q4_Q1, Avg_Utilization_Ratio
Target	1	Attrition_Flag (Existing Customer / Attrited Customer)
Artifacts	2	Naive_Bayes_Classifier columns (pre-computed probabilities — must be removed to avoid data leakage)

Total Features: 21 (19 usable + 1 ID + 2 NB artifacts to drop)

Class Distribution (Estimated)

Class	Approximate Count	Rate
Existing Customer	~8,500	~83%
Attrited Customer	~1,700	~17%

Churn rate ~17% — more imbalanced than Telco; SMOTE/ADASYN or class weighting will be necessary.

Notable Data Characteristics

Quarter-over-quarter dynamics: Total_Amt_Chng_Q4_Q1 and Total_Ct_Chng_Q4_Q1 capture behavioral velocity — powerful churn signals
Utilization ratio: Avg_Utilization_Ratio is a strong proxy for engagement; low utilization often precedes attrition
Income categories are binned: $60K - $80K, $80K - $120K, etc. — ordinal encoding preferred
Card category: Blue (vast majority), Silver, Gold, Platinum — strong class imbalance within feature itself

Data Quality Flags

Critical: Two Naive_Bayes_Classifier_* columns are pre-computed churn probabilities from a baseline model. Using them as features would constitute data leakage — they must be dropped before any model training.
No explicit CLTV field; must be estimated from Credit_Limit, Total_Trans_Amt, and Total_Trans_Ct
Single split requires manual stratified partitioning (70/15/15 or 80/10/10)

5.3 Cross-Dataset Comparison

Attribute	Telco (IBM)	Bank Churners
Records	~7,000	~10,000
Features (usable)	~45	~19
Churn Rate	~25%	~17%
Industry	Telecommunications	Banking / Credit Cards
Temporal Features	Quarter, Tenure (months)	Months_on_book, Q4/Q1 change ratios
CLTV Available	Yes (explicit field)	No (must derive)
Geographic Data	Yes (lat/lon, city, state)	No
Pre-computed Scores	Churn Score, Satisfaction	Naive Bayayes (leakage — drop)
Class Imbalance Severity	Moderate	High
Primary Churn Driver	Contract type, tenure, service usage	Inactivity, transaction decline, utilization

6. Proposed Methodology

6.1 The 7-Phase Pipeline

Phase 1: Data Ingestion & Audit
    ↓
Phase 2: Preprocessing & Feature Engineering
    ↓
Phase 3: Exploratory Data Analysis (EDA)
    ↓
Phase 4: Model Training — 5-Base Stacking Ensemble
    ↓
Phase 5: Hyperparameter Optimization
    ↓
Phase 6: Evaluation, Interpretability & CLV Scoring
    ↓
Phase 7: Deployment, Monitoring & Documentation

Phase 1: Data Ingestion & Audit

Load both datasets from Hugging Face datasets library
Compute schema validation: type checks, missing value audit, cardinality report
Flag anomalous values (negative charges, impossible ages, blank Total Charges)
Document data provenance and version hashes (DVC)

Phase 2: Preprocessing & Feature Engineering

2A. Cleaning

Telco: Impute Total Charges blanks with Monthly Charge × Tenure
Bank: Drop Naive_Bayes_Classifier_* columns immediately
Both datasets: remove ID fields (Customer ID, CLIENTNUM)

2B. Encoding

Feature Type	Encoding Strategy	Rationale
Binary categorical	Label encoding (0/1)	`Gender`, `Partner`, `PhoneService`
Low-cardinality ordinal	One-hot encoding	`Contract`, `Payment Method`, `Education_Level`
High-cardinality nominal	Target encoding / CatBoost native	`City`, `State` (Telco); `Income_Category` (Bank)
Cyclical temporal	Sine/cosine encoding	`Quarter` mapped to angle

2C. Feature Engineering

RFM-style features (Bank): Recency = Months_Inactive_12_mon, Frequency = Total_Trans_Ct, Monetary = Total_Trans_Amt
Engagement ratio (Telco): Satisfaction_Score / Churn_Score as loyalty proxy
Velocity features: Month-over-month change in charges and usage
CLTV proxy (Bank): Credit_Limit × Avg_Utilization_Ratio × (12 - Months_Inactive_12_mon)

2D. Scaling & Imbalance Handling

Numerical features → RobustScaler (median/IQR, resistant to outliers)
Class imbalance → SMOTEENN (SMOTE + Edited Nearest Neighbours) on training fold only; never on validation/test
Class weights → scale_pos_weight = len(negative) / len(positive) for XGBoost/LightGBM

Phase 3: Exploratory Data Analysis (EDA)

Univariate distributions (histograms, boxplots for skew detection)
Bivariate analysis: churn rate by contract type, payment method, tenure bins
Correlation matrix (Spearman for non-linear relationships)
Feature-target mutual information scores for feature selection
Geographic heatmap (Telco: churn rate by state)

Phase 4: Model Training — Stacking Ensemble

4A. Cross-Validation Strategy

5-fold Stratified Cross-Validation to preserve class distribution
GroupKFold if temporal leakage risk (same customer in multiple quarters)
Out-of-fold (OOF) predictions from each base model used as meta-features

4B. Base Model Training

Base Model	Key Hyperparameters	Tuning Range
XGBoost	`max_depth`, `learning_rate`, `subsample`, `colsample_bytree`, `scale_pos_weight`	depth: 3–8; lr: 0.01–0.3
LightGBM	`num_leaves`, `learning_rate`, `feature_fraction`, `bagging_fraction`, `is_unbalance`	leaves: 20–100; lr: 0.01–0.3
CatBoost	`depth`, `learning_rate`, `iterations`, `auto_class_weights`	depth: 4–10; iterations: 200–1000
MLP	`hidden_layers`, `dropout`, `batch_size`, `learning_rate`	layers: (128,64), (256,128,64); dropout: 0.2–0.5
Logistic Regression	`C`, `penalty`, `solver`, `class_weight`	C: 0.001–10; penalty: l1/l2/elasticnet

4C. Meta-Learner Training

Input: 5 OOF probability vectors (one per base model) + optionally top-K original features
Model: Logistic Regression (interpretable weights showing model contribution) OR XGBoost (if non-linear meta-interactions needed)
Validation: Same 5-fold CV; meta-learner trained on OOF predictions, tested on hold-out

Phase 5: Hyperparameter Optimization

Optuna with TPESampler (Tree-structured Parzen Estimator)
100 trials per base model; 50 trials for meta-learner
Pruning: MedianPruner with early stopping on validation F1
Objective: Maximize F1-Score (harmonic mean of precision and recall)

Phase 6: Evaluation, Interpretability & CLV Scoring

6A. Metrics Suite (10 metrics)

Accuracy
Precision (Churn class)
Recall (Churn class)
F1-Score
ROC-AUC
PR-AUC (Precision-Recall AUC — critical for imbalanced data)
Matthews Correlation Coefficient (MCC)
Cohen's Kappa
Balanced Accuracy
Expected Calibration Error (ECE)

6B. SHAP Analysis

Global: SHAP summary plot (beeswarm) showing feature importance across full dataset
Local: SHAP force plot for individual predictions — customer-level actionable insights
Dependence: SHAP dependence plots for top-5 features revealing interaction effects

6C. CLV Scoring

Telco: Use explicit CLTV field; multiply by churn probability
Bank: Derive CLV proxy; multiply by churn probability
Output: Prioritized customer list sorted by RPS (Retention Priority Score)
Segment: Top 10% (urgent), 10–30% (high), 30–60% (medium), 60–100% (low)

Phase 7: Deployment, Monitoring & Documentation

Model serialization: joblib for sklearn/CatBoost, native formats for XGBoost/LightGBM
Inference pipeline: scikit-learn Pipeline + custom transformers
Monitoring: Track prediction distribution drift, feature drift, and metric decay over time
Documentation: Model card with intended use, limitations, bias analysis, and SHAP summary

7. Implementation Strategy

7.1 Tech Stack

Layer	Technology	Purpose
Data Loading	`datasets` (HF), `pandas`, `polars`	Efficient dataset ingestion
Preprocessing	`scikit-learn` (Pipeline, ColumnTransformer, RobustScaler)	Reproducible feature engineering
ML Models	`xgboost`, `lightgbm`, `catboost`, `scikit-learn` (MLP, LR)	Base learners
Ensemble	`mlens` / custom stacking with `scikit-learn`	Meta-learner orchestration
Imbalance	`imbalanced-learn` (SMOTEENN)	Oversampling + cleaning
Optimization	`optuna`	Hyperparameter search
Interpretability	`shap`	Game-theoretic explanations
Tracking	`trackio` + `mlflow`	Experiment logging, metrics, artifacts
Deployment	`gradio` / `fastapi` + Docker	API inference and UI demo
Versioning	`dvc` + `git`	Data and model versioning

7.2 4-Week Timeline

Week	Focus	Deliverables
Week 1	Data audit, preprocessing, EDA	Clean notebooks; feature engineering pipeline; data quality report
Week 2	Base model training, hyperparameter tuning	5 trained base models; Optuna study results; OOF prediction matrices
Week 3	Stacking ensemble, evaluation, SHAP analysis	Trained meta-learner; 10-metric report; SHAP dashboards; CLV scoring
Week 4	Cross-domain testing, deployment, documentation	Generalization report; Gradio demo; model card; final documentation

7.3 Code Architecture

churnpredict-pro/
├── data/
│   ├── raw/                    # HF datasets (versioned with DVC)
│   ├── processed/              # Train/val/test splits
│   └── engineered/             # Feature-engineered datasets
├── notebooks/
│   ├── 01_eda_telco.ipynb
│   ├── 02_eda_bank.ipynb
│   ├── 03_feature_engineering.ipynb
│   └── 04_shap_analysis.ipynb
├── src/
│   ├── __init__.py
│   ├── data/
│   │   ├── load_datasets.py    # HF datasets loader
│   │   ├── preprocess.py       # Cleaning + encoding + scaling
│   │   └── feature_engineer.py # RFM, velocity, CLV proxy
│   ├── models/
│   │   ├── base_models.py      # XGB, LGBM, CatBoost, MLP, LR wrappers
│   │   ├── stacking_ensemble.py # OOF + meta-learner
│   │   └── hyperparameter_search.py # Optuna studies
│   ├── evaluation/
│   │   ├── metrics.py          # 10-metric computation
│   │   ├── shap_explainer.py   # Global + local SHAP
│   │   └── clv_scorer.py       # RPS computation
│   └── deployment/
│       ├── inference_pipeline.py
│       └── app.py              # Gradio/FastAPI interface
├── configs/
│   ├── telco_config.yaml
│   └── bank_config.yaml
├── experiments/                # Trackio / MLflow runs
├── tests/
│   ├── test_preprocessing.py
│   └── test_models.py
├── Dockerfile
├── requirements.txt
├── dvc.yaml
└── README.md

8. Experimental Design

8.1 Five Experiments

ID	Experiment	Hypothesis	Method
E1	Single Model Baseline	Individual models underperform ensemble due to bias-variance limitations	Train each of 5 base models standalone; report metrics
E2	Stacking Ensemble	Meta-learner combining 5 models outperforms best single model by ≥ 3% F1	5-fold OOF stacking with LR meta-learner
E3	Imbalance Strategy Comparison	Threshold calibration is more effective than SMOTE for churn (per [21])	Compare: (a) no correction, (b) SMOTEENN, (c) class weights, (d) threshold calibration
E4	Cross-Domain Transfer	Models trained on Telco generalize to Bank with ≥ 80% AUC	Train on Telco, evaluate zero-shot on Bank; then fine-tune
E5	CLV-Weighted vs. Uniform Ranking	RPS improves campaign ROI over probability-only ranking	Compare top-20% precision: P(churn) only vs. P(churn) × CLV

8.2 Ten Evaluation Metrics

#	Metric	Formula / Definition	Why It Matters for Churn
1	Accuracy	(TP + TN) / (TP + TN + FP + FN)	Overall correctness; misleading if imbalanced
2	Precision (Churn)	TP / (TP + FP)	Of predicted churners, how many actually churn? (cost of false alarms)
3	Recall (Churn)	TP / (TP + FN)	Of actual churners, how many did we catch? (cost of missed churners)
4	F1-Score	2 × (Precision × Recall) / (Precision + Recall)	Harmonic mean; balances precision and recall
5	ROC-AUC	Area under ROC curve	Discrimination ability across all thresholds
6	PR-AUC	Area under Precision-Recall curve	More informative than ROC-AUC for imbalanced data
7	MCC	(TP×TN − FP×FN) / √(product of marginals)	Correlation between prediction and truth; robust to imbalance
8	Cohen's Kappa	(Observed − Expected) / (1 − Expected)	Agreement beyond chance; useful for inter-rater reliability analogies
9	Balanced Accuracy	(Sensitivity + Specificity) / 2	Average of recall on both classes; fair on imbalanced data
10	ECE	Expected Calibration Error	Measures reliability of probability outputs; critical for CLV weighting

8.3 Statistical Rigor

Confidence Intervals: All metrics reported with 95% CIs from 5-fold CV (bootstrap percentile method)
McNemar's Test: Statistically compare stacking ensemble vs. best single model
DeLong's Test: Compare ROC-AUC differences between models
Permutation Test: Validate feature importance scores from SHAP
Stratification: All splits stratified on target + Contract type (strongest churn predictor) to prevent distribution shift

8.4 Reproducibility Checklist

Random seeds fixed (random_state=42) for all stochastic operations
requirements.txt with exact versions (via pip freeze)
DVC tracking for data and model artifacts
Git commit hash recorded with every experiment
Trackio / MLflow logging of hyperparameters, metrics, and artifact paths

9. Result Analysis

9.1 Expected Performance

Based on literature benchmarks ([4] achieved 99.28% on telecom; [16] achieved strong results on bank credit risk with SMOTEENN + LightGBM), our targets are conservative and grounded:

Dataset	Best Single Model F1	Stacking Ensemble F1	Expected Δ
Telco	0.82–0.84 (XGBoost/CatBoost)	0.86–0.88	+0.03–0.04
Bank	0.78–0.81 (LightGBM/XGBoost)	0.82–0.85	+0.03–0.04

9.2 SHAP Analysis — Expected Insights

Based on prior churn research, we anticipate the following feature importance rankings:

Telco (Expected Top 5 SHAP Features):

Contract (Month-to-Month vs. longer) — strongest predictor
Tenure in Months — inverse relationship with churn
Monthly Charge / Total Charges — price sensitivity
Internet Type (Fiber Optic churns more than DSL)
Payment Method (Electronic check = high risk)

Bank (Expected Top 5 SHAP Features):

Total_Trans_Ct (transaction frequency decline)
Total_Trans_Amt (monetary decline)
Months_Inactive_12_mon (recency of activity)
Total_Relationship_Count (cross-product engagement)
Contacts_Count_12_mon (complaint/contact proxy)

9.3 Business Impact Projections

Assuming a hypothetical telecom with:

100,000 customers
25% annual churn rate
Average CLV = $3,000
Retention campaign cost = $50 per targeted customer
Campaign success rate (if well-targeted) = 30%

Scenario	Customers Targeted	Campaign Cost	Churners Caught	Revenue Saved	Net ROI
Random targeting (25% churn)	20,000	$1,000,000	1,500	$4,500,000	4.5×
Model-guided (top 20% by RPS)	20,000	$1,000,000	4,200	$12,600,000	12.6×

Model-guided targeting improves ROI by ~2.8× over random selection by focusing on high-value, high-probability churners.

9.4 Visualization Plan

Visualization	Purpose
ROC & PR curves (all models overlaid)	Comparative discrimination
Confusion matrices	Error type analysis
SHAP summary plot (beeswarm)	Global feature importance
SHAP force plots (sample customers)	Local explanations for stakeholders
SHAP dependence plots	Feature interaction discovery
Calibration plot (predicted vs. actual)	Probability reliability
CLV-RPS scatter plot	Segmentation visualization
Metric bar chart with 95% CIs	Statistical comparison

10. Iterative Improvement

10.1 Six Iteration Cycles

Iteration	Focus	Action	Expected Outcome
Iter 1	Feature Engineering Deep-Dive	Add polynomial features (tenure², charge²); interaction terms (contract × monthly charge); binning (tenure quartiles)	+1–2% F1 from non-linear feature capture
Iter 2	Advanced Sampling	Replace SMOTEENN with ADASYN + Edited Nearest Neighbours; test BorderlineSMOTE	Better synthetic sample quality near decision boundary
Iter 3	Deep Learning Augmentation	Replace MLP with TabNet or FT-Transformer for tabular deep learning; compare against MLP base	Validate whether deep tabular models improve ensemble diversity
Iter 4	Temporal Modeling	For Telco: add LSTM/GRU on quarterly customer journey sequences; for Bank: add transaction time-series	Capture temporal churn dynamics; +2–3% F1 on time-sensitive subsets
Iter 5	Ensemble Expansion	Add 6th base model (Random Forest or Extra Trees) for additional variance reduction; test blending vs. stacking	Further variance reduction; marginal F1 gain of +0.5–1%
Iter 6	Production Hardening	Dockerize inference; add A/B test framework; build automated retraining trigger on drift detection; write full production documentation	Deployable system with monitoring, retraining, and compliance docs

10.2 Production Documentation Deliverables

Document	Contents	Audience
Model Card	Intended use, training data summary, performance metrics, limitations, bias assessment, ethical considerations	Data scientists, regulators
API Documentation	Endpoint specs, request/response schemas, rate limits, error codes	Engineering teams
SHAP Dashboard Guide	How to read force plots, summary plots, and dependence plots	Business stakeholders, customer success
Retention Playbook	How to act on RPS segments; recommended interventions per churn reason	Marketing, customer success
Retraining SOP	When and how to retrain; drift detection thresholds; rollback procedures	MLOps, data engineering
Compliance Checklist	GDPR Article 22 (automated decision-making), CCPA, internal audit requirements	Legal, compliance

Appendix A: Key Equations

Retention Priority Score: $\text{RPS}_i = P(\text{churn}_i) \times \text{CLV}_i$

F1-Score: $F1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

Matthews Correlation Coefficient: $\text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$

Expected Calibration Error: $\text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{conf}(B_m) \right|$

Document compiled for the ChurnPredict Pro project. All datasets verified on Hugging Face Hub. All 21 references span peer-reviewed and high-impact arXiv publications from 2016–2024.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for Riteesh2k6/churn-prediction-project-document

Customer Lifetime Value Prediction with Uncertainty Estimation Using Monte Carlo Dropout

Paper • 2411.15944 • Published Nov 24, 2024

Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance in Binary Classification

Paper • 2409.19751 • Published Sep 29, 2024

Enhancing Customer Churn Prediction in Telecommunications: An Adaptive Ensemble Learning Approach

Paper • 2408.16284 • Published Aug 29, 2024 • 1

OptDist: Learning Optimal Distribution for Customer Lifetime Value Prediction

Paper • 2408.08585 • Published Aug 16, 2024

Advanced User Credit Risk Prediction Model using LightGBM, XGBoost and Tabnet with SMOTEENN

Paper • 2408.03497 • Published Aug 7, 2024 • 2

ChurnPredict Pro: A Stacking Ensemble Framework for Customer Churn Prediction with Explainable AI and CLV Scoring

Table of Contents

1. Problem Statement

1.1 Business Context

1.2 Technical Challenges

1.3 Gaps in Existing Solutions

2. Idea of Solution

2.1 Architecture Overview

2.2 The 5-Model Stacking Ensemble

2.3 Why These 5 Base Models?

2.4 CLV-Weighted Scoring

3. Objectives

3.1 Primary Goals

3.2 Secondary Goals

3.3 Success Criteria Summary

4. Literature Review & References

4.1 Category Overview

4.2 Full References (2016–2024)

[1] XGBoost: A Scalable Tree Boosting System

[2] Tabular Data: Deep Learning is Not All You Need

[3] CatBoost: Unbiased Boosting with Categorical Features

[4] Enhancing Customer Churn Prediction: An Adaptive Ensemble Learning Approach

[5] A Unified Approach to Interpreting Model Predictions (SHAP)

[6] "Why Should I Trust You?": Explaining Predictions of Any Classifier (LIME)

[7] XAI Handbook: Towards a Unified Framework for Explainable AI

[8] Early Churn Prediction from Large-Scale User-Product Interaction Time Series

[9] Modelling Customer Churn for the Retail Industry in a Deep Learning Sequential Framework

[10] Churn Reduction via Distillation

[11] OptDist: Learning Optimal Distribution for Customer Lifetime Value Prediction

[12] Customer Lifetime Value Prediction with Uncertainty Estimation Using Monte Carlo Dropout

[13] A Predict-and-Optimize Approach to Profit-Driven Churn Prevention

[14] Dynamic Customer Embeddings for Financial Service Applications

[15] FinPT: Financial Risk Prediction with Profile Tuning on Pretrained Foundation Models

[16] Advanced User Credit Risk Prediction Using LightGBM, XGBoost and TabNet with SMOTEENN

[17] Credit Card Fraud Detection — Classifier Selection Strategy

[18] Predicting Customer Churn: Extreme Gradient Boosting with Temporal Data

[19] Predictive Churn with the Set of Good Models

[20] Retention Is All You Need

[21] Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance

5. Dataset Understanding

5.1 Dataset 1: Telco Customer Churn (IBM)

Schema Summary

Class Distribution (Audited)

Notable Data Characteristics

Data Quality Flags

5.2 Dataset 2: Bank Customer Churners

Schema Summary

Class Distribution (Estimated)

Notable Data Characteristics

Data Quality Flags

5.3 Cross-Dataset Comparison

6. Proposed Methodology

6.1 The 7-Phase Pipeline

Phase 1: Data Ingestion & Audit

Phase 2: Preprocessing & Feature Engineering

2A. Cleaning

2B. Encoding

2C. Feature Engineering

2D. Scaling & Imbalance Handling

Phase 3: Exploratory Data Analysis (EDA)

Phase 4: Model Training — Stacking Ensemble

4A. Cross-Validation Strategy

4B. Base Model Training

4C. Meta-Learner Training

Phase 5: Hyperparameter Optimization

Phase 6: Evaluation, Interpretability & CLV Scoring

6A. Metrics Suite (10 metrics)

6B. SHAP Analysis

6C. CLV Scoring

Phase 7: Deployment, Monitoring & Documentation

7. Implementation Strategy

7.1 Tech Stack

7.2 4-Week Timeline

7.3 Code Architecture

8. Experimental Design

8.1 Five Experiments

8.2 Ten Evaluation Metrics

8.3 Statistical Rigor

8.4 Reproducibility Checklist

9. Result Analysis