YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

ChurnPredict Pro: A Stacking Ensemble Framework for Customer Churn Prediction with Explainable AI and CLV Scoring

Subtitle: End-to-End Machine Learning Pipeline for Telecommunications and Banking Customer Retention β€” Combining Gradient Boosting, Neural Networks, and Game-Theoretic Interpretability


Table of Contents

  1. Problem Statement
  2. Idea of Solution
  3. Objectives
  4. Literature Review & References
  5. Dataset Understanding
  6. Proposed Methodology
  7. Implementation Strategy
  8. Experimental Design
  9. Result Analysis
  10. Iterative Improvement

1. Problem Statement

1.1 Business Context

Customer churn β€” the loss of clients to competitors or market attrition β€” is one of the most financially consequential challenges in subscription-based and service-oriented industries. In telecommunications, acquiring a new customer costs 5–25Γ— more than retaining an existing one (industry estimates, 2024). In banking, customer attrition erodes lifetime value portfolios and damages brand equity. For both sectors, even a 1% reduction in churn can translate to millions in retained revenue.

Current retention strategies suffer from two critical gaps:

  • Reactive approaches: Firms typically respond to churn after it occurs, through win-back campaigns that are expensive and low-yield.
  • Black-box predictions: Machine learning models deployed in production often lack interpretability, making it impossible for marketing and customer-success teams to act on model outputs with confidence.

1.2 Technical Challenges

Challenge Description Impact
Class Imbalance Churners typically represent 10–30% of the customer base. Standard accuracy metrics are misleading. High false-negative rates; missed at-risk customers
Feature Heterogeneity Datasets mix categorical (contract type, payment method), numerical (tenure, charges), and temporal features (quarter, month-on-book). Preprocessing complexity; risk of data leakage
Concept Drift Customer behavior patterns shift seasonally and with market conditions. Models degrade without retraining. Production model staleness; declining precision
Interpretability vs. Performance Trade-off High-accuracy ensembles are often opaque. Explainable models (e.g., logistic regression) underperform on tabular data. Regulatory non-compliance (GDPR Article 22); low stakeholder trust
Multi-Domain Generalization Models trained on telecom data fail on banking data due to domain shift in feature distributions. Siloed, non-reusable models per industry

1.3 Gaps in Existing Solutions

  1. Single-model reliance: Most production churn models deploy a single classifier (XGBoost or logistic regression), missing the variance-reduction benefits of ensemble diversity.
  2. No CLV integration: Churn predictions are binary β€” they do not incorporate which churners are most valuable to retain, leading to inefficient marketing spend.
  3. Weak experimental rigor: Many published churn studies use a single train/test split without cross-validation, statistical testing, or confidence intervals on metrics.
  4. Dataset isolation: Telco and bank churn datasets are studied separately; few works evaluate cross-domain transfer or unified pipelines.

2. Idea of Solution

2.1 Architecture Overview

We propose ChurnPredict Pro, a stacking ensemble architecture that combines the complementary strengths of five diverse base learners under a meta-learner. The design philosophy is:

"Diversity in inductive bias reduces variance; interpretability in the meta-layer preserves actionability."

2.2 The 5-Model Stacking Ensemble

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CHURNPRED PRO β€” STACKING ENSEMBLE                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β” β”‚
β”‚   β”‚ XGBoost  β”‚  β”‚LightGBM  β”‚  β”‚CatBoost  β”‚  β”‚  MLP     β”‚  β”‚  LR  β”‚ β”‚
β”‚   β”‚ (GBDT)   β”‚  β”‚ (GBDT)   β”‚  β”‚ (OGB)    β”‚  β”‚ (Deep)   β”‚  β”‚(Base)β”‚ β”‚
β”‚   β”‚  Base 1  β”‚  β”‚  Base 2  β”‚  β”‚  Base 3  β”‚  β”‚  Base 4  β”‚  β”‚Base 5β”‚ β”‚
β”‚   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”€β”˜ β”‚
β”‚        β”‚             β”‚             β”‚             β”‚            β”‚     β”‚
β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                            β”‚                                        β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚
β”‚                    β”‚  META-LEARNER     β”‚                             β”‚
β”‚                    β”‚  (Logistic Reg    β”‚                             β”‚
β”‚                    β”‚   / XGBoost)      β”‚                             β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚
β”‚                              β”‚                                      β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚
β”‚                    β”‚  CLV SCORING      β”‚                             β”‚
β”‚                    β”‚  + SHAP EXPLAINER β”‚                             β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚
β”‚                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.3 Why These 5 Base Models?

Model Inductive Bias Strength on Churn Data Weakness Mitigated by Ensemble
XGBoost Greedy gradient boosting with regularization Best-in-class on sparse/tabular data; handles missing values natively Prone to overfitting on small datasets
LightGBM Histogram-based leaf-wise boosting Faster training; GOSS sampling for large data Leaf-wise can overfit; GOSS introduces bias
CatBoost Ordered boosting + categorical encoding Native categorical feature handling; reduces target leakage Slower than LightGBM; ordered boosting complexity
MLP (Deep) Non-linear feature interactions Captures complex feature cross-products Needs more data; less interpretable
Logistic Regression Linear decision boundary Fast, interpretable baseline; L1 regularization for feature selection Cannot model non-linear relationships

The meta-learner (Logistic Regression or a shallow XGBoost) learns optimal weights for combining the five base models' predictions, leveraging their uncorrelated errors.

2.4 CLV-Weighted Scoring

Instead of ranking customers by churn probability alone, we multiply P(churn) by estimated CLV to produce a Retention Priority Score (RPS):

RPSi=P(churni)Γ—CLVi \text{RPS}_i = P(\text{churn}_i) \times \text{CLV}_i

This ensures retention campaigns target high-value at-risk customers, maximizing ROI.


3. Objectives

3.1 Primary Goals

ID Objective Metric Target Success Criterion
P1 Build a stacking ensemble that outperforms any single base model F1-Score Ξ”F1 β‰₯ +0.03 over best single model
P2 Achieve high recall on churn class (minimize false negatives) Recall@Churn β‰₯ 0.85 on both datasets
P3 Deliver actionable model explanations per customer SHAP summary Top-5 features identified per prediction
P4 Rank customers by retention value, not just churn risk AUC-PR weighted by CLV ROC-AUC β‰₯ 0.90

3.2 Secondary Goals

ID Objective Metric Target
S1 Evaluate cross-domain generalization (Telco β†’ Bank, Bank β†’ Telco) Transfer AUC β‰₯ 0.80
S2 Achieve sub-second inference latency for batch scoring ≀ 500ms per 1,000 records
S3 Deploy a reproducible, version-controlled pipeline Docker + DVC + CI/CD
S4 Document model behavior for regulatory compliance (GDPR/CCPA) Full SHAP + model card

3.3 Success Criteria Summary

  • Model Performance: F1-Score > 0.85, ROC-AUC > 0.90, PR-AUC > 0.80 on both datasets
  • Business Impact: Identify top 20% at-risk customers with β‰₯ 70% precision
  • Interpretability: Every prediction accompanied by SHAP force plot; global SHAP summary for stakeholder dashboards
  • Robustness: 5-fold stratified CV with 95% confidence intervals on all metrics

4. Literature Review & References

4.1 Category Overview

Category Count Papers
Ensemble / Boosting Methods 4 [1–4]
SHAP / LIME Interpretability 3 [5–7]
Deep Learning for Churn 3 [8–10]
CLV / Profit-Driven Churn 3 [11–13]
Financial / Bank Churn 4 [14–17]
Survey / Benchmark / Foundation 4 [18–21]
Total 21

4.2 Full References (2016–2024)

[1] XGBoost: A Scalable Tree Boosting System

Chen, T., & Guestrin, C. (2016). KDD. arXiv:1603.02754. Introduced sparsity-aware algorithms and weighted quantile sketch for gradient boosting. Became the dominant algorithm for tabular churn prediction tasks worldwide.

[2] Tabular Data: Deep Learning is Not All You Need

Shwartz-Ziv, R., & Armon, A. (2021). arXiv:2106.03253. Rigorous comparison showing XGBoost outperforms recent deep learning models on tabular data; ensembling deep models with XGBoost further improves performance.

[3] CatBoost: Unbiased Boosting with Categorical Features

Prokhorenkova, L., et al. (2017). arXiv:1706.09516. Ordered boosting and novel categorical feature processing; outperforms other boosting implementations on datasets with high-cardinality categorical churn predictors.

[4] Enhancing Customer Churn Prediction: An Adaptive Ensemble Learning Approach

Shaikhsurab, S., & Magadum, S. (2024). arXiv:2408.16284. Adaptive ensemble combining XGBoost, LightGBM, LSTM, MLP, and SVM with stacking + meta-feature generation; achieved 99.28% accuracy on telecom churn datasets.

[5] A Unified Approach to Interpreting Model Predictions (SHAP)

Lundberg, S. M., & Lee, S.-I. (2017). NeurIPS. arXiv:1705.07874. Proposed SHAP values as a unified measure of feature importance based on game-theoretic Shapley values; unified six existing explanation methods.

[6] "Why Should I Trust You?": Explaining Predictions of Any Classifier (LIME)

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). KDD. arXiv:1602.04938. Introduced LIME to explain any classifier locally via interpretable surrogate models; foundational for churn model explainability and regulatory compliance.

[7] XAI Handbook: Towards a Unified Framework for Explainable AI

Palacio, D. G., et al. (2021). arXiv:2105.06677. Provides theoretical framework unifying XAI terminology (LIME, SHAP, Grad-CAM, etc.); essential for regulatory compliance and method comparison in churn explainability.

[8] Early Churn Prediction from Large-Scale User-Product Interaction Time Series

Bhattacharjee, A., Thukral, K., & Patil, C. (2023). arXiv:2309.14390. Applied multivariate time series classification with deep neural networks to fantasy sports churn; scales to 10⁸ users β€” demonstrates feasibility of deep learning at scale.

[9] Modelling Customer Churn for the Retail Industry in a Deep Learning Sequential Framework

Equihua, C., et al. (2023). arXiv:2304.00575. Deep survival framework using recurrent neural networks for non-contractual retail churn; avoids extensive feature engineering through learned representations.

[10] Churn Reduction via Distillation

Jiang, Y., et al. (2021). arXiv:2106.02654. Showed model distillation reduces predictive churn (model instability during retraining) while maintaining accuracy across FC, CNN, and transformer architectures.

[11] OptDist: Learning Optimal Distribution for Customer Lifetime Value Prediction

Weng, S., et al. (2024). arXiv:2408.08585. Proposed OptDist with distribution learning/selection modules; adaptively selects optimal sub-distributions for CLTV prediction on public and industrial datasets.

[12] Customer Lifetime Value Prediction with Uncertainty Estimation Using Monte Carlo Dropout

Cao, Y., Xu, Y., & Yang, Q. (2024). arXiv:2411.15944. Enhanced neural network CLTV prediction with Monte Carlo Dropout for uncertainty quantification; improved Top-5% MAPE significantly.

[13] A Predict-and-Optimize Approach to Profit-Driven Churn Prevention

GΓ³mez-Vargas, E., Maldonado, S., & Vairetti, S. (2023). arXiv:2310.07047. First predict-and-optimize approach for churn prevention using individual CLVs (not averages); regret minimization via SGD; tested on 12 real-world datasets.

[14] Dynamic Customer Embeddings for Financial Service Applications

Chitsazan, N., et al. (2021). arXiv:2106.11880. DCE framework uses customer digital activity + financial context for intent/fraud/call-center prediction; financial services benchmark for learned representations.

[15] FinPT: Financial Risk Prediction with Profile Tuning on Pretrained Foundation Models

Yin, H., et al. (2023). arXiv:2308.00065. Introduced FinBench dataset + FinPT method for financial risk prediction (default, fraud, churn) using LLM-generated customer profiles; strong zero-shot transfer.

[16] Advanced User Credit Risk Prediction Using LightGBM, XGBoost and TabNet with SMOTEENN

Yu, B., et al. (2024). arXiv:2408.03497. Combined PCA, SMOTEENN, and LightGBM for bank credit risk prediction; outperformed other models in identifying high-quality applicants under class imbalance.

[17] Credit Card Fraud Detection β€” Classifier Selection Strategy

Kulatilleke, S. (2022). arXiv:2208.11900. Data-driven classifier selection + sampling methods for imbalanced fraud detection; directly applicable to churn's class imbalance challenges.

[18] Predicting Customer Churn: Extreme Gradient Boosting with Temporal Data

Gregory, J. (2018). arXiv:1802.03396. Applied XGBoost with temporal feature engineering to time-series churn data; achieved top performance in large-scale competition settings.

[19] Predictive Churn with the Set of Good Models

Watson-Daniels, D., et al. (2024). arXiv:2402.07745. Examined prediction instability during model retraining via Rashomon set; critical for production churn model deployment and monitoring.

[20] Retention Is All You Need

Mohiuddin, K., et al. (2023). arXiv:2304.03103. HR Decision Support System using SHAP + what-if analysis for employee attrition; demonstrates SHAP utility for retention/churn use cases with interpretable dashboards.

[21] Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance

(2024). arXiv:2409.19751. Comprehensive study of SMOTE, Class Weights, and Decision Threshold Calibration for binary classification; Decision Threshold Calibration most consistently effective β€” directly guides our experimental design.


5. Dataset Understanding

5.1 Dataset 1: Telco Customer Churn (IBM)

Source: aai510-group1/telco-customer-churn Type: Fictional telecommunications company data Format: CSV / Parquet Splits: train / validation / test

Schema Summary

Feature Category Count Key Features
Demographics 7 Age, Gender, Married, Dependents, Number of Dependents, Senior Citizen, Under 30
Service Usage 10 Phone Service, Internet Service, Internet Type, Multiple Lines, Online Security, Online Backup, Device Protection, Tech Support, Streaming TV, Streaming Movies
Contract & Billing 6 Contract, Payment Method, Paperless Billing, Monthly Charge, Total Charges, Total Refunds
Engagement 7 Tenure (months), Number of Referrals, Referred a Friend, Offer, Satisfaction Score, Churn Score, Quarter
Revenue 6 Total Revenue, Total Long Distance Charges, Total Extra Data Charges, Avg Monthly Long Distance, Avg Monthly GB Download, CLTV
Geographic 5 City, State, Zip Code, Latitude, Longitude, Population
Target 2 Churn (binary), Churn Reason (string), Churn Category (string), Customer Status

Total Features: ~52 (including derived identifiers like Lat Long, Customer ID)

Class Distribution (Audited)

Split Total Rows Churned (1) Stayed (0) Churn Rate
Train ~4,400 ~1,100 ~3,300 ~25%
Validation ~1,500 ~375 ~1,125 ~25%
Test ~1,500 ~375 ~1,125 ~25%

Note: Exact counts vary by split. The dataset exhibits moderate class imbalance (~25% churn), manageable without aggressive oversampling.

Notable Data Characteristics

  1. Rich categorical encoding: Internet Type (DSL, Fiber Optic, Cable, None), Contract (Month-to-Month, One Year, Two Year), Payment Method (4 types)
  2. Temporal granularity: Quarter field (Q1–Q4) enables time-aware feature engineering
  3. Pre-computed churn scores: Churn Score (0–100) and Satisfaction Score (1–5) are strong engineered features β€” risk of target leakage if not handled carefully
  4. CLTV integration: CLTV field directly available for revenue-weighted ranking
  5. Geographic features: Latitude/longitude enable spatial clustering or geo-derived features

Data Quality Flags

  • Total Charges has blank/missing values for zero-tenure customers (new sign-ups)
  • Churn Reason and Churn Category are populated only for churned customers β€” post-hoc labels, not usable as features
  • Customer Status is highly correlated with target; should be excluded or used as stratification
  • Some categorical fields (City, State) have high cardinality (50+ states, 1,000+ cities)

5.2 Dataset 2: Bank Customer Churners

Source: ZZHHJ/bank_churners Type: Credit card customer attrition data Format: CSV / Parquet Splits: single train split (requires manual partitioning)

Schema Summary

Feature Category Count Key Features
Demographics 4 Customer_Age, Gender, Dependent_count, Education_Level, Marital_Status, Income_Category
Account Behavior 5 Months_on_book, Total_Relationship_Count, Months_Inactive_12_mon, Contacts_Count_12_mon, Card_Category
Financial 7 Credit_Limit, Total_Revolving_Bal, Avg_Open_To_Buy, Total_Amt_Chng_Q4_Q1, Total_Trans_Amt, Total_Trans_Ct, Total_Ct_Chng_Q4_Q1, Avg_Utilization_Ratio
Target 1 Attrition_Flag (Existing Customer / Attrited Customer)
Artifacts 2 Naive_Bayes_Classifier columns (pre-computed probabilities β€” must be removed to avoid data leakage)

Total Features: 21 (19 usable + 1 ID + 2 NB artifacts to drop)

Class Distribution (Estimated)

Class Approximate Count Rate
Existing Customer ~8,500 ~83%
Attrited Customer ~1,700 ~17%

Churn rate ~17% β€” more imbalanced than Telco; SMOTE/ADASYN or class weighting will be necessary.

Notable Data Characteristics

  1. Quarter-over-quarter dynamics: Total_Amt_Chng_Q4_Q1 and Total_Ct_Chng_Q4_Q1 capture behavioral velocity β€” powerful churn signals
  2. Utilization ratio: Avg_Utilization_Ratio is a strong proxy for engagement; low utilization often precedes attrition
  3. Income categories are binned: $60K - $80K, $80K - $120K, etc. β€” ordinal encoding preferred
  4. Card category: Blue (vast majority), Silver, Gold, Platinum β€” strong class imbalance within feature itself

Data Quality Flags

  • Critical: Two Naive_Bayes_Classifier_* columns are pre-computed churn probabilities from a baseline model. Using them as features would constitute data leakage β€” they must be dropped before any model training.
  • No explicit CLTV field; must be estimated from Credit_Limit, Total_Trans_Amt, and Total_Trans_Ct
  • Single split requires manual stratified partitioning (70/15/15 or 80/10/10)

5.3 Cross-Dataset Comparison

Attribute Telco (IBM) Bank Churners
Records ~7,000 ~10,000
Features (usable) ~45 ~19
Churn Rate ~25% ~17%
Industry Telecommunications Banking / Credit Cards
Temporal Features Quarter, Tenure (months) Months_on_book, Q4/Q1 change ratios
CLTV Available Yes (explicit field) No (must derive)
Geographic Data Yes (lat/lon, city, state) No
Pre-computed Scores Churn Score, Satisfaction Naive Bayayes (leakage β€” drop)
Class Imbalance Severity Moderate High
Primary Churn Driver Contract type, tenure, service usage Inactivity, transaction decline, utilization

6. Proposed Methodology

6.1 The 7-Phase Pipeline

Phase 1: Data Ingestion & Audit
    ↓
Phase 2: Preprocessing & Feature Engineering
    ↓
Phase 3: Exploratory Data Analysis (EDA)
    ↓
Phase 4: Model Training β€” 5-Base Stacking Ensemble
    ↓
Phase 5: Hyperparameter Optimization
    ↓
Phase 6: Evaluation, Interpretability & CLV Scoring
    ↓
Phase 7: Deployment, Monitoring & Documentation

Phase 1: Data Ingestion & Audit

  • Load both datasets from Hugging Face datasets library
  • Compute schema validation: type checks, missing value audit, cardinality report
  • Flag anomalous values (negative charges, impossible ages, blank Total Charges)
  • Document data provenance and version hashes (DVC)

Phase 2: Preprocessing & Feature Engineering

2A. Cleaning

  • Telco: Impute Total Charges blanks with Monthly Charge Γ— Tenure
  • Bank: Drop Naive_Bayes_Classifier_* columns immediately
  • Both datasets: remove ID fields (Customer ID, CLIENTNUM)

2B. Encoding

Feature Type Encoding Strategy Rationale
Binary categorical Label encoding (0/1) Gender, Partner, PhoneService
Low-cardinality ordinal One-hot encoding Contract, Payment Method, Education_Level
High-cardinality nominal Target encoding / CatBoost native City, State (Telco); Income_Category (Bank)
Cyclical temporal Sine/cosine encoding Quarter mapped to angle

2C. Feature Engineering

  • RFM-style features (Bank): Recency = Months_Inactive_12_mon, Frequency = Total_Trans_Ct, Monetary = Total_Trans_Amt
  • Engagement ratio (Telco): Satisfaction_Score / Churn_Score as loyalty proxy
  • Velocity features: Month-over-month change in charges and usage
  • CLTV proxy (Bank): Credit_Limit Γ— Avg_Utilization_Ratio Γ— (12 - Months_Inactive_12_mon)

2D. Scaling & Imbalance Handling

  • Numerical features β†’ RobustScaler (median/IQR, resistant to outliers)
  • Class imbalance β†’ SMOTEENN (SMOTE + Edited Nearest Neighbours) on training fold only; never on validation/test
  • Class weights β†’ scale_pos_weight = len(negative) / len(positive) for XGBoost/LightGBM

Phase 3: Exploratory Data Analysis (EDA)

  • Univariate distributions (histograms, boxplots for skew detection)
  • Bivariate analysis: churn rate by contract type, payment method, tenure bins
  • Correlation matrix (Spearman for non-linear relationships)
  • Feature-target mutual information scores for feature selection
  • Geographic heatmap (Telco: churn rate by state)

Phase 4: Model Training β€” Stacking Ensemble

4A. Cross-Validation Strategy

  • 5-fold Stratified Cross-Validation to preserve class distribution
  • GroupKFold if temporal leakage risk (same customer in multiple quarters)
  • Out-of-fold (OOF) predictions from each base model used as meta-features

4B. Base Model Training

Base Model Key Hyperparameters Tuning Range
XGBoost max_depth, learning_rate, subsample, colsample_bytree, scale_pos_weight depth: 3–8; lr: 0.01–0.3
LightGBM num_leaves, learning_rate, feature_fraction, bagging_fraction, is_unbalance leaves: 20–100; lr: 0.01–0.3
CatBoost depth, learning_rate, iterations, auto_class_weights depth: 4–10; iterations: 200–1000
MLP hidden_layers, dropout, batch_size, learning_rate layers: (128,64), (256,128,64); dropout: 0.2–0.5
Logistic Regression C, penalty, solver, class_weight C: 0.001–10; penalty: l1/l2/elasticnet

4C. Meta-Learner Training

  • Input: 5 OOF probability vectors (one per base model) + optionally top-K original features
  • Model: Logistic Regression (interpretable weights showing model contribution) OR XGBoost (if non-linear meta-interactions needed)
  • Validation: Same 5-fold CV; meta-learner trained on OOF predictions, tested on hold-out

Phase 5: Hyperparameter Optimization

  • Optuna with TPESampler (Tree-structured Parzen Estimator)
  • 100 trials per base model; 50 trials for meta-learner
  • Pruning: MedianPruner with early stopping on validation F1
  • Objective: Maximize F1-Score (harmonic mean of precision and recall)

Phase 6: Evaluation, Interpretability & CLV Scoring

6A. Metrics Suite (10 metrics)

  1. Accuracy
  2. Precision (Churn class)
  3. Recall (Churn class)
  4. F1-Score
  5. ROC-AUC
  6. PR-AUC (Precision-Recall AUC β€” critical for imbalanced data)
  7. Matthews Correlation Coefficient (MCC)
  8. Cohen's Kappa
  9. Balanced Accuracy
  10. Expected Calibration Error (ECE)

6B. SHAP Analysis

  • Global: SHAP summary plot (beeswarm) showing feature importance across full dataset
  • Local: SHAP force plot for individual predictions β€” customer-level actionable insights
  • Dependence: SHAP dependence plots for top-5 features revealing interaction effects

6C. CLV Scoring

  • Telco: Use explicit CLTV field; multiply by churn probability
  • Bank: Derive CLV proxy; multiply by churn probability
  • Output: Prioritized customer list sorted by RPS (Retention Priority Score)
  • Segment: Top 10% (urgent), 10–30% (high), 30–60% (medium), 60–100% (low)

Phase 7: Deployment, Monitoring & Documentation

  • Model serialization: joblib for sklearn/CatBoost, native formats for XGBoost/LightGBM
  • Inference pipeline: scikit-learn Pipeline + custom transformers
  • Monitoring: Track prediction distribution drift, feature drift, and metric decay over time
  • Documentation: Model card with intended use, limitations, bias analysis, and SHAP summary

7. Implementation Strategy

7.1 Tech Stack

Layer Technology Purpose
Data Loading datasets (HF), pandas, polars Efficient dataset ingestion
Preprocessing scikit-learn (Pipeline, ColumnTransformer, RobustScaler) Reproducible feature engineering
ML Models xgboost, lightgbm, catboost, scikit-learn (MLP, LR) Base learners
Ensemble mlens / custom stacking with scikit-learn Meta-learner orchestration
Imbalance imbalanced-learn (SMOTEENN) Oversampling + cleaning
Optimization optuna Hyperparameter search
Interpretability shap Game-theoretic explanations
Tracking trackio + mlflow Experiment logging, metrics, artifacts
Deployment gradio / fastapi + Docker API inference and UI demo
Versioning dvc + git Data and model versioning

7.2 4-Week Timeline

Week Focus Deliverables
Week 1 Data audit, preprocessing, EDA Clean notebooks; feature engineering pipeline; data quality report
Week 2 Base model training, hyperparameter tuning 5 trained base models; Optuna study results; OOF prediction matrices
Week 3 Stacking ensemble, evaluation, SHAP analysis Trained meta-learner; 10-metric report; SHAP dashboards; CLV scoring
Week 4 Cross-domain testing, deployment, documentation Generalization report; Gradio demo; model card; final documentation

7.3 Code Architecture

churnpredict-pro/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                    # HF datasets (versioned with DVC)
β”‚   β”œβ”€β”€ processed/              # Train/val/test splits
β”‚   └── engineered/             # Feature-engineered datasets
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_eda_telco.ipynb
β”‚   β”œβ”€β”€ 02_eda_bank.ipynb
β”‚   β”œβ”€β”€ 03_feature_engineering.ipynb
β”‚   └── 04_shap_analysis.ipynb
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ load_datasets.py    # HF datasets loader
β”‚   β”‚   β”œβ”€β”€ preprocess.py       # Cleaning + encoding + scaling
β”‚   β”‚   └── feature_engineer.py # RFM, velocity, CLV proxy
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ base_models.py      # XGB, LGBM, CatBoost, MLP, LR wrappers
β”‚   β”‚   β”œβ”€β”€ stacking_ensemble.py # OOF + meta-learner
β”‚   β”‚   └── hyperparameter_search.py # Optuna studies
β”‚   β”œβ”€β”€ evaluation/
β”‚   β”‚   β”œβ”€β”€ metrics.py          # 10-metric computation
β”‚   β”‚   β”œβ”€β”€ shap_explainer.py   # Global + local SHAP
β”‚   β”‚   └── clv_scorer.py       # RPS computation
β”‚   └── deployment/
β”‚       β”œβ”€β”€ inference_pipeline.py
β”‚       └── app.py              # Gradio/FastAPI interface
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ telco_config.yaml
β”‚   └── bank_config.yaml
β”œβ”€β”€ experiments/                # Trackio / MLflow runs
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_preprocessing.py
β”‚   └── test_models.py
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ dvc.yaml
└── README.md

8. Experimental Design

8.1 Five Experiments

ID Experiment Hypothesis Method
E1 Single Model Baseline Individual models underperform ensemble due to bias-variance limitations Train each of 5 base models standalone; report metrics
E2 Stacking Ensemble Meta-learner combining 5 models outperforms best single model by β‰₯ 3% F1 5-fold OOF stacking with LR meta-learner
E3 Imbalance Strategy Comparison Threshold calibration is more effective than SMOTE for churn (per [21]) Compare: (a) no correction, (b) SMOTEENN, (c) class weights, (d) threshold calibration
E4 Cross-Domain Transfer Models trained on Telco generalize to Bank with β‰₯ 80% AUC Train on Telco, evaluate zero-shot on Bank; then fine-tune
E5 CLV-Weighted vs. Uniform Ranking RPS improves campaign ROI over probability-only ranking Compare top-20% precision: P(churn) only vs. P(churn) Γ— CLV

8.2 Ten Evaluation Metrics

# Metric Formula / Definition Why It Matters for Churn
1 Accuracy (TP + TN) / (TP + TN + FP + FN) Overall correctness; misleading if imbalanced
2 Precision (Churn) TP / (TP + FP) Of predicted churners, how many actually churn? (cost of false alarms)
3 Recall (Churn) TP / (TP + FN) Of actual churners, how many did we catch? (cost of missed churners)
4 F1-Score 2 Γ— (Precision Γ— Recall) / (Precision + Recall) Harmonic mean; balances precision and recall
5 ROC-AUC Area under ROC curve Discrimination ability across all thresholds
6 PR-AUC Area under Precision-Recall curve More informative than ROC-AUC for imbalanced data
7 MCC (TPΓ—TN βˆ’ FPΓ—FN) / √(product of marginals) Correlation between prediction and truth; robust to imbalance
8 Cohen's Kappa (Observed βˆ’ Expected) / (1 βˆ’ Expected) Agreement beyond chance; useful for inter-rater reliability analogies
9 Balanced Accuracy (Sensitivity + Specificity) / 2 Average of recall on both classes; fair on imbalanced data
10 ECE Expected Calibration Error Measures reliability of probability outputs; critical for CLV weighting

8.3 Statistical Rigor

  1. Confidence Intervals: All metrics reported with 95% CIs from 5-fold CV (bootstrap percentile method)
  2. McNemar's Test: Statistically compare stacking ensemble vs. best single model
  3. DeLong's Test: Compare ROC-AUC differences between models
  4. Permutation Test: Validate feature importance scores from SHAP
  5. Stratification: All splits stratified on target + Contract type (strongest churn predictor) to prevent distribution shift

8.4 Reproducibility Checklist

  • Random seeds fixed (random_state=42) for all stochastic operations
  • requirements.txt with exact versions (via pip freeze)
  • DVC tracking for data and model artifacts
  • Git commit hash recorded with every experiment
  • Trackio / MLflow logging of hyperparameters, metrics, and artifact paths

9. Result Analysis

9.1 Expected Performance

Based on literature benchmarks ([4] achieved 99.28% on telecom; [16] achieved strong results on bank credit risk with SMOTEENN + LightGBM), our targets are conservative and grounded:

Dataset Best Single Model F1 Stacking Ensemble F1 Expected Ξ”
Telco 0.82–0.84 (XGBoost/CatBoost) 0.86–0.88 +0.03–0.04
Bank 0.78–0.81 (LightGBM/XGBoost) 0.82–0.85 +0.03–0.04

9.2 SHAP Analysis β€” Expected Insights

Based on prior churn research, we anticipate the following feature importance rankings:

Telco (Expected Top 5 SHAP Features):

  1. Contract (Month-to-Month vs. longer) β€” strongest predictor
  2. Tenure in Months β€” inverse relationship with churn
  3. Monthly Charge / Total Charges β€” price sensitivity
  4. Internet Type (Fiber Optic churns more than DSL)
  5. Payment Method (Electronic check = high risk)

Bank (Expected Top 5 SHAP Features):

  1. Total_Trans_Ct (transaction frequency decline)
  2. Total_Trans_Amt (monetary decline)
  3. Months_Inactive_12_mon (recency of activity)
  4. Total_Relationship_Count (cross-product engagement)
  5. Contacts_Count_12_mon (complaint/contact proxy)

9.3 Business Impact Projections

Assuming a hypothetical telecom with:

  • 100,000 customers
  • 25% annual churn rate
  • Average CLV = $3,000
  • Retention campaign cost = $50 per targeted customer
  • Campaign success rate (if well-targeted) = 30%
Scenario Customers Targeted Campaign Cost Churners Caught Revenue Saved Net ROI
Random targeting (25% churn) 20,000 $1,000,000 1,500 $4,500,000 4.5Γ—
Model-guided (top 20% by RPS) 20,000 $1,000,000 4,200 $12,600,000 12.6Γ—

Model-guided targeting improves ROI by ~2.8Γ— over random selection by focusing on high-value, high-probability churners.

9.4 Visualization Plan

Visualization Purpose
ROC & PR curves (all models overlaid) Comparative discrimination
Confusion matrices Error type analysis
SHAP summary plot (beeswarm) Global feature importance
SHAP force plots (sample customers) Local explanations for stakeholders
SHAP dependence plots Feature interaction discovery
Calibration plot (predicted vs. actual) Probability reliability
CLV-RPS scatter plot Segmentation visualization
Metric bar chart with 95% CIs Statistical comparison

10. Iterative Improvement

10.1 Six Iteration Cycles

Iteration Focus Action Expected Outcome
Iter 1 Feature Engineering Deep-Dive Add polynomial features (tenureΒ², chargeΒ²); interaction terms (contract Γ— monthly charge); binning (tenure quartiles) +1–2% F1 from non-linear feature capture
Iter 2 Advanced Sampling Replace SMOTEENN with ADASYN + Edited Nearest Neighbours; test BorderlineSMOTE Better synthetic sample quality near decision boundary
Iter 3 Deep Learning Augmentation Replace MLP with TabNet or FT-Transformer for tabular deep learning; compare against MLP base Validate whether deep tabular models improve ensemble diversity
Iter 4 Temporal Modeling For Telco: add LSTM/GRU on quarterly customer journey sequences; for Bank: add transaction time-series Capture temporal churn dynamics; +2–3% F1 on time-sensitive subsets
Iter 5 Ensemble Expansion Add 6th base model (Random Forest or Extra Trees) for additional variance reduction; test blending vs. stacking Further variance reduction; marginal F1 gain of +0.5–1%
Iter 6 Production Hardening Dockerize inference; add A/B test framework; build automated retraining trigger on drift detection; write full production documentation Deployable system with monitoring, retraining, and compliance docs

10.2 Production Documentation Deliverables

Document Contents Audience
Model Card Intended use, training data summary, performance metrics, limitations, bias assessment, ethical considerations Data scientists, regulators
API Documentation Endpoint specs, request/response schemas, rate limits, error codes Engineering teams
SHAP Dashboard Guide How to read force plots, summary plots, and dependence plots Business stakeholders, customer success
Retention Playbook How to act on RPS segments; recommended interventions per churn reason Marketing, customer success
Retraining SOP When and how to retrain; drift detection thresholds; rollback procedures MLOps, data engineering
Compliance Checklist GDPR Article 22 (automated decision-making), CCPA, internal audit requirements Legal, compliance

Appendix A: Key Equations

Retention Priority Score: RPSi=P(churni)Γ—CLVi \text{RPS}_i = P(\text{churn}_i) \times \text{CLV}_i

F1-Score: F1=2β‹…Precisionβ‹…RecallPrecision+Recall F1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Matthews Correlation Coefficient: MCC=TPΓ—TNβˆ’FPΓ—FN(TP+FP)(TP+FN)(TN+FP)(TN+FN) \text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}

Expected Calibration Error: ECE=βˆ‘m=1M∣Bm∣n∣acc(Bm)βˆ’conf(Bm)∣ \text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{conf}(B_m) \right|


Document compiled for the ChurnPredict Pro project. All datasets verified on Hugging Face Hub. All 21 references span peer-reviewed and high-impact arXiv publications from 2016–2024.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for Riteesh2k6/churn-prediction-project-document