YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
ChurnPredict Pro: A Stacking Ensemble Framework for Customer Churn Prediction with Explainable AI and CLV Scoring
Subtitle: End-to-End Machine Learning Pipeline for Telecommunications and Banking Customer Retention β Combining Gradient Boosting, Neural Networks, and Game-Theoretic Interpretability
Table of Contents
- Problem Statement
- Idea of Solution
- Objectives
- Literature Review & References
- Dataset Understanding
- Proposed Methodology
- Implementation Strategy
- Experimental Design
- Result Analysis
- Iterative Improvement
1. Problem Statement
1.1 Business Context
Customer churn β the loss of clients to competitors or market attrition β is one of the most financially consequential challenges in subscription-based and service-oriented industries. In telecommunications, acquiring a new customer costs 5β25Γ more than retaining an existing one (industry estimates, 2024). In banking, customer attrition erodes lifetime value portfolios and damages brand equity. For both sectors, even a 1% reduction in churn can translate to millions in retained revenue.
Current retention strategies suffer from two critical gaps:
- Reactive approaches: Firms typically respond to churn after it occurs, through win-back campaigns that are expensive and low-yield.
- Black-box predictions: Machine learning models deployed in production often lack interpretability, making it impossible for marketing and customer-success teams to act on model outputs with confidence.
1.2 Technical Challenges
| Challenge | Description | Impact |
|---|---|---|
| Class Imbalance | Churners typically represent 10β30% of the customer base. Standard accuracy metrics are misleading. | High false-negative rates; missed at-risk customers |
| Feature Heterogeneity | Datasets mix categorical (contract type, payment method), numerical (tenure, charges), and temporal features (quarter, month-on-book). | Preprocessing complexity; risk of data leakage |
| Concept Drift | Customer behavior patterns shift seasonally and with market conditions. Models degrade without retraining. | Production model staleness; declining precision |
| Interpretability vs. Performance Trade-off | High-accuracy ensembles are often opaque. Explainable models (e.g., logistic regression) underperform on tabular data. | Regulatory non-compliance (GDPR Article 22); low stakeholder trust |
| Multi-Domain Generalization | Models trained on telecom data fail on banking data due to domain shift in feature distributions. | Siloed, non-reusable models per industry |
1.3 Gaps in Existing Solutions
- Single-model reliance: Most production churn models deploy a single classifier (XGBoost or logistic regression), missing the variance-reduction benefits of ensemble diversity.
- No CLV integration: Churn predictions are binary β they do not incorporate which churners are most valuable to retain, leading to inefficient marketing spend.
- Weak experimental rigor: Many published churn studies use a single train/test split without cross-validation, statistical testing, or confidence intervals on metrics.
- Dataset isolation: Telco and bank churn datasets are studied separately; few works evaluate cross-domain transfer or unified pipelines.
2. Idea of Solution
2.1 Architecture Overview
We propose ChurnPredict Pro, a stacking ensemble architecture that combines the complementary strengths of five diverse base learners under a meta-learner. The design philosophy is:
"Diversity in inductive bias reduces variance; interpretability in the meta-layer preserves actionability."
2.2 The 5-Model Stacking Ensemble
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CHURNPRED PRO β STACKING ENSEMBLE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββ β
β β XGBoost β βLightGBM β βCatBoost β β MLP β β LR β β
β β (GBDT) β β (GBDT) β β (OGB) β β (Deep) β β(Base)β β
β β Base 1 β β Base 2 β β Base 3 β β Base 4 β βBase 5β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββ¬ββββ β
β β β β β β β
β βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄βββββββββββββ β
β β β
β βββββββββββΌββββββββββ β
β β META-LEARNER β β
β β (Logistic Reg β β
β β / XGBoost) β β
β βββββββββββ¬ββββββββββ β
β β β
β βββββββββββΌββββββββββ β
β β CLV SCORING β β
β β + SHAP EXPLAINER β β
β βββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.3 Why These 5 Base Models?
| Model | Inductive Bias | Strength on Churn Data | Weakness Mitigated by Ensemble |
|---|---|---|---|
| XGBoost | Greedy gradient boosting with regularization | Best-in-class on sparse/tabular data; handles missing values natively | Prone to overfitting on small datasets |
| LightGBM | Histogram-based leaf-wise boosting | Faster training; GOSS sampling for large data | Leaf-wise can overfit; GOSS introduces bias |
| CatBoost | Ordered boosting + categorical encoding | Native categorical feature handling; reduces target leakage | Slower than LightGBM; ordered boosting complexity |
| MLP (Deep) | Non-linear feature interactions | Captures complex feature cross-products | Needs more data; less interpretable |
| Logistic Regression | Linear decision boundary | Fast, interpretable baseline; L1 regularization for feature selection | Cannot model non-linear relationships |
The meta-learner (Logistic Regression or a shallow XGBoost) learns optimal weights for combining the five base models' predictions, leveraging their uncorrelated errors.
2.4 CLV-Weighted Scoring
Instead of ranking customers by churn probability alone, we multiply P(churn) by estimated CLV to produce a Retention Priority Score (RPS):
This ensures retention campaigns target high-value at-risk customers, maximizing ROI.
3. Objectives
3.1 Primary Goals
| ID | Objective | Metric Target | Success Criterion |
|---|---|---|---|
| P1 | Build a stacking ensemble that outperforms any single base model | F1-Score | ΞF1 β₯ +0.03 over best single model |
| P2 | Achieve high recall on churn class (minimize false negatives) | Recall@Churn | β₯ 0.85 on both datasets |
| P3 | Deliver actionable model explanations per customer | SHAP summary | Top-5 features identified per prediction |
| P4 | Rank customers by retention value, not just churn risk | AUC-PR weighted by CLV | ROC-AUC β₯ 0.90 |
3.2 Secondary Goals
| ID | Objective | Metric Target |
|---|---|---|
| S1 | Evaluate cross-domain generalization (Telco β Bank, Bank β Telco) | Transfer AUC β₯ 0.80 |
| S2 | Achieve sub-second inference latency for batch scoring | β€ 500ms per 1,000 records |
| S3 | Deploy a reproducible, version-controlled pipeline | Docker + DVC + CI/CD |
| S4 | Document model behavior for regulatory compliance (GDPR/CCPA) | Full SHAP + model card |
3.3 Success Criteria Summary
- Model Performance: F1-Score > 0.85, ROC-AUC > 0.90, PR-AUC > 0.80 on both datasets
- Business Impact: Identify top 20% at-risk customers with β₯ 70% precision
- Interpretability: Every prediction accompanied by SHAP force plot; global SHAP summary for stakeholder dashboards
- Robustness: 5-fold stratified CV with 95% confidence intervals on all metrics
4. Literature Review & References
4.1 Category Overview
| Category | Count | Papers |
|---|---|---|
| Ensemble / Boosting Methods | 4 | [1β4] |
| SHAP / LIME Interpretability | 3 | [5β7] |
| Deep Learning for Churn | 3 | [8β10] |
| CLV / Profit-Driven Churn | 3 | [11β13] |
| Financial / Bank Churn | 4 | [14β17] |
| Survey / Benchmark / Foundation | 4 | [18β21] |
| Total | 21 |
4.2 Full References (2016β2024)
[1] XGBoost: A Scalable Tree Boosting System
Chen, T., & Guestrin, C. (2016). KDD. arXiv:1603.02754. Introduced sparsity-aware algorithms and weighted quantile sketch for gradient boosting. Became the dominant algorithm for tabular churn prediction tasks worldwide.
[2] Tabular Data: Deep Learning is Not All You Need
Shwartz-Ziv, R., & Armon, A. (2021). arXiv:2106.03253. Rigorous comparison showing XGBoost outperforms recent deep learning models on tabular data; ensembling deep models with XGBoost further improves performance.
[3] CatBoost: Unbiased Boosting with Categorical Features
Prokhorenkova, L., et al. (2017). arXiv:1706.09516. Ordered boosting and novel categorical feature processing; outperforms other boosting implementations on datasets with high-cardinality categorical churn predictors.
[4] Enhancing Customer Churn Prediction: An Adaptive Ensemble Learning Approach
Shaikhsurab, S., & Magadum, S. (2024). arXiv:2408.16284. Adaptive ensemble combining XGBoost, LightGBM, LSTM, MLP, and SVM with stacking + meta-feature generation; achieved 99.28% accuracy on telecom churn datasets.
[5] A Unified Approach to Interpreting Model Predictions (SHAP)
Lundberg, S. M., & Lee, S.-I. (2017). NeurIPS. arXiv:1705.07874. Proposed SHAP values as a unified measure of feature importance based on game-theoretic Shapley values; unified six existing explanation methods.
[6] "Why Should I Trust You?": Explaining Predictions of Any Classifier (LIME)
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). KDD. arXiv:1602.04938. Introduced LIME to explain any classifier locally via interpretable surrogate models; foundational for churn model explainability and regulatory compliance.
[7] XAI Handbook: Towards a Unified Framework for Explainable AI
Palacio, D. G., et al. (2021). arXiv:2105.06677. Provides theoretical framework unifying XAI terminology (LIME, SHAP, Grad-CAM, etc.); essential for regulatory compliance and method comparison in churn explainability.
[8] Early Churn Prediction from Large-Scale User-Product Interaction Time Series
Bhattacharjee, A., Thukral, K., & Patil, C. (2023). arXiv:2309.14390. Applied multivariate time series classification with deep neural networks to fantasy sports churn; scales to 10βΈ users β demonstrates feasibility of deep learning at scale.
[9] Modelling Customer Churn for the Retail Industry in a Deep Learning Sequential Framework
Equihua, C., et al. (2023). arXiv:2304.00575. Deep survival framework using recurrent neural networks for non-contractual retail churn; avoids extensive feature engineering through learned representations.
[10] Churn Reduction via Distillation
Jiang, Y., et al. (2021). arXiv:2106.02654. Showed model distillation reduces predictive churn (model instability during retraining) while maintaining accuracy across FC, CNN, and transformer architectures.
[11] OptDist: Learning Optimal Distribution for Customer Lifetime Value Prediction
Weng, S., et al. (2024). arXiv:2408.08585. Proposed OptDist with distribution learning/selection modules; adaptively selects optimal sub-distributions for CLTV prediction on public and industrial datasets.
[12] Customer Lifetime Value Prediction with Uncertainty Estimation Using Monte Carlo Dropout
Cao, Y., Xu, Y., & Yang, Q. (2024). arXiv:2411.15944. Enhanced neural network CLTV prediction with Monte Carlo Dropout for uncertainty quantification; improved Top-5% MAPE significantly.
[13] A Predict-and-Optimize Approach to Profit-Driven Churn Prevention
GΓ³mez-Vargas, E., Maldonado, S., & Vairetti, S. (2023). arXiv:2310.07047. First predict-and-optimize approach for churn prevention using individual CLVs (not averages); regret minimization via SGD; tested on 12 real-world datasets.
[14] Dynamic Customer Embeddings for Financial Service Applications
Chitsazan, N., et al. (2021). arXiv:2106.11880. DCE framework uses customer digital activity + financial context for intent/fraud/call-center prediction; financial services benchmark for learned representations.
[15] FinPT: Financial Risk Prediction with Profile Tuning on Pretrained Foundation Models
Yin, H., et al. (2023). arXiv:2308.00065. Introduced FinBench dataset + FinPT method for financial risk prediction (default, fraud, churn) using LLM-generated customer profiles; strong zero-shot transfer.
[16] Advanced User Credit Risk Prediction Using LightGBM, XGBoost and TabNet with SMOTEENN
Yu, B., et al. (2024). arXiv:2408.03497. Combined PCA, SMOTEENN, and LightGBM for bank credit risk prediction; outperformed other models in identifying high-quality applicants under class imbalance.
[17] Credit Card Fraud Detection β Classifier Selection Strategy
Kulatilleke, S. (2022). arXiv:2208.11900. Data-driven classifier selection + sampling methods for imbalanced fraud detection; directly applicable to churn's class imbalance challenges.
[18] Predicting Customer Churn: Extreme Gradient Boosting with Temporal Data
Gregory, J. (2018). arXiv:1802.03396. Applied XGBoost with temporal feature engineering to time-series churn data; achieved top performance in large-scale competition settings.
[19] Predictive Churn with the Set of Good Models
Watson-Daniels, D., et al. (2024). arXiv:2402.07745. Examined prediction instability during model retraining via Rashomon set; critical for production churn model deployment and monitoring.
[20] Retention Is All You Need
Mohiuddin, K., et al. (2023). arXiv:2304.03103. HR Decision Support System using SHAP + what-if analysis for employee attrition; demonstrates SHAP utility for retention/churn use cases with interpretable dashboards.
[21] Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance
(2024). arXiv:2409.19751. Comprehensive study of SMOTE, Class Weights, and Decision Threshold Calibration for binary classification; Decision Threshold Calibration most consistently effective β directly guides our experimental design.
5. Dataset Understanding
5.1 Dataset 1: Telco Customer Churn (IBM)
Source: aai510-group1/telco-customer-churn Type: Fictional telecommunications company data Format: CSV / Parquet Splits: train / validation / test
Schema Summary
| Feature Category | Count | Key Features |
|---|---|---|
| Demographics | 7 | Age, Gender, Married, Dependents, Number of Dependents, Senior Citizen, Under 30 |
| Service Usage | 10 | Phone Service, Internet Service, Internet Type, Multiple Lines, Online Security, Online Backup, Device Protection, Tech Support, Streaming TV, Streaming Movies |
| Contract & Billing | 6 | Contract, Payment Method, Paperless Billing, Monthly Charge, Total Charges, Total Refunds |
| Engagement | 7 | Tenure (months), Number of Referrals, Referred a Friend, Offer, Satisfaction Score, Churn Score, Quarter |
| Revenue | 6 | Total Revenue, Total Long Distance Charges, Total Extra Data Charges, Avg Monthly Long Distance, Avg Monthly GB Download, CLTV |
| Geographic | 5 | City, State, Zip Code, Latitude, Longitude, Population |
| Target | 2 | Churn (binary), Churn Reason (string), Churn Category (string), Customer Status |
Total Features: ~52 (including derived identifiers like Lat Long, Customer ID)
Class Distribution (Audited)
| Split | Total Rows | Churned (1) | Stayed (0) | Churn Rate |
|---|---|---|---|---|
| Train | ~4,400 | ~1,100 | ~3,300 | ~25% |
| Validation | ~1,500 | ~375 | ~1,125 | ~25% |
| Test | ~1,500 | ~375 | ~1,125 | ~25% |
Note: Exact counts vary by split. The dataset exhibits moderate class imbalance (~25% churn), manageable without aggressive oversampling.
Notable Data Characteristics
- Rich categorical encoding: Internet Type (DSL, Fiber Optic, Cable, None), Contract (Month-to-Month, One Year, Two Year), Payment Method (4 types)
- Temporal granularity:
Quarterfield (Q1βQ4) enables time-aware feature engineering - Pre-computed churn scores:
Churn Score(0β100) andSatisfaction Score(1β5) are strong engineered features β risk of target leakage if not handled carefully - CLTV integration:
CLTVfield directly available for revenue-weighted ranking - Geographic features: Latitude/longitude enable spatial clustering or geo-derived features
Data Quality Flags
Total Chargeshas blank/missing values for zero-tenure customers (new sign-ups)Churn ReasonandChurn Categoryare populated only for churned customers β post-hoc labels, not usable as featuresCustomer Statusis highly correlated with target; should be excluded or used as stratification- Some categorical fields (City, State) have high cardinality (50+ states, 1,000+ cities)
5.2 Dataset 2: Bank Customer Churners
Source: ZZHHJ/bank_churners Type: Credit card customer attrition data Format: CSV / Parquet Splits: single train split (requires manual partitioning)
Schema Summary
| Feature Category | Count | Key Features |
|---|---|---|
| Demographics | 4 | Customer_Age, Gender, Dependent_count, Education_Level, Marital_Status, Income_Category |
| Account Behavior | 5 | Months_on_book, Total_Relationship_Count, Months_Inactive_12_mon, Contacts_Count_12_mon, Card_Category |
| Financial | 7 | Credit_Limit, Total_Revolving_Bal, Avg_Open_To_Buy, Total_Amt_Chng_Q4_Q1, Total_Trans_Amt, Total_Trans_Ct, Total_Ct_Chng_Q4_Q1, Avg_Utilization_Ratio |
| Target | 1 | Attrition_Flag (Existing Customer / Attrited Customer) |
| Artifacts | 2 | Naive_Bayes_Classifier columns (pre-computed probabilities β must be removed to avoid data leakage) |
Total Features: 21 (19 usable + 1 ID + 2 NB artifacts to drop)
Class Distribution (Estimated)
| Class | Approximate Count | Rate |
|---|---|---|
| Existing Customer | ~8,500 | ~83% |
| Attrited Customer | ~1,700 | ~17% |
Churn rate ~17% β more imbalanced than Telco; SMOTE/ADASYN or class weighting will be necessary.
Notable Data Characteristics
- Quarter-over-quarter dynamics:
Total_Amt_Chng_Q4_Q1andTotal_Ct_Chng_Q4_Q1capture behavioral velocity β powerful churn signals - Utilization ratio:
Avg_Utilization_Ratiois a strong proxy for engagement; low utilization often precedes attrition - Income categories are binned:
$60K - $80K,$80K - $120K, etc. β ordinal encoding preferred - Card category:
Blue(vast majority),Silver,Gold,Platinumβ strong class imbalance within feature itself
Data Quality Flags
- Critical: Two
Naive_Bayes_Classifier_*columns are pre-computed churn probabilities from a baseline model. Using them as features would constitute data leakage β they must be dropped before any model training. - No explicit CLTV field; must be estimated from
Credit_Limit,Total_Trans_Amt, andTotal_Trans_Ct - Single split requires manual stratified partitioning (70/15/15 or 80/10/10)
5.3 Cross-Dataset Comparison
| Attribute | Telco (IBM) | Bank Churners |
|---|---|---|
| Records | ~7,000 | ~10,000 |
| Features (usable) | ~45 | ~19 |
| Churn Rate | ~25% | ~17% |
| Industry | Telecommunications | Banking / Credit Cards |
| Temporal Features | Quarter, Tenure (months) | Months_on_book, Q4/Q1 change ratios |
| CLTV Available | Yes (explicit field) | No (must derive) |
| Geographic Data | Yes (lat/lon, city, state) | No |
| Pre-computed Scores | Churn Score, Satisfaction | Naive Bayayes (leakage β drop) |
| Class Imbalance Severity | Moderate | High |
| Primary Churn Driver | Contract type, tenure, service usage | Inactivity, transaction decline, utilization |
6. Proposed Methodology
6.1 The 7-Phase Pipeline
Phase 1: Data Ingestion & Audit
β
Phase 2: Preprocessing & Feature Engineering
β
Phase 3: Exploratory Data Analysis (EDA)
β
Phase 4: Model Training β 5-Base Stacking Ensemble
β
Phase 5: Hyperparameter Optimization
β
Phase 6: Evaluation, Interpretability & CLV Scoring
β
Phase 7: Deployment, Monitoring & Documentation
Phase 1: Data Ingestion & Audit
- Load both datasets from Hugging Face
datasetslibrary - Compute schema validation: type checks, missing value audit, cardinality report
- Flag anomalous values (negative charges, impossible ages, blank
Total Charges) - Document data provenance and version hashes (DVC)
Phase 2: Preprocessing & Feature Engineering
2A. Cleaning
- Telco: Impute
Total Chargesblanks withMonthly Charge Γ Tenure - Bank: Drop
Naive_Bayes_Classifier_*columns immediately - Both datasets: remove ID fields (
Customer ID,CLIENTNUM)
2B. Encoding
| Feature Type | Encoding Strategy | Rationale |
|---|---|---|
| Binary categorical | Label encoding (0/1) | Gender, Partner, PhoneService |
| Low-cardinality ordinal | One-hot encoding | Contract, Payment Method, Education_Level |
| High-cardinality nominal | Target encoding / CatBoost native | City, State (Telco); Income_Category (Bank) |
| Cyclical temporal | Sine/cosine encoding | Quarter mapped to angle |
2C. Feature Engineering
- RFM-style features (Bank): Recency =
Months_Inactive_12_mon, Frequency =Total_Trans_Ct, Monetary =Total_Trans_Amt - Engagement ratio (Telco):
Satisfaction_Score / Churn_Scoreas loyalty proxy - Velocity features: Month-over-month change in charges and usage
- CLTV proxy (Bank):
Credit_Limit Γ Avg_Utilization_Ratio Γ (12 - Months_Inactive_12_mon)
2D. Scaling & Imbalance Handling
- Numerical features β RobustScaler (median/IQR, resistant to outliers)
- Class imbalance β SMOTEENN (SMOTE + Edited Nearest Neighbours) on training fold only; never on validation/test
- Class weights β
scale_pos_weight = len(negative) / len(positive)for XGBoost/LightGBM
Phase 3: Exploratory Data Analysis (EDA)
- Univariate distributions (histograms, boxplots for skew detection)
- Bivariate analysis: churn rate by contract type, payment method, tenure bins
- Correlation matrix (Spearman for non-linear relationships)
- Feature-target mutual information scores for feature selection
- Geographic heatmap (Telco: churn rate by state)
Phase 4: Model Training β Stacking Ensemble
4A. Cross-Validation Strategy
- 5-fold Stratified Cross-Validation to preserve class distribution
- GroupKFold if temporal leakage risk (same customer in multiple quarters)
- Out-of-fold (OOF) predictions from each base model used as meta-features
4B. Base Model Training
| Base Model | Key Hyperparameters | Tuning Range |
|---|---|---|
| XGBoost | max_depth, learning_rate, subsample, colsample_bytree, scale_pos_weight |
depth: 3β8; lr: 0.01β0.3 |
| LightGBM | num_leaves, learning_rate, feature_fraction, bagging_fraction, is_unbalance |
leaves: 20β100; lr: 0.01β0.3 |
| CatBoost | depth, learning_rate, iterations, auto_class_weights |
depth: 4β10; iterations: 200β1000 |
| MLP | hidden_layers, dropout, batch_size, learning_rate |
layers: (128,64), (256,128,64); dropout: 0.2β0.5 |
| Logistic Regression | C, penalty, solver, class_weight |
C: 0.001β10; penalty: l1/l2/elasticnet |
4C. Meta-Learner Training
- Input: 5 OOF probability vectors (one per base model) + optionally top-K original features
- Model: Logistic Regression (interpretable weights showing model contribution) OR XGBoost (if non-linear meta-interactions needed)
- Validation: Same 5-fold CV; meta-learner trained on OOF predictions, tested on hold-out
Phase 5: Hyperparameter Optimization
- Optuna with TPESampler (Tree-structured Parzen Estimator)
- 100 trials per base model; 50 trials for meta-learner
- Pruning:
MedianPrunerwith early stopping on validation F1 - Objective: Maximize F1-Score (harmonic mean of precision and recall)
Phase 6: Evaluation, Interpretability & CLV Scoring
6A. Metrics Suite (10 metrics)
- Accuracy
- Precision (Churn class)
- Recall (Churn class)
- F1-Score
- ROC-AUC
- PR-AUC (Precision-Recall AUC β critical for imbalanced data)
- Matthews Correlation Coefficient (MCC)
- Cohen's Kappa
- Balanced Accuracy
- Expected Calibration Error (ECE)
6B. SHAP Analysis
- Global: SHAP summary plot (beeswarm) showing feature importance across full dataset
- Local: SHAP force plot for individual predictions β customer-level actionable insights
- Dependence: SHAP dependence plots for top-5 features revealing interaction effects
6C. CLV Scoring
- Telco: Use explicit
CLTVfield; multiply by churn probability - Bank: Derive CLV proxy; multiply by churn probability
- Output: Prioritized customer list sorted by RPS (Retention Priority Score)
- Segment: Top 10% (urgent), 10β30% (high), 30β60% (medium), 60β100% (low)
Phase 7: Deployment, Monitoring & Documentation
- Model serialization:
joblibfor sklearn/CatBoost, native formats for XGBoost/LightGBM - Inference pipeline:
scikit-learn Pipeline+ custom transformers - Monitoring: Track prediction distribution drift, feature drift, and metric decay over time
- Documentation: Model card with intended use, limitations, bias analysis, and SHAP summary
7. Implementation Strategy
7.1 Tech Stack
| Layer | Technology | Purpose |
|---|---|---|
| Data Loading | datasets (HF), pandas, polars |
Efficient dataset ingestion |
| Preprocessing | scikit-learn (Pipeline, ColumnTransformer, RobustScaler) |
Reproducible feature engineering |
| ML Models | xgboost, lightgbm, catboost, scikit-learn (MLP, LR) |
Base learners |
| Ensemble | mlens / custom stacking with scikit-learn |
Meta-learner orchestration |
| Imbalance | imbalanced-learn (SMOTEENN) |
Oversampling + cleaning |
| Optimization | optuna |
Hyperparameter search |
| Interpretability | shap |
Game-theoretic explanations |
| Tracking | trackio + mlflow |
Experiment logging, metrics, artifacts |
| Deployment | gradio / fastapi + Docker |
API inference and UI demo |
| Versioning | dvc + git |
Data and model versioning |
7.2 4-Week Timeline
| Week | Focus | Deliverables |
|---|---|---|
| Week 1 | Data audit, preprocessing, EDA | Clean notebooks; feature engineering pipeline; data quality report |
| Week 2 | Base model training, hyperparameter tuning | 5 trained base models; Optuna study results; OOF prediction matrices |
| Week 3 | Stacking ensemble, evaluation, SHAP analysis | Trained meta-learner; 10-metric report; SHAP dashboards; CLV scoring |
| Week 4 | Cross-domain testing, deployment, documentation | Generalization report; Gradio demo; model card; final documentation |
7.3 Code Architecture
churnpredict-pro/
βββ data/
β βββ raw/ # HF datasets (versioned with DVC)
β βββ processed/ # Train/val/test splits
β βββ engineered/ # Feature-engineered datasets
βββ notebooks/
β βββ 01_eda_telco.ipynb
β βββ 02_eda_bank.ipynb
β βββ 03_feature_engineering.ipynb
β βββ 04_shap_analysis.ipynb
βββ src/
β βββ __init__.py
β βββ data/
β β βββ load_datasets.py # HF datasets loader
β β βββ preprocess.py # Cleaning + encoding + scaling
β β βββ feature_engineer.py # RFM, velocity, CLV proxy
β βββ models/
β β βββ base_models.py # XGB, LGBM, CatBoost, MLP, LR wrappers
β β βββ stacking_ensemble.py # OOF + meta-learner
β β βββ hyperparameter_search.py # Optuna studies
β βββ evaluation/
β β βββ metrics.py # 10-metric computation
β β βββ shap_explainer.py # Global + local SHAP
β β βββ clv_scorer.py # RPS computation
β βββ deployment/
β βββ inference_pipeline.py
β βββ app.py # Gradio/FastAPI interface
βββ configs/
β βββ telco_config.yaml
β βββ bank_config.yaml
βββ experiments/ # Trackio / MLflow runs
βββ tests/
β βββ test_preprocessing.py
β βββ test_models.py
βββ Dockerfile
βββ requirements.txt
βββ dvc.yaml
βββ README.md
8. Experimental Design
8.1 Five Experiments
| ID | Experiment | Hypothesis | Method |
|---|---|---|---|
| E1 | Single Model Baseline | Individual models underperform ensemble due to bias-variance limitations | Train each of 5 base models standalone; report metrics |
| E2 | Stacking Ensemble | Meta-learner combining 5 models outperforms best single model by β₯ 3% F1 | 5-fold OOF stacking with LR meta-learner |
| E3 | Imbalance Strategy Comparison | Threshold calibration is more effective than SMOTE for churn (per [21]) | Compare: (a) no correction, (b) SMOTEENN, (c) class weights, (d) threshold calibration |
| E4 | Cross-Domain Transfer | Models trained on Telco generalize to Bank with β₯ 80% AUC | Train on Telco, evaluate zero-shot on Bank; then fine-tune |
| E5 | CLV-Weighted vs. Uniform Ranking | RPS improves campaign ROI over probability-only ranking | Compare top-20% precision: P(churn) only vs. P(churn) Γ CLV |
8.2 Ten Evaluation Metrics
| # | Metric | Formula / Definition | Why It Matters for Churn |
|---|---|---|---|
| 1 | Accuracy | (TP + TN) / (TP + TN + FP + FN) | Overall correctness; misleading if imbalanced |
| 2 | Precision (Churn) | TP / (TP + FP) | Of predicted churners, how many actually churn? (cost of false alarms) |
| 3 | Recall (Churn) | TP / (TP + FN) | Of actual churners, how many did we catch? (cost of missed churners) |
| 4 | F1-Score | 2 Γ (Precision Γ Recall) / (Precision + Recall) | Harmonic mean; balances precision and recall |
| 5 | ROC-AUC | Area under ROC curve | Discrimination ability across all thresholds |
| 6 | PR-AUC | Area under Precision-Recall curve | More informative than ROC-AUC for imbalanced data |
| 7 | MCC | (TPΓTN β FPΓFN) / β(product of marginals) | Correlation between prediction and truth; robust to imbalance |
| 8 | Cohen's Kappa | (Observed β Expected) / (1 β Expected) | Agreement beyond chance; useful for inter-rater reliability analogies |
| 9 | Balanced Accuracy | (Sensitivity + Specificity) / 2 | Average of recall on both classes; fair on imbalanced data |
| 10 | ECE | Expected Calibration Error | Measures reliability of probability outputs; critical for CLV weighting |
8.3 Statistical Rigor
- Confidence Intervals: All metrics reported with 95% CIs from 5-fold CV (bootstrap percentile method)
- McNemar's Test: Statistically compare stacking ensemble vs. best single model
- DeLong's Test: Compare ROC-AUC differences between models
- Permutation Test: Validate feature importance scores from SHAP
- Stratification: All splits stratified on target +
Contracttype (strongest churn predictor) to prevent distribution shift
8.4 Reproducibility Checklist
- Random seeds fixed (
random_state=42) for all stochastic operations -
requirements.txtwith exact versions (viapip freeze) - DVC tracking for data and model artifacts
- Git commit hash recorded with every experiment
- Trackio / MLflow logging of hyperparameters, metrics, and artifact paths
9. Result Analysis
9.1 Expected Performance
Based on literature benchmarks ([4] achieved 99.28% on telecom; [16] achieved strong results on bank credit risk with SMOTEENN + LightGBM), our targets are conservative and grounded:
| Dataset | Best Single Model F1 | Stacking Ensemble F1 | Expected Ξ |
|---|---|---|---|
| Telco | 0.82β0.84 (XGBoost/CatBoost) | 0.86β0.88 | +0.03β0.04 |
| Bank | 0.78β0.81 (LightGBM/XGBoost) | 0.82β0.85 | +0.03β0.04 |
9.2 SHAP Analysis β Expected Insights
Based on prior churn research, we anticipate the following feature importance rankings:
Telco (Expected Top 5 SHAP Features):
Contract(Month-to-Month vs. longer) β strongest predictorTenure in Monthsβ inverse relationship with churnMonthly Charge/Total Chargesβ price sensitivityInternet Type(Fiber Optic churns more than DSL)Payment Method(Electronic check = high risk)
Bank (Expected Top 5 SHAP Features):
Total_Trans_Ct(transaction frequency decline)Total_Trans_Amt(monetary decline)Months_Inactive_12_mon(recency of activity)Total_Relationship_Count(cross-product engagement)Contacts_Count_12_mon(complaint/contact proxy)
9.3 Business Impact Projections
Assuming a hypothetical telecom with:
- 100,000 customers
- 25% annual churn rate
- Average CLV = $3,000
- Retention campaign cost = $50 per targeted customer
- Campaign success rate (if well-targeted) = 30%
| Scenario | Customers Targeted | Campaign Cost | Churners Caught | Revenue Saved | Net ROI |
|---|---|---|---|---|---|
| Random targeting (25% churn) | 20,000 | $1,000,000 | 1,500 | $4,500,000 | 4.5Γ |
| Model-guided (top 20% by RPS) | 20,000 | $1,000,000 | 4,200 | $12,600,000 | 12.6Γ |
Model-guided targeting improves ROI by ~2.8Γ over random selection by focusing on high-value, high-probability churners.
9.4 Visualization Plan
| Visualization | Purpose |
|---|---|
| ROC & PR curves (all models overlaid) | Comparative discrimination |
| Confusion matrices | Error type analysis |
| SHAP summary plot (beeswarm) | Global feature importance |
| SHAP force plots (sample customers) | Local explanations for stakeholders |
| SHAP dependence plots | Feature interaction discovery |
| Calibration plot (predicted vs. actual) | Probability reliability |
| CLV-RPS scatter plot | Segmentation visualization |
| Metric bar chart with 95% CIs | Statistical comparison |
10. Iterative Improvement
10.1 Six Iteration Cycles
| Iteration | Focus | Action | Expected Outcome |
|---|---|---|---|
| Iter 1 | Feature Engineering Deep-Dive | Add polynomial features (tenureΒ², chargeΒ²); interaction terms (contract Γ monthly charge); binning (tenure quartiles) | +1β2% F1 from non-linear feature capture |
| Iter 2 | Advanced Sampling | Replace SMOTEENN with ADASYN + Edited Nearest Neighbours; test BorderlineSMOTE | Better synthetic sample quality near decision boundary |
| Iter 3 | Deep Learning Augmentation | Replace MLP with TabNet or FT-Transformer for tabular deep learning; compare against MLP base | Validate whether deep tabular models improve ensemble diversity |
| Iter 4 | Temporal Modeling | For Telco: add LSTM/GRU on quarterly customer journey sequences; for Bank: add transaction time-series | Capture temporal churn dynamics; +2β3% F1 on time-sensitive subsets |
| Iter 5 | Ensemble Expansion | Add 6th base model (Random Forest or Extra Trees) for additional variance reduction; test blending vs. stacking | Further variance reduction; marginal F1 gain of +0.5β1% |
| Iter 6 | Production Hardening | Dockerize inference; add A/B test framework; build automated retraining trigger on drift detection; write full production documentation | Deployable system with monitoring, retraining, and compliance docs |
10.2 Production Documentation Deliverables
| Document | Contents | Audience |
|---|---|---|
| Model Card | Intended use, training data summary, performance metrics, limitations, bias assessment, ethical considerations | Data scientists, regulators |
| API Documentation | Endpoint specs, request/response schemas, rate limits, error codes | Engineering teams |
| SHAP Dashboard Guide | How to read force plots, summary plots, and dependence plots | Business stakeholders, customer success |
| Retention Playbook | How to act on RPS segments; recommended interventions per churn reason | Marketing, customer success |
| Retraining SOP | When and how to retrain; drift detection thresholds; rollback procedures | MLOps, data engineering |
| Compliance Checklist | GDPR Article 22 (automated decision-making), CCPA, internal audit requirements | Legal, compliance |
Appendix A: Key Equations
Retention Priority Score:
F1-Score:
Matthews Correlation Coefficient:
Expected Calibration Error:
Document compiled for the ChurnPredict Pro project. All datasets verified on Hugging Face Hub. All 21 references span peer-reviewed and high-impact arXiv publications from 2016β2024.