| # ChurnPredict Pro: A Stacking Ensemble Framework for Customer Churn Prediction with Explainable AI and CLV Scoring |
|
|
| > **Subtitle:** End-to-End Machine Learning Pipeline for Telecommunications and Banking Customer Retention β Combining Gradient Boosting, Neural Networks, and Game-Theoretic Interpretability |
|
|
| --- |
|
|
| ## Table of Contents |
|
|
| 1. [Problem Statement](#1-problem-statement) |
| 2. [Idea of Solution](#2-idea-of-solution) |
| 3. [Objectives](#3-objectives) |
| 4. [Literature Review & References](#4-literature-review--references) |
| 5. [Dataset Understanding](#5-dataset-understanding) |
| 6. [Proposed Methodology](#6-proposed-methodology) |
| 7. [Implementation Strategy](#7-implementation-strategy) |
| 8. [Experimental Design](#8-experimental-design) |
| 9. [Result Analysis](#9-result-analysis) |
| 10. [Iterative Improvement](#10-iterative-improvement) |
|
|
| --- |
|
|
| ## 1. Problem Statement |
|
|
| ### 1.1 Business Context |
|
|
| Customer churn β the loss of clients to competitors or market attrition β is one of the most financially consequential challenges in subscription-based and service-oriented industries. In telecommunications, acquiring a new customer costs **5β25Γ more** than retaining an existing one (industry estimates, 2024). In banking, customer attrition erodes lifetime value portfolios and damages brand equity. For both sectors, even a **1% reduction in churn** can translate to millions in retained revenue. |
|
|
| Current retention strategies suffer from two critical gaps: |
| - **Reactive approaches:** Firms typically respond to churn *after* it occurs, through win-back campaigns that are expensive and low-yield. |
| - **Black-box predictions:** Machine learning models deployed in production often lack interpretability, making it impossible for marketing and customer-success teams to act on model outputs with confidence. |
|
|
| ### 1.2 Technical Challenges |
|
|
| | Challenge | Description | Impact | |
| |-----------|-------------|--------| |
| | **Class Imbalance** | Churners typically represent 10β30% of the customer base. Standard accuracy metrics are misleading. | High false-negative rates; missed at-risk customers | |
| | **Feature Heterogeneity** | Datasets mix categorical (contract type, payment method), numerical (tenure, charges), and temporal features (quarter, month-on-book). | Preprocessing complexity; risk of data leakage | |
| | **Concept Drift** | Customer behavior patterns shift seasonally and with market conditions. Models degrade without retraining. | Production model staleness; declining precision | |
| | **Interpretability vs. Performance Trade-off** | High-accuracy ensembles are often opaque. Explainable models (e.g., logistic regression) underperform on tabular data. | Regulatory non-compliance (GDPR Article 22); low stakeholder trust | |
| | **Multi-Domain Generalization** | Models trained on telecom data fail on banking data due to domain shift in feature distributions. | Siloed, non-reusable models per industry | |
|
|
| ### 1.3 Gaps in Existing Solutions |
|
|
| 1. **Single-model reliance:** Most production churn models deploy a single classifier (XGBoost or logistic regression), missing the variance-reduction benefits of ensemble diversity. |
| 2. **No CLV integration:** Churn predictions are binary β they do not incorporate *which* churners are most valuable to retain, leading to inefficient marketing spend. |
| 3. **Weak experimental rigor:** Many published churn studies use a single train/test split without cross-validation, statistical testing, or confidence intervals on metrics. |
| 4. **Dataset isolation:** Telco and bank churn datasets are studied separately; few works evaluate cross-domain transfer or unified pipelines. |
|
|
| --- |
|
|
| ## 2. Idea of Solution |
|
|
| ### 2.1 Architecture Overview |
|
|
| We propose **ChurnPredict Pro**, a **stacking ensemble architecture** that combines the complementary strengths of five diverse base learners under a meta-learner. The design philosophy is: |
|
|
| > *"Diversity in inductive bias reduces variance; interpretability in the meta-layer preserves actionability."* |
|
|
| ### 2.2 The 5-Model Stacking Ensemble |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β CHURNPRED PRO β STACKING ENSEMBLE β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ |
| β β |
| β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββ β |
| β β XGBoost β βLightGBM β βCatBoost β β MLP β β LR β β |
| β β (GBDT) β β (GBDT) β β (OGB) β β (Deep) β β(Base)β β |
| β β Base 1 β β Base 2 β β Base 3 β β Base 4 β βBase 5β β |
| β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββ¬ββββ β |
| β β β β β β β |
| β βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄βββββββββββββ β |
| β β β |
| β βββββββββββΌββββββββββ β |
| β β META-LEARNER β β |
| β β (Logistic Reg β β |
| β β / XGBoost) β β |
| β βββββββββββ¬ββββββββββ β |
| β β β |
| β βββββββββββΌββββββββββ β |
| β β CLV SCORING β β |
| β β + SHAP EXPLAINER β β |
| β βββββββββββββββββββββ β |
| β β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| ### 2.3 Why These 5 Base Models? |
|
|
| | Model | Inductive Bias | Strength on Churn Data | Weakness Mitigated by Ensemble | |
| |-------|---------------|------------------------|-------------------------------| |
| | **XGBoost** | Greedy gradient boosting with regularization | Best-in-class on sparse/tabular data; handles missing values natively | Prone to overfitting on small datasets | |
| | **LightGBM** | Histogram-based leaf-wise boosting | Faster training; GOSS sampling for large data | Leaf-wise can overfit; GOSS introduces bias | |
| | **CatBoost** | Ordered boosting + categorical encoding | Native categorical feature handling; reduces target leakage | Slower than LightGBM; ordered boosting complexity | |
| | **MLP (Deep)** | Non-linear feature interactions | Captures complex feature cross-products | Needs more data; less interpretable | |
| | **Logistic Regression** | Linear decision boundary | Fast, interpretable baseline; L1 regularization for feature selection | Cannot model non-linear relationships | |
|
|
| The meta-learner (Logistic Regression or a shallow XGBoost) learns optimal weights for combining the five base models' predictions, leveraging their uncorrelated errors. |
|
|
| ### 2.4 CLV-Weighted Scoring |
|
|
| Instead of ranking customers by churn probability alone, we multiply P(churn) by estimated CLV to produce a **Retention Priority Score (RPS)**: |
|
|
| $$ |
| \text{RPS}_i = P(\text{churn}_i) \times \text{CLV}_i |
| $$ |
| |
| This ensures retention campaigns target high-value at-risk customers, maximizing ROI. |
| |
| --- |
| |
| ## 3. Objectives |
| |
| ### 3.1 Primary Goals |
| |
| | ID | Objective | Metric Target | Success Criterion | |
| |----|-----------|---------------|-----------------| |
| | P1 | Build a stacking ensemble that outperforms any single base model | F1-Score | ΞF1 β₯ +0.03 over best single model | |
| | P2 | Achieve high recall on churn class (minimize false negatives) | Recall@Churn | β₯ 0.85 on both datasets | |
| | P3 | Deliver actionable model explanations per customer | SHAP summary | Top-5 features identified per prediction | |
| | P4 | Rank customers by retention value, not just churn risk | AUC-PR weighted by CLV | ROC-AUC β₯ 0.90 | |
| |
| ### 3.2 Secondary Goals |
| |
| | ID | Objective | Metric Target | |
| |----|-----------|---------------| |
| | S1 | Evaluate cross-domain generalization (Telco β Bank, Bank β Telco) | Transfer AUC β₯ 0.80 | |
| | S2 | Achieve sub-second inference latency for batch scoring | β€ 500ms per 1,000 records | |
| | S3 | Deploy a reproducible, version-controlled pipeline | Docker + DVC + CI/CD | |
| | S4 | Document model behavior for regulatory compliance (GDPR/CCPA) | Full SHAP + model card | |
| |
| ### 3.3 Success Criteria Summary |
| |
| - **Model Performance:** F1-Score > 0.85, ROC-AUC > 0.90, PR-AUC > 0.80 on both datasets |
| - **Business Impact:** Identify top 20% at-risk customers with β₯ 70% precision |
| - **Interpretability:** Every prediction accompanied by SHAP force plot; global SHAP summary for stakeholder dashboards |
| - **Robustness:** 5-fold stratified CV with 95% confidence intervals on all metrics |
| |
| --- |
| |
| ## 4. Literature Review & References |
| |
| ### 4.1 Category Overview |
| |
| | Category | Count | Papers | |
| |----------|-------|--------| |
| | Ensemble / Boosting Methods | 4 | [1β4] | |
| | SHAP / LIME Interpretability | 3 | [5β7] | |
| | Deep Learning for Churn | 3 | [8β10] | |
| | CLV / Profit-Driven Churn | 3 | [11β13] | |
| | Financial / Bank Churn | 4 | [14β17] | |
| | Survey / Benchmark / Foundation | 4 | [18β21] | |
| | **Total** | **21** | | |
| |
| ### 4.2 Full References (2016β2024) |
| |
| #### [1] XGBoost: A Scalable Tree Boosting System |
| **Chen, T., & Guestrin, C.** (2016). *KDD*. arXiv:1603.02754. |
| Introduced sparsity-aware algorithms and weighted quantile sketch for gradient boosting. Became the dominant algorithm for tabular churn prediction tasks worldwide. |
| |
| #### [2] Tabular Data: Deep Learning is Not All You Need |
| **Shwartz-Ziv, R., & Armon, A.** (2021). arXiv:2106.03253. |
| Rigorous comparison showing XGBoost outperforms recent deep learning models on tabular data; ensembling deep models with XGBoost further improves performance. |
| |
| #### [3] CatBoost: Unbiased Boosting with Categorical Features |
| **Prokhorenkova, L., et al.** (2017). arXiv:1706.09516. |
| Ordered boosting and novel categorical feature processing; outperforms other boosting implementations on datasets with high-cardinality categorical churn predictors. |
| |
| #### [4] Enhancing Customer Churn Prediction: An Adaptive Ensemble Learning Approach |
| **Shaikhsurab, S., & Magadum, S.** (2024). arXiv:2408.16284. |
| Adaptive ensemble combining XGBoost, LightGBM, LSTM, MLP, and SVM with stacking + meta-feature generation; achieved **99.28% accuracy** on telecom churn datasets. |
| |
| #### [5] A Unified Approach to Interpreting Model Predictions (SHAP) |
| **Lundberg, S. M., & Lee, S.-I.** (2017). *NeurIPS*. arXiv:1705.07874. |
| Proposed SHAP values as a unified measure of feature importance based on game-theoretic Shapley values; unified six existing explanation methods. |
| |
| #### [6] "Why Should I Trust You?": Explaining Predictions of Any Classifier (LIME) |
| **Ribeiro, M. T., Singh, S., & Guestrin, C.** (2016). *KDD*. arXiv:1602.04938. |
| Introduced LIME to explain any classifier locally via interpretable surrogate models; foundational for churn model explainability and regulatory compliance. |
| |
| #### [7] XAI Handbook: Towards a Unified Framework for Explainable AI |
| **Palacio, D. G., et al.** (2021). arXiv:2105.06677. |
| Provides theoretical framework unifying XAI terminology (LIME, SHAP, Grad-CAM, etc.); essential for regulatory compliance and method comparison in churn explainability. |
| |
| #### [8] Early Churn Prediction from Large-Scale User-Product Interaction Time Series |
| **Bhattacharjee, A., Thukral, K., & Patil, C.** (2023). arXiv:2309.14390. |
| Applied multivariate time series classification with deep neural networks to fantasy sports churn; scales to 10βΈ users β demonstrates feasibility of deep learning at scale. |
| |
| #### [9] Modelling Customer Churn for the Retail Industry in a Deep Learning Sequential Framework |
| **Equihua, C., et al.** (2023). arXiv:2304.00575. |
| Deep survival framework using recurrent neural networks for non-contractual retail churn; avoids extensive feature engineering through learned representations. |
| |
| #### [10] Churn Reduction via Distillation |
| **Jiang, Y., et al.** (2021). arXiv:2106.02654. |
| Showed model distillation reduces predictive churn (model instability during retraining) while maintaining accuracy across FC, CNN, and transformer architectures. |
| |
| #### [11] OptDist: Learning Optimal Distribution for Customer Lifetime Value Prediction |
| **Weng, S., et al.** (2024). arXiv:2408.08585. |
| Proposed OptDist with distribution learning/selection modules; adaptively selects optimal sub-distributions for CLTV prediction on public and industrial datasets. |
| |
| #### [12] Customer Lifetime Value Prediction with Uncertainty Estimation Using Monte Carlo Dropout |
| **Cao, Y., Xu, Y., & Yang, Q.** (2024). arXiv:2411.15944. |
| Enhanced neural network CLTV prediction with Monte Carlo Dropout for uncertainty quantification; improved Top-5% MAPE significantly. |
| |
| #### [13] A Predict-and-Optimize Approach to Profit-Driven Churn Prevention |
| **GΓ³mez-Vargas, E., Maldonado, S., & Vairetti, S.** (2023). arXiv:2310.07047. |
| First predict-and-optimize approach for churn prevention using individual CLVs (not averages); regret minimization via SGD; tested on 12 real-world datasets. |
| |
| #### [14] Dynamic Customer Embeddings for Financial Service Applications |
| **Chitsazan, N., et al.** (2021). arXiv:2106.11880. |
| DCE framework uses customer digital activity + financial context for intent/fraud/call-center prediction; financial services benchmark for learned representations. |
| |
| #### [15] FinPT: Financial Risk Prediction with Profile Tuning on Pretrained Foundation Models |
| **Yin, H., et al.** (2023). arXiv:2308.00065. |
| Introduced FinBench dataset + FinPT method for financial risk prediction (default, fraud, churn) using LLM-generated customer profiles; strong zero-shot transfer. |
| |
| #### [16] Advanced User Credit Risk Prediction Using LightGBM, XGBoost and TabNet with SMOTEENN |
| **Yu, B., et al.** (2024). arXiv:2408.03497. |
| Combined PCA, SMOTEENN, and LightGBM for bank credit risk prediction; outperformed other models in identifying high-quality applicants under class imbalance. |
| |
| #### [17] Credit Card Fraud Detection β Classifier Selection Strategy |
| **Kulatilleke, S.** (2022). arXiv:2208.11900. |
| Data-driven classifier selection + sampling methods for imbalanced fraud detection; directly applicable to churn's class imbalance challenges. |
| |
| #### [18] Predicting Customer Churn: Extreme Gradient Boosting with Temporal Data |
| **Gregory, J.** (2018). arXiv:1802.03396. |
| Applied XGBoost with temporal feature engineering to time-series churn data; achieved top performance in large-scale competition settings. |
| |
| #### [19] Predictive Churn with the Set of Good Models |
| **Watson-Daniels, D., et al.** (2024). arXiv:2402.07745. |
| Examined prediction instability during model retraining via Rashomon set; critical for production churn model deployment and monitoring. |
| |
| #### [20] Retention Is All You Need |
| **Mohiuddin, K., et al.** (2023). arXiv:2304.03103. |
| HR Decision Support System using SHAP + what-if analysis for employee attrition; demonstrates SHAP utility for retention/churn use cases with interpretable dashboards. |
| |
| #### [21] Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance |
| **(2024).** arXiv:2409.19751. |
| Comprehensive study of SMOTE, Class Weights, and Decision Threshold Calibration for binary classification; **Decision Threshold Calibration most consistently effective** β directly guides our experimental design. |
| |
| --- |
| |
| ## 5. Dataset Understanding |
| |
| ### 5.1 Dataset 1: Telco Customer Churn (IBM) |
| |
| **Source:** [aai510-group1/telco-customer-churn](https://hf.co/datasets/aai510-group1/telco-customer-churn) |
| **Type:** Fictional telecommunications company data |
| **Format:** CSV / Parquet |
| **Splits:** train / validation / test |
| |
| #### Schema Summary |
| |
| | Feature Category | Count | Key Features | |
| |-----------------|-------|-------------| |
| | **Demographics** | 7 | Age, Gender, Married, Dependents, Number of Dependents, Senior Citizen, Under 30 | |
| | **Service Usage** | 10 | Phone Service, Internet Service, Internet Type, Multiple Lines, Online Security, Online Backup, Device Protection, Tech Support, Streaming TV, Streaming Movies | |
| | **Contract & Billing** | 6 | Contract, Payment Method, Paperless Billing, Monthly Charge, Total Charges, Total Refunds | |
| | **Engagement** | 7 | Tenure (months), Number of Referrals, Referred a Friend, Offer, Satisfaction Score, Churn Score, Quarter | |
| | **Revenue** | 6 | Total Revenue, Total Long Distance Charges, Total Extra Data Charges, Avg Monthly Long Distance, Avg Monthly GB Download, CLTV | |
| | **Geographic** | 5 | City, State, Zip Code, Latitude, Longitude, Population | |
| | **Target** | 2 | Churn (binary), Churn Reason (string), Churn Category (string), Customer Status | |
| |
| **Total Features:** ~52 (including derived identifiers like `Lat Long`, `Customer ID`) |
| |
| #### Class Distribution (Audited) |
| |
| | Split | Total Rows | Churned (1) | Stayed (0) | Churn Rate | |
| |-------|-----------|-------------|------------|------------| |
| | Train | ~4,400 | ~1,100 | ~3,300 | ~25% | |
| | Validation | ~1,500 | ~375 | ~1,125 | ~25% | |
| | Test | ~1,500 | ~375 | ~1,125 | ~25% | |
| |
| *Note: Exact counts vary by split. The dataset exhibits moderate class imbalance (~25% churn), manageable without aggressive oversampling.* |
| |
| #### Notable Data Characteristics |
| |
| 1. **Rich categorical encoding:** Internet Type (DSL, Fiber Optic, Cable, None), Contract (Month-to-Month, One Year, Two Year), Payment Method (4 types) |
| 2. **Temporal granularity:** `Quarter` field (Q1βQ4) enables time-aware feature engineering |
| 3. **Pre-computed churn scores:** `Churn Score` (0β100) and `Satisfaction Score` (1β5) are strong engineered features β risk of target leakage if not handled carefully |
| 4. **CLTV integration:** `CLTV` field directly available for revenue-weighted ranking |
| 5. **Geographic features:** Latitude/longitude enable spatial clustering or geo-derived features |
| |
| #### Data Quality Flags |
| |
| - `Total Charges` has blank/missing values for zero-tenure customers (new sign-ups) |
| - `Churn Reason` and `Churn Category` are populated only for churned customers β post-hoc labels, not usable as features |
| - `Customer Status` is highly correlated with target; should be excluded or used as stratification |
| - Some categorical fields (City, State) have high cardinality (50+ states, 1,000+ cities) |
| |
| --- |
| |
| ### 5.2 Dataset 2: Bank Customer Churners |
| |
| **Source:** [ZZHHJ/bank_churners](https://hf.co/datasets/ZZHHJ/bank_churners) |
| **Type:** Credit card customer attrition data |
| **Format:** CSV / Parquet |
| **Splits:** single train split (requires manual partitioning) |
| |
| #### Schema Summary |
| |
| | Feature Category | Count | Key Features | |
| |-----------------|-------|-------------| |
| | **Demographics** | 4 | Customer_Age, Gender, Dependent_count, Education_Level, Marital_Status, Income_Category | |
| | **Account Behavior** | 5 | Months_on_book, Total_Relationship_Count, Months_Inactive_12_mon, Contacts_Count_12_mon, Card_Category | |
| | **Financial** | 7 | Credit_Limit, Total_Revolving_Bal, Avg_Open_To_Buy, Total_Amt_Chng_Q4_Q1, Total_Trans_Amt, Total_Trans_Ct, Total_Ct_Chng_Q4_Q1, Avg_Utilization_Ratio | |
| | **Target** | 1 | Attrition_Flag (Existing Customer / Attrited Customer) | |
| | **Artifacts** | 2 | Naive_Bayes_Classifier columns (pre-computed probabilities β **must be removed** to avoid data leakage) | |
|
|
| **Total Features:** 21 (19 usable + 1 ID + 2 NB artifacts to drop) |
|
|
| #### Class Distribution (Estimated) |
|
|
| | Class | Approximate Count | Rate | |
| |-------|-------------------|------| |
| | Existing Customer | ~8,500 | ~83% | |
| | Attrited Customer | ~1,700 | ~17% | |
|
|
| **Churn rate ~17%** β more imbalanced than Telco; SMOTE/ADASYN or class weighting will be necessary. |
|
|
| #### Notable Data Characteristics |
|
|
| 1. **Quarter-over-quarter dynamics:** `Total_Amt_Chng_Q4_Q1` and `Total_Ct_Chng_Q4_Q1` capture behavioral velocity β powerful churn signals |
| 2. **Utilization ratio:** `Avg_Utilization_Ratio` is a strong proxy for engagement; low utilization often precedes attrition |
| 3. **Income categories are binned:** `$60K - $80K`, `$80K - $120K`, etc. β ordinal encoding preferred |
| 4. **Card category:** `Blue` (vast majority), `Silver`, `Gold`, `Platinum` β strong class imbalance within feature itself |
|
|
| #### Data Quality Flags |
|
|
| - **Critical:** Two `Naive_Bayes_Classifier_*` columns are pre-computed churn probabilities from a baseline model. Using them as features would constitute **data leakage** β they must be dropped before any model training. |
| - No explicit CLTV field; must be estimated from `Credit_Limit`, `Total_Trans_Amt`, and `Total_Trans_Ct` |
| - Single split requires manual stratified partitioning (70/15/15 or 80/10/10) |
|
|
| --- |
|
|
| ### 5.3 Cross-Dataset Comparison |
|
|
| | Attribute | Telco (IBM) | Bank Churners | |
| |-----------|-------------|---------------| |
| | **Records** | ~7,000 | ~10,000 | |
| | **Features (usable)** | ~45 | ~19 | |
| | **Churn Rate** | ~25% | ~17% | |
| | **Industry** | Telecommunications | Banking / Credit Cards | |
| | **Temporal Features** | Quarter, Tenure (months) | Months_on_book, Q4/Q1 change ratios | |
| | **CLTV Available** | Yes (explicit field) | No (must derive) | |
| | **Geographic Data** | Yes (lat/lon, city, state) | No | |
| | **Pre-computed Scores** | Churn Score, Satisfaction | Naive Bayayes (leakage β drop) | |
| | **Class Imbalance Severity** | Moderate | High | |
| | **Primary Churn Driver** | Contract type, tenure, service usage | Inactivity, transaction decline, utilization | |
|
|
| --- |
|
|
| ## 6. Proposed Methodology |
|
|
| ### 6.1 The 7-Phase Pipeline |
|
|
| ``` |
| Phase 1: Data Ingestion & Audit |
| β |
| Phase 2: Preprocessing & Feature Engineering |
| β |
| Phase 3: Exploratory Data Analysis (EDA) |
| β |
| Phase 4: Model Training β 5-Base Stacking Ensemble |
| β |
| Phase 5: Hyperparameter Optimization |
| β |
| Phase 6: Evaluation, Interpretability & CLV Scoring |
| β |
| Phase 7: Deployment, Monitoring & Documentation |
| ``` |
|
|
| ### Phase 1: Data Ingestion & Audit |
|
|
| - Load both datasets from Hugging Face `datasets` library |
| - Compute schema validation: type checks, missing value audit, cardinality report |
| - Flag anomalous values (negative charges, impossible ages, blank `Total Charges`) |
| - Document data provenance and version hashes (DVC) |
|
|
| ### Phase 2: Preprocessing & Feature Engineering |
|
|
| #### 2A. Cleaning |
| - **Telco:** Impute `Total Charges` blanks with `Monthly Charge Γ Tenure` |
| - **Bank:** Drop `Naive_Bayes_Classifier_*` columns immediately |
| - Both datasets: remove ID fields (`Customer ID`, `CLIENTNUM`) |
|
|
| #### 2B. Encoding |
| | Feature Type | Encoding Strategy | Rationale | |
| |-------------|-------------------|-----------| |
| | Binary categorical | Label encoding (0/1) | `Gender`, `Partner`, `PhoneService` | |
| | Low-cardinality ordinal | One-hot encoding | `Contract`, `Payment Method`, `Education_Level` | |
| | High-cardinality nominal | Target encoding / CatBoost native | `City`, `State` (Telco); `Income_Category` (Bank) | |
| | Cyclical temporal | Sine/cosine encoding | `Quarter` mapped to angle | |
|
|
| #### 2C. Feature Engineering |
| - **RFM-style features (Bank):** Recency = `Months_Inactive_12_mon`, Frequency = `Total_Trans_Ct`, Monetary = `Total_Trans_Amt` |
| - **Engagement ratio (Telco):** `Satisfaction_Score / Churn_Score` as loyalty proxy |
| - **Velocity features:** Month-over-month change in charges and usage |
| - **CLTV proxy (Bank):** `Credit_Limit Γ Avg_Utilization_Ratio Γ (12 - Months_Inactive_12_mon)` |
|
|
| #### 2D. Scaling & Imbalance Handling |
| - Numerical features β RobustScaler (median/IQR, resistant to outliers) |
| - Class imbalance β SMOTEENN (SMOTE + Edited Nearest Neighbours) on training fold only; **never on validation/test** |
| - Class weights β `scale_pos_weight = len(negative) / len(positive)` for XGBoost/LightGBM |
|
|
| ### Phase 3: Exploratory Data Analysis (EDA) |
|
|
| - Univariate distributions (histograms, boxplots for skew detection) |
| - Bivariate analysis: churn rate by contract type, payment method, tenure bins |
| - Correlation matrix (Spearman for non-linear relationships) |
| - Feature-target mutual information scores for feature selection |
| - Geographic heatmap (Telco: churn rate by state) |
|
|
| ### Phase 4: Model Training β Stacking Ensemble |
|
|
| #### 4A. Cross-Validation Strategy |
| - **5-fold Stratified Cross-Validation** to preserve class distribution |
| - **GroupKFold** if temporal leakage risk (same customer in multiple quarters) |
| - Out-of-fold (OOF) predictions from each base model used as meta-features |
|
|
| #### 4B. Base Model Training |
|
|
| | Base Model | Key Hyperparameters | Tuning Range | |
| |-----------|-------------------|--------------| |
| | XGBoost | `max_depth`, `learning_rate`, `subsample`, `colsample_bytree`, `scale_pos_weight` | depth: 3β8; lr: 0.01β0.3 | |
| | LightGBM | `num_leaves`, `learning_rate`, `feature_fraction`, `bagging_fraction`, `is_unbalance` | leaves: 20β100; lr: 0.01β0.3 | |
| | CatBoost | `depth`, `learning_rate`, `iterations`, `auto_class_weights` | depth: 4β10; iterations: 200β1000 | |
| | MLP | `hidden_layers`, `dropout`, `batch_size`, `learning_rate` | layers: (128,64), (256,128,64); dropout: 0.2β0.5 | |
| | Logistic Regression | `C`, `penalty`, `solver`, `class_weight` | C: 0.001β10; penalty: l1/l2/elasticnet | |
|
|
| #### 4C. Meta-Learner Training |
| - Input: 5 OOF probability vectors (one per base model) + optionally top-K original features |
| - Model: **Logistic Regression** (interpretable weights showing model contribution) OR **XGBoost** (if non-linear meta-interactions needed) |
| - Validation: Same 5-fold CV; meta-learner trained on OOF predictions, tested on hold-out |
|
|
| ### Phase 5: Hyperparameter Optimization |
|
|
| - **Optuna** with **TPESampler** (Tree-structured Parzen Estimator) |
| - 100 trials per base model; 50 trials for meta-learner |
| - Pruning: `MedianPruner` with early stopping on validation F1 |
| - Objective: Maximize F1-Score (harmonic mean of precision and recall) |
|
|
| ### Phase 6: Evaluation, Interpretability & CLV Scoring |
|
|
| #### 6A. Metrics Suite (10 metrics) |
| 1. Accuracy |
| 2. Precision (Churn class) |
| 3. Recall (Churn class) |
| 4. F1-Score |
| 5. ROC-AUC |
| 6. PR-AUC (Precision-Recall AUC β critical for imbalanced data) |
| 7. Matthews Correlation Coefficient (MCC) |
| 8. Cohen's Kappa |
| 9. Balanced Accuracy |
| 10. Expected Calibration Error (ECE) |
|
|
| #### 6B. SHAP Analysis |
| - **Global:** SHAP summary plot (beeswarm) showing feature importance across full dataset |
| - **Local:** SHAP force plot for individual predictions β customer-level actionable insights |
| - **Dependence:** SHAP dependence plots for top-5 features revealing interaction effects |
|
|
| #### 6C. CLV Scoring |
| - **Telco:** Use explicit `CLTV` field; multiply by churn probability |
| - **Bank:** Derive CLV proxy; multiply by churn probability |
| - Output: Prioritized customer list sorted by RPS (Retention Priority Score) |
| - Segment: Top 10% (urgent), 10β30% (high), 30β60% (medium), 60β100% (low) |
|
|
| ### Phase 7: Deployment, Monitoring & Documentation |
|
|
| - Model serialization: `joblib` for sklearn/CatBoost, native formats for XGBoost/LightGBM |
| - Inference pipeline: `scikit-learn Pipeline` + custom transformers |
| - Monitoring: Track prediction distribution drift, feature drift, and metric decay over time |
| - Documentation: Model card with intended use, limitations, bias analysis, and SHAP summary |
|
|
| --- |
|
|
| ## 7. Implementation Strategy |
|
|
| ### 7.1 Tech Stack |
|
|
| | Layer | Technology | Purpose | |
| |-------|-----------|---------| |
| | **Data Loading** | `datasets` (HF), `pandas`, `polars` | Efficient dataset ingestion | |
| | **Preprocessing** | `scikit-learn` (Pipeline, ColumnTransformer, RobustScaler) | Reproducible feature engineering | |
| | **ML Models** | `xgboost`, `lightgbm`, `catboost`, `scikit-learn` (MLP, LR) | Base learners | |
| | **Ensemble** | `mlens` / custom stacking with `scikit-learn` | Meta-learner orchestration | |
| | **Imbalance** | `imbalanced-learn` (SMOTEENN) | Oversampling + cleaning | |
| | **Optimization** | `optuna` | Hyperparameter search | |
| | **Interpretability** | `shap` | Game-theoretic explanations | |
| | **Tracking** | `trackio` + `mlflow` | Experiment logging, metrics, artifacts | |
| | **Deployment** | `gradio` / `fastapi` + Docker | API inference and UI demo | |
| | **Versioning** | `dvc` + `git` | Data and model versioning | |
|
|
| ### 7.2 4-Week Timeline |
|
|
| | Week | Focus | Deliverables | |
| |------|-------|-------------| |
| | **Week 1** | Data audit, preprocessing, EDA | Clean notebooks; feature engineering pipeline; data quality report | |
| | **Week 2** | Base model training, hyperparameter tuning | 5 trained base models; Optuna study results; OOF prediction matrices | |
| | **Week 3** | Stacking ensemble, evaluation, SHAP analysis | Trained meta-learner; 10-metric report; SHAP dashboards; CLV scoring | |
| | **Week 4** | Cross-domain testing, deployment, documentation | Generalization report; Gradio demo; model card; final documentation | |
|
|
| ### 7.3 Code Architecture |
|
|
| ``` |
| churnpredict-pro/ |
| βββ data/ |
| β βββ raw/ # HF datasets (versioned with DVC) |
| β βββ processed/ # Train/val/test splits |
| β βββ engineered/ # Feature-engineered datasets |
| βββ notebooks/ |
| β βββ 01_eda_telco.ipynb |
| β βββ 02_eda_bank.ipynb |
| β βββ 03_feature_engineering.ipynb |
| β βββ 04_shap_analysis.ipynb |
| βββ src/ |
| β βββ __init__.py |
| β βββ data/ |
| β β βββ load_datasets.py # HF datasets loader |
| β β βββ preprocess.py # Cleaning + encoding + scaling |
| β β βββ feature_engineer.py # RFM, velocity, CLV proxy |
| β βββ models/ |
| β β βββ base_models.py # XGB, LGBM, CatBoost, MLP, LR wrappers |
| β β βββ stacking_ensemble.py # OOF + meta-learner |
| β β βββ hyperparameter_search.py # Optuna studies |
| β βββ evaluation/ |
| β β βββ metrics.py # 10-metric computation |
| β β βββ shap_explainer.py # Global + local SHAP |
| β β βββ clv_scorer.py # RPS computation |
| β βββ deployment/ |
| β βββ inference_pipeline.py |
| β βββ app.py # Gradio/FastAPI interface |
| βββ configs/ |
| β βββ telco_config.yaml |
| β βββ bank_config.yaml |
| βββ experiments/ # Trackio / MLflow runs |
| βββ tests/ |
| β βββ test_preprocessing.py |
| β βββ test_models.py |
| βββ Dockerfile |
| βββ requirements.txt |
| βββ dvc.yaml |
| βββ README.md |
| ``` |
|
|
| --- |
|
|
| ## 8. Experimental Design |
|
|
| ### 8.1 Five Experiments |
|
|
| | ID | Experiment | Hypothesis | Method | |
| |----|-----------|------------|--------| |
| | **E1** | Single Model Baseline | Individual models underperform ensemble due to bias-variance limitations | Train each of 5 base models standalone; report metrics | |
| | **E2** | Stacking Ensemble | Meta-learner combining 5 models outperforms best single model by β₯ 3% F1 | 5-fold OOF stacking with LR meta-learner | |
| | **E3** | Imbalance Strategy Comparison | Threshold calibration is more effective than SMOTE for churn (per [21]) | Compare: (a) no correction, (b) SMOTEENN, (c) class weights, (d) threshold calibration | |
| | **E4** | Cross-Domain Transfer | Models trained on Telco generalize to Bank with β₯ 80% AUC | Train on Telco, evaluate zero-shot on Bank; then fine-tune | |
| | **E5** | CLV-Weighted vs. Uniform Ranking | RPS improves campaign ROI over probability-only ranking | Compare top-20% precision: P(churn) only vs. P(churn) Γ CLV | |
|
|
| ### 8.2 Ten Evaluation Metrics |
|
|
| | # | Metric | Formula / Definition | Why It Matters for Churn | |
| |---|--------|---------------------|-------------------------| |
| | 1 | **Accuracy** | (TP + TN) / (TP + TN + FP + FN) | Overall correctness; misleading if imbalanced | |
| | 2 | **Precision (Churn)** | TP / (TP + FP) | Of predicted churners, how many actually churn? (cost of false alarms) | |
| | 3 | **Recall (Churn)** | TP / (TP + FN) | Of actual churners, how many did we catch? (cost of missed churners) | |
| | 4 | **F1-Score** | 2 Γ (Precision Γ Recall) / (Precision + Recall) | Harmonic mean; balances precision and recall | |
| | 5 | **ROC-AUC** | Area under ROC curve | Discrimination ability across all thresholds | |
| | 6 | **PR-AUC** | Area under Precision-Recall curve | More informative than ROC-AUC for imbalanced data | |
| | 7 | **MCC** | (TPΓTN β FPΓFN) / β(product of marginals) | Correlation between prediction and truth; robust to imbalance | |
| | 8 | **Cohen's Kappa** | (Observed β Expected) / (1 β Expected) | Agreement beyond chance; useful for inter-rater reliability analogies | |
| | 9 | **Balanced Accuracy** | (Sensitivity + Specificity) / 2 | Average of recall on both classes; fair on imbalanced data | |
| | 10 | **ECE** | Expected Calibration Error | Measures reliability of probability outputs; critical for CLV weighting | |
|
|
| ### 8.3 Statistical Rigor |
|
|
| 1. **Confidence Intervals:** All metrics reported with 95% CIs from 5-fold CV (bootstrap percentile method) |
| 2. **McNemar's Test:** Statistically compare stacking ensemble vs. best single model |
| 3. **DeLong's Test:** Compare ROC-AUC differences between models |
| 4. **Permutation Test:** Validate feature importance scores from SHAP |
| 5. **Stratification:** All splits stratified on target + `Contract` type (strongest churn predictor) to prevent distribution shift |
|
|
| ### 8.4 Reproducibility Checklist |
|
|
| - [ ] Random seeds fixed (`random_state=42`) for all stochastic operations |
| - [ ] `requirements.txt` with exact versions (via `pip freeze`) |
| - [ ] DVC tracking for data and model artifacts |
| - [ ] Git commit hash recorded with every experiment |
| - [ ] Trackio / MLflow logging of hyperparameters, metrics, and artifact paths |
|
|
| --- |
|
|
| ## 9. Result Analysis |
|
|
| ### 9.1 Expected Performance |
|
|
| Based on literature benchmarks ([4] achieved 99.28% on telecom; [16] achieved strong results on bank credit risk with SMOTEENN + LightGBM), our targets are conservative and grounded: |
|
|
| | Dataset | Best Single Model F1 | Stacking Ensemble F1 | Expected Ξ | |
| |---------|---------------------|---------------------|------------| |
| | **Telco** | 0.82β0.84 (XGBoost/CatBoost) | **0.86β0.88** | +0.03β0.04 | |
| | **Bank** | 0.78β0.81 (LightGBM/XGBoost) | **0.82β0.85** | +0.03β0.04 | |
|
|
| ### 9.2 SHAP Analysis β Expected Insights |
|
|
| Based on prior churn research, we anticipate the following feature importance rankings: |
|
|
| **Telco (Expected Top 5 SHAP Features):** |
| 1. `Contract` (Month-to-Month vs. longer) β strongest predictor |
| 2. `Tenure in Months` β inverse relationship with churn |
| 3. `Monthly Charge` / `Total Charges` β price sensitivity |
| 4. `Internet Type` (Fiber Optic churns more than DSL) |
| 5. `Payment Method` (Electronic check = high risk) |
|
|
| **Bank (Expected Top 5 SHAP Features):** |
| 1. `Total_Trans_Ct` (transaction frequency decline) |
| 2. `Total_Trans_Amt` (monetary decline) |
| 3. `Months_Inactive_12_mon` (recency of activity) |
| 4. `Total_Relationship_Count` (cross-product engagement) |
| 5. `Contacts_Count_12_mon` (complaint/contact proxy) |
|
|
| ### 9.3 Business Impact Projections |
|
|
| Assuming a hypothetical telecom with: |
| - 100,000 customers |
| - 25% annual churn rate |
| - Average CLV = $3,000 |
| - Retention campaign cost = $50 per targeted customer |
| - Campaign success rate (if well-targeted) = 30% |
|
|
| | Scenario | Customers Targeted | Campaign Cost | Churners Caught | Revenue Saved | Net ROI | |
| |----------|-------------------|---------------|-----------------|---------------|---------| |
| | Random targeting (25% churn) | 20,000 | $1,000,000 | 1,500 | $4,500,000 | 4.5Γ | |
| | Model-guided (top 20% by RPS) | 20,000 | $1,000,000 | 4,200 | $12,600,000 | **12.6Γ** | |
|
|
| *Model-guided targeting improves ROI by ~2.8Γ over random selection by focusing on high-value, high-probability churners.* |
|
|
| ### 9.4 Visualization Plan |
|
|
| | Visualization | Purpose | |
| |--------------|---------| |
| | ROC & PR curves (all models overlaid) | Comparative discrimination | |
| | Confusion matrices | Error type analysis | |
| | SHAP summary plot (beeswarm) | Global feature importance | |
| | SHAP force plots (sample customers) | Local explanations for stakeholders | |
| | SHAP dependence plots | Feature interaction discovery | |
| | Calibration plot (predicted vs. actual) | Probability reliability | |
| | CLV-RPS scatter plot | Segmentation visualization | |
| | Metric bar chart with 95% CIs | Statistical comparison | |
|
|
| --- |
|
|
| ## 10. Iterative Improvement |
|
|
| ### 10.1 Six Iteration Cycles |
|
|
| | Iteration | Focus | Action | Expected Outcome | |
| |-----------|-------|--------|------------------| |
| | **Iter 1** | Feature Engineering Deep-Dive | Add polynomial features (tenureΒ², chargeΒ²); interaction terms (contract Γ monthly charge); binning (tenure quartiles) | +1β2% F1 from non-linear feature capture | |
| | **Iter 2** | Advanced Sampling | Replace SMOTEENN with ADASYN + Edited Nearest Neighbours; test BorderlineSMOTE | Better synthetic sample quality near decision boundary | |
| | **Iter 3** | Deep Learning Augmentation | Replace MLP with TabNet or FT-Transformer for tabular deep learning; compare against MLP base | Validate whether deep tabular models improve ensemble diversity | |
| | **Iter 4** | Temporal Modeling | For Telco: add LSTM/GRU on quarterly customer journey sequences; for Bank: add transaction time-series | Capture temporal churn dynamics; +2β3% F1 on time-sensitive subsets | |
| | **Iter 5** | Ensemble Expansion | Add 6th base model (Random Forest or Extra Trees) for additional variance reduction; test blending vs. stacking | Further variance reduction; marginal F1 gain of +0.5β1% | |
| | **Iter 6** | Production Hardening | Dockerize inference; add A/B test framework; build automated retraining trigger on drift detection; write full production documentation | Deployable system with monitoring, retraining, and compliance docs | |
|
|
| ### 10.2 Production Documentation Deliverables |
|
|
| | Document | Contents | Audience | |
| |----------|----------|----------| |
| | **Model Card** | Intended use, training data summary, performance metrics, limitations, bias assessment, ethical considerations | Data scientists, regulators | |
| | **API Documentation** | Endpoint specs, request/response schemas, rate limits, error codes | Engineering teams | |
| | **SHAP Dashboard Guide** | How to read force plots, summary plots, and dependence plots | Business stakeholders, customer success | |
| | **Retention Playbook** | How to act on RPS segments; recommended interventions per churn reason | Marketing, customer success | |
| | **Retraining SOP** | When and how to retrain; drift detection thresholds; rollback procedures | MLOps, data engineering | |
| | **Compliance Checklist** | GDPR Article 22 (automated decision-making), CCPA, internal audit requirements | Legal, compliance | |
|
|
| --- |
|
|
| ## Appendix A: Key Equations |
|
|
| **Retention Priority Score:** |
| $$ |
| \text{RPS}_i = P(\text{churn}_i) \times \text{CLV}_i |
| $$ |
| |
| **F1-Score:** |
| $$ |
| F1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} |
| $$ |
| |
| **Matthews Correlation Coefficient:** |
| $$ |
| \text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} |
| $$ |
| |
| **Expected Calibration Error:** |
| $$ |
| \text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{conf}(B_m) \right| |
| $$ |
| |
| --- |
| |
| *Document compiled for the ChurnPredict Pro project. All datasets verified on Hugging Face Hub. All 21 references span peer-reviewed and high-impact arXiv publications from 2016β2024.* |
| |