File size: 40,098 Bytes
b3e4654 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 | # ChurnPredict Pro: A Stacking Ensemble Framework for Customer Churn Prediction with Explainable AI and CLV Scoring
> **Subtitle:** End-to-End Machine Learning Pipeline for Telecommunications and Banking Customer Retention β Combining Gradient Boosting, Neural Networks, and Game-Theoretic Interpretability
---
## Table of Contents
1. [Problem Statement](#1-problem-statement)
2. [Idea of Solution](#2-idea-of-solution)
3. [Objectives](#3-objectives)
4. [Literature Review & References](#4-literature-review--references)
5. [Dataset Understanding](#5-dataset-understanding)
6. [Proposed Methodology](#6-proposed-methodology)
7. [Implementation Strategy](#7-implementation-strategy)
8. [Experimental Design](#8-experimental-design)
9. [Result Analysis](#9-result-analysis)
10. [Iterative Improvement](#10-iterative-improvement)
---
## 1. Problem Statement
### 1.1 Business Context
Customer churn β the loss of clients to competitors or market attrition β is one of the most financially consequential challenges in subscription-based and service-oriented industries. In telecommunications, acquiring a new customer costs **5β25Γ more** than retaining an existing one (industry estimates, 2024). In banking, customer attrition erodes lifetime value portfolios and damages brand equity. For both sectors, even a **1% reduction in churn** can translate to millions in retained revenue.
Current retention strategies suffer from two critical gaps:
- **Reactive approaches:** Firms typically respond to churn *after* it occurs, through win-back campaigns that are expensive and low-yield.
- **Black-box predictions:** Machine learning models deployed in production often lack interpretability, making it impossible for marketing and customer-success teams to act on model outputs with confidence.
### 1.2 Technical Challenges
| Challenge | Description | Impact |
|-----------|-------------|--------|
| **Class Imbalance** | Churners typically represent 10β30% of the customer base. Standard accuracy metrics are misleading. | High false-negative rates; missed at-risk customers |
| **Feature Heterogeneity** | Datasets mix categorical (contract type, payment method), numerical (tenure, charges), and temporal features (quarter, month-on-book). | Preprocessing complexity; risk of data leakage |
| **Concept Drift** | Customer behavior patterns shift seasonally and with market conditions. Models degrade without retraining. | Production model staleness; declining precision |
| **Interpretability vs. Performance Trade-off** | High-accuracy ensembles are often opaque. Explainable models (e.g., logistic regression) underperform on tabular data. | Regulatory non-compliance (GDPR Article 22); low stakeholder trust |
| **Multi-Domain Generalization** | Models trained on telecom data fail on banking data due to domain shift in feature distributions. | Siloed, non-reusable models per industry |
### 1.3 Gaps in Existing Solutions
1. **Single-model reliance:** Most production churn models deploy a single classifier (XGBoost or logistic regression), missing the variance-reduction benefits of ensemble diversity.
2. **No CLV integration:** Churn predictions are binary β they do not incorporate *which* churners are most valuable to retain, leading to inefficient marketing spend.
3. **Weak experimental rigor:** Many published churn studies use a single train/test split without cross-validation, statistical testing, or confidence intervals on metrics.
4. **Dataset isolation:** Telco and bank churn datasets are studied separately; few works evaluate cross-domain transfer or unified pipelines.
---
## 2. Idea of Solution
### 2.1 Architecture Overview
We propose **ChurnPredict Pro**, a **stacking ensemble architecture** that combines the complementary strengths of five diverse base learners under a meta-learner. The design philosophy is:
> *"Diversity in inductive bias reduces variance; interpretability in the meta-layer preserves actionability."*
### 2.2 The 5-Model Stacking Ensemble
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CHURNPRED PRO β STACKING ENSEMBLE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββ β
β β XGBoost β βLightGBM β βCatBoost β β MLP β β LR β β
β β (GBDT) β β (GBDT) β β (OGB) β β (Deep) β β(Base)β β
β β Base 1 β β Base 2 β β Base 3 β β Base 4 β βBase 5β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββ¬ββββ β
β β β β β β β
β βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄βββββββββββββ β
β β β
β βββββββββββΌββββββββββ β
β β META-LEARNER β β
β β (Logistic Reg β β
β β / XGBoost) β β
β βββββββββββ¬ββββββββββ β
β β β
β βββββββββββΌββββββββββ β
β β CLV SCORING β β
β β + SHAP EXPLAINER β β
β βββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
### 2.3 Why These 5 Base Models?
| Model | Inductive Bias | Strength on Churn Data | Weakness Mitigated by Ensemble |
|-------|---------------|------------------------|-------------------------------|
| **XGBoost** | Greedy gradient boosting with regularization | Best-in-class on sparse/tabular data; handles missing values natively | Prone to overfitting on small datasets |
| **LightGBM** | Histogram-based leaf-wise boosting | Faster training; GOSS sampling for large data | Leaf-wise can overfit; GOSS introduces bias |
| **CatBoost** | Ordered boosting + categorical encoding | Native categorical feature handling; reduces target leakage | Slower than LightGBM; ordered boosting complexity |
| **MLP (Deep)** | Non-linear feature interactions | Captures complex feature cross-products | Needs more data; less interpretable |
| **Logistic Regression** | Linear decision boundary | Fast, interpretable baseline; L1 regularization for feature selection | Cannot model non-linear relationships |
The meta-learner (Logistic Regression or a shallow XGBoost) learns optimal weights for combining the five base models' predictions, leveraging their uncorrelated errors.
### 2.4 CLV-Weighted Scoring
Instead of ranking customers by churn probability alone, we multiply P(churn) by estimated CLV to produce a **Retention Priority Score (RPS)**:
$$
\text{RPS}_i = P(\text{churn}_i) \times \text{CLV}_i
$$
This ensures retention campaigns target high-value at-risk customers, maximizing ROI.
---
## 3. Objectives
### 3.1 Primary Goals
| ID | Objective | Metric Target | Success Criterion |
|----|-----------|---------------|-----------------|
| P1 | Build a stacking ensemble that outperforms any single base model | F1-Score | ΞF1 β₯ +0.03 over best single model |
| P2 | Achieve high recall on churn class (minimize false negatives) | Recall@Churn | β₯ 0.85 on both datasets |
| P3 | Deliver actionable model explanations per customer | SHAP summary | Top-5 features identified per prediction |
| P4 | Rank customers by retention value, not just churn risk | AUC-PR weighted by CLV | ROC-AUC β₯ 0.90 |
### 3.2 Secondary Goals
| ID | Objective | Metric Target |
|----|-----------|---------------|
| S1 | Evaluate cross-domain generalization (Telco β Bank, Bank β Telco) | Transfer AUC β₯ 0.80 |
| S2 | Achieve sub-second inference latency for batch scoring | β€ 500ms per 1,000 records |
| S3 | Deploy a reproducible, version-controlled pipeline | Docker + DVC + CI/CD |
| S4 | Document model behavior for regulatory compliance (GDPR/CCPA) | Full SHAP + model card |
### 3.3 Success Criteria Summary
- **Model Performance:** F1-Score > 0.85, ROC-AUC > 0.90, PR-AUC > 0.80 on both datasets
- **Business Impact:** Identify top 20% at-risk customers with β₯ 70% precision
- **Interpretability:** Every prediction accompanied by SHAP force plot; global SHAP summary for stakeholder dashboards
- **Robustness:** 5-fold stratified CV with 95% confidence intervals on all metrics
---
## 4. Literature Review & References
### 4.1 Category Overview
| Category | Count | Papers |
|----------|-------|--------|
| Ensemble / Boosting Methods | 4 | [1β4] |
| SHAP / LIME Interpretability | 3 | [5β7] |
| Deep Learning for Churn | 3 | [8β10] |
| CLV / Profit-Driven Churn | 3 | [11β13] |
| Financial / Bank Churn | 4 | [14β17] |
| Survey / Benchmark / Foundation | 4 | [18β21] |
| **Total** | **21** | |
### 4.2 Full References (2016β2024)
#### [1] XGBoost: A Scalable Tree Boosting System
**Chen, T., & Guestrin, C.** (2016). *KDD*. arXiv:1603.02754.
Introduced sparsity-aware algorithms and weighted quantile sketch for gradient boosting. Became the dominant algorithm for tabular churn prediction tasks worldwide.
#### [2] Tabular Data: Deep Learning is Not All You Need
**Shwartz-Ziv, R., & Armon, A.** (2021). arXiv:2106.03253.
Rigorous comparison showing XGBoost outperforms recent deep learning models on tabular data; ensembling deep models with XGBoost further improves performance.
#### [3] CatBoost: Unbiased Boosting with Categorical Features
**Prokhorenkova, L., et al.** (2017). arXiv:1706.09516.
Ordered boosting and novel categorical feature processing; outperforms other boosting implementations on datasets with high-cardinality categorical churn predictors.
#### [4] Enhancing Customer Churn Prediction: An Adaptive Ensemble Learning Approach
**Shaikhsurab, S., & Magadum, S.** (2024). arXiv:2408.16284.
Adaptive ensemble combining XGBoost, LightGBM, LSTM, MLP, and SVM with stacking + meta-feature generation; achieved **99.28% accuracy** on telecom churn datasets.
#### [5] A Unified Approach to Interpreting Model Predictions (SHAP)
**Lundberg, S. M., & Lee, S.-I.** (2017). *NeurIPS*. arXiv:1705.07874.
Proposed SHAP values as a unified measure of feature importance based on game-theoretic Shapley values; unified six existing explanation methods.
#### [6] "Why Should I Trust You?": Explaining Predictions of Any Classifier (LIME)
**Ribeiro, M. T., Singh, S., & Guestrin, C.** (2016). *KDD*. arXiv:1602.04938.
Introduced LIME to explain any classifier locally via interpretable surrogate models; foundational for churn model explainability and regulatory compliance.
#### [7] XAI Handbook: Towards a Unified Framework for Explainable AI
**Palacio, D. G., et al.** (2021). arXiv:2105.06677.
Provides theoretical framework unifying XAI terminology (LIME, SHAP, Grad-CAM, etc.); essential for regulatory compliance and method comparison in churn explainability.
#### [8] Early Churn Prediction from Large-Scale User-Product Interaction Time Series
**Bhattacharjee, A., Thukral, K., & Patil, C.** (2023). arXiv:2309.14390.
Applied multivariate time series classification with deep neural networks to fantasy sports churn; scales to 10βΈ users β demonstrates feasibility of deep learning at scale.
#### [9] Modelling Customer Churn for the Retail Industry in a Deep Learning Sequential Framework
**Equihua, C., et al.** (2023). arXiv:2304.00575.
Deep survival framework using recurrent neural networks for non-contractual retail churn; avoids extensive feature engineering through learned representations.
#### [10] Churn Reduction via Distillation
**Jiang, Y., et al.** (2021). arXiv:2106.02654.
Showed model distillation reduces predictive churn (model instability during retraining) while maintaining accuracy across FC, CNN, and transformer architectures.
#### [11] OptDist: Learning Optimal Distribution for Customer Lifetime Value Prediction
**Weng, S., et al.** (2024). arXiv:2408.08585.
Proposed OptDist with distribution learning/selection modules; adaptively selects optimal sub-distributions for CLTV prediction on public and industrial datasets.
#### [12] Customer Lifetime Value Prediction with Uncertainty Estimation Using Monte Carlo Dropout
**Cao, Y., Xu, Y., & Yang, Q.** (2024). arXiv:2411.15944.
Enhanced neural network CLTV prediction with Monte Carlo Dropout for uncertainty quantification; improved Top-5% MAPE significantly.
#### [13] A Predict-and-Optimize Approach to Profit-Driven Churn Prevention
**GΓ³mez-Vargas, E., Maldonado, S., & Vairetti, S.** (2023). arXiv:2310.07047.
First predict-and-optimize approach for churn prevention using individual CLVs (not averages); regret minimization via SGD; tested on 12 real-world datasets.
#### [14] Dynamic Customer Embeddings for Financial Service Applications
**Chitsazan, N., et al.** (2021). arXiv:2106.11880.
DCE framework uses customer digital activity + financial context for intent/fraud/call-center prediction; financial services benchmark for learned representations.
#### [15] FinPT: Financial Risk Prediction with Profile Tuning on Pretrained Foundation Models
**Yin, H., et al.** (2023). arXiv:2308.00065.
Introduced FinBench dataset + FinPT method for financial risk prediction (default, fraud, churn) using LLM-generated customer profiles; strong zero-shot transfer.
#### [16] Advanced User Credit Risk Prediction Using LightGBM, XGBoost and TabNet with SMOTEENN
**Yu, B., et al.** (2024). arXiv:2408.03497.
Combined PCA, SMOTEENN, and LightGBM for bank credit risk prediction; outperformed other models in identifying high-quality applicants under class imbalance.
#### [17] Credit Card Fraud Detection β Classifier Selection Strategy
**Kulatilleke, S.** (2022). arXiv:2208.11900.
Data-driven classifier selection + sampling methods for imbalanced fraud detection; directly applicable to churn's class imbalance challenges.
#### [18] Predicting Customer Churn: Extreme Gradient Boosting with Temporal Data
**Gregory, J.** (2018). arXiv:1802.03396.
Applied XGBoost with temporal feature engineering to time-series churn data; achieved top performance in large-scale competition settings.
#### [19] Predictive Churn with the Set of Good Models
**Watson-Daniels, D., et al.** (2024). arXiv:2402.07745.
Examined prediction instability during model retraining via Rashomon set; critical for production churn model deployment and monitoring.
#### [20] Retention Is All You Need
**Mohiuddin, K., et al.** (2023). arXiv:2304.03103.
HR Decision Support System using SHAP + what-if analysis for employee attrition; demonstrates SHAP utility for retention/churn use cases with interpretable dashboards.
#### [21] Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance
**(2024).** arXiv:2409.19751.
Comprehensive study of SMOTE, Class Weights, and Decision Threshold Calibration for binary classification; **Decision Threshold Calibration most consistently effective** β directly guides our experimental design.
---
## 5. Dataset Understanding
### 5.1 Dataset 1: Telco Customer Churn (IBM)
**Source:** [aai510-group1/telco-customer-churn](https://hf.co/datasets/aai510-group1/telco-customer-churn)
**Type:** Fictional telecommunications company data
**Format:** CSV / Parquet
**Splits:** train / validation / test
#### Schema Summary
| Feature Category | Count | Key Features |
|-----------------|-------|-------------|
| **Demographics** | 7 | Age, Gender, Married, Dependents, Number of Dependents, Senior Citizen, Under 30 |
| **Service Usage** | 10 | Phone Service, Internet Service, Internet Type, Multiple Lines, Online Security, Online Backup, Device Protection, Tech Support, Streaming TV, Streaming Movies |
| **Contract & Billing** | 6 | Contract, Payment Method, Paperless Billing, Monthly Charge, Total Charges, Total Refunds |
| **Engagement** | 7 | Tenure (months), Number of Referrals, Referred a Friend, Offer, Satisfaction Score, Churn Score, Quarter |
| **Revenue** | 6 | Total Revenue, Total Long Distance Charges, Total Extra Data Charges, Avg Monthly Long Distance, Avg Monthly GB Download, CLTV |
| **Geographic** | 5 | City, State, Zip Code, Latitude, Longitude, Population |
| **Target** | 2 | Churn (binary), Churn Reason (string), Churn Category (string), Customer Status |
**Total Features:** ~52 (including derived identifiers like `Lat Long`, `Customer ID`)
#### Class Distribution (Audited)
| Split | Total Rows | Churned (1) | Stayed (0) | Churn Rate |
|-------|-----------|-------------|------------|------------|
| Train | ~4,400 | ~1,100 | ~3,300 | ~25% |
| Validation | ~1,500 | ~375 | ~1,125 | ~25% |
| Test | ~1,500 | ~375 | ~1,125 | ~25% |
*Note: Exact counts vary by split. The dataset exhibits moderate class imbalance (~25% churn), manageable without aggressive oversampling.*
#### Notable Data Characteristics
1. **Rich categorical encoding:** Internet Type (DSL, Fiber Optic, Cable, None), Contract (Month-to-Month, One Year, Two Year), Payment Method (4 types)
2. **Temporal granularity:** `Quarter` field (Q1βQ4) enables time-aware feature engineering
3. **Pre-computed churn scores:** `Churn Score` (0β100) and `Satisfaction Score` (1β5) are strong engineered features β risk of target leakage if not handled carefully
4. **CLTV integration:** `CLTV` field directly available for revenue-weighted ranking
5. **Geographic features:** Latitude/longitude enable spatial clustering or geo-derived features
#### Data Quality Flags
- `Total Charges` has blank/missing values for zero-tenure customers (new sign-ups)
- `Churn Reason` and `Churn Category` are populated only for churned customers β post-hoc labels, not usable as features
- `Customer Status` is highly correlated with target; should be excluded or used as stratification
- Some categorical fields (City, State) have high cardinality (50+ states, 1,000+ cities)
---
### 5.2 Dataset 2: Bank Customer Churners
**Source:** [ZZHHJ/bank_churners](https://hf.co/datasets/ZZHHJ/bank_churners)
**Type:** Credit card customer attrition data
**Format:** CSV / Parquet
**Splits:** single train split (requires manual partitioning)
#### Schema Summary
| Feature Category | Count | Key Features |
|-----------------|-------|-------------|
| **Demographics** | 4 | Customer_Age, Gender, Dependent_count, Education_Level, Marital_Status, Income_Category |
| **Account Behavior** | 5 | Months_on_book, Total_Relationship_Count, Months_Inactive_12_mon, Contacts_Count_12_mon, Card_Category |
| **Financial** | 7 | Credit_Limit, Total_Revolving_Bal, Avg_Open_To_Buy, Total_Amt_Chng_Q4_Q1, Total_Trans_Amt, Total_Trans_Ct, Total_Ct_Chng_Q4_Q1, Avg_Utilization_Ratio |
| **Target** | 1 | Attrition_Flag (Existing Customer / Attrited Customer) |
| **Artifacts** | 2 | Naive_Bayes_Classifier columns (pre-computed probabilities β **must be removed** to avoid data leakage) |
**Total Features:** 21 (19 usable + 1 ID + 2 NB artifacts to drop)
#### Class Distribution (Estimated)
| Class | Approximate Count | Rate |
|-------|-------------------|------|
| Existing Customer | ~8,500 | ~83% |
| Attrited Customer | ~1,700 | ~17% |
**Churn rate ~17%** β more imbalanced than Telco; SMOTE/ADASYN or class weighting will be necessary.
#### Notable Data Characteristics
1. **Quarter-over-quarter dynamics:** `Total_Amt_Chng_Q4_Q1` and `Total_Ct_Chng_Q4_Q1` capture behavioral velocity β powerful churn signals
2. **Utilization ratio:** `Avg_Utilization_Ratio` is a strong proxy for engagement; low utilization often precedes attrition
3. **Income categories are binned:** `$60K - $80K`, `$80K - $120K`, etc. β ordinal encoding preferred
4. **Card category:** `Blue` (vast majority), `Silver`, `Gold`, `Platinum` β strong class imbalance within feature itself
#### Data Quality Flags
- **Critical:** Two `Naive_Bayes_Classifier_*` columns are pre-computed churn probabilities from a baseline model. Using them as features would constitute **data leakage** β they must be dropped before any model training.
- No explicit CLTV field; must be estimated from `Credit_Limit`, `Total_Trans_Amt`, and `Total_Trans_Ct`
- Single split requires manual stratified partitioning (70/15/15 or 80/10/10)
---
### 5.3 Cross-Dataset Comparison
| Attribute | Telco (IBM) | Bank Churners |
|-----------|-------------|---------------|
| **Records** | ~7,000 | ~10,000 |
| **Features (usable)** | ~45 | ~19 |
| **Churn Rate** | ~25% | ~17% |
| **Industry** | Telecommunications | Banking / Credit Cards |
| **Temporal Features** | Quarter, Tenure (months) | Months_on_book, Q4/Q1 change ratios |
| **CLTV Available** | Yes (explicit field) | No (must derive) |
| **Geographic Data** | Yes (lat/lon, city, state) | No |
| **Pre-computed Scores** | Churn Score, Satisfaction | Naive Bayayes (leakage β drop) |
| **Class Imbalance Severity** | Moderate | High |
| **Primary Churn Driver** | Contract type, tenure, service usage | Inactivity, transaction decline, utilization |
---
## 6. Proposed Methodology
### 6.1 The 7-Phase Pipeline
```
Phase 1: Data Ingestion & Audit
β
Phase 2: Preprocessing & Feature Engineering
β
Phase 3: Exploratory Data Analysis (EDA)
β
Phase 4: Model Training β 5-Base Stacking Ensemble
β
Phase 5: Hyperparameter Optimization
β
Phase 6: Evaluation, Interpretability & CLV Scoring
β
Phase 7: Deployment, Monitoring & Documentation
```
### Phase 1: Data Ingestion & Audit
- Load both datasets from Hugging Face `datasets` library
- Compute schema validation: type checks, missing value audit, cardinality report
- Flag anomalous values (negative charges, impossible ages, blank `Total Charges`)
- Document data provenance and version hashes (DVC)
### Phase 2: Preprocessing & Feature Engineering
#### 2A. Cleaning
- **Telco:** Impute `Total Charges` blanks with `Monthly Charge Γ Tenure`
- **Bank:** Drop `Naive_Bayes_Classifier_*` columns immediately
- Both datasets: remove ID fields (`Customer ID`, `CLIENTNUM`)
#### 2B. Encoding
| Feature Type | Encoding Strategy | Rationale |
|-------------|-------------------|-----------|
| Binary categorical | Label encoding (0/1) | `Gender`, `Partner`, `PhoneService` |
| Low-cardinality ordinal | One-hot encoding | `Contract`, `Payment Method`, `Education_Level` |
| High-cardinality nominal | Target encoding / CatBoost native | `City`, `State` (Telco); `Income_Category` (Bank) |
| Cyclical temporal | Sine/cosine encoding | `Quarter` mapped to angle |
#### 2C. Feature Engineering
- **RFM-style features (Bank):** Recency = `Months_Inactive_12_mon`, Frequency = `Total_Trans_Ct`, Monetary = `Total_Trans_Amt`
- **Engagement ratio (Telco):** `Satisfaction_Score / Churn_Score` as loyalty proxy
- **Velocity features:** Month-over-month change in charges and usage
- **CLTV proxy (Bank):** `Credit_Limit Γ Avg_Utilization_Ratio Γ (12 - Months_Inactive_12_mon)`
#### 2D. Scaling & Imbalance Handling
- Numerical features β RobustScaler (median/IQR, resistant to outliers)
- Class imbalance β SMOTEENN (SMOTE + Edited Nearest Neighbours) on training fold only; **never on validation/test**
- Class weights β `scale_pos_weight = len(negative) / len(positive)` for XGBoost/LightGBM
### Phase 3: Exploratory Data Analysis (EDA)
- Univariate distributions (histograms, boxplots for skew detection)
- Bivariate analysis: churn rate by contract type, payment method, tenure bins
- Correlation matrix (Spearman for non-linear relationships)
- Feature-target mutual information scores for feature selection
- Geographic heatmap (Telco: churn rate by state)
### Phase 4: Model Training β Stacking Ensemble
#### 4A. Cross-Validation Strategy
- **5-fold Stratified Cross-Validation** to preserve class distribution
- **GroupKFold** if temporal leakage risk (same customer in multiple quarters)
- Out-of-fold (OOF) predictions from each base model used as meta-features
#### 4B. Base Model Training
| Base Model | Key Hyperparameters | Tuning Range |
|-----------|-------------------|--------------|
| XGBoost | `max_depth`, `learning_rate`, `subsample`, `colsample_bytree`, `scale_pos_weight` | depth: 3β8; lr: 0.01β0.3 |
| LightGBM | `num_leaves`, `learning_rate`, `feature_fraction`, `bagging_fraction`, `is_unbalance` | leaves: 20β100; lr: 0.01β0.3 |
| CatBoost | `depth`, `learning_rate`, `iterations`, `auto_class_weights` | depth: 4β10; iterations: 200β1000 |
| MLP | `hidden_layers`, `dropout`, `batch_size`, `learning_rate` | layers: (128,64), (256,128,64); dropout: 0.2β0.5 |
| Logistic Regression | `C`, `penalty`, `solver`, `class_weight` | C: 0.001β10; penalty: l1/l2/elasticnet |
#### 4C. Meta-Learner Training
- Input: 5 OOF probability vectors (one per base model) + optionally top-K original features
- Model: **Logistic Regression** (interpretable weights showing model contribution) OR **XGBoost** (if non-linear meta-interactions needed)
- Validation: Same 5-fold CV; meta-learner trained on OOF predictions, tested on hold-out
### Phase 5: Hyperparameter Optimization
- **Optuna** with **TPESampler** (Tree-structured Parzen Estimator)
- 100 trials per base model; 50 trials for meta-learner
- Pruning: `MedianPruner` with early stopping on validation F1
- Objective: Maximize F1-Score (harmonic mean of precision and recall)
### Phase 6: Evaluation, Interpretability & CLV Scoring
#### 6A. Metrics Suite (10 metrics)
1. Accuracy
2. Precision (Churn class)
3. Recall (Churn class)
4. F1-Score
5. ROC-AUC
6. PR-AUC (Precision-Recall AUC β critical for imbalanced data)
7. Matthews Correlation Coefficient (MCC)
8. Cohen's Kappa
9. Balanced Accuracy
10. Expected Calibration Error (ECE)
#### 6B. SHAP Analysis
- **Global:** SHAP summary plot (beeswarm) showing feature importance across full dataset
- **Local:** SHAP force plot for individual predictions β customer-level actionable insights
- **Dependence:** SHAP dependence plots for top-5 features revealing interaction effects
#### 6C. CLV Scoring
- **Telco:** Use explicit `CLTV` field; multiply by churn probability
- **Bank:** Derive CLV proxy; multiply by churn probability
- Output: Prioritized customer list sorted by RPS (Retention Priority Score)
- Segment: Top 10% (urgent), 10β30% (high), 30β60% (medium), 60β100% (low)
### Phase 7: Deployment, Monitoring & Documentation
- Model serialization: `joblib` for sklearn/CatBoost, native formats for XGBoost/LightGBM
- Inference pipeline: `scikit-learn Pipeline` + custom transformers
- Monitoring: Track prediction distribution drift, feature drift, and metric decay over time
- Documentation: Model card with intended use, limitations, bias analysis, and SHAP summary
---
## 7. Implementation Strategy
### 7.1 Tech Stack
| Layer | Technology | Purpose |
|-------|-----------|---------|
| **Data Loading** | `datasets` (HF), `pandas`, `polars` | Efficient dataset ingestion |
| **Preprocessing** | `scikit-learn` (Pipeline, ColumnTransformer, RobustScaler) | Reproducible feature engineering |
| **ML Models** | `xgboost`, `lightgbm`, `catboost`, `scikit-learn` (MLP, LR) | Base learners |
| **Ensemble** | `mlens` / custom stacking with `scikit-learn` | Meta-learner orchestration |
| **Imbalance** | `imbalanced-learn` (SMOTEENN) | Oversampling + cleaning |
| **Optimization** | `optuna` | Hyperparameter search |
| **Interpretability** | `shap` | Game-theoretic explanations |
| **Tracking** | `trackio` + `mlflow` | Experiment logging, metrics, artifacts |
| **Deployment** | `gradio` / `fastapi` + Docker | API inference and UI demo |
| **Versioning** | `dvc` + `git` | Data and model versioning |
### 7.2 4-Week Timeline
| Week | Focus | Deliverables |
|------|-------|-------------|
| **Week 1** | Data audit, preprocessing, EDA | Clean notebooks; feature engineering pipeline; data quality report |
| **Week 2** | Base model training, hyperparameter tuning | 5 trained base models; Optuna study results; OOF prediction matrices |
| **Week 3** | Stacking ensemble, evaluation, SHAP analysis | Trained meta-learner; 10-metric report; SHAP dashboards; CLV scoring |
| **Week 4** | Cross-domain testing, deployment, documentation | Generalization report; Gradio demo; model card; final documentation |
### 7.3 Code Architecture
```
churnpredict-pro/
βββ data/
β βββ raw/ # HF datasets (versioned with DVC)
β βββ processed/ # Train/val/test splits
β βββ engineered/ # Feature-engineered datasets
βββ notebooks/
β βββ 01_eda_telco.ipynb
β βββ 02_eda_bank.ipynb
β βββ 03_feature_engineering.ipynb
β βββ 04_shap_analysis.ipynb
βββ src/
β βββ __init__.py
β βββ data/
β β βββ load_datasets.py # HF datasets loader
β β βββ preprocess.py # Cleaning + encoding + scaling
β β βββ feature_engineer.py # RFM, velocity, CLV proxy
β βββ models/
β β βββ base_models.py # XGB, LGBM, CatBoost, MLP, LR wrappers
β β βββ stacking_ensemble.py # OOF + meta-learner
β β βββ hyperparameter_search.py # Optuna studies
β βββ evaluation/
β β βββ metrics.py # 10-metric computation
β β βββ shap_explainer.py # Global + local SHAP
β β βββ clv_scorer.py # RPS computation
β βββ deployment/
β βββ inference_pipeline.py
β βββ app.py # Gradio/FastAPI interface
βββ configs/
β βββ telco_config.yaml
β βββ bank_config.yaml
βββ experiments/ # Trackio / MLflow runs
βββ tests/
β βββ test_preprocessing.py
β βββ test_models.py
βββ Dockerfile
βββ requirements.txt
βββ dvc.yaml
βββ README.md
```
---
## 8. Experimental Design
### 8.1 Five Experiments
| ID | Experiment | Hypothesis | Method |
|----|-----------|------------|--------|
| **E1** | Single Model Baseline | Individual models underperform ensemble due to bias-variance limitations | Train each of 5 base models standalone; report metrics |
| **E2** | Stacking Ensemble | Meta-learner combining 5 models outperforms best single model by β₯ 3% F1 | 5-fold OOF stacking with LR meta-learner |
| **E3** | Imbalance Strategy Comparison | Threshold calibration is more effective than SMOTE for churn (per [21]) | Compare: (a) no correction, (b) SMOTEENN, (c) class weights, (d) threshold calibration |
| **E4** | Cross-Domain Transfer | Models trained on Telco generalize to Bank with β₯ 80% AUC | Train on Telco, evaluate zero-shot on Bank; then fine-tune |
| **E5** | CLV-Weighted vs. Uniform Ranking | RPS improves campaign ROI over probability-only ranking | Compare top-20% precision: P(churn) only vs. P(churn) Γ CLV |
### 8.2 Ten Evaluation Metrics
| # | Metric | Formula / Definition | Why It Matters for Churn |
|---|--------|---------------------|-------------------------|
| 1 | **Accuracy** | (TP + TN) / (TP + TN + FP + FN) | Overall correctness; misleading if imbalanced |
| 2 | **Precision (Churn)** | TP / (TP + FP) | Of predicted churners, how many actually churn? (cost of false alarms) |
| 3 | **Recall (Churn)** | TP / (TP + FN) | Of actual churners, how many did we catch? (cost of missed churners) |
| 4 | **F1-Score** | 2 Γ (Precision Γ Recall) / (Precision + Recall) | Harmonic mean; balances precision and recall |
| 5 | **ROC-AUC** | Area under ROC curve | Discrimination ability across all thresholds |
| 6 | **PR-AUC** | Area under Precision-Recall curve | More informative than ROC-AUC for imbalanced data |
| 7 | **MCC** | (TPΓTN β FPΓFN) / β(product of marginals) | Correlation between prediction and truth; robust to imbalance |
| 8 | **Cohen's Kappa** | (Observed β Expected) / (1 β Expected) | Agreement beyond chance; useful for inter-rater reliability analogies |
| 9 | **Balanced Accuracy** | (Sensitivity + Specificity) / 2 | Average of recall on both classes; fair on imbalanced data |
| 10 | **ECE** | Expected Calibration Error | Measures reliability of probability outputs; critical for CLV weighting |
### 8.3 Statistical Rigor
1. **Confidence Intervals:** All metrics reported with 95% CIs from 5-fold CV (bootstrap percentile method)
2. **McNemar's Test:** Statistically compare stacking ensemble vs. best single model
3. **DeLong's Test:** Compare ROC-AUC differences between models
4. **Permutation Test:** Validate feature importance scores from SHAP
5. **Stratification:** All splits stratified on target + `Contract` type (strongest churn predictor) to prevent distribution shift
### 8.4 Reproducibility Checklist
- [ ] Random seeds fixed (`random_state=42`) for all stochastic operations
- [ ] `requirements.txt` with exact versions (via `pip freeze`)
- [ ] DVC tracking for data and model artifacts
- [ ] Git commit hash recorded with every experiment
- [ ] Trackio / MLflow logging of hyperparameters, metrics, and artifact paths
---
## 9. Result Analysis
### 9.1 Expected Performance
Based on literature benchmarks ([4] achieved 99.28% on telecom; [16] achieved strong results on bank credit risk with SMOTEENN + LightGBM), our targets are conservative and grounded:
| Dataset | Best Single Model F1 | Stacking Ensemble F1 | Expected Ξ |
|---------|---------------------|---------------------|------------|
| **Telco** | 0.82β0.84 (XGBoost/CatBoost) | **0.86β0.88** | +0.03β0.04 |
| **Bank** | 0.78β0.81 (LightGBM/XGBoost) | **0.82β0.85** | +0.03β0.04 |
### 9.2 SHAP Analysis β Expected Insights
Based on prior churn research, we anticipate the following feature importance rankings:
**Telco (Expected Top 5 SHAP Features):**
1. `Contract` (Month-to-Month vs. longer) β strongest predictor
2. `Tenure in Months` β inverse relationship with churn
3. `Monthly Charge` / `Total Charges` β price sensitivity
4. `Internet Type` (Fiber Optic churns more than DSL)
5. `Payment Method` (Electronic check = high risk)
**Bank (Expected Top 5 SHAP Features):**
1. `Total_Trans_Ct` (transaction frequency decline)
2. `Total_Trans_Amt` (monetary decline)
3. `Months_Inactive_12_mon` (recency of activity)
4. `Total_Relationship_Count` (cross-product engagement)
5. `Contacts_Count_12_mon` (complaint/contact proxy)
### 9.3 Business Impact Projections
Assuming a hypothetical telecom with:
- 100,000 customers
- 25% annual churn rate
- Average CLV = $3,000
- Retention campaign cost = $50 per targeted customer
- Campaign success rate (if well-targeted) = 30%
| Scenario | Customers Targeted | Campaign Cost | Churners Caught | Revenue Saved | Net ROI |
|----------|-------------------|---------------|-----------------|---------------|---------|
| Random targeting (25% churn) | 20,000 | $1,000,000 | 1,500 | $4,500,000 | 4.5Γ |
| Model-guided (top 20% by RPS) | 20,000 | $1,000,000 | 4,200 | $12,600,000 | **12.6Γ** |
*Model-guided targeting improves ROI by ~2.8Γ over random selection by focusing on high-value, high-probability churners.*
### 9.4 Visualization Plan
| Visualization | Purpose |
|--------------|---------|
| ROC & PR curves (all models overlaid) | Comparative discrimination |
| Confusion matrices | Error type analysis |
| SHAP summary plot (beeswarm) | Global feature importance |
| SHAP force plots (sample customers) | Local explanations for stakeholders |
| SHAP dependence plots | Feature interaction discovery |
| Calibration plot (predicted vs. actual) | Probability reliability |
| CLV-RPS scatter plot | Segmentation visualization |
| Metric bar chart with 95% CIs | Statistical comparison |
---
## 10. Iterative Improvement
### 10.1 Six Iteration Cycles
| Iteration | Focus | Action | Expected Outcome |
|-----------|-------|--------|------------------|
| **Iter 1** | Feature Engineering Deep-Dive | Add polynomial features (tenureΒ², chargeΒ²); interaction terms (contract Γ monthly charge); binning (tenure quartiles) | +1β2% F1 from non-linear feature capture |
| **Iter 2** | Advanced Sampling | Replace SMOTEENN with ADASYN + Edited Nearest Neighbours; test BorderlineSMOTE | Better synthetic sample quality near decision boundary |
| **Iter 3** | Deep Learning Augmentation | Replace MLP with TabNet or FT-Transformer for tabular deep learning; compare against MLP base | Validate whether deep tabular models improve ensemble diversity |
| **Iter 4** | Temporal Modeling | For Telco: add LSTM/GRU on quarterly customer journey sequences; for Bank: add transaction time-series | Capture temporal churn dynamics; +2β3% F1 on time-sensitive subsets |
| **Iter 5** | Ensemble Expansion | Add 6th base model (Random Forest or Extra Trees) for additional variance reduction; test blending vs. stacking | Further variance reduction; marginal F1 gain of +0.5β1% |
| **Iter 6** | Production Hardening | Dockerize inference; add A/B test framework; build automated retraining trigger on drift detection; write full production documentation | Deployable system with monitoring, retraining, and compliance docs |
### 10.2 Production Documentation Deliverables
| Document | Contents | Audience |
|----------|----------|----------|
| **Model Card** | Intended use, training data summary, performance metrics, limitations, bias assessment, ethical considerations | Data scientists, regulators |
| **API Documentation** | Endpoint specs, request/response schemas, rate limits, error codes | Engineering teams |
| **SHAP Dashboard Guide** | How to read force plots, summary plots, and dependence plots | Business stakeholders, customer success |
| **Retention Playbook** | How to act on RPS segments; recommended interventions per churn reason | Marketing, customer success |
| **Retraining SOP** | When and how to retrain; drift detection thresholds; rollback procedures | MLOps, data engineering |
| **Compliance Checklist** | GDPR Article 22 (automated decision-making), CCPA, internal audit requirements | Legal, compliance |
---
## Appendix A: Key Equations
**Retention Priority Score:**
$$
\text{RPS}_i = P(\text{churn}_i) \times \text{CLV}_i
$$
**F1-Score:**
$$
F1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$
**Matthews Correlation Coefficient:**
$$
\text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}
$$
**Expected Calibration Error:**
$$
\text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{conf}(B_m) \right|
$$
---
*Document compiled for the ChurnPredict Pro project. All datasets verified on Hugging Face Hub. All 21 references span peer-reviewed and high-impact arXiv publications from 2016β2024.*
|