Lead Scoring with Bank Marketing Dataset
Overview
This notebook demonstrates building a lead scoring model using the Bank Marketing dataset. The goal is to predict whether a client will convert (sign up for a service) based on various features.
We cover:
- Data preparation and handling missing values.
- Feature importance using ROC AUC for numerical variables.
- Logistic regression modeling with one-hot encoding.
- Precision, recall, and F1 score analysis to select thresholds.
- 5-fold cross-validation to check model stability.
- Hyperparameter tuning to select the best regularization parameter.
Key Results
- Best numerical feature (ROC AUC):
number_of_courses_viewed - Validation AUC:
0.794 - Threshold where precision ≈ recall:
0.59 - Threshold with max F1:
0.47 - Standard deviation of AUC across folds:
0.01 - Best regularization parameter C:
0.001
Lessons Learned
- ROC AUC can help identify predictive features even before modeling.
- Logistic regression combined with one-hot encoding provides a strong baseline.
- Threshold tuning is crucial for balancing precision and recall based on business needs.
- Cross-validation confirms the robustness of the model and prevents overfitting.
- Hyperparameter tuning improves model performance and reliability.
Environment
- Python 3.12
- Jupyter Notebook
- Libraries:
pandas,numpy,scikit-learn,matplotlib,seaborn
Dataset
Bank Marketing dataset used in this project is publicly available:
Bank Marketing Dataset CSV
Author
Created as part of ML Zoomcamp 2025 Homework 4.