File size: 2,153 Bytes
3d827c0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | # Lead Scoring with Bank Marketing Dataset
[](https://www.python.org/)
[](https://scikit-learn.org/)
[](https://jupyter.org/)
---
## Overview
This notebook demonstrates building a **lead scoring model** using the Bank Marketing dataset. The goal is to predict whether a client will **convert** (sign up for a service) based on various features.
We cover:
1. Data preparation and handling missing values.
2. Feature importance using ROC AUC for numerical variables.
3. Logistic regression modeling with **one-hot encoding**.
4. Precision, recall, and F1 score analysis to select thresholds.
5. 5-fold cross-validation to check model stability.
6. Hyperparameter tuning to select the best regularization parameter.
---
## Key Results
- **Best numerical feature (ROC AUC):** `number_of_courses_viewed`
- **Validation AUC:** `0.794`
- **Threshold where precision ≈ recall:** `0.59`
- **Threshold with max F1:** `0.47`
- **Standard deviation of AUC across folds:** `0.01`
- **Best regularization parameter C:** `0.001`
---
## Lessons Learned
- ROC AUC can help identify predictive features even before modeling.
- Logistic regression combined with one-hot encoding provides a strong baseline.
- Threshold tuning is crucial for balancing precision and recall based on business needs.
- Cross-validation confirms the robustness of the model and prevents overfitting.
- Hyperparameter tuning improves model performance and reliability.
---
## Environment
- Python 3.12
- Jupyter Notebook
- Libraries: `pandas`, `numpy`, `scikit-learn`, `matplotlib`, `seaborn`
---
## Dataset
Bank Marketing dataset used in this project is publicly available:
[Bank Marketing Dataset CSV](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv)
---
## Author
Created as part of **ML Zoomcamp 2025 Homework 4**.
|