sks01dev's picture
Create readme.md
3d827c0
|
raw
history blame
2.15 kB

Lead Scoring with Bank Marketing Dataset

Python Scikit-Learn Jupyter Notebook


Overview

This notebook demonstrates building a lead scoring model using the Bank Marketing dataset. The goal is to predict whether a client will convert (sign up for a service) based on various features.

We cover:

  1. Data preparation and handling missing values.
  2. Feature importance using ROC AUC for numerical variables.
  3. Logistic regression modeling with one-hot encoding.
  4. Precision, recall, and F1 score analysis to select thresholds.
  5. 5-fold cross-validation to check model stability.
  6. Hyperparameter tuning to select the best regularization parameter.

Key Results

  • Best numerical feature (ROC AUC): number_of_courses_viewed
  • Validation AUC: 0.794
  • Threshold where precision ≈ recall: 0.59
  • Threshold with max F1: 0.47
  • Standard deviation of AUC across folds: 0.01
  • Best regularization parameter C: 0.001

Lessons Learned

  • ROC AUC can help identify predictive features even before modeling.
  • Logistic regression combined with one-hot encoding provides a strong baseline.
  • Threshold tuning is crucial for balancing precision and recall based on business needs.
  • Cross-validation confirms the robustness of the model and prevents overfitting.
  • Hyperparameter tuning improves model performance and reliability.

Environment

  • Python 3.12
  • Jupyter Notebook
  • Libraries: pandas, numpy, scikit-learn, matplotlib, seaborn

Dataset

Bank Marketing dataset used in this project is publicly available:
Bank Marketing Dataset CSV


Author

Created as part of ML Zoomcamp 2025 Homework 4.