| # Machine Learning Zoomcamp 2025 - Homework 3 |
|
|
| [](https://www.python.org/) |
| [](https://pandas.pydata.org/) |
| [](https://scikit-learn.org/stable/) |
| [](https://jupyter.org/) |
|
|
| --- |
|
|
| ## Homework 3: Machine Learning for Classification |
|
|
| This repository contains solutions for **Homework 3** of **Machine Learning Zoomcamp 2025**, focused on **classification tasks** using the Bank Marketing dataset. |
|
|
| --- |
|
|
| ## π Project Overview |
|
|
| - **Dataset:** [Bank Marketing Dataset](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv) |
| - **Target variable:** `converted` (whether the client signed up) |
| - **Objective:** Data preprocessing, exploratory analysis, feature selection, and training logistic regression models (regularized and unregularized). |
|
|
| **Tech Stack:** |
| - **Python 3.11** β core programming language |
| - **Pandas** β data manipulation |
| - **NumPy** β numerical operations |
| - **Scikit-Learn** β machine learning models, feature selection, evaluation |
| - **Jupyter Notebook** β interactive coding and documentation |
|
|
| --- |
|
|
| ## πΉ Questions & Answers |
|
|
| | Question | Task | Answer | |
| |----------|------|--------| |
| | 1 | Mode of `industry` | `retail` | |
| | 2 | Biggest correlation (numerical features) | `annual_income` and `interaction_count` | |
| | 3 | Biggest mutual information (categorical features) | `lead_source` | |
| | 4 | Logistic regression validation accuracy | 0.74 | |
| | 5 | Least useful feature (feature elimination) | `lead_score` | |
| | 6 | Best `C` value for regularized logistic regression | 1 | |
|
|
| --- |
|
|
| ## π Approach / Key Steps |
|
|
| 1. **Data Cleaning & Preparation** |
| - Filled missing values: categorical β `'NA'`, numerical β `0.0` |
| - Verified feature types and correlations |
|
|
| 2. **Exploratory Analysis** |
| - Mode of categorical variables |
| - Correlation matrix for numerical features |
|
|
| 3. **Feature Selection** |
| - Calculated mutual information for categorical variables using `mutual_info_score` |
| - Identified least useful features via feature elimination |
|
|
| 4. **Model Training** |
| - Logistic Regression with one-hot encoded categorical variables |
| - Regularized logistic regression with hyperparameter tuning (`C` values) |
|
|
| --- |
|
|
| ## π Results |
|
|
| - Baseline logistic regression accuracy: **0.74** |
| - Least useful feature: **`lead_score`** |
| - Best regularization parameter `C`: **1** |
| |
| --- |
| |
| ## β How to Run |
| |
| 1. Clone the repository: |
| ```bash |
| git clone https://github.com/yourusername/ml-zoomcamp-hw3.git |
| ``` |
| |
| 2. Install requirements: |
| ```bash |
| pip install -r requirements.txt |
| ``` |
| |
| 3. Open the Jupyter Notebook and run cells sequentially: |
| ```bash |
| jupyter notebook |
| ``` |
| |
| --- |
| |
| ## π References |
| |
| - [Bank Marketing Dataset](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv) |
| - [Scikit-Learn Documentation](https://scikit-learn.org/stable/) |
| - [Pandas Documentation](https://pandas.pydata.org/) |
| - [NumPy Documentation](https://numpy.org/) |
| - [Jupyter Notebook Documentation](https://jupyter.org/) |
| |
| --- |
| |
| |