File size: 3,433 Bytes
4108ad2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | # Machine Learning Zoomcamp 2025 - Homework 3
[](https://www.python.org/)
[](https://pandas.pydata.org/)
[](https://scikit-learn.org/stable/)
[](https://jupyter.org/)
---
## Homework 3: Machine Learning for Classification
This repository contains solutions for **Homework 3** of **Machine Learning Zoomcamp 2025**, focused on **classification tasks** using the Bank Marketing dataset.
---
## π Project Overview
- **Dataset:** [Bank Marketing Dataset](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv)
- **Target variable:** `converted` (whether the client signed up)
- **Objective:** Data preprocessing, exploratory analysis, feature selection, and training logistic regression models (regularized and unregularized).
**Tech Stack:**
- **Python 3.11** β core programming language
- **Pandas** β data manipulation
- **NumPy** β numerical operations
- **Scikit-Learn** β machine learning models, feature selection, evaluation
- **Jupyter Notebook** β interactive coding and documentation
---
## πΉ Questions & Answers
| Question | Task | Answer |
|----------|------|--------|
| 1 | Mode of `industry` | `retail` |
| 2 | Biggest correlation (numerical features) | `annual_income` and `interaction_count` |
| 3 | Biggest mutual information (categorical features) | `lead_source` |
| 4 | Logistic regression validation accuracy | 0.74 |
| 5 | Least useful feature (feature elimination) | `lead_score` |
| 6 | Best `C` value for regularized logistic regression | 1 |
---
## π Approach / Key Steps
1. **Data Cleaning & Preparation**
- Filled missing values: categorical β `'NA'`, numerical β `0.0`
- Verified feature types and correlations
2. **Exploratory Analysis**
- Mode of categorical variables
- Correlation matrix for numerical features
3. **Feature Selection**
- Calculated mutual information for categorical variables using `mutual_info_score`
- Identified least useful features via feature elimination
4. **Model Training**
- Logistic Regression with one-hot encoded categorical variables
- Regularized logistic regression with hyperparameter tuning (`C` values)
---
## π Results
- Baseline logistic regression accuracy: **0.74**
- Least useful feature: **`lead_score`**
- Best regularization parameter `C`: **1**
---
## β How to Run
1. Clone the repository:
```bash
git clone https://github.com/yourusername/ml-zoomcamp-hw3.git
```
2. Install requirements:
```bash
pip install -r requirements.txt
```
3. Open the Jupyter Notebook and run cells sequentially:
```bash
jupyter notebook
```
---
## π References
- [Bank Marketing Dataset](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv)
- [Scikit-Learn Documentation](https://scikit-learn.org/stable/)
- [Pandas Documentation](https://pandas.pydata.org/)
- [NumPy Documentation](https://numpy.org/)
- [Jupyter Notebook Documentation](https://jupyter.org/)
---
|