File size: 3,433 Bytes
4108ad2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# Machine Learning Zoomcamp 2025 - Homework 3

[![Python](https://img.shields.io/badge/Python-3.11-blue?logo=python&logoColor=white)](https://www.python.org/)
[![Pandas](https://img.shields.io/badge/Pandas-1.5.3-orange?logo=pandas&logoColor=white)](https://pandas.pydata.org/)
[![Scikit-Learn](https://img.shields.io/badge/Scikit--Learn-1.3.1-green?logo=scikit-learn&logoColor=white)](https://scikit-learn.org/stable/)
[![Jupyter](https://img.shields.io/badge/Jupyter-Notebook-yellow?logo=jupyter&logoColor=white)](https://jupyter.org/)

---

## Homework 3: Machine Learning for Classification

This repository contains solutions for **Homework 3** of **Machine Learning Zoomcamp 2025**, focused on **classification tasks** using the Bank Marketing dataset.

---

## πŸ“‚ Project Overview

- **Dataset:** [Bank Marketing Dataset](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv)  
- **Target variable:** `converted` (whether the client signed up)  
- **Objective:** Data preprocessing, exploratory analysis, feature selection, and training logistic regression models (regularized and unregularized).  

**Tech Stack:**  
- **Python 3.11** – core programming language  
- **Pandas** – data manipulation  
- **NumPy** – numerical operations  
- **Scikit-Learn** – machine learning models, feature selection, evaluation  
- **Jupyter Notebook** – interactive coding and documentation  

---

## πŸ”Ή Questions & Answers

| Question | Task | Answer |
|----------|------|--------|
| 1 | Mode of `industry` | `retail` |
| 2 | Biggest correlation (numerical features) | `annual_income` and `interaction_count` |
| 3 | Biggest mutual information (categorical features) | `lead_source` |
| 4 | Logistic regression validation accuracy | 0.74 |
| 5 | Least useful feature (feature elimination) | `lead_score` |
| 6 | Best `C` value for regularized logistic regression | 1 |

---

## πŸ“Œ Approach / Key Steps

1. **Data Cleaning & Preparation**  
   - Filled missing values: categorical β†’ `'NA'`, numerical β†’ `0.0`  
   - Verified feature types and correlations  

2. **Exploratory Analysis**  
   - Mode of categorical variables  
   - Correlation matrix for numerical features  

3. **Feature Selection**  
   - Calculated mutual information for categorical variables using `mutual_info_score`  
   - Identified least useful features via feature elimination  

4. **Model Training**  
   - Logistic Regression with one-hot encoded categorical variables  
   - Regularized logistic regression with hyperparameter tuning (`C` values)  

---

## πŸ“ˆ Results

- Baseline logistic regression accuracy: **0.74**  
- Least useful feature: **`lead_score`**  
- Best regularization parameter `C`: **1**  

---

## βš™ How to Run

1. Clone the repository:  
   ```bash
   git clone https://github.com/yourusername/ml-zoomcamp-hw3.git
   ```

2. Install requirements:  
   ```bash
   pip install -r requirements.txt
   ```

3. Open the Jupyter Notebook and run cells sequentially:  
   ```bash
   jupyter notebook
   ```

---

## πŸ“š References

- [Bank Marketing Dataset](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv)  
- [Scikit-Learn Documentation](https://scikit-learn.org/stable/)  
- [Pandas Documentation](https://pandas.pydata.org/)  
- [NumPy Documentation](https://numpy.org/)  
- [Jupyter Notebook Documentation](https://jupyter.org/)

---