Spaces:

sks01dev
/

Customer-Conversion-Prediction

Sleeping

App Files Files Community

Customer-Conversion-Prediction / Week 3 /readme.md

sks01dev

Create readme.md

4108ad2 6 months ago

preview code

raw

history blame

3.43 kB

	# Machine Learning Zoomcamp 2025 - Homework 3

	[![Python](https://img.shields.io/badge/Python-3.11-blue?logo=python&logoColor=white)](https://www.python.org/)
	[![Pandas](https://img.shields.io/badge/Pandas-1.5.3-orange?logo=pandas&logoColor=white)](https://pandas.pydata.org/)
	[![Scikit-Learn](https://img.shields.io/badge/Scikit--Learn-1.3.1-green?logo=scikit-learn&logoColor=white)](https://scikit-learn.org/stable/)
	[![Jupyter](https://img.shields.io/badge/Jupyter-Notebook-yellow?logo=jupyter&logoColor=white)](https://jupyter.org/)

	---

	## Homework 3: Machine Learning for Classification

	This repository contains solutions for Homework 3 of Machine Learning Zoomcamp 2025, focused on classification tasks using the Bank Marketing dataset.

	---

	## 📂 Project Overview

	- Dataset: [Bank Marketing Dataset](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv)
	- Target variable: `converted` (whether the client signed up)
	- Objective: Data preprocessing, exploratory analysis, feature selection, and training logistic regression models (regularized and unregularized).

	Tech Stack:
	- Python 3.11 – core programming language
	- Pandas – data manipulation
	- NumPy – numerical operations
	- Scikit-Learn – machine learning models, feature selection, evaluation
	- Jupyter Notebook – interactive coding and documentation

	---

	## 🔹 Questions & Answers

	\| Question \| Task \| Answer \|
	\|----------\|------\|--------\|
	\| 1 \| Mode of `industry` \| `retail` \|
	\| 2 \| Biggest correlation (numerical features) \| `annual_income` and `interaction_count` \|
	\| 3 \| Biggest mutual information (categorical features) \| `lead_source` \|
	\| 4 \| Logistic regression validation accuracy \| 0.74 \|
	\| 5 \| Least useful feature (feature elimination) \| `lead_score` \|
	\| 6 \| Best `C` value for regularized logistic regression \| 1 \|

	---

	## 📌 Approach / Key Steps

	1. Data Cleaning & Preparation
	- Filled missing values: categorical → `'NA'`, numerical → `0.0`
	- Verified feature types and correlations

	2. Exploratory Analysis
	- Mode of categorical variables
	- Correlation matrix for numerical features

	3. Feature Selection
	- Calculated mutual information for categorical variables using `mutual_info_score`
	- Identified least useful features via feature elimination

	4. Model Training
	- Logistic Regression with one-hot encoded categorical variables
	- Regularized logistic regression with hyperparameter tuning (`C` values)

	---

	## 📈 Results

	- Baseline logistic regression accuracy: 0.74
	- Least useful feature: `lead_score`
	- Best regularization parameter `C`: 1

	---

	## ⚙ How to Run

	1. Clone the repository:
	```bash
	git clone https://github.com/yourusername/ml-zoomcamp-hw3.git
	```

	2. Install requirements:
	```bash
	pip install -r requirements.txt
	```

	3. Open the Jupyter Notebook and run cells sequentially:
	```bash
	jupyter notebook
	```

	---

	## 📚 References

	- [Bank Marketing Dataset](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv)
	- [Scikit-Learn Documentation](https://scikit-learn.org/stable/)
	- [Pandas Documentation](https://pandas.pydata.org/)
	- [NumPy Documentation](https://numpy.org/)
	- [Jupyter Notebook Documentation](https://jupyter.org/)

	---