Spaces:

sks01dev
/

Customer-Conversion-Prediction

Sleeping

App Files Files Community

Customer-Conversion-Prediction / Week 2 /readme.md

sks01dev

Update readme.md

fb23fe9 7 months ago

preview code

raw

history blame

5.34 kB

🚗 Car Fuel Efficiency Predictor (ML Zoomcamp W2)

✍️ Description

This project develops a Linear Regression model to predict car fuel efficiency in Miles Per Gallon (fuel_efficiency_mpg) using a subset of car features. The primary focus is on mastering the fundamental machine learning workflow: handling missing data, proper train/validation/test splitting, and understanding model stability and regularization.

Key Objectives:

Data Preprocessing: Analyze missing values and compare imputation strategies (mean vs. zero).
Model Training: Implement a Linear Regression model using scikit-learn.
Model Evaluation & Stability: Use RMSE to evaluate model performance and assess the impact of different random seeds on model stability.
Regularization: Test the effect of Ridge regularization (L2) to prevent overfitting.

⚙️ Installation

To run this notebook, you need a standard Python environment with the following dependencies.

Clone the repository:

git clone [https://github.com/sks01dev/data-science-lab/](https://github.com/sks01dev/data-science-lab/)
cd data-science-lab

Install dependencies:

pip install numpy pandas scikit-learn matplotlib seaborn

Download Data: The dataset is fetched directly within the notebook from the official source:

wget [https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv)

🏃 Usage

Execute the week_2.ipynb notebook cell-by-cell to follow the full machine learning workflow.

Data Selection: Only the following columns are used: 'engine_displacement', 'horsepower', 'vehicle_weight', 'model_year', and 'fuel_efficiency_mpg'.
Splitting: The data is shuffled (seed 42) and split into Train (60%), Validation (20%), and Test (20%) sets.
Model Comparison: Linear Regression models are trained to compare performance after filling missing 'horsepower' values with 0 vs. the training set mean.
Final Evaluation: The best configuration is tested on the final, unseen test set.

🛠️ Technologies Used

Technology	Purpose	Badge/Icon
Python	Core programming and analysis language
NumPy	Numerical operations and RMSE calculation
Pandas	Data loading, manipulation, and missing value handling
scikit-learn	Linear Regression, Ridge, and `train_test_split`
Seaborn/Matplotlib	Data visualization and distribution checks

📊 Dataset Used

Car Fuel Efficiency Dataset: A collection of vehicle attributes (displacement, horsepower, weight, model year) used to predict fuel consumption (fuel_efficiency_mpg). [Image of Fuel Efficiency Distribution]
Source: Alexey Grigorev's public datasets repository.

🧠 Key Learnings

Imputation Strategy: The mean imputation of missing 'horsepower' values (RMSE: 0.46) significantly outperformed filling with zero (RMSE: 0.51). Using the mean preserves the feature's distribution better than using an extreme outlier (0).
Model Stability: By testing the split sensitivity across 10 random seeds, the model was determined to be stable ($\text{std} \approx 0.006$), confirming the reliability of the chosen split ratio.
Regularization Impact: Ridge regularization (L2) had a negligible effect on the final RMSE scores, suggesting the Linear Regression model was not heavily overfitting the data.

✨ Results

Question	Task	Result
Q1	Column with missing values	`horsepower`
Q2	Median Horsepower	149.0
Q3	Best Imputation (Validation RMSE)	With mean (0.46)
Q4	Best Regularization $\text{r}$ (alpha)	0 (RMSE: 0.51)
Q5	Standard Deviation of RMSEs	0.006
Q6	Final Test RMSE ($\text{r}=0.001$, seed 9)	$\approx 0.520$ (Closest to 0.515)

🚀 Future Work

Feature Engineering: Engineer new features (e.g., age of car from model_year) to potentially improve the model's predictive power.
Categorical Features: Incorporate the original categorical columns (origin, fuel_type, drivetrain) using one-hot encoding or other techniques.
Advanced Regression: Test other models like Elastic Net or Random Forest for improved accuracy.

📚 References

Alexey Grigorev - ML Zoomcamp Week 2 Materials