π Car Fuel Efficiency Predictor (ML Zoomcamp W2)
βοΈ Description
This project develops a Linear Regression model to predict car fuel efficiency in Miles Per Gallon (fuel_efficiency_mpg) using a subset of car features. The primary focus is on mastering the fundamental machine learning workflow: handling missing data, proper train/validation/test splitting, and understanding model stability and regularization.
Key Objectives:
- Data Preprocessing: Analyze missing values and compare imputation strategies (mean vs. zero).
- Model Training: Implement a Linear Regression model using
scikit-learn. - Model Evaluation & Stability: Use RMSE to evaluate model performance and assess the impact of different random seeds on model stability.
- Regularization: Test the effect of Ridge regularization (L2) to prevent overfitting.
βοΈ Installation
To run this notebook, you need a standard Python environment with the following dependencies.
Clone the repository:
git clone [https://github.com/sks01dev/data-science-lab/](https://github.com/sks01dev/data-science-lab/) cd data-science-labInstall dependencies:
pip install numpy pandas scikit-learn matplotlib seabornDownload Data: The dataset is fetched directly within the notebook from the official source:
wget [https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv)
π Usage
Execute the week_2.ipynb notebook cell-by-cell to follow the full machine learning workflow.
- Data Selection: Only the following columns are used:
'engine_displacement','horsepower','vehicle_weight','model_year', and'fuel_efficiency_mpg'. - Splitting: The data is shuffled (seed 42) and split into Train (60%), Validation (20%), and Test (20%) sets.
- Model Comparison: Linear Regression models are trained to compare performance after filling missing
'horsepower'values with 0 vs. the training set mean. - Final Evaluation: The best configuration is tested on the final, unseen test set.
π οΈ Technologies Used
π Dataset Used
- Car Fuel Efficiency Dataset: A collection of vehicle attributes (displacement, horsepower, weight, model year) used to predict fuel consumption (
fuel_efficiency_mpg). [Image of Fuel Efficiency Distribution] - Source: Alexey Grigorev's public datasets repository.
π§ Key Learnings
- Imputation Strategy: The mean imputation of missing
'horsepower'values (RMSE: 0.46) significantly outperformed filling with zero (RMSE: 0.51). Using the mean preserves the feature's distribution better than using an extreme outlier (0). - Model Stability: By testing the split sensitivity across 10 random seeds, the model was determined to be stable ($\text{std} \approx 0.006$), confirming the reliability of the chosen split ratio.
- Regularization Impact: Ridge regularization (L2) had a negligible effect on the final RMSE scores, suggesting the Linear Regression model was not heavily overfitting the data.
β¨ Results
| Question | Task | Result |
|---|---|---|
| Q1 | Column with missing values | horsepower |
| Q2 | Median Horsepower | 149.0 |
| Q3 | Best Imputation (Validation RMSE) | With mean (0.46) |
| Q4 | Best Regularization $\text{r}$ (alpha) | 0 (RMSE: 0.51) |
| Q5 | Standard Deviation of RMSEs | 0.006 |
| Q6 | Final Test RMSE ($\text{r}=0.001$, seed 9) | $\approx 0.520$ (Closest to 0.515) |
π Future Work
- Feature Engineering: Engineer new features (e.g., age of car from
model_year) to potentially improve the model's predictive power. - Categorical Features: Incorporate the original categorical columns (
origin,fuel_type,drivetrain) using one-hot encoding or other techniques. - Advanced Regression: Test other models like Elastic Net or Random Forest for improved accuracy.
π References
- Alexey Grigorev - ML Zoomcamp Week 2 Materials