| # π Car Fuel Efficiency Predictor (ML Zoomcamp W2) |
|
|
| ## βοΈ Description |
|
|
| This project develops a **Linear Regression model** to predict car fuel efficiency in Miles Per Gallon (`fuel_efficiency_mpg`) using a subset of car features. The primary focus is on mastering the fundamental machine learning workflow: handling missing data, proper train/validation/test splitting, and understanding model stability and regularization. |
|
|
| ### Key Objectives: |
|
|
| * **Data Preprocessing:** Analyze missing values and compare imputation strategies (mean vs. zero). |
| * **Model Training:** Implement a Linear Regression model using `scikit-learn`. |
| * **Model Evaluation & Stability:** Use RMSE to evaluate model performance and assess the impact of different random seeds on model stability. |
| * **Regularization:** Test the effect of Ridge regularization (L2) to prevent overfitting. |
|
|
| --- |
|
|
| ## βοΈ Installation |
|
|
| To run this notebook, you need a standard Python environment with the following dependencies. |
|
|
| 1. **Clone the repository:** |
| ```bash |
| git clone [https://github.com/sks01dev/data-science-lab/](https://github.com/sks01dev/data-science-lab/) |
| cd data-science-lab |
| ``` |
| |
| 2. **Install dependencies:** |
| ```bash |
| pip install numpy pandas scikit-learn matplotlib seaborn |
| ``` |
| |
| 3. **Download Data:** The dataset is fetched directly within the notebook from the official source: |
| ```bash |
| wget [https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv) |
| ``` |
| |
| --- |
|
|
| ## π Usage |
|
|
| Execute the `week_2.ipynb` notebook cell-by-cell to follow the full machine learning workflow. |
|
|
| 1. **Data Selection:** Only the following columns are used: `'engine_displacement'`, `'horsepower'`, `'vehicle_weight'`, `'model_year'`, and `'fuel_efficiency_mpg'`. |
| 2. **Splitting:** The data is shuffled (seed 42) and split into Train (60%), Validation (20%), and Test (20%) sets. |
| 3. **Model Comparison:** Linear Regression models are trained to compare performance after filling missing `'horsepower'` values with 0 vs. the training set mean. |
| 4. **Final Evaluation:** The best configuration is tested on the final, unseen test set. |
|
|
| --- |
|
|
| ## π οΈ Technologies Used |
|
|
| | Technology | Purpose | Badge/Icon | |
| | :--- | :--- | :--- | |
| | **Python** | Core programming and analysis language | [](https://www.python.org/doc/) | |
| | **NumPy** | Numerical operations and RMSE calculation | [](https://numpy.org/doc/) | |
| | **Pandas** | Data loading, manipulation, and missing value handling | [](https://pandas.pydata.org/docs/) | |
| | **scikit-learn** | Linear Regression, Ridge, and `train_test_split` | [](https://scikit-learn.org/stable/documentation.html) | |
| | **Seaborn/Matplotlib** | Data visualization and distribution checks | [](https://seaborn.pydata.org/) | |
|
|
| --- |
|
|
| ## π Dataset Used |
|
|
| * **Car Fuel Efficiency Dataset:** A collection of vehicle attributes (displacement, horsepower, weight, model year) used to predict fuel consumption (`fuel_efficiency_mpg`). [Image of Fuel Efficiency Distribution] |
| * **Source:** Alexey Grigorev's public datasets repository. |
|
|
| --- |
|
|
| ## π§ Key Learnings |
|
|
| 1. **Imputation Strategy:** The **mean imputation** of missing `'horsepower'` values (RMSE: 0.46) significantly outperformed filling with zero (RMSE: 0.51). Using the mean preserves the feature's distribution better than using an extreme outlier (0). |
| 2. **Model Stability:** By testing the split sensitivity across 10 random seeds, the model was determined to be **stable** ($\text{std} \approx 0.006$), confirming the reliability of the chosen split ratio. |
| 3. **Regularization Impact:** Ridge regularization (L2) had a negligible effect on the final RMSE scores, suggesting the Linear Regression model was not heavily overfitting the data. |
|
|
| --- |
|
|
| ## β¨ Results |
|
|
| | Question | Task | Result | |
| | :--- | :--- | :--- | |
| | **Q1** | Column with missing values | `horsepower` | |
| | **Q2** | Median Horsepower | 149.0 | |
| | **Q3** | Best Imputation (Validation RMSE) | **With mean** (0.46) | |
| | **Q4** | Best Regularization $\text{r}$ (alpha) | **0** (RMSE: 0.51) | |
| | **Q5** | Standard Deviation of RMSEs | **0.006** | |
| | **Q6** | Final Test RMSE ($\text{r}=0.001$, seed 9) | $\approx 0.520$ (Closest to 0.515) | |
|
|
| --- |
|
|
| ## π Future Work |
|
|
| * **Feature Engineering:** Engineer new features (e.g., age of car from `model_year`) to potentially improve the model's predictive power. |
| * **Categorical Features:** Incorporate the original categorical columns (`origin`, `fuel_type`, `drivetrain`) using one-hot encoding or other techniques. |
| * **Advanced Regression:** Test other models like Elastic Net or Random Forest for improved accuracy. |
|
|
| --- |
|
|
| ## π References |
|
|
| * *Alexey Grigorev - ML Zoomcamp Week 2 Materials* |
|
|