sks01dev's picture
Update readme.md
fb23fe9
|
raw
history blame
5.34 kB

πŸš— Car Fuel Efficiency Predictor (ML Zoomcamp W2)

✍️ Description

This project develops a Linear Regression model to predict car fuel efficiency in Miles Per Gallon (fuel_efficiency_mpg) using a subset of car features. The primary focus is on mastering the fundamental machine learning workflow: handling missing data, proper train/validation/test splitting, and understanding model stability and regularization.

Key Objectives:

  • Data Preprocessing: Analyze missing values and compare imputation strategies (mean vs. zero).
  • Model Training: Implement a Linear Regression model using scikit-learn.
  • Model Evaluation & Stability: Use RMSE to evaluate model performance and assess the impact of different random seeds on model stability.
  • Regularization: Test the effect of Ridge regularization (L2) to prevent overfitting.

βš™οΈ Installation

To run this notebook, you need a standard Python environment with the following dependencies.

  1. Clone the repository:

    git clone [https://github.com/sks01dev/data-science-lab/](https://github.com/sks01dev/data-science-lab/)
    cd data-science-lab
    
  2. Install dependencies:

    pip install numpy pandas scikit-learn matplotlib seaborn
    
  3. Download Data: The dataset is fetched directly within the notebook from the official source:

    wget [https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv)
    

πŸƒ Usage

Execute the week_2.ipynb notebook cell-by-cell to follow the full machine learning workflow.

  1. Data Selection: Only the following columns are used: 'engine_displacement', 'horsepower', 'vehicle_weight', 'model_year', and 'fuel_efficiency_mpg'.
  2. Splitting: The data is shuffled (seed 42) and split into Train (60%), Validation (20%), and Test (20%) sets.
  3. Model Comparison: Linear Regression models are trained to compare performance after filling missing 'horsepower' values with 0 vs. the training set mean.
  4. Final Evaluation: The best configuration is tested on the final, unseen test set.

πŸ› οΈ Technologies Used

Technology Purpose Badge/Icon
Python Core programming and analysis language Python
NumPy Numerical operations and RMSE calculation NumPy
Pandas Data loading, manipulation, and missing value handling Pandas
scikit-learn Linear Regression, Ridge, and train_test_split Scikit-learn
Seaborn/Matplotlib Data visualization and distribution checks Seaborn

πŸ“Š Dataset Used

  • Car Fuel Efficiency Dataset: A collection of vehicle attributes (displacement, horsepower, weight, model year) used to predict fuel consumption (fuel_efficiency_mpg). [Image of Fuel Efficiency Distribution]
  • Source: Alexey Grigorev's public datasets repository.

🧠 Key Learnings

  1. Imputation Strategy: The mean imputation of missing 'horsepower' values (RMSE: 0.46) significantly outperformed filling with zero (RMSE: 0.51). Using the mean preserves the feature's distribution better than using an extreme outlier (0).
  2. Model Stability: By testing the split sensitivity across 10 random seeds, the model was determined to be stable ($\text{std} \approx 0.006$), confirming the reliability of the chosen split ratio.
  3. Regularization Impact: Ridge regularization (L2) had a negligible effect on the final RMSE scores, suggesting the Linear Regression model was not heavily overfitting the data.

✨ Results

Question Task Result
Q1 Column with missing values horsepower
Q2 Median Horsepower 149.0
Q3 Best Imputation (Validation RMSE) With mean (0.46)
Q4 Best Regularization $\text{r}$ (alpha) 0 (RMSE: 0.51)
Q5 Standard Deviation of RMSEs 0.006
Q6 Final Test RMSE ($\text{r}=0.001$, seed 9) $\approx 0.520$ (Closest to 0.515)

πŸš€ Future Work

  • Feature Engineering: Engineer new features (e.g., age of car from model_year) to potentially improve the model's predictive power.
  • Categorical Features: Incorporate the original categorical columns (origin, fuel_type, drivetrain) using one-hot encoding or other techniques.
  • Advanced Regression: Test other models like Elastic Net or Random Forest for improved accuracy.

πŸ“š References

  • Alexey Grigorev - ML Zoomcamp Week 2 Materials