YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Used Cars Price Prediction & Classification

Regression β€’ Classification β€’ Feature Engineering β€’ Clustering

This project builds a complete Machine Learning workflow for predicting and classifying used car prices using a large real-world dataset from Craigslist.
Both the regression and classification winning models are available in this repository.


πŸ“Œ Presentation Video

https://drive.google.com/file/d/1sYohB3E4Sdb8ewkVP3S7uZXy2G06tS3l/view?usp=sharing


πŸ“Œ Dataset Overview

Source: Kaggle β€” Craigslist Cars & Trucks Data
Rows: 426,880 β†’ 339,346 after cleaning
Features: numeric, categorical, geolocation
Targets:

  • Regression: price
  • Classification: 3 price tiers (Low / Mid / High)

πŸ“Œ Exploratory Data Analysis (EDA)

Key Questions & Insights

πŸ”Ή What is the distribution of prices?

Price Histogram

Prices are heavily right-skewed, with most cars priced under $20,000.

πŸ”Ή Do newer cars cost more?

Year vs Price

Clear positive trend: newer cars command higher prices.

πŸ”Ή How does mileage affect price?

Odometer vs Price

Strong negative correlation: more mileage β†’ lower value.

πŸ”Ή How does manufacturer affect price?

Manufacturer Price Boxplot

Significant variation across manufacturers (e.g., BMW > Toyota > Chevrolet).


πŸ“Œ Feature Engineering

To improve model performance, multiple new features were created:

  • car_age = 2023 - year
  • mileage_per_year
  • manufacturer_popularity (frequency-based)
  • Engineered flags: is_electric, is_hybrid
  • Season extracted from posting date
  • One-hot encoding for all categorical variables

Clustering

  • Applied K-Means (k=5) on numeric features
  • Added cluster_kmeans to capture structural patterns
  • Visualized using PCA (2D)

This contributed strongly to non-linear model performance.


πŸ“Œ Regression Modeling

Baseline Model: Linear Regression

Performance (no feature engineering):

  • MAE: 5775
  • RMSE: 8811
  • RΒ²: 0.645

Improved Models (After Feature Engineering)

Model RΒ² RMSE Winner
Linear Regression + FE ~0.70 ↓
Gradient Boosting ~0.78 ↓
Random Forest Regressor 0.783 6818 πŸ† Yes

Winner: Random Forest Regressor
It captured non-linear relationships and benefited the most from engineered features.

πŸ“₯ Model file: random_forest_used_cars_model.pkl


πŸ“Œ Classification (Regression β†’ Classes)

Price converted into three balanced classes using quantile binning:

  • 0 = Low (bottom 33%)
  • 1 = Mid (middle 33%)
  • 2 = High (top 33%)

Why recall > precision?

Missing high-value cars (false negatives) is more harmful than mistakenly predicting a car as high-value.


πŸ“Œ Classification Models

Model Macro F1 Accuracy
Logistic Regression 0.62 0.63
Gradient Boosting 0.78 0.79
Random Forest Classifier 0.82 0.82

Winner: Random Forest Classifier
Best F1 across all classes and fewest severe misclassifications (especially between Low ↔ High).

πŸ“₯ Model file: used_cars_price_classifier_rf.pkl


πŸ“Œ Repository Structure

yoav-omer-used-cars-price-random-forest
β”‚
β”œβ”€β”€ random_forest_used_cars_model.pkl        # Regression model
β”œβ”€β”€ used_cars_price_classifier_rf.pkl        # Classification model
β”‚
β”œβ”€β”€ price_hist.png
β”œβ”€β”€ year_price.png
β”œβ”€β”€ odometer_price.png
β”œβ”€β”€ manufacturer_price.png
β”‚
└── README.md

πŸ“Œ Summary

This project demonstrates a full machine learning pipeline on a complex real-world dataset, including:

  • Data cleaning and robust EDA
  • Strong feature engineering & clustering
  • Regression modeling
  • Classification modeling
  • Model export & deployment

Random Forest delivered the best results in both regression and classification tasks.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support