YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Used Cars Price Prediction & Classification
Regression β’ Classification β’ Feature Engineering β’ Clustering
This project builds a complete Machine Learning workflow for predicting and classifying used car prices using a large real-world dataset from Craigslist.
Both the regression and classification winning models are available in this repository.
π Presentation Video
https://drive.google.com/file/d/1sYohB3E4Sdb8ewkVP3S7uZXy2G06tS3l/view?usp=sharing
π Dataset Overview
Source: Kaggle β Craigslist Cars & Trucks Data
Rows: 426,880 β 339,346 after cleaning
Features: numeric, categorical, geolocation
Targets:
- Regression:
price - Classification: 3 price tiers (Low / Mid / High)
π Exploratory Data Analysis (EDA)
Key Questions & Insights
πΉ What is the distribution of prices?
Prices are heavily right-skewed, with most cars priced under $20,000.
πΉ Do newer cars cost more?
Clear positive trend: newer cars command higher prices.
πΉ How does mileage affect price?
Strong negative correlation: more mileage β lower value.
πΉ How does manufacturer affect price?
Significant variation across manufacturers (e.g., BMW > Toyota > Chevrolet).
π Feature Engineering
To improve model performance, multiple new features were created:
car_age = 2023 - yearmileage_per_yearmanufacturer_popularity(frequency-based)- Engineered flags:
is_electric,is_hybrid - Season extracted from posting date
- One-hot encoding for all categorical variables
Clustering
- Applied K-Means (k=5) on numeric features
- Added
cluster_kmeansto capture structural patterns - Visualized using PCA (2D)
This contributed strongly to non-linear model performance.
π Regression Modeling
Baseline Model: Linear Regression
Performance (no feature engineering):
- MAE: 5775
- RMSE: 8811
- RΒ²: 0.645
Improved Models (After Feature Engineering)
| Model | RΒ² | RMSE | Winner |
|---|---|---|---|
| Linear Regression + FE | ~0.70 | β | |
| Gradient Boosting | ~0.78 | β | |
| Random Forest Regressor | 0.783 | 6818 | π Yes |
Winner: Random Forest Regressor
It captured non-linear relationships and benefited the most from engineered features.
π₯ Model file: random_forest_used_cars_model.pkl
π Classification (Regression β Classes)
Price converted into three balanced classes using quantile binning:
- 0 = Low (bottom 33%)
- 1 = Mid (middle 33%)
- 2 = High (top 33%)
Why recall > precision?
Missing high-value cars (false negatives) is more harmful than mistakenly predicting a car as high-value.
π Classification Models
| Model | Macro F1 | Accuracy |
|---|---|---|
| Logistic Regression | 0.62 | 0.63 |
| Gradient Boosting | 0.78 | 0.79 |
| Random Forest Classifier | 0.82 | 0.82 |
Winner: Random Forest Classifier
Best F1 across all classes and fewest severe misclassifications (especially between Low β High).
π₯ Model file: used_cars_price_classifier_rf.pkl
π Repository Structure
yoav-omer-used-cars-price-random-forest
β
βββ random_forest_used_cars_model.pkl # Regression model
βββ used_cars_price_classifier_rf.pkl # Classification model
β
βββ price_hist.png
βββ year_price.png
βββ odometer_price.png
βββ manufacturer_price.png
β
βββ README.md
π Summary
This project demonstrates a full machine learning pipeline on a complex real-world dataset, including:
- Data cleaning and robust EDA
- Strong feature engineering & clustering
- Regression modeling
- Classification modeling
- Model export & deployment
Random Forest delivered the best results in both regression and classification tasks.



