Used Cars Price Prediction & Classification

Regression • Classification • Feature Engineering • Clustering

This project builds a complete Machine Learning workflow for predicting and classifying used car prices using a large real-world dataset from Craigslist.
Both the regression and classification winning models are available in this repository.

📌 Presentation Video

https://drive.google.com/file/d/1sYohB3E4Sdb8ewkVP3S7uZXy2G06tS3l/view?usp=sharing

📌 Dataset Overview

Source: Kaggle — Craigslist Cars & Trucks Data
Rows: 426,880 → 339,346 after cleaning
Features: numeric, categorical, geolocation
Targets:

Regression: price
Classification: 3 price tiers (Low / Mid / High)

📌 Exploratory Data Analysis (EDA)

Key Questions & Insights

🔹 What is the distribution of prices?

Prices are heavily right-skewed, with most cars priced under $20,000.

🔹 Do newer cars cost more?

Clear positive trend: newer cars command higher prices.

🔹 How does mileage affect price?

Strong negative correlation: more mileage → lower value.

🔹 How does manufacturer affect price?

Significant variation across manufacturers (e.g., BMW > Toyota > Chevrolet).

📌 Feature Engineering

To improve model performance, multiple new features were created:

car_age = 2023 - year
mileage_per_year
manufacturer_popularity (frequency-based)
Engineered flags: is_electric, is_hybrid
Season extracted from posting date
One-hot encoding for all categorical variables

Clustering

Applied K-Means (k=5) on numeric features
Added cluster_kmeans to capture structural patterns
Visualized using PCA (2D)

This contributed strongly to non-linear model performance.

📌 Regression Modeling

Baseline Model: Linear Regression

Performance (no feature engineering):

MAE: 5775
RMSE: 8811
R²: 0.645

Improved Models (After Feature Engineering)

Model	R²	RMSE	Winner
Linear Regression + FE	~0.70	↓
Gradient Boosting	~0.78	↓
Random Forest Regressor	0.783	6818	🏆 Yes

Winner: Random Forest Regressor
It captured non-linear relationships and benefited the most from engineered features.

📥 Model file: random_forest_used_cars_model.pkl

📌 Classification (Regression → Classes)

Price converted into three balanced classes using quantile binning:

0 = Low (bottom 33%)
1 = Mid (middle 33%)
2 = High (top 33%)

Why recall > precision?

Missing high-value cars (false negatives) is more harmful than mistakenly predicting a car as high-value.

📌 Classification Models

Model	Macro F1	Accuracy
Logistic Regression	0.62	0.63
Gradient Boosting	0.78	0.79
Random Forest Classifier	0.82	0.82

Winner: Random Forest Classifier
Best F1 across all classes and fewest severe misclassifications (especially between Low ↔ High).

📥 Model file: used_cars_price_classifier_rf.pkl

📌 Repository Structure

yoav-omer-used-cars-price-random-forest
│
├── random_forest_used_cars_model.pkl        # Regression model
├── used_cars_price_classifier_rf.pkl        # Classification model
│
├── price_hist.png
├── year_price.png
├── odometer_price.png
├── manufacturer_price.png
│
└── README.md

📌 Summary

This project demonstrates a full machine learning pipeline on a complex real-world dataset, including:

Data cleaning and robust EDA
Strong feature engineering & clustering
Regression modeling
Classification modeling
Model export & deployment

Random Forest delivered the best results in both regression and classification tasks.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support