almador2002

Update README.md

9b6eb88 verified 14 days ago

3.61 kB

	---
	language:
	- en
	license: mit
	tags:
	- tabular-regression
	- tabular-classification
	- doordash
	- delivery-time
	---

	<div align="center">
	<video width="100%" controls autoplay loop muted>
	<source src="https://huggingface.co/almador2002/doordash-delivery-predictor/resolve/main/d08aecd4dc994e8b9fae392148fa0c6d.mp4" type="video/mp4">
	Your browser does not support the video tag.
	</video>
	</div>

	# 🚀 DoorDash Delivery Time Predictor

	## 📋 Project Overview
	This project analyzes and predicts DoorDash delivery times using historical order data.
	We tackle the problem as both a regression task (predict exact duration) and a
	classification task (predict Fast / Normal / Slow delivery).

	---

	## 📦 Dataset
	\| Property \| Value \|
	\|----------\|-------\|
	\| Source \| Kaggle — DoorDash Historical Delivery Data \|
	\| Rows \| 197,428 orders \|
	\| Features \| 16 columns (numeric + categorical) \|
	\| Target \| `delivery_duration_seconds` \|

	Key Features:
	- `total_onshift_dashers` — how many dashers are available
	- `total_busy_dashers` — how many dashers are currently occupied
	- `total_outstanding_orders` — current order backlog
	- `store_primary_category` — type of restaurant
	- `estimated_store_to_consumer_driving_duration` — estimated driving time

	![EDA](eda_plots.png)

	---

	## 🔍 Research Question
	> *Can we predict how long a DoorDash delivery will take based on order details,
	> restaurant category, and real-time dasher availability?*

	---

	## ⚙️ Feature Engineering
	We created 9 new features from the raw data:
	- `dasher_util_ratio` — how busy are available dashers (busy/onshift)
	- `order_pressure` — outstanding orders per available dasher
	- `is_peak_lunch` / `is_peak_dinner` — peak hour flags
	- `is_weekend` — weekend flag
	- `log_subtotal` — log-transformed order value
	- `price_spread` — range between cheapest and most expensive item
	- `dasher_idle` — dashers available but not working
	- `cluster` — KMeans cluster ID (5 clusters)

	---

	## 🤖 Models

	### Regression
	\| Model \| MAE \| RMSE \| R² \|
	\|-------\|-----\|------\|----\|
	\| Baseline Linear Regression \| 11.2 min \| 14.7 min \| 0.233 \|
	\| Linear Regression (Engineered) \| 11.2 min \| 14.7 min \| 0.240 \|
	\| Random Forest \| 10.8 min \| 14.1 min \| 0.294 \|
	\| Gradient Boosting ✓ \| 10.5 min \| 13.9 min \| 0.319 \|

	<img src="model_comparison.png" width="800"/>

	### Classification (Fast / Normal / Slow)
	\| Model \| Macro F1 \|
	\|-------\|----------\|
	\| Logistic Regression \| 0.502 \|
	\| Random Forest \| 0.518 \|
	\| Gradient Boosting ✓ \| 0.533 \|

	<img src="confusion_matrices.png" width="800"/>

	---

	## 🏆 Winners
	- Regression: `GradientBoostingRegressor` — R² = 0.319
	- Classification: `GradientBoostingClassifier` — Macro F1 = 0.533

	---

	## 📁 Repository Files
	\| File \| Description \|
	\|------\|-------------\|
	\| `regression_model.pkl` \| Trained GradientBoostingRegressor + scaler + features \|
	\| `classification_model.pkl` \| Trained GradientBoostingClassifier + thresholds \|
	\| `notebook.ipynb` \| Full analysis, EDA, training and evaluation code \|

	---

	## 🏷️ Classification Strategy
	Delivery times split into 3 equal quantiles:
	- 🟢 Fast (0): under 38.1 minutes
	- 🟡 Normal (1): 38.1 – 51.4 minutes
	- 🔴 Slow (2): over 51.4 minutes

	---

	## 🛠️ Tech Stack
	![Python](https://img.shields.io/badge/Python-3.12-blue)
	![scikit-learn](https://img.shields.io/badge/scikit--learn-latest-orange)
	![pandas](https://img.shields.io/badge/pandas-latest-green)
	![numpy](https://img.shields.io/badge/numpy-latest-lightblue)