---
language:
- en
license: mit
tags:
- tabular-regression
- tabular-classification
- doordash
- delivery-time
---
# 🚀 DoorDash Delivery Time Predictor
## 📋 Project Overview
This project analyzes and predicts DoorDash delivery times using historical order data.
We tackle the problem as both a **regression task** (predict exact duration) and a
**classification task** (predict Fast / Normal / Slow delivery).
---
## 📦 Dataset
| Property | Value |
|----------|-------|
| Source | Kaggle — DoorDash Historical Delivery Data |
| Rows | 197,428 orders |
| Features | 16 columns (numeric + categorical) |
| Target | `delivery_duration_seconds` |
**Key Features:**
- `total_onshift_dashers` — how many dashers are available
- `total_busy_dashers` — how many dashers are currently occupied
- `total_outstanding_orders` — current order backlog
- `store_primary_category` — type of restaurant
- `estimated_store_to_consumer_driving_duration` — estimated driving time

---
## 🔍 Research Question
> *Can we predict how long a DoorDash delivery will take based on order details,
> restaurant category, and real-time dasher availability?*
---
## ⚙️ Feature Engineering
We created 9 new features from the raw data:
- **`dasher_util_ratio`** — how busy are available dashers (busy/onshift)
- **`order_pressure`** — outstanding orders per available dasher
- **`is_peak_lunch`** / **`is_peak_dinner`** — peak hour flags
- **`is_weekend`** — weekend flag
- **`log_subtotal`** — log-transformed order value
- **`price_spread`** — range between cheapest and most expensive item
- **`dasher_idle`** — dashers available but not working
- **`cluster`** — KMeans cluster ID (5 clusters)
---
## 🤖 Models
### Regression
| Model | MAE | RMSE | R² |
|-------|-----|------|----|
| Baseline Linear Regression | 11.2 min | 14.7 min | 0.233 |
| Linear Regression (Engineered) | 11.2 min | 14.7 min | 0.240 |
| Random Forest | 10.8 min | 14.1 min | 0.294 |
| **Gradient Boosting ✓** | **10.5 min** | **13.9 min** | **0.319** |
### Classification (Fast / Normal / Slow)
| Model | Macro F1 |
|-------|----------|
| Logistic Regression | 0.502 |
| Random Forest | 0.518 |
| **Gradient Boosting ✓** | **0.533** |
---
## 🏆 Winners
- **Regression:** `GradientBoostingRegressor` — R² = 0.319
- **Classification:** `GradientBoostingClassifier` — Macro F1 = 0.533
---
## 📁 Repository Files
| File | Description |
|------|-------------|
| `regression_model.pkl` | Trained GradientBoostingRegressor + scaler + features |
| `classification_model.pkl` | Trained GradientBoostingClassifier + thresholds |
| `notebook.ipynb` | Full analysis, EDA, training and evaluation code |
---
## 🏷️ Classification Strategy
Delivery times split into 3 equal quantiles:
- 🟢 **Fast (0):** under 38.1 minutes
- 🟡 **Normal (1):** 38.1 – 51.4 minutes
- 🔴 **Slow (2):** over 51.4 minutes
---
## 🛠️ Tech Stack



