almador2002's picture
Update README.md
9b6eb88 verified
---
language:
- en
license: mit
tags:
- tabular-regression
- tabular-classification
- doordash
- delivery-time
---
<div align="center">
<video width="100%" controls autoplay loop muted>
<source src="https://huggingface.co/almador2002/doordash-delivery-predictor/resolve/main/d08aecd4dc994e8b9fae392148fa0c6d.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</div>
# πŸš€ DoorDash Delivery Time Predictor
## πŸ“‹ Project Overview
This project analyzes and predicts DoorDash delivery times using historical order data.
We tackle the problem as both a **regression task** (predict exact duration) and a
**classification task** (predict Fast / Normal / Slow delivery).
---
## πŸ“¦ Dataset
| Property | Value |
|----------|-------|
| Source | Kaggle β€” DoorDash Historical Delivery Data |
| Rows | 197,428 orders |
| Features | 16 columns (numeric + categorical) |
| Target | `delivery_duration_seconds` |
**Key Features:**
- `total_onshift_dashers` β€” how many dashers are available
- `total_busy_dashers` β€” how many dashers are currently occupied
- `total_outstanding_orders` β€” current order backlog
- `store_primary_category` β€” type of restaurant
- `estimated_store_to_consumer_driving_duration` β€” estimated driving time
![EDA](eda_plots.png)
---
## πŸ” Research Question
> *Can we predict how long a DoorDash delivery will take based on order details,
> restaurant category, and real-time dasher availability?*
---
## βš™οΈ Feature Engineering
We created 9 new features from the raw data:
- **`dasher_util_ratio`** β€” how busy are available dashers (busy/onshift)
- **`order_pressure`** β€” outstanding orders per available dasher
- **`is_peak_lunch`** / **`is_peak_dinner`** β€” peak hour flags
- **`is_weekend`** β€” weekend flag
- **`log_subtotal`** β€” log-transformed order value
- **`price_spread`** β€” range between cheapest and most expensive item
- **`dasher_idle`** β€” dashers available but not working
- **`cluster`** β€” KMeans cluster ID (5 clusters)
---
## πŸ€– Models
### Regression
| Model | MAE | RMSE | RΒ² |
|-------|-----|------|----|
| Baseline Linear Regression | 11.2 min | 14.7 min | 0.233 |
| Linear Regression (Engineered) | 11.2 min | 14.7 min | 0.240 |
| Random Forest | 10.8 min | 14.1 min | 0.294 |
| **Gradient Boosting βœ“** | **10.5 min** | **13.9 min** | **0.319** |
<img src="model_comparison.png" width="800"/>
### Classification (Fast / Normal / Slow)
| Model | Macro F1 |
|-------|----------|
| Logistic Regression | 0.502 |
| Random Forest | 0.518 |
| **Gradient Boosting βœ“** | **0.533** |
<img src="confusion_matrices.png" width="800"/>
---
## πŸ† Winners
- **Regression:** `GradientBoostingRegressor` β€” RΒ² = 0.319
- **Classification:** `GradientBoostingClassifier` β€” Macro F1 = 0.533
---
## πŸ“ Repository Files
| File | Description |
|------|-------------|
| `regression_model.pkl` | Trained GradientBoostingRegressor + scaler + features |
| `classification_model.pkl` | Trained GradientBoostingClassifier + thresholds |
| `notebook.ipynb` | Full analysis, EDA, training and evaluation code |
---
## 🏷️ Classification Strategy
Delivery times split into 3 equal quantiles:
- 🟒 **Fast (0):** under 38.1 minutes
- 🟑 **Normal (1):** 38.1 – 51.4 minutes
- πŸ”΄ **Slow (2):** over 51.4 minutes
---
## πŸ› οΈ Tech Stack
![Python](https://img.shields.io/badge/Python-3.12-blue)
![scikit-learn](https://img.shields.io/badge/scikit--learn-latest-orange)
![pandas](https://img.shields.io/badge/pandas-latest-green)
![numpy](https://img.shields.io/badge/numpy-latest-lightblue)