--- language: - en license: mit tags: - tabular-regression - tabular-classification - doordash - delivery-time ---
# 🚀 DoorDash Delivery Time Predictor ## 📋 Project Overview This project analyzes and predicts DoorDash delivery times using historical order data. We tackle the problem as both a **regression task** (predict exact duration) and a **classification task** (predict Fast / Normal / Slow delivery). --- ## 📦 Dataset | Property | Value | |----------|-------| | Source | Kaggle — DoorDash Historical Delivery Data | | Rows | 197,428 orders | | Features | 16 columns (numeric + categorical) | | Target | `delivery_duration_seconds` | **Key Features:** - `total_onshift_dashers` — how many dashers are available - `total_busy_dashers` — how many dashers are currently occupied - `total_outstanding_orders` — current order backlog - `store_primary_category` — type of restaurant - `estimated_store_to_consumer_driving_duration` — estimated driving time ![EDA](eda_plots.png) --- ## 🔍 Research Question > *Can we predict how long a DoorDash delivery will take based on order details, > restaurant category, and real-time dasher availability?* --- ## ⚙️ Feature Engineering We created 9 new features from the raw data: - **`dasher_util_ratio`** — how busy are available dashers (busy/onshift) - **`order_pressure`** — outstanding orders per available dasher - **`is_peak_lunch`** / **`is_peak_dinner`** — peak hour flags - **`is_weekend`** — weekend flag - **`log_subtotal`** — log-transformed order value - **`price_spread`** — range between cheapest and most expensive item - **`dasher_idle`** — dashers available but not working - **`cluster`** — KMeans cluster ID (5 clusters) --- ## 🤖 Models ### Regression | Model | MAE | RMSE | R² | |-------|-----|------|----| | Baseline Linear Regression | 11.2 min | 14.7 min | 0.233 | | Linear Regression (Engineered) | 11.2 min | 14.7 min | 0.240 | | Random Forest | 10.8 min | 14.1 min | 0.294 | | **Gradient Boosting ✓** | **10.5 min** | **13.9 min** | **0.319** | ### Classification (Fast / Normal / Slow) | Model | Macro F1 | |-------|----------| | Logistic Regression | 0.502 | | Random Forest | 0.518 | | **Gradient Boosting ✓** | **0.533** | --- ## 🏆 Winners - **Regression:** `GradientBoostingRegressor` — R² = 0.319 - **Classification:** `GradientBoostingClassifier` — Macro F1 = 0.533 --- ## 📁 Repository Files | File | Description | |------|-------------| | `regression_model.pkl` | Trained GradientBoostingRegressor + scaler + features | | `classification_model.pkl` | Trained GradientBoostingClassifier + thresholds | | `notebook.ipynb` | Full analysis, EDA, training and evaluation code | --- ## 🏷️ Classification Strategy Delivery times split into 3 equal quantiles: - 🟢 **Fast (0):** under 38.1 minutes - 🟡 **Normal (1):** 38.1 – 51.4 minutes - 🔴 **Slow (2):** over 51.4 minutes --- ## 🛠️ Tech Stack ![Python](https://img.shields.io/badge/Python-3.12-blue) ![scikit-learn](https://img.shields.io/badge/scikit--learn-latest-orange) ![pandas](https://img.shields.io/badge/pandas-latest-green) ![numpy](https://img.shields.io/badge/numpy-latest-lightblue)