🚕 NYC Taxi Fare Regression (PyTorch ANN)

A small PyTorch tabular regression model that predicts NYC taxi fare (USD) from trip time + pickup/dropoff coordinates + passenger count.

This Hugging Face repo stores the trained model weights + preprocessing schema used by a Streamlit inference app.

Training → Model → Inference

Training notebook (Colab): https://colab.research.google.com/drive/1474z1TaWSAZFNShG3msPQfMgH_VU3Ol_
Inference app (Streamlit): https://github.com/sparklerz/Deep-Learning-Fundamentals-Suite
(page : pages/02_NYC_Taxi_Fare_Regression.py)

What’s in this repo

artifacts/model_state.pt — PyTorch model state dict (ANN with embeddings)
artifacts/schema.json — feature schema, category lists, embedding sizes, cont mean/std
artifacts/metrics.json — validation metrics (RMSE/MAE)
artifacts/sample_rows.csv — small sample used by the Streamlit UI “load a row”
NYCTaxiFares.csv — dataset file used in training

Inputs

The model uses: Categorical (embedded)

Hour (0–23)
AMorPM (am / pm)
Weekday (Mon…Sun)

Continuous

pickup_latitude, pickup_longitude
dropoff_latitude, dropoff_longitude
passenger_count (1–6)
dist_km (computed using haversine distance)

Preprocessing (same as training / Streamlit app)

dist_km is computed from pickup/dropoff lat/lon using the haversine formula.
Continuous features are standardized using cont_mean / cont_std stored in schema.json.
Categorical values are converted to integer codes using the fixed category lists in schema.json (cat_categories).

Training notes (from notebook):

Hour / AMorPM / Weekday were derived from pickup_datetime after a ~4 hour timezone shift (stored as timezone_shift_hours=4 in schema.json).
Basic trimming was applied during training (fare, passenger_count, distance ranges) to reduce outliers.

Output

A single float: predicted taxi fare (USD).

Metrics

From artifacts/metrics.json (validation split):

RMSE: 2.8648
MAE: 1.4056
Rows used (after cleaning): 119,602

Quickstart (load model + schema)

import json
import torch
import numpy as np
from huggingface_hub import hf_hub_download

REPO_ID = "ash001/nyc-taxi-fare-regression-ann"

schema_path = hf_hub_download(REPO_ID, "artifacts/schema.json")
state_path  = hf_hub_download(REPO_ID, "artifacts/model_state.pt")

with open(schema_path, "r", encoding="utf-8") as f:
    schema = json.load(f)

cat_cols = schema["cat_cols"]
cont_cols = schema["cont_cols"]
cat_categories = schema["cat_categories"]
cont_mean = np.array(schema["cont_mean"], dtype=np.float32)
cont_std  = np.array(schema["cont_std"], dtype=np.float32)

# Define the same model class used in your inference app, then:
# model.load_state_dict(torch.load(state_path, map_location="cpu"))
# model.eval()

For an end-to-end prediction example (feature building + distance + standardization), see the Streamlit inference implementation in your app.

license: apache-2.0

Downloads last month: -; Downloads are not tracked for this model. How to track