Surfe Diem Wave Height Forecasting Model v1 (USA Southwest)

Model Description

A LightGBM regression model trained to predict ocean wave heights 6 hours in advance using real-time buoy observations from NOAA's National Data Buoy Center (NDBC).

Developed by: Surfe Diem
Model type: Gradient Boosted Decision Trees (LightGBM)
Language: Python
License: MIT

Intended Use

Primary Use Case

Provide 6-hour wave height forecasts for surf forecasting applications along the California coast (USA Southwest Pacific region as defined by https://www.ndbc.noaa.gov/)

Out-of-Scope Use

  • Wave heights beyond 6 hours
  • Regions outside California coast (model trained on a set limited to USA-southwest CA coast)
  • Real-time safety-critical applications without human oversight

Training Data

Source: NOAA NDBC Buoy Spectral Wave Density Data

Stations: 15 NDBC buoys along the California coast "46011", "46012", "46013", "46014", "46022", "46025", "46026", "46027", "46028", "46042", "46047", "46053", "46054", "46069", "46086"

Records: ~1.8M observations (226 Parquet files w/ stdmet and spectral aligned columns)

Features:

  • Meteorological: wave height, period, direction, wind speed/direction, pressure, temperature
  • Spectral compression: 9 physics-informed features derived from ~150 raw spectral bands
    • Ground swell energy, direction, quality (< 0.07 Hz)
    • Mid-range energy, direction, quality (0.07-0.14 Hz)
    • Wind wave energy, direction, quality (> 0.14 Hz)
  • Temporal: 1h, 3h, 6h, 12h lag features

Split: 80/20 train/test, time-series ordered (no shuffle)

Model Performance

Test MAE: 0.19 meters (~7.5 inches)

Evaluated on held-out 2024 data with proper time-series validation (train on past, test on future).

Training Details

Algorithm: LightGBM
Objective: Regression (L2 loss)
Learning rate: 0.05
Max depth: 5
Num iterations: 170
Early stopping: 10 rounds

Feature engineering:

  • Station IDs encoded as fixed categorical dtype for inference consistency
  • Lag features filled with 0 for single-observation inference

How to Use

import lightgbm as lgb
import pandas as pd

# Load model
model = lgb.Booster(model_file="surfe_diem_v1_usa_southwest_model.txt")

# Prepare observation (example)
obs = pd.DataFrame({
    'wvht': [2.5], 'dpd': [12.0], 'apd': [8.5],
    'mwd': [270], 'wspd': [15.0], 'wdir': [280],
    'pres': [1013.0], 'atmp': [18.0], 'wtmp': [16.0],
    # ... + spectral features + lag features + station_id
})

# Predict
wave_height_6h = model.predict(obs)[0]

Full inference pipeline available in GitHub repo.

Limitations

  • No history for single observations: Lag features set to 0 for real-time inference (degrades accuracy slightly)
  • Regional specificity: Trained only on California buoys
  • Forecast horizon: 6 hours only (not optimized for longer/shorter horizons)
  • Spectral dependency: Requires spectral data (not all buoys provide this)

Citation

@misc{surfediem2025wave,
  author = {Surfe Diem},
  title = {Wave Height Forecasting Model v1 - USA Southwest},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/datasets/surfe-diem/wave-archive-USA-southwest}}
}

Model Card Contact

For questions or issues, please open an issue in the GitHub repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train scroobio/surfe-diem-wave-forecast-v1-usa-southwest