Surfe Diem Wave Height Forecasting Model v1 (USA Southwest)
Model Description
A LightGBM regression model trained to predict ocean wave heights 6 hours in advance using real-time buoy observations from NOAA's National Data Buoy Center (NDBC).
Developed by: Surfe Diem
Model type: Gradient Boosted Decision Trees (LightGBM)
Language: Python
License: MIT
Intended Use
Primary Use Case
Provide 6-hour wave height forecasts for surf forecasting applications along the California coast (USA Southwest Pacific region as defined by https://www.ndbc.noaa.gov/)
Out-of-Scope Use
- Wave heights beyond 6 hours
- Regions outside California coast (model trained on a set limited to USA-southwest CA coast)
- Real-time safety-critical applications without human oversight
Training Data
Source: NOAA NDBC Buoy Spectral Wave Density Data
Stations: 15 NDBC buoys along the California coast "46011", "46012", "46013", "46014", "46022", "46025", "46026", "46027", "46028", "46042", "46047", "46053", "46054", "46069", "46086"
Records: ~1.8M observations (226 Parquet files w/ stdmet and spectral aligned columns)
Features:
- Meteorological: wave height, period, direction, wind speed/direction, pressure, temperature
- Spectral compression: 9 physics-informed features derived from ~150 raw spectral bands
- Ground swell energy, direction, quality (< 0.07 Hz)
- Mid-range energy, direction, quality (0.07-0.14 Hz)
- Wind wave energy, direction, quality (> 0.14 Hz)
- Temporal: 1h, 3h, 6h, 12h lag features
Split: 80/20 train/test, time-series ordered (no shuffle)
Model Performance
Test MAE: 0.19 meters (~7.5 inches)
Evaluated on held-out 2024 data with proper time-series validation (train on past, test on future).
Training Details
Algorithm: LightGBM
Objective: Regression (L2 loss)
Learning rate: 0.05
Max depth: 5
Num iterations: 170
Early stopping: 10 rounds
Feature engineering:
- Station IDs encoded as fixed categorical dtype for inference consistency
- Lag features filled with 0 for single-observation inference
How to Use
import lightgbm as lgb
import pandas as pd
# Load model
model = lgb.Booster(model_file="surfe_diem_v1_usa_southwest_model.txt")
# Prepare observation (example)
obs = pd.DataFrame({
'wvht': [2.5], 'dpd': [12.0], 'apd': [8.5],
'mwd': [270], 'wspd': [15.0], 'wdir': [280],
'pres': [1013.0], 'atmp': [18.0], 'wtmp': [16.0],
# ... + spectral features + lag features + station_id
})
# Predict
wave_height_6h = model.predict(obs)[0]
Full inference pipeline available in GitHub repo.
Limitations
- No history for single observations: Lag features set to 0 for real-time inference (degrades accuracy slightly)
- Regional specificity: Trained only on California buoys
- Forecast horizon: 6 hours only (not optimized for longer/shorter horizons)
- Spectral dependency: Requires spectral data (not all buoys provide this)
Citation
@misc{surfediem2025wave,
author = {Surfe Diem},
title = {Wave Height Forecasting Model v1 - USA Southwest},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/surfe-diem/wave-archive-USA-southwest}}
}
Model Card Contact
For questions or issues, please open an issue in the GitHub repository.