Spaces:

bitsofchris
/

time-series-ai-weather-forecast

Running

App Files Files Community

time-series-ai-weather-forecast / docs /toto-inference.md

bitsofchris

Bump Toto from 4 M to 22 M (small) — same data path, larger model

2b7eb68 12 days ago

preview code

raw

history blame contribute delete

8.68 kB

Toto inference: how the demo makes its forecasts

This is a precise spec of every Toto-related knob used by app.py so the post / footnote can quote it accurately.

Model


Model ID	`Datadog/Toto-2.0-22m`
Parameters	~22 M
Source	https://huggingface.co/Datadog/Toto-2.0-22m
Loaded via	`Toto2Model.from_pretrained(...)` (the `toto-2` package from DataDog/toto's `toto2/` subdir, pinned in `requirements.txt`)
Hardware	CPU (HF Space free tier — no GPU)
Patch size	`model.config.patch_size` (read at runtime; the context-length truncate/pad logic adapts automatically if the value differs across variants)

We started on the 4 M variant for the demo's "weakest model still says something useful" angle, then bumped to 22 M for visibly tighter confidence bands and lower scoreboard MAE — still small enough to run in sub-second CPU latency on the free HF Space tier.

Input data


Source	Ecowitt Cloud API v3 (`/device/history`)
Station	Ecowitt GW3000B, Westhampton Beach NY
Channels forecasted	`outdoor.temperature` (°F), `outdoor.humidity` (%), `pressure.relative` (inHg), `rainfall_piezo.rain_rate` (in/hr)
Native storage cadence	5 min at `cycle_type=5min` (the device is configured to upload at ~1-min intervals; Ecowitt buckets to 5 min for the 90-day tier). Earlier defaults of 30 min were the device's out-of-box upload schedule.
`cycle_type` requested	`5min` — finest tier the API exposes. The data lives in `data/ecowitt.db` (synced incrementally every 15 min); each refresh reads from the archive rather than hitting Ecowitt live.
History window pulled from the archive	7 days
Resampling for the chart	`df.resample("5min").mean()` — fine-grained display
Resampling for Toto inference	`df.resample("1h").mean()` — coarser series so the 4M model receives a 168-point context + 48-step horizon, the regime where it forecasts cleanly. Decoupling the chart cadence from the model cadence keeps the visual fine and the model output honest.
Cleaning	`Series.interpolate(limit_direction="both")` fills resample gaps before the tensor goes to Toto
NWS comparison	`https://api.weather.gov/points/{lat},{lon}` → `forecastHourly` (point forecast, no distribution)

Context length

Toto requires the time axis of the input tensor to be a multiple of patch_size = 32.

n_raw = len(history_series_after_resample_and_interpolate)
if n_raw >= 32:
    n_ctx = (n_raw // 32) * 32                # truncate oldest points
    target_mask = ones(n_ctx)                  # all valid
else:
    n_ctx = 32                                 # pad up to one patch
    pad = 32 - n_raw
    target = [first_value]*pad + raw           # left-pad with the first value
    target_mask = [False]*pad + [True]*n_raw   # tell Toto to ignore the padded steps

With ~7 days of archive history and the hourly resample we use for inference, this gives a context of 160 hourly points (5 patches). The chart shows the same 7 days at 5-min cadence (≈2 016 points) — but those raw points only feed the chart, not the model.

Tensor shape

target:      torch.float32, shape (batch=1, n_variates=1, time=n_ctx)
target_mask: torch.bool,    shape (batch=1, n_variates=1, time=n_ctx)
series_ids:  torch.long,    shape (batch=1, n_variates=1)            (all zeros — univariate)

We forecast each metric independently (univariate). Multivariate inference is a follow-up; the inference cost is comparable but the chart gets noisier and the post hook is easier to read one metric at a time.

Prediction length

horizon_steps = round(horizon_hours / step_hours) where:

step_hours is the forecast cadence (1 h), not the chart cadence (5 min). horizon_hours = 48, so horizon_steps = 48. The 48 hourly predictions get drawn on top of the 5-min historical line — the forecast line is therefore sparser than the actuals, but visually continuous because Plotly connects the anchor points.

Distribution → quantiles

We do not Monte-Carlo sample. Toto's output head is a parametric Student-t mixture (see the Toto 2.0 paper), and model.forecast() returns analytical quantiles directly:

quantiles = model.forecast(
    {"target": target, "target_mask": target_mask, "series_ids": series_ids},
    horizon=horizon_steps,
)
# quantiles shape: (9, batch=1, n_variates=1, horizon)
# quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

We pluck three of them for the chart and the scoreboard:

Display	Index	Quantile
Lower edge of the shaded band	`quantiles[0, 0, 0]`	p10
Toto median line	`quantiles[4, 0, 0]`	p50
Upper edge of the shaded band	`quantiles[8, 0, 0]`	p90

The shaded band is therefore the 80 % central interval (p10–p90). The "±X°F at +24 h" chip on the hero is half of (p90 − p10) at the last forecast step.

Inference cadence


Trigger	a daemon thread inside the Space (`_autorefresh_loop`) and `demo.load` on a visitor's first request
Interval	every 15 minutes
Cache TTL	14 minutes — slightly less than the autorefresh interval so the next tick always misses the cache and refetches Ecowitt + NWS
Per-tick cost	one `/device/real_time` + one `/device/history` per cycle_type touched + one NWS `/points` + one NWS `forecastHourly` + four univariate Toto forwards (one per metric)
CPU forward time	~hundreds of milliseconds per metric on the free CPU tier; total wallclock per refresh is dominated by the network calls, not the model

Persistence

Every refresh writes to data/forecasts.db (SQLite):

forecast_snapshots(forecast_made_at, target_ts, source, metric, p10, p50, p90)
actuals(target_ts, metric, value)

source ∈ {toto, nws}. NWS rows store the point forecast in p50 and leave p10/p90 NULL.

A second SQLite, data/ecowitt.db, is the all-channel raw archive (populated by src/sync.py). Both DBs are pushed to a private HF Dataset (bitsofchris/toto-weather-forecast-log) on every autorefresh tick so they survive Space rebuilds.

Scoreboard — how the accuracy is calculated

The scoreboard answers one question: over the last 48 hours, which model was closer to the actual reading at each hour? The rules are designed so neither model gets to peek at data that wasn't available when the forecast was made.

Inputs

actuals(target_ts, metric, value) — Ecowitt readings resampled to the current display cadence (hourly by default).
forecast_snapshots(forecast_made_at, target_ts, source, metric, p10, p50, p90) — every forecast either model has ever issued, with two timestamps:
- forecast_made_at — when we ran inference / fetched NWS
- target_ts — the hour the forecast is predicting

Rule for picking which forecast counts

For each (target_ts, source, metric):

Filter to forecasts whose forecast_made_at <= target_ts — i.e. the model didn't yet know the actual value.
Of those, pick the one with the largest forecast_made_at — the most recent prediction issued before the target hour.

So for any past hour, Toto and NWS are both scored on their latest pre-target opinion. Both models always have the same information cutoff; no foresight, no stale snapshots.

Aggregation

Once each (target_ts, source, metric) triple has been pinned to a single forecast row:

abs_err = |p50 − actual|
MAE_source = mean(abs_err)   over target_ts in the last 48 h
n          = count(matching pairs)

The "lower is better" winner is whichever source has the smaller MAE for that metric. NWS doesn't expose pressure in forecastHourly, so the pressure scoreboard reports Toto only.

Caveats / what this scoreboard is NOT

We score the point prediction (p50) for both models. That throws away Toto's uncertainty — a wider interval doesn't hurt or help its MAE. A more Toto-flattering scoring would be CRPS or pinball loss, which credits well-calibrated intervals. We can layer that in later; MAE is what most people intuit by "accuracy", which is why it's on the headline scoreboard.
The window is rolling 48 h, so the number you see depends on the last two days, not all history.
We score every horizon distance lumped together. Toto's p50 at +6 h vs at +24 h are both folded into the same MAE. Splitting by horizon (1 h MAE, 6 h MAE, 24 h MAE, …) is a likely next iteration.

The "Past forecasts" overlay on the chart uses the same query so the scoreboard number and the chart line refer to identical predictions.