bitsofchris's picture
Bump Toto from 4 M to 22 M (small) β€” same data path, larger model
2b7eb68

Toto inference: how the demo makes its forecasts

This is a precise spec of every Toto-related knob used by app.py so the post / footnote can quote it accurately.

Model

Model ID Datadog/Toto-2.0-22m
Parameters ~22 M
Source https://huggingface.co/Datadog/Toto-2.0-22m
Loaded via Toto2Model.from_pretrained(...) (the toto-2 package from DataDog/toto's toto2/ subdir, pinned in requirements.txt)
Hardware CPU (HF Space free tier β€” no GPU)
Patch size model.config.patch_size (read at runtime; the context-length truncate/pad logic adapts automatically if the value differs across variants)

We started on the 4 M variant for the demo's "weakest model still says something useful" angle, then bumped to 22 M for visibly tighter confidence bands and lower scoreboard MAE β€” still small enough to run in sub-second CPU latency on the free HF Space tier.

Input data

Source Ecowitt Cloud API v3 (/device/history)
Station Ecowitt GW3000B, Westhampton Beach NY
Channels forecasted outdoor.temperature (Β°F), outdoor.humidity (%), pressure.relative (inHg), rainfall_piezo.rain_rate (in/hr)
Native storage cadence 5 min at cycle_type=5min (the device is configured to upload at ~1-min intervals; Ecowitt buckets to 5 min for the 90-day tier). Earlier defaults of 30 min were the device's out-of-box upload schedule.
cycle_type requested 5min β€” finest tier the API exposes. The data lives in data/ecowitt.db (synced incrementally every 15 min); each refresh reads from the archive rather than hitting Ecowitt live.
History window pulled from the archive 7 days
Resampling for the chart df.resample("5min").mean() β€” fine-grained display
Resampling for Toto inference df.resample("1h").mean() β€” coarser series so the 4M model receives a 168-point context + 48-step horizon, the regime where it forecasts cleanly. Decoupling the chart cadence from the model cadence keeps the visual fine and the model output honest.
Cleaning Series.interpolate(limit_direction="both") fills resample gaps before the tensor goes to Toto
NWS comparison https://api.weather.gov/points/{lat},{lon} β†’ forecastHourly (point forecast, no distribution)

Context length

Toto requires the time axis of the input tensor to be a multiple of patch_size = 32.

n_raw = len(history_series_after_resample_and_interpolate)
if n_raw >= 32:
    n_ctx = (n_raw // 32) * 32                # truncate oldest points
    target_mask = ones(n_ctx)                  # all valid
else:
    n_ctx = 32                                 # pad up to one patch
    pad = 32 - n_raw
    target = [first_value]*pad + raw           # left-pad with the first value
    target_mask = [False]*pad + [True]*n_raw   # tell Toto to ignore the padded steps

With ~7 days of archive history and the hourly resample we use for inference, this gives a context of 160 hourly points (5 patches). The chart shows the same 7 days at 5-min cadence (β‰ˆ2 016 points) β€” but those raw points only feed the chart, not the model.

Tensor shape

target:      torch.float32, shape (batch=1, n_variates=1, time=n_ctx)
target_mask: torch.bool,    shape (batch=1, n_variates=1, time=n_ctx)
series_ids:  torch.long,    shape (batch=1, n_variates=1)            (all zeros β€” univariate)

We forecast each metric independently (univariate). Multivariate inference is a follow-up; the inference cost is comparable but the chart gets noisier and the post hook is easier to read one metric at a time.

Prediction length

horizon_steps = round(horizon_hours / step_hours) where:

step_hours is the forecast cadence (1 h), not the chart cadence (5 min). horizon_hours = 48, so horizon_steps = 48. The 48 hourly predictions get drawn on top of the 5-min historical line β€” the forecast line is therefore sparser than the actuals, but visually continuous because Plotly connects the anchor points.

Distribution β†’ quantiles

We do not Monte-Carlo sample. Toto's output head is a parametric Student-t mixture (see the Toto 2.0 paper), and model.forecast() returns analytical quantiles directly:

quantiles = model.forecast(
    {"target": target, "target_mask": target_mask, "series_ids": series_ids},
    horizon=horizon_steps,
)
# quantiles shape: (9, batch=1, n_variates=1, horizon)
# quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

We pluck three of them for the chart and the scoreboard:

Display Index Quantile
Lower edge of the shaded band quantiles[0, 0, 0] p10
Toto median line quantiles[4, 0, 0] p50
Upper edge of the shaded band quantiles[8, 0, 0] p90

The shaded band is therefore the 80 % central interval (p10–p90). The "Β±XΒ°F at +24 h" chip on the hero is half of (p90 βˆ’ p10) at the last forecast step.

Inference cadence

Trigger a daemon thread inside the Space (_autorefresh_loop) and demo.load on a visitor's first request
Interval every 15 minutes
Cache TTL 14 minutes β€” slightly less than the autorefresh interval so the next tick always misses the cache and refetches Ecowitt + NWS
Per-tick cost one /device/real_time + one /device/history per cycle_type touched + one NWS /points + one NWS forecastHourly + four univariate Toto forwards (one per metric)
CPU forward time ~hundreds of milliseconds per metric on the free CPU tier; total wallclock per refresh is dominated by the network calls, not the model

Persistence

Every refresh writes to data/forecasts.db (SQLite):

  • forecast_snapshots(forecast_made_at, target_ts, source, metric, p10, p50, p90)
  • actuals(target_ts, metric, value)

source ∈ {toto, nws}. NWS rows store the point forecast in p50 and leave p10/p90 NULL.

A second SQLite, data/ecowitt.db, is the all-channel raw archive (populated by src/sync.py). Both DBs are pushed to a private HF Dataset (bitsofchris/toto-weather-forecast-log) on every autorefresh tick so they survive Space rebuilds.

Scoreboard β€” how the accuracy is calculated

The scoreboard answers one question: over the last 48 hours, which model was closer to the actual reading at each hour? The rules are designed so neither model gets to peek at data that wasn't available when the forecast was made.

Inputs

  • actuals(target_ts, metric, value) β€” Ecowitt readings resampled to the current display cadence (hourly by default).
  • forecast_snapshots(forecast_made_at, target_ts, source, metric, p10, p50, p90) β€” every forecast either model has ever issued, with two timestamps:
    • forecast_made_at β€” when we ran inference / fetched NWS
    • target_ts β€” the hour the forecast is predicting

Rule for picking which forecast counts

For each (target_ts, source, metric):

  1. Filter to forecasts whose forecast_made_at <= target_ts β€” i.e. the model didn't yet know the actual value.
  2. Of those, pick the one with the largest forecast_made_at β€” the most recent prediction issued before the target hour.

So for any past hour, Toto and NWS are both scored on their latest pre-target opinion. Both models always have the same information cutoff; no foresight, no stale snapshots.

Aggregation

Once each (target_ts, source, metric) triple has been pinned to a single forecast row:

abs_err = |p50 βˆ’ actual|
MAE_source = mean(abs_err)   over target_ts in the last 48 h
n          = count(matching pairs)

The "lower is better" winner is whichever source has the smaller MAE for that metric. NWS doesn't expose pressure in forecastHourly, so the pressure scoreboard reports Toto only.

Caveats / what this scoreboard is NOT

  • We score the point prediction (p50) for both models. That throws away Toto's uncertainty β€” a wider interval doesn't hurt or help its MAE. A more Toto-flattering scoring would be CRPS or pinball loss, which credits well-calibrated intervals. We can layer that in later; MAE is what most people intuit by "accuracy", which is why it's on the headline scoreboard.
  • The window is rolling 48 h, so the number you see depends on the last two days, not all history.
  • We score every horizon distance lumped together. Toto's p50 at +6 h vs at +24 h are both folded into the same MAE. Splitting by horizon (1 h MAE, 6 h MAE, 24 h MAE, …) is a likely next iteration.

The "Past forecasts" overlay on the chart uses the same query so the scoreboard number and the chart line refer to identical predictions.