Toto inference: how the demo makes its forecasts
This is a precise spec of every Toto-related knob used by app.py so the
post / footnote can quote it accurately.
Model
| Model ID | Datadog/Toto-2.0-22m |
| Parameters | ~22 M |
| Source | https://huggingface.co/Datadog/Toto-2.0-22m |
| Loaded via | Toto2Model.from_pretrained(...) (the toto-2 package from DataDog/toto's toto2/ subdir, pinned in requirements.txt) |
| Hardware | CPU (HF Space free tier β no GPU) |
| Patch size | model.config.patch_size (read at runtime; the context-length truncate/pad logic adapts automatically if the value differs across variants) |
We started on the 4 M variant for the demo's "weakest model still says something useful" angle, then bumped to 22 M for visibly tighter confidence bands and lower scoreboard MAE β still small enough to run in sub-second CPU latency on the free HF Space tier.
Input data
| Source | Ecowitt Cloud API v3 (/device/history) |
| Station | Ecowitt GW3000B, Westhampton Beach NY |
| Channels forecasted | outdoor.temperature (Β°F), outdoor.humidity (%), pressure.relative (inHg), rainfall_piezo.rain_rate (in/hr) |
| Native storage cadence | 5 min at cycle_type=5min (the device is configured to upload at ~1-min intervals; Ecowitt buckets to 5 min for the 90-day tier). Earlier defaults of 30 min were the device's out-of-box upload schedule. |
cycle_type requested |
5min β finest tier the API exposes. The data lives in data/ecowitt.db (synced incrementally every 15 min); each refresh reads from the archive rather than hitting Ecowitt live. |
| History window pulled from the archive | 7 days |
| Resampling for the chart | df.resample("5min").mean() β fine-grained display |
| Resampling for Toto inference | df.resample("1h").mean() β coarser series so the 4M model receives a 168-point context + 48-step horizon, the regime where it forecasts cleanly. Decoupling the chart cadence from the model cadence keeps the visual fine and the model output honest. |
| Cleaning | Series.interpolate(limit_direction="both") fills resample gaps before the tensor goes to Toto |
| NWS comparison | https://api.weather.gov/points/{lat},{lon} β forecastHourly (point forecast, no distribution) |
Context length
Toto requires the time axis of the input tensor to be a multiple of patch_size = 32.
n_raw = len(history_series_after_resample_and_interpolate)
if n_raw >= 32:
n_ctx = (n_raw // 32) * 32 # truncate oldest points
target_mask = ones(n_ctx) # all valid
else:
n_ctx = 32 # pad up to one patch
pad = 32 - n_raw
target = [first_value]*pad + raw # left-pad with the first value
target_mask = [False]*pad + [True]*n_raw # tell Toto to ignore the padded steps
With ~7 days of archive history and the hourly resample we use for inference, this gives a context of 160 hourly points (5 patches). The chart shows the same 7 days at 5-min cadence (β2 016 points) β but those raw points only feed the chart, not the model.
Tensor shape
target: torch.float32, shape (batch=1, n_variates=1, time=n_ctx)
target_mask: torch.bool, shape (batch=1, n_variates=1, time=n_ctx)
series_ids: torch.long, shape (batch=1, n_variates=1) (all zeros β univariate)
We forecast each metric independently (univariate). Multivariate inference is a follow-up; the inference cost is comparable but the chart gets noisier and the post hook is easier to read one metric at a time.
Prediction length
horizon_steps = round(horizon_hours / step_hours) where:
step_hours is the forecast cadence (1 h), not the chart cadence
(5 min). horizon_hours = 48, so horizon_steps = 48. The 48 hourly
predictions get drawn on top of the 5-min historical line β the
forecast line is therefore sparser than the actuals, but visually
continuous because Plotly connects the anchor points.
Distribution β quantiles
We do not Monte-Carlo sample. Toto's output head is a parametric
Student-t mixture (see the Toto 2.0 paper), and model.forecast()
returns analytical quantiles directly:
quantiles = model.forecast(
{"target": target, "target_mask": target_mask, "series_ids": series_ids},
horizon=horizon_steps,
)
# quantiles shape: (9, batch=1, n_variates=1, horizon)
# quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
We pluck three of them for the chart and the scoreboard:
| Display | Index | Quantile |
|---|---|---|
| Lower edge of the shaded band | quantiles[0, 0, 0] |
p10 |
| Toto median line | quantiles[4, 0, 0] |
p50 |
| Upper edge of the shaded band | quantiles[8, 0, 0] |
p90 |
The shaded band is therefore the 80 % central interval (p10βp90).
The "Β±XΒ°F at +24 h" chip on the hero is half of (p90 β p10) at the
last forecast step.
Inference cadence
| Trigger | a daemon thread inside the Space (_autorefresh_loop) and demo.load on a visitor's first request |
| Interval | every 15 minutes |
| Cache TTL | 14 minutes β slightly less than the autorefresh interval so the next tick always misses the cache and refetches Ecowitt + NWS |
| Per-tick cost | one /device/real_time + one /device/history per cycle_type touched + one NWS /points + one NWS forecastHourly + four univariate Toto forwards (one per metric) |
| CPU forward time | ~hundreds of milliseconds per metric on the free CPU tier; total wallclock per refresh is dominated by the network calls, not the model |
Persistence
Every refresh writes to data/forecasts.db (SQLite):
forecast_snapshots(forecast_made_at, target_ts, source, metric, p10, p50, p90)actuals(target_ts, metric, value)
source β {toto, nws}. NWS rows store the point forecast in p50 and
leave p10/p90 NULL.
A second SQLite, data/ecowitt.db, is the all-channel raw archive
(populated by src/sync.py). Both DBs are pushed to a private HF Dataset
(bitsofchris/toto-weather-forecast-log) on every autorefresh tick so
they survive Space rebuilds.
Scoreboard β how the accuracy is calculated
The scoreboard answers one question: over the last 48 hours, which model was closer to the actual reading at each hour? The rules are designed so neither model gets to peek at data that wasn't available when the forecast was made.
Inputs
actuals(target_ts, metric, value)β Ecowitt readings resampled to the current display cadence (hourly by default).forecast_snapshots(forecast_made_at, target_ts, source, metric, p10, p50, p90)β every forecast either model has ever issued, with two timestamps:forecast_made_atβ when we ran inference / fetched NWStarget_tsβ the hour the forecast is predicting
Rule for picking which forecast counts
For each (target_ts, source, metric):
- Filter to forecasts whose
forecast_made_at <= target_tsβ i.e. the model didn't yet know the actual value. - Of those, pick the one with the largest
forecast_made_atβ the most recent prediction issued before the target hour.
So for any past hour, Toto and NWS are both scored on their latest pre-target opinion. Both models always have the same information cutoff; no foresight, no stale snapshots.
Aggregation
Once each (target_ts, source, metric) triple has been pinned to a
single forecast row:
abs_err = |p50 β actual|
MAE_source = mean(abs_err) over target_ts in the last 48 h
n = count(matching pairs)
The "lower is better" winner is whichever source has the smaller MAE for
that metric. NWS doesn't expose pressure in forecastHourly, so the
pressure scoreboard reports Toto only.
Caveats / what this scoreboard is NOT
- We score the point prediction (
p50) for both models. That throws away Toto's uncertainty β a wider interval doesn't hurt or help its MAE. A more Toto-flattering scoring would be CRPS or pinball loss, which credits well-calibrated intervals. We can layer that in later; MAE is what most people intuit by "accuracy", which is why it's on the headline scoreboard. - The window is rolling 48 h, so the number you see depends on the last two days, not all history.
- We score every horizon distance lumped together. Toto's p50 at +6 h vs at +24 h are both folded into the same MAE. Splitting by horizon (1 h MAE, 6 h MAE, 24 h MAE, β¦) is a likely next iteration.
The "Past forecasts" overlay on the chart uses the same query so the scoreboard number and the chart line refer to identical predictions.