# Toto inference: how the demo makes its forecasts This is a precise spec of every Toto-related knob used by `app.py` so the post / footnote can quote it accurately. ## Model | | | |---|---| | Model ID | `Datadog/Toto-2.0-22m` | | Parameters | ~22 M | | Source | https://huggingface.co/Datadog/Toto-2.0-22m | | Loaded via | `Toto2Model.from_pretrained(...)` (the `toto-2` package from DataDog/toto's `toto2/` subdir, pinned in `requirements.txt`) | | Hardware | CPU (HF Space free tier — no GPU) | | Patch size | `model.config.patch_size` (read at runtime; the context-length truncate/pad logic adapts automatically if the value differs across variants) | We started on the 4 M variant for the demo's "weakest model still says something useful" angle, then bumped to 22 M for visibly tighter confidence bands and lower scoreboard MAE — still small enough to run in sub-second CPU latency on the free HF Space tier. ## Input data | | | |---|---| | Source | Ecowitt Cloud API v3 (`/device/history`) | | Station | Ecowitt GW3000B, Westhampton Beach NY | | Channels forecasted | `outdoor.temperature` (°F), `outdoor.humidity` (%), `pressure.relative` (inHg), `rainfall_piezo.rain_rate` (in/hr) | | Native storage cadence | **5 min** at `cycle_type=5min` (the device is configured to upload at ~1-min intervals; Ecowitt buckets to 5 min for the 90-day tier). Earlier defaults of 30 min were the device's out-of-box upload schedule. | | `cycle_type` requested | `5min` — finest tier the API exposes. The data lives in `data/ecowitt.db` (synced incrementally every 15 min); each refresh reads from the archive rather than hitting Ecowitt live. | | History window pulled from the archive | 7 days | | Resampling for the chart | `df.resample("5min").mean()` — fine-grained display | | Resampling for Toto inference | `df.resample("1h").mean()` — coarser series so the 4M model receives a 168-point context + 48-step horizon, the regime where it forecasts cleanly. Decoupling the chart cadence from the model cadence keeps the visual fine and the model output honest. | | Cleaning | `Series.interpolate(limit_direction="both")` fills resample gaps before the tensor goes to Toto | | NWS comparison | `https://api.weather.gov/points/{lat},{lon}` → `forecastHourly` (point forecast, no distribution) | ## Context length Toto requires the time axis of the input tensor to be a multiple of `patch_size = 32`. ```text n_raw = len(history_series_after_resample_and_interpolate) if n_raw >= 32: n_ctx = (n_raw // 32) * 32 # truncate oldest points target_mask = ones(n_ctx) # all valid else: n_ctx = 32 # pad up to one patch pad = 32 - n_raw target = [first_value]*pad + raw # left-pad with the first value target_mask = [False]*pad + [True]*n_raw # tell Toto to ignore the padded steps ``` With ~7 days of archive history and the hourly resample we use for inference, this gives a context of 160 hourly points (5 patches). The chart shows the same 7 days at 5-min cadence (≈2 016 points) — but those raw points only feed the chart, not the model. ## Tensor shape ```text target: torch.float32, shape (batch=1, n_variates=1, time=n_ctx) target_mask: torch.bool, shape (batch=1, n_variates=1, time=n_ctx) series_ids: torch.long, shape (batch=1, n_variates=1) (all zeros — univariate) ``` We forecast each metric **independently** (univariate). Multivariate inference is a follow-up; the inference cost is comparable but the chart gets noisier and the post hook is easier to read one metric at a time. ## Prediction length `horizon_steps = round(horizon_hours / step_hours)` where: `step_hours` is the **forecast cadence** (1 h), not the chart cadence (5 min). `horizon_hours = 48`, so `horizon_steps = 48`. The 48 hourly predictions get drawn on top of the 5-min historical line — the forecast line is therefore sparser than the actuals, but visually continuous because Plotly connects the anchor points. ## Distribution → quantiles We do **not** Monte-Carlo sample. Toto's output head is a parametric Student-t mixture (see the Toto 2.0 paper), and `model.forecast()` returns analytical quantiles directly: ```python quantiles = model.forecast( {"target": target, "target_mask": target_mask, "series_ids": series_ids}, horizon=horizon_steps, ) # quantiles shape: (9, batch=1, n_variates=1, horizon) # quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] ``` We pluck three of them for the chart and the scoreboard: | Display | Index | Quantile | |---|---|---| | Lower edge of the shaded band | `quantiles[0, 0, 0]` | p10 | | Toto median line | `quantiles[4, 0, 0]` | p50 | | Upper edge of the shaded band | `quantiles[8, 0, 0]` | p90 | The shaded band is therefore the **80 % central interval** (p10–p90). The "±X°F at +24 h" chip on the hero is half of `(p90 − p10)` at the last forecast step. ## Inference cadence | | | |---|---| | Trigger | a daemon thread inside the Space (`_autorefresh_loop`) and `demo.load` on a visitor's first request | | Interval | every 15 minutes | | Cache TTL | 14 minutes — slightly less than the autorefresh interval so the next tick always misses the cache and refetches Ecowitt + NWS | | Per-tick cost | one `/device/real_time` + one `/device/history` per cycle_type touched + one NWS `/points` + one NWS `forecastHourly` + four univariate Toto forwards (one per metric) | | CPU forward time | ~hundreds of milliseconds per metric on the free CPU tier; total wallclock per refresh is dominated by the network calls, not the model | ## Persistence Every refresh writes to `data/forecasts.db` (SQLite): - `forecast_snapshots(forecast_made_at, target_ts, source, metric, p10, p50, p90)` - `actuals(target_ts, metric, value)` `source ∈ {toto, nws}`. NWS rows store the point forecast in `p50` and leave `p10/p90` NULL. A second SQLite, `data/ecowitt.db`, is the all-channel raw archive (populated by `src/sync.py`). Both DBs are pushed to a private HF Dataset (`bitsofchris/toto-weather-forecast-log`) on every autorefresh tick so they survive Space rebuilds. ## Scoreboard — how the accuracy is calculated The scoreboard answers one question: **over the last 48 hours, which model was closer to the actual reading at each hour?** The rules are designed so neither model gets to peek at data that wasn't available when the forecast was made. ### Inputs - `actuals(target_ts, metric, value)` — Ecowitt readings resampled to the current display cadence (hourly by default). - `forecast_snapshots(forecast_made_at, target_ts, source, metric, p10, p50, p90)` — every forecast either model has ever issued, with two timestamps: - `forecast_made_at` — when we ran inference / fetched NWS - `target_ts` — the hour the forecast is predicting ### Rule for picking which forecast counts For each `(target_ts, source, metric)`: 1. Filter to forecasts whose `forecast_made_at <= target_ts` — i.e. the model didn't yet know the actual value. 2. Of those, pick the one with the **largest** `forecast_made_at` — the most recent prediction issued *before* the target hour. So for any past hour, Toto and NWS are both scored on their *latest pre-target opinion*. Both models always have the same information cutoff; no foresight, no stale snapshots. ### Aggregation Once each `(target_ts, source, metric)` triple has been pinned to a single forecast row: ```text abs_err = |p50 − actual| MAE_source = mean(abs_err) over target_ts in the last 48 h n = count(matching pairs) ``` The "lower is better" winner is whichever source has the smaller MAE for that metric. NWS doesn't expose pressure in `forecastHourly`, so the pressure scoreboard reports Toto only. ### Caveats / what this scoreboard is NOT - We score the **point prediction** (`p50`) for both models. That throws away Toto's uncertainty — a wider interval doesn't hurt or help its MAE. A more Toto-flattering scoring would be CRPS or pinball loss, which credits well-calibrated intervals. We can layer that in later; MAE is what most people intuit by "accuracy", which is why it's on the headline scoreboard. - The window is rolling 48 h, so the number you see depends on the last two days, not all history. - We score every horizon distance lumped together. Toto's p50 at +6 h vs at +24 h are both folded into the same MAE. Splitting by horizon (1 h MAE, 6 h MAE, 24 h MAE, …) is a likely next iteration. The "Past forecasts" overlay on the chart uses the same query so the scoreboard number and the chart line refer to identical predictions.