| # Toto inference: how the demo makes its forecasts |
|
|
| This is a precise spec of every Toto-related knob used by `app.py` so the |
| post / footnote can quote it accurately. |
|
|
| ## Model |
|
|
| | | | |
| |---|---| |
| | Model ID | `Datadog/Toto-2.0-22m` | |
| | Parameters | ~22 M | |
| | Source | https://huggingface.co/Datadog/Toto-2.0-22m | |
| | Loaded via | `Toto2Model.from_pretrained(...)` (the `toto-2` package from DataDog/toto's `toto2/` subdir, pinned in `requirements.txt`) | |
| | Hardware | CPU (HF Space free tier β no GPU) | |
| | Patch size | `model.config.patch_size` (read at runtime; the context-length truncate/pad logic adapts automatically if the value differs across variants) | |
|
|
| We started on the 4 M variant for the demo's "weakest model still says |
| something useful" angle, then bumped to 22 M for visibly tighter |
| confidence bands and lower scoreboard MAE β still small enough to run |
| in sub-second CPU latency on the free HF Space tier. |
|
|
| ## Input data |
|
|
| | | | |
| |---|---| |
| | Source | Ecowitt Cloud API v3 (`/device/history`) | |
| | Station | Ecowitt GW3000B, Westhampton Beach NY | |
| | Channels forecasted | `outdoor.temperature` (Β°F), `outdoor.humidity` (%), `pressure.relative` (inHg), `rainfall_piezo.rain_rate` (in/hr) | |
| | Native storage cadence | **5 min** at `cycle_type=5min` (the device is configured to upload at ~1-min intervals; Ecowitt buckets to 5 min for the 90-day tier). Earlier defaults of 30 min were the device's out-of-box upload schedule. | |
| | `cycle_type` requested | `5min` β finest tier the API exposes. The data lives in `data/ecowitt.db` (synced incrementally every 15 min); each refresh reads from the archive rather than hitting Ecowitt live. | |
| | History window pulled from the archive | 7 days | |
| | Resampling for the chart | `df.resample("5min").mean()` β fine-grained display | |
| | Resampling for Toto inference | `df.resample("1h").mean()` β coarser series so the 4M model receives a 168-point context + 48-step horizon, the regime where it forecasts cleanly. Decoupling the chart cadence from the model cadence keeps the visual fine and the model output honest. | |
| | Cleaning | `Series.interpolate(limit_direction="both")` fills resample gaps before the tensor goes to Toto | |
| | NWS comparison | `https://api.weather.gov/points/{lat},{lon}` β `forecastHourly` (point forecast, no distribution) | |
|
|
| ## Context length |
|
|
| Toto requires the time axis of the input tensor to be a multiple of `patch_size = 32`. |
|
|
| ```text |
| n_raw = len(history_series_after_resample_and_interpolate) |
| if n_raw >= 32: |
| n_ctx = (n_raw // 32) * 32 # truncate oldest points |
| target_mask = ones(n_ctx) # all valid |
| else: |
| n_ctx = 32 # pad up to one patch |
| pad = 32 - n_raw |
| target = [first_value]*pad + raw # left-pad with the first value |
| target_mask = [False]*pad + [True]*n_raw # tell Toto to ignore the padded steps |
| ``` |
|
|
| With ~7 days of archive history and the hourly resample we use for |
| inference, this gives a context of 160 hourly points (5 patches). The |
| chart shows the same 7 days at 5-min cadence (β2 016 points) β but |
| those raw points only feed the chart, not the model. |
|
|
| ## Tensor shape |
|
|
| ```text |
| target: torch.float32, shape (batch=1, n_variates=1, time=n_ctx) |
| target_mask: torch.bool, shape (batch=1, n_variates=1, time=n_ctx) |
| series_ids: torch.long, shape (batch=1, n_variates=1) (all zeros β univariate) |
| ``` |
|
|
| We forecast each metric **independently** (univariate). Multivariate |
| inference is a follow-up; the inference cost is comparable but the chart |
| gets noisier and the post hook is easier to read one metric at a time. |
|
|
| ## Prediction length |
|
|
| `horizon_steps = round(horizon_hours / step_hours)` where: |
|
|
| `step_hours` is the **forecast cadence** (1 h), not the chart cadence |
| (5 min). `horizon_hours = 48`, so `horizon_steps = 48`. The 48 hourly |
| predictions get drawn on top of the 5-min historical line β the |
| forecast line is therefore sparser than the actuals, but visually |
| continuous because Plotly connects the anchor points. |
|
|
| ## Distribution β quantiles |
|
|
| We do **not** Monte-Carlo sample. Toto's output head is a parametric |
| Student-t mixture (see the Toto 2.0 paper), and `model.forecast()` |
| returns analytical quantiles directly: |
|
|
| ```python |
| quantiles = model.forecast( |
| {"target": target, "target_mask": target_mask, "series_ids": series_ids}, |
| horizon=horizon_steps, |
| ) |
| # quantiles shape: (9, batch=1, n_variates=1, horizon) |
| # quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] |
| ``` |
|
|
| We pluck three of them for the chart and the scoreboard: |
|
|
| | Display | Index | Quantile | |
| |---|---|---| |
| | Lower edge of the shaded band | `quantiles[0, 0, 0]` | p10 | |
| | Toto median line | `quantiles[4, 0, 0]` | p50 | |
| | Upper edge of the shaded band | `quantiles[8, 0, 0]` | p90 | |
|
|
| The shaded band is therefore the **80 % central interval** (p10βp90). |
| The "Β±XΒ°F at +24 h" chip on the hero is half of `(p90 β p10)` at the |
| last forecast step. |
|
|
| ## Inference cadence |
|
|
| | | | |
| |---|---| |
| | Trigger | a daemon thread inside the Space (`_autorefresh_loop`) and `demo.load` on a visitor's first request | |
| | Interval | every 15 minutes | |
| | Cache TTL | 14 minutes β slightly less than the autorefresh interval so the next tick always misses the cache and refetches Ecowitt + NWS | |
| | Per-tick cost | one `/device/real_time` + one `/device/history` per cycle_type touched + one NWS `/points` + one NWS `forecastHourly` + four univariate Toto forwards (one per metric) | |
| | CPU forward time | ~hundreds of milliseconds per metric on the free CPU tier; total wallclock per refresh is dominated by the network calls, not the model | |
| |
| ## Persistence |
| |
| Every refresh writes to `data/forecasts.db` (SQLite): |
| |
| - `forecast_snapshots(forecast_made_at, target_ts, source, metric, p10, p50, p90)` |
| - `actuals(target_ts, metric, value)` |
|
|
| `source β {toto, nws}`. NWS rows store the point forecast in `p50` and |
| leave `p10/p90` NULL. |
|
|
| A second SQLite, `data/ecowitt.db`, is the all-channel raw archive |
| (populated by `src/sync.py`). Both DBs are pushed to a private HF Dataset |
| (`bitsofchris/toto-weather-forecast-log`) on every autorefresh tick so |
| they survive Space rebuilds. |
|
|
| ## Scoreboard β how the accuracy is calculated |
|
|
| The scoreboard answers one question: **over the last 48 hours, which model |
| was closer to the actual reading at each hour?** The rules are designed so |
| neither model gets to peek at data that wasn't available when the forecast |
| was made. |
|
|
| ### Inputs |
|
|
| - `actuals(target_ts, metric, value)` β Ecowitt readings resampled to the |
| current display cadence (hourly by default). |
| - `forecast_snapshots(forecast_made_at, target_ts, source, metric, p10, p50, p90)` |
| β every forecast either model has ever issued, with two timestamps: |
| - `forecast_made_at` β when we ran inference / fetched NWS |
| - `target_ts` β the hour the forecast is predicting |
|
|
| ### Rule for picking which forecast counts |
|
|
| For each `(target_ts, source, metric)`: |
|
|
| 1. Filter to forecasts whose `forecast_made_at <= target_ts` β i.e. the |
| model didn't yet know the actual value. |
| 2. Of those, pick the one with the **largest** `forecast_made_at` β the |
| most recent prediction issued *before* the target hour. |
|
|
| So for any past hour, Toto and NWS are both scored on their *latest |
| pre-target opinion*. Both models always have the same information cutoff; |
| no foresight, no stale snapshots. |
|
|
| ### Aggregation |
|
|
| Once each `(target_ts, source, metric)` triple has been pinned to a |
| single forecast row: |
|
|
| ```text |
| abs_err = |p50 β actual| |
| MAE_source = mean(abs_err) over target_ts in the last 48 h |
| n = count(matching pairs) |
| ``` |
|
|
| The "lower is better" winner is whichever source has the smaller MAE for |
| that metric. NWS doesn't expose pressure in `forecastHourly`, so the |
| pressure scoreboard reports Toto only. |
|
|
| ### Caveats / what this scoreboard is NOT |
|
|
| - We score the **point prediction** (`p50`) for both models. That throws |
| away Toto's uncertainty β a wider interval doesn't hurt or help its |
| MAE. A more Toto-flattering scoring would be CRPS or pinball loss, |
| which credits well-calibrated intervals. We can layer that in later; |
| MAE is what most people intuit by "accuracy", which is why it's |
| on the headline scoreboard. |
| - The window is rolling 48 h, so the number you see depends on the last |
| two days, not all history. |
| - We score every horizon distance lumped together. Toto's p50 at +6 h |
| vs at +24 h are both folded into the same MAE. Splitting by horizon |
| (1 h MAE, 6 h MAE, 24 h MAE, β¦) is a likely next iteration. |
|
|
| The "Past forecasts" overlay on the chart uses the same query so the |
| scoreboard number and the chart line refer to identical predictions. |
|
|