bitsofchris's picture
Bump Toto from 4 M to 22 M (small) β€” same data path, larger model
2b7eb68
# Toto inference: how the demo makes its forecasts
This is a precise spec of every Toto-related knob used by `app.py` so the
post / footnote can quote it accurately.
## Model
| | |
|---|---|
| Model ID | `Datadog/Toto-2.0-22m` |
| Parameters | ~22 M |
| Source | https://huggingface.co/Datadog/Toto-2.0-22m |
| Loaded via | `Toto2Model.from_pretrained(...)` (the `toto-2` package from DataDog/toto's `toto2/` subdir, pinned in `requirements.txt`) |
| Hardware | CPU (HF Space free tier β€” no GPU) |
| Patch size | `model.config.patch_size` (read at runtime; the context-length truncate/pad logic adapts automatically if the value differs across variants) |
We started on the 4 M variant for the demo's "weakest model still says
something useful" angle, then bumped to 22 M for visibly tighter
confidence bands and lower scoreboard MAE β€” still small enough to run
in sub-second CPU latency on the free HF Space tier.
## Input data
| | |
|---|---|
| Source | Ecowitt Cloud API v3 (`/device/history`) |
| Station | Ecowitt GW3000B, Westhampton Beach NY |
| Channels forecasted | `outdoor.temperature` (Β°F), `outdoor.humidity` (%), `pressure.relative` (inHg), `rainfall_piezo.rain_rate` (in/hr) |
| Native storage cadence | **5 min** at `cycle_type=5min` (the device is configured to upload at ~1-min intervals; Ecowitt buckets to 5 min for the 90-day tier). Earlier defaults of 30 min were the device's out-of-box upload schedule. |
| `cycle_type` requested | `5min` β€” finest tier the API exposes. The data lives in `data/ecowitt.db` (synced incrementally every 15 min); each refresh reads from the archive rather than hitting Ecowitt live. |
| History window pulled from the archive | 7 days |
| Resampling for the chart | `df.resample("5min").mean()` β€” fine-grained display |
| Resampling for Toto inference | `df.resample("1h").mean()` β€” coarser series so the 4M model receives a 168-point context + 48-step horizon, the regime where it forecasts cleanly. Decoupling the chart cadence from the model cadence keeps the visual fine and the model output honest. |
| Cleaning | `Series.interpolate(limit_direction="both")` fills resample gaps before the tensor goes to Toto |
| NWS comparison | `https://api.weather.gov/points/{lat},{lon}` β†’ `forecastHourly` (point forecast, no distribution) |
## Context length
Toto requires the time axis of the input tensor to be a multiple of `patch_size = 32`.
```text
n_raw = len(history_series_after_resample_and_interpolate)
if n_raw >= 32:
n_ctx = (n_raw // 32) * 32 # truncate oldest points
target_mask = ones(n_ctx) # all valid
else:
n_ctx = 32 # pad up to one patch
pad = 32 - n_raw
target = [first_value]*pad + raw # left-pad with the first value
target_mask = [False]*pad + [True]*n_raw # tell Toto to ignore the padded steps
```
With ~7 days of archive history and the hourly resample we use for
inference, this gives a context of 160 hourly points (5 patches). The
chart shows the same 7 days at 5-min cadence (β‰ˆ2 016 points) β€” but
those raw points only feed the chart, not the model.
## Tensor shape
```text
target: torch.float32, shape (batch=1, n_variates=1, time=n_ctx)
target_mask: torch.bool, shape (batch=1, n_variates=1, time=n_ctx)
series_ids: torch.long, shape (batch=1, n_variates=1) (all zeros β€” univariate)
```
We forecast each metric **independently** (univariate). Multivariate
inference is a follow-up; the inference cost is comparable but the chart
gets noisier and the post hook is easier to read one metric at a time.
## Prediction length
`horizon_steps = round(horizon_hours / step_hours)` where:
`step_hours` is the **forecast cadence** (1 h), not the chart cadence
(5 min). `horizon_hours = 48`, so `horizon_steps = 48`. The 48 hourly
predictions get drawn on top of the 5-min historical line β€” the
forecast line is therefore sparser than the actuals, but visually
continuous because Plotly connects the anchor points.
## Distribution β†’ quantiles
We do **not** Monte-Carlo sample. Toto's output head is a parametric
Student-t mixture (see the Toto 2.0 paper), and `model.forecast()`
returns analytical quantiles directly:
```python
quantiles = model.forecast(
{"target": target, "target_mask": target_mask, "series_ids": series_ids},
horizon=horizon_steps,
)
# quantiles shape: (9, batch=1, n_variates=1, horizon)
# quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
```
We pluck three of them for the chart and the scoreboard:
| Display | Index | Quantile |
|---|---|---|
| Lower edge of the shaded band | `quantiles[0, 0, 0]` | p10 |
| Toto median line | `quantiles[4, 0, 0]` | p50 |
| Upper edge of the shaded band | `quantiles[8, 0, 0]` | p90 |
The shaded band is therefore the **80 % central interval** (p10–p90).
The "Β±XΒ°F at +24 h" chip on the hero is half of `(p90 βˆ’ p10)` at the
last forecast step.
## Inference cadence
| | |
|---|---|
| Trigger | a daemon thread inside the Space (`_autorefresh_loop`) and `demo.load` on a visitor's first request |
| Interval | every 15 minutes |
| Cache TTL | 14 minutes β€” slightly less than the autorefresh interval so the next tick always misses the cache and refetches Ecowitt + NWS |
| Per-tick cost | one `/device/real_time` + one `/device/history` per cycle_type touched + one NWS `/points` + one NWS `forecastHourly` + four univariate Toto forwards (one per metric) |
| CPU forward time | ~hundreds of milliseconds per metric on the free CPU tier; total wallclock per refresh is dominated by the network calls, not the model |
## Persistence
Every refresh writes to `data/forecasts.db` (SQLite):
- `forecast_snapshots(forecast_made_at, target_ts, source, metric, p10, p50, p90)`
- `actuals(target_ts, metric, value)`
`source ∈ {toto, nws}`. NWS rows store the point forecast in `p50` and
leave `p10/p90` NULL.
A second SQLite, `data/ecowitt.db`, is the all-channel raw archive
(populated by `src/sync.py`). Both DBs are pushed to a private HF Dataset
(`bitsofchris/toto-weather-forecast-log`) on every autorefresh tick so
they survive Space rebuilds.
## Scoreboard β€” how the accuracy is calculated
The scoreboard answers one question: **over the last 48 hours, which model
was closer to the actual reading at each hour?** The rules are designed so
neither model gets to peek at data that wasn't available when the forecast
was made.
### Inputs
- `actuals(target_ts, metric, value)` β€” Ecowitt readings resampled to the
current display cadence (hourly by default).
- `forecast_snapshots(forecast_made_at, target_ts, source, metric, p10, p50, p90)`
β€” every forecast either model has ever issued, with two timestamps:
- `forecast_made_at` β€” when we ran inference / fetched NWS
- `target_ts` β€” the hour the forecast is predicting
### Rule for picking which forecast counts
For each `(target_ts, source, metric)`:
1. Filter to forecasts whose `forecast_made_at <= target_ts` β€” i.e. the
model didn't yet know the actual value.
2. Of those, pick the one with the **largest** `forecast_made_at` β€” the
most recent prediction issued *before* the target hour.
So for any past hour, Toto and NWS are both scored on their *latest
pre-target opinion*. Both models always have the same information cutoff;
no foresight, no stale snapshots.
### Aggregation
Once each `(target_ts, source, metric)` triple has been pinned to a
single forecast row:
```text
abs_err = |p50 βˆ’ actual|
MAE_source = mean(abs_err) over target_ts in the last 48 h
n = count(matching pairs)
```
The "lower is better" winner is whichever source has the smaller MAE for
that metric. NWS doesn't expose pressure in `forecastHourly`, so the
pressure scoreboard reports Toto only.
### Caveats / what this scoreboard is NOT
- We score the **point prediction** (`p50`) for both models. That throws
away Toto's uncertainty β€” a wider interval doesn't hurt or help its
MAE. A more Toto-flattering scoring would be CRPS or pinball loss,
which credits well-calibrated intervals. We can layer that in later;
MAE is what most people intuit by "accuracy", which is why it's
on the headline scoreboard.
- The window is rolling 48 h, so the number you see depends on the last
two days, not all history.
- We score every horizon distance lumped together. Toto's p50 at +6 h
vs at +24 h are both folded into the same MAE. Splitting by horizon
(1 h MAE, 6 h MAE, 24 h MAE, …) is a likely next iteration.
The "Past forecasts" overlay on the chart uses the same query so the
scoreboard number and the chart line refer to identical predictions.