Spaces:

bitsofchris
/

time-series-ai-weather-forecast

Running

App Files Files Community

time-series-ai-weather-forecast / docs /toto-inference.md

bitsofchris

Bump Toto from 4 M to 22 M (small) — same data path, larger model

2b7eb68 12 days ago

preview code

raw

history blame contribute delete

8.68 kB

	# Toto inference: how the demo makes its forecasts

	This is a precise spec of every Toto-related knob used by `app.py` so the
	post / footnote can quote it accurately.

	## Model

	\| \| \|
	\|---\|---\|
	\| Model ID \| `Datadog/Toto-2.0-22m` \|
	\| Parameters \| ~22 M \|
	\| Source \| https://huggingface.co/Datadog/Toto-2.0-22m \|
	\| Loaded via \| `Toto2Model.from_pretrained(...)` (the `toto-2` package from DataDog/toto's `toto2/` subdir, pinned in `requirements.txt`) \|
	\| Hardware \| CPU (HF Space free tier — no GPU) \|
	\| Patch size \| `model.config.patch_size` (read at runtime; the context-length truncate/pad logic adapts automatically if the value differs across variants) \|

	We started on the 4 M variant for the demo's "weakest model still says
	something useful" angle, then bumped to 22 M for visibly tighter
	confidence bands and lower scoreboard MAE — still small enough to run
	in sub-second CPU latency on the free HF Space tier.

	## Input data

	\| \| \|
	\|---\|---\|
	\| Source \| Ecowitt Cloud API v3 (`/device/history`) \|
	\| Station \| Ecowitt GW3000B, Westhampton Beach NY \|
	\| Channels forecasted \| `outdoor.temperature` (°F), `outdoor.humidity` (%), `pressure.relative` (inHg), `rainfall_piezo.rain_rate` (in/hr) \|
	\| Native storage cadence \| 5 min at `cycle_type=5min` (the device is configured to upload at ~1-min intervals; Ecowitt buckets to 5 min for the 90-day tier). Earlier defaults of 30 min were the device's out-of-box upload schedule. \|
	\| `cycle_type` requested \| `5min` — finest tier the API exposes. The data lives in `data/ecowitt.db` (synced incrementally every 15 min); each refresh reads from the archive rather than hitting Ecowitt live. \|
	\| History window pulled from the archive \| 7 days \|
	\| Resampling for the chart \| `df.resample("5min").mean()` — fine-grained display \|
	\| Resampling for Toto inference \| `df.resample("1h").mean()` — coarser series so the 4M model receives a 168-point context + 48-step horizon, the regime where it forecasts cleanly. Decoupling the chart cadence from the model cadence keeps the visual fine and the model output honest. \|
	\| Cleaning \| `Series.interpolate(limit_direction="both")` fills resample gaps before the tensor goes to Toto \|
	\| NWS comparison \| `https://api.weather.gov/points/{lat},{lon}` → `forecastHourly` (point forecast, no distribution) \|

	## Context length

	Toto requires the time axis of the input tensor to be a multiple of `patch_size = 32`.

	```text
	n_raw = len(history_series_after_resample_and_interpolate)
	if n_raw >= 32:
	n_ctx = (n_raw // 32) * 32 # truncate oldest points
	target_mask = ones(n_ctx) # all valid
	else:
	n_ctx = 32 # pad up to one patch
	pad = 32 - n_raw
	target = [first_value]*pad + raw # left-pad with the first value
	target_mask = [False]pad + [True]n_raw # tell Toto to ignore the padded steps
	```

	With ~7 days of archive history and the hourly resample we use for
	inference, this gives a context of 160 hourly points (5 patches). The
	chart shows the same 7 days at 5-min cadence (≈2 016 points) — but
	those raw points only feed the chart, not the model.

	## Tensor shape

	```text
	target: torch.float32, shape (batch=1, n_variates=1, time=n_ctx)
	target_mask: torch.bool, shape (batch=1, n_variates=1, time=n_ctx)
	series_ids: torch.long, shape (batch=1, n_variates=1) (all zeros — univariate)
	```

	We forecast each metric independently (univariate). Multivariate
	inference is a follow-up; the inference cost is comparable but the chart
	gets noisier and the post hook is easier to read one metric at a time.

	## Prediction length

	`horizon_steps = round(horizon_hours / step_hours)` where:

	`step_hours` is the forecast cadence (1 h), not the chart cadence
	(5 min). `horizon_hours = 48`, so `horizon_steps = 48`. The 48 hourly
	predictions get drawn on top of the 5-min historical line — the
	forecast line is therefore sparser than the actuals, but visually
	continuous because Plotly connects the anchor points.

	## Distribution → quantiles

	We do not Monte-Carlo sample. Toto's output head is a parametric
	Student-t mixture (see the Toto 2.0 paper), and `model.forecast()`
	returns analytical quantiles directly:

	```python
	quantiles = model.forecast(
	{"target": target, "target_mask": target_mask, "series_ids": series_ids},
	horizon=horizon_steps,
	)
	# quantiles shape: (9, batch=1, n_variates=1, horizon)
	# quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
	```

	We pluck three of them for the chart and the scoreboard:

	\| Display \| Index \| Quantile \|
	\|---\|---\|---\|
	\| Lower edge of the shaded band \| `quantiles[0, 0, 0]` \| p10 \|
	\| Toto median line \| `quantiles[4, 0, 0]` \| p50 \|
	\| Upper edge of the shaded band \| `quantiles[8, 0, 0]` \| p90 \|

	The shaded band is therefore the 80 % central interval (p10–p90).
	The "±X°F at +24 h" chip on the hero is half of `(p90 − p10)` at the
	last forecast step.

	## Inference cadence

	\| \| \|
	\|---\|---\|
	\| Trigger \| a daemon thread inside the Space (`_autorefresh_loop`) and `demo.load` on a visitor's first request \|
	\| Interval \| every 15 minutes \|
	\| Cache TTL \| 14 minutes — slightly less than the autorefresh interval so the next tick always misses the cache and refetches Ecowitt + NWS \|
	\| Per-tick cost \| one `/device/real_time` + one `/device/history` per cycle_type touched + one NWS `/points` + one NWS `forecastHourly` + four univariate Toto forwards (one per metric) \|
	\| CPU forward time \| ~hundreds of milliseconds per metric on the free CPU tier; total wallclock per refresh is dominated by the network calls, not the model \|

	## Persistence

	Every refresh writes to `data/forecasts.db` (SQLite):

	- `forecast_snapshots(forecast_made_at, target_ts, source, metric, p10, p50, p90)`
	- `actuals(target_ts, metric, value)`

	`source ∈ {toto, nws}`. NWS rows store the point forecast in `p50` and
	leave `p10/p90` NULL.

	A second SQLite, `data/ecowitt.db`, is the all-channel raw archive
	(populated by `src/sync.py`). Both DBs are pushed to a private HF Dataset
	(`bitsofchris/toto-weather-forecast-log`) on every autorefresh tick so
	they survive Space rebuilds.

	## Scoreboard — how the accuracy is calculated

	The scoreboard answers one question: **over the last 48 hours, which model
	was closer to the actual reading at each hour?** The rules are designed so
	neither model gets to peek at data that wasn't available when the forecast
	was made.

	### Inputs

	- `actuals(target_ts, metric, value)` — Ecowitt readings resampled to the
	current display cadence (hourly by default).
	- `forecast_snapshots(forecast_made_at, target_ts, source, metric, p10, p50, p90)`
	— every forecast either model has ever issued, with two timestamps:
	- `forecast_made_at` — when we ran inference / fetched NWS
	- `target_ts` — the hour the forecast is predicting

	### Rule for picking which forecast counts

	For each `(target_ts, source, metric)`:

	1. Filter to forecasts whose `forecast_made_at <= target_ts` — i.e. the
	model didn't yet know the actual value.
	2. Of those, pick the one with the largest `forecast_made_at` — the
	most recent prediction issued before the target hour.

	So for any past hour, Toto and NWS are both scored on their *latest
	pre-target opinion*. Both models always have the same information cutoff;
	no foresight, no stale snapshots.

	### Aggregation

	Once each `(target_ts, source, metric)` triple has been pinned to a
	single forecast row:

	```text
	abs_err = \|p50 − actual\|
	MAE_source = mean(abs_err) over target_ts in the last 48 h
	n = count(matching pairs)
	```

	The "lower is better" winner is whichever source has the smaller MAE for
	that metric. NWS doesn't expose pressure in `forecastHourly`, so the
	pressure scoreboard reports Toto only.

	### Caveats / what this scoreboard is NOT

	- We score the point prediction (`p50`) for both models. That throws
	away Toto's uncertainty — a wider interval doesn't hurt or help its
	MAE. A more Toto-flattering scoring would be CRPS or pinball loss,
	which credits well-calibrated intervals. We can layer that in later;
	MAE is what most people intuit by "accuracy", which is why it's
	on the headline scoreboard.
	- The window is rolling 48 h, so the number you see depends on the last
	two days, not all history.
	- We score every horizon distance lumped together. Toto's p50 at +6 h
	vs at +24 h are both folded into the same MAE. Splitting by horizon
	(1 h MAE, 6 h MAE, 24 h MAE, …) is a likely next iteration.

	The "Past forecasts" overlay on the chart uses the same query so the
	scoreboard number and the chart line refer to identical predictions.