# Toto inference: how the demo makes its forecasts

This is a precise spec of every Toto-related knob used by `app.py` so the
post / footnote can quote it accurately.

## Model

| | |
|---|---|
| Model ID | `Datadog/Toto-2.0-22m` |
| Parameters | ~22 M |
| Source | https://huggingface.co/Datadog/Toto-2.0-22m |
| Loaded via | `Toto2Model.from_pretrained(...)` (the `toto-2` package from DataDog/toto's `toto2/` subdir, pinned in `requirements.txt`) |
| Hardware | CPU (HF Space free tier — no GPU) |
| Patch size | `model.config.patch_size` (read at runtime; the context-length truncate/pad logic adapts automatically if the value differs across variants) |

We started on the 4 M variant for the demo's "weakest model still says
something useful" angle, then bumped to 22 M for visibly tighter
confidence bands and lower scoreboard MAE — still small enough to run
in sub-second CPU latency on the free HF Space tier.

## Input data

| | |
|---|---|
| Source | Ecowitt Cloud API v3 (`/device/history`) |
| Station | Ecowitt GW3000B, Westhampton Beach NY |
| Channels forecasted | `outdoor.temperature` (°F), `outdoor.humidity` (%), `pressure.relative` (inHg), `rainfall_piezo.rain_rate` (in/hr) |
| Native storage cadence | **5 min** at `cycle_type=5min` (the device is configured to upload at ~1-min intervals; Ecowitt buckets to 5 min for the 90-day tier). Earlier defaults of 30 min were the device's out-of-box upload schedule. |
| `cycle_type` requested | `5min` — finest tier the API exposes. The data lives in `data/ecowitt.db` (synced incrementally every 15 min); each refresh reads from the archive rather than hitting Ecowitt live. |
| History window pulled from the archive | 7 days |
| Resampling for the chart | `df.resample("5min").mean()` — fine-grained display |
| Resampling for Toto inference | `df.resample("1h").mean()` — coarser series so the 4M model receives a 168-point context + 48-step horizon, the regime where it forecasts cleanly. Decoupling the chart cadence from the model cadence keeps the visual fine and the model output honest. |
| Cleaning | `Series.interpolate(limit_direction="both")` fills resample gaps before the tensor goes to Toto |
| NWS comparison | `https://api.weather.gov/points/{lat},{lon}` → `forecastHourly` (point forecast, no distribution) |

## Context length

Toto requires the time axis of the input tensor to be a multiple of `patch_size = 32`.

```text
n_raw = len(history_series_after_resample_and_interpolate)
if n_raw >= 32:
    n_ctx = (n_raw // 32) * 32                # truncate oldest points
    target_mask = ones(n_ctx)                  # all valid
else:
    n_ctx = 32                                 # pad up to one patch
    pad = 32 - n_raw
    target = [first_value]*pad + raw           # left-pad with the first value
    target_mask = [False]*pad + [True]*n_raw   # tell Toto to ignore the padded steps
```

With ~7 days of archive history and the hourly resample we use for
inference, this gives a context of 160 hourly points (5 patches). The
chart shows the same 7 days at 5-min cadence (≈2 016 points) — but
those raw points only feed the chart, not the model.

## Tensor shape

```text
target:      torch.float32, shape (batch=1, n_variates=1, time=n_ctx)
target_mask: torch.bool,    shape (batch=1, n_variates=1, time=n_ctx)
series_ids:  torch.long,    shape (batch=1, n_variates=1)            (all zeros — univariate)
```

We forecast each metric **independently** (univariate). Multivariate
inference is a follow-up; the inference cost is comparable but the chart
gets noisier and the post hook is easier to read one metric at a time.

## Prediction length

`horizon_steps = round(horizon_hours / step_hours)` where:

`step_hours` is the **forecast cadence** (1 h), not the chart cadence
(5 min). `horizon_hours = 48`, so `horizon_steps = 48`. The 48 hourly
predictions get drawn on top of the 5-min historical line — the
forecast line is therefore sparser than the actuals, but visually
continuous because Plotly connects the anchor points.

## Distribution → quantiles

We do **not** Monte-Carlo sample. Toto's output head is a parametric
Student-t mixture (see the Toto 2.0 paper), and `model.forecast()`
returns analytical quantiles directly:

```python
quantiles = model.forecast(
    {"target": target, "target_mask": target_mask, "series_ids": series_ids},
    horizon=horizon_steps,
)
# quantiles shape: (9, batch=1, n_variates=1, horizon)
# quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
```

We pluck three of them for the chart and the scoreboard:

| Display | Index | Quantile |
|---|---|---|
| Lower edge of the shaded band | `quantiles[0, 0, 0]` | p10 |
| Toto median line | `quantiles[4, 0, 0]` | p50 |
| Upper edge of the shaded band | `quantiles[8, 0, 0]` | p90 |

The shaded band is therefore the **80 % central interval** (p10–p90).
The "±X°F at +24 h" chip on the hero is half of `(p90 − p10)` at the
last forecast step.

## Inference cadence

| | |
|---|---|
| Trigger | a daemon thread inside the Space (`_autorefresh_loop`) and `demo.load` on a visitor's first request |
| Interval | every 15 minutes |
| Cache TTL | 14 minutes — slightly less than the autorefresh interval so the next tick always misses the cache and refetches Ecowitt + NWS |
| Per-tick cost | one `/device/real_time` + one `/device/history` per cycle_type touched + one NWS `/points` + one NWS `forecastHourly` + four univariate Toto forwards (one per metric) |
| CPU forward time | ~hundreds of milliseconds per metric on the free CPU tier; total wallclock per refresh is dominated by the network calls, not the model |

## Persistence

Every refresh writes to `data/forecasts.db` (SQLite):

- `forecast_snapshots(forecast_made_at, target_ts, source, metric, p10, p50, p90)`
- `actuals(target_ts, metric, value)`

`source ∈ {toto, nws}`. NWS rows store the point forecast in `p50` and
leave `p10/p90` NULL.

A second SQLite, `data/ecowitt.db`, is the all-channel raw archive
(populated by `src/sync.py`). Both DBs are pushed to a private HF Dataset
(`bitsofchris/toto-weather-forecast-log`) on every autorefresh tick so
they survive Space rebuilds.

## Scoreboard — how the accuracy is calculated

The scoreboard answers one question: **over the last 48 hours, which model
was closer to the actual reading at each hour?** The rules are designed so
neither model gets to peek at data that wasn't available when the forecast
was made.

### Inputs

- `actuals(target_ts, metric, value)` — Ecowitt readings resampled to the
  current display cadence (hourly by default).
- `forecast_snapshots(forecast_made_at, target_ts, source, metric, p10, p50, p90)`
  — every forecast either model has ever issued, with two timestamps:
    - `forecast_made_at` — when we ran inference / fetched NWS
    - `target_ts` — the hour the forecast is predicting

### Rule for picking which forecast counts

For each `(target_ts, source, metric)`:

1. Filter to forecasts whose `forecast_made_at <= target_ts` — i.e. the
   model didn't yet know the actual value.
2. Of those, pick the one with the **largest** `forecast_made_at` — the
   most recent prediction issued *before* the target hour.

So for any past hour, Toto and NWS are both scored on their *latest
pre-target opinion*. Both models always have the same information cutoff;
no foresight, no stale snapshots.

### Aggregation

Once each `(target_ts, source, metric)` triple has been pinned to a
single forecast row:

```text
abs_err = |p50 − actual|
MAE_source = mean(abs_err)   over target_ts in the last 48 h
n          = count(matching pairs)
```

The "lower is better" winner is whichever source has the smaller MAE for
that metric. NWS doesn't expose pressure in `forecastHourly`, so the
pressure scoreboard reports Toto only.

### Caveats / what this scoreboard is NOT

- We score the **point prediction** (`p50`) for both models. That throws
  away Toto's uncertainty — a wider interval doesn't hurt or help its
  MAE. A more Toto-flattering scoring would be CRPS or pinball loss,
  which credits well-calibrated intervals. We can layer that in later;
  MAE is what most people intuit by "accuracy", which is why it's
  on the headline scoreboard.
- The window is rolling 48 h, so the number you see depends on the last
  two days, not all history.
- We score every horizon distance lumped together. Toto's p50 at +6 h
  vs at +24 h are both folded into the same MAE. Splitting by horizon
  (1 h MAE, 6 h MAE, 24 h MAE, …) is a likely next iteration.

The "Past forecasts" overlay on the chart uses the same query so the
scoreboard number and the chart line refer to identical predictions.