File size: 6,849 Bytes

3affccf
 
930cd4d
 
 
 
 
 
 
 
3affccf
930cd4d
 
8aa5991
a3343cc
 
 
dcd8aa7
a3343cc
dcd8aa7
 
 
a3343cc
dcd8aa7
 
 
a3343cc
 
dcd8aa7
 
 
 
 
 
 
 
 
 
a3343cc
 
 
dcd8aa7
 
 
a3343cc
 
 
dcd8aa7
a3343cc
 
dcd8aa7
 
a3343cc
 
 
dcd8aa7
 
 
 
a3343cc
dcd8aa7
 
3affccf
dcd8aa7
930cd4d
3affccf
9a7f7a2
dcd8aa7
 
930cd4d
dcd8aa7
 
a89f26a
 
 
 
930cd4d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dcd8aa7
 
930cd4d
 
dcd8aa7
930cd4d
dcd8aa7
930cd4d
 
 
 
 
 
dcd8aa7
 
930cd4d
 
 
 
 
 
 
c011449
 
9969d21
4607ff1
 
 
 
 
 
930cd4d
47760a5
 
 
 
 
 
9969d21
47760a5
 
 
a89f26a
 
 
 
47760a5
930cd4d
 
dcd8aa7
 
 
 
 
 
930cd4d

---
tags:
- time-series-forecasting
- foundation-models
- pretrained-models
- time-series
- timeseries
- forecasting
- observability
- safetensors
- pytorch_model_hub_mixin
license: apache-2.0
pipeline_tag: time-series-forecasting
thumbnail: https://web-assets.dd-static.net/42588/1778691695-toto-2-hero.png
model-index:
- name: Toto-2.0-22m
  results:
    - task:
        type: time-series-forecasting
      dataset:
        name: BOOM
        type: BOOM
      metrics:
        - name: CRPS
          type: CRPS
          value: 0.363
        - name: MASE
          type: MASE
          value: 0.601
      source:
        name: BOOM 💥 Observability Time-Series Forecasting Leaderboard
        url: https://huggingface.co/spaces/Datadog/BOOM
    - task:
        type: time-series-forecasting
      dataset:
        name: GIFT-Eval
        type: GIFT-Eval
      metrics:
        - name: CRPS
          type: CRPS
          value: 0.496
        - name: MASE
          type: MASE
          value: 0.719
      source:
        name: GIFT-Eval Time Series Forecasting Leaderboard
        url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
    - task:
        type: time-series-forecasting
      dataset:
        name: TIME
        type: TIME
      metrics:
        - name: CRPS
          type: CRPS
          value: 0.556
        - name: MASE
          type: MASE
          value: 0.668
      source:
        name: TIME Benchmark Leaderboard
        url: https://huggingface.co/spaces/Real-TSF/TIME-leaderboard
---

# Toto-2.0-22m

Toto (Time Series Optimized Transformer for [Observability](https://www.datadoghq.com/knowledge-center/observability/)) is a family of time series foundation models for multivariate forecasting developed by [Datadog](https://www.datadoghq.com/). Toto 2.0 is the current generation, featuring u-μP-scaled transformers ranging from 4m to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family.

The family sets a new state of the art on three forecasting benchmarks: [BOOM](https://huggingface.co/spaces/Datadog/BOOM), our observability benchmark; [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval), the standard general-purpose benchmark; and the recent contamination-resistant [TIME](https://arxiv.org/abs/2602.12147) benchmark.

## 📊 Performance

<figure>
<img src="assets/pareto.png" alt="Pareto frontier on BOOM and GIFT-Eval">
<figcaption>Every Toto 2.0 size sits on or near the Pareto frontier on both BOOM and GIFT-Eval. The three largest sizes rank first, second, and third among foundation models on GIFT-Eval CRPS rank. On TIME, Toto 2.0 sizes take the top three spots on every metric, ahead of every other external foundation model evaluated.</figcaption>
</figure>

## ⚡ Quick Start

Inference code is available on [GitHub](https://github.com/DataDog/toto).

### Installation

```bash
pip install "toto-2 @ git+https://github.com/DataDog/toto.git#subdirectory=toto2"
```

### Inference Example

```python
import torch
from toto2 import Toto2Model

model = Toto2Model.from_pretrained("Datadog/Toto-2.0-22m")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device).eval()

# (batch, n_variates, time_steps)
target = torch.randn(1, 1, 512, device=device)
target_mask = torch.ones_like(target, dtype=torch.bool)
series_ids = torch.zeros(1, 1, dtype=torch.long, device=device)

# Returns quantiles of shape (9, batch, n_variates, horizon)
# Quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
quantiles = model.forecast(
    {"target": target, "target_mask": target_mask, "series_ids": series_ids},
    horizon=96,
    decode_block_size=768,
    has_missing_values=False,
)
```

For more examples, see the [Quick Start notebook](https://github.com/DataDog/toto/blob/main/toto2/notebooks/quick_start.ipynb) and [GluonTS integration notebook](https://github.com/DataDog/toto/blob/main/toto2/notebooks/gluonts_integration.ipynb).

## 💾 Available Checkpoints

All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.

| Model | Params | Weights (fp32) | Latency | Recommended for |
|:---:|:---:|:---:|---|---|
| [Toto‑2.0‑4m](https://huggingface.co/Datadog/Toto-2.0-4m)     | 4m   | 16 MB  | ~3.8&nbsp;ms  | Edge / CPU deployment; tightest latency or memory budgets. |
| [Toto‑2.0‑22m](https://huggingface.co/Datadog/Toto-2.0-22m)   | 22m  | 84 MB  | ~5.0&nbsp;ms  | Efficient default — matches or beats Toto 1.0 quality with ~7× fewer parameters. |
| [Toto‑2.0‑313m](https://huggingface.co/Datadog/Toto-2.0-313m) | 313m | 1.2 GB | ~15.4&nbsp;ms | Strong general-purpose checkpoint; top-3 foundation model on GIFT-Eval. |
| [Toto‑2.0‑1B](https://huggingface.co/Datadog/Toto-2.0-1B)     | 1B   | 3.9 GB | ~20.9&nbsp;ms | Best quality / cost tradeoff for production workloads. |
| [Toto‑2.0‑2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B | 9.1 GB | ~36.2&nbsp;ms | Highest accuracy; #1 foundation model on every benchmark. |

## ✨ Key Features

- **Zero-Shot Forecasting:** Forecast without fine-tuning on your specific time series.
- **Multi-Variate Support:** Efficiently process multiple variables using alternating time/variate attention.
- **Probabilistic Predictions:** Generate point forecasts and uncertainty estimates via a quantile output head.
- **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
- **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4m → 2.5B).

## 🏗️ Architecture

<figure>
<img src="assets/architecture.png" alt="Overview of the Toto 2.0 architecture.">
<figcaption>A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds <b>contiguous patch masking (CPM)</b> for single-pass parallel decoding, a <b>quantile output head</b> trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the <a href="#-additional-resources">technical report</a> for details.</figcaption>
</figure>

## 🔗 Additional Resources

- **Technical Report** — *(coming soon)*
- [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)
- [GitHub Repository](https://github.com/DataDog/toto)
- [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) — all five base checkpoints
- [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM) — Datadog's observability time-series benchmark
- [Toto 1.0 Weights](https://huggingface.co/Datadog/Toto-Open-Base-1.0)

## 📖 Citation

```bibtex
(citation coming soon)
```