File size: 6,852 Bytes
d9ac320
 
d6727ec
 
 
 
 
 
 
 
d9ac320
d6727ec
 
0262e07
d6727ec
 
 
9fe0b8a
d6727ec
 
 
 
 
 
 
 
 
 
 
 
 
 
9fe0b8a
d6727ec
9fe0b8a
d6727ec
 
 
 
 
 
 
 
 
 
 
 
9fe0b8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d9ac320
9fe0b8a
d6727ec
d9ac320
4f97f1b
9fe0b8a
 
d6727ec
9fe0b8a
 
021db3d
 
 
 
d6727ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9fe0b8a
 
d6727ec
 
9fe0b8a
d6727ec
9fe0b8a
d6727ec
 
 
 
 
 
9fe0b8a
 
d6727ec
 
 
 
 
 
 
31ec937
 
4f97f1b
f6b0baa
 
 
 
 
 
d6727ec
02ef441
 
 
 
 
 
4f97f1b
02ef441
 
 
021db3d
 
 
 
02ef441
d6727ec
 
9fe0b8a
 
 
 
 
 
d6727ec
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
tags:
- time-series-forecasting
- foundation-models
- pretrained-models
- time-series
- timeseries
- forecasting
- observability
- safetensors
- pytorch_model_hub_mixin
license: apache-2.0
pipeline_tag: time-series-forecasting
thumbnail: https://web-assets.dd-static.net/42588/1778691695-toto-2-hero.png
model-index:
- name: Toto-2.0-313m
  results:
    - task:
        type: time-series-forecasting
      dataset:
        name: BOOM
        type: BOOM
      metrics:
        - name: CRPS
          type: CRPS
          value: 0.351
        - name: MASE
          type: MASE
          value: 0.585
      source:
        name: BOOM 💥 Observability Time-Series Forecasting Leaderboard
        url: https://huggingface.co/spaces/Datadog/BOOM
    - task:
        type: time-series-forecasting
      dataset:
        name: GIFT-Eval
        type: GIFT-Eval
      metrics:
        - name: CRPS
          type: CRPS
          value: 0.481
        - name: MASE
          type: MASE
          value: 0.703
      source:
        name: GIFT-Eval Time Series Forecasting Leaderboard
        url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
    - task:
        type: time-series-forecasting
      dataset:
        name: TIME
        type: TIME
      metrics:
        - name: CRPS
          type: CRPS
          value: 0.535
        - name: MASE
          type: MASE
          value: 0.642
      source:
        name: TIME Benchmark Leaderboard
        url: https://huggingface.co/spaces/Real-TSF/TIME-leaderboard
---

# Toto-2.0-313m

Toto (Time Series Optimized Transformer for [Observability](https://www.datadoghq.com/knowledge-center/observability/)) is a family of time series foundation models for multivariate forecasting developed by [Datadog](https://www.datadoghq.com/). Toto 2.0 is the current generation, featuring u-μP-scaled transformers ranging from 4m to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family.

The family sets a new state of the art on three forecasting benchmarks: [BOOM](https://huggingface.co/spaces/Datadog/BOOM), our observability benchmark; [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval), the standard general-purpose benchmark; and the recent contamination-resistant [TIME](https://arxiv.org/abs/2602.12147) benchmark.

## 📊 Performance

<figure>
<img src="assets/pareto.png" alt="Pareto frontier on BOOM and GIFT-Eval">
<figcaption>Every Toto 2.0 size sits on or near the Pareto frontier on both BOOM and GIFT-Eval. The three largest sizes rank first, second, and third among foundation models on GIFT-Eval CRPS rank. On TIME, Toto 2.0 sizes take the top three spots on every metric, ahead of every other external foundation model evaluated.</figcaption>
</figure>

## ⚡ Quick Start

Inference code is available on [GitHub](https://github.com/DataDog/toto).

### Installation

```bash
pip install "toto-2 @ git+https://github.com/DataDog/toto.git#subdirectory=toto2"
```

### Inference Example

```python
import torch
from toto2 import Toto2Model

model = Toto2Model.from_pretrained("Datadog/Toto-2.0-313m")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device).eval()

# (batch, n_variates, time_steps)
target = torch.randn(1, 1, 512, device=device)
target_mask = torch.ones_like(target, dtype=torch.bool)
series_ids = torch.zeros(1, 1, dtype=torch.long, device=device)

# Returns quantiles of shape (9, batch, n_variates, horizon)
# Quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
quantiles = model.forecast(
    {"target": target, "target_mask": target_mask, "series_ids": series_ids},
    horizon=96,
    decode_block_size=768,
    has_missing_values=False,
)
```

For more examples, see the [Quick Start notebook](https://github.com/DataDog/toto/blob/main/toto2/notebooks/quick_start.ipynb) and [GluonTS integration notebook](https://github.com/DataDog/toto/blob/main/toto2/notebooks/gluonts_integration.ipynb).

## 💾 Available Checkpoints

All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.

| Model | Params | Weights (fp32) | Latency | Recommended for |
|:---:|:---:|:---:|---|---|
| [Toto‑2.0‑4m](https://huggingface.co/Datadog/Toto-2.0-4m)     | 4m   | 16 MB  | ~3.8&nbsp;ms  | Edge / CPU deployment; tightest latency or memory budgets. |
| [Toto‑2.0‑22m](https://huggingface.co/Datadog/Toto-2.0-22m)   | 22m  | 84 MB  | ~5.0&nbsp;ms  | Efficient default — matches or beats Toto 1.0 quality with ~7× fewer parameters. |
| [Toto‑2.0‑313m](https://huggingface.co/Datadog/Toto-2.0-313m) | 313m | 1.2 GB | ~15.4&nbsp;ms | Strong general-purpose checkpoint; top-3 foundation model on GIFT-Eval. |
| [Toto‑2.0‑1B](https://huggingface.co/Datadog/Toto-2.0-1B)     | 1B   | 3.9 GB | ~20.9&nbsp;ms | Best quality / cost tradeoff for production workloads. |
| [Toto‑2.0‑2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B | 9.1 GB | ~36.2&nbsp;ms | Highest accuracy; #1 foundation model on every benchmark. |

## ✨ Key Features

- **Zero-Shot Forecasting:** Forecast without fine-tuning on your specific time series.
- **Multi-Variate Support:** Efficiently process multiple variables using alternating time/variate attention.
- **Probabilistic Predictions:** Generate point forecasts and uncertainty estimates via a quantile output head.
- **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
- **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4m → 2.5B).

## 🏗️ Architecture

<figure>
<img src="assets/architecture.png" alt="Overview of the Toto 2.0 architecture.">
<figcaption>A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds <b>contiguous patch masking (CPM)</b> for single-pass parallel decoding, a <b>quantile output head</b> trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the <a href="#-additional-resources">technical report</a> for details.</figcaption>
</figure>

## 🔗 Additional Resources

- **Technical Report***(coming soon)*
- [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)
- [GitHub Repository](https://github.com/DataDog/toto)
- [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) — all five base checkpoints
- [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM) — Datadog's observability time-series benchmark
- [Toto 1.0 Weights](https://huggingface.co/Datadog/Toto-Open-Base-1.0)

## 📖 Citation

```bibtex
(citation coming soon)
```