Emaad commited on
Commit
9fe0b8a
ยท
verified ยท
1 Parent(s): d6727ec

Refresh model card: add Pareto + architecture figures, TIME metrics, latency table

Browse files
Files changed (4) hide show
  1. .gitattributes +2 -0
  2. README.md +65 -30
  3. assets/architecture.png +3 -0
  4. assets/pareto.png +3 -0
.gitattributes CHANGED
@@ -34,3 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  figures/architecture.png filter=lfs diff=lfs merge=lfs -text
 
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  figures/architecture.png filter=lfs diff=lfs merge=lfs -text
37
+ assets/architecture.png filter=lfs diff=lfs merge=lfs -text
38
+ assets/pareto.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -15,7 +15,7 @@ thumbnail: https://corp.dd-static.net/img/about/presskit/kit/press_kit.png
15
  model-index:
16
  - name: Toto-2.0-313m
17
  results:
18
- - task:
19
  type: time-series-forecasting
20
  dataset:
21
  name: BOOM
@@ -30,9 +30,9 @@ model-index:
30
  source:
31
  name: BOOM ๐Ÿ’ฅ Observability Time-Series Forecasting Leaderboard
32
  url: https://huggingface.co/spaces/Datadog/BOOM
33
- - task:
34
  type: time-series-forecasting
35
- dataset:
36
  name: GIFT-Eval
37
  type: GIFT-Eval
38
  metrics:
@@ -45,28 +45,54 @@ model-index:
45
  source:
46
  name: GIFT-Eval Time Series Forecasting Leaderboard
47
  url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
48
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  ---
 
50
  # Toto-2.0-313m
51
 
52
- Toto (**T**ime Series **O**ptimized **T**ransformer for [**O**bservability](https://www.datadoghq.com/knowledge-center/observability/)) is a family of time series foundation models for multivariate forecasting developed by [Datadog](https://www.datadoghq.com/). **Toto 2.0** is the current generation, featuring u-ฮผP-scaled transformers ranging from 4M to 2.5B parameters.
 
 
53
 
54
  ---
55
 
56
  ## โœจ Key Features
57
 
58
- - **Zero-Shot Forecasting**: Forecast without fine-tuning on your specific time series.
59
- - **Multi-Variate Support**: Efficiently process multiple variables using alternating time/variate attention.
60
- - **Probabilistic Predictions**: Generate point forecasts and uncertainty estimates via a quantile output head.
61
- - **Decoder-Only Architecture**: Support for variable prediction horizons and context lengths.
62
- - **u-ฮผP Scaling**: Stable training transfer across all model sizes.
63
 
64
- <div style="width: 100%; margin: auto; padding: 1rem;">
65
- <img src="figures/architecture.png" alt="Toto 2.0 architecture" style="width: 100%; height: auto;" />
66
- <em style="display: block; margin-top: 0.5rem; text-align: center;">
67
- Overview of the Toto 2.0 architecture.
68
- </em>
69
- </div>
 
 
 
 
 
 
 
 
 
70
 
71
  ---
72
 
@@ -87,18 +113,21 @@ import torch
87
  from toto2 import Toto2Model
88
 
89
  model = Toto2Model.from_pretrained("Datadog/Toto-2.0-313m")
90
- model = model.to("cuda").eval()
 
91
 
92
  # (batch, n_variates, time_steps)
93
- target = torch.randn(1, 1, 512, device="cuda")
94
  target_mask = torch.ones_like(target, dtype=torch.bool)
95
- series_ids = torch.zeros(1, 1, dtype=torch.long, device="cuda")
96
 
97
  # Returns quantiles of shape (9, batch, n_variates, horizon)
98
  # Quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
99
  quantiles = model.forecast(
100
  {"target": target, "target_mask": target_mask, "series_ids": series_ids},
101
  horizon=96,
 
 
102
  )
103
  ```
104
 
@@ -108,22 +137,28 @@ For more examples, see the [Quick Start notebook](https://github.com/DataDog/tot
108
 
109
  ## ๐Ÿ’พ Available Checkpoints
110
 
111
- | Checkpoint | Parameters |
112
- |---|---|
113
- | [Toto-2.0-4m](https://huggingface.co/Datadog/Toto-2.0-4m) | 4M |
114
- | [Toto-2.0-22m](https://huggingface.co/Datadog/Toto-2.0-22m) | 22M |
115
- | [Toto-2.0-313m](https://huggingface.co/Datadog/Toto-2.0-313m) | 313M |
116
- | [Toto-2.0-1B](https://huggingface.co/Datadog/Toto-2.0-1B) | 1B |
117
- | [Toto-2.0-2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B |
 
 
 
 
118
 
119
  ---
120
 
121
  ## ๐Ÿ”— Additional Resources
122
 
123
- - **[Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)**
124
- - **[GitHub Repository](https://github.com/DataDog/toto)**
125
- - **[BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM)**
126
- - **[Toto 1.0 Weights](https://huggingface.co/Datadog/Toto-Open-Base-1.0)**
 
 
127
 
128
  ---
129
 
 
15
  model-index:
16
  - name: Toto-2.0-313m
17
  results:
18
+ - task:
19
  type: time-series-forecasting
20
  dataset:
21
  name: BOOM
 
30
  source:
31
  name: BOOM ๐Ÿ’ฅ Observability Time-Series Forecasting Leaderboard
32
  url: https://huggingface.co/spaces/Datadog/BOOM
33
+ - task:
34
  type: time-series-forecasting
35
+ dataset:
36
  name: GIFT-Eval
37
  type: GIFT-Eval
38
  metrics:
 
45
  source:
46
  name: GIFT-Eval Time Series Forecasting Leaderboard
47
  url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
48
+ - task:
49
+ type: time-series-forecasting
50
+ dataset:
51
+ name: TIME
52
+ type: TIME
53
+ metrics:
54
+ - name: CRPS
55
+ type: CRPS
56
+ value: 0.535
57
+ - name: MASE
58
+ type: MASE
59
+ value: 0.642
60
+ source:
61
+ name: TIME Benchmark Leaderboard
62
+ url: https://huggingface.co/spaces/Real-TSF/TIME-leaderboard
63
  ---
64
+
65
  # Toto-2.0-313m
66
 
67
+ Toto (Time Series Optimized Transformer for [Observability](https://www.datadoghq.com/knowledge-center/observability/)) is a family of time series foundation models for multivariate forecasting developed by [Datadog](https://www.datadoghq.com/). Toto 2.0 is the current generation, featuring u-ฮผP-scaled transformers ranging from 4M to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family, with no sign of saturation at 2.5B.
68
+
69
+ The family sets a new state of the art on three forecasting benchmarks: [BOOM](https://huggingface.co/spaces/Datadog/BOOM), our observability benchmark; [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval), the standard general-purpose benchmark; and the recent contamination-resistant [TIME](https://arxiv.org/abs/2602.12147) benchmark.
70
 
71
  ---
72
 
73
  ## โœจ Key Features
74
 
75
+ - **Zero-Shot Forecasting:** Forecast without fine-tuning on your specific time series.
76
+ - **Multi-Variate Support:** Efficiently process multiple variables using alternating time/variate attention.
77
+ - **Probabilistic Predictions:** Generate point forecasts and uncertainty estimates via a quantile output head.
78
+ - **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
79
+ - **u-ฮผP Scaling:** A single training recipe transfers cleanly across all five sizes (4M โ†’ 2.5B).
80
 
81
+ ---
82
+
83
+ ## ๐Ÿ—๏ธ Architecture
84
+
85
+ ![Overview of the Toto 2.0 architecture.](assets/architecture.png)
86
+
87
+ A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds **contiguous patch masking (CPM)** for single-pass parallel decoding, a **quantile output head** trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the [technical report](#-additional-resources) for details.
88
+
89
+ ---
90
+
91
+ ## ๐Ÿ“Š Performance
92
+
93
+ ![Pareto frontier on BOOM and GIFT-Eval](assets/pareto.png)
94
+
95
+ Every Toto 2.0 size sits on or near the Pareto frontier on both BOOM and GIFT-Eval. The three largest sizes rank first, second, and third among foundation models on GIFT-Eval CRPS rank. On TIME, Toto 2.0 sizes take the top three spots on every metric, ahead of every other external foundation model evaluated.
96
 
97
  ---
98
 
 
113
  from toto2 import Toto2Model
114
 
115
  model = Toto2Model.from_pretrained("Datadog/Toto-2.0-313m")
116
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
117
+ model = model.to(device).eval()
118
 
119
  # (batch, n_variates, time_steps)
120
+ target = torch.randn(1, 1, 512, device=device)
121
  target_mask = torch.ones_like(target, dtype=torch.bool)
122
+ series_ids = torch.zeros(1, 1, dtype=torch.long, device=device)
123
 
124
  # Returns quantiles of shape (9, batch, n_variates, horizon)
125
  # Quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
126
  quantiles = model.forecast(
127
  {"target": target, "target_mask": target_mask, "series_ids": series_ids},
128
  horizon=96,
129
+ decode_block_size=768,
130
+ has_missing_values=False,
131
  )
132
  ```
133
 
 
137
 
138
  ## ๐Ÿ’พ Available Checkpoints
139
 
140
+ All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latencies are forward-pass time for a 1,024-step forecast at batch size 8 on a single A100.
141
+
142
+ | Model | Params | Single-pass latency<br>(1,024 horizon) | Block decoding<br>(block=768) | Recommended for |
143
+ |---|---|---|---|---|
144
+ | [Toto-2.0-4m](https://huggingface.co/Datadog/Toto-2.0-4m) | 4m | ~3.8 ms | ~10.0 ms | Edge / CPU deployment; tightest latency or memory budgets. |
145
+ | [Toto-2.0-22m](https://huggingface.co/Datadog/Toto-2.0-22m) | 22m | ~5.0 ms | ~12.8 ms | Efficient default โ€” matches or beats Toto 1.0 quality with ~7ร— fewer parameters. |
146
+ | [Toto-2.0-313m](https://huggingface.co/Datadog/Toto-2.0-313m) | 313m | ~15.4 ms | ~32.4 ms | Strong general-purpose checkpoint; top-3 foundation model on GIFT-Eval. |
147
+ | [Toto-2.0-1B](https://huggingface.co/Datadog/Toto-2.0-1B) | 1B | ~20.9 ms | ~46.3 ms | Best quality / cost tradeoff for production workloads. |
148
+ | [Toto-2.0-2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B | ~36.2 ms | ~78.0 ms | Highest accuracy; #1 foundation model on every benchmark. |
149
+
150
+ > Single-pass decoding fills the entire horizon in one forward pass and is recommended up to ~768 steps. Block decoding generates the horizon in 768-step segments conditioned on the previous segment's median (with KV caching); it is slower but more stable at long horizons. Both modes use the same checkpoint.
151
 
152
  ---
153
 
154
  ## ๐Ÿ”— Additional Resources
155
 
156
+ - **Technical Report** โ€” *(coming soon)*
157
+ - [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)
158
+ - [GitHub Repository](https://github.com/DataDog/toto)
159
+ - [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) โ€” all five base checkpoints
160
+ - [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM) โ€” Datadog's observability time-series benchmark
161
+ - [Toto 1.0 Weights](https://huggingface.co/Datadog/Toto-Open-Base-1.0)
162
 
163
  ---
164
 
assets/architecture.png ADDED

Git LFS Details

  • SHA256: 973196289f6036b880ec7fdb00fe0b1078215232bf58f0bdd6a27eeebfca46ef
  • Pointer size: 131 Bytes
  • Size of remote file: 437 kB
assets/pareto.png ADDED

Git LFS Details

  • SHA256: 756a059027357cf224effc92995755b2c50ca8396b68918135ef0e5226798294
  • Pointer size: 131 Bytes
  • Size of remote file: 302 kB