Refresh model card: add Pareto + architecture figures, TIME metrics, latency table

Browse files

Files changed (4) hide show

.gitattributes +2 -0
README.md +65 -30
assets/architecture.png +3 -0
assets/pareto.png +3 -0

.gitattributes CHANGED Viewed

@@ -34,3 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 figures/architecture.png filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 figures/architecture.png filter=lfs diff=lfs merge=lfs -text
+assets/architecture.png filter=lfs diff=lfs merge=lfs -text
+assets/pareto.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -15,7 +15,7 @@ thumbnail: https://corp.dd-static.net/img/about/presskit/kit/press_kit.png
 model-index:
 - name: Toto-2.0-313m
   results:
-    - task:
         type: time-series-forecasting
       dataset:
         name: BOOM
@@ -30,9 +30,9 @@ model-index:
       source:
         name: BOOM 💥 Observability Time-Series Forecasting Leaderboard
         url: https://huggingface.co/spaces/Datadog/BOOM
-    - task:
         type: time-series-forecasting
-      dataset:
         name: GIFT-Eval
         type: GIFT-Eval
       metrics:
@@ -45,28 +45,54 @@ model-index:
       source:
         name: GIFT-Eval Time Series Forecasting Leaderboard
         url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
 ---
 # Toto-2.0-313m
-Toto (**T**ime Series **O**ptimized **T**ransformer for [**O**bservability](https://www.datadoghq.com/knowledge-center/observability/)) is a family of time series foundation models for multivariate forecasting developed by [Datadog](https://www.datadoghq.com/). **Toto 2.0** is the current generation, featuring u-μP-scaled transformers ranging from 4M to 2.5B parameters.
 ---
 ## ✨ Key Features
-- **Zero-Shot Forecasting**: Forecast without fine-tuning on your specific time series.
-- **Multi-Variate Support**: Efficiently process multiple variables using alternating time/variate attention.
-- **Probabilistic Predictions**: Generate point forecasts and uncertainty estimates via a quantile output head.
-- **Decoder-Only Architecture**: Support for variable prediction horizons and context lengths.
-- **u-μP Scaling**: Stable training transfer across all model sizes.
-<div style="width: 100%; margin: auto; padding: 1rem;">
-  <img src="figures/architecture.png" alt="Toto 2.0 architecture" style="width: 100%; height: auto;" />
-  <em style="display: block; margin-top: 0.5rem; text-align: center;">
-    Overview of the Toto 2.0 architecture.
-  </em>
-</div>
 ---
@@ -87,18 +113,21 @@ import torch
 from toto2 import Toto2Model
 model = Toto2Model.from_pretrained("Datadog/Toto-2.0-313m")
-model = model.to("cuda").eval()
 # (batch, n_variates, time_steps)
-target = torch.randn(1, 1, 512, device="cuda")
 target_mask = torch.ones_like(target, dtype=torch.bool)
-series_ids = torch.zeros(1, 1, dtype=torch.long, device="cuda")
 # Returns quantiles of shape (9, batch, n_variates, horizon)
 # Quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
 quantiles = model.forecast(
     {"target": target, "target_mask": target_mask, "series_ids": series_ids},
     horizon=96,
 )
 ```
@@ -108,22 +137,28 @@ For more examples, see the [Quick Start notebook](https://github.com/DataDog/tot
 ## 💾 Available Checkpoints
-| Checkpoint | Parameters |
-|---|---|
-| [Toto-2.0-4m](https://huggingface.co/Datadog/Toto-2.0-4m) | 4M |
-| [Toto-2.0-22m](https://huggingface.co/Datadog/Toto-2.0-22m) | 22M |
-| [Toto-2.0-313m](https://huggingface.co/Datadog/Toto-2.0-313m) | 313M |
-| [Toto-2.0-1B](https://huggingface.co/Datadog/Toto-2.0-1B) | 1B |
-| [Toto-2.0-2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B |
 ---
 ## 🔗 Additional Resources
-- **[Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)**
-- **[GitHub Repository](https://github.com/DataDog/toto)**
-- **[BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM)**
-- **[Toto 1.0 Weights](https://huggingface.co/Datadog/Toto-Open-Base-1.0)**
 ---

 model-index:
 - name: Toto-2.0-313m
   results:
+    - task:
         type: time-series-forecasting
       dataset:
         name: BOOM
       source:
         name: BOOM 💥 Observability Time-Series Forecasting Leaderboard
         url: https://huggingface.co/spaces/Datadog/BOOM
+    - task:
         type: time-series-forecasting
+      dataset:
         name: GIFT-Eval
         type: GIFT-Eval
       metrics:
       source:
         name: GIFT-Eval Time Series Forecasting Leaderboard
         url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
+    - task:
+        type: time-series-forecasting
+      dataset:
+        name: TIME
+        type: TIME
+      metrics:
+        - name: CRPS
+          type: CRPS
+          value: 0.535
+        - name: MASE
+          type: MASE
+          value: 0.642
+      source:
+        name: TIME Benchmark Leaderboard
+        url: https://huggingface.co/spaces/Real-TSF/TIME-leaderboard
 ---
 # Toto-2.0-313m
+Toto (Time Series Optimized Transformer for [Observability](https://www.datadoghq.com/knowledge-center/observability/)) is a family of time series foundation models for multivariate forecasting developed by [Datadog](https://www.datadoghq.com/). Toto 2.0 is the current generation, featuring u-μP-scaled transformers ranging from 4M to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family, with no sign of saturation at 2.5B.
+The family sets a new state of the art on three forecasting benchmarks: [BOOM](https://huggingface.co/spaces/Datadog/BOOM), our observability benchmark; [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval), the standard general-purpose benchmark; and the recent contamination-resistant [TIME](https://arxiv.org/abs/2602.12147) benchmark.
 ---
 ## ✨ Key Features
+- **Zero-Shot Forecasting:** Forecast without fine-tuning on your specific time series.
+- **Multi-Variate Support:** Efficiently process multiple variables using alternating time/variate attention.
+- **Probabilistic Predictions:** Generate point forecasts and uncertainty estimates via a quantile output head.
+- **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
+- **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4M → 2.5B).
+---
+## 🏗️ Architecture
+![Overview of the Toto 2.0 architecture.](assets/architecture.png)
+A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds **contiguous patch masking (CPM)** for single-pass parallel decoding, a **quantile output head** trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the [technical report](#-additional-resources) for details.
+---
+## 📊 Performance
+![Pareto frontier on BOOM and GIFT-Eval](assets/pareto.png)
+Every Toto 2.0 size sits on or near the Pareto frontier on both BOOM and GIFT-Eval. The three largest sizes rank first, second, and third among foundation models on GIFT-Eval CRPS rank. On TIME, Toto 2.0 sizes take the top three spots on every metric, ahead of every other external foundation model evaluated.
 ---
 from toto2 import Toto2Model
 model = Toto2Model.from_pretrained("Datadog/Toto-2.0-313m")
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = model.to(device).eval()
 # (batch, n_variates, time_steps)
+target = torch.randn(1, 1, 512, device=device)
 target_mask = torch.ones_like(target, dtype=torch.bool)
+series_ids = torch.zeros(1, 1, dtype=torch.long, device=device)
 # Returns quantiles of shape (9, batch, n_variates, horizon)
 # Quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
 quantiles = model.forecast(
     {"target": target, "target_mask": target_mask, "series_ids": series_ids},
     horizon=96,
+    decode_block_size=768,
+    has_missing_values=False,
 )
 ```
 ## 💾 Available Checkpoints
+All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latencies are forward-pass time for a 1,024-step forecast at batch size 8 on a single A100.
+| Model | Params | Single-pass latency<br>(1,024 horizon) | Block decoding<br>(block=768) | Recommended for |
+|---|---|---|---|---|
+| [Toto-2.0-4m](https://huggingface.co/Datadog/Toto-2.0-4m)     | 4m     | ~3.8 ms  | ~10.0 ms | Edge / CPU deployment; tightest latency or memory budgets. |
+| [Toto-2.0-22m](https://huggingface.co/Datadog/Toto-2.0-22m)   | 22m    | ~5.0 ms  | ~12.8 ms | Efficient default — matches or beats Toto 1.0 quality with ~7× fewer parameters. |
+| [Toto-2.0-313m](https://huggingface.co/Datadog/Toto-2.0-313m) | 313m   | ~15.4 ms | ~32.4 ms | Strong general-purpose checkpoint; top-3 foundation model on GIFT-Eval. |
+| [Toto-2.0-1B](https://huggingface.co/Datadog/Toto-2.0-1B)     | 1B     | ~20.9 ms | ~46.3 ms | Best quality / cost tradeoff for production workloads. |
+| [Toto-2.0-2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B   | ~36.2 ms | ~78.0 ms | Highest accuracy; #1 foundation model on every benchmark. |
+> Single-pass decoding fills the entire horizon in one forward pass and is recommended up to ~768 steps. Block decoding generates the horizon in 768-step segments conditioned on the previous segment's median (with KV caching); it is slower but more stable at long horizons. Both modes use the same checkpoint.
 ---
 ## 🔗 Additional Resources
+- **Technical Report** — *(coming soon)*
+- [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)
+- [GitHub Repository](https://github.com/DataDog/toto)
+- [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) — all five base checkpoints
+- [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM) — Datadog's observability time-series benchmark
+- [Toto 1.0 Weights](https://huggingface.co/Datadog/Toto-Open-Base-1.0)
 ---

assets/architecture.png ADDED Viewed

Git LFS Details

SHA256: 973196289f6036b880ec7fdb00fe0b1078215232bf58f0bdd6a27eeebfca46ef
Pointer size: 131 Bytes
Size of remote file: 437 kB

assets/pareto.png ADDED Viewed

Git LFS Details

SHA256: 756a059027357cf224effc92995755b2c50ca8396b68918135ef0e5226798294
Pointer size: 131 Bytes
Size of remote file: 302 kB