Refresh model card: add Pareto + architecture figures, TIME metrics, latency table

Browse files

Files changed (4) hide show

.gitattributes +2 -0
README.md +62 -30
assets/architecture.png +3 -0
assets/pareto.png +3 -0

.gitattributes CHANGED Viewed

@@ -34,3 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 figures/architecture.png filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 figures/architecture.png filter=lfs diff=lfs merge=lfs -text
+assets/architecture.png filter=lfs diff=lfs merge=lfs -text
+assets/pareto.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -15,7 +15,7 @@ thumbnail: https://corp.dd-static.net/img/about/presskit/kit/press_kit.png
 model-index:
 - name: Toto-2.0-4m
   results:
-    - task:
         type: time-series-forecasting
       dataset:
         name: BOOM
@@ -23,16 +23,16 @@ model-index:
       metrics:
         - name: CRPS
           type: CRPS
-          value: 0.717
         - name: MASE
           type: MASE
           value: 0.624
       source:
         name: BOOM 💥 Observability Time-Series Forecasting Leaderboard
-        url: https://huggingface.co/spaces/Datadog/BOOM
-    - task:
         type: time-series-forecasting
-      dataset:
         name: GIFT-Eval
         type: GIFT-Eval
       metrics:
@@ -45,28 +45,54 @@ model-index:
       source:
         name: GIFT-Eval Time Series Forecasting Leaderboard
         url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
 ---
 # Toto-2.0-4m
-Toto (**T**ime Series **O**ptimized **T**ransformer for [**O**bservability](https://www.datadoghq.com/knowledge-center/observability/)) is a family of time series foundation models for multivariate forecasting developed by [Datadog](https://www.datadoghq.com/). **Toto 2.0** is the current generation, featuring u-μP-scaled transformers ranging from 4M to 2.5B parameters.
 ---
 ## ✨ Key Features
-- **Zero-Shot Forecasting**: Forecast without fine-tuning on your specific time series.
-- **Multi-Variate Support**: Efficiently process multiple variables using alternating time/variate attention.
-- **Probabilistic Predictions**: Generate point forecasts and uncertainty estimates via a quantile output head.
-- **Decoder-Only Architecture**: Support for variable prediction horizons and context lengths.
-- **u-μP Scaling**: Stable training transfer across all model sizes.
-<div style="width: 100%; margin: auto; padding: 1rem;">
-  <img src="figures/architecture.png" alt="Toto 2.0 architecture" style="width: 100%; height: auto;" />
-  <em style="display: block; margin-top: 0.5rem; text-align: center;">
-    Overview of the Toto 2.0 architecture.
-  </em>
-</div>
 ---
@@ -86,7 +112,7 @@ pip install "toto-2 @ git+https://github.com/DataDog/toto.git#subdirectory=toto2
 import torch
 from toto2 import Toto2Model
-model = Toto2Model.from_pretrained("Datadog/Toto-2.0-22m")
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 model = model.to(device).eval()
@@ -111,22 +137,28 @@ For more examples, see the [Quick Start notebook](https://github.com/DataDog/tot
 ## 💾 Available Checkpoints
-| Checkpoint | Parameters |
-|---|---|
-| [Toto-2.0-4m](https://huggingface.co/Datadog/Toto-2.0-4m) | 4M |
-| [Toto-2.0-22m](https://huggingface.co/Datadog/Toto-2.0-22m) | 22M |
-| [Toto-2.0-313m](https://huggingface.co/Datadog/Toto-2.0-313m) | 313M |
-| [Toto-2.0-1B](https://huggingface.co/Datadog/Toto-2.0-1B) | 1B |
-| [Toto-2.0-2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B |
 ---
 ## 🔗 Additional Resources
-- **[Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)**
-- **[GitHub Repository](https://github.com/DataDog/toto)**
-- **[BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM)**
-- **[Toto 1.0 Weights](https://huggingface.co/Datadog/Toto-Open-Base-1.0)**
 ---

 model-index:
 - name: Toto-2.0-4m
   results:
+    - task:
         type: time-series-forecasting
       dataset:
         name: BOOM
       metrics:
         - name: CRPS
           type: CRPS
+          value: 0.377
         - name: MASE
           type: MASE
           value: 0.624
       source:
         name: BOOM 💥 Observability Time-Series Forecasting Leaderboard
+        url: https://huggingface.co/spaces/Datadog/BOOM
+    - task:
         type: time-series-forecasting
+      dataset:
         name: GIFT-Eval
         type: GIFT-Eval
       metrics:
       source:
         name: GIFT-Eval Time Series Forecasting Leaderboard
         url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
+    - task:
+        type: time-series-forecasting
+      dataset:
+        name: TIME
+        type: TIME
+      metrics:
+        - name: CRPS
+          type: CRPS
+          value: 0.574
+        - name: MASE
+          type: MASE
+          value: 0.689
+      source:
+        name: TIME Benchmark Leaderboard
+        url: https://huggingface.co/spaces/Real-TSF/TIME-leaderboard
 ---
 # Toto-2.0-4m
+Toto (Time Series Optimized Transformer for [Observability](https://www.datadoghq.com/knowledge-center/observability/)) is a family of time series foundation models for multivariate forecasting developed by [Datadog](https://www.datadoghq.com/). Toto 2.0 is the current generation, featuring u-μP-scaled transformers ranging from 4M to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family, with no sign of saturation at 2.5B.
+The family sets a new state of the art on three forecasting benchmarks: [BOOM](https://huggingface.co/spaces/Datadog/BOOM), our observability benchmark; [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval), the standard general-purpose benchmark; and the recent contamination-resistant [TIME](https://arxiv.org/abs/2602.12147) benchmark.
 ---
 ## ✨ Key Features
+- **Zero-Shot Forecasting:** Forecast without fine-tuning on your specific time series.
+- **Multi-Variate Support:** Efficiently process multiple variables using alternating time/variate attention.
+- **Probabilistic Predictions:** Generate point forecasts and uncertainty estimates via a quantile output head.
+- **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
+- **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4M → 2.5B).
+---
+## 🏗️ Architecture
+![Overview of the Toto 2.0 architecture.](assets/architecture.png)
+A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds **contiguous patch masking (CPM)** for single-pass parallel decoding, a **quantile output head** trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the [technical report](#-additional-resources) for details.
+---
+## 📊 Performance
+![Pareto frontier on BOOM and GIFT-Eval](assets/pareto.png)
+Every Toto 2.0 size sits on or near the Pareto frontier on both BOOM and GIFT-Eval. The three largest sizes rank first, second, and third among foundation models on GIFT-Eval CRPS rank. On TIME, Toto 2.0 sizes take the top three spots on every metric, ahead of every other external foundation model evaluated.
 ---
 import torch
 from toto2 import Toto2Model
+model = Toto2Model.from_pretrained("Datadog/Toto-2.0-4m")
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 model = model.to(device).eval()
 ## 💾 Available Checkpoints
+All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latencies are forward-pass time for a 1,024-step forecast at batch size 8 on a single A100.
+| Model | Params | Single-pass latency<br>(1,024 horizon) | Block decoding<br>(block=768) | Recommended for |
+|---|---|---|---|---|
+| [Toto-2.0-4m](https://huggingface.co/Datadog/Toto-2.0-4m)     | 4m     | ~3.8 ms  | ~10.0 ms | Edge / CPU deployment; tightest latency or memory budgets. |
+| [Toto-2.0-22m](https://huggingface.co/Datadog/Toto-2.0-22m)   | 22m    | ~5.0 ms  | ~12.8 ms | Efficient default — matches or beats Toto 1.0 quality with ~7× fewer parameters. |
+| [Toto-2.0-313m](https://huggingface.co/Datadog/Toto-2.0-313m) | 313m   | ~15.4 ms | ~32.4 ms | Strong general-purpose checkpoint; top-3 foundation model on GIFT-Eval. |
+| [Toto-2.0-1B](https://huggingface.co/Datadog/Toto-2.0-1B)     | 1B     | ~20.9 ms | ~46.3 ms | Best quality / cost tradeoff for production workloads. |
+| [Toto-2.0-2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B   | ~36.2 ms | ~78.0 ms | Highest accuracy; #1 foundation model on every benchmark. |
+> Single-pass decoding fills the entire horizon in one forward pass and is recommended up to ~768 steps. Block decoding generates the horizon in 768-step segments conditioned on the previous segment's median (with KV caching); it is slower but more stable at long horizons. Both modes use the same checkpoint.
 ---
 ## 🔗 Additional Resources
+- **Technical Report** — *(coming soon)*
+- [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)
+- [GitHub Repository](https://github.com/DataDog/toto)
+- [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) — all five base checkpoints
+- [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM) — Datadog's observability time-series benchmark
+- [Toto 1.0 Weights](https://huggingface.co/Datadog/Toto-Open-Base-1.0)
 ---

assets/architecture.png ADDED Viewed

Git LFS Details

SHA256: 973196289f6036b880ec7fdb00fe0b1078215232bf58f0bdd6a27eeebfca46ef
Pointer size: 131 Bytes
Size of remote file: 437 kB

assets/pareto.png ADDED Viewed

Git LFS Details

SHA256: 756a059027357cf224effc92995755b2c50ca8396b68918135ef0e5226798294
Pointer size: 131 Bytes
Size of remote file: 302 kB