Datadog
/

Toto-2.0-1B

@@ -64,7 +64,7 @@ model-index:
 # Toto-2.0-1B
-Toto (Time Series Optimized Transformer for [Observability](https://www.datadoghq.com/knowledge-center/observability/)) is a family of time series foundation models for multivariate forecasting developed by [Datadog](https://www.datadoghq.com/). Toto 2.0 is the current generation, featuring u-μP-scaled transformers ranging from 4M to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family, with no sign of saturation at 2.5B.
 The family sets a new state of the art on three forecasting benchmarks: [BOOM](https://huggingface.co/spaces/Datadog/BOOM), our observability benchmark; [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval), the standard general-purpose benchmark; and the recent contamination-resistant [TIME](https://arxiv.org/abs/2602.12147) benchmark.
@@ -121,13 +121,13 @@ For more examples, see the [Quick Start notebook](https://github.com/DataDog/tot
 All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.
-| Model | Params | Latency | Recommended for |
-|---|---|---|---|
-| [Toto‑2.0‑4m](https://huggingface.co/Datadog/Toto-2.0-4m)     | 4m   | ~3.8 ms  | Edge / CPU deployment; tightest latency or memory budgets. |
-| [Toto‑2.0‑22m](https://huggingface.co/Datadog/Toto-2.0-22m)   | 22m  | ~5.0 ms  | Efficient default — matches or beats Toto 1.0 quality with ~7× fewer parameters. |
-| [Toto‑2.0‑313m](https://huggingface.co/Datadog/Toto-2.0-313m) | 313m | ~15.4 ms | Strong general-purpose checkpoint; top-3 foundation model on GIFT-Eval. |
-| [Toto‑2.0‑1B](https://huggingface.co/Datadog/Toto-2.0-1B)     | 1B   | ~20.9 ms | Best quality / cost tradeoff for production workloads. |
-| [Toto‑2.0‑2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B | ~36.2 ms | Highest accuracy; #1 foundation model on every benchmark. |
 ---
@@ -137,7 +137,7 @@ All five Toto 2.0 sizes share the same training recipe; pick a size based on you
 - **Multi-Variate Support:** Efficiently process multiple variables using alternating time/variate attention.
 - **Probabilistic Predictions:** Generate point forecasts and uncertainty estimates via a quantile output head.
 - **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
-- **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4M → 2.5B).
 ---

 # Toto-2.0-1B
+Toto (Time Series Optimized Transformer for [Observability](https://www.datadoghq.com/knowledge-center/observability/)) is a family of time series foundation models for multivariate forecasting developed by [Datadog](https://www.datadoghq.com/). Toto 2.0 is the current generation, featuring u-μP-scaled transformers ranging from 4m to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family.
 The family sets a new state of the art on three forecasting benchmarks: [BOOM](https://huggingface.co/spaces/Datadog/BOOM), our observability benchmark; [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval), the standard general-purpose benchmark; and the recent contamination-resistant [TIME](https://arxiv.org/abs/2602.12147) benchmark.
 All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.
+| Model | Params | Weights (fp32) | Latency | Recommended for |
+|---|---|---|---|---|
+| [Toto‑2.0‑4m](https://huggingface.co/Datadog/Toto-2.0-4m)     | 4m   | 16 MB  | ~3.8 ms  | Edge / CPU deployment; tightest latency or memory budgets. |
+| [Toto‑2.0‑22m](https://huggingface.co/Datadog/Toto-2.0-22m)   | 22m  | 84 MB  | ~5.0 ms  | Efficient default — matches or beats Toto 1.0 quality with ~7× fewer parameters. |
+| [Toto‑2.0‑313m](https://huggingface.co/Datadog/Toto-2.0-313m) | 313m | 1.2 GB | ~15.4 ms | Strong general-purpose checkpoint; top-3 foundation model on GIFT-Eval. |
+| [Toto‑2.0‑1B](https://huggingface.co/Datadog/Toto-2.0-1B)     | 1B   | 3.9 GB | ~20.9 ms | Best quality / cost tradeoff for production workloads. |
+| [Toto‑2.0‑2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B | 9.1 GB | ~36.2 ms | Highest accuracy; #1 foundation model on every benchmark. |
 ---
 - **Multi-Variate Support:** Efficiently process multiple variables using alternating time/variate attention.
 - **Probabilistic Predictions:** Generate point forecasts and uncertainty estimates via a quantile output head.
 - **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
+- **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4m → 2.5B).
 ---