Datadog
/

Toto-2.0-4m

@@ -121,13 +121,13 @@ For more examples, see the [Quick Start notebook](https://github.com/DataDog/tot
 All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.
-| Model | Params | Latency | Recommended for |
-|---|---|---|---|
-| [Toto‑2.0‑4m](https://huggingface.co/Datadog/Toto-2.0-4m)     | 4m   | ~3.8 ms  | Edge / CPU deployment; tightest latency or memory budgets. |
-| [Toto‑2.0‑22m](https://huggingface.co/Datadog/Toto-2.0-22m)   | 22m  | ~5.0 ms  | Efficient default — matches or beats Toto 1.0 quality with ~7× fewer parameters. |
-| [Toto‑2.0‑313m](https://huggingface.co/Datadog/Toto-2.0-313m) | 313m | ~15.4 ms | Strong general-purpose checkpoint; top-3 foundation model on GIFT-Eval. |
-| [Toto‑2.0‑1B](https://huggingface.co/Datadog/Toto-2.0-1B)     | 1B   | ~20.9 ms | Best quality / cost tradeoff for production workloads. |
-| [Toto‑2.0‑2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B | ~36.2 ms | Highest accuracy; #1 foundation model on every benchmark. |
 ---
@@ -137,7 +137,7 @@ All five Toto 2.0 sizes share the same training recipe; pick a size based on you
 - **Multi-Variate Support:** Efficiently process multiple variables using alternating time/variate attention.
 - **Probabilistic Predictions:** Generate point forecasts and uncertainty estimates via a quantile output head.
 - **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
-- **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4M → 2.5B).
 ---

 All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.
+| Model | Params | Weights (fp32) | Latency | Recommended for |
+|---|---|---|---|---|
+| [Toto‑2.0‑4m](https://huggingface.co/Datadog/Toto-2.0-4m)     | 4m   | 16 MB  | ~3.8 ms  | Edge / CPU deployment; tightest latency or memory budgets. |
+| [Toto‑2.0‑22m](https://huggingface.co/Datadog/Toto-2.0-22m)   | 22m  | 84 MB  | ~5.0 ms  | Efficient default — matches or beats Toto 1.0 quality with ~7× fewer parameters. |
+| [Toto‑2.0‑313m](https://huggingface.co/Datadog/Toto-2.0-313m) | 313m | 1.2 GB | ~15.4 ms | Strong general-purpose checkpoint; top-3 foundation model on GIFT-Eval. |
+| [Toto‑2.0‑1B](https://huggingface.co/Datadog/Toto-2.0-1B)     | 1B   | 3.9 GB | ~20.9 ms | Best quality / cost tradeoff for production workloads. |
+| [Toto‑2.0‑2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B | 9.1 GB | ~36.2 ms | Highest accuracy; #1 foundation model on every benchmark. |
 ---
 - **Multi-Variate Support:** Efficiently process multiple variables using alternating time/variate attention.
 - **Probabilistic Predictions:** Generate point forecasts and uncertainty estimates via a quantile output head.
 - **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
+- **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4m → 2.5B).
 ---