Emaad commited on
Commit
95a0771
·
verified ·
1 Parent(s): ed7afe9

Add checkpoint size (fp32 weights) column to Available Checkpoints

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -64,7 +64,7 @@ model-index:
64
 
65
  # Toto-2.0-1B
66
 
67
- Toto (Time Series Optimized Transformer for [Observability](https://www.datadoghq.com/knowledge-center/observability/)) is a family of time series foundation models for multivariate forecasting developed by [Datadog](https://www.datadoghq.com/). Toto 2.0 is the current generation, featuring u-μP-scaled transformers ranging from 4M to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family, with no sign of saturation at 2.5B.
68
 
69
  The family sets a new state of the art on three forecasting benchmarks: [BOOM](https://huggingface.co/spaces/Datadog/BOOM), our observability benchmark; [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval), the standard general-purpose benchmark; and the recent contamination-resistant [TIME](https://arxiv.org/abs/2602.12147) benchmark.
70
 
@@ -121,13 +121,13 @@ For more examples, see the [Quick Start notebook](https://github.com/DataDog/tot
121
 
122
  All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.
123
 
124
- | Model | Params | Latency | Recommended for |
125
- |---|---|---|---|
126
- | [Toto‑2.0‑4m](https://huggingface.co/Datadog/Toto-2.0-4m) | 4m | ~3.8 ms | Edge / CPU deployment; tightest latency or memory budgets. |
127
- | [Toto‑2.0‑22m](https://huggingface.co/Datadog/Toto-2.0-22m) | 22m | ~5.0 ms | Efficient default — matches or beats Toto 1.0 quality with ~7× fewer parameters. |
128
- | [Toto‑2.0‑313m](https://huggingface.co/Datadog/Toto-2.0-313m) | 313m | ~15.4 ms | Strong general-purpose checkpoint; top-3 foundation model on GIFT-Eval. |
129
- | [Toto‑2.0‑1B](https://huggingface.co/Datadog/Toto-2.0-1B) | 1B | ~20.9 ms | Best quality / cost tradeoff for production workloads. |
130
- | [Toto‑2.0‑2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B | ~36.2 ms | Highest accuracy; #1 foundation model on every benchmark. |
131
 
132
  ---
133
 
@@ -137,7 +137,7 @@ All five Toto 2.0 sizes share the same training recipe; pick a size based on you
137
  - **Multi-Variate Support:** Efficiently process multiple variables using alternating time/variate attention.
138
  - **Probabilistic Predictions:** Generate point forecasts and uncertainty estimates via a quantile output head.
139
  - **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
140
- - **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4M → 2.5B).
141
 
142
  ---
143
 
 
64
 
65
  # Toto-2.0-1B
66
 
67
+ Toto (Time Series Optimized Transformer for [Observability](https://www.datadoghq.com/knowledge-center/observability/)) is a family of time series foundation models for multivariate forecasting developed by [Datadog](https://www.datadoghq.com/). Toto 2.0 is the current generation, featuring u-μP-scaled transformers ranging from 4m to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family.
68
 
69
  The family sets a new state of the art on three forecasting benchmarks: [BOOM](https://huggingface.co/spaces/Datadog/BOOM), our observability benchmark; [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval), the standard general-purpose benchmark; and the recent contamination-resistant [TIME](https://arxiv.org/abs/2602.12147) benchmark.
70
 
 
121
 
122
  All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.
123
 
124
+ | Model | Params | Weights (fp32) | Latency | Recommended for |
125
+ |---|---|---|---|---|
126
+ | [Toto‑2.0‑4m](https://huggingface.co/Datadog/Toto-2.0-4m) | 4m | 16 MB | ~3.8 ms | Edge / CPU deployment; tightest latency or memory budgets. |
127
+ | [Toto‑2.0‑22m](https://huggingface.co/Datadog/Toto-2.0-22m) | 22m | 84 MB | ~5.0 ms | Efficient default — matches or beats Toto 1.0 quality with ~7× fewer parameters. |
128
+ | [Toto‑2.0‑313m](https://huggingface.co/Datadog/Toto-2.0-313m) | 313m | 1.2 GB | ~15.4 ms | Strong general-purpose checkpoint; top-3 foundation model on GIFT-Eval. |
129
+ | [Toto‑2.0‑1B](https://huggingface.co/Datadog/Toto-2.0-1B) | 1B | 3.9 GB | ~20.9 ms | Best quality / cost tradeoff for production workloads. |
130
+ | [Toto‑2.0‑2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B | 9.1 GB | ~36.2 ms | Highest accuracy; #1 foundation model on every benchmark. |
131
 
132
  ---
133
 
 
137
  - **Multi-Variate Support:** Efficiently process multiple variables using alternating time/variate attention.
138
  - **Probabilistic Predictions:** Generate point forecasts and uncertainty estimates via a quantile output head.
139
  - **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
140
+ - **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4m → 2.5B).
141
 
142
  ---
143