| --- |
| tags: |
| - time-series-forecasting |
| - foundation-models |
| - pretrained-models |
| - time-series |
| - timeseries |
| - forecasting |
| - observability |
| - safetensors |
| - pytorch_model_hub_mixin |
| license: apache-2.0 |
| pipeline_tag: time-series-forecasting |
| thumbnail: https://web-assets.dd-static.net/42588/1778691695-toto-2-hero.png |
| model-index: |
| - name: Toto-2.0-4m |
| results: |
| - task: |
| type: time-series-forecasting |
| dataset: |
| name: BOOM |
| type: BOOM |
| metrics: |
| - name: CRPS |
| type: CRPS |
| value: 0.377 |
| - name: MASE |
| type: MASE |
| value: 0.624 |
| source: |
| name: BOOM 💥 Observability Time-Series Forecasting Leaderboard |
| url: https://huggingface.co/spaces/Datadog/BOOM |
| - task: |
| type: time-series-forecasting |
| dataset: |
| name: GIFT-Eval |
| type: GIFT-Eval |
| metrics: |
| - name: CRPS |
| type: CRPS |
| value: 0.524 |
| - name: MASE |
| type: MASE |
| value: 0.757 |
| source: |
| name: GIFT-Eval Time Series Forecasting Leaderboard |
| url: https://huggingface.co/spaces/Salesforce/GIFT-Eval |
| - task: |
| type: time-series-forecasting |
| dataset: |
| name: TIME |
| type: TIME |
| metrics: |
| - name: CRPS |
| type: CRPS |
| value: 0.574 |
| - name: MASE |
| type: MASE |
| value: 0.689 |
| source: |
| name: TIME Benchmark Leaderboard |
| url: https://huggingface.co/spaces/Real-TSF/TIME-leaderboard |
| --- |
| |
| # Toto-2.0-4m |
|
|
| Toto (Time Series Optimized Transformer for [Observability](https://www.datadoghq.com/knowledge-center/observability/)) is a family of time series foundation models for multivariate forecasting developed by [Datadog](https://www.datadoghq.com/). Toto 2.0 is the current generation, featuring u-μP-scaled transformers ranging from 4m to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family. |
|
|
| The family sets a new state of the art on three forecasting benchmarks: [BOOM](https://huggingface.co/spaces/Datadog/BOOM), our observability benchmark; [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval), the standard general-purpose benchmark; and the recent contamination-resistant [TIME](https://arxiv.org/abs/2602.12147) benchmark. |
|
|
| ## 📊 Performance |
|
|
| <figure> |
| <img src="assets/pareto.png" alt="Pareto frontier on BOOM and GIFT-Eval"> |
| <figcaption>Every Toto 2.0 size sits on or near the Pareto frontier on both BOOM and GIFT-Eval. The three largest sizes rank first, second, and third among foundation models on GIFT-Eval CRPS rank. On TIME, Toto 2.0 sizes take the top three spots on every metric, ahead of every other external foundation model evaluated.</figcaption> |
| </figure> |
|
|
| ## ⚡ Quick Start |
|
|
| Inference code is available on [GitHub](https://github.com/DataDog/toto). |
|
|
| ### Installation |
|
|
| ```bash |
| pip install "toto-2 @ git+https://github.com/DataDog/toto.git#subdirectory=toto2" |
| ``` |
|
|
| ### Inference Example |
|
|
| ```python |
| import torch |
| from toto2 import Toto2Model |
| |
| model = Toto2Model.from_pretrained("Datadog/Toto-2.0-4m") |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| model = model.to(device).eval() |
| |
| # (batch, n_variates, time_steps) |
| target = torch.randn(1, 1, 512, device=device) |
| target_mask = torch.ones_like(target, dtype=torch.bool) |
| series_ids = torch.zeros(1, 1, dtype=torch.long, device=device) |
| |
| # Returns quantiles of shape (9, batch, n_variates, horizon) |
| # Quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] |
| quantiles = model.forecast( |
| {"target": target, "target_mask": target_mask, "series_ids": series_ids}, |
| horizon=96, |
| decode_block_size=768, |
| has_missing_values=False, |
| ) |
| ``` |
|
|
| For more examples, see the [Quick Start notebook](https://github.com/DataDog/toto/blob/main/toto2/notebooks/quick_start.ipynb) and [GluonTS integration notebook](https://github.com/DataDog/toto/blob/main/toto2/notebooks/gluonts_integration.ipynb). |
|
|
| ## 💾 Available Checkpoints |
|
|
| All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100. |
|
|
| | Model | Params | Weights (fp32) | Latency | Recommended for | |
| |:---:|:---:|:---:|---|---| |
| | [Toto‑2.0‑4m](https://huggingface.co/Datadog/Toto-2.0-4m) | 4m | 16 MB | ~3.8 ms | Edge / CPU deployment; tightest latency or memory budgets. | |
| | [Toto‑2.0‑22m](https://huggingface.co/Datadog/Toto-2.0-22m) | 22m | 84 MB | ~5.0 ms | Efficient default — matches or beats Toto 1.0 quality with ~7× fewer parameters. | |
| | [Toto‑2.0‑313m](https://huggingface.co/Datadog/Toto-2.0-313m) | 313m | 1.2 GB | ~15.4 ms | Strong general-purpose checkpoint; top-3 foundation model on GIFT-Eval. | |
| | [Toto‑2.0‑1B](https://huggingface.co/Datadog/Toto-2.0-1B) | 1B | 3.9 GB | ~20.9 ms | Best quality / cost tradeoff for production workloads. | |
| | [Toto‑2.0‑2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B | 9.1 GB | ~36.2 ms | Highest accuracy; #1 foundation model on every benchmark. | |
|
|
| ## ✨ Key Features |
|
|
| - **Zero-Shot Forecasting:** Forecast without fine-tuning on your specific time series. |
| - **Multi-Variate Support:** Efficiently process multiple variables using alternating time/variate attention. |
| - **Probabilistic Predictions:** Generate point forecasts and uncertainty estimates via a quantile output head. |
| - **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths. |
| - **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4m → 2.5B). |
|
|
| ## 🏗️ Architecture |
|
|
| <figure> |
| <img src="assets/architecture.png" alt="Overview of the Toto 2.0 architecture."> |
| <figcaption>A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds <b>contiguous patch masking (CPM)</b> for single-pass parallel decoding, a <b>quantile output head</b> trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the <a href="#-additional-resources">technical report</a> for details.</figcaption> |
| </figure> |
|
|
| ## 🔗 Additional Resources |
|
|
| - **Technical Report** — *(coming soon)* |
| - [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/) |
| - [GitHub Repository](https://github.com/DataDog/toto) |
| - [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) — all five base checkpoints |
| - [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM) — Datadog's observability time-series benchmark |
| - [Toto 1.0 Weights](https://huggingface.co/Datadog/Toto-Open-Base-1.0) |
|
|
| ## 📖 Citation |
|
|
| ```bibtex |
| (citation coming soon) |
| ``` |
|
|