Toto-2.0-22m / README.md
Emaad's picture
Update README.md
8aa5991 verified
metadata
tags:
  - time-series-forecasting
  - foundation-models
  - pretrained-models
  - time-series
  - timeseries
  - forecasting
  - observability
  - safetensors
  - pytorch_model_hub_mixin
license: apache-2.0
pipeline_tag: time-series-forecasting
thumbnail: https://web-assets.dd-static.net/42588/1778691695-toto-2-hero.png
model-index:
  - name: Toto-2.0-22m
    results:
      - task:
          type: time-series-forecasting
        dataset:
          name: BOOM
          type: BOOM
        metrics:
          - name: CRPS
            type: CRPS
            value: 0.363
          - name: MASE
            type: MASE
            value: 0.601
        source:
          name: BOOM πŸ’₯ Observability Time-Series Forecasting Leaderboard
          url: https://huggingface.co/spaces/Datadog/BOOM
      - task:
          type: time-series-forecasting
        dataset:
          name: GIFT-Eval
          type: GIFT-Eval
        metrics:
          - name: CRPS
            type: CRPS
            value: 0.496
          - name: MASE
            type: MASE
            value: 0.719
        source:
          name: GIFT-Eval Time Series Forecasting Leaderboard
          url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
      - task:
          type: time-series-forecasting
        dataset:
          name: TIME
          type: TIME
        metrics:
          - name: CRPS
            type: CRPS
            value: 0.556
          - name: MASE
            type: MASE
            value: 0.668
        source:
          name: TIME Benchmark Leaderboard
          url: https://huggingface.co/spaces/Real-TSF/TIME-leaderboard

Toto-2.0-22m

Toto (Time Series Optimized Transformer for Observability) is a family of time series foundation models for multivariate forecasting developed by Datadog. Toto 2.0 is the current generation, featuring u-ΞΌP-scaled transformers ranging from 4m to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family.

The family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistant TIME benchmark.

πŸ“Š Performance

Pareto frontier on BOOM and GIFT-Eval
Every Toto 2.0 size sits on or near the Pareto frontier on both BOOM and GIFT-Eval. The three largest sizes rank first, second, and third among foundation models on GIFT-Eval CRPS rank. On TIME, Toto 2.0 sizes take the top three spots on every metric, ahead of every other external foundation model evaluated.

⚑ Quick Start

Inference code is available on GitHub.

Installation

pip install "toto-2 @ git+https://github.com/DataDog/toto.git#subdirectory=toto2"

Inference Example

import torch
from toto2 import Toto2Model

model = Toto2Model.from_pretrained("Datadog/Toto-2.0-22m")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device).eval()

# (batch, n_variates, time_steps)
target = torch.randn(1, 1, 512, device=device)
target_mask = torch.ones_like(target, dtype=torch.bool)
series_ids = torch.zeros(1, 1, dtype=torch.long, device=device)

# Returns quantiles of shape (9, batch, n_variates, horizon)
# Quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
quantiles = model.forecast(
    {"target": target, "target_mask": target_mask, "series_ids": series_ids},
    horizon=96,
    decode_block_size=768,
    has_missing_values=False,
)

For more examples, see the Quick Start notebook and GluonTS integration notebook.

πŸ’Ύ Available Checkpoints

All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.

Model Params Weights (fp32) Latency Recommended for
Toto‑2.0‑4m 4m 16 MB ~3.8 ms Edge / CPU deployment; tightest latency or memory budgets.
Toto‑2.0‑22m 22m 84 MB ~5.0 ms Efficient default β€” matches or beats Toto 1.0 quality with ~7Γ— fewer parameters.
Toto‑2.0‑313m 313m 1.2 GB ~15.4 ms Strong general-purpose checkpoint; top-3 foundation model on GIFT-Eval.
Toto‑2.0‑1B 1B 3.9 GB ~20.9 ms Best quality / cost tradeoff for production workloads.
Toto‑2.0‑2.5B 2.5B 9.1 GB ~36.2 ms Highest accuracy; #1 foundation model on every benchmark.

✨ Key Features

  • Zero-Shot Forecasting: Forecast without fine-tuning on your specific time series.
  • Multi-Variate Support: Efficiently process multiple variables using alternating time/variate attention.
  • Probabilistic Predictions: Generate point forecasts and uncertainty estimates via a quantile output head.
  • Decoder-Only Architecture: Support for variable prediction horizons and context lengths.
  • u-ΞΌP Scaling: A single training recipe transfers cleanly across all five sizes (4m β†’ 2.5B).

πŸ—οΈ Architecture

Overview of the Toto 2.0 architecture.
A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds contiguous patch masking (CPM) for single-pass parallel decoding, a quantile output head trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the technical report for details.

πŸ”— Additional Resources

πŸ“– Citation

(citation coming soon)