---
tags:
- time-series-forecasting
- foundation-models
- pretrained-models
- time-series
- timeseries
- forecasting
- observability
- ensemble
- meta-learning
- gift-eval
license: apache-2.0
pipeline_tag: time-series-forecasting
thumbnail: https://corp.dd-static.net/img/about/presskit/kit/press_kit.png
model-index:
- name: Toto-2.0-Family-and-Friends
results:
- task:
type: time-series-forecasting
dataset:
name: GIFT-Eval
type: GIFT-Eval
metrics:
- name: CRPS
type: CRPS
value: 0.463
- name: MASE
type: MASE
value: 0.676
source:
name: GIFT-Eval Time Series Forecasting Leaderboard
url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
---
# Toto 2.0 Family-and-Friends (FnF)
> [!WARNING]
> **This is a benchmarking artifact, not a general-purpose model.**
> Toto-2.0-FnF is an FFORMA-style XGBoost meta-learner over 10 foundation models that we submitted to the GIFT-Eval leaderboard. The bundle ships pre-computed predictions for the GIFT-Eval test split and exists to make the **#1** submission fully reproducible — it cannot forecast new series without first running every base model.
>
> For real workloads, please use the base [Toto 2.0 collection](https://huggingface.co/collections/Datadog/toto-20). The base checkpoints are pretrained without any public data, generalize to every benchmark we have evaluated, and are what we recommend deploying.
## ✨ What this is?
A per-`(frequency, term)` XGBoost gate over a pool of 10 foundation models (5 Toto 2.0 sizes + 5 external models). The meta-learner consumes lightweight tsfeatures from each forecast window and emits a softmax over the model pool; the final forecast is the weighted sum of the 10 base-model quantile predictions. Following the [FFORMA](https://www.sciencedirect.com/science/article/abs/pii/S0169207019300895) framework (Montero-Manso et al., 2020).
On the full GIFT-Eval leaderboard (foundation + finetuned + ensemble + agentic), Toto-2.0-FnF takes #1 on every metric (tied for #1 on raw CRPS).
The replication notebook lives in the GIFT-Eval repo at [notebooks/toto_2_0_fnf.ipynb](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb).
## 🧩 What's in the ensemble
The Toto 2.0 family accounts for **39% of the assigned weight** across all predictions — more than any other model in the pool, ahead of Chronos-2 (32%) and more than the four remaining external models combined.
| # | Model | Family |
|:---:|---|:---:|
| 0 | [chronos-2](https://huggingface.co/amazon/chronos-2) | Chronos |
| 1 | [timesfm-2.5](https://huggingface.co/google/timesfm-2.5-200m-pytorch) | TimesFM |
| 2 | [flowstate](https://huggingface.co/ibm-research/flowstate) | FlowState |
| 3 | [tirex](https://huggingface.co/NX-AI/TiRex-1.1-gifteval) | TiRex |
| 4 | [patchtst-fm](https://huggingface.co/ibm-research/patchtst-fm-r1) | PatchTST |
| 5 | [toto-2.0-4m](https://huggingface.co/Datadog/Toto-2.0-4m) | Toto 2.0 |
| 6 | [toto-2.0-22m](https://huggingface.co/Datadog/Toto-2.0-22m) | Toto 2.0 |
| 7 | [toto-2.0-313m](https://huggingface.co/Datadog/Toto-2.0-313m) | Toto 2.0 |
| 8 | [toto-2.0-1b](https://huggingface.co/Datadog/Toto-2.0-1B) | Toto 2.0 |
| 9 | [toto-2.0-2.5b](https://huggingface.co/Datadog/Toto-2.0-2.5B) | Toto 2.0 |
Column order matters — it is tied to the booster's class indices.
## ✨ Key Features
- **Per-bucket gating:** Separate XGBoost head per `(frequency, term)` bucket — each bucket learns its own softmax over the model pool so the ensemble can specialize without one global gate trading off across regimes.
- **No retraining at inference:** The bundle ships pre-computed base-model predictions and tsfeatures for the full GIFT-Eval test split, so replication needs neither GPUs nor the base-model libraries.
- **No leakage:** tsfeatures are computed only on the lookback context preceding each forecast window; the bundle stores dataset metadata but not ground-truth labels.
## 📦 Bundle layout
```
booster_manifest.json ~4.8 GB — base64-encoded XGBoost boosters keyed by "|"
feature_columns.json train-time column order expected by the booster
feature_types.json XGBoost feature_types (c = categorical, q = float)
categories.json {"freq": [...], "domain": [...]} train-time category vocabularies
models.json list of model names in column order (column index ↔ model)
test_features//
test_features.npz (n_windows, n_tsfeatures) tsfeatures from the lookback context preceding each window
test_metadata.npz dataset-level scalars only (seasonality, prediction_length, num_variates, freq, domain)
test_predictions///
test_predictions.npz (n_windows, 9, prediction_length) quantile forecasts at QUANTILE_LEVELS = [0.1, ..., 0.9]
```
`ds_dirname` follows GIFT-Eval's canonical naming: `__` (e.g. `m4_weekly_W_short`).
## ⚡ How the booster is used
Per (dataset, term):
1. Load `test_features.npz` and `test_metadata.npz`. Reindex the tsfeatures to `feature_columns.json` — columns missing in this dataset's tsfeatures (e.g. `seasonal_strength` on yearly data) become NaN, which XGBoost handles natively. Attach scalar features (`seasonality`, `prediction_length`, `num_variates`) and categorical features (`freq`, `domain`) using the train-time categorical vocabularies in `categories.json`. *The tsfeatures are computed only on the lookback context that precedes each forecast window, so no information from the ground-truth labels is ever used at inference time.*
2. Look up the bucket booster for `(canonical_freq, term)` where canonical_freq strips pandas anchor suffixes (`W-TUE` → `W`, `Q-DEC` → `Q`).
3. `booster.predict(..., output_margin=True)` returns raw class logits of shape `(n_windows, 10)`; softmax over the model axis gives the per-window weights.
4. Stack the 10 per-model `test_predictions.npz` arrays into a `(n_windows, 10, 9, prediction_length)` tensor; weight-sum across the model axis → final quantile forecast.
5. Score with `gluonts.evaluate_model` using the same call shape every other GIFT-Eval submission uses (see `evaluate_dataset` in the notebook).
## 🔁 Reproducing from scratch
Each base model's predictions were generated by running its standard GIFT-Eval notebook (`notebooks/chronos-2.ipynb`, etc.) with a wrapper that saves the per-window quantile forecasts to `test_predictions.npz` instead of going straight into `evaluate_model`. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the [tsfeatures](https://github.com/Nixtla/tsfeatures) library; "Optional A" in the notebook shows the per-window extraction call. The meta-learner boosters were trained on the corresponding train-window predictions, which are not included in this bundle.
## 🔗 Additional Resources
- [Technical Report](https://arxiv.org/abs/2605.20119)
- [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)
- [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) — base Toto checkpoints (4m → 2.5B), which is what we recommend deploying
- [Toto-2.0-2.5B-FT](https://huggingface.co/Datadog/Toto-2.0-2.5B-FT) — companion benchmark-only finetune
- [GIFT-Eval benchmark](https://huggingface.co/spaces/Salesforce/GIFT-Eval) — leaderboard hosting this submission
- [Replication notebook](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb) — fast-path scoring + optional regeneration of every artifact in this bundle
- [GitHub Repository](https://github.com/DataDog/toto)
- [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM)
## 📖 Citation
```bibtex
@misc{khwaja2026toto20timeseries,
title={Toto 2.0: Time Series Forecasting Enters the Scaling Era},
author={Emaad Khwaja and Chris Lettieri and Gerald Woo and Eden Belouadah and Marc Cenac and Guillaume Jarry and Enguerrand Paquin and Xunyi Zhao and Viktoriya Zhukov and Othmane Abou-Amal and Chenghao Liu and Ameet Talwalkar and David Asker},
year={2026},
eprint={2605.20119},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.20119},
}
```