Update README.md

9e953ea verified 1 day ago

8.53 kB

	---
	tags:
	- time-series-forecasting
	- foundation-models
	- pretrained-models
	- time-series
	- timeseries
	- forecasting
	- observability
	- ensemble
	- meta-learning
	- gift-eval
	license: apache-2.0
	pipeline_tag: time-series-forecasting
	thumbnail: https://corp.dd-static.net/img/about/presskit/kit/press_kit.png
	model-index:
	- name: Toto-2.0-Family-and-Friends
	results:
	- task:
	type: time-series-forecasting
	dataset:
	name: GIFT-Eval
	type: GIFT-Eval
	metrics:
	- name: CRPS
	type: CRPS
	value: 0.463
	- name: MASE
	type: MASE
	value: 0.676
	source:
	name: GIFT-Eval Time Series Forecasting Leaderboard
	url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
	---

	# Toto 2.0 Family-and-Friends (FnF)

	> [!WARNING]
	> This is a benchmarking artifact, not a general-purpose model.
	> Toto-2.0-FnF is an FFORMA-style XGBoost meta-learner over 10 foundation models that we submitted to the GIFT-Eval leaderboard. The bundle ships pre-computed predictions for the GIFT-Eval test split and exists to make the #1 submission fully reproducible — it cannot forecast new series without first running every base model.
	>
	> For real workloads, please use the base [Toto 2.0 collection](https://huggingface.co/collections/Datadog/toto-20). The base checkpoints are pretrained without any public data, generalize to every benchmark we have evaluated, and are what we recommend deploying.

	## ✨ What this is?

	A per-`(frequency, term)` XGBoost gate over a pool of 10 foundation models (5 Toto 2.0 sizes + 5 external models). The meta-learner consumes lightweight tsfeatures from each forecast window and emits a softmax over the model pool; the final forecast is the weighted sum of the 10 base-model quantile predictions. Following the [FFORMA](https://www.sciencedirect.com/science/article/abs/pii/S0169207019300895) framework (Montero-Manso et al., 2020).

	<figure>
	<img src="assets/bar_metrics_gift_eval.png" alt="GIFT-Eval bar metrics — Toto 2.0 FnF highlighted">
	<figcaption>On the full GIFT-Eval leaderboard (foundation + finetuned + ensemble + agentic), Toto-2.0-FnF takes <b>#1 on every metric</b> (tied for #1 on raw CRPS).</figcaption>
	</figure>

	The replication notebook lives in the GIFT-Eval repo at [notebooks/toto_2_0_fnf.ipynb](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb).

	## 🧩 What's in the ensemble

	The Toto 2.0 family accounts for 39% of the assigned weight across all predictions — more than any other model in the pool, ahead of Chronos-2 (32%) and more than the four remaining external models combined.

	\| # \| Model \| Family \|
	\|:---:\|---\|:---:\|
	\| 0 \| [chronos-2](https://huggingface.co/amazon/chronos-2) \| Chronos \|
	\| 1 \| [timesfm-2.5](https://huggingface.co/google/timesfm-2.5-200m-pytorch) \| TimesFM \|
	\| 2 \| [flowstate](https://huggingface.co/ibm-research/flowstate) \| FlowState \|
	\| 3 \| [tirex](https://huggingface.co/NX-AI/TiRex-1.1-gifteval) \| TiRex \|
	\| 4 \| [patchtst-fm](https://huggingface.co/ibm-research/patchtst-fm-r1) \| PatchTST \|
	\| 5 \| [toto-2.0-4m](https://huggingface.co/Datadog/Toto-2.0-4m) \| Toto 2.0 \|
	\| 6 \| [toto-2.0-22m](https://huggingface.co/Datadog/Toto-2.0-22m) \| Toto 2.0 \|
	\| 7 \| [toto-2.0-313m](https://huggingface.co/Datadog/Toto-2.0-313m) \| Toto 2.0 \|
	\| 8 \| [toto-2.0-1b](https://huggingface.co/Datadog/Toto-2.0-1B) \| Toto 2.0 \|
	\| 9 \| [toto-2.0-2.5b](https://huggingface.co/Datadog/Toto-2.0-2.5B) \| Toto 2.0 \|

	Column order matters — it is tied to the booster's class indices.

	## ✨ Key Features

	- Per-bucket gating: Separate XGBoost head per `(frequency, term)` bucket — each bucket learns its own softmax over the model pool so the ensemble can specialize without one global gate trading off across regimes.
	- No retraining at inference: The bundle ships pre-computed base-model predictions and tsfeatures for the full GIFT-Eval test split, so replication needs neither GPUs nor the base-model libraries.
	- No leakage: tsfeatures are computed only on the lookback context preceding each forecast window; the bundle stores dataset metadata but not ground-truth labels.

	## 📦 Bundle layout

	```
	booster_manifest.json ~4.8 GB — base64-encoded XGBoost boosters keyed by "<canonical_freq>\|<term>"
	feature_columns.json train-time column order expected by the booster
	feature_types.json XGBoost feature_types (c = categorical, q = float)
	categories.json {"freq": [...], "domain": [...]} train-time category vocabularies
	models.json list of model names in column order (column index ↔ model)
	test_features/<ds_dirname>/
	test_features.npz (n_windows, n_tsfeatures) tsfeatures from the lookback context preceding each window
	test_metadata.npz dataset-level scalars only (seasonality, prediction_length, num_variates, freq, domain)
	test_predictions/<model>/<ds_dirname>/
	test_predictions.npz (n_windows, 9, prediction_length) quantile forecasts at QUANTILE_LEVELS = [0.1, ..., 0.9]
	```

	`ds_dirname` follows GIFT-Eval's canonical naming: `<pretty_name>_<freq>_<term>` (e.g. `m4_weekly_W_short`).

	## ⚡ How the booster is used

	Per (dataset, term):

	1. Load `test_features.npz` and `test_metadata.npz`. Reindex the tsfeatures to `feature_columns.json` — columns missing in this dataset's tsfeatures (e.g. `seasonal_strength` on yearly data) become NaN, which XGBoost handles natively. Attach scalar features (`seasonality`, `prediction_length`, `num_variates`) and categorical features (`freq`, `domain`) using the train-time categorical vocabularies in `categories.json`. The tsfeatures are computed only on the lookback context that precedes each forecast window, so no information from the ground-truth labels is ever used at inference time.
	2. Look up the bucket booster for `(canonical_freq, term)` where canonical_freq strips pandas anchor suffixes (`W-TUE` → `W`, `Q-DEC` → `Q`).
	3. `booster.predict(..., output_margin=True)` returns raw class logits of shape `(n_windows, 10)`; softmax over the model axis gives the per-window weights.
	4. Stack the 10 per-model `test_predictions.npz` arrays into a `(n_windows, 10, 9, prediction_length)` tensor; weight-sum across the model axis → final quantile forecast.
	5. Score with `gluonts.evaluate_model` using the same call shape every other GIFT-Eval submission uses (see `evaluate_dataset` in the notebook).

	## 🔁 Reproducing from scratch

	Each base model's predictions were generated by running its standard GIFT-Eval notebook (`notebooks/chronos-2.ipynb`, etc.) with a wrapper that saves the per-window quantile forecasts to `test_predictions.npz` instead of going straight into `evaluate_model`. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the [tsfeatures](https://github.com/Nixtla/tsfeatures) library; "Optional A" in the notebook shows the per-window extraction call. The meta-learner boosters were trained on the corresponding train-window predictions, which are not included in this bundle.

	## 🔗 Additional Resources

	- [Technical Report](https://arxiv.org/abs/2605.20119)
	- [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)
	- [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) — base Toto checkpoints (4m → 2.5B), which is what we recommend deploying
	- [Toto-2.0-2.5B-FT](https://huggingface.co/Datadog/Toto-2.0-2.5B-FT) — companion benchmark-only finetune
	- [GIFT-Eval benchmark](https://huggingface.co/spaces/Salesforce/GIFT-Eval) — leaderboard hosting this submission
	- [Replication notebook](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb) — fast-path scoring + optional regeneration of every artifact in this bundle
	- [GitHub Repository](https://github.com/DataDog/toto)
	- [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM)

	## 📖 Citation

	```bibtex
	@misc{khwaja2026toto20timeseries,
	title={Toto 2.0: Time Series Forecasting Enters the Scaling Era},
	author={Emaad Khwaja and Chris Lettieri and Gerald Woo and Eden Belouadah and Marc Cenac and Guillaume Jarry and Enguerrand Paquin and Xunyi Zhao and Viktoriya Zhukov and Othmane Abou-Amal and Chenghao Liu and Ameet Talwalkar and David Asker},
	year={2026},
	eprint={2605.20119},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2605.20119},
	}
	```