tags:
- time-series-forecasting
- foundation-models
- pretrained-models
- time-series
- timeseries
- forecasting
- observability
- ensemble
- meta-learning
- gift-eval
license: apache-2.0
pipeline_tag: time-series-forecasting
thumbnail: https://corp.dd-static.net/img/about/presskit/kit/press_kit.png
model-index:
- name: Toto-2.0-Family-and-Friends
results:
- task:
type: time-series-forecasting
dataset:
name: GIFT-Eval
type: GIFT-Eval
metrics:
- name: CRPS
type: CRPS
value: 0.463
- name: MASE
type: MASE
value: 0.676
source:
name: GIFT-Eval Time Series Forecasting Leaderboard
url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
Toto 2.0 Family-and-Friends (FnF)
This is a benchmarking artifact, not a general-purpose model. Toto-2.0-FnF is an FFORMA-style XGBoost meta-learner over 10 foundation models that we submitted to the GIFT-Eval leaderboard. The bundle ships pre-computed predictions for the GIFT-Eval test split and exists to make the #1 submission fully reproducible β it cannot forecast new series without first running every base model.
For real workloads, please use the base Toto 2.0 collection. The base checkpoints are pretrained without any public data, generalize to every benchmark we have evaluated, and are what we recommend deploying.
β¨ What this is?
A per-(frequency, term) XGBoost gate over a pool of 10 foundation models (5 Toto 2.0 sizes + 5 external models). The meta-learner consumes lightweight tsfeatures from each forecast window and emits a softmax over the model pool; the final forecast is the weighted sum of the 10 base-model quantile predictions. Following the FFORMA framework (Montero-Manso et al., 2020).
The replication notebook lives in the GIFT-Eval repo at notebooks/toto_2_0_fnf.ipynb.
π§© What's in the ensemble
The Toto 2.0 family accounts for 39% of the assigned weight across all predictions β more than any other model in the pool, ahead of Chronos-2 (32%) and more than the four remaining external models combined.
| # | Model | Family |
|---|---|---|
| 0 | chronos-2 | Chronos |
| 1 | timesfm-2.5 | TimesFM |
| 2 | flowstate | FlowState |
| 3 | tirex | TiRex |
| 4 | patchtst-fm | PatchTST |
| 5 | toto-2.0-4m | Toto 2.0 |
| 6 | toto-2.0-22m | Toto 2.0 |
| 7 | toto-2.0-313m | Toto 2.0 |
| 8 | toto-2.0-1b | Toto 2.0 |
| 9 | toto-2.0-2.5b | Toto 2.0 |
Column order matters β it is tied to the booster's class indices.
β¨ Key Features
- Per-bucket gating: Separate XGBoost head per
(frequency, term)bucket β each bucket learns its own softmax over the model pool so the ensemble can specialize without one global gate trading off across regimes. - No retraining at inference: The bundle ships pre-computed base-model predictions and tsfeatures for the full GIFT-Eval test split, so replication needs neither GPUs nor the base-model libraries.
- No leakage: tsfeatures are computed only on the lookback context preceding each forecast window; the bundle stores dataset metadata but not ground-truth labels.
π¦ Bundle layout
booster_manifest.json ~4.8 GB β base64-encoded XGBoost boosters keyed by "<canonical_freq>|<term>"
feature_columns.json train-time column order expected by the booster
feature_types.json XGBoost feature_types (c = categorical, q = float)
categories.json {"freq": [...], "domain": [...]} train-time category vocabularies
models.json list of model names in column order (column index β model)
test_features/<ds_dirname>/
test_features.npz (n_windows, n_tsfeatures) tsfeatures from the lookback context preceding each window
test_metadata.npz dataset-level scalars only (seasonality, prediction_length, num_variates, freq, domain)
test_predictions/<model>/<ds_dirname>/
test_predictions.npz (n_windows, 9, prediction_length) quantile forecasts at QUANTILE_LEVELS = [0.1, ..., 0.9]
ds_dirname follows GIFT-Eval's canonical naming: <pretty_name>_<freq>_<term> (e.g. m4_weekly_W_short).
β‘ How the booster is used
Per (dataset, term):
- Load
test_features.npzandtest_metadata.npz. Reindex the tsfeatures tofeature_columns.jsonβ columns missing in this dataset's tsfeatures (e.g.seasonal_strengthon yearly data) become NaN, which XGBoost handles natively. Attach scalar features (seasonality,prediction_length,num_variates) and categorical features (freq,domain) using the train-time categorical vocabularies incategories.json. The tsfeatures are computed only on the lookback context that precedes each forecast window, so no information from the ground-truth labels is ever used at inference time. - Look up the bucket booster for
(canonical_freq, term)where canonical_freq strips pandas anchor suffixes (W-TUEβW,Q-DECβQ). booster.predict(..., output_margin=True)returns raw class logits of shape(n_windows, 10); softmax over the model axis gives the per-window weights.- Stack the 10 per-model
test_predictions.npzarrays into a(n_windows, 10, 9, prediction_length)tensor; weight-sum across the model axis β final quantile forecast. - Score with
gluonts.evaluate_modelusing the same call shape every other GIFT-Eval submission uses (seeevaluate_datasetin the notebook).
π Reproducing from scratch
Each base model's predictions were generated by running its standard GIFT-Eval notebook (notebooks/chronos-2.ipynb, etc.) with a wrapper that saves the per-window quantile forecasts to test_predictions.npz instead of going straight into evaluate_model. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the tsfeatures library; "Optional A" in the notebook shows the per-window extraction call. The meta-learner boosters were trained on the corresponding train-window predictions, which are not included in this bundle.
π Additional Resources
- Technical Report
- Blog Post
- Toto 2.0 Collection β base Toto checkpoints (4m β 2.5B), which is what we recommend deploying
- Toto-2.0-2.5B-FT β companion benchmark-only finetune
- GIFT-Eval benchmark β leaderboard hosting this submission
- Replication notebook β fast-path scoring + optional regeneration of every artifact in this bundle
- GitHub Repository
- BOOM Dataset
π Citation
@misc{khwaja2026toto20timeseries,
title={Toto 2.0: Time Series Forecasting Enters the Scaling Era},
author={Emaad Khwaja and Chris Lettieri and Gerald Woo and Eden Belouadah and Marc Cenac and Guillaume Jarry and Enguerrand Paquin and Xunyi Zhao and Viktoriya Zhukov and Othmane Abou-Amal and Chenghao Liu and Ameet Talwalkar and David Asker},
year={2026},
eprint={2605.20119},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.20119},
}