| --- |
| tags: |
| - time-series-forecasting |
| - foundation-models |
| - pretrained-models |
| - time-series |
| - timeseries |
| - forecasting |
| - observability |
| - ensemble |
| - meta-learning |
| - gift-eval |
| license: apache-2.0 |
| pipeline_tag: time-series-forecasting |
| thumbnail: https://corp.dd-static.net/img/about/presskit/kit/press_kit.png |
| model-index: |
| - name: Toto-2.0-Family-and-Friends |
| results: |
| - task: |
| type: time-series-forecasting |
| dataset: |
| name: GIFT-Eval |
| type: GIFT-Eval |
| metrics: |
| - name: CRPS |
| type: CRPS |
| value: 0.463 |
| - name: MASE |
| type: MASE |
| value: 0.676 |
| source: |
| name: GIFT-Eval Time Series Forecasting Leaderboard |
| url: https://huggingface.co/spaces/Salesforce/GIFT-Eval |
| --- |
| |
| # Toto 2.0 Family-and-Friends (FnF) |
|
|
| > [!WARNING] |
| > **This is a benchmarking artifact, not a general-purpose model.** |
| > Toto-2.0-FnF is an FFORMA-style XGBoost meta-learner over 10 foundation models that we submitted to the GIFT-Eval leaderboard. The bundle ships pre-computed predictions for the GIFT-Eval test split and exists to make the **#1** submission fully reproducible β it cannot forecast new series without first running every base model. |
| > |
| > For real workloads, please use the base [Toto 2.0 collection](https://huggingface.co/collections/Datadog/toto-20). The base checkpoints are pretrained without any public data, generalize to every benchmark we have evaluated, and are what we recommend deploying. |
|
|
| ## β¨ What this is? |
|
|
| A per-`(frequency, term)` XGBoost gate over a pool of 10 foundation models (5 Toto 2.0 sizes + 5 external models). The meta-learner consumes lightweight tsfeatures from each forecast window and emits a softmax over the model pool; the final forecast is the weighted sum of the 10 base-model quantile predictions. Following the [FFORMA](https://www.sciencedirect.com/science/article/abs/pii/S0169207019300895) framework (Montero-Manso et al., 2020). |
|
|
| <figure> |
| <img src="assets/bar_metrics_gift_eval.png" alt="GIFT-Eval bar metrics β Toto 2.0 FnF highlighted"> |
| <figcaption>On the full GIFT-Eval leaderboard (foundation + finetuned + ensemble + agentic), Toto-2.0-FnF takes <b>#1 on every metric</b> (tied for #1 on raw CRPS).</figcaption> |
| </figure> |
|
|
| The replication notebook lives in the GIFT-Eval repo at [notebooks/toto_2_0_fnf.ipynb](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb). |
|
|
| ## π§© What's in the ensemble |
|
|
| The Toto 2.0 family accounts for **39% of the assigned weight** across all predictions β more than any other model in the pool, ahead of Chronos-2 (32%) and more than the four remaining external models combined. |
|
|
| | # | Model | Family | |
| |:---:|---|:---:| |
| | 0 | [chronos-2](https://huggingface.co/amazon/chronos-2) | Chronos | |
| | 1 | [timesfm-2.5](https://huggingface.co/google/timesfm-2.5-200m-pytorch) | TimesFM | |
| | 2 | [flowstate](https://huggingface.co/ibm-research/flowstate) | FlowState | |
| | 3 | [tirex](https://huggingface.co/NX-AI/TiRex-1.1-gifteval) | TiRex | |
| | 4 | [patchtst-fm](https://huggingface.co/ibm-research/patchtst-fm-r1) | PatchTST | |
| | 5 | [toto-2.0-4m](https://huggingface.co/Datadog/Toto-2.0-4m) | Toto 2.0 | |
| | 6 | [toto-2.0-22m](https://huggingface.co/Datadog/Toto-2.0-22m) | Toto 2.0 | |
| | 7 | [toto-2.0-313m](https://huggingface.co/Datadog/Toto-2.0-313m) | Toto 2.0 | |
| | 8 | [toto-2.0-1b](https://huggingface.co/Datadog/Toto-2.0-1B) | Toto 2.0 | |
| | 9 | [toto-2.0-2.5b](https://huggingface.co/Datadog/Toto-2.0-2.5B) | Toto 2.0 | |
|
|
| Column order matters β it is tied to the booster's class indices. |
|
|
| ## β¨ Key Features |
|
|
| - **Per-bucket gating:** Separate XGBoost head per `(frequency, term)` bucket β each bucket learns its own softmax over the model pool so the ensemble can specialize without one global gate trading off across regimes. |
| - **No retraining at inference:** The bundle ships pre-computed base-model predictions and tsfeatures for the full GIFT-Eval test split, so replication needs neither GPUs nor the base-model libraries. |
| - **No leakage:** tsfeatures are computed only on the lookback context preceding each forecast window; the bundle stores dataset metadata but not ground-truth labels. |
|
|
| ## π¦ Bundle layout |
|
|
| ``` |
| booster_manifest.json ~4.8 GB β base64-encoded XGBoost boosters keyed by "<canonical_freq>|<term>" |
| feature_columns.json train-time column order expected by the booster |
| feature_types.json XGBoost feature_types (c = categorical, q = float) |
| categories.json {"freq": [...], "domain": [...]} train-time category vocabularies |
| models.json list of model names in column order (column index β model) |
| test_features/<ds_dirname>/ |
| test_features.npz (n_windows, n_tsfeatures) tsfeatures from the lookback context preceding each window |
| test_metadata.npz dataset-level scalars only (seasonality, prediction_length, num_variates, freq, domain) |
| test_predictions/<model>/<ds_dirname>/ |
| test_predictions.npz (n_windows, 9, prediction_length) quantile forecasts at QUANTILE_LEVELS = [0.1, ..., 0.9] |
| ``` |
|
|
| `ds_dirname` follows GIFT-Eval's canonical naming: `<pretty_name>_<freq>_<term>` (e.g. `m4_weekly_W_short`). |
|
|
| ## β‘ How the booster is used |
|
|
| Per (dataset, term): |
|
|
| 1. Load `test_features.npz` and `test_metadata.npz`. Reindex the tsfeatures to `feature_columns.json` β columns missing in this dataset's tsfeatures (e.g. `seasonal_strength` on yearly data) become NaN, which XGBoost handles natively. Attach scalar features (`seasonality`, `prediction_length`, `num_variates`) and categorical features (`freq`, `domain`) using the train-time categorical vocabularies in `categories.json`. *The tsfeatures are computed only on the lookback context that precedes each forecast window, so no information from the ground-truth labels is ever used at inference time.* |
| 2. Look up the bucket booster for `(canonical_freq, term)` where canonical_freq strips pandas anchor suffixes (`W-TUE` β `W`, `Q-DEC` β `Q`). |
| 3. `booster.predict(..., output_margin=True)` returns raw class logits of shape `(n_windows, 10)`; softmax over the model axis gives the per-window weights. |
| 4. Stack the 10 per-model `test_predictions.npz` arrays into a `(n_windows, 10, 9, prediction_length)` tensor; weight-sum across the model axis β final quantile forecast. |
| 5. Score with `gluonts.evaluate_model` using the same call shape every other GIFT-Eval submission uses (see `evaluate_dataset` in the notebook). |
|
|
| ## π Reproducing from scratch |
|
|
| Each base model's predictions were generated by running its standard GIFT-Eval notebook (`notebooks/chronos-2.ipynb`, etc.) with a wrapper that saves the per-window quantile forecasts to `test_predictions.npz` instead of going straight into `evaluate_model`. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the [tsfeatures](https://github.com/Nixtla/tsfeatures) library; "Optional A" in the notebook shows the per-window extraction call. The meta-learner boosters were trained on the corresponding train-window predictions, which are not included in this bundle. |
|
|
| ## π Additional Resources |
|
|
| - [Technical Report](https://arxiv.org/abs/2605.20119) |
| - [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/) |
| - [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) β base Toto checkpoints (4m β 2.5B), which is what we recommend deploying |
| - [Toto-2.0-2.5B-FT](https://huggingface.co/Datadog/Toto-2.0-2.5B-FT) β companion benchmark-only finetune |
| - [GIFT-Eval benchmark](https://huggingface.co/spaces/Salesforce/GIFT-Eval) β leaderboard hosting this submission |
| - [Replication notebook](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb) β fast-path scoring + optional regeneration of every artifact in this bundle |
| - [GitHub Repository](https://github.com/DataDog/toto) |
| - [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM) |
|
|
| ## π Citation |
|
|
| ```bibtex |
| @misc{khwaja2026toto20timeseries, |
| title={Toto 2.0: Time Series Forecasting Enters the Scaling Era}, |
| author={Emaad Khwaja and Chris Lettieri and Gerald Woo and Eden Belouadah and Marc Cenac and Guillaume Jarry and Enguerrand Paquin and Xunyi Zhao and Viktoriya Zhukov and Othmane Abou-Amal and Chenghao Liu and Ameet Talwalkar and David Asker}, |
| year={2026}, |
| eprint={2605.20119}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.LG}, |
| url={https://arxiv.org/abs/2605.20119}, |
| } |
| ``` |
|
|