--- tags: - time-series-forecasting - foundation-models - pretrained-models - time-series - timeseries - forecasting - observability - ensemble - meta-learning - gift-eval license: apache-2.0 pipeline_tag: time-series-forecasting thumbnail: https://corp.dd-static.net/img/about/presskit/kit/press_kit.png model-index: - name: Toto-2.0-Family-and-Friends results: - task: type: time-series-forecasting dataset: name: GIFT-Eval type: GIFT-Eval metrics: - name: CRPS type: CRPS value: 0.463 - name: MASE type: MASE value: 0.676 source: name: GIFT-Eval Time Series Forecasting Leaderboard url: https://huggingface.co/spaces/Salesforce/GIFT-Eval --- # Toto 2.0 Family-and-Friends (FnF) > [!WARNING] > **This is a benchmarking artifact, not a general-purpose model.** > Toto-2.0-FnF is an FFORMA-style XGBoost meta-learner over 10 foundation models that we submitted to the GIFT-Eval leaderboard. The bundle ships pre-computed predictions for the GIFT-Eval test split and exists to make the **#1** submission fully reproducible — it cannot forecast new series without first running every base model. > > For real workloads, please use the base [Toto 2.0 collection](https://huggingface.co/collections/Datadog/toto-20). The base checkpoints are pretrained without any public data, generalize to every benchmark we have evaluated, and are what we recommend deploying. ## ✨ What this is? A per-`(frequency, term)` XGBoost gate over a pool of 10 foundation models (5 Toto 2.0 sizes + 5 external models). The meta-learner consumes lightweight tsfeatures from each forecast window and emits a softmax over the model pool; the final forecast is the weighted sum of the 10 base-model quantile predictions. Following the [FFORMA](https://www.sciencedirect.com/science/article/abs/pii/S0169207019300895) framework (Montero-Manso et al., 2020).
GIFT-Eval bar metrics — Toto 2.0 FnF highlighted
On the full GIFT-Eval leaderboard (foundation + finetuned + ensemble + agentic), Toto-2.0-FnF takes #1 on every metric (tied for #1 on raw CRPS).
The replication notebook lives in the GIFT-Eval repo at [notebooks/toto_2_0_fnf.ipynb](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb). ## 🧩 What's in the ensemble The Toto 2.0 family accounts for **39% of the assigned weight** across all predictions — more than any other model in the pool, ahead of Chronos-2 (32%) and more than the four remaining external models combined. | # | Model | Family | |:---:|---|:---:| | 0 | [chronos-2](https://huggingface.co/amazon/chronos-2) | Chronos | | 1 | [timesfm-2.5](https://huggingface.co/google/timesfm-2.5-200m-pytorch) | TimesFM | | 2 | [flowstate](https://huggingface.co/ibm-research/flowstate) | FlowState | | 3 | [tirex](https://huggingface.co/NX-AI/TiRex-1.1-gifteval) | TiRex | | 4 | [patchtst-fm](https://huggingface.co/ibm-research/patchtst-fm-r1) | PatchTST | | 5 | [toto-2.0-4m](https://huggingface.co/Datadog/Toto-2.0-4m) | Toto 2.0 | | 6 | [toto-2.0-22m](https://huggingface.co/Datadog/Toto-2.0-22m) | Toto 2.0 | | 7 | [toto-2.0-313m](https://huggingface.co/Datadog/Toto-2.0-313m) | Toto 2.0 | | 8 | [toto-2.0-1b](https://huggingface.co/Datadog/Toto-2.0-1B) | Toto 2.0 | | 9 | [toto-2.0-2.5b](https://huggingface.co/Datadog/Toto-2.0-2.5B) | Toto 2.0 | Column order matters — it is tied to the booster's class indices. ## ✨ Key Features - **Per-bucket gating:** Separate XGBoost head per `(frequency, term)` bucket — each bucket learns its own softmax over the model pool so the ensemble can specialize without one global gate trading off across regimes. - **No retraining at inference:** The bundle ships pre-computed base-model predictions and tsfeatures for the full GIFT-Eval test split, so replication needs neither GPUs nor the base-model libraries. - **No leakage:** tsfeatures are computed only on the lookback context preceding each forecast window; the bundle stores dataset metadata but not ground-truth labels. ## 📦 Bundle layout ``` booster_manifest.json ~4.8 GB — base64-encoded XGBoost boosters keyed by "|" feature_columns.json train-time column order expected by the booster feature_types.json XGBoost feature_types (c = categorical, q = float) categories.json {"freq": [...], "domain": [...]} train-time category vocabularies models.json list of model names in column order (column index ↔ model) test_features// test_features.npz (n_windows, n_tsfeatures) tsfeatures from the lookback context preceding each window test_metadata.npz dataset-level scalars only (seasonality, prediction_length, num_variates, freq, domain) test_predictions/// test_predictions.npz (n_windows, 9, prediction_length) quantile forecasts at QUANTILE_LEVELS = [0.1, ..., 0.9] ``` `ds_dirname` follows GIFT-Eval's canonical naming: `__` (e.g. `m4_weekly_W_short`). ## ⚡ How the booster is used Per (dataset, term): 1. Load `test_features.npz` and `test_metadata.npz`. Reindex the tsfeatures to `feature_columns.json` — columns missing in this dataset's tsfeatures (e.g. `seasonal_strength` on yearly data) become NaN, which XGBoost handles natively. Attach scalar features (`seasonality`, `prediction_length`, `num_variates`) and categorical features (`freq`, `domain`) using the train-time categorical vocabularies in `categories.json`. *The tsfeatures are computed only on the lookback context that precedes each forecast window, so no information from the ground-truth labels is ever used at inference time.* 2. Look up the bucket booster for `(canonical_freq, term)` where canonical_freq strips pandas anchor suffixes (`W-TUE` → `W`, `Q-DEC` → `Q`). 3. `booster.predict(..., output_margin=True)` returns raw class logits of shape `(n_windows, 10)`; softmax over the model axis gives the per-window weights. 4. Stack the 10 per-model `test_predictions.npz` arrays into a `(n_windows, 10, 9, prediction_length)` tensor; weight-sum across the model axis → final quantile forecast. 5. Score with `gluonts.evaluate_model` using the same call shape every other GIFT-Eval submission uses (see `evaluate_dataset` in the notebook). ## 🔁 Reproducing from scratch Each base model's predictions were generated by running its standard GIFT-Eval notebook (`notebooks/chronos-2.ipynb`, etc.) with a wrapper that saves the per-window quantile forecasts to `test_predictions.npz` instead of going straight into `evaluate_model`. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the [tsfeatures](https://github.com/Nixtla/tsfeatures) library; "Optional A" in the notebook shows the per-window extraction call. The meta-learner boosters were trained on the corresponding train-window predictions, which are not included in this bundle. ## 🔗 Additional Resources - [Technical Report](https://arxiv.org/abs/2605.20119) - [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/) - [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) — base Toto checkpoints (4m → 2.5B), which is what we recommend deploying - [Toto-2.0-2.5B-FT](https://huggingface.co/Datadog/Toto-2.0-2.5B-FT) — companion benchmark-only finetune - [GIFT-Eval benchmark](https://huggingface.co/spaces/Salesforce/GIFT-Eval) — leaderboard hosting this submission - [Replication notebook](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb) — fast-path scoring + optional regeneration of every artifact in this bundle - [GitHub Repository](https://github.com/DataDog/toto) - [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM) ## 📖 Citation ```bibtex @misc{khwaja2026toto20timeseries, title={Toto 2.0: Time Series Forecasting Enters the Scaling Era}, author={Emaad Khwaja and Chris Lettieri and Gerald Woo and Eden Belouadah and Marc Cenac and Guillaume Jarry and Enguerrand Paquin and Xunyi Zhao and Viktoriya Zhukov and Othmane Abou-Amal and Chenghao Liu and Ameet Talwalkar and David Asker}, year={2026}, eprint={2605.20119}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2605.20119}, } ```