Update README.md

9e953ea verified 1 day ago

8.53 kB

tags:
  - time-series-forecasting
  - foundation-models
  - pretrained-models
  - time-series
  - timeseries
  - forecasting
  - observability
  - ensemble
  - meta-learning
  - gift-eval
license: apache-2.0
pipeline_tag: time-series-forecasting
thumbnail: https://corp.dd-static.net/img/about/presskit/kit/press_kit.png
model-index:
  - name: Toto-2.0-Family-and-Friends
    results:
      - task:
          type: time-series-forecasting
        dataset:
          name: GIFT-Eval
          type: GIFT-Eval
        metrics:
          - name: CRPS
            type: CRPS
            value: 0.463
          - name: MASE
            type: MASE
            value: 0.676
        source:
          name: GIFT-Eval Time Series Forecasting Leaderboard
          url: https://huggingface.co/spaces/Salesforce/GIFT-Eval

Toto 2.0 Family-and-Friends (FnF)

This is a benchmarking artifact, not a general-purpose model. Toto-2.0-FnF is an FFORMA-style XGBoost meta-learner over 10 foundation models that we submitted to the GIFT-Eval leaderboard. The bundle ships pre-computed predictions for the GIFT-Eval test split and exists to make the #1 submission fully reproducible — it cannot forecast new series without first running every base model.

For real workloads, please use the base Toto 2.0 collection. The base checkpoints are pretrained without any public data, generalize to every benchmark we have evaluated, and are what we recommend deploying.

✨ What this is?

A per-(frequency, term) XGBoost gate over a pool of 10 foundation models (5 Toto 2.0 sizes + 5 external models). The meta-learner consumes lightweight tsfeatures from each forecast window and emits a softmax over the model pool; the final forecast is the weighted sum of the 10 base-model quantile predictions. Following the FFORMA framework (Montero-Manso et al., 2020).

GIFT-Eval bar metrics — Toto 2.0 FnF highlighted — On the full GIFT-Eval leaderboard (foundation + finetuned + ensemble + agentic), Toto-2.0-FnF takes **#1 on every metric** (tied for #1 on raw CRPS).

The replication notebook lives in the GIFT-Eval repo at notebooks/toto_2_0_fnf.ipynb.

🧩 What's in the ensemble

The Toto 2.0 family accounts for 39% of the assigned weight across all predictions — more than any other model in the pool, ahead of Chronos-2 (32%) and more than the four remaining external models combined.

#	Model	Family
0	chronos-2	Chronos
1	timesfm-2.5	TimesFM
2	flowstate	FlowState
3	tirex	TiRex
4	patchtst-fm	PatchTST
5	toto-2.0-4m	Toto 2.0
6	toto-2.0-22m	Toto 2.0
7	toto-2.0-313m	Toto 2.0
8	toto-2.0-1b	Toto 2.0
9	toto-2.0-2.5b	Toto 2.0

Column order matters — it is tied to the booster's class indices.

✨ Key Features

Per-bucket gating: Separate XGBoost head per (frequency, term) bucket — each bucket learns its own softmax over the model pool so the ensemble can specialize without one global gate trading off across regimes.
No retraining at inference: The bundle ships pre-computed base-model predictions and tsfeatures for the full GIFT-Eval test split, so replication needs neither GPUs nor the base-model libraries.
No leakage: tsfeatures are computed only on the lookback context preceding each forecast window; the bundle stores dataset metadata but not ground-truth labels.

📦 Bundle layout

booster_manifest.json          ~4.8 GB — base64-encoded XGBoost boosters keyed by "<canonical_freq>|<term>"
feature_columns.json           train-time column order expected by the booster
feature_types.json             XGBoost feature_types (c = categorical, q = float)
categories.json                {"freq": [...], "domain": [...]} train-time category vocabularies
models.json                    list of model names in column order (column index ↔ model)
test_features/<ds_dirname>/
  test_features.npz            (n_windows, n_tsfeatures) tsfeatures from the lookback context preceding each window
  test_metadata.npz            dataset-level scalars only (seasonality, prediction_length, num_variates, freq, domain)
test_predictions/<model>/<ds_dirname>/
  test_predictions.npz         (n_windows, 9, prediction_length) quantile forecasts at QUANTILE_LEVELS = [0.1, ..., 0.9]

ds_dirname follows GIFT-Eval's canonical naming: <pretty_name>_<freq>_<term> (e.g. m4_weekly_W_short).

⚡ How the booster is used

Per (dataset, term):

Load test_features.npz and test_metadata.npz. Reindex the tsfeatures to feature_columns.json — columns missing in this dataset's tsfeatures (e.g. seasonal_strength on yearly data) become NaN, which XGBoost handles natively. Attach scalar features (seasonality, prediction_length, num_variates) and categorical features (freq, domain) using the train-time categorical vocabularies in categories.json. The tsfeatures are computed only on the lookback context that precedes each forecast window, so no information from the ground-truth labels is ever used at inference time.
Look up the bucket booster for (canonical_freq, term) where canonical_freq strips pandas anchor suffixes (W-TUE → W, Q-DEC → Q).
booster.predict(..., output_margin=True) returns raw class logits of shape (n_windows, 10); softmax over the model axis gives the per-window weights.
Stack the 10 per-model test_predictions.npz arrays into a (n_windows, 10, 9, prediction_length) tensor; weight-sum across the model axis → final quantile forecast.
Score with gluonts.evaluate_model using the same call shape every other GIFT-Eval submission uses (see evaluate_dataset in the notebook).

🔁 Reproducing from scratch

Each base model's predictions were generated by running its standard GIFT-Eval notebook (notebooks/chronos-2.ipynb, etc.) with a wrapper that saves the per-window quantile forecasts to test_predictions.npz instead of going straight into evaluate_model. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the tsfeatures library; "Optional A" in the notebook shows the per-window extraction call. The meta-learner boosters were trained on the corresponding train-window predictions, which are not included in this bundle.

🔗 Additional Resources

Technical Report
Blog Post
Toto 2.0 Collection — base Toto checkpoints (4m → 2.5B), which is what we recommend deploying
Toto-2.0-2.5B-FT — companion benchmark-only finetune
GIFT-Eval benchmark — leaderboard hosting this submission
Replication notebook — fast-path scoring + optional regeneration of every artifact in this bundle
GitHub Repository
BOOM Dataset

📖 Citation

@misc{khwaja2026toto20timeseries,
      title={Toto 2.0: Time Series Forecasting Enters the Scaling Era}, 
      author={Emaad Khwaja and Chris Lettieri and Gerald Woo and Eden Belouadah and Marc Cenac and Guillaume Jarry and Enguerrand Paquin and Xunyi Zhao and Viktoriya Zhukov and Othmane Abou-Amal and Chenghao Liu and Ameet Talwalkar and David Asker},
      year={2026},
      eprint={2605.20119},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.20119}, 
}