File size: 8,527 Bytes
d617712
 
 
 
 
 
 
 
7e58ec7
d617712
 
 
 
 
 
7e58ec7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d617712
 
7e58ec7
d617712
7e58ec7
 
 
 
 
d617712
2f348bf
d617712
7e58ec7
d617712
7e58ec7
 
 
 
d617712
7e58ec7
d617712
7e58ec7
d617712
7e58ec7
d617712
 
7e58ec7
 
 
 
 
 
 
 
 
 
 
 
 
d617712
7e58ec7
 
 
 
 
d617712
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7e58ec7
d617712
 
 
9e953ea
7e58ec7
 
 
 
 
 
 
d617712
 
 
 
fc0ef72
 
 
 
 
 
 
 
 
d617712
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
tags:
- time-series-forecasting
- foundation-models
- pretrained-models
- time-series
- timeseries
- forecasting
- observability
- ensemble
- meta-learning
- gift-eval
license: apache-2.0
pipeline_tag: time-series-forecasting
thumbnail: https://corp.dd-static.net/img/about/presskit/kit/press_kit.png
model-index:
- name: Toto-2.0-Family-and-Friends
  results:
    - task:
        type: time-series-forecasting
      dataset:
        name: GIFT-Eval
        type: GIFT-Eval
      metrics:
        - name: CRPS
          type: CRPS
          value: 0.463
        - name: MASE
          type: MASE
          value: 0.676
      source:
        name: GIFT-Eval Time Series Forecasting Leaderboard
        url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
---

# Toto 2.0 Family-and-Friends (FnF)

> [!WARNING]
> **This is a benchmarking artifact, not a general-purpose model.**
> Toto-2.0-FnF is an FFORMA-style XGBoost meta-learner over 10 foundation models that we submitted to the GIFT-Eval leaderboard. The bundle ships pre-computed predictions for the GIFT-Eval test split and exists to make the **#1** submission fully reproducible β€” it cannot forecast new series without first running every base model.
>
> For real workloads, please use the base [Toto 2.0 collection](https://huggingface.co/collections/Datadog/toto-20). The base checkpoints are pretrained without any public data, generalize to every benchmark we have evaluated, and are what we recommend deploying.

## ✨ What this is?

A per-`(frequency, term)` XGBoost gate over a pool of 10 foundation models (5 Toto 2.0 sizes + 5 external models). The meta-learner consumes lightweight tsfeatures from each forecast window and emits a softmax over the model pool; the final forecast is the weighted sum of the 10 base-model quantile predictions. Following the [FFORMA](https://www.sciencedirect.com/science/article/abs/pii/S0169207019300895) framework (Montero-Manso et al., 2020).

<figure>
<img src="assets/bar_metrics_gift_eval.png" alt="GIFT-Eval bar metrics β€” Toto 2.0 FnF highlighted">
<figcaption>On the full GIFT-Eval leaderboard (foundation + finetuned + ensemble + agentic), Toto-2.0-FnF takes <b>#1 on every metric</b> (tied for #1 on raw CRPS).</figcaption>
</figure>

The replication notebook lives in the GIFT-Eval repo at [notebooks/toto_2_0_fnf.ipynb](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb).

## 🧩 What's in the ensemble

The Toto 2.0 family accounts for **39% of the assigned weight** across all predictions β€” more than any other model in the pool, ahead of Chronos-2 (32%) and more than the four remaining external models combined.

| # | Model | Family |
|:---:|---|:---:|
| 0 | [chronos-2](https://huggingface.co/amazon/chronos-2) | Chronos |
| 1 | [timesfm-2.5](https://huggingface.co/google/timesfm-2.5-200m-pytorch) | TimesFM |
| 2 | [flowstate](https://huggingface.co/ibm-research/flowstate) | FlowState |
| 3 | [tirex](https://huggingface.co/NX-AI/TiRex-1.1-gifteval) | TiRex |
| 4 | [patchtst-fm](https://huggingface.co/ibm-research/patchtst-fm-r1) | PatchTST |
| 5 | [toto-2.0-4m](https://huggingface.co/Datadog/Toto-2.0-4m) | Toto 2.0 |
| 6 | [toto-2.0-22m](https://huggingface.co/Datadog/Toto-2.0-22m) | Toto 2.0 |
| 7 | [toto-2.0-313m](https://huggingface.co/Datadog/Toto-2.0-313m) | Toto 2.0 |
| 8 | [toto-2.0-1b](https://huggingface.co/Datadog/Toto-2.0-1B) | Toto 2.0 |
| 9 | [toto-2.0-2.5b](https://huggingface.co/Datadog/Toto-2.0-2.5B) | Toto 2.0 |

Column order matters β€” it is tied to the booster's class indices.

## ✨ Key Features

- **Per-bucket gating:** Separate XGBoost head per `(frequency, term)` bucket β€” each bucket learns its own softmax over the model pool so the ensemble can specialize without one global gate trading off across regimes.
- **No retraining at inference:** The bundle ships pre-computed base-model predictions and tsfeatures for the full GIFT-Eval test split, so replication needs neither GPUs nor the base-model libraries.
- **No leakage:** tsfeatures are computed only on the lookback context preceding each forecast window; the bundle stores dataset metadata but not ground-truth labels.

## πŸ“¦ Bundle layout

```
booster_manifest.json          ~4.8 GB β€” base64-encoded XGBoost boosters keyed by "<canonical_freq>|<term>"
feature_columns.json           train-time column order expected by the booster
feature_types.json             XGBoost feature_types (c = categorical, q = float)
categories.json                {"freq": [...], "domain": [...]} train-time category vocabularies
models.json                    list of model names in column order (column index ↔ model)
test_features/<ds_dirname>/
  test_features.npz            (n_windows, n_tsfeatures) tsfeatures from the lookback context preceding each window
  test_metadata.npz            dataset-level scalars only (seasonality, prediction_length, num_variates, freq, domain)
test_predictions/<model>/<ds_dirname>/
  test_predictions.npz         (n_windows, 9, prediction_length) quantile forecasts at QUANTILE_LEVELS = [0.1, ..., 0.9]
```

`ds_dirname` follows GIFT-Eval's canonical naming: `<pretty_name>_<freq>_<term>` (e.g. `m4_weekly_W_short`).

## ⚑ How the booster is used

Per (dataset, term):

1. Load `test_features.npz` and `test_metadata.npz`. Reindex the tsfeatures to `feature_columns.json` β€” columns missing in this dataset's tsfeatures (e.g. `seasonal_strength` on yearly data) become NaN, which XGBoost handles natively. Attach scalar features (`seasonality`, `prediction_length`, `num_variates`) and categorical features (`freq`, `domain`) using the train-time categorical vocabularies in `categories.json`. *The tsfeatures are computed only on the lookback context that precedes each forecast window, so no information from the ground-truth labels is ever used at inference time.*
2. Look up the bucket booster for `(canonical_freq, term)` where canonical_freq strips pandas anchor suffixes (`W-TUE` β†’ `W`, `Q-DEC` β†’ `Q`).
3. `booster.predict(..., output_margin=True)` returns raw class logits of shape `(n_windows, 10)`; softmax over the model axis gives the per-window weights.
4. Stack the 10 per-model `test_predictions.npz` arrays into a `(n_windows, 10, 9, prediction_length)` tensor; weight-sum across the model axis β†’ final quantile forecast.
5. Score with `gluonts.evaluate_model` using the same call shape every other GIFT-Eval submission uses (see `evaluate_dataset` in the notebook).

## πŸ” Reproducing from scratch

Each base model's predictions were generated by running its standard GIFT-Eval notebook (`notebooks/chronos-2.ipynb`, etc.) with a wrapper that saves the per-window quantile forecasts to `test_predictions.npz` instead of going straight into `evaluate_model`. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the [tsfeatures](https://github.com/Nixtla/tsfeatures) library; "Optional A" in the notebook shows the per-window extraction call. The meta-learner boosters were trained on the corresponding train-window predictions, which are not included in this bundle.

## πŸ”— Additional Resources

- [Technical Report](https://arxiv.org/abs/2605.20119)
- [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)
- [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) β€” base Toto checkpoints (4m β†’ 2.5B), which is what we recommend deploying
- [Toto-2.0-2.5B-FT](https://huggingface.co/Datadog/Toto-2.0-2.5B-FT) β€” companion benchmark-only finetune
- [GIFT-Eval benchmark](https://huggingface.co/spaces/Salesforce/GIFT-Eval) β€” leaderboard hosting this submission
- [Replication notebook](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb) β€” fast-path scoring + optional regeneration of every artifact in this bundle
- [GitHub Repository](https://github.com/DataDog/toto)
- [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM)

## πŸ“– Citation

```bibtex
@misc{khwaja2026toto20timeseries,
      title={Toto 2.0: Time Series Forecasting Enters the Scaling Era}, 
      author={Emaad Khwaja and Chris Lettieri and Gerald Woo and Eden Belouadah and Marc Cenac and Guillaume Jarry and Enguerrand Paquin and Xunyi Zhao and Viktoriya Zhukov and Othmane Abou-Amal and Chenghao Liu and Ameet Talwalkar and David Asker},
      year={2026},
      eprint={2605.20119},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.20119}, 
}
```