Datadog
/

Toto-2.0-Family-and-Friends

@@ -41,49 +41,42 @@ model-index:
 >
 > For real workloads, please use the base [Toto 2.0 collection](https://huggingface.co/collections/Datadog/toto-20). The base checkpoints are pretrained without any public data, generalize to every benchmark we have evaluated, and are what we recommend deploying.
----
 ## ✨ What this is
 A per-`(frequency, term)` XGBoost gate over a pool of 10 foundation models (5 Toto 2.0 sizes + 5 external models). The meta-learner consumes lightweight tsfeatures from each forecast window and emits a softmax over the model pool; the final forecast is the weighted sum of the 10 base-model quantile predictions. Following the [FFORMA](https://www.sciencedirect.com/science/article/abs/pii/S0169207019300895) framework (Montero-Manso et al., 2020).
-![GIFT-Eval bar metrics — Toto 2.0 FnF highlighted](assets/bar_metrics_gift_eval.png)
-On the full GIFT-Eval leaderboard (foundation + finetuned + ensemble + agentic), Toto-2.0-FnF takes **#1 on every metric** (tied for #1 on raw CRPS).
 The replication notebook lives in the GIFT-Eval repo at [notebooks/toto_2_0_fnf.ipynb](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb).
----
 ## 🧩 What's in the ensemble
 The Toto 2.0 family accounts for **39% of the assigned weight** across all predictions — more than any other model in the pool, ahead of Chronos-2 (32%) and more than the four remaining external models combined.
-| # | Model                                                                 | Family    |
-| - | --------------------------------------------------------------------- | --------- |
-| 0 | [chronos-2](https://huggingface.co/amazon/chronos-2)                  | Chronos   |
-| 1 | [timesfm-2.5](https://huggingface.co/google/timesfm-2.5-200m-pytorch) | TimesFM   |
-| 2 | [flowstate](https://huggingface.co/ibm-research/flowstate)            | FlowState |
-| 3 | [tirex](https://huggingface.co/NX-AI/TiRex-1.1-gifteval)              | TiRex     |
-| 4 | [patchtst-fm](https://huggingface.co/ibm-research/patchtst-fm-r1)     | PatchTST  |
-| 5 | [toto-2.0-4m](https://huggingface.co/Datadog/Toto-2.0-4m)             | Toto 2.0  |
-| 6 | [toto-2.0-22m](https://huggingface.co/Datadog/Toto-2.0-22m)           | Toto 2.0  |
-| 7 | [toto-2.0-313m](https://huggingface.co/Datadog/Toto-2.0-313m)         | Toto 2.0  |
-| 8 | [toto-2.0-1b](https://huggingface.co/Datadog/Toto-2.0-1B)             | Toto 2.0  |
-| 9 | [toto-2.0-2.5b](https://huggingface.co/Datadog/Toto-2.0-2.5B)         | Toto 2.0  |
 Column order matters — it is tied to the booster's class indices.
----
 ## ✨ Key Features
 - **Per-bucket gating:** Separate XGBoost head per `(frequency, term)` bucket — each bucket learns its own softmax over the model pool so the ensemble can specialize without one global gate trading off across regimes.
 - **No retraining at inference:** The bundle ships pre-computed base-model predictions and tsfeatures for the full GIFT-Eval test split, so replication needs neither GPUs nor the base-model libraries.
 - **No leakage:** tsfeatures are computed only on the lookback context preceding each forecast window; the bundle stores dataset metadata but not ground-truth labels.
----
 ## 📦 Bundle layout
 ```
@@ -101,8 +94,6 @@ test_predictions/<model>/<ds_dirname>/
 `ds_dirname` follows GIFT-Eval's canonical naming: `<pretty_name>_<freq>_<term>` (e.g. `m4_weekly_W_short`).
----
 ## ⚡ How the booster is used
 Per (dataset, term):
@@ -113,35 +104,27 @@ Per (dataset, term):
 4. Stack the 10 per-model `test_predictions.npz` arrays into a `(n_windows, 10, 9, prediction_length)` tensor; weight-sum across the model axis → final quantile forecast.
 5. Score with `gluonts.evaluate_model` using the same call shape every other GIFT-Eval submission uses (see `evaluate_dataset` in the notebook).
----
 ## 🔁 Reproducing from scratch
 Each base model's predictions were generated by running its standard GIFT-Eval notebook (`notebooks/chronos-2.ipynb`, etc.) with a wrapper that saves the per-window quantile forecasts to `test_predictions.npz` instead of going straight into `evaluate_model`. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the [tsfeatures](https://github.com/Nixtla/tsfeatures) library; "Optional A" in the notebook shows the per-window extraction call. The meta-learner boosters were trained on the corresponding train-window predictions, which are not included in this bundle.
----
 ## 🔗 Additional Resources
 - **Technical Report** — *(coming soon)*
 - [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)
-- [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) — base Toto checkpoints (4M → 2.5B), which is what we recommend deploying
 - [Toto-2.0-2.5B-FT](https://huggingface.co/Datadog/Toto-2.0-2.5B-FT) — companion benchmark-only finetune
 - [GIFT-Eval benchmark](https://huggingface.co/spaces/Salesforce/GIFT-Eval) — leaderboard hosting this submission
 - [Replication notebook](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb) — fast-path scoring + optional regeneration of every artifact in this bundle
 - [GitHub Repository](https://github.com/DataDog/toto)
 - [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM)
----
 ## 📖 Citation
 ```bibtex
 (citation coming soon)
 ```
----
 ## 📝 License
 Apache 2.0. Each base model retains its original license — see the linked HF repos in the model pool table.

 >
 > For real workloads, please use the base [Toto 2.0 collection](https://huggingface.co/collections/Datadog/toto-20). The base checkpoints are pretrained without any public data, generalize to every benchmark we have evaluated, and are what we recommend deploying.
 ## ✨ What this is
 A per-`(frequency, term)` XGBoost gate over a pool of 10 foundation models (5 Toto 2.0 sizes + 5 external models). The meta-learner consumes lightweight tsfeatures from each forecast window and emits a softmax over the model pool; the final forecast is the weighted sum of the 10 base-model quantile predictions. Following the [FFORMA](https://www.sciencedirect.com/science/article/abs/pii/S0169207019300895) framework (Montero-Manso et al., 2020).
+<figure>
+<img src="assets/bar_metrics_gift_eval.png" alt="GIFT-Eval bar metrics — Toto 2.0 FnF highlighted">
+<figcaption>On the full GIFT-Eval leaderboard (foundation + finetuned + ensemble + agentic), Toto-2.0-FnF takes <b>#1 on every metric</b> (tied for #1 on raw CRPS).</figcaption>
+</figure>
 The replication notebook lives in the GIFT-Eval repo at [notebooks/toto_2_0_fnf.ipynb](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb).
 ## 🧩 What's in the ensemble
 The Toto 2.0 family accounts for **39% of the assigned weight** across all predictions — more than any other model in the pool, ahead of Chronos-2 (32%) and more than the four remaining external models combined.
+| # | Model | Family |
+|:---:|---|:---:|
+| 0 | [chronos-2](https://huggingface.co/amazon/chronos-2) | Chronos |
+| 1 | [timesfm-2.5](https://huggingface.co/google/timesfm-2.5-200m-pytorch) | TimesFM |
+| 2 | [flowstate](https://huggingface.co/ibm-research/flowstate) | FlowState |
+| 3 | [tirex](https://huggingface.co/NX-AI/TiRex-1.1-gifteval) | TiRex |
+| 4 | [patchtst-fm](https://huggingface.co/ibm-research/patchtst-fm-r1) | PatchTST |
+| 5 | [toto-2.0-4m](https://huggingface.co/Datadog/Toto-2.0-4m) | Toto 2.0 |
+| 6 | [toto-2.0-22m](https://huggingface.co/Datadog/Toto-2.0-22m) | Toto 2.0 |
+| 7 | [toto-2.0-313m](https://huggingface.co/Datadog/Toto-2.0-313m) | Toto 2.0 |
+| 8 | [toto-2.0-1b](https://huggingface.co/Datadog/Toto-2.0-1B) | Toto 2.0 |
+| 9 | [toto-2.0-2.5b](https://huggingface.co/Datadog/Toto-2.0-2.5B) | Toto 2.0 |
 Column order matters — it is tied to the booster's class indices.
 ## ✨ Key Features
 - **Per-bucket gating:** Separate XGBoost head per `(frequency, term)` bucket — each bucket learns its own softmax over the model pool so the ensemble can specialize without one global gate trading off across regimes.
 - **No retraining at inference:** The bundle ships pre-computed base-model predictions and tsfeatures for the full GIFT-Eval test split, so replication needs neither GPUs nor the base-model libraries.
 - **No leakage:** tsfeatures are computed only on the lookback context preceding each forecast window; the bundle stores dataset metadata but not ground-truth labels.
 ## 📦 Bundle layout
 ```
 `ds_dirname` follows GIFT-Eval's canonical naming: `<pretty_name>_<freq>_<term>` (e.g. `m4_weekly_W_short`).
 ## ⚡ How the booster is used
 Per (dataset, term):
 4. Stack the 10 per-model `test_predictions.npz` arrays into a `(n_windows, 10, 9, prediction_length)` tensor; weight-sum across the model axis → final quantile forecast.
 5. Score with `gluonts.evaluate_model` using the same call shape every other GIFT-Eval submission uses (see `evaluate_dataset` in the notebook).
 ## 🔁 Reproducing from scratch
 Each base model's predictions were generated by running its standard GIFT-Eval notebook (`notebooks/chronos-2.ipynb`, etc.) with a wrapper that saves the per-window quantile forecasts to `test_predictions.npz` instead of going straight into `evaluate_model`. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the [tsfeatures](https://github.com/Nixtla/tsfeatures) library; "Optional A" in the notebook shows the per-window extraction call. The meta-learner boosters were trained on the corresponding train-window predictions, which are not included in this bundle.
 ## 🔗 Additional Resources
 - **Technical Report** — *(coming soon)*
 - [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)
+- [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) — base Toto checkpoints (4m → 2.5B), which is what we recommend deploying
 - [Toto-2.0-2.5B-FT](https://huggingface.co/Datadog/Toto-2.0-2.5B-FT) — companion benchmark-only finetune
 - [GIFT-Eval benchmark](https://huggingface.co/spaces/Salesforce/GIFT-Eval) — leaderboard hosting this submission
 - [Replication notebook](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb) — fast-path scoring + optional regeneration of every artifact in this bundle
 - [GitHub Repository](https://github.com/DataDog/toto)
 - [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM)
 ## 📖 Citation
 ```bibtex
 (citation coming soon)
 ```
 ## 📝 License
 Apache 2.0. Each base model retains its original license — see the linked HF repos in the model pool table.