Refresh model card: benchmark-only CTA, bar metrics hero, ensemble weight share callout

by Emaad - opened 8 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+65

-45

Files changed (3) hide show

.gitattributes +1 -0
README.md +61 -45
assets/bar_metrics_gift_eval.png +3 -0

.gitattributes CHANGED Viewed

@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 booster_manifest.json filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 booster_manifest.json filter=lfs diff=lfs merge=lfs -text
+assets/bar_metrics_gift_eval.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -6,52 +6,76 @@ tags:
 - time-series
 - timeseries
 - forecasting
 - ensemble
 - meta-learning
 - gift-eval
-- observability
 license: apache-2.0
 pipeline_tag: time-series-forecasting
 thumbnail: https://corp.dd-static.net/img/about/presskit/kit/press_kit.png
-language:
-- en
-pretty_name: Toto 2.0 Family and Friends — GIFT-Eval artifacts
 ---
-# Toto 2.0 Family and Friends — GIFT-Eval artifact bundle
-Pre-computed artifacts for replicating the **Toto 2.0 Family and Friends** (short form: **Toto-2.0-FnF**) submission to the [GIFT-Eval benchmark](https://huggingface.co/spaces/Salesforce/GIFT-Eval). The ensemble is an FFORMA-style ([Montero-Manso et al., *International Journal of Forecasting*, 2020](https://www.sciencedirect.com/science/article/abs/pii/S0169207019300895)) meta-learner that gates a pool of foundation models on a per-(frequency, term) bucket basis using XGBoost over time-series features.
-The replication notebook lives in the GIFT-Eval repo at [`notebooks/toto_2_0_fnf.ipynb`](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb).
----
-## ✨ Key Features
-- **Per-bucket gating**: Separate XGBoost head per `(frequency, term)` bucket — each bucket learns its own softmax over the model pool so the ensemble can specialize without one global gate trading off across regimes.
-- **No retraining at inference**: The bundle ships pre-computed base-model predictions and tsfeatures for the full GIFT-Eval test split, so replication needs neither GPUs nor the base-model libraries.
-- **No leakage**: tsfeatures are computed only on the lookback context preceding each forecast window; the bundle stores dataset metadata but not ground-truth labels.
----
-## 🧩 Model pool
-The meta-learner outputs softmax weights over 10 foundation models (column order matters — it is tied to the booster's class indices):
 | # | Model | Family |
-|---|-------|--------|
-| 0 | [`chronos-2`](https://huggingface.co/amazon/chronos-2) | Chronos |
-| 1 | [`timesfm-2.5`](https://huggingface.co/google/timesfm-2.5-200m-pytorch) | TimesFM |
-| 2 | [`flowstate`](https://huggingface.co/ibm-research/flowstate) | FlowState |
-| 3 | [`tirex`](https://huggingface.co/NX-AI/TiRex-1.1-gifteval) | TiRex |
-| 4 | [`patchtst-fm`](https://huggingface.co/ibm-research/patchtst-fm-r1) | PatchTST |
-| 5 | [`toto-2.0-4m`](https://huggingface.co/Datadog/Toto-2.0-4m) | Toto 2.0 |
-| 6 | [`toto-2.0-22m`](https://huggingface.co/Datadog/Toto-2.0-22m) | Toto 2.0 |
-| 7 | [`toto-2.0-313m`](https://huggingface.co/Datadog/Toto-2.0-313m) | Toto 2.0 |
-| 8 | [`toto-2.0-1b`](https://huggingface.co/Datadog/Toto-2.0-1b) | Toto 2.0 |
-| 9 | [`toto-2.0-2.5b`](https://huggingface.co/Datadog/Toto-2.0-2.5b) | Toto 2.0 |
----
 ## 📦 Bundle layout
@@ -70,8 +94,6 @@ test_predictions/<model>/<ds_dirname>/
 `ds_dirname` follows GIFT-Eval's canonical naming: `<pretty_name>_<freq>_<term>` (e.g. `m4_weekly_W_short`).
----
 ## ⚡ How the booster is used
 Per (dataset, term):
@@ -82,24 +104,20 @@ Per (dataset, term):
 4. Stack the 10 per-model `test_predictions.npz` arrays into a `(n_windows, 10, 9, prediction_length)` tensor; weight-sum across the model axis → final quantile forecast.
 5. Score with `gluonts.evaluate_model` using the same call shape every other GIFT-Eval submission uses (see `evaluate_dataset` in the notebook).
----
 ## 🔁 Reproducing from scratch
-Each base model's predictions were generated by running its standard GIFT-Eval notebook (`notebooks/chronos-2.ipynb`, etc.) with a wrapper that saves the per-window quantile forecasts to `test_predictions.npz` instead of going straight into `evaluate_model`. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the [`tsfeatures`](https://github.com/Nixtla/tsfeatures) library; "Optional A" in the notebook shows the per-window extraction call. The meta-learner boosters were trained on the corresponding train-window predictions, which are not included in this bundle.
----
 ## 🔗 Additional Resources
-- **[GIFT-Eval benchmark](https://huggingface.co/spaces/Salesforce/GIFT-Eval)** — leaderboard hosting this submission
-- **[Replication notebook](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb)** — fast-path scoring + optional regeneration of every artifact in this bundle
-- **[Toto 2.0 family](https://huggingface.co/Datadog/Toto-2.0-1B)** — base Toto checkpoints (4M → 2.5B)
-- **[Toto GitHub repository](https://github.com/DataDog/toto)** — Toto 2.0 source code
-- **[BOOM dataset](https://huggingface.co/datasets/Datadog/BOOM)** — Datadog's observability time-series benchmark
-- **[Datadog blog post](https://www.datadoghq.com/blog/ai/toto-2/)** — Toto 2.0 announcement
----
 ## 📖 Citation
@@ -107,8 +125,6 @@ Each base model's predictions were generated by running its standard GIFT-Eval n
 (citation coming soon)
 ```
----
 ## 📝 License
 Apache 2.0. Each base model retains its original license — see the linked HF repos in the model pool table.

 - time-series
 - timeseries
 - forecasting
+- observability
 - ensemble
 - meta-learning
 - gift-eval
 license: apache-2.0
 pipeline_tag: time-series-forecasting
 thumbnail: https://corp.dd-static.net/img/about/presskit/kit/press_kit.png
+model-index:
+- name: Toto-2.0-Family-and-Friends
+  results:
+    - task:
+        type: time-series-forecasting
+      dataset:
+        name: GIFT-Eval
+        type: GIFT-Eval
+      metrics:
+        - name: CRPS
+          type: CRPS
+          value: 0.463
+        - name: MASE
+          type: MASE
+          value: 0.676
+      source:
+        name: GIFT-Eval Time Series Forecasting Leaderboard
+        url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
 ---
+# Toto 2.0 Family-and-Friends (FnF)
+> [!WARNING]
+> **This is a benchmarking artifact, not a general-purpose model.**
+> Toto-2.0-FnF is an FFORMA-style XGBoost meta-learner over 10 foundation models that we submitted to the GIFT-Eval leaderboard. The bundle ships pre-computed predictions for the GIFT-Eval test split and exists to make the **#1** submission fully reproducible — it cannot forecast new series without first running every base model.
+>
+> For real workloads, please use the base [Toto 2.0 collection](https://huggingface.co/collections/Datadog/toto-20). The base checkpoints are pretrained without any public data, generalize to every benchmark we have evaluated, and are what we recommend deploying.
+## ✨ What this is
+A per-`(frequency, term)` XGBoost gate over a pool of 10 foundation models (5 Toto 2.0 sizes + 5 external models). The meta-learner consumes lightweight tsfeatures from each forecast window and emits a softmax over the model pool; the final forecast is the weighted sum of the 10 base-model quantile predictions. Following the [FFORMA](https://www.sciencedirect.com/science/article/abs/pii/S0169207019300895) framework (Montero-Manso et al., 2020).
+<figure>
+<img src="assets/bar_metrics_gift_eval.png" alt="GIFT-Eval bar metrics — Toto 2.0 FnF highlighted">
+<figcaption>On the full GIFT-Eval leaderboard (foundation + finetuned + ensemble + agentic), Toto-2.0-FnF takes <b>#1 on every metric</b> (tied for #1 on raw CRPS).</figcaption>
+</figure>
+The replication notebook lives in the GIFT-Eval repo at [notebooks/toto_2_0_fnf.ipynb](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb).
+## 🧩 What's in the ensemble
+The Toto 2.0 family accounts for **39% of the assigned weight** across all predictions — more than any other model in the pool, ahead of Chronos-2 (32%) and more than the four remaining external models combined.
 | # | Model | Family |
+|:---:|---|:---:|
+| 0 | [chronos-2](https://huggingface.co/amazon/chronos-2) | Chronos |
+| 1 | [timesfm-2.5](https://huggingface.co/google/timesfm-2.5-200m-pytorch) | TimesFM |
+| 2 | [flowstate](https://huggingface.co/ibm-research/flowstate) | FlowState |
+| 3 | [tirex](https://huggingface.co/NX-AI/TiRex-1.1-gifteval) | TiRex |
+| 4 | [patchtst-fm](https://huggingface.co/ibm-research/patchtst-fm-r1) | PatchTST |
+| 5 | [toto-2.0-4m](https://huggingface.co/Datadog/Toto-2.0-4m) | Toto 2.0 |
+| 6 | [toto-2.0-22m](https://huggingface.co/Datadog/Toto-2.0-22m) | Toto 2.0 |
+| 7 | [toto-2.0-313m](https://huggingface.co/Datadog/Toto-2.0-313m) | Toto 2.0 |
+| 8 | [toto-2.0-1b](https://huggingface.co/Datadog/Toto-2.0-1B) | Toto 2.0 |
+| 9 | [toto-2.0-2.5b](https://huggingface.co/Datadog/Toto-2.0-2.5B) | Toto 2.0 |
+Column order matters — it is tied to the booster's class indices.
+## ✨ Key Features
+- **Per-bucket gating:** Separate XGBoost head per `(frequency, term)` bucket — each bucket learns its own softmax over the model pool so the ensemble can specialize without one global gate trading off across regimes.
+- **No retraining at inference:** The bundle ships pre-computed base-model predictions and tsfeatures for the full GIFT-Eval test split, so replication needs neither GPUs nor the base-model libraries.
+- **No leakage:** tsfeatures are computed only on the lookback context preceding each forecast window; the bundle stores dataset metadata but not ground-truth labels.
 ## 📦 Bundle layout
 `ds_dirname` follows GIFT-Eval's canonical naming: `<pretty_name>_<freq>_<term>` (e.g. `m4_weekly_W_short`).
 ## ⚡ How the booster is used
 Per (dataset, term):
 4. Stack the 10 per-model `test_predictions.npz` arrays into a `(n_windows, 10, 9, prediction_length)` tensor; weight-sum across the model axis → final quantile forecast.
 5. Score with `gluonts.evaluate_model` using the same call shape every other GIFT-Eval submission uses (see `evaluate_dataset` in the notebook).
 ## 🔁 Reproducing from scratch
+Each base model's predictions were generated by running its standard GIFT-Eval notebook (`notebooks/chronos-2.ipynb`, etc.) with a wrapper that saves the per-window quantile forecasts to `test_predictions.npz` instead of going straight into `evaluate_model`. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the [tsfeatures](https://github.com/Nixtla/tsfeatures) library; "Optional A" in the notebook shows the per-window extraction call. The meta-learner boosters were trained on the corresponding train-window predictions, which are not included in this bundle.
 ## 🔗 Additional Resources
+- **Technical Report** — *(coming soon)*
+- [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)
+- [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) — base Toto checkpoints (4m → 2.5B), which is what we recommend deploying
+- [Toto-2.0-2.5B-FT](https://huggingface.co/Datadog/Toto-2.0-2.5B-FT) — companion benchmark-only finetune
+- [GIFT-Eval benchmark](https://huggingface.co/spaces/Salesforce/GIFT-Eval) — leaderboard hosting this submission
+- [Replication notebook](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb) — fast-path scoring + optional regeneration of every artifact in this bundle
+- [GitHub Repository](https://github.com/DataDog/toto)
+- [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM)
 ## 📖 Citation
 (citation coming soon)
 ```
 ## 📝 License
 Apache 2.0. Each base model retains its original license — see the linked HF repos in the model pool table.

assets/bar_metrics_gift_eval.png ADDED Viewed

Git LFS Details

SHA256: 9eef1afe0c18126a9e4813345ae4b0189539c61878b4e0d3428e1205bfe13c5e
Pointer size: 131 Bytes
Size of remote file: 602 kB