Refresh model card: benchmark-only CTA, bar metrics hero, ensemble weight share callout
#1
by Emaad - opened
- .gitattributes +1 -0
- README.md +61 -45
- assets/bar_metrics_gift_eval.png +3 -0
.gitattributes
CHANGED
|
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
booster_manifest.json filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
booster_manifest.json filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
assets/bar_metrics_gift_eval.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -6,52 +6,76 @@ tags:
|
|
| 6 |
- time-series
|
| 7 |
- timeseries
|
| 8 |
- forecasting
|
|
|
|
| 9 |
- ensemble
|
| 10 |
- meta-learning
|
| 11 |
- gift-eval
|
| 12 |
-
- observability
|
| 13 |
license: apache-2.0
|
| 14 |
pipeline_tag: time-series-forecasting
|
| 15 |
thumbnail: https://corp.dd-static.net/img/about/presskit/kit/press_kit.png
|
| 16 |
-
|
| 17 |
-
-
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
---
|
| 20 |
|
| 21 |
-
# Toto 2.0 Family
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
---
|
| 28 |
-
|
| 29 |
-
## β¨ Key Features
|
| 30 |
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
|
|
|
| 34 |
|
| 35 |
-
--
|
| 36 |
|
| 37 |
-
## π§©
|
| 38 |
|
| 39 |
-
The
|
| 40 |
|
| 41 |
| # | Model | Family |
|
| 42 |
-
|---|---
|
| 43 |
-
| 0 | [
|
| 44 |
-
| 1 | [
|
| 45 |
-
| 2 | [
|
| 46 |
-
| 3 | [
|
| 47 |
-
| 4 | [
|
| 48 |
-
| 5 | [
|
| 49 |
-
| 6 | [
|
| 50 |
-
| 7 | [
|
| 51 |
-
| 8 | [
|
| 52 |
-
| 9 | [
|
|
|
|
|
|
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
## π¦ Bundle layout
|
| 57 |
|
|
@@ -70,8 +94,6 @@ test_predictions/<model>/<ds_dirname>/
|
|
| 70 |
|
| 71 |
`ds_dirname` follows GIFT-Eval's canonical naming: `<pretty_name>_<freq>_<term>` (e.g. `m4_weekly_W_short`).
|
| 72 |
|
| 73 |
-
---
|
| 74 |
-
|
| 75 |
## β‘ How the booster is used
|
| 76 |
|
| 77 |
Per (dataset, term):
|
|
@@ -82,24 +104,20 @@ Per (dataset, term):
|
|
| 82 |
4. Stack the 10 per-model `test_predictions.npz` arrays into a `(n_windows, 10, 9, prediction_length)` tensor; weight-sum across the model axis β final quantile forecast.
|
| 83 |
5. Score with `gluonts.evaluate_model` using the same call shape every other GIFT-Eval submission uses (see `evaluate_dataset` in the notebook).
|
| 84 |
|
| 85 |
-
---
|
| 86 |
-
|
| 87 |
## π Reproducing from scratch
|
| 88 |
|
| 89 |
-
Each base model's predictions were generated by running its standard GIFT-Eval notebook (`notebooks/chronos-2.ipynb`, etc.) with a wrapper that saves the per-window quantile forecasts to `test_predictions.npz` instead of going straight into `evaluate_model`. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the [
|
| 90 |
-
|
| 91 |
-
---
|
| 92 |
|
| 93 |
## π Additional Resources
|
| 94 |
|
| 95 |
-
- **
|
| 96 |
-
-
|
| 97 |
-
-
|
| 98 |
-
-
|
| 99 |
-
-
|
| 100 |
-
-
|
| 101 |
-
|
| 102 |
-
-
|
| 103 |
|
| 104 |
## π Citation
|
| 105 |
|
|
@@ -107,8 +125,6 @@ Each base model's predictions were generated by running its standard GIFT-Eval n
|
|
| 107 |
(citation coming soon)
|
| 108 |
```
|
| 109 |
|
| 110 |
-
---
|
| 111 |
-
|
| 112 |
## π License
|
| 113 |
|
| 114 |
Apache 2.0. Each base model retains its original license β see the linked HF repos in the model pool table.
|
|
|
|
| 6 |
- time-series
|
| 7 |
- timeseries
|
| 8 |
- forecasting
|
| 9 |
+
- observability
|
| 10 |
- ensemble
|
| 11 |
- meta-learning
|
| 12 |
- gift-eval
|
|
|
|
| 13 |
license: apache-2.0
|
| 14 |
pipeline_tag: time-series-forecasting
|
| 15 |
thumbnail: https://corp.dd-static.net/img/about/presskit/kit/press_kit.png
|
| 16 |
+
model-index:
|
| 17 |
+
- name: Toto-2.0-Family-and-Friends
|
| 18 |
+
results:
|
| 19 |
+
- task:
|
| 20 |
+
type: time-series-forecasting
|
| 21 |
+
dataset:
|
| 22 |
+
name: GIFT-Eval
|
| 23 |
+
type: GIFT-Eval
|
| 24 |
+
metrics:
|
| 25 |
+
- name: CRPS
|
| 26 |
+
type: CRPS
|
| 27 |
+
value: 0.463
|
| 28 |
+
- name: MASE
|
| 29 |
+
type: MASE
|
| 30 |
+
value: 0.676
|
| 31 |
+
source:
|
| 32 |
+
name: GIFT-Eval Time Series Forecasting Leaderboard
|
| 33 |
+
url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
|
| 34 |
---
|
| 35 |
|
| 36 |
+
# Toto 2.0 Family-and-Friends (FnF)
|
| 37 |
|
| 38 |
+
> [!WARNING]
|
| 39 |
+
> **This is a benchmarking artifact, not a general-purpose model.**
|
| 40 |
+
> Toto-2.0-FnF is an FFORMA-style XGBoost meta-learner over 10 foundation models that we submitted to the GIFT-Eval leaderboard. The bundle ships pre-computed predictions for the GIFT-Eval test split and exists to make the **#1** submission fully reproducible β it cannot forecast new series without first running every base model.
|
| 41 |
+
>
|
| 42 |
+
> For real workloads, please use the base [Toto 2.0 collection](https://huggingface.co/collections/Datadog/toto-20). The base checkpoints are pretrained without any public data, generalize to every benchmark we have evaluated, and are what we recommend deploying.
|
| 43 |
|
| 44 |
+
## β¨ What this is
|
| 45 |
|
| 46 |
+
A per-`(frequency, term)` XGBoost gate over a pool of 10 foundation models (5 Toto 2.0 sizes + 5 external models). The meta-learner consumes lightweight tsfeatures from each forecast window and emits a softmax over the model pool; the final forecast is the weighted sum of the 10 base-model quantile predictions. Following the [FFORMA](https://www.sciencedirect.com/science/article/abs/pii/S0169207019300895) framework (Montero-Manso et al., 2020).
|
|
|
|
|
|
|
| 47 |
|
| 48 |
+
<figure>
|
| 49 |
+
<img src="assets/bar_metrics_gift_eval.png" alt="GIFT-Eval bar metrics β Toto 2.0 FnF highlighted">
|
| 50 |
+
<figcaption>On the full GIFT-Eval leaderboard (foundation + finetuned + ensemble + agentic), Toto-2.0-FnF takes <b>#1 on every metric</b> (tied for #1 on raw CRPS).</figcaption>
|
| 51 |
+
</figure>
|
| 52 |
|
| 53 |
+
The replication notebook lives in the GIFT-Eval repo at [notebooks/toto_2_0_fnf.ipynb](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb).
|
| 54 |
|
| 55 |
+
## π§© What's in the ensemble
|
| 56 |
|
| 57 |
+
The Toto 2.0 family accounts for **39% of the assigned weight** across all predictions β more than any other model in the pool, ahead of Chronos-2 (32%) and more than the four remaining external models combined.
|
| 58 |
|
| 59 |
| # | Model | Family |
|
| 60 |
+
|:---:|---|:---:|
|
| 61 |
+
| 0 | [chronos-2](https://huggingface.co/amazon/chronos-2) | Chronos |
|
| 62 |
+
| 1 | [timesfm-2.5](https://huggingface.co/google/timesfm-2.5-200m-pytorch) | TimesFM |
|
| 63 |
+
| 2 | [flowstate](https://huggingface.co/ibm-research/flowstate) | FlowState |
|
| 64 |
+
| 3 | [tirex](https://huggingface.co/NX-AI/TiRex-1.1-gifteval) | TiRex |
|
| 65 |
+
| 4 | [patchtst-fm](https://huggingface.co/ibm-research/patchtst-fm-r1) | PatchTST |
|
| 66 |
+
| 5 | [toto-2.0-4m](https://huggingface.co/Datadog/Toto-2.0-4m) | Toto 2.0 |
|
| 67 |
+
| 6 | [toto-2.0-22m](https://huggingface.co/Datadog/Toto-2.0-22m) | Toto 2.0 |
|
| 68 |
+
| 7 | [toto-2.0-313m](https://huggingface.co/Datadog/Toto-2.0-313m) | Toto 2.0 |
|
| 69 |
+
| 8 | [toto-2.0-1b](https://huggingface.co/Datadog/Toto-2.0-1B) | Toto 2.0 |
|
| 70 |
+
| 9 | [toto-2.0-2.5b](https://huggingface.co/Datadog/Toto-2.0-2.5B) | Toto 2.0 |
|
| 71 |
+
|
| 72 |
+
Column order matters β it is tied to the booster's class indices.
|
| 73 |
|
| 74 |
+
## β¨ Key Features
|
| 75 |
+
|
| 76 |
+
- **Per-bucket gating:** Separate XGBoost head per `(frequency, term)` bucket β each bucket learns its own softmax over the model pool so the ensemble can specialize without one global gate trading off across regimes.
|
| 77 |
+
- **No retraining at inference:** The bundle ships pre-computed base-model predictions and tsfeatures for the full GIFT-Eval test split, so replication needs neither GPUs nor the base-model libraries.
|
| 78 |
+
- **No leakage:** tsfeatures are computed only on the lookback context preceding each forecast window; the bundle stores dataset metadata but not ground-truth labels.
|
| 79 |
|
| 80 |
## π¦ Bundle layout
|
| 81 |
|
|
|
|
| 94 |
|
| 95 |
`ds_dirname` follows GIFT-Eval's canonical naming: `<pretty_name>_<freq>_<term>` (e.g. `m4_weekly_W_short`).
|
| 96 |
|
|
|
|
|
|
|
| 97 |
## β‘ How the booster is used
|
| 98 |
|
| 99 |
Per (dataset, term):
|
|
|
|
| 104 |
4. Stack the 10 per-model `test_predictions.npz` arrays into a `(n_windows, 10, 9, prediction_length)` tensor; weight-sum across the model axis β final quantile forecast.
|
| 105 |
5. Score with `gluonts.evaluate_model` using the same call shape every other GIFT-Eval submission uses (see `evaluate_dataset` in the notebook).
|
| 106 |
|
|
|
|
|
|
|
| 107 |
## π Reproducing from scratch
|
| 108 |
|
| 109 |
+
Each base model's predictions were generated by running its standard GIFT-Eval notebook (`notebooks/chronos-2.ipynb`, etc.) with a wrapper that saves the per-window quantile forecasts to `test_predictions.npz` instead of going straight into `evaluate_model`. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the [tsfeatures](https://github.com/Nixtla/tsfeatures) library; "Optional A" in the notebook shows the per-window extraction call. The meta-learner boosters were trained on the corresponding train-window predictions, which are not included in this bundle.
|
|
|
|
|
|
|
| 110 |
|
| 111 |
## π Additional Resources
|
| 112 |
|
| 113 |
+
- **Technical Report** β *(coming soon)*
|
| 114 |
+
- [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)
|
| 115 |
+
- [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) β base Toto checkpoints (4m β 2.5B), which is what we recommend deploying
|
| 116 |
+
- [Toto-2.0-2.5B-FT](https://huggingface.co/Datadog/Toto-2.0-2.5B-FT) β companion benchmark-only finetune
|
| 117 |
+
- [GIFT-Eval benchmark](https://huggingface.co/spaces/Salesforce/GIFT-Eval) β leaderboard hosting this submission
|
| 118 |
+
- [Replication notebook](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb) β fast-path scoring + optional regeneration of every artifact in this bundle
|
| 119 |
+
- [GitHub Repository](https://github.com/DataDog/toto)
|
| 120 |
+
- [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM)
|
| 121 |
|
| 122 |
## π Citation
|
| 123 |
|
|
|
|
| 125 |
(citation coming soon)
|
| 126 |
```
|
| 127 |
|
|
|
|
|
|
|
| 128 |
## π License
|
| 129 |
|
| 130 |
Apache 2.0. Each base model retains its original license β see the linked HF repos in the model pool table.
|
assets/bar_metrics_gift_eval.png
ADDED
|
Git LFS Details
|