Refresh model card: benchmark-only CTA, bar metrics hero, ensemble weight share callout

#1
by Emaad - opened
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +61 -45
  3. assets/bar_metrics_gift_eval.png +3 -0
.gitattributes CHANGED
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  booster_manifest.json filter=lfs diff=lfs merge=lfs -text
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  booster_manifest.json filter=lfs diff=lfs merge=lfs -text
37
+ assets/bar_metrics_gift_eval.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -6,52 +6,76 @@ tags:
6
  - time-series
7
  - timeseries
8
  - forecasting
 
9
  - ensemble
10
  - meta-learning
11
  - gift-eval
12
- - observability
13
  license: apache-2.0
14
  pipeline_tag: time-series-forecasting
15
  thumbnail: https://corp.dd-static.net/img/about/presskit/kit/press_kit.png
16
- language:
17
- - en
18
- pretty_name: Toto 2.0 Family and Friends β€” GIFT-Eval artifacts
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ---
20
 
21
- # Toto 2.0 Family and Friends β€” GIFT-Eval artifact bundle
22
 
23
- Pre-computed artifacts for replicating the **Toto 2.0 Family and Friends** (short form: **Toto-2.0-FnF**) submission to the [GIFT-Eval benchmark](https://huggingface.co/spaces/Salesforce/GIFT-Eval). The ensemble is an FFORMA-style ([Montero-Manso et al., *International Journal of Forecasting*, 2020](https://www.sciencedirect.com/science/article/abs/pii/S0169207019300895)) meta-learner that gates a pool of foundation models on a per-(frequency, term) bucket basis using XGBoost over time-series features.
 
 
 
 
24
 
25
- The replication notebook lives in the GIFT-Eval repo at [`notebooks/toto_2_0_fnf.ipynb`](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb).
26
 
27
- ---
28
-
29
- ## ✨ Key Features
30
 
31
- - **Per-bucket gating**: Separate XGBoost head per `(frequency, term)` bucket β€” each bucket learns its own softmax over the model pool so the ensemble can specialize without one global gate trading off across regimes.
32
- - **No retraining at inference**: The bundle ships pre-computed base-model predictions and tsfeatures for the full GIFT-Eval test split, so replication needs neither GPUs nor the base-model libraries.
33
- - **No leakage**: tsfeatures are computed only on the lookback context preceding each forecast window; the bundle stores dataset metadata but not ground-truth labels.
 
34
 
35
- ---
36
 
37
- ## 🧩 Model pool
38
 
39
- The meta-learner outputs softmax weights over 10 foundation models (column order matters β€” it is tied to the booster's class indices):
40
 
41
  | # | Model | Family |
42
- |---|-------|--------|
43
- | 0 | [`chronos-2`](https://huggingface.co/amazon/chronos-2) | Chronos |
44
- | 1 | [`timesfm-2.5`](https://huggingface.co/google/timesfm-2.5-200m-pytorch) | TimesFM |
45
- | 2 | [`flowstate`](https://huggingface.co/ibm-research/flowstate) | FlowState |
46
- | 3 | [`tirex`](https://huggingface.co/NX-AI/TiRex-1.1-gifteval) | TiRex |
47
- | 4 | [`patchtst-fm`](https://huggingface.co/ibm-research/patchtst-fm-r1) | PatchTST |
48
- | 5 | [`toto-2.0-4m`](https://huggingface.co/Datadog/Toto-2.0-4m) | Toto 2.0 |
49
- | 6 | [`toto-2.0-22m`](https://huggingface.co/Datadog/Toto-2.0-22m) | Toto 2.0 |
50
- | 7 | [`toto-2.0-313m`](https://huggingface.co/Datadog/Toto-2.0-313m) | Toto 2.0 |
51
- | 8 | [`toto-2.0-1b`](https://huggingface.co/Datadog/Toto-2.0-1b) | Toto 2.0 |
52
- | 9 | [`toto-2.0-2.5b`](https://huggingface.co/Datadog/Toto-2.0-2.5b) | Toto 2.0 |
 
 
53
 
54
- ---
 
 
 
 
55
 
56
  ## πŸ“¦ Bundle layout
57
 
@@ -70,8 +94,6 @@ test_predictions/<model>/<ds_dirname>/
70
 
71
  `ds_dirname` follows GIFT-Eval's canonical naming: `<pretty_name>_<freq>_<term>` (e.g. `m4_weekly_W_short`).
72
 
73
- ---
74
-
75
  ## ⚑ How the booster is used
76
 
77
  Per (dataset, term):
@@ -82,24 +104,20 @@ Per (dataset, term):
82
  4. Stack the 10 per-model `test_predictions.npz` arrays into a `(n_windows, 10, 9, prediction_length)` tensor; weight-sum across the model axis β†’ final quantile forecast.
83
  5. Score with `gluonts.evaluate_model` using the same call shape every other GIFT-Eval submission uses (see `evaluate_dataset` in the notebook).
84
 
85
- ---
86
-
87
  ## πŸ” Reproducing from scratch
88
 
89
- Each base model's predictions were generated by running its standard GIFT-Eval notebook (`notebooks/chronos-2.ipynb`, etc.) with a wrapper that saves the per-window quantile forecasts to `test_predictions.npz` instead of going straight into `evaluate_model`. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the [`tsfeatures`](https://github.com/Nixtla/tsfeatures) library; "Optional A" in the notebook shows the per-window extraction call. The meta-learner boosters were trained on the corresponding train-window predictions, which are not included in this bundle.
90
-
91
- ---
92
 
93
  ## πŸ”— Additional Resources
94
 
95
- - **[GIFT-Eval benchmark](https://huggingface.co/spaces/Salesforce/GIFT-Eval)** β€” leaderboard hosting this submission
96
- - **[Replication notebook](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb)** β€” fast-path scoring + optional regeneration of every artifact in this bundle
97
- - **[Toto 2.0 family](https://huggingface.co/Datadog/Toto-2.0-1B)** β€” base Toto checkpoints (4M β†’ 2.5B)
98
- - **[Toto GitHub repository](https://github.com/DataDog/toto)** β€” Toto 2.0 source code
99
- - **[BOOM dataset](https://huggingface.co/datasets/Datadog/BOOM)** β€” Datadog's observability time-series benchmark
100
- - **[Datadog blog post](https://www.datadoghq.com/blog/ai/toto-2/)** β€” Toto 2.0 announcement
101
-
102
- ---
103
 
104
  ## πŸ“– Citation
105
 
@@ -107,8 +125,6 @@ Each base model's predictions were generated by running its standard GIFT-Eval n
107
  (citation coming soon)
108
  ```
109
 
110
- ---
111
-
112
  ## πŸ“ License
113
 
114
  Apache 2.0. Each base model retains its original license β€” see the linked HF repos in the model pool table.
 
6
  - time-series
7
  - timeseries
8
  - forecasting
9
+ - observability
10
  - ensemble
11
  - meta-learning
12
  - gift-eval
 
13
  license: apache-2.0
14
  pipeline_tag: time-series-forecasting
15
  thumbnail: https://corp.dd-static.net/img/about/presskit/kit/press_kit.png
16
+ model-index:
17
+ - name: Toto-2.0-Family-and-Friends
18
+ results:
19
+ - task:
20
+ type: time-series-forecasting
21
+ dataset:
22
+ name: GIFT-Eval
23
+ type: GIFT-Eval
24
+ metrics:
25
+ - name: CRPS
26
+ type: CRPS
27
+ value: 0.463
28
+ - name: MASE
29
+ type: MASE
30
+ value: 0.676
31
+ source:
32
+ name: GIFT-Eval Time Series Forecasting Leaderboard
33
+ url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
34
  ---
35
 
36
+ # Toto 2.0 Family-and-Friends (FnF)
37
 
38
+ > [!WARNING]
39
+ > **This is a benchmarking artifact, not a general-purpose model.**
40
+ > Toto-2.0-FnF is an FFORMA-style XGBoost meta-learner over 10 foundation models that we submitted to the GIFT-Eval leaderboard. The bundle ships pre-computed predictions for the GIFT-Eval test split and exists to make the **#1** submission fully reproducible β€” it cannot forecast new series without first running every base model.
41
+ >
42
+ > For real workloads, please use the base [Toto 2.0 collection](https://huggingface.co/collections/Datadog/toto-20). The base checkpoints are pretrained without any public data, generalize to every benchmark we have evaluated, and are what we recommend deploying.
43
 
44
+ ## ✨ What this is
45
 
46
+ A per-`(frequency, term)` XGBoost gate over a pool of 10 foundation models (5 Toto 2.0 sizes + 5 external models). The meta-learner consumes lightweight tsfeatures from each forecast window and emits a softmax over the model pool; the final forecast is the weighted sum of the 10 base-model quantile predictions. Following the [FFORMA](https://www.sciencedirect.com/science/article/abs/pii/S0169207019300895) framework (Montero-Manso et al., 2020).
 
 
47
 
48
+ <figure>
49
+ <img src="assets/bar_metrics_gift_eval.png" alt="GIFT-Eval bar metrics β€” Toto 2.0 FnF highlighted">
50
+ <figcaption>On the full GIFT-Eval leaderboard (foundation + finetuned + ensemble + agentic), Toto-2.0-FnF takes <b>#1 on every metric</b> (tied for #1 on raw CRPS).</figcaption>
51
+ </figure>
52
 
53
+ The replication notebook lives in the GIFT-Eval repo at [notebooks/toto_2_0_fnf.ipynb](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb).
54
 
55
+ ## 🧩 What's in the ensemble
56
 
57
+ The Toto 2.0 family accounts for **39% of the assigned weight** across all predictions β€” more than any other model in the pool, ahead of Chronos-2 (32%) and more than the four remaining external models combined.
58
 
59
  | # | Model | Family |
60
+ |:---:|---|:---:|
61
+ | 0 | [chronos-2](https://huggingface.co/amazon/chronos-2) | Chronos |
62
+ | 1 | [timesfm-2.5](https://huggingface.co/google/timesfm-2.5-200m-pytorch) | TimesFM |
63
+ | 2 | [flowstate](https://huggingface.co/ibm-research/flowstate) | FlowState |
64
+ | 3 | [tirex](https://huggingface.co/NX-AI/TiRex-1.1-gifteval) | TiRex |
65
+ | 4 | [patchtst-fm](https://huggingface.co/ibm-research/patchtst-fm-r1) | PatchTST |
66
+ | 5 | [toto-2.0-4m](https://huggingface.co/Datadog/Toto-2.0-4m) | Toto 2.0 |
67
+ | 6 | [toto-2.0-22m](https://huggingface.co/Datadog/Toto-2.0-22m) | Toto 2.0 |
68
+ | 7 | [toto-2.0-313m](https://huggingface.co/Datadog/Toto-2.0-313m) | Toto 2.0 |
69
+ | 8 | [toto-2.0-1b](https://huggingface.co/Datadog/Toto-2.0-1B) | Toto 2.0 |
70
+ | 9 | [toto-2.0-2.5b](https://huggingface.co/Datadog/Toto-2.0-2.5B) | Toto 2.0 |
71
+
72
+ Column order matters β€” it is tied to the booster's class indices.
73
 
74
+ ## ✨ Key Features
75
+
76
+ - **Per-bucket gating:** Separate XGBoost head per `(frequency, term)` bucket β€” each bucket learns its own softmax over the model pool so the ensemble can specialize without one global gate trading off across regimes.
77
+ - **No retraining at inference:** The bundle ships pre-computed base-model predictions and tsfeatures for the full GIFT-Eval test split, so replication needs neither GPUs nor the base-model libraries.
78
+ - **No leakage:** tsfeatures are computed only on the lookback context preceding each forecast window; the bundle stores dataset metadata but not ground-truth labels.
79
 
80
  ## πŸ“¦ Bundle layout
81
 
 
94
 
95
  `ds_dirname` follows GIFT-Eval's canonical naming: `<pretty_name>_<freq>_<term>` (e.g. `m4_weekly_W_short`).
96
 
 
 
97
  ## ⚑ How the booster is used
98
 
99
  Per (dataset, term):
 
104
  4. Stack the 10 per-model `test_predictions.npz` arrays into a `(n_windows, 10, 9, prediction_length)` tensor; weight-sum across the model axis β†’ final quantile forecast.
105
  5. Score with `gluonts.evaluate_model` using the same call shape every other GIFT-Eval submission uses (see `evaluate_dataset` in the notebook).
106
 
 
 
107
  ## πŸ” Reproducing from scratch
108
 
109
+ Each base model's predictions were generated by running its standard GIFT-Eval notebook (`notebooks/chronos-2.ipynb`, etc.) with a wrapper that saves the per-window quantile forecasts to `test_predictions.npz` instead of going straight into `evaluate_model`. The notebook's "Optional B" section shows the wrapper for every pool member. Time-series features come from the [tsfeatures](https://github.com/Nixtla/tsfeatures) library; "Optional A" in the notebook shows the per-window extraction call. The meta-learner boosters were trained on the corresponding train-window predictions, which are not included in this bundle.
 
 
110
 
111
  ## πŸ”— Additional Resources
112
 
113
+ - **Technical Report** β€” *(coming soon)*
114
+ - [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)
115
+ - [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) β€” base Toto checkpoints (4m β†’ 2.5B), which is what we recommend deploying
116
+ - [Toto-2.0-2.5B-FT](https://huggingface.co/Datadog/Toto-2.0-2.5B-FT) β€” companion benchmark-only finetune
117
+ - [GIFT-Eval benchmark](https://huggingface.co/spaces/Salesforce/GIFT-Eval) β€” leaderboard hosting this submission
118
+ - [Replication notebook](https://github.com/SalesforceAIResearch/gift-eval/blob/main/notebooks/toto_2_0_fnf.ipynb) β€” fast-path scoring + optional regeneration of every artifact in this bundle
119
+ - [GitHub Repository](https://github.com/DataDog/toto)
120
+ - [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM)
121
 
122
  ## πŸ“– Citation
123
 
 
125
  (citation coming soon)
126
  ```
127
 
 
 
128
  ## πŸ“ License
129
 
130
  Apache 2.0. Each base model retains its original license β€” see the linked HF repos in the model pool table.
assets/bar_metrics_gift_eval.png ADDED

Git LFS Details

  • SHA256: 9eef1afe0c18126a9e4813345ae4b0189539c61878b4e0d3428e1205bfe13c5e
  • Pointer size: 131 Bytes
  • Size of remote file: 602 kB