Emaad commited on
Commit
92fa6c2
·
verified ·
1 Parent(s): 2f3aec2

Tighten spacing: drop section HRs; convert figure prose to captions

Browse files
Files changed (1) hide show
  1. README.md +8 -20
README.md CHANGED
@@ -68,15 +68,12 @@ Toto (Time Series Optimized Transformer for [Observability](https://www.datadogh
68
 
69
  The family sets a new state of the art on three forecasting benchmarks: [BOOM](https://huggingface.co/spaces/Datadog/BOOM), our observability benchmark; [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval), the standard general-purpose benchmark; and the recent contamination-resistant [TIME](https://arxiv.org/abs/2602.12147) benchmark.
70
 
71
- ---
72
-
73
  ## 📊 Performance
74
 
75
- ![Pareto frontier on BOOM and GIFT-Eval](assets/pareto.png)
76
-
77
- Every Toto 2.0 size sits on or near the Pareto frontier on both BOOM and GIFT-Eval. The three largest sizes rank first, second, and third among foundation models on GIFT-Eval CRPS rank. On TIME, Toto 2.0 sizes take the top three spots on every metric, ahead of every other external foundation model evaluated.
78
-
79
- ---
80
 
81
  ## ⚡ Quick Start
82
 
@@ -115,8 +112,6 @@ quantiles = model.forecast(
115
 
116
  For more examples, see the [Quick Start notebook](https://github.com/DataDog/toto/blob/main/toto2/notebooks/quick_start.ipynb) and [GluonTS integration notebook](https://github.com/DataDog/toto/blob/main/toto2/notebooks/gluonts_integration.ipynb).
117
 
118
- ---
119
-
120
  ## 💾 Available Checkpoints
121
 
122
  All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.
@@ -129,8 +124,6 @@ All five Toto 2.0 sizes share the same training recipe; pick a size based on you
129
  | [Toto‑2.0‑1B](https://huggingface.co/Datadog/Toto-2.0-1B) | 1B | 3.9 GB | ~20.9 ms | Best quality / cost tradeoff for production workloads. |
130
  | [Toto‑2.0‑2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B | 9.1 GB | ~36.2 ms | Highest accuracy; #1 foundation model on every benchmark. |
131
 
132
- ---
133
-
134
  ## ✨ Key Features
135
 
136
  - **Zero-Shot Forecasting:** Forecast without fine-tuning on your specific time series.
@@ -139,15 +132,12 @@ All five Toto 2.0 sizes share the same training recipe; pick a size based on you
139
  - **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
140
  - **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4m → 2.5B).
141
 
142
- ---
143
-
144
  ## 🏗️ Architecture
145
 
146
- ![Overview of the Toto 2.0 architecture.](assets/architecture.png)
147
-
148
- A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds **contiguous patch masking (CPM)** for single-pass parallel decoding, a **quantile output head** trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the [technical report](#-additional-resources) for details.
149
-
150
- ---
151
 
152
  ## 🔗 Additional Resources
153
 
@@ -158,8 +148,6 @@ A decoder-only patched transformer whose attention layers alternate between time
158
  - [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM) — Datadog's observability time-series benchmark
159
  - [Toto 1.0 Weights](https://huggingface.co/Datadog/Toto-Open-Base-1.0)
160
 
161
- ---
162
-
163
  ## 📖 Citation
164
 
165
  ```bibtex
 
68
 
69
  The family sets a new state of the art on three forecasting benchmarks: [BOOM](https://huggingface.co/spaces/Datadog/BOOM), our observability benchmark; [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval), the standard general-purpose benchmark; and the recent contamination-resistant [TIME](https://arxiv.org/abs/2602.12147) benchmark.
70
 
 
 
71
  ## 📊 Performance
72
 
73
+ <figure>
74
+ <img src="assets/pareto.png" alt="Pareto frontier on BOOM and GIFT-Eval">
75
+ <figcaption>Every Toto 2.0 size sits on or near the Pareto frontier on both BOOM and GIFT-Eval. The three largest sizes rank first, second, and third among foundation models on GIFT-Eval CRPS rank. On TIME, Toto 2.0 sizes take the top three spots on every metric, ahead of every other external foundation model evaluated.</figcaption>
76
+ </figure>
 
77
 
78
  ## ⚡ Quick Start
79
 
 
112
 
113
  For more examples, see the [Quick Start notebook](https://github.com/DataDog/toto/blob/main/toto2/notebooks/quick_start.ipynb) and [GluonTS integration notebook](https://github.com/DataDog/toto/blob/main/toto2/notebooks/gluonts_integration.ipynb).
114
 
 
 
115
  ## 💾 Available Checkpoints
116
 
117
  All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.
 
124
  | [Toto‑2.0‑1B](https://huggingface.co/Datadog/Toto-2.0-1B) | 1B | 3.9 GB | ~20.9&nbsp;ms | Best quality / cost tradeoff for production workloads. |
125
  | [Toto‑2.0‑2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B | 9.1 GB | ~36.2&nbsp;ms | Highest accuracy; #1 foundation model on every benchmark. |
126
 
 
 
127
  ## ✨ Key Features
128
 
129
  - **Zero-Shot Forecasting:** Forecast without fine-tuning on your specific time series.
 
132
  - **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
133
  - **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4m → 2.5B).
134
 
 
 
135
  ## 🏗️ Architecture
136
 
137
+ <figure>
138
+ <img src="assets/architecture.png" alt="Overview of the Toto 2.0 architecture.">
139
+ <figcaption>A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds <b>contiguous patch masking (CPM)</b> for single-pass parallel decoding, a <b>quantile output head</b> trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the <a href="#-additional-resources">technical report</a> for details.</figcaption>
140
+ </figure>
 
141
 
142
  ## 🔗 Additional Resources
143
 
 
148
  - [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM) — Datadog's observability time-series benchmark
149
  - [Toto 1.0 Weights](https://huggingface.co/Datadog/Toto-Open-Base-1.0)
150
 
 
 
151
  ## 📖 Citation
152
 
153
  ```bibtex