Datadog
/

Toto-2.0-4m

@@ -68,15 +68,12 @@ Toto (Time Series Optimized Transformer for [Observability](https://www.datadogh
 The family sets a new state of the art on three forecasting benchmarks: [BOOM](https://huggingface.co/spaces/Datadog/BOOM), our observability benchmark; [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval), the standard general-purpose benchmark; and the recent contamination-resistant [TIME](https://arxiv.org/abs/2602.12147) benchmark.
----
 ## 📊 Performance
-![Pareto frontier on BOOM and GIFT-Eval](assets/pareto.png)
-Every Toto 2.0 size sits on or near the Pareto frontier on both BOOM and GIFT-Eval. The three largest sizes rank first, second, and third among foundation models on GIFT-Eval CRPS rank. On TIME, Toto 2.0 sizes take the top three spots on every metric, ahead of every other external foundation model evaluated.
----
 ## ⚡ Quick Start
@@ -115,8 +112,6 @@ quantiles = model.forecast(
 For more examples, see the [Quick Start notebook](https://github.com/DataDog/toto/blob/main/toto2/notebooks/quick_start.ipynb) and [GluonTS integration notebook](https://github.com/DataDog/toto/blob/main/toto2/notebooks/gluonts_integration.ipynb).
----
 ## 💾 Available Checkpoints
 All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.
@@ -129,8 +124,6 @@ All five Toto 2.0 sizes share the same training recipe; pick a size based on you
 | [Toto‑2.0‑1B](https://huggingface.co/Datadog/Toto-2.0-1B)     | 1B   | 3.9 GB | ~20.9&nbsp;ms | Best quality / cost tradeoff for production workloads. |
 | [Toto‑2.0‑2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B | 9.1 GB | ~36.2&nbsp;ms | Highest accuracy; #1 foundation model on every benchmark. |
----
 ## ✨ Key Features
 - **Zero-Shot Forecasting:** Forecast without fine-tuning on your specific time series.
@@ -139,15 +132,12 @@ All five Toto 2.0 sizes share the same training recipe; pick a size based on you
 - **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
 - **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4m → 2.5B).
----
 ## 🏗️ Architecture
-![Overview of the Toto 2.0 architecture.](assets/architecture.png)
-A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds **contiguous patch masking (CPM)** for single-pass parallel decoding, a **quantile output head** trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the [technical report](#-additional-resources) for details.
----
 ## 🔗 Additional Resources
@@ -158,8 +148,6 @@ A decoder-only patched transformer whose attention layers alternate between time
 - [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM) — Datadog's observability time-series benchmark
 - [Toto 1.0 Weights](https://huggingface.co/Datadog/Toto-Open-Base-1.0)
----
 ## 📖 Citation
 ```bibtex

 The family sets a new state of the art on three forecasting benchmarks: [BOOM](https://huggingface.co/spaces/Datadog/BOOM), our observability benchmark; [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval), the standard general-purpose benchmark; and the recent contamination-resistant [TIME](https://arxiv.org/abs/2602.12147) benchmark.
 ## 📊 Performance
+<figure>
+<img src="assets/pareto.png" alt="Pareto frontier on BOOM and GIFT-Eval">
+<figcaption>Every Toto 2.0 size sits on or near the Pareto frontier on both BOOM and GIFT-Eval. The three largest sizes rank first, second, and third among foundation models on GIFT-Eval CRPS rank. On TIME, Toto 2.0 sizes take the top three spots on every metric, ahead of every other external foundation model evaluated.</figcaption>
+</figure>
 ## ⚡ Quick Start
 For more examples, see the [Quick Start notebook](https://github.com/DataDog/toto/blob/main/toto2/notebooks/quick_start.ipynb) and [GluonTS integration notebook](https://github.com/DataDog/toto/blob/main/toto2/notebooks/gluonts_integration.ipynb).
 ## 💾 Available Checkpoints
 All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.
 | [Toto‑2.0‑1B](https://huggingface.co/Datadog/Toto-2.0-1B)     | 1B   | 3.9 GB | ~20.9&nbsp;ms | Best quality / cost tradeoff for production workloads. |
 | [Toto‑2.0‑2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) | 2.5B | 9.1 GB | ~36.2&nbsp;ms | Highest accuracy; #1 foundation model on every benchmark. |
 ## ✨ Key Features
 - **Zero-Shot Forecasting:** Forecast without fine-tuning on your specific time series.
 - **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
 - **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4m → 2.5B).
 ## 🏗️ Architecture
+<figure>
+<img src="assets/architecture.png" alt="Overview of the Toto 2.0 architecture.">
+<figcaption>A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds <b>contiguous patch masking (CPM)</b> for single-pass parallel decoding, a <b>quantile output head</b> trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the <a href="#-additional-resources">technical report</a> for details.</figcaption>
+</figure>
 ## 🔗 Additional Resources
 - [BOOM Dataset](https://huggingface.co/datasets/Datadog/BOOM) — Datadog's observability time-series benchmark
 - [Toto 1.0 Weights](https://huggingface.co/Datadog/Toto-Open-Base-1.0)
 ## 📖 Citation
 ```bibtex