Datadog
/

Toto-2.0-2.5B-FT

@@ -42,24 +42,21 @@ model-index:
 >
 > For real workloads, please use the base [Toto 2.0 collection](https://huggingface.co/collections/Datadog/toto-20). The base checkpoints are pretrained without any public data, generalize to every benchmark we have evaluated, and are what we recommend deploying.
----
 ## ✨ What this is
 A single Toto 2.0 2.5B base checkpoint finetuned on a mix that **includes the GIFT-Eval training split**, used to probe how far the base model can be pushed on a single in-distribution benchmark.
-![GIFT-Eval bar metrics — Toto 2.0 2.5B-FT highlighted](assets/bar_metrics_gift_eval.png)
-On the full GIFT-Eval leaderboard (foundation models + finetuned + ensemble + agentic), Toto-2.0-2.5B-FT places **#2 on CRPS rank, MASE rank, and #3 on raw CRPS / MASE**, behind only the [Toto 2.0 Family-and-Friends](https://huggingface.co/Datadog/Toto-2.0-Family-and-Friends) ensemble.
----
 ## 🔁 Finetuning recipe
 Starting from a fully-decayed [Toto-2.0-2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) base checkpoint, we finetuned for 10,000 steps on a mix designed to expose the model to in-distribution structure without overfitting to GIFT-Eval alone:
 | Source | Share |
-|---|---|
 | GIFT-Eval Pretrain | 45% |
 | Datadog 5-minute+ observability metrics | 25% |
 | GIFT-Eval train split | 15% |
@@ -71,8 +68,6 @@ The public portion (45% GIFT-Eval Pretrain) is drawn from the Toto 1.0 mix of GI
 NorMuon and AdamW learning rates were both dropped by roughly an order of magnitude from pretraining (to 0.05 and 0.001 respectively). All other architecture and inference settings match the base 2.5B model.
----
 ## ⚡ Quick Start
 ```python
@@ -87,20 +82,16 @@ model = model.to("cuda").eval()
 See the base [Toto-2.0-2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) model card for the full inference example.
----
 ## 🔗 Additional Resources
 - **Technical Report** — *(coming soon)*
 - [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)
 - [Base model: Toto-2.0-2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) — the unfinetuned checkpoint, which is what we recommend deploying
-- [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) — all five base sizes (4M → 2.5B)
 - [Toto 2.0 Family-and-Friends](https://huggingface.co/Datadog/Toto-2.0-Family-and-Friends) — companion FFORMA-ensemble submission, also benchmark-only
 - [GIFT-Eval benchmark](https://huggingface.co/spaces/Salesforce/GIFT-Eval) — leaderboard hosting this submission
 - [GitHub Repository](https://github.com/DataDog/toto)
----
 ## 📝 License
 Apache 2.0.

 >
 > For real workloads, please use the base [Toto 2.0 collection](https://huggingface.co/collections/Datadog/toto-20). The base checkpoints are pretrained without any public data, generalize to every benchmark we have evaluated, and are what we recommend deploying.
 ## ✨ What this is
 A single Toto 2.0 2.5B base checkpoint finetuned on a mix that **includes the GIFT-Eval training split**, used to probe how far the base model can be pushed on a single in-distribution benchmark.
+<figure>
+<img src="assets/bar_metrics_gift_eval.png" alt="GIFT-Eval bar metrics — Toto 2.0 2.5B-FT highlighted">
+<figcaption>On the full GIFT-Eval leaderboard (foundation models + finetuned + ensemble + agentic), Toto-2.0-2.5B-FT places <b>#2 on CRPS rank, MASE rank, and #3 on raw CRPS / MASE</b>, behind only the <a href="https://huggingface.co/Datadog/Toto-2.0-Family-and-Friends">Toto 2.0 Family-and-Friends</a> ensemble.</figcaption>
+</figure>
 ## 🔁 Finetuning recipe
 Starting from a fully-decayed [Toto-2.0-2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) base checkpoint, we finetuned for 10,000 steps on a mix designed to expose the model to in-distribution structure without overfitting to GIFT-Eval alone:
 | Source | Share |
+|---|---:|
 | GIFT-Eval Pretrain | 45% |
 | Datadog 5-minute+ observability metrics | 25% |
 | GIFT-Eval train split | 15% |
 NorMuon and AdamW learning rates were both dropped by roughly an order of magnitude from pretraining (to 0.05 and 0.001 respectively). All other architecture and inference settings match the base 2.5B model.
 ## ⚡ Quick Start
 ```python
 See the base [Toto-2.0-2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) model card for the full inference example.
 ## 🔗 Additional Resources
 - **Technical Report** — *(coming soon)*
 - [Blog Post](https://www.datadoghq.com/blog/ai/toto-2/)
 - [Base model: Toto-2.0-2.5B](https://huggingface.co/Datadog/Toto-2.0-2.5B) — the unfinetuned checkpoint, which is what we recommend deploying
+- [Toto 2.0 Collection](https://huggingface.co/collections/Datadog/toto-20) — all five base sizes (4m → 2.5B)
 - [Toto 2.0 Family-and-Friends](https://huggingface.co/Datadog/Toto-2.0-Family-and-Friends) — companion FFORMA-ensemble submission, also benchmark-only
 - [GIFT-Eval benchmark](https://huggingface.co/spaces/Salesforce/GIFT-Eval) — leaderboard hosting this submission
 - [GitHub Repository](https://github.com/DataDog/toto)
 ## 📝 License
 Apache 2.0.