Datadog
/

Toto-2.0-4m

@@ -70,24 +70,6 @@ The family sets a new state of the art on three forecasting benchmarks: [BOOM](h
 ---
-## ✨ Key Features
-- **Zero-Shot Forecasting:** Forecast without fine-tuning on your specific time series.
-- **Multi-Variate Support:** Efficiently process multiple variables using alternating time/variate attention.
-- **Probabilistic Predictions:** Generate point forecasts and uncertainty estimates via a quantile output head.
-- **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
-- **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4M → 2.5B).
----
-## 🏗️ Architecture
-![Overview of the Toto 2.0 architecture.](assets/architecture.png)
-A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds **contiguous patch masking (CPM)** for single-pass parallel decoding, a **quantile output head** trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the [technical report](#-additional-resources) for details.
----
 ## 📊 Performance
 ![Pareto frontier on BOOM and GIFT-Eval](assets/pareto.png)
@@ -151,6 +133,24 @@ All five Toto 2.0 sizes share the same training recipe; pick a size based on you
 ---
 ## 🔗 Additional Resources
 - **Technical Report** — *(coming soon)*

 ---
 ## 📊 Performance
 ![Pareto frontier on BOOM and GIFT-Eval](assets/pareto.png)
 ---
+## ✨ Key Features
+- **Zero-Shot Forecasting:** Forecast without fine-tuning on your specific time series.
+- **Multi-Variate Support:** Efficiently process multiple variables using alternating time/variate attention.
+- **Probabilistic Predictions:** Generate point forecasts and uncertainty estimates via a quantile output head.
+- **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
+- **u-μP Scaling:** A single training recipe transfers cleanly across all five sizes (4M → 2.5B).
+---
+## 🏗️ Architecture
+![Overview of the Toto 2.0 architecture.](assets/architecture.png)
+A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds **contiguous patch masking (CPM)** for single-pass parallel decoding, a **quantile output head** trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the [technical report](#-additional-resources) for details.
+---
 ## 🔗 Additional Resources
 - **Technical Report** — *(coming soon)*