Emaad commited on
Commit
3770379
ยท
verified ยท
1 Parent(s): 962b925

Reorder sections: frontload Performance + Quick Start

Browse files
Files changed (1) hide show
  1. README.md +18 -18
README.md CHANGED
@@ -70,24 +70,6 @@ The family sets a new state of the art on three forecasting benchmarks: [BOOM](h
70
 
71
  ---
72
 
73
- ## โœจ Key Features
74
-
75
- - **Zero-Shot Forecasting:** Forecast without fine-tuning on your specific time series.
76
- - **Multi-Variate Support:** Efficiently process multiple variables using alternating time/variate attention.
77
- - **Probabilistic Predictions:** Generate point forecasts and uncertainty estimates via a quantile output head.
78
- - **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
79
- - **u-ฮผP Scaling:** A single training recipe transfers cleanly across all five sizes (4M โ†’ 2.5B).
80
-
81
- ---
82
-
83
- ## ๐Ÿ—๏ธ Architecture
84
-
85
- ![Overview of the Toto 2.0 architecture.](assets/architecture.png)
86
-
87
- A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds **contiguous patch masking (CPM)** for single-pass parallel decoding, a **quantile output head** trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the [technical report](#-additional-resources) for details.
88
-
89
- ---
90
-
91
  ## ๐Ÿ“Š Performance
92
 
93
  ![Pareto frontier on BOOM and GIFT-Eval](assets/pareto.png)
@@ -151,6 +133,24 @@ All five Toto 2.0 sizes share the same training recipe; pick a size based on you
151
 
152
  ---
153
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  ## ๐Ÿ”— Additional Resources
155
 
156
  - **Technical Report** โ€” *(coming soon)*
 
70
 
71
  ---
72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  ## ๐Ÿ“Š Performance
74
 
75
  ![Pareto frontier on BOOM and GIFT-Eval](assets/pareto.png)
 
133
 
134
  ---
135
 
136
+ ## โœจ Key Features
137
+
138
+ - **Zero-Shot Forecasting:** Forecast without fine-tuning on your specific time series.
139
+ - **Multi-Variate Support:** Efficiently process multiple variables using alternating time/variate attention.
140
+ - **Probabilistic Predictions:** Generate point forecasts and uncertainty estimates via a quantile output head.
141
+ - **Decoder-Only Architecture:** Support for variable prediction horizons and context lengths.
142
+ - **u-ฮผP Scaling:** A single training recipe transfers cleanly across all five sizes (4M โ†’ 2.5B).
143
+
144
+ ---
145
+
146
+ ## ๐Ÿ—๏ธ Architecture
147
+
148
+ ![Overview of the Toto 2.0 architecture.](assets/architecture.png)
149
+
150
+ A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds **contiguous patch masking (CPM)** for single-pass parallel decoding, a **quantile output head** trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the [technical report](#-additional-resources) for details.
151
+
152
+ ---
153
+
154
  ## ๐Ÿ”— Additional Resources
155
 
156
  - **Technical Report** โ€” *(coming soon)*