rzzhan commited on
Commit
91d7683
Β·
verified Β·
1 Parent(s): 3914065

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +274 -0
README.md ADDED
@@ -0,0 +1,274 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - reasoning
5
+ - olympiad
6
+ - mathematics
7
+ - science
8
+ - reinforcement-learning
9
+ - test-time-scaling
10
+ - long-context
11
+ ---
12
+
13
+ # SU-01: Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
14
+
15
+ A compact 30B-A3B reasoning model for rigorous mathematical and scientific olympiad problem solving.
16
+
17
+ <p align="center">
18
+ <img src="https://github.com/Simplified-Reasoning/SU-01/raw/main/page/source_html/simplex-pipeline-hires.png" alt="SU-01 training and inference pipeline" width="100%">
19
+ </p>
20
+
21
+ <p align="center">
22
+ <a href="http://arxiv.org/abs/2605.13301">
23
+ <img src="https://img.shields.io/badge/Technical_Report-A42C25?style=for-the-badge&logo=arxiv&logoColor=white" alt="Technical Report">
24
+ </a>
25
+ <a href="https://simplified-reasoning.github.io/SU-01/">
26
+ <img src="https://img.shields.io/badge/Project_Page-4285F4?style=for-the-badge&logo=googlechrome&logoColor=white" alt="Project Page">
27
+ </a>
28
+ <a href="https://github.com/Simplified-Reasoning/SU-01">
29
+ <img src="https://img.shields.io/badge/SU--01-000000?style=for-the-badge&logo=github&logoColor=white" alt="GitHub">
30
+ </a>
31
+ <a href="https://huggingface.co/Simplified-Reasoning/SU-01">
32
+ <img src="https://img.shields.io/badge/SU--01-fcd022?style=for-the-badge&logo=huggingface&logoColor=000" alt="Hugging Face Model">
33
+ </a>
34
+ </p>
35
+
36
+
37
+ <div align="center">
38
+ <p>
39
+ <a href="#introduction" style="text-decoration: none;">πŸ“– Introduction</a> β€’
40
+ <a href="#key-highlights" style="text-decoration: none;">πŸ† Key Highlights</a>
41
+ </p>
42
+ <p>
43
+ <a href="#getting-started" style="text-decoration: none;">πŸš€ Getting Started</a> β€’
44
+ <a href="#training-code" style="text-decoration: none;">πŸ”§ Training Code</a> β€’
45
+ <a href="#test-time-scaling" style="text-decoration: none;">πŸ§ͺ Test-Time Scaling</a> β€’
46
+ <a href="#evaluation" style="text-decoration: none;">πŸ“Š Evaluation</a>
47
+ </p>
48
+ <p>
49
+ <a href="#acknowledgement" style="text-decoration: none; ">✨ Acknowledgement</a> β€’
50
+ <a href="#citation" style="text-decoration: none;">πŸ“ Citation</a>
51
+ </p>
52
+ </div>
53
+
54
+
55
+ ---
56
+
57
+ <a id="introduction"></a>
58
+ # πŸ“– Introduction
59
+
60
+
61
+ **SU-01** is a 30B-A3B olympiad reasoning model trained with a simple and unified post-training recipe for mathematical and scientific problem solving. The goal is to turn a broadly capable post-trained reasoning backbone into a rigorous long-horizon proof solver without relying on external tools, code execution, or dedicated symbolic solvers.
62
+
63
+ The recipe first applies **reverse-perplexity curriculum SFT** on roughly **338K sub-8K-token** trajectories to install explicit, proof-oriented reasoning behavior. It then uses **200 steps of two-stage reinforcement learning** to improve both answer-seeking ability and complete-proof quality. Finally, SU-01 uses a multi-round **generate-verify-revise** loop at inference time, enabling coherent natural-language reasoning trajectories beyond **100K tokens** on difficult olympiad problems.
64
+
65
+ In competition-style evaluations, test-time scaling brings SU-01 to **35 points on IMO 2025** and **35 points on USAMO 2026**, reaching gold-medal-level performance. SU-01 also exceeds the gold cutoff on **IPhO 2024/2025** and substantially improves over similarly sized models on proof-level benchmarks such as **IMO-ProofBench**.
66
+
67
+ ---
68
+
69
+ <a id="key-highlights"></a>
70
+ # πŸ† Key Highlights
71
+
72
+
73
+ - **Reverse-perplexity curriculum SFT**: sorts long-CoT training examples by descending PPL within each epoch, exposing the model first to teacher trajectories most mismatched with the current policy.
74
+ - **Two-stage RL**: starts with verifiable-reward training for answer-seeking behavior, then shifts to proof-quality optimization with self-refinement and experience replay.
75
+ - **Long-horizon proof repair**: uses iterative generation, verification, issue localization, and refinement to produce complete olympiad-style solutions.
76
+ - **Gold-medal-level results**: reaches 35 points on both IMO 2025 and USAMO 2026 with test-time scaling, and passes IPhO 2024/2025 gold lines.
77
+
78
+ <p align="center">
79
+ <img src="https://github.com/Simplified-Reasoning/SU-01/raw/main/page/source_png/proofbench_overall.png" alt="SU-01 ProofBench overview" width="80%">
80
+ </p>
81
+
82
+ ### Gold-Medal Competition Results
83
+
84
+ #### IMO 2025
85
+
86
+ | **Model** | **P1** | **P2** | **P3** | **P4** | **P5** | **P6** | **Total** |
87
+ |-----------|-------:|-------:|-------:|-------:|-------:|-------:|----------:|
88
+ | SU-01 | 1 | 7 | 1 | 6 | 6 | 0 | 21 |
89
+ | **SU-01 w/ TTS** | **7*** | **7*** | **7*** | **7*** | **7*** | **0*** | **35*** |
90
+
91
+ #### USAMO 2026
92
+
93
+ | **Model** | **P1** | **P2** | **P3** | **P4** | **P5** | **P6** | **Total** |
94
+ |-----------|-------:|-------:|-------:|-------:|-------:|-------:|----------:|
95
+ | SU-01 | 7 | 0 | 0 | 7 | 0 | 1 | 15 |
96
+ | **SU-01 w/ TTS** | **7*** | **0*** | **7*** | **7*** | **7*** | **7*** | **35*** |
97
+
98
+ `*` denotes results graded by human experts. Medal lines for IMO 2025 are 35/28/19 points for gold/silver/bronze, and medal lines for USAMO 2026 are 25/18/11 points.
99
+ ---
100
+
101
+ ## Getting Started
102
+
103
+ This Hugging Face repository hosts the **model weights**. The training and evaluation code is maintained in the GitHub repository:
104
+
105
+ - GitHub repo: [Simplified-Reasoning/SU-01](https://github.com/Simplified-Reasoning/SU-01)
106
+ - Training code: [su01-train-slime](https://github.com/Simplified-Reasoning/SU-01/tree/main/su01-train-slime)
107
+ - Evaluation code: [su01-eval](https://github.com/Simplified-Reasoning/SU-01/tree/main/su01-eval)
108
+
109
+ ### Installation
110
+
111
+ Clone the code repository:
112
+
113
+ ```bash
114
+ git clone https://github.com/Simplified-Reasoning/SU-01.git
115
+ cd SU-01
116
+ ```
117
+
118
+ The project uses the [`slimerl/slime:nightly-dev-20260202c`](https://hub.docker.com/layers/slimerl/slime/nightly-dev-20260202c) Docker image.
119
+
120
+ ```bash
121
+ docker pull slimerl/slime:nightly-dev-20260202c
122
+
123
+ docker run --gpus all --ipc=host --network=host -it \
124
+ -v "$PWD":/workspace/SU-01 \
125
+ -w /workspace/SU-01/su01-train-slime \
126
+ slimerl/slime:nightly-dev-20260202c \
127
+ /bin/bash
128
+ ```
129
+
130
+ Inside the container, install the local training package:
131
+
132
+ ```bash
133
+ pip install -e . --no-deps --no-index --disable-pip-version-check --no-build-isolation
134
+ ```
135
+
136
+ Adjust cluster mounts, model paths, data paths, Ray environment variables, and reward-server URLs according to your infrastructure.
137
+
138
+ ---
139
+
140
+ <a id="training-code"></a>
141
+ # πŸ”§ Training Code
142
+
143
+ The released training code contains the three major training stages used by SU-01:
144
+
145
+ ```text
146
+ su01-train-slime/scripts
147
+ β”œβ”€β”€ sft.sh # Stage 1: reverse-perplexity curriculum SFT
148
+ β”œβ”€β”€ coarse_rl.sh # Stage 2: coarse RL with verifiable rewards
149
+ └── refined_rl.sh # Stage 3: refined RL with proof rewards, self-refinement, and experience replay
150
+ ```
151
+
152
+ GitHub links:
153
+
154
+ - [`su01-train-slime/scripts`](https://github.com/Simplified-Reasoning/SU-01/tree/main/su01-train-slime/scripts)
155
+ - [`sft.sh`](https://github.com/Simplified-Reasoning/SU-01/blob/main/su01-train-slime/scripts/sft.sh)
156
+ - [`coarse_rl.sh`](https://github.com/Simplified-Reasoning/SU-01/blob/main/su01-train-slime/scripts/coarse_rl.sh)
157
+ - [`refined_rl.sh`](https://github.com/Simplified-Reasoning/SU-01/blob/main/su01-train-slime/scripts/refined_rl.sh)
158
+
159
+
160
+ ---
161
+
162
+ <a id="test-time-scaling"></a>
163
+ # πŸ§ͺ Test-Time Scaling
164
+
165
+ SU-01 uses a model-internal verification-and-refinement loop:
166
+
167
+ 1. Generate an initial complete solution.
168
+ 2. Verify the full proof and produce a structured critique or bug report.
169
+ 3. Refine the solution conditioned on the critique.
170
+ 4. Repeat until the solution is accepted or the refinement budget is exhausted.
171
+
172
+ This expands the model's own natural-language proof-search computation rather than calling an external theorem prover, symbolic solver, or code executor. In the reported USAMO 2026 TTS traces, initial solution generations have a median length of approximately **106K tokens**, while refinement stages have a median length of approximately **83K tokens**.
173
+
174
+ The released TTS implementation is in [`su01-eval/decode`](https://github.com/Simplified-Reasoning/SU-01/tree/main/su01-eval/decode), including direct decoding, TTS decoding, batch decoding, and SGLang server helpers. See [`su01-eval/decode/README.md`](https://github.com/Simplified-Reasoning/SU-01/blob/main/su01-eval/decode/README.md) for launch commands, input layout, decoding options, and smoke tests.
175
+
176
+ <p align="center">
177
+ <img src="https://github.com/Simplified-Reasoning/SU-01/raw/main/page/source_png/tts_action_length_distribution_1.png" alt="Test-time scaling action length distribution" width="80%">
178
+ </p>
179
+
180
+ ---
181
+
182
+ <a id="evaluation"></a>
183
+ # πŸ“Š Evaluation
184
+
185
+ Evaluation code is released under [`su01-eval`](https://github.com/Simplified-Reasoning/SU-01/tree/main/su01-eval). Use [`su01-eval/decode`](https://github.com/Simplified-Reasoning/SU-01/tree/main/su01-eval/decode) to generate direct or TTS predictions, and use [`su01-eval/verifiable_bench`](https://github.com/Simplified-Reasoning/SU-01/tree/main/su01-eval/verifiable_bench) to score answer-verifiable benchmarks and FrontierScience Olympiad predictions.
186
+
187
+ See [`su01-eval/decode/README.md`](https://github.com/Simplified-Reasoning/SU-01/blob/main/su01-eval/decode/README.md) and [`su01-eval/verifiable_bench/README.md`](https://github.com/Simplified-Reasoning/SU-01/blob/main/su01-eval/verifiable_bench/README.md) for commands, input formats, output formats, and configuration options.
188
+
189
+ ### Table 1: Performance on Answer-Verifiable Reasoning Tasks
190
+
191
+ AnswerBench, AMO-Bench, AIME 25/26, and FrontierScience-Olympiad are averaged over 4, 8, 8, and 4 runs, respectively. Avg. is the mean of AnswerBench, AMO-Bench, AIME 2025, AIME 2026, and FrontierScience-Olympiad.
192
+
193
+ | **Model** | **AnswerBench** | **AMO-Bench** | **AIME 25/26** | **FS-O Physics** | **FS-O Chemistry** | **FS-O Biology** | **FS-O Overall** | **Avg.** |
194
+ |-----------|----------------:|--------------:|---------------:|-----------------:|-------------------:|-----------------:|-----------------:|---------:|
195
+ | P1-30B-A3B | 69.3% | 41.3% | 90.4% / 89.6% | 57.5% | 57.5% | 27.5% | 54.5% | 69.0% |
196
+ | GLM-4.7-Flash | 73.8% | 53.8% | 91.3% / 88.3% | 54.5% | 60.0% | 17.5% | 53.0% | 72.0% |
197
+ | Nemotron-Cascade-2 | **80.5%** | 40.8% | 94.2% / 90.0% | 56.0% | 56.3% | **30.0%** | 53.5% | 71.8% |
198
+ | Qwen3.6-35B-A3B | 78.0% | 58.8% | 92.5% / 92.9% | 65.5% | **74.4%** | 25.0% | **65.0%** | **77.4%** |
199
+ | Gemma-4-31B | 74.0% | 39.3% | 88.8% / 91.3% | **69.0%** | 61.9% | 27.5% | 61.0% | 70.9% |
200
+ | **SU-01** | 77.5% | **59.8%** | **94.6%** / **93.3%** | 62.5% | 69.4% | 25.0% | 61.5% | 77.3% |
201
+
202
+ ### Table 2: Performance on Non-Verifiable Benchmarks
203
+
204
+ FrontierScience-Research refers to the research subset of FrontierScience. For SU-01, `x/y` reports scores without and with TTS on IMO-ProofBench.
205
+
206
+ | **Model** | **ProofBench Basic** | **ProofBench Advanced** | **ProofBench Overall** | **FS-R Physics** | **FS-R Chemistry** | **FS-R Biology** | **FS-R Overall** |
207
+ |-----------|---------------------:|------------------------:|-----------------------:|-----------------:|-------------------:|-----------------:|-----------------:|
208
+ | Gemini 3.1 Pro Thinking | 95.2% | 50.0% | 72.6% | 0.0% | 30.0% | 10.0% | 13.3% |
209
+ | GPT-5.5-High | **96.7%** | **64.8%** | **80.7%** | **25.0%** | **40.0%** | **45.0%** | **36.7%** |
210
+ | DeepSeek-V3.2-Speciale | 77.6% | 34.3% | 56.0% | 10.0% | 20.0% | 15.0% | 15.0% |
211
+ | P1-30B-A3B | 33.8% | 6.2% | 20.0% | 0.0% | **10.0%** | 0.0% | 3.3% |
212
+ | GLM-4.7-Flash | 51.0% | 16.7% | 33.8% | 0.0% | 0.0% | 0.0% | 0.0% |
213
+ | Nemotron-Cascade-2 | 77.1% | 28.6% | 52.9% | 5.0% | 5.0% | **20.0%** | 10.0% |
214
+ | Qwen3.6-35B-A3B | 39.1% | 7.1% | 23.1% | 0.0% | 5.0% | 10.0% | 5.0% |
215
+ | Gemma-4-31B | 46.7% | 16.2% | 31.4% | 0.0% | **10.0%** | 5.0% | 5.0% |
216
+ | **SU-01** | 77.1% / **91.0%** | 38.1% / **49.5%** | 57.6% / **70.2%** | **10.0%** | **10.0%** | 15.0% | **11.7%** |
217
+
218
+ ### Table 3: Performance on Olympiad Competition Problems
219
+
220
+ For IPhO, `x/y` reports scores without and with TTS. Gold lines for IPhO 2024/2025 are 20.8/19.7 points. Medal lines for IMO 2025 are 35/28/19 points, and medal lines for USAMO 2026 are 25/18/11 points.
221
+
222
+ #### IPhO 2024/2025
223
+
224
+ | **Model** | **IPhO 2024** | **IPhO 2025** |
225
+ |-----------|--------------:|--------------:|
226
+ | P1-30B-A3B | 23.1 | 17.7 |
227
+ | GLM-4.7-Flash | 22.2 | 19.5 |
228
+ | Nemotron-Cascade-2 | 21.2 | 16.7 |
229
+ | Qwen3.6-35B-A3B | 24.3 | 19.9 |
230
+ | Gemma-4-31B | 24.4 | 20.3 |
231
+ | **SU-01** | 23.5 / **25.3** | 20.3 / **21.7** |
232
+
233
+ #### IMO 2025
234
+
235
+ | **Model** | **P1** | **P2** | **P3** | **P4** | **P5** | **P6** | **Total** |
236
+ |-----------|-------:|-------:|-------:|-------:|-------:|-------:|----------:|
237
+ | SU-01 | 1 | 7 | 1 | 6 | 6 | 0 | 21 |
238
+ | **SU-01 w/ TTS** | **7*** | **7*** | **7*** | **7*** | **7*** | **0*** | **35*** |
239
+
240
+ #### USAMO 2026
241
+
242
+ | **Model** | **P1** | **P2** | **P3** | **P4** | **P5** | **P6** | **Total** |
243
+ |-----------|-------:|-------:|-------:|-------:|-------:|-------:|----------:|
244
+ | SU-01 | 7 | 0 | 0 | 7 | 0 | 1 | 15 |
245
+ | **SU-01 w/ TTS** | **7*** | **0*** | **7*** | **7*** | **7*** | **7*** | **35*** |
246
+
247
+ `*` denotes TTS results graded by human experts.
248
+
249
+ ---
250
+
251
+ <a id="acknowledgement"></a>
252
+ # ✨ Acknowledgement
253
+
254
+ This work was supported by the Shanghai Artificial Intelligence Laboratory.
255
+
256
+ We thank the authors and maintainers of prior open research and infrastructure that made this work possible. In particular, we are grateful to DeepSeek for open-sourcing strong reasoning policies and generative reward models, which provided an important reference point for our work. IMO-Bench, AMO-Bench, and FrontierScience helped guide the overall system optimization by offering challenging mathematical and scientific reasoning benchmarks and evaluation protocols.
257
+
258
+ We also thank prior data efforts that supported our SFT and RL data curation, including DeepMath, NaturalReasoning, Eurus, OpenCodeReasoning, P1, and OPC, as well as the many public problem sources and communities that cannot all be listed here. We further acknowledge the broader open-source infrastructure ecosystem, including slime for training and SGLang for efficient inference and serving.
259
+
260
+ ---
261
+
262
+ <a id="citation"></a>
263
+ # πŸ“ Citation
264
+
265
+ If you find SU-01 useful, please cite the project:
266
+
267
+ ```bibtex
268
+ @misc{su012026,
269
+ title={Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling},
270
+ author={Yafu Li and Runzhe Zhan and Haoran Zhang and Shunkai Zhang and Yizhuo Li and Zhilin Wang and Jiacheng Chen and Futing Wang and Xuyang Hu and Yuchen Fan and Bangjie Xu and Yucheng Su and Xinmiao Han and Chenxi Li and Haodi Lei and Yufeng Zhao and Zejin Lin and Qianjia Cheng and Tong Zhu and Xiaoye Qu and Ganqu Cui and Peng Ye and Yun Luo and Zhouchen Lin and Yu Qiao and Bowen Zhou and Ning Ding and Yu Cheng},
271
+ year={2026},
272
+ url={http://arxiv.org/abs/2605.13301}
273
+ }
274
+ ```