Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,18 +1,37 @@
|
|
| 1 |
# OPSD Experiment Results
|
| 2 |
|
| 3 |
-
Reproduction of [OPSD (On-Policy Self-Distillation)](https://github.com/siyan-zhao/OPSD)
|
| 4 |
|
| 5 |
-
##
|
| 6 |
-
- **Base**: Qwen3-4B without fine-tuning
|
| 7 |
-
- **SFT**: Supervised fine-tuning on OpenThoughts Math 30K
|
| 8 |
-
- **GRPO**: Group Relative Policy Optimization with correctness reward
|
| 9 |
-
- **OPSD**: On-Policy Self-Distillation (proposed method)
|
| 10 |
|
| 11 |
-
##
|
| 12 |
-
|
| 13 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
- Data: siyanzhao/Openthoughts_math_30k_opsd
|
| 15 |
-
- Eval: temperature=1.0, thinking mode, val_n=12
|
| 16 |
|
| 17 |
## Reference
|
| 18 |
-
|
|
|
|
| 1 |
# OPSD Experiment Results
|
| 2 |
|
| 3 |
+
Reproduction of [OPSD (On-Policy Self-Distillation)](https://github.com/siyan-zhao/OPSD) on Qwen3-1.7B, 4B, and 8B.
|
| 4 |
|
| 5 |
+
## Results (Avg@12)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
+
### Qwen3-1.7B
|
| 8 |
+
| Method | AIME24 | AIME25 | HMMT25 |
|
| 9 |
+
|--------|:------:|:------:|:------:|
|
| 10 |
+
| Base | 47.2% | 35.3% | 21.9% |
|
| 11 |
+
| OPSD (best) | **49.2%** | **37.5%** | **24.4%** |
|
| 12 |
+
| SFT (best) | 37.5% | 30.8% | 19.2% |
|
| 13 |
+
| GRPO (best) | 47.8% | 35.0% | 22.8% |
|
| 14 |
+
|
| 15 |
+
### Qwen3-4B
|
| 16 |
+
| Method | AIME24 | AIME25 | HMMT25 |
|
| 17 |
+
|--------|:------:|:------:|:------:|
|
| 18 |
+
| Base | **71.1%** | 60.0% | 38.6% |
|
| 19 |
+
| OPSD (best) | 62.2% | 57.2% | 34.2% |
|
| 20 |
+
| SFT (best) | 62.5% | 58.1% | 33.3% |
|
| 21 |
+
| GRPO (best) | 68.9% | **65.0%** | **41.9%** |
|
| 22 |
+
|
| 23 |
+
### Qwen3-8B
|
| 24 |
+
| Method | AIME24 | AIME25 | HMMT25 |
|
| 25 |
+
|--------|:------:|:------:|:------:|
|
| 26 |
+
| Base | **72.8%** | 61.7% | 38.6% |
|
| 27 |
+
| OPSD (best) | 69.4% | 63.3% | 38.6% |
|
| 28 |
+
| SFT (best) | 69.2% | 60.3% | 36.1% |
|
| 29 |
+
| GRPO (best) | 72.2% | **65.8%** | **40.8%** |
|
| 30 |
+
|
| 31 |
+
## Setup
|
| 32 |
+
- All methods: lr=5e-6, BS=32, LoRA r=64 alpha=128, 200 steps
|
| 33 |
+
- Eval: val_n=12, temperature=1.0, thinking mode enabled
|
| 34 |
- Data: siyanzhao/Openthoughts_math_30k_opsd
|
|
|
|
| 35 |
|
| 36 |
## Reference
|
| 37 |
+
[Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs](https://arxiv.org/pdf/2601.18734v3)
|