JLiangHe commited on
Commit
d01a42d
·
verified ·
1 Parent(s): d2dfeba

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +30 -11
README.md CHANGED
@@ -1,18 +1,37 @@
1
  # OPSD Experiment Results
2
 
3
- Reproduction of [OPSD (On-Policy Self-Distillation)](https://github.com/siyan-zhao/OPSD) experiments on **Qwen3-4B**.
4
 
5
- ## Methods Compared
6
- - **Base**: Qwen3-4B without fine-tuning
7
- - **SFT**: Supervised fine-tuning on OpenThoughts Math 30K
8
- - **GRPO**: Group Relative Policy Optimization with correctness reward
9
- - **OPSD**: On-Policy Self-Distillation (proposed method)
10
 
11
- ## Shared Config
12
- - LoRA: r=64, alpha=128
13
- - Learning rate: 5e-6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  - Data: siyanzhao/Openthoughts_math_30k_opsd
15
- - Eval: temperature=1.0, thinking mode, val_n=12
16
 
17
  ## Reference
18
- Paper: [Self-Distilled Reasoner](https://arxiv.org/pdf/2601.18734v3)
 
1
  # OPSD Experiment Results
2
 
3
+ Reproduction of [OPSD (On-Policy Self-Distillation)](https://github.com/siyan-zhao/OPSD) on Qwen3-1.7B, 4B, and 8B.
4
 
5
+ ## Results (Avg@12)
 
 
 
 
6
 
7
+ ### Qwen3-1.7B
8
+ | Method | AIME24 | AIME25 | HMMT25 |
9
+ |--------|:------:|:------:|:------:|
10
+ | Base | 47.2% | 35.3% | 21.9% |
11
+ | OPSD (best) | **49.2%** | **37.5%** | **24.4%** |
12
+ | SFT (best) | 37.5% | 30.8% | 19.2% |
13
+ | GRPO (best) | 47.8% | 35.0% | 22.8% |
14
+
15
+ ### Qwen3-4B
16
+ | Method | AIME24 | AIME25 | HMMT25 |
17
+ |--------|:------:|:------:|:------:|
18
+ | Base | **71.1%** | 60.0% | 38.6% |
19
+ | OPSD (best) | 62.2% | 57.2% | 34.2% |
20
+ | SFT (best) | 62.5% | 58.1% | 33.3% |
21
+ | GRPO (best) | 68.9% | **65.0%** | **41.9%** |
22
+
23
+ ### Qwen3-8B
24
+ | Method | AIME24 | AIME25 | HMMT25 |
25
+ |--------|:------:|:------:|:------:|
26
+ | Base | **72.8%** | 61.7% | 38.6% |
27
+ | OPSD (best) | 69.4% | 63.3% | 38.6% |
28
+ | SFT (best) | 69.2% | 60.3% | 36.1% |
29
+ | GRPO (best) | 72.2% | **65.8%** | **40.8%** |
30
+
31
+ ## Setup
32
+ - All methods: lr=5e-6, BS=32, LoRA r=64 alpha=128, 200 steps
33
+ - Eval: val_n=12, temperature=1.0, thinking mode enabled
34
  - Data: siyanzhao/Openthoughts_math_30k_opsd
 
35
 
36
  ## Reference
37
+ [Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs](https://arxiv.org/pdf/2601.18734v3)