Update model card
Browse files
README.md
CHANGED
|
@@ -1,63 +1,158 @@
|
|
| 1 |
---
|
| 2 |
license: other
|
| 3 |
-
library_name: transformers
|
| 4 |
-
pipeline_tag: text-generation
|
| 5 |
-
base_model:
|
| 6 |
-
- Qwen/Qwen3-4B-Base
|
| 7 |
-
base_model_relation: finetune
|
| 8 |
language:
|
| 9 |
- en
|
| 10 |
- zh
|
|
|
|
|
|
|
| 11 |
tags:
|
| 12 |
-
- safetensors
|
| 13 |
-
- qwen3
|
| 14 |
- qwen
|
|
|
|
|
|
|
| 15 |
- grpo
|
| 16 |
- reinforcement-learning
|
|
|
|
|
|
|
| 17 |
- reasoning
|
| 18 |
-
-
|
| 19 |
-
- math
|
| 20 |
- arxiv:2604.13016
|
|
|
|
|
|
|
| 21 |
---
|
| 22 |
|
| 23 |
# Qwen3-4B-Base-GRPO
|
| 24 |
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
```python
|
| 63 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
@@ -71,3 +166,18 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
| 71 |
device_map="auto",
|
| 72 |
)
|
| 73 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: other
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
- zh
|
| 6 |
+
library_name: transformers
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
tags:
|
|
|
|
|
|
|
| 9 |
- qwen
|
| 10 |
+
- qwen3
|
| 11 |
+
- math
|
| 12 |
- grpo
|
| 13 |
- reinforcement-learning
|
| 14 |
+
- on-policy-distillation
|
| 15 |
+
- full-finetuning
|
| 16 |
- reasoning
|
| 17 |
+
- safetensors
|
|
|
|
| 18 |
- arxiv:2604.13016
|
| 19 |
+
base_model: Qwen/Qwen3-4B-Base
|
| 20 |
+
base_model_relation: finetune
|
| 21 |
---
|
| 22 |
|
| 23 |
# Qwen3-4B-Base-GRPO
|
| 24 |
|
| 25 |
+
Qwen3-4B-Base-GRPO is a GRPO-trained model based on **Qwen3-4B-Base**, trained on the
|
| 26 |
+
**DAPO-Math-17k-Processed** dataset for mathematical reasoning and problem-solving.
|
| 27 |
+
|
| 28 |
+
This model is associated with the paper:
|
| 29 |
+
**Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe**
|
| 30 |
+
Paper link: https://arxiv.org/abs/2604.13016
|
| 31 |
+
|
| 32 |
+
## Model Description
|
| 33 |
+
|
| 34 |
+
This model is obtained by full-parameter GRPO training from `Qwen3-4B-Base`.
|
| 35 |
+
The training is designed to improve the model's performance on math-focused
|
| 36 |
+
reasoning tasks under the on-policy distillation setting.
|
| 37 |
+
|
| 38 |
+
No learned reward model is used in this training run. Rewards are computed by a
|
| 39 |
+
custom rule-based reward function for math evaluation:
|
| 40 |
+
`verl/verl/utils/reward_score/ttrl_math/__init__.py`.
|
| 41 |
+
|
| 42 |
+
### Key characteristics
|
| 43 |
+
|
| 44 |
+
- **Base model**: Qwen3-4B-Base
|
| 45 |
+
- **Training stage**: GRPO
|
| 46 |
+
- **Finetuning type**: Full finetuning
|
| 47 |
+
- **Primary domain**: Mathematical reasoning
|
| 48 |
+
- **Reward model**: Not used (`reward_model.enable: false`)
|
| 49 |
+
- **Custom reward function**: `reward_func`
|
| 50 |
+
- **Rollout engine**: vLLM
|
| 51 |
+
- **Context length**: 32768 tokens
|
| 52 |
+
- **Responses per prompt**: 8
|
| 53 |
+
|
| 54 |
+
## Training Details
|
| 55 |
+
|
| 56 |
+
### Training configuration
|
| 57 |
+
|
| 58 |
+
- **Framework**: verl
|
| 59 |
+
- **Algorithm**: `grpo`
|
| 60 |
+
- **GRPO outcome weight**: `1.0`
|
| 61 |
+
- **Training dataset**: `DAPO-Math-17k-Processed`
|
| 62 |
+
- **Training file**: `datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet`
|
| 63 |
+
- **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
|
| 64 |
+
- **Prompt length**: `1024`
|
| 65 |
+
- **Response length**: `7168`
|
| 66 |
+
- **Validation response length**: `31744`
|
| 67 |
+
- **Max model length**: `32768`
|
| 68 |
+
- **Rollout temperature**: `1.0`
|
| 69 |
+
- **Teacher temperature**: `1.0`
|
| 70 |
+
- **Repetition penalty**: `1.0`
|
| 71 |
+
- **Top-k log probability**: `0`
|
| 72 |
+
- **Top-k strategy**: `union`
|
| 73 |
+
- **Reward weight mode**: `student_p`
|
| 74 |
+
- **KL loss**: disabled
|
| 75 |
+
- **Format reward**: disabled
|
| 76 |
+
- **Loss aggregation**: `token-mean`
|
| 77 |
+
- **Learning rate**: `1e-6`
|
| 78 |
+
- **PPO mini-batch size**: `64`
|
| 79 |
+
- **PPO micro-batch size per GPU**: `1`
|
| 80 |
+
- **Tensor parallel size**: `1`
|
| 81 |
+
- **Number of GPUs**: `8`
|
| 82 |
+
- **Number of epochs**: `1`
|
| 83 |
+
- **Save frequency**: every `20` steps
|
| 84 |
+
- **Test frequency**: every `20` steps
|
| 85 |
+
- **Logging**: console and SwanLab
|
| 86 |
+
|
| 87 |
+
### Dataset
|
| 88 |
+
|
| 89 |
+
- **Training dataset**: `DAPO-Math-17k-Processed`
|
| 90 |
+
- **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
|
| 91 |
+
|
| 92 |
+
## Training Hyperparameters
|
| 93 |
+
|
| 94 |
+
For reproducibility, the core configuration is summarized below:
|
| 95 |
+
|
| 96 |
+
```bash
|
| 97 |
+
ACTOR_MODEL_PATH=model/Qwen3-4B-Base
|
| 98 |
+
ADV_ESTIMATOR=grpo
|
| 99 |
+
GRPO_OUTCOME_WEIGHT=1.0
|
| 100 |
+
|
| 101 |
+
TRAIN_DATASET=datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet
|
| 102 |
+
TRAIN_DATASET_NAME=DAPO-Math-17k-Processed
|
| 103 |
+
TEST_DATASET=[
|
| 104 |
+
datasets/test_data/AIME25/test.parquet,
|
| 105 |
+
datasets/test_data/AMC23/test.parquet,
|
| 106 |
+
datasets/test_data/AIME24/test.parquet
|
| 107 |
+
]
|
| 108 |
+
|
| 109 |
+
MAX_PROMPT_LENGTH=1024
|
| 110 |
+
MAX_RESP_LENGTH=7168
|
| 111 |
+
MAX_VAL_RESP_LENGTH=31744
|
| 112 |
+
MAX_MODEL_LEN=32768
|
| 113 |
+
|
| 114 |
+
MINI_BATCH_SIZE=64
|
| 115 |
+
TEMPERATURE=1.0
|
| 116 |
+
TEACHER_TEMPERATURE=1.0
|
| 117 |
+
REPETITION_PENALTY=1.0
|
| 118 |
+
N_RESPONSES=8
|
| 119 |
+
|
| 120 |
+
LOG_PROB_TOP_K=0
|
| 121 |
+
TOP_K_STRATEGY=union
|
| 122 |
+
REWARD_WEIGHT_MODE=student_p
|
| 123 |
+
|
| 124 |
+
USE_KL=False
|
| 125 |
+
ENABLE_FORMAT_REWARD=False
|
| 126 |
+
MODEL_DTYPE=fp32
|
| 127 |
+
LOSS_AGG_MODE=token-mean
|
| 128 |
+
|
| 129 |
+
actor_rollout_ref.actor.optim.lr=1e-6
|
| 130 |
+
actor_rollout_ref.actor.ppo_mini_batch_size=64
|
| 131 |
+
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1
|
| 132 |
+
actor_rollout_ref.actor.use_dynamic_bsz=True
|
| 133 |
+
actor_rollout_ref.model.enable_gradient_checkpointing=True
|
| 134 |
+
actor_rollout_ref.model.enable_activation_offload=True
|
| 135 |
+
|
| 136 |
+
actor_rollout_ref.rollout.name=vllm
|
| 137 |
+
actor_rollout_ref.rollout.tensor_model_parallel_size=1
|
| 138 |
+
actor_rollout_ref.rollout.gpu_memory_utilization=0.8
|
| 139 |
+
actor_rollout_ref.rollout.n=8
|
| 140 |
+
actor_rollout_ref.rollout.val_kwargs.n=16
|
| 141 |
+
actor_rollout_ref.rollout.val_kwargs.temperature=1.0
|
| 142 |
+
actor_rollout_ref.rollout.val_kwargs.top_p=0.95
|
| 143 |
+
|
| 144 |
+
reward_model.enable=False
|
| 145 |
+
custom_reward_function.path=verl/verl/utils/reward_score/ttrl_math/__init__.py
|
| 146 |
+
custom_reward_function.name=reward_func
|
| 147 |
+
|
| 148 |
+
trainer.n_gpus_per_node=8
|
| 149 |
+
trainer.nnodes=1
|
| 150 |
+
trainer.total_epochs=1
|
| 151 |
+
trainer.save_freq=20
|
| 152 |
+
trainer.test_freq=20
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
## Usage
|
| 156 |
|
| 157 |
```python
|
| 158 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
| 166 |
device_map="auto",
|
| 167 |
)
|
| 168 |
```
|
| 169 |
+
|
| 170 |
+
## Citation
|
| 171 |
+
|
| 172 |
+
If you use this model, please consider citing the related paper:
|
| 173 |
+
|
| 174 |
+
```bibtex
|
| 175 |
+
@article{li2026rethinking,
|
| 176 |
+
title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe},
|
| 177 |
+
author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and Ding, Ning},
|
| 178 |
+
journal={arXiv preprint arXiv:2604.13016},
|
| 179 |
+
year={2026}
|
| 180 |
+
}
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
Paper: https://arxiv.org/abs/2604.13016
|