Update model card
Browse files
README.md
CHANGED
|
@@ -25,8 +25,7 @@ base_model_relation: finetune
|
|
| 25 |
# Qwen3-4B-Base-GRPO
|
| 26 |
|
| 27 |
Qwen3-4B-Base-GRPO is a post-RL checkpoint trained with the **verl** framework.
|
| 28 |
-
It starts from **Qwen3-4B-Base** and applies GRPO on the
|
| 29 |
-
**DAPO-Math-17k-Processed** dataset for mathematical reasoning and problem-solving.
|
| 30 |
|
| 31 |
This model is associated with the paper:
|
| 32 |
**Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe**
|
|
@@ -34,15 +33,8 @@ Paper link: https://arxiv.org/abs/2604.13016
|
|
| 34 |
|
| 35 |
## Model Description
|
| 36 |
|
| 37 |
-
This model is obtained by applying GRPO reinforcement learning to
|
| 38 |
-
`Qwen3-4B-Base` with verl. The training updates the actor model parameters and is
|
| 39 |
-
intended to improve math-focused reasoning performance under the on-policy
|
| 40 |
-
distillation setting.
|
| 41 |
|
| 42 |
-
No learned reward model is used in this training run. In particular,
|
| 43 |
-
`reward_model.enable` is set to `false`; rewards are computed by a custom
|
| 44 |
-
rule-based reward function for math evaluation:
|
| 45 |
-
`verl/verl/utils/reward_score/ttrl_math/__init__.py`.
|
| 46 |
|
| 47 |
### Key characteristics
|
| 48 |
|
|
@@ -52,7 +44,6 @@ rule-based reward function for math evaluation:
|
|
| 52 |
- **Parameter update**: Full-parameter actor update
|
| 53 |
- **Primary domain**: Mathematical reasoning
|
| 54 |
- **Reward model**: Not used (`reward_model.enable: false`)
|
| 55 |
-
- **Custom reward function**: `reward_func`
|
| 56 |
- **Rollout engine**: vLLM
|
| 57 |
- **Context length**: 32768 tokens
|
| 58 |
- **Responses per prompt**: 8
|
|
@@ -75,8 +66,6 @@ rule-based reward function for math evaluation:
|
|
| 75 |
- **Max model length**: `32768`
|
| 76 |
- **Rollout temperature**: `1.0`
|
| 77 |
- **Repetition penalty**: `1.0`
|
| 78 |
-
- **Top-k log probability**: `0`
|
| 79 |
-
- **Top-k strategy**: `union`
|
| 80 |
- **KL loss**: disabled
|
| 81 |
- **Format reward**: disabled
|
| 82 |
- **Loss aggregation**: `token-mean`
|
|
@@ -88,79 +77,12 @@ rule-based reward function for math evaluation:
|
|
| 88 |
- **Number of epochs**: `1`
|
| 89 |
- **Save frequency**: every `20` steps
|
| 90 |
- **Test frequency**: every `20` steps
|
| 91 |
-
- **Logging**: console and SwanLab
|
| 92 |
|
| 93 |
### Dataset
|
| 94 |
|
| 95 |
- **Training dataset**: `DAPO-Math-17k-Processed`
|
| 96 |
- **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
|
| 97 |
|
| 98 |
-
## Training Hyperparameters
|
| 99 |
-
|
| 100 |
-
For reproducibility, the core configuration is summarized below:
|
| 101 |
-
|
| 102 |
-
```bash
|
| 103 |
-
ACTOR_MODEL_PATH=model/Qwen3-4B-Base
|
| 104 |
-
ADV_ESTIMATOR=grpo
|
| 105 |
-
GRPO_OUTCOME_WEIGHT=1.0
|
| 106 |
-
|
| 107 |
-
TRAIN_DATASET=datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet
|
| 108 |
-
TRAIN_DATASET_NAME=DAPO-Math-17k-Processed
|
| 109 |
-
TEST_DATASET=[
|
| 110 |
-
datasets/test_data/AIME25/test.parquet,
|
| 111 |
-
datasets/test_data/AMC23/test.parquet,
|
| 112 |
-
datasets/test_data/AIME24/test.parquet
|
| 113 |
-
]
|
| 114 |
-
|
| 115 |
-
MAX_PROMPT_LENGTH=1024
|
| 116 |
-
MAX_RESP_LENGTH=7168
|
| 117 |
-
MAX_VAL_RESP_LENGTH=31744
|
| 118 |
-
MAX_MODEL_LEN=32768
|
| 119 |
-
|
| 120 |
-
MINI_BATCH_SIZE=64
|
| 121 |
-
TEMPERATURE=1.0
|
| 122 |
-
|
| 123 |
-
# TEACHER_TEMPERATURE and REWARD_WEIGHT_MODE are rollout/logit-control settings
|
| 124 |
-
# from the training script. They do not indicate the use of a learned reward model.
|
| 125 |
-
TEACHER_TEMPERATURE=1.0
|
| 126 |
-
REPETITION_PENALTY=1.0
|
| 127 |
-
N_RESPONSES=8
|
| 128 |
-
|
| 129 |
-
LOG_PROB_TOP_K=0
|
| 130 |
-
TOP_K_STRATEGY=union
|
| 131 |
-
REWARD_WEIGHT_MODE=student_p
|
| 132 |
-
|
| 133 |
-
USE_KL=False
|
| 134 |
-
ENABLE_FORMAT_REWARD=False
|
| 135 |
-
MODEL_DTYPE=fp32
|
| 136 |
-
LOSS_AGG_MODE=token-mean
|
| 137 |
-
|
| 138 |
-
actor_rollout_ref.actor.optim.lr=1e-6
|
| 139 |
-
actor_rollout_ref.actor.ppo_mini_batch_size=64
|
| 140 |
-
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1
|
| 141 |
-
actor_rollout_ref.actor.use_dynamic_bsz=True
|
| 142 |
-
actor_rollout_ref.model.enable_gradient_checkpointing=True
|
| 143 |
-
actor_rollout_ref.model.enable_activation_offload=True
|
| 144 |
-
|
| 145 |
-
actor_rollout_ref.rollout.name=vllm
|
| 146 |
-
actor_rollout_ref.rollout.tensor_model_parallel_size=1
|
| 147 |
-
actor_rollout_ref.rollout.gpu_memory_utilization=0.8
|
| 148 |
-
actor_rollout_ref.rollout.n=8
|
| 149 |
-
actor_rollout_ref.rollout.val_kwargs.n=16
|
| 150 |
-
actor_rollout_ref.rollout.val_kwargs.temperature=1.0
|
| 151 |
-
actor_rollout_ref.rollout.val_kwargs.top_p=0.95
|
| 152 |
-
|
| 153 |
-
reward_model.enable=False
|
| 154 |
-
custom_reward_function.path=verl/verl/utils/reward_score/ttrl_math/__init__.py
|
| 155 |
-
custom_reward_function.name=reward_func
|
| 156 |
-
|
| 157 |
-
trainer.n_gpus_per_node=8
|
| 158 |
-
trainer.nnodes=1
|
| 159 |
-
trainer.total_epochs=1
|
| 160 |
-
trainer.save_freq=20
|
| 161 |
-
trainer.test_freq=20
|
| 162 |
-
```
|
| 163 |
-
|
| 164 |
## Usage
|
| 165 |
|
| 166 |
```python
|
|
@@ -187,6 +109,4 @@ If you use this model, please consider citing the related paper:
|
|
| 187 |
journal={arXiv preprint arXiv:2604.13016},
|
| 188 |
year={2026}
|
| 189 |
}
|
| 190 |
-
```
|
| 191 |
-
|
| 192 |
-
Paper: https://arxiv.org/abs/2604.13016
|
|
|
|
| 25 |
# Qwen3-4B-Base-GRPO
|
| 26 |
|
| 27 |
Qwen3-4B-Base-GRPO is a post-RL checkpoint trained with the **verl** framework.
|
| 28 |
+
It starts from **Qwen3-4B-Base** and applies GRPO on the **DAPO-Math-17k-Processed** dataset for mathematical reasoning and problem-solving.
|
|
|
|
| 29 |
|
| 30 |
This model is associated with the paper:
|
| 31 |
**Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe**
|
|
|
|
| 33 |
|
| 34 |
## Model Description
|
| 35 |
|
| 36 |
+
This model is obtained by applying GRPO reinforcement learning to `Qwen3-4B-Base` with verl. The training is intended to improve math-focused reasoning performance under the on-policy distillation setting.
|
|
|
|
|
|
|
|
|
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
### Key characteristics
|
| 40 |
|
|
|
|
| 44 |
- **Parameter update**: Full-parameter actor update
|
| 45 |
- **Primary domain**: Mathematical reasoning
|
| 46 |
- **Reward model**: Not used (`reward_model.enable: false`)
|
|
|
|
| 47 |
- **Rollout engine**: vLLM
|
| 48 |
- **Context length**: 32768 tokens
|
| 49 |
- **Responses per prompt**: 8
|
|
|
|
| 66 |
- **Max model length**: `32768`
|
| 67 |
- **Rollout temperature**: `1.0`
|
| 68 |
- **Repetition penalty**: `1.0`
|
|
|
|
|
|
|
| 69 |
- **KL loss**: disabled
|
| 70 |
- **Format reward**: disabled
|
| 71 |
- **Loss aggregation**: `token-mean`
|
|
|
|
| 77 |
- **Number of epochs**: `1`
|
| 78 |
- **Save frequency**: every `20` steps
|
| 79 |
- **Test frequency**: every `20` steps
|
|
|
|
| 80 |
|
| 81 |
### Dataset
|
| 82 |
|
| 83 |
- **Training dataset**: `DAPO-Math-17k-Processed`
|
| 84 |
- **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
|
| 85 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
## Usage
|
| 87 |
|
| 88 |
```python
|
|
|
|
| 109 |
journal={arXiv preprint arXiv:2604.13016},
|
| 110 |
year={2026}
|
| 111 |
}
|
| 112 |
+
```
|
|
|
|
|
|