Clarify verl RL model card
Browse files
README.md
CHANGED
|
@@ -10,9 +10,11 @@ tags:
|
|
| 10 |
- qwen3
|
| 11 |
- math
|
| 12 |
- grpo
|
|
|
|
|
|
|
| 13 |
- reinforcement-learning
|
| 14 |
- on-policy-distillation
|
| 15 |
-
- full-
|
| 16 |
- reasoning
|
| 17 |
- safetensors
|
| 18 |
- arxiv:2604.13016
|
|
@@ -22,7 +24,8 @@ base_model_relation: finetune
|
|
| 22 |
|
| 23 |
# Qwen3-4B-Base-GRPO
|
| 24 |
|
| 25 |
-
Qwen3-4B-Base-GRPO is a
|
|
|
|
| 26 |
**DAPO-Math-17k-Processed** dataset for mathematical reasoning and problem-solving.
|
| 27 |
|
| 28 |
This model is associated with the paper:
|
|
@@ -31,19 +34,22 @@ Paper link: https://arxiv.org/abs/2604.13016
|
|
| 31 |
|
| 32 |
## Model Description
|
| 33 |
|
| 34 |
-
This model is obtained by
|
| 35 |
-
|
| 36 |
-
reasoning
|
|
|
|
| 37 |
|
| 38 |
-
No learned reward model is used in this training run.
|
| 39 |
-
|
|
|
|
| 40 |
`verl/verl/utils/reward_score/ttrl_math/__init__.py`.
|
| 41 |
|
| 42 |
### Key characteristics
|
| 43 |
|
| 44 |
- **Base model**: Qwen3-4B-Base
|
| 45 |
-
- **Training
|
| 46 |
-
- **
|
|
|
|
| 47 |
- **Primary domain**: Mathematical reasoning
|
| 48 |
- **Reward model**: Not used (`reward_model.enable: false`)
|
| 49 |
- **Custom reward function**: `reward_func`
|
|
@@ -58,6 +64,8 @@ custom rule-based reward function for math evaluation:
|
|
| 58 |
- **Framework**: verl
|
| 59 |
- **Algorithm**: `grpo`
|
| 60 |
- **GRPO outcome weight**: `1.0`
|
|
|
|
|
|
|
| 61 |
- **Training dataset**: `DAPO-Math-17k-Processed`
|
| 62 |
- **Training file**: `datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet`
|
| 63 |
- **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
|
|
@@ -66,11 +74,9 @@ custom rule-based reward function for math evaluation:
|
|
| 66 |
- **Validation response length**: `31744`
|
| 67 |
- **Max model length**: `32768`
|
| 68 |
- **Rollout temperature**: `1.0`
|
| 69 |
-
- **Teacher temperature**: `1.0`
|
| 70 |
- **Repetition penalty**: `1.0`
|
| 71 |
- **Top-k log probability**: `0`
|
| 72 |
- **Top-k strategy**: `union`
|
| 73 |
-
- **Reward weight mode**: `student_p`
|
| 74 |
- **KL loss**: disabled
|
| 75 |
- **Format reward**: disabled
|
| 76 |
- **Loss aggregation**: `token-mean`
|
|
@@ -113,6 +119,9 @@ MAX_MODEL_LEN=32768
|
|
| 113 |
|
| 114 |
MINI_BATCH_SIZE=64
|
| 115 |
TEMPERATURE=1.0
|
|
|
|
|
|
|
|
|
|
| 116 |
TEACHER_TEMPERATURE=1.0
|
| 117 |
REPETITION_PENALTY=1.0
|
| 118 |
N_RESPONSES=8
|
|
|
|
| 10 |
- qwen3
|
| 11 |
- math
|
| 12 |
- grpo
|
| 13 |
+
- verl
|
| 14 |
+
- rl
|
| 15 |
- reinforcement-learning
|
| 16 |
- on-policy-distillation
|
| 17 |
+
- full-parameter-rl
|
| 18 |
- reasoning
|
| 19 |
- safetensors
|
| 20 |
- arxiv:2604.13016
|
|
|
|
| 24 |
|
| 25 |
# Qwen3-4B-Base-GRPO
|
| 26 |
|
| 27 |
+
Qwen3-4B-Base-GRPO is a post-RL checkpoint trained with the **verl** framework.
|
| 28 |
+
It starts from **Qwen3-4B-Base** and applies GRPO on the
|
| 29 |
**DAPO-Math-17k-Processed** dataset for mathematical reasoning and problem-solving.
|
| 30 |
|
| 31 |
This model is associated with the paper:
|
|
|
|
| 34 |
|
| 35 |
## Model Description
|
| 36 |
|
| 37 |
+
This model is obtained by applying GRPO reinforcement learning to
|
| 38 |
+
`Qwen3-4B-Base` with verl. The training updates the actor model parameters and is
|
| 39 |
+
intended to improve math-focused reasoning performance under the on-policy
|
| 40 |
+
distillation setting.
|
| 41 |
|
| 42 |
+
No learned reward model is used in this training run. In particular,
|
| 43 |
+
`reward_model.enable` is set to `false`; rewards are computed by a custom
|
| 44 |
+
rule-based reward function for math evaluation:
|
| 45 |
`verl/verl/utils/reward_score/ttrl_math/__init__.py`.
|
| 46 |
|
| 47 |
### Key characteristics
|
| 48 |
|
| 49 |
- **Base model**: Qwen3-4B-Base
|
| 50 |
+
- **Training framework**: verl
|
| 51 |
+
- **Training stage**: Reinforcement Learning (GRPO)
|
| 52 |
+
- **Parameter update**: Full-parameter actor update
|
| 53 |
- **Primary domain**: Mathematical reasoning
|
| 54 |
- **Reward model**: Not used (`reward_model.enable: false`)
|
| 55 |
- **Custom reward function**: `reward_func`
|
|
|
|
| 64 |
- **Framework**: verl
|
| 65 |
- **Algorithm**: `grpo`
|
| 66 |
- **GRPO outcome weight**: `1.0`
|
| 67 |
+
- **Learned reward model**: disabled (`reward_model.enable: false`)
|
| 68 |
+
- **Reward source**: custom rule-based math reward function
|
| 69 |
- **Training dataset**: `DAPO-Math-17k-Processed`
|
| 70 |
- **Training file**: `datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet`
|
| 71 |
- **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
|
|
|
|
| 74 |
- **Validation response length**: `31744`
|
| 75 |
- **Max model length**: `32768`
|
| 76 |
- **Rollout temperature**: `1.0`
|
|
|
|
| 77 |
- **Repetition penalty**: `1.0`
|
| 78 |
- **Top-k log probability**: `0`
|
| 79 |
- **Top-k strategy**: `union`
|
|
|
|
| 80 |
- **KL loss**: disabled
|
| 81 |
- **Format reward**: disabled
|
| 82 |
- **Loss aggregation**: `token-mean`
|
|
|
|
| 119 |
|
| 120 |
MINI_BATCH_SIZE=64
|
| 121 |
TEMPERATURE=1.0
|
| 122 |
+
|
| 123 |
+
# TEACHER_TEMPERATURE and REWARD_WEIGHT_MODE are rollout/logit-control settings
|
| 124 |
+
# from the training script. They do not indicate the use of a learned reward model.
|
| 125 |
TEACHER_TEMPERATURE=1.0
|
| 126 |
REPETITION_PENALTY=1.0
|
| 127 |
N_RESPONSES=8
|