Instructions to use lllyx/Qwen3-4B-Base-GRPO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lllyx/Qwen3-4B-Base-GRPO with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="lllyx/Qwen3-4B-Base-GRPO")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("lllyx/Qwen3-4B-Base-GRPO")
model = AutoModelForCausalLM.from_pretrained("lllyx/Qwen3-4B-Base-GRPO")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use lllyx/Qwen3-4B-Base-GRPO with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lllyx/Qwen3-4B-Base-GRPO"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lllyx/Qwen3-4B-Base-GRPO",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/lllyx/Qwen3-4B-Base-GRPO

SGLang

How to use lllyx/Qwen3-4B-Base-GRPO with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "lllyx/Qwen3-4B-Base-GRPO" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lllyx/Qwen3-4B-Base-GRPO",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "lllyx/Qwen3-4B-Base-GRPO" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lllyx/Qwen3-4B-Base-GRPO",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use lllyx/Qwen3-4B-Base-GRPO with Docker Model Runner:
```
docker model run hf.co/lllyx/Qwen3-4B-Base-GRPO
```

lllyx commited on 9 days ago

Commit

141d1d0

verified ·

1 Parent(s): 10de68f

Update model card

Browse files

Files changed (1) hide show

README.md +155 -45

README.md CHANGED Viewed

@@ -1,63 +1,158 @@
 ---
 license: other
-library_name: transformers
-pipeline_tag: text-generation
-base_model:
-- Qwen/Qwen3-4B-Base
-base_model_relation: finetune
 language:
 - en
 - zh
 tags:
-- safetensors
-- qwen3
 - qwen
 - grpo
 - reinforcement-learning
 - reasoning
-- conversational
-- math
 - arxiv:2604.13016
 ---
 # Qwen3-4B-Base-GRPO
-This repository contains a Qwen3-4B-Base GRPO checkpoint for the collection
-[Rethinking OPD](https://huggingface.co/collections/lllyx/rethinking-opd).
-The model is provided in `safetensors` format and can be loaded with
-`transformers`.
-## Training Configuration
-This checkpoint was trained with the GRPO recipe used in the Rethinking OPD
-experiments.
-| Setting | Value |
-| --- | --- |
-| Actor initialization | `model/Qwen3-4B-Base` |
-| Reward/teacher model | `model/Qwen3-4B` |
-| Training data | `datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet` |
-| Validation data | `AIME25`, `AMC23`, `AIME24` |
-| Advantage estimator | `grpo` |
-| GRPO outcome weight | `1.0` |
-| Rollout correction | token-level IS, threshold `2.0` |
-| Prompt length | `1024` |
-| Response length | `7168` |
-| Validation response length | `31744` |
-| Max model length | `32768` |
-| Responses per prompt | `8` |
-| Rollout temperature | `1.0` |
-| Teacher temperature | `1.0` |
-| Repetition penalty | `1.0` |
-| PPO mini-batch size | `64` |
-| Learning rate | `1e-6` |
-| KL loss | disabled |
-| Format reward | disabled |
-| Loss aggregation | `token-mean` |
-| Rollout engine | `vllm` |
-| Tensor parallel size | `1` |
-| GPUs per node | `8` |
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -71,3 +166,18 @@ model = AutoModelForCausalLM.from_pretrained(
     device_map="auto",
 )
 ```

 ---
 license: other
 language:
 - en
 - zh
+library_name: transformers
+pipeline_tag: text-generation
 tags:
 - qwen
+- qwen3
+- math
 - grpo
 - reinforcement-learning
+- on-policy-distillation
+- full-finetuning
 - reasoning
+- safetensors
 - arxiv:2604.13016
+base_model: Qwen/Qwen3-4B-Base
+base_model_relation: finetune
 ---
 # Qwen3-4B-Base-GRPO
+Qwen3-4B-Base-GRPO is a GRPO-trained model based on **Qwen3-4B-Base**, trained on the
+**DAPO-Math-17k-Processed** dataset for mathematical reasoning and problem-solving.
+This model is associated with the paper:
+**Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe**
+Paper link: https://arxiv.org/abs/2604.13016
+## Model Description
+This model is obtained by full-parameter GRPO training from `Qwen3-4B-Base`.
+The training is designed to improve the model's performance on math-focused
+reasoning tasks under the on-policy distillation setting.
+No learned reward model is used in this training run. Rewards are computed by a
+custom rule-based reward function for math evaluation:
+`verl/verl/utils/reward_score/ttrl_math/__init__.py`.
+### Key characteristics
+- **Base model**: Qwen3-4B-Base
+- **Training stage**: GRPO
+- **Finetuning type**: Full finetuning
+- **Primary domain**: Mathematical reasoning
+- **Reward model**: Not used (`reward_model.enable: false`)
+- **Custom reward function**: `reward_func`
+- **Rollout engine**: vLLM
+- **Context length**: 32768 tokens
+- **Responses per prompt**: 8
+## Training Details
+### Training configuration
+- **Framework**: verl
+- **Algorithm**: `grpo`
+- **GRPO outcome weight**: `1.0`
+- **Training dataset**: `DAPO-Math-17k-Processed`
+- **Training file**: `datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet`
+- **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
+- **Prompt length**: `1024`
+- **Response length**: `7168`
+- **Validation response length**: `31744`
+- **Max model length**: `32768`
+- **Rollout temperature**: `1.0`
+- **Teacher temperature**: `1.0`
+- **Repetition penalty**: `1.0`
+- **Top-k log probability**: `0`
+- **Top-k strategy**: `union`
+- **Reward weight mode**: `student_p`
+- **KL loss**: disabled
+- **Format reward**: disabled
+- **Loss aggregation**: `token-mean`
+- **Learning rate**: `1e-6`
+- **PPO mini-batch size**: `64`
+- **PPO micro-batch size per GPU**: `1`
+- **Tensor parallel size**: `1`
+- **Number of GPUs**: `8`
+- **Number of epochs**: `1`
+- **Save frequency**: every `20` steps
+- **Test frequency**: every `20` steps
+- **Logging**: console and SwanLab
+### Dataset
+- **Training dataset**: `DAPO-Math-17k-Processed`
+- **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
+## Training Hyperparameters
+For reproducibility, the core configuration is summarized below:
+```bash
+ACTOR_MODEL_PATH=model/Qwen3-4B-Base
+ADV_ESTIMATOR=grpo
+GRPO_OUTCOME_WEIGHT=1.0
+TRAIN_DATASET=datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet
+TRAIN_DATASET_NAME=DAPO-Math-17k-Processed
+TEST_DATASET=[
+  datasets/test_data/AIME25/test.parquet,
+  datasets/test_data/AMC23/test.parquet,
+  datasets/test_data/AIME24/test.parquet
+]
+MAX_PROMPT_LENGTH=1024
+MAX_RESP_LENGTH=7168
+MAX_VAL_RESP_LENGTH=31744
+MAX_MODEL_LEN=32768
+MINI_BATCH_SIZE=64
+TEMPERATURE=1.0
+TEACHER_TEMPERATURE=1.0
+REPETITION_PENALTY=1.0
+N_RESPONSES=8
+LOG_PROB_TOP_K=0
+TOP_K_STRATEGY=union
+REWARD_WEIGHT_MODE=student_p
+USE_KL=False
+ENABLE_FORMAT_REWARD=False
+MODEL_DTYPE=fp32
+LOSS_AGG_MODE=token-mean
+actor_rollout_ref.actor.optim.lr=1e-6
+actor_rollout_ref.actor.ppo_mini_batch_size=64
+actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1
+actor_rollout_ref.actor.use_dynamic_bsz=True
+actor_rollout_ref.model.enable_gradient_checkpointing=True
+actor_rollout_ref.model.enable_activation_offload=True
+actor_rollout_ref.rollout.name=vllm
+actor_rollout_ref.rollout.tensor_model_parallel_size=1
+actor_rollout_ref.rollout.gpu_memory_utilization=0.8
+actor_rollout_ref.rollout.n=8
+actor_rollout_ref.rollout.val_kwargs.n=16
+actor_rollout_ref.rollout.val_kwargs.temperature=1.0
+actor_rollout_ref.rollout.val_kwargs.top_p=0.95
+reward_model.enable=False
+custom_reward_function.path=verl/verl/utils/reward_score/ttrl_math/__init__.py
+custom_reward_function.name=reward_func
+trainer.n_gpus_per_node=8
+trainer.nnodes=1
+trainer.total_epochs=1
+trainer.save_freq=20
+trainer.test_freq=20
+```
+## Usage
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
     device_map="auto",
 )
 ```
+## Citation
+If you use this model, please consider citing the related paper:
+```bibtex
+@article{li2026rethinking,
+  title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe},
+  author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and Ding, Ning},
+  journal={arXiv preprint arXiv:2604.13016},
+  year={2026}
+}
+```
+Paper: https://arxiv.org/abs/2604.13016