Instructions to use lllyx/Qwen3-4B-Base-GRPO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lllyx/Qwen3-4B-Base-GRPO with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="lllyx/Qwen3-4B-Base-GRPO")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("lllyx/Qwen3-4B-Base-GRPO")
model = AutoModelForCausalLM.from_pretrained("lllyx/Qwen3-4B-Base-GRPO")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use lllyx/Qwen3-4B-Base-GRPO with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lllyx/Qwen3-4B-Base-GRPO"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lllyx/Qwen3-4B-Base-GRPO",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/lllyx/Qwen3-4B-Base-GRPO

SGLang

How to use lllyx/Qwen3-4B-Base-GRPO with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "lllyx/Qwen3-4B-Base-GRPO" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lllyx/Qwen3-4B-Base-GRPO",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "lllyx/Qwen3-4B-Base-GRPO" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lllyx/Qwen3-4B-Base-GRPO",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use lllyx/Qwen3-4B-Base-GRPO with Docker Model Runner:
```
docker model run hf.co/lllyx/Qwen3-4B-Base-GRPO
```

lllyx commited on 8 days ago

Commit

513e04f

verified ·

1 Parent(s): 73067d4

Update model card

Browse files

Files changed (1) hide show

README.md +3 -83

README.md CHANGED Viewed

@@ -25,8 +25,7 @@ base_model_relation: finetune
 # Qwen3-4B-Base-GRPO
 Qwen3-4B-Base-GRPO is a post-RL checkpoint trained with the **verl** framework.
-It starts from **Qwen3-4B-Base** and applies GRPO on the
-**DAPO-Math-17k-Processed** dataset for mathematical reasoning and problem-solving.
 This model is associated with the paper:
 **Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe**
@@ -34,15 +33,8 @@ Paper link: https://arxiv.org/abs/2604.13016
 ## Model Description
-This model is obtained by applying GRPO reinforcement learning to
-`Qwen3-4B-Base` with verl. The training updates the actor model parameters and is
-intended to improve math-focused reasoning performance under the on-policy
-distillation setting.
-No learned reward model is used in this training run. In particular,
-`reward_model.enable` is set to `false`; rewards are computed by a custom
-rule-based reward function for math evaluation:
-`verl/verl/utils/reward_score/ttrl_math/__init__.py`.
 ### Key characteristics
@@ -52,7 +44,6 @@ rule-based reward function for math evaluation:
 - **Parameter update**: Full-parameter actor update
 - **Primary domain**: Mathematical reasoning
 - **Reward model**: Not used (`reward_model.enable: false`)
-- **Custom reward function**: `reward_func`
 - **Rollout engine**: vLLM
 - **Context length**: 32768 tokens
 - **Responses per prompt**: 8
@@ -75,8 +66,6 @@ rule-based reward function for math evaluation:
 - **Max model length**: `32768`
 - **Rollout temperature**: `1.0`
 - **Repetition penalty**: `1.0`
-- **Top-k log probability**: `0`
-- **Top-k strategy**: `union`
 - **KL loss**: disabled
 - **Format reward**: disabled
 - **Loss aggregation**: `token-mean`
@@ -88,79 +77,12 @@ rule-based reward function for math evaluation:
 - **Number of epochs**: `1`
 - **Save frequency**: every `20` steps
 - **Test frequency**: every `20` steps
-- **Logging**: console and SwanLab
 ### Dataset
 - **Training dataset**: `DAPO-Math-17k-Processed`
 - **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
-## Training Hyperparameters
-For reproducibility, the core configuration is summarized below:
-```bash
-ACTOR_MODEL_PATH=model/Qwen3-4B-Base
-ADV_ESTIMATOR=grpo
-GRPO_OUTCOME_WEIGHT=1.0
-TRAIN_DATASET=datasets/DAPO-Math-17k-Processed/DAPO-Math.parquet
-TRAIN_DATASET_NAME=DAPO-Math-17k-Processed
-TEST_DATASET=[
-  datasets/test_data/AIME25/test.parquet,
-  datasets/test_data/AMC23/test.parquet,
-  datasets/test_data/AIME24/test.parquet
-]
-MAX_PROMPT_LENGTH=1024
-MAX_RESP_LENGTH=7168
-MAX_VAL_RESP_LENGTH=31744
-MAX_MODEL_LEN=32768
-MINI_BATCH_SIZE=64
-TEMPERATURE=1.0
-# TEACHER_TEMPERATURE and REWARD_WEIGHT_MODE are rollout/logit-control settings
-# from the training script. They do not indicate the use of a learned reward model.
-TEACHER_TEMPERATURE=1.0
-REPETITION_PENALTY=1.0
-N_RESPONSES=8
-LOG_PROB_TOP_K=0
-TOP_K_STRATEGY=union
-REWARD_WEIGHT_MODE=student_p
-USE_KL=False
-ENABLE_FORMAT_REWARD=False
-MODEL_DTYPE=fp32
-LOSS_AGG_MODE=token-mean
-actor_rollout_ref.actor.optim.lr=1e-6
-actor_rollout_ref.actor.ppo_mini_batch_size=64
-actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1
-actor_rollout_ref.actor.use_dynamic_bsz=True
-actor_rollout_ref.model.enable_gradient_checkpointing=True
-actor_rollout_ref.model.enable_activation_offload=True
-actor_rollout_ref.rollout.name=vllm
-actor_rollout_ref.rollout.tensor_model_parallel_size=1
-actor_rollout_ref.rollout.gpu_memory_utilization=0.8
-actor_rollout_ref.rollout.n=8
-actor_rollout_ref.rollout.val_kwargs.n=16
-actor_rollout_ref.rollout.val_kwargs.temperature=1.0
-actor_rollout_ref.rollout.val_kwargs.top_p=0.95
-reward_model.enable=False
-custom_reward_function.path=verl/verl/utils/reward_score/ttrl_math/__init__.py
-custom_reward_function.name=reward_func
-trainer.n_gpus_per_node=8
-trainer.nnodes=1
-trainer.total_epochs=1
-trainer.save_freq=20
-trainer.test_freq=20
-```
 ## Usage
 ```python
@@ -187,6 +109,4 @@ If you use this model, please consider citing the related paper:
   journal={arXiv preprint arXiv:2604.13016},
   year={2026}
 }
-```
-Paper: https://arxiv.org/abs/2604.13016

 # Qwen3-4B-Base-GRPO
 Qwen3-4B-Base-GRPO is a post-RL checkpoint trained with the **verl** framework.
+It starts from **Qwen3-4B-Base** and applies GRPO on the **DAPO-Math-17k-Processed** dataset for mathematical reasoning and problem-solving.
 This model is associated with the paper:
 **Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe**
 ## Model Description
+This model is obtained by applying GRPO reinforcement learning to `Qwen3-4B-Base` with verl. The training is intended to improve math-focused reasoning performance under the on-policy distillation setting.
 ### Key characteristics
 - **Parameter update**: Full-parameter actor update
 - **Primary domain**: Mathematical reasoning
 - **Reward model**: Not used (`reward_model.enable: false`)
 - **Rollout engine**: vLLM
 - **Context length**: 32768 tokens
 - **Responses per prompt**: 8
 - **Max model length**: `32768`
 - **Rollout temperature**: `1.0`
 - **Repetition penalty**: `1.0`
 - **KL loss**: disabled
 - **Format reward**: disabled
 - **Loss aggregation**: `token-mean`
 - **Number of epochs**: `1`
 - **Save frequency**: every `20` steps
 - **Test frequency**: every `20` steps
 ### Dataset
 - **Training dataset**: `DAPO-Math-17k-Processed`
 - **Validation datasets**: `AIME25`, `AMC23`, `AIME24`
 ## Usage
 ```python
   journal={arXiv preprint arXiv:2604.13016},
   year={2026}
 }
+```