qwen3-4b-agrpo-think-lr3e-6

Qwen3-4B fine-tuned with Async GRPO (thinking mode enabled)

Task

Fill-in-the-middle multiple-choice questions (MCQ) for energy domain verification. The model outputs its answer inside \boxed{N} where N is the option number.

Reward Function

  • +1.0 โ€” correct (\boxed{N} matches ground truth)
  • -0.5 โ€” wrong (\boxed{N} exists but wrong answer)
  • -1.0 โ€” no answer (no \boxed{} found)

Training Hyperparameters

Parameter Value
base_model Qwen/Qwen3-4B
algorithm Async GRPO (TRL AsyncGRPOTrainer)
thinking enabled (enable_thinking=True)
learning_rate 3e-6
lr_scheduler cosine
warmup_steps 60
max_steps 2000
global_step_saved 2000
per_device_train_batch_size 1
gradient_accumulation_steps 32
num_train_processes 4
effective_batch_size 128 prompts/step
num_generations 9
max_completion_length 4096
temperature 1.0
epsilon 0.2
epsilon_high 0.2
max_staleness 4
weight_sync_steps 1
max_grad_norm 1.0
precision bf16
parallelism FSDP2 (4 GPUs training) + vLLM TP=4 (4 GPUs inference)
final_reward ~0.45
final_mean_completion_length ~2370 tokens

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EnergyAI/qwen3-4b-agrpo-think-lr3e-6", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("EnergyAI/qwen3-4b-agrpo-think-lr3e-6")
Downloads last month
71
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for EnergyAI/qwen3-4b-agrpo-think-lr3e-6

Finetuned
Qwen/Qwen3-4B
Finetuned
(577)
this model