qwen3-8b-agrpo-think-lr3e-6

Qwen3-8B fine-tuned with Async GRPO (thinking mode enabled)

Task

Fill-in-the-middle multiple-choice questions (MCQ) for energy domain verification. The model outputs its answer inside \boxed{N} where N is the option number.

Reward Function

  • +1.0 โ€” correct (\boxed{N} matches ground truth)
  • -0.5 โ€” wrong (\boxed{N} exists but wrong answer)
  • -1.0 โ€” no answer (no \boxed{} found)

Training Hyperparameters

Parameter Value
base_model Qwen/Qwen3-8B
algorithm Async GRPO (TRL AsyncGRPOTrainer)
thinking enabled (enable_thinking=True)
learning_rate 3e-6
lr_scheduler cosine
warmup_steps 30
max_steps 2000
global_step_saved 1400
per_device_train_batch_size 1
gradient_accumulation_steps 32
num_train_processes 4
effective_batch_size 128 prompts/step
num_generations 9
max_completion_length 12000
temperature 1.0
epsilon 0.2
epsilon_high 0.2
max_staleness 4
weight_sync_steps 1
max_grad_norm 1.0
precision bf16
parallelism FSDP2 (4 GPUs training) + vLLM TP=4 (4 GPUs inference)
final_reward ~0.62
final_mean_completion_length ~6130 tokens

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EnergyAI/qwen3-8b-agrpo-think-lr3e-6", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("EnergyAI/qwen3-8b-agrpo-think-lr3e-6")
Downloads last month
48
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for EnergyAI/qwen3-8b-agrpo-think-lr3e-6

Finetuned
Qwen/Qwen3-8B
Finetuned
(1466)
this model