qwen3-8b-agrpo-think-lr3e-6

Qwen3-8B fine-tuned with Async GRPO (thinking mode enabled)

Task

Fill-in-the-middle multiple-choice questions (MCQ) for energy domain verification. The model outputs its answer inside \boxed{N} where N is the option number.

Reward Function

+1.0 — correct (\boxed{N} matches ground truth)
-0.5 — wrong (\boxed{N} exists but wrong answer)
-1.0 — no answer (no \boxed{} found)

Training Hyperparameters

Parameter	Value
base_model	Qwen/Qwen3-8B
algorithm	Async GRPO (TRL AsyncGRPOTrainer)
thinking	enabled (enable_thinking=True)
learning_rate	3e-6
lr_scheduler	cosine
warmup_steps	30
max_steps	2000
global_step_saved	1400
per_device_train_batch_size	1
gradient_accumulation_steps	32
num_train_processes	4
effective_batch_size	128 prompts/step
num_generations	9
max_completion_length	12000
temperature	1.0
epsilon	0.2
epsilon_high	0.2
max_staleness	4
weight_sync_steps	1
max_grad_norm	1.0
precision	bf16
parallelism	FSDP2 (4 GPUs training) + vLLM TP=4 (4 GPUs inference)
final_reward	~0.62
final_mean_completion_length	~6130 tokens

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EnergyAI/qwen3-8b-agrpo-think-lr3e-6", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("EnergyAI/qwen3-8b-agrpo-think-lr3e-6")

Downloads last month: 48

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EnergyAI/qwen3-8b-agrpo-think-lr3e-6

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

(1466)

this model