qwen3-4b-agrpo-nothink-lr3e-6

Qwen3-4B fine-tuned with Async GRPO (no thinking, nothink mode)

Task

Fill-in-the-middle multiple-choice questions (MCQ) for energy domain verification. The model outputs its answer inside \boxed{N} where N is the option number.

Reward Function

+1.0 — correct (\boxed{N} matches ground truth)
-0.5 — wrong (\boxed{N} exists but wrong answer)
-1.0 — no answer (no \boxed{} found)

Training Hyperparameters

Parameter	Value
base_model	Qwen/Qwen3-4B
algorithm	Async GRPO (TRL AsyncGRPOTrainer)
thinking	disabled (nothink)
learning_rate	3e-6
lr_scheduler	cosine
warmup_steps	60
max_steps	2000
global_step_saved	2000
per_device_train_batch_size	1
gradient_accumulation_steps	32
num_train_processes	4
effective_batch_size	128 prompts/step
num_generations	9
max_completion_length	512
temperature	1.0
epsilon	0.2
epsilon_high	0.2
max_staleness	4
weight_sync_steps	1
max_grad_norm	1.0
precision	bf16
parallelism	FSDP2 (4 GPUs training) + vLLM TP=4 (4 GPUs inference)
final_reward	~0.57
final_mean_completion_length	~169 tokens

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EnergyAI/qwen3-4b-agrpo-nothink-lr3e-6", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("EnergyAI/qwen3-4b-agrpo-nothink-lr3e-6")

Downloads last month: 161

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EnergyAI/qwen3-4b-agrpo-nothink-lr3e-6

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

(577)

this model