qwen3-4b-agrpo-nothink-lr3e-6
Qwen3-4B fine-tuned with Async GRPO (no thinking, nothink mode)
Task
Fill-in-the-middle multiple-choice questions (MCQ) for energy domain verification.
The model outputs its answer inside \boxed{N} where N is the option number.
Reward Function
- +1.0 โ correct (
\boxed{N}matches ground truth) - -0.5 โ wrong (
\boxed{N}exists but wrong answer) - -1.0 โ no answer (no
\boxed{}found)
Training Hyperparameters
| Parameter | Value |
|---|---|
| base_model | Qwen/Qwen3-4B |
| algorithm | Async GRPO (TRL AsyncGRPOTrainer) |
| thinking | disabled (nothink) |
| learning_rate | 3e-6 |
| lr_scheduler | cosine |
| warmup_steps | 60 |
| max_steps | 2000 |
| global_step_saved | 2000 |
| per_device_train_batch_size | 1 |
| gradient_accumulation_steps | 32 |
| num_train_processes | 4 |
| effective_batch_size | 128 prompts/step |
| num_generations | 9 |
| max_completion_length | 512 |
| temperature | 1.0 |
| epsilon | 0.2 |
| epsilon_high | 0.2 |
| max_staleness | 4 |
| weight_sync_steps | 1 |
| max_grad_norm | 1.0 |
| precision | bf16 |
| parallelism | FSDP2 (4 GPUs training) + vLLM TP=4 (4 GPUs inference) |
| final_reward | ~0.57 |
| final_mean_completion_length | ~169 tokens |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EnergyAI/qwen3-4b-agrpo-nothink-lr3e-6", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("EnergyAI/qwen3-4b-agrpo-nothink-lr3e-6")
- Downloads last month
- 161
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support