qwen3.5-27b-agrpo-nothink-lr3e-6-checkpoint400
Qwen3.5-27B fine-tuned with Async GRPO (no thinking, nothink mode) โ checkpoint at step 400
Task
Fill-in-the-middle multiple-choice questions (MCQ) for energy domain verification.
The model outputs its answer inside \boxed{N} where N is the option number.
Reward Function
- +1.0 โ correct (
\boxed{N}matches ground truth) - -0.5 โ wrong (
\boxed{N}exists but wrong answer) - -1.0 โ no answer (no
\boxed{}found)
Training Hyperparameters
| Parameter | Value |
|---|---|
| base_model | Qwen/Qwen3.5-27B |
| algorithm | Async GRPO (TRL AsyncGRPOTrainer) |
| thinking | disabled (nothink) |
| learning_rate | 3e-6 |
| lr_scheduler | cosine |
| warmup_steps | 30 |
| max_steps | 2000 |
| global_step_saved | 400 |
| per_device_train_batch_size | 2 |
| gradient_accumulation_steps | 32 |
| num_train_processes | 8 |
| effective_batch_size | 512 prompts/step |
| num_generations | 5 |
| max_completion_length | 512 |
| temperature | 1.0 |
| epsilon | 0.2 |
| epsilon_high | 0.2 |
| max_staleness | 4 |
| weight_sync_steps | 3 |
| max_grad_norm | 1.0 |
| precision | bf16 |
| parallelism | FSDP2 (8 GPUs training, 1 node) + vLLM TP=8 (8 GPUs inference, 1 node) โ 2 nodes total |
| final_reward | ~0.52 (at step 400) |
| final_mean_completion_length | ~190 tokens |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EnergyAI/qwen3.5-27b-agrpo-nothink-lr3e-6-checkpoint400", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("EnergyAI/qwen3.5-27b-agrpo-nothink-lr3e-6-checkpoint400")
- Downloads last month
- 100
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for EnergyAI/qwen3.5-27b-agrpo-nothink-lr3e-6-checkpoint400
Base model
Qwen/Qwen3.5-27B