qwen3.5-27b-agrpo-nothink-lr3e-6-checkpoint400

Qwen3.5-27B fine-tuned with Async GRPO (no thinking, nothink mode) โ€” checkpoint at step 400

Task

Fill-in-the-middle multiple-choice questions (MCQ) for energy domain verification. The model outputs its answer inside \boxed{N} where N is the option number.

Reward Function

  • +1.0 โ€” correct (\boxed{N} matches ground truth)
  • -0.5 โ€” wrong (\boxed{N} exists but wrong answer)
  • -1.0 โ€” no answer (no \boxed{} found)

Training Hyperparameters

Parameter Value
base_model Qwen/Qwen3.5-27B
algorithm Async GRPO (TRL AsyncGRPOTrainer)
thinking disabled (nothink)
learning_rate 3e-6
lr_scheduler cosine
warmup_steps 30
max_steps 2000
global_step_saved 400
per_device_train_batch_size 2
gradient_accumulation_steps 32
num_train_processes 8
effective_batch_size 512 prompts/step
num_generations 5
max_completion_length 512
temperature 1.0
epsilon 0.2
epsilon_high 0.2
max_staleness 4
weight_sync_steps 3
max_grad_norm 1.0
precision bf16
parallelism FSDP2 (8 GPUs training, 1 node) + vLLM TP=8 (8 GPUs inference, 1 node) โ€” 2 nodes total
final_reward ~0.52 (at step 400)
final_mean_completion_length ~190 tokens

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EnergyAI/qwen3.5-27b-agrpo-nothink-lr3e-6-checkpoint400", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("EnergyAI/qwen3.5-27b-agrpo-nothink-lr3e-6-checkpoint400")
Downloads last month
100
Safetensors
Model size
27B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for EnergyAI/qwen3.5-27b-agrpo-nothink-lr3e-6-checkpoint400

Base model

Qwen/Qwen3.5-27B
Finetuned
(265)
this model