qwen3.5-27b-agrpo-nothink-lr3e-6-checkpoint400

Qwen3.5-27B fine-tuned with Async GRPO (no thinking, nothink mode) — checkpoint at step 400

Task

Fill-in-the-middle multiple-choice questions (MCQ) for energy domain verification. The model outputs its answer inside \boxed{N} where N is the option number.

Reward Function

+1.0 — correct (\boxed{N} matches ground truth)
-0.5 — wrong (\boxed{N} exists but wrong answer)
-1.0 — no answer (no \boxed{} found)

Training Hyperparameters

Parameter	Value
base_model	Qwen/Qwen3.5-27B
algorithm	Async GRPO (TRL AsyncGRPOTrainer)
thinking	disabled (nothink)
learning_rate	3e-6
lr_scheduler	cosine
warmup_steps	30
max_steps	2000
global_step_saved	400
per_device_train_batch_size	2
gradient_accumulation_steps	32
num_train_processes	8
effective_batch_size	512 prompts/step
num_generations	5
max_completion_length	512
temperature	1.0
epsilon	0.2
epsilon_high	0.2
max_staleness	4
weight_sync_steps	3
max_grad_norm	1.0
precision	bf16
parallelism	FSDP2 (8 GPUs training, 1 node) + vLLM TP=8 (8 GPUs inference, 1 node) — 2 nodes total
final_reward	~0.52 (at step 400)
final_mean_completion_length	~190 tokens

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EnergyAI/qwen3.5-27b-agrpo-nothink-lr3e-6-checkpoint400", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("EnergyAI/qwen3.5-27b-agrpo-nothink-lr3e-6-checkpoint400")

Downloads last month: 100

Safetensors

Model size

27B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EnergyAI/qwen3.5-27b-agrpo-nothink-lr3e-6-checkpoint400

Base model

Qwen/Qwen3.5-27B

Finetuned

(265)

this model