llama-3.1-8b-instruct-agrpo-nothink-lr3e-6

Llama-3.1-8B-Instruct fine-tuned with Async GRPO (no thinking, nothink mode)

Task

Fill-in-the-middle multiple-choice questions (MCQ) for energy domain verification. The model outputs its answer inside \boxed{N} where N is the option number.

Reward Function

  • +1.0 โ€” correct (\boxed{N} matches ground truth)
  • -0.5 โ€” wrong (\boxed{N} exists but wrong answer)
  • -1.0 โ€” no answer (no \boxed{} found)

Training Hyperparameters

Parameter Value
base_model meta-llama/Llama-3.1-8B-Instruct
algorithm Async GRPO (TRL AsyncGRPOTrainer)
thinking disabled (nothink)
learning_rate 3e-6
lr_scheduler cosine
warmup_steps 60
max_steps 2000
global_step_saved 2000
per_device_train_batch_size 1
gradient_accumulation_steps 32
num_train_processes 4
effective_batch_size 128 prompts/step
num_generations 9
max_completion_length 512
temperature 1.0
epsilon 0.2
epsilon_high 0.2
max_staleness 4
weight_sync_steps 1
max_grad_norm 1.0
precision bf16
parallelism FSDP2 (4 GPUs training) + vLLM TP=4 (4 GPUs inference)
final_reward ~0.58
final_mean_completion_length ~42 tokens

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EnergyAI/llama-3.1-8b-instruct-agrpo-nothink-lr3e-6", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("EnergyAI/llama-3.1-8b-instruct-agrpo-nothink-lr3e-6")
Downloads last month
42
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for EnergyAI/llama-3.1-8b-instruct-agrpo-nothink-lr3e-6

Finetuned
(2587)
this model