llama-3.1-8b-instruct-agrpo-nothink-lr3e-6

Llama-3.1-8B-Instruct fine-tuned with Async GRPO (no thinking, nothink mode)

Task

Fill-in-the-middle multiple-choice questions (MCQ) for energy domain verification. The model outputs its answer inside \boxed{N} where N is the option number.

Reward Function

+1.0 — correct (\boxed{N} matches ground truth)
-0.5 — wrong (\boxed{N} exists but wrong answer)
-1.0 — no answer (no \boxed{} found)

Training Hyperparameters

Parameter	Value
base_model	meta-llama/Llama-3.1-8B-Instruct
algorithm	Async GRPO (TRL AsyncGRPOTrainer)
thinking	disabled (nothink)
learning_rate	3e-6
lr_scheduler	cosine
warmup_steps	60
max_steps	2000
global_step_saved	2000
per_device_train_batch_size	1
gradient_accumulation_steps	32
num_train_processes	4
effective_batch_size	128 prompts/step
num_generations	9
max_completion_length	512
temperature	1.0
epsilon	0.2
epsilon_high	0.2
max_staleness	4
weight_sync_steps	1
max_grad_norm	1.0
precision	bf16
parallelism	FSDP2 (4 GPUs training) + vLLM TP=4 (4 GPUs inference)
final_reward	~0.58
final_mean_completion_length	~42 tokens

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EnergyAI/llama-3.1-8b-instruct-agrpo-nothink-lr3e-6", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("EnergyAI/llama-3.1-8b-instruct-agrpo-nothink-lr3e-6")

Downloads last month: 42

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EnergyAI/llama-3.1-8b-instruct-agrpo-nothink-lr3e-6

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Finetuned

(2587)

this model