Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs
Paper • 2510.11062 • Published • 29
This is a LoRA adapter trained using Agent- and Turn-wise Group Relative Policy Optimization (AT-GRPO) on the Qwen3-30B-A3B sparse MoE model.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
# 4-bit quantization for efficient inference
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=False
)
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"unsloth/Qwen3-30B-A3B",
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2",
trust_remote_code=True
)
# Load adapter
model = PeftModel.from_pretrained(model, "wheattoast11/qwen3-30b-atgrpo-production-k8")
model.eval()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-30B-A3B", trust_remote_code=True)
# Generate
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain quantum entanglement in simple terms."}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.8,
repetition_penalty=1.075,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
If you use this model, please cite:
@article{chen2025atgrpo,
title={Agent- and Turn-wise Group Relative Policy Optimization},
author={Chen et al.},
journal={arXiv preprint arXiv:2510.11062},
year={2025}
}
Apache 2.0 (same as base model)
Generated on AMD Strix Halo platform.