GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Paper • 2601.05242 • Published • 230
This model was trained in two phases:
This repository contains the full-merged 16-bit weights.
GDPO (arxiv 2601.05242, NVIDIA Research) decouples reward normalization across individual rewards, preserving their relative differences.
| # | Reward | Weight | Description |
|---|---|---|---|
| 1 | Format Compliance | 1.0 | Approach/Output structure |
| 2 | Structured Output Validity | 3.0 | JSON/XML/YAML/TOML/CSV parsing |
| 3 | Output Length | 0.3 | Appropriate length |
| 4 | No Repetition | 0.5 | Prevents degeneration |
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "kabuizuchi-trading/gdpo-qwen-structured-merged"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto",
)
Base model
Qwen/Qwen3-4B-Instruct-2507