Qwen3.5-9B 格律诗模型 - GRPO (Gelv Poet GRPO)

基于 wnwu/Qwen3.5-9B-gelv-poet (SFT模型) 进一步使用 GRPO 强化学习训练的格律诗生成模型。

训练链路

Qwen3.5-9B (基座)
  → SFT微调 (格律诗数据, 3 epochs, eval_loss 0.094)
    → GRPO强化学习 (格律reward函数, 1900步)

GRPO 训练细节

  • 方法: Group Relative Policy Optimization (GRPO) + LoRA (r=16)
  • Reward函数: 多维度格律验证 (结构25% + 平仄35% + 押韵25% + 对仗15%)
  • 训练数据: 2000条格律诗创作prompts,每个采样4个completions
  • KL系数: β=0.1
  • 硬件: RTX 4090 24GB
  • 训练时长: ~31.5小时 (1900步)
  • 参考: Composer 2 Technical Report (Cursor Research Team)

评估结果

模型 平均Reward
SFT 0.967
GRPO 0.952

使用方法

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "wnwu/Qwen3.5-9B-gelv-poet-grpo"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True,
)
model.eval()

SYSTEM_PROMPT = (
    "你是一位精通中国古典诗词格律的诗人。你严格遵循平仄格律规则,"
    "擅长创作五言律诗、七言律诗、五言绝句、七言绝句等格律诗。"
    "你熟知「二四六分明」的平仄规则,懂得对仗、押韵的要求。"
)

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "请以「山居」为题,写一首严格符合格律的五言律诗。\n\n请直接输出诗句。"},
]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs, max_new_tokens=256, temperature=0.7,
        top_p=0.9, top_k=50, do_sample=True, repetition_penalty=1.15,
    )
new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
print(tokenizer.decode(new_tokens, skip_special_tokens=True))

生成示例

五言律诗「山居」

屋在瀑泉西,茅簷下有溪。 閉門留野鹿,分食養山雞。 桂熟長收子,蘭生不作畦。 初開洞中路,深處轉松梯。

Downloads last month
204
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wnwu/Qwen3.5-9B-gelv-poet-grpo

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(1)
this model