Transformers
Safetensors
trl
grpo
arabic-poetry
classical-arabic
lora

Shaer-adapters-grpo-short1k-no-trio-v2

This repo is the current best completed GRPO stage in the Shaer project.

Current Status As Of 2026-04-13

This is still the best completed GRPO run in the project.

Later runs v3 and v4 were useful diagnostically, but neither surpassed v2 as a finished model. The current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.

Place In The Story

Project sequence:

  1. Shaer-AI/Shaer-adapters trusted clean SFT baseline
  2. Shaer-AI/Shaer-adapters-grpo historically important but reward-hacked GRPO stage
  3. Shaer-AI/Shaer-adapters-grpo-vnext first strict anti-hack rerun
  4. Shaer-AI/Shaer-adapters-grpo-friend-v1 first judge-centered GRPO run
  5. Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst healthier multiplicative judge-centered rerun
  6. Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2 smoother weighted-reward rerun on a short easy-first subset
  7. Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3 meter-times-judge-core rerun on the same short subset

This repo is the first stage where the project adopted both:

  • the short no-trio train and eval regime
  • a smoother weighted reward instead of the old multiplicative floor recipe

What Data This Run Used

  • base starting adapter: Shaer-AI/Shaer-adapters
  • GRPO dataset artifact: Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
  • source poetry dataset: Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits
  • train subset: short only 1-6 bayts
  • meter coverage: dropped المديد, المنسرح, الهزج
  • sampling policy: easy-first with medium fill
  • target cap: 1000 rows per surviving meter
  • realized train size: 9955
  • eval bank: short-only no-trio, 80 rows total
  • local run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260412_221146

Reward Used Here

This run introduced the weighted reward:

reward_train = hard_gate * (
    0.45 * meter
  + 0.15 * count_adherence
  + 0.30 * judge_quality
  + 0.10 * repeat_soft
)

Key ideas:

  • keep meter as the strongest signal
  • keep count inside the optimized reward, but lighter
  • use a focused Arabic judge for meaning, non-garbage, and relevance
  • keep repeat as a soft guardrail
  • use only a minimal hard gate for catastrophic junk

Judge setup in this stage:

  • focused Arabic judge
  • scores meaning
  • scores whether the poem is not garbage
  • scores relevance to the description

Best Checkpoint

  • best checkpoint: checkpoint-3200
  • best eval total: 0.7862
  • best eval meter: 0.9395
  • best eval judge quality: 0.4445
  • best eval count adherence: 0.9919
  • best eval repeat soft: 0.9004
  • best eval hard gate: 0.9875

Final eval at step 3300:

  • total: 0.7728
  • meter: 0.9251
  • judge quality: 0.4210
  • count adherence: 0.9975
  • repeat soft: 0.9031
  • hard gate: 0.9750

What This Run Proved

This is the healthiest completed GRPO regime so far in the project:

  • meter became very strong
  • count adherence became excellent
  • contamination was much better controlled
  • repeat-soft visibly exposed template collapse

But this is still not the final flagship paper model:

  • strong semantic samples are still too sparse
  • high-meter weak poems still survive
  • semantic quality remains the main bottleneck

Why We Moved On

Manual inspection of this run made the next issue unusually clear:

  • the additive reward was healthier than the multiplicative one
  • but a poem could still get a fairly good total score from:
    • excellent meter
    • strong count adherence
    • strong repeat-soft
    • weak semantic judge

That led directly to the next stage:

  • Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3

whose main change is:

reward_train = hard_gate * (
    0.65 * (meter * judge_quality)
  + 0.20 * count_adherence
  + 0.15 * repeat_soft
)

The whole point of that next run is to make semantic weakness unable to hide behind great meter.

Useful Local Artifacts

  • full run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260412_221146
  • manual eval inspection: eval_manual_inspection.md
  • run summary: run_summary.json

Recommended Use

Use this repo as the current best completed GRPO stage in the Shaer project.

For paper writing, present it as:

  • the strongest completed GRPO regime so far
  • clearly better than the older reward-hacked and multiplicative stages
  • still short of the final desired semantic quality bar
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2

Adapter
(13)
this model

Datasets used to train Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2