Transformers
Safetensors
trl
grpo
arabic-poetry
classical-arabic
lora

Shaer-adapters-grpo-friend-v1-easyfirst

This repo is the easier judge-centered rerun that followed Shaer-AI/Shaer-adapters-grpo-friend-v1.

Current Status As Of 2026-04-13

This repo is still the best result from the older multiplicative friend-style stage, but it has been surpassed as a training regime by later short-subset weighted-reward experiments.

The best completed GRPO stage is now Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2, and the current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.

Place In The Story

Project sequence:

  1. Shaer-AI/Shaer-adapters clean SFT baseline
  2. Shaer-AI/Shaer-adapters-grpo reward-hacked historical GRPO result
  3. Shaer-AI/Shaer-adapters-grpo-vnext stricter anti-template rerun
  4. Shaer-AI/Shaer-adapters-grpo-friend-v1 first judge-centered rerun
  5. Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst easy-first judge-centered rerun
  6. Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2 next weighted-reward rerun on the short 1-6 bayt subset

What Data It Used

  • base starting adapter: Shaer-AI/Shaer-adapters
  • GRPO dataset artifact: Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
  • source poetry dataset: Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits
  • train subset: easy-first dropped-trio subset, cap 2000 per surviving meter
  • train size: 19001
  • eval bank: full 13-meter eval bank, 104 rows total
  • local run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260412_155553

Reward Used Here

This run kept the same friend-style multiplicative reward:

hard_gate * meter * judge_quality * arabic_floor * count_floor * repeat_floor

The intent was:

  • meter and judge as the core
  • count kept inside the reward
  • repeat and contamination softened into floors
  • hard zero only for catastrophic junk

This stage kept semantics in the optimized reward through judge_quality, but the overall aggregation was still multiplicative, which later turned out to be too brittle and too permissive in some high-meter weak-meaning cases.

Best Tracked Checkpoint

  • step: 800
  • eval total: 0.2375
  • eval meter: 0.6896
  • eval judge quality: 0.3920
  • eval count adherence: 0.6201
  • eval arabic clean: 0.9423
  • eval repeat penalty: 0.6619

Trained-meters-only snapshot at the best checkpoint:

  • total: 0.2682
  • meter: 0.8028
  • judge quality: 0.3883
  • count adherence: 0.6018
  • joint good count: 8 / 80

What This Run Proved

This run was clearly healthier than friend-v1:

  • meter recovered strongly on the trained meters
  • the run found some genuinely promising samples
  • the easy-first slice was much more learnable

But it still was not the final answer:

  • judge quality stayed too low overall
  • top-ranked samples could still look bad to humans
  • high-meter / low-quality behavior was reduced, not solved

Current Interpretation

For the paper story, this repo is the best result from the multiplicative judge-centered era. It proved that the judge-centered direction could become healthy on a friendlier train slice, but it also made the next problem obvious: semantic weakness could still survive behind strong structure.

Why We Moved On

This repo directly motivated the next redesign:

  • move from the multiplicative floor recipe to a smoother weighted reward
  • keep meter strongest
  • keep count inside the optimized reward but lighter
  • keep judge focused on meaning, non-garbage, and relevance
  • keep repeat as a soft guardrail
  • train on the short easy-first 1k per meter no-trio subset
  • evaluate on a new short-only no-trio eval bank

That next stage was published as Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2.

Repo Contents

  • root adapter files: the run-end adapter state
  • best-checkpoint/: exported best checkpoint from this run

Recommended Use

Use this repo as the best result from the friend-style multiplicative reward stage, but not as the final recommended Shaer GRPO model.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst

Adapter
(13)
this model

Datasets used to train Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst