Transformers
Safetensors
trl
grpo
arabic-poetry
classical-arabic
lora

Shaer-adapters-grpo

This repo is the first historically important GRPO model in the Shaer project. It is no longer treated as a recommended model, but it remains a key research artifact because it exposed the first major reward-hacking failure mode.

Current Status As Of 2026-04-13

This model remains a historical cautionary artifact.

The best completed GRPO stage is now Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2, and the current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.

Place In The Story

This was the first published GRPO stage after the clean SFT baseline Shaer-AI/Shaer-adapters.

Project sequence:

  1. Shaer-AI/Shaer-adapters clean fresh SFT baseline on the enhanced-description dataset
  2. Shaer-AI/Shaer-adapters-grpo first published GRPO chain, later reclassified as reward hacked
  3. Shaer-AI/Shaer-adapters-grpo-vnext stricter anti-template rerun
  4. Shaer-AI/Shaer-adapters-grpo-friend-v1 first judge-centered rerun
  5. Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst easier judge-centered rerun
  6. Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2 healthiest weighted-reward short-subset stage
  7. Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3 meter-times-judge-core rerun

What Data It Used

  • base starting adapter: Shaer-AI/Shaer-adapters
  • GRPO dataset artifact: Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
  • source poetry dataset: Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits
  • train subset: dropped-trio curated subset, cap 3000 per surviving meter
  • dropped trio: المديد, المنسرح, الهزج
  • eval bank: full 13-meter eval bank, 104 rows total
  • local run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260411_223409

Reward Journey Inside This Stage

This repo matters partly because several reward variants were tried inside the same historical chain, and the project learned the hard way that apparently strong numeric reward can still hide bad poetry quality.

Reward family across this chain:

  • first loophole: 0.9 * meter + 0.1 * exact_count
  • second loophole: meter * count_adherence * arabic_clean
  • later published chain objective: meter * count_adherence * arabic_clean * repeat_penalty

Core intent at the time:

  • push meter strongly
  • keep requested bayt count under control
  • require Arabic-script cleanliness
  • reduce obvious repetition

What was still missing:

  • no explicit semantic judge
  • no direct notion of meaning
  • no direct notion of relevance to the description
  • anti-template logic was still too weak against metrically valid junk

Why This Repo Matters

This run showed that high tracked GRPO reward was not enough. The checkpoint family looked strong numerically under the old objective, but later manual inspection showed that the policy still exploited the reward with:

  • metrically strong output
  • correct requested bayt counts
  • Arabic-script text
  • non-exact-copy line templates

while still producing visibly bad poems for humans.

So this repo is now treated as:

  • a documented reward-hacking stage
  • an important negative result in the paper story
  • the source of hacked generations used to validate later reward fixes

Best Tracked Checkpoint

Best checkpoint under the old tracked objective:

  • step: 3100
  • eval total: 0.7422
  • eval meter: 0.8046
  • eval count adherence: 0.9586
  • eval repeat penalty: 0.9568
  • eval arabic clean: 0.9904

These numbers are historically real, but they should not be read as proof of truly good poetry quality.

Current Interpretation

For the paper story, this repo should be read as:

  • the first strong-looking GRPO result
  • the stage where reward hacking became undeniable
  • the negative example that justified later semantic and anti-template redesigns

Why We Moved On

After later review, this run was judged still reward hacked. It motivated the next stage:

  • stronger artifact filtering
  • stronger anti-template scoring
  • explicit validation on real poems versus hacked generations

That next stage was published as Shaer-AI/Shaer-adapters-grpo-vnext.

Repo Contents

  • root adapter files: the run-end adapter state
  • best-checkpoint/: exported best checkpoint from the historical run
  • paper/final_plots/: paper-facing plots and summaries for this run

Recommended Use

Use this repo as a historical research artifact, not as the current recommended GRPO checkpoint.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shaer-AI/Shaer-adapters-grpo

Adapter
(13)
this model

Datasets used to train Shaer-AI/Shaer-adapters-grpo