Shaer-adapters-grpo
This repo is the first historically important GRPO model in the Shaer project. It is no longer treated as a recommended model, but it remains a key research artifact because it exposed the first major reward-hacking failure mode.
Current Status As Of 2026-04-13
This model remains a historical cautionary artifact.
The best completed GRPO stage is now Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2, and the current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.
Place In The Story
This was the first published GRPO stage after the clean SFT baseline Shaer-AI/Shaer-adapters.
Project sequence:
Shaer-AI/Shaer-adaptersclean fresh SFT baseline on the enhanced-description datasetShaer-AI/Shaer-adapters-grpofirst published GRPO chain, later reclassified as reward hackedShaer-AI/Shaer-adapters-grpo-vnextstricter anti-template rerunShaer-AI/Shaer-adapters-grpo-friend-v1first judge-centered rerunShaer-AI/Shaer-adapters-grpo-friend-v1-easyfirsteasier judge-centered rerunShaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2healthiest weighted-reward short-subset stageShaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3meter-times-judge-core rerun
What Data It Used
- base starting adapter:
Shaer-AI/Shaer-adapters - GRPO dataset artifact:
Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1 - source poetry dataset:
Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits - train subset:
dropped-trio curated subset, cap
3000per surviving meter - dropped trio:
المديد,المنسرح,الهزج - eval bank:
full
13-meter eval bank,104rows total - local run dir:
/root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260411_223409
Reward Journey Inside This Stage
This repo matters partly because several reward variants were tried inside the same historical chain, and the project learned the hard way that apparently strong numeric reward can still hide bad poetry quality.
Reward family across this chain:
- first loophole:
0.9 * meter + 0.1 * exact_count - second loophole:
meter * count_adherence * arabic_clean - later published chain objective:
meter * count_adherence * arabic_clean * repeat_penalty
Core intent at the time:
- push meter strongly
- keep requested bayt count under control
- require Arabic-script cleanliness
- reduce obvious repetition
What was still missing:
- no explicit semantic judge
- no direct notion of meaning
- no direct notion of relevance to the description
- anti-template logic was still too weak against metrically valid junk
Why This Repo Matters
This run showed that high tracked GRPO reward was not enough. The checkpoint family looked strong numerically under the old objective, but later manual inspection showed that the policy still exploited the reward with:
- metrically strong output
- correct requested bayt counts
- Arabic-script text
- non-exact-copy line templates
while still producing visibly bad poems for humans.
So this repo is now treated as:
- a documented reward-hacking stage
- an important negative result in the paper story
- the source of hacked generations used to validate later reward fixes
Best Tracked Checkpoint
Best checkpoint under the old tracked objective:
- step:
3100 - eval total:
0.7422 - eval meter:
0.8046 - eval count adherence:
0.9586 - eval repeat penalty:
0.9568 - eval arabic clean:
0.9904
These numbers are historically real, but they should not be read as proof of truly good poetry quality.
Current Interpretation
For the paper story, this repo should be read as:
- the first strong-looking GRPO result
- the stage where reward hacking became undeniable
- the negative example that justified later semantic and anti-template redesigns
Why We Moved On
After later review, this run was judged still reward hacked. It motivated the next stage:
- stronger artifact filtering
- stronger anti-template scoring
- explicit validation on real poems versus hacked generations
That next stage was published as Shaer-AI/Shaer-adapters-grpo-vnext.
Repo Contents
- root adapter files: the run-end adapter state
best-checkpoint/: exported best checkpoint from the historical runpaper/final_plots/: paper-facing plots and summaries for this run
Recommended Use
Use this repo as a historical research artifact, not as the current recommended GRPO checkpoint.
Model tree for Shaer-AI/Shaer-adapters-grpo
Base model
humain-ai/ALLaM-7B-Instruct-preview