Shaer-adapters-grpo

This repo is the first historically important GRPO model in the Shaer project. It is no longer treated as a recommended model, but it remains a key research artifact because it exposed the first major reward-hacking failure mode.

Current Status As Of 2026-04-13

This model remains a historical cautionary artifact.

The best completed GRPO stage is now Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2, and the current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.

Place In The Story

This was the first published GRPO stage after the clean SFT baseline Shaer-AI/Shaer-adapters.

Project sequence:

Shaer-AI/Shaer-adapters clean fresh SFT baseline on the enhanced-description dataset
Shaer-AI/Shaer-adapters-grpo first published GRPO chain, later reclassified as reward hacked
Shaer-AI/Shaer-adapters-grpo-vnext stricter anti-template rerun
Shaer-AI/Shaer-adapters-grpo-friend-v1 first judge-centered rerun
Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst easier judge-centered rerun
Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2 healthiest weighted-reward short-subset stage
Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3 meter-times-judge-core rerun

What Data It Used

base starting adapter: Shaer-AI/Shaer-adapters
GRPO dataset artifact: Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
source poetry dataset: Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits
train subset: dropped-trio curated subset, cap 3000 per surviving meter
dropped trio: المديد, المنسرح, الهزج
eval bank: full 13-meter eval bank, 104 rows total
local run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260411_223409

Reward Journey Inside This Stage

This repo matters partly because several reward variants were tried inside the same historical chain, and the project learned the hard way that apparently strong numeric reward can still hide bad poetry quality.

Reward family across this chain:

first loophole: 0.9 * meter + 0.1 * exact_count
second loophole: meter * count_adherence * arabic_clean
later published chain objective: meter * count_adherence * arabic_clean * repeat_penalty

Core intent at the time:

push meter strongly
keep requested bayt count under control
require Arabic-script cleanliness
reduce obvious repetition

What was still missing:

no explicit semantic judge
no direct notion of meaning
no direct notion of relevance to the description
anti-template logic was still too weak against metrically valid junk

Why This Repo Matters

This run showed that high tracked GRPO reward was not enough. The checkpoint family looked strong numerically under the old objective, but later manual inspection showed that the policy still exploited the reward with:

metrically strong output
correct requested bayt counts
Arabic-script text
non-exact-copy line templates

while still producing visibly bad poems for humans.

So this repo is now treated as:

a documented reward-hacking stage
an important negative result in the paper story
the source of hacked generations used to validate later reward fixes

Best Tracked Checkpoint

Best checkpoint under the old tracked objective:

step: 3100
eval total: 0.7422
eval meter: 0.8046
eval count adherence: 0.9586
eval repeat penalty: 0.9568
eval arabic clean: 0.9904

These numbers are historically real, but they should not be read as proof of truly good poetry quality.

Current Interpretation

For the paper story, this repo should be read as:

the first strong-looking GRPO result
the stage where reward hacking became undeniable
the negative example that justified later semantic and anti-template redesigns

Why We Moved On

After later review, this run was judged still reward hacked. It motivated the next stage:

stronger artifact filtering
stronger anti-template scoring
explicit validation on real poems versus hacked generations

That next stage was published as Shaer-AI/Shaer-adapters-grpo-vnext.

Repo Contents

root adapter files: the run-end adapter state
best-checkpoint/: exported best checkpoint from the historical run
paper/final_plots/: paper-facing plots and summaries for this run

Recommended Use

Use this repo as a historical research artifact, not as the current recommended GRPO checkpoint.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shaer-AI/Shaer-adapters-grpo

Base model

humain-ai/ALLaM-7B-Instruct-preview

Finetuned

Navid-AI/Yehia-7B-preview

Adapter

(13)

this model

Shaer-AI
/

Shaer-adapters-grpo