Shaer-adapters-grpo-friend-v1
This repo is the first judge-centered GRPO rerun in the Shaer project.
Current Status As Of 2026-04-13
This repo remains the first important judge-in-the-loop experiment, but not the current recommended GRPO stage.
The best completed GRPO stage is Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2, and the current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.
Place In The Story
Project sequence:
Shaer-AI/Shaer-adaptersclean SFT baselineShaer-AI/Shaer-adapters-grporeward-hacked historical GRPO resultShaer-AI/Shaer-adapters-grpo-vnextstricter anti-template rerunShaer-AI/Shaer-adapters-grpo-friend-v1first run where semantic judge pressure moved into the optimized rewardShaer-AI/Shaer-adapters-grpo-friend-v1-easyfirsteasier judge-centered rerun on a friendlier train slice
What Data It Used
- base starting adapter:
Shaer-AI/Shaer-adapters - GRPO dataset artifact:
Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1 - source poetry dataset:
Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits - train subset:
dropped-trio curated subset, cap
3000per surviving meter - eval bank:
full
13-meter eval bank,104rows total - local run dir:
/root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260412_152315
Reward Used Here
This run used the first friend-style reward:
hard_gate * meter * judge_quality * arabic_floor * count_floor * repeat_floor
Core idea:
- keep strong pressure on meter
- use a focused Arabic judge for meaning and relevance
- soften count, Arabic, and repeat penalties into floors
- hard-zero only catastrophic junk
Judge intent in this stage:
- score whether the poem has meaning
- score whether it is not garbage
- score whether it is relevant to the description
Best Tracked Checkpoint
- step:
50 - eval total:
0.1080 - eval meter:
0.3143 - eval count adherence:
0.9764 - eval judge quality:
0.3617 - eval repeat penalty:
0.9708 - eval arabic clean:
0.9231
What This Run Proved
This run showed that the semantic-judge direction was conceptually right, but the setup was still too hard for the model on this train slice.
Observed pattern:
- count and anti-repeat terms stayed high
- meter stayed too weak
- the run peaked almost immediately and did not become a healthy training trajectory
Current Interpretation
For the paper story, this repo is the first run where semantic quality was explicitly brought into the optimized GRPO objective. It is important because it marked the right direction, even though the optimization setup was still too brittle to make the direction work well yet.
Why We Moved On
The next fix was not to abandon the judge, but to make the optimization problem easier:
- switch to an easy-first subset
- reduce the amount of difficult data in the active train mix
- keep the same friend-style reward and see whether meter could recover
That next stage was published as Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst.
Recommended Use
Use this repo as the first semantic-judge GRPO experiment, not as the recommended checkpoint for downstream generation.
Model tree for Shaer-AI/Shaer-adapters-grpo-friend-v1
Base model
humain-ai/ALLaM-7B-Instruct-preview