Shaer-adapters-grpo-short1k-no-trio-v2
This repo is the current best completed GRPO stage in the Shaer project.
Current Status As Of 2026-04-13
This is still the best completed GRPO run in the project.
Later runs v3 and v4 were useful diagnostically, but neither surpassed v2 as a finished model. The current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.
Place In The Story
Project sequence:
Shaer-AI/Shaer-adapterstrusted clean SFT baselineShaer-AI/Shaer-adapters-grpohistorically important but reward-hacked GRPO stageShaer-AI/Shaer-adapters-grpo-vnextfirst strict anti-hack rerunShaer-AI/Shaer-adapters-grpo-friend-v1first judge-centered GRPO runShaer-AI/Shaer-adapters-grpo-friend-v1-easyfirsthealthier multiplicative judge-centered rerunShaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2smoother weighted-reward rerun on a short easy-first subsetShaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3meter-times-judge-core rerun on the same short subset
This repo is the first stage where the project adopted both:
- the short no-trio train and eval regime
- a smoother weighted reward instead of the old multiplicative floor recipe
What Data This Run Used
- base starting adapter:
Shaer-AI/Shaer-adapters - GRPO dataset artifact:
Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1 - source poetry dataset:
Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits - train subset:
short only
1-6bayts - meter coverage:
dropped
المديد,المنسرح,الهزج - sampling policy: easy-first with medium fill
- target cap:
1000rows per surviving meter - realized train size:
9955 - eval bank:
short-only no-trio,
80rows total - local run dir:
/root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260412_221146
Reward Used Here
This run introduced the weighted reward:
reward_train = hard_gate * (
0.45 * meter
+ 0.15 * count_adherence
+ 0.30 * judge_quality
+ 0.10 * repeat_soft
)
Key ideas:
- keep meter as the strongest signal
- keep count inside the optimized reward, but lighter
- use a focused Arabic judge for meaning, non-garbage, and relevance
- keep repeat as a soft guardrail
- use only a minimal hard gate for catastrophic junk
Judge setup in this stage:
- focused Arabic judge
- scores meaning
- scores whether the poem is not garbage
- scores relevance to the description
Best Checkpoint
- best checkpoint:
checkpoint-3200 - best eval total:
0.7862 - best eval meter:
0.9395 - best eval judge quality:
0.4445 - best eval count adherence:
0.9919 - best eval repeat soft:
0.9004 - best eval hard gate:
0.9875
Final eval at step 3300:
- total:
0.7728 - meter:
0.9251 - judge quality:
0.4210 - count adherence:
0.9975 - repeat soft:
0.9031 - hard gate:
0.9750
What This Run Proved
This is the healthiest completed GRPO regime so far in the project:
- meter became very strong
- count adherence became excellent
- contamination was much better controlled
- repeat-soft visibly exposed template collapse
But this is still not the final flagship paper model:
- strong semantic samples are still too sparse
- high-meter weak poems still survive
- semantic quality remains the main bottleneck
Why We Moved On
Manual inspection of this run made the next issue unusually clear:
- the additive reward was healthier than the multiplicative one
- but a poem could still get a fairly good total score from:
- excellent meter
- strong count adherence
- strong repeat-soft
- weak semantic judge
That led directly to the next stage:
Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3
whose main change is:
reward_train = hard_gate * (
0.65 * (meter * judge_quality)
+ 0.20 * count_adherence
+ 0.15 * repeat_soft
)
The whole point of that next run is to make semantic weakness unable to hide behind great meter.
Useful Local Artifacts
- full run dir:
/root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260412_221146 - manual eval inspection:
eval_manual_inspection.md - run summary:
run_summary.json
Recommended Use
Use this repo as the current best completed GRPO stage in the Shaer project.
For paper writing, present it as:
- the strongest completed GRPO regime so far
- clearly better than the older reward-hacked and multiplicative stages
- still short of the final desired semantic quality bar
Model tree for Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2
Base model
humain-ai/ALLaM-7B-Instruct-preview