Shaer-adapters-grpo-friend-v1-easyfirst
This repo is the easier judge-centered rerun that followed Shaer-AI/Shaer-adapters-grpo-friend-v1.
Current Status As Of 2026-04-13
This repo is still the best result from the older multiplicative friend-style stage, but it has been surpassed as a training regime by later short-subset weighted-reward experiments.
The best completed GRPO stage is now Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2, and the current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.
Place In The Story
Project sequence:
Shaer-AI/Shaer-adaptersclean SFT baselineShaer-AI/Shaer-adapters-grporeward-hacked historical GRPO resultShaer-AI/Shaer-adapters-grpo-vnextstricter anti-template rerunShaer-AI/Shaer-adapters-grpo-friend-v1first judge-centered rerunShaer-AI/Shaer-adapters-grpo-friend-v1-easyfirsteasy-first judge-centered rerunShaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2next weighted-reward rerun on the short1-6bayt subset
What Data It Used
- base starting adapter:
Shaer-AI/Shaer-adapters - GRPO dataset artifact:
Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1 - source poetry dataset:
Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits - train subset:
easy-first dropped-trio subset, cap
2000per surviving meter - train size:
19001 - eval bank:
full
13-meter eval bank,104rows total - local run dir:
/root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260412_155553
Reward Used Here
This run kept the same friend-style multiplicative reward:
hard_gate * meter * judge_quality * arabic_floor * count_floor * repeat_floor
The intent was:
- meter and judge as the core
- count kept inside the reward
- repeat and contamination softened into floors
- hard zero only for catastrophic junk
This stage kept semantics in the optimized reward through judge_quality, but the overall aggregation was still multiplicative, which later turned out to be too brittle and too permissive in some high-meter weak-meaning cases.
Best Tracked Checkpoint
- step:
800 - eval total:
0.2375 - eval meter:
0.6896 - eval judge quality:
0.3920 - eval count adherence:
0.6201 - eval arabic clean:
0.9423 - eval repeat penalty:
0.6619
Trained-meters-only snapshot at the best checkpoint:
- total:
0.2682 - meter:
0.8028 - judge quality:
0.3883 - count adherence:
0.6018 - joint good count:
8 / 80
What This Run Proved
This run was clearly healthier than friend-v1:
- meter recovered strongly on the trained meters
- the run found some genuinely promising samples
- the easy-first slice was much more learnable
But it still was not the final answer:
- judge quality stayed too low overall
- top-ranked samples could still look bad to humans
- high-meter / low-quality behavior was reduced, not solved
Current Interpretation
For the paper story, this repo is the best result from the multiplicative judge-centered era. It proved that the judge-centered direction could become healthy on a friendlier train slice, but it also made the next problem obvious: semantic weakness could still survive behind strong structure.
Why We Moved On
This repo directly motivated the next redesign:
- move from the multiplicative floor recipe to a smoother weighted reward
- keep meter strongest
- keep count inside the optimized reward but lighter
- keep judge focused on meaning, non-garbage, and relevance
- keep repeat as a soft guardrail
- train on the short easy-first
1kper meter no-trio subset - evaluate on a new short-only no-trio eval bank
That next stage was published as Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2.
Repo Contents
- root adapter files: the run-end adapter state
best-checkpoint/: exported best checkpoint from this run
Recommended Use
Use this repo as the best result from the friend-style multiplicative reward stage, but not as the final recommended Shaer GRPO model.
Model tree for Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst
Base model
humain-ai/ALLaM-7B-Instruct-preview