Shaer-adapters-grpo-short1k-no-trio-v4
This repo is a stopped post-v3 experiment in the Shaer project.
It exists because v3 made the next problem very clear: the judge still over-credited awkward poems. v4 tried to solve that by making low-judge poems much harder to rescue with count and repeat. That logic looked good offline, but the online run turned out too harsh and did not learn well enough.
Current Status As Of 2026-04-13
This model is not the recommended next direction to continue training from.
The best completed GRPO stage is still Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2. The current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.
Place In The Story
Project sequence:
Shaer-AI/Shaer-adaptersclean SFT baseShaer-AI/Shaer-adapters-grpohistorically hacked early GRPO stageShaer-AI/Shaer-adapters-grpo-vnextstructure-side anti-hack repair stageShaer-AI/Shaer-adapters-grpo-friend-v1first judge-centered runShaer-AI/Shaer-adapters-grpo-friend-v1-easyfirsthealthier multiplicative judge-centered rerunShaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2best completed weighted short-subset runShaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3stopped meter-times-judge-core stageShaer-AI/Shaer-adapters-grpo-short1k-no-trio-v4stopped judge-gated-helper stage
What Data This Run Used
- base starting adapter:
Shaer-AI/Shaer-adapters - GRPO dataset artifact:
Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1 - source poetry dataset:
Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits - train subset:
short only
1-6bayts - meter coverage:
dropped
المديد,المنسرح,الهزج - sampling policy: easy-first with medium fill
- target cap:
1000rows per surviving meter - realized train size:
9955 - eval bank:
short-only no-trio,
80rows total - local run dir:
/root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260413_124347
Reward Used In This Run
reward_train = hard_gate * (
0.65 * (meter * judge_quality)
+ judge_quality * (0.20 * count_adherence + 0.15 * repeat_soft)
)
Why this was tried:
- keep
meter * judgeas the core fromv3 - prevent low-judge poems from coasting on count and repeat
Best And Final Tracked Metrics
Best eval:
- step:
50 - total:
0.2064 - meter:
0.5631 - judge:
0.2937 - count adherence:
0.9606 - repeat soft:
0.9909 - hard gate:
0.9125
Latest eval before stop:
- step:
250 - total:
0.1951 - meter:
0.4753 - judge:
0.3066 - count adherence:
0.9763 - repeat soft:
0.9909 - hard gate:
0.9250
Train stopped around:
- train step:
250
Why We Stopped
- the run did not look blatantly hacked
- but the online learning trend was poor
- meter stayed too low
- strong-sample rate was weak
- offline reward validation was healthier than the actual online training behavior
So this run is best interpreted as:
- a useful reward-shape experiment
- evidence that the
v4judge-gated helper design was too harsh in practice
What Came Next
The repo was then prepared for a dual-judge v5 candidate:
- one judge for meaning/description fit
- one judge for naturalness / anti-template quality
with the new proposed reward:
hard_gate * (
0.45 * (meter * judge_meaning_fit)
+ 0.20 * judge_naturalness
+ 0.20 * (count_adherence * judge_meaning_fit)
+ 0.15 * (repeat_soft * judge_naturalness)
)
Validation artifacts for that next candidate live under:
dual_judge_reward_validation/
Model tree for Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v4
Base model
humain-ai/ALLaM-7B-Instruct-preview