Shaer-adapters-grpo-short1k-no-trio-v3
This repo is a stopped intermediate GRPO stage in the Shaer project.
It is important historically because it introduced the meter * judge core reward, which greatly reduced the old "high-meter / low-meaning" reward-hack pattern. We stopped it early because the run made the next bottleneck very clear: the judge was still over-crediting awkward poems, so the project moved to a stricter judge prompt and a small reward-shape fix instead of continuing to train on a blurry signal.
Place In The Story
Project sequence:
Shaer-AI/Shaer-adaptersclean SFT baseShaer-AI/Shaer-adapters-grpohistorically important but reward-hacked early GRPO stageShaer-AI/Shaer-adapters-grpo-vnextfirst stricter anti-hack rerunShaer-AI/Shaer-adapters-grpo-friend-v1first judge-centered GRPO runShaer-AI/Shaer-adapters-grpo-friend-v1-easyfirsthealthier multiplicative friend-style rerunShaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2weighted-reward short-subset rerunShaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3meter-times-judge-core rerun on the same short subsetShaer-AI/Shaer-adapters-grpo-short1k-no-trio-v4next rerun after tightening the judge and gating auxiliary reward terms by judge quality
This repo is the stage that proved the project should keep the meter-times-judge idea, but sharpen the semantic signal.
Current Status As Of 2026-04-13
This stage remains important because it exposed the next exact bottleneck: the single judge still over-credited awkward poems.
The follow-up v4 tried judge-gated helper terms and turned out too harsh online. The current unlaunched next-step candidate is now the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.
What Data This Run Used
- base starting adapter:
Shaer-AI/Shaer-adapters - GRPO dataset artifact:
Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1 - source poetry dataset:
Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits - train subset:
short only
1-6bayts - meter coverage:
dropped
المديد,المنسرح,الهزج - sampling policy: easy-first with medium fill
- target cap:
1000rows per surviving meter - realized train size:
9955 - eval bank:
short-only no-trio,
80rows total - local run dir:
/root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260413_071842
Reward Used In This Run
This run used the following reward:
reward_train = hard_gate * (
0.65 * (meter * judge_quality)
+ 0.20 * count_adherence
+ 0.15 * repeat_soft
)
Design intent:
- make semantic weakness unable to hide behind excellent meter
- keep count adherence inside the reward
- keep repeat as a small soft guardrail
- keep only a minimal hard gate for catastrophic junk
Judge setup in this stage:
- Arabic judge prompt focused on:
- meaning
- not being garbage
- relevance to the description
- same core judge family as
v2, but before the later stricter prompt revision
Best And Latest Eval Before Stopping
Best completed eval:
- step:
1500 - total:
0.5827 - meter:
0.8153 - judge quality:
0.4936 - count adherence:
0.9521 - repeat soft:
0.9100 - hard gate:
0.9750
Latest completed eval before stopping:
- step:
2150 - total:
0.4049 - meter:
0.7940 - judge quality:
0.1570 - count adherence:
0.9881 - repeat soft:
0.8970 - hard gate:
0.9750
Stop point:
- training was stopped manually around train step
2180 - the last completed eval remained the
2150checkpoint above
What This Run Proved
This run did accomplish something important:
- the classic "high meter, obviously bad semantics, still high total" failure mode was much better controlled than before
- top poems were more often real-looking and less often blatant reward hacks
- the meter-times-judge core was the right direction
Manual inspection showed that the old exploit was no longer the dominant issue.
Why We Stopped
We did not stop because the run catastrophically collapsed.
We stopped because the failure mode became narrower and easier to diagnose:
- the judge still over-credit
- awkward poems with unnatural diction
- lexical game-playing
- rhyme-driven phrasing with weak meaning
- semantically thin poems that were only half convincing
That meant the run was optimizing against a judge signal that was safer than before, but still not sharp enough.
In practice, this produced a pattern like:
- top scores were less hacked than earlier stages
- but too many awkward poems were still receiving comfortable judge scores
- count and repeat could still prop up weak poems once the judge gave them a middling score
So the project moved on instead of continuing to spend compute on a blurry target.
What Came Next
The next stage (v4) keeps the same overall direction but fixes the two issues exposed by this run:
- stricter judge prompt
- better distinguishes:
- genuinely meaningful poems
- awkward, rhyme-driven, semantically thin poems
- penalizes lexical games and half-natural nonsense more clearly
- small reward-shape fix
Instead of letting count_adherence and repeat_soft help independently, the next stage gates those helper terms by judge quality:
reward_train = hard_gate * (
0.65 * (meter * judge_quality)
+ judge_quality * (0.20 * count_adherence + 0.15 * repeat_soft)
)
This keeps gradient flow, but makes it much harder for a bad poem to coast on count and repeat once the judge has already called it weak.
Recommended Interpretation
Use this repo as:
- the run that validated the
meter * judgedirection - the run that narrowed the problem from "obvious reward hacking" to "judge permissiveness / signal sharpness"
- an important intermediate stage, but not the final paper model
Do not present it as the final best system.
Present it as the stage that clarified the next exact fix.
Useful Local Artifacts
- full run dir:
/root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260413_071842 - high-score manual review:
high_score_poems_step1750.md - prompt-tuning validations:
reward_validation_judge_v5_currentreward_validation_judge_v6_currentreward_validation_judge_v7_currentreward_validation_final_v4_current
- old-hacked comparison validations:
reward_validation_judge_v7_oldhackedreward_validation_final_v4_oldhacked
Model tree for Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3
Base model
humain-ai/ALLaM-7B-Instruct-preview