Shaer-adapters-grpo-short1k-no-trio-v3

This repo is a stopped intermediate GRPO stage in the Shaer project.

It is important historically because it introduced the meter * judge core reward, which greatly reduced the old "high-meter / low-meaning" reward-hack pattern. We stopped it early because the run made the next bottleneck very clear: the judge was still over-crediting awkward poems, so the project moved to a stricter judge prompt and a small reward-shape fix instead of continuing to train on a blurry signal.

Place In The Story

Project sequence:

Shaer-AI/Shaer-adapters clean SFT base
Shaer-AI/Shaer-adapters-grpo historically important but reward-hacked early GRPO stage
Shaer-AI/Shaer-adapters-grpo-vnext first stricter anti-hack rerun
Shaer-AI/Shaer-adapters-grpo-friend-v1 first judge-centered GRPO run
Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst healthier multiplicative friend-style rerun
Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2 weighted-reward short-subset rerun
Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3 meter-times-judge-core rerun on the same short subset
Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v4 next rerun after tightening the judge and gating auxiliary reward terms by judge quality

This repo is the stage that proved the project should keep the meter-times-judge idea, but sharpen the semantic signal.

Current Status As Of 2026-04-13

This stage remains important because it exposed the next exact bottleneck: the single judge still over-credited awkward poems.

The follow-up v4 tried judge-gated helper terms and turned out too harsh online. The current unlaunched next-step candidate is now the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.

What Data This Run Used

base starting adapter: Shaer-AI/Shaer-adapters
GRPO dataset artifact: Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
source poetry dataset: Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits
train subset: short only 1-6 bayts
meter coverage: dropped المديد, المنسرح, الهزج
sampling policy: easy-first with medium fill
target cap: 1000 rows per surviving meter
realized train size: 9955
eval bank: short-only no-trio, 80 rows total
local run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260413_071842

Reward Used In This Run

This run used the following reward:

reward_train = hard_gate * (
    0.65 * (meter * judge_quality)
  + 0.20 * count_adherence
  + 0.15 * repeat_soft
)

Design intent:

make semantic weakness unable to hide behind excellent meter
keep count adherence inside the reward
keep repeat as a small soft guardrail
keep only a minimal hard gate for catastrophic junk

Judge setup in this stage:

Arabic judge prompt focused on:
- meaning
- not being garbage
- relevance to the description
same core judge family as v2, but before the later stricter prompt revision

Best And Latest Eval Before Stopping

Best completed eval:

step: 1500
total: 0.5827
meter: 0.8153
judge quality: 0.4936
count adherence: 0.9521
repeat soft: 0.9100
hard gate: 0.9750

Latest completed eval before stopping:

step: 2150
total: 0.4049
meter: 0.7940
judge quality: 0.1570
count adherence: 0.9881
repeat soft: 0.8970
hard gate: 0.9750

Stop point:

training was stopped manually around train step 2180
the last completed eval remained the 2150 checkpoint above

What This Run Proved

This run did accomplish something important:

the classic "high meter, obviously bad semantics, still high total" failure mode was much better controlled than before
top poems were more often real-looking and less often blatant reward hacks
the meter-times-judge core was the right direction

Manual inspection showed that the old exploit was no longer the dominant issue.

Why We Stopped

We did not stop because the run catastrophically collapsed.

We stopped because the failure mode became narrower and easier to diagnose:

the judge still over-credit
awkward poems with unnatural diction
lexical game-playing
rhyme-driven phrasing with weak meaning
semantically thin poems that were only half convincing

That meant the run was optimizing against a judge signal that was safer than before, but still not sharp enough.

In practice, this produced a pattern like:

top scores were less hacked than earlier stages
but too many awkward poems were still receiving comfortable judge scores
count and repeat could still prop up weak poems once the judge gave them a middling score

So the project moved on instead of continuing to spend compute on a blurry target.

What Came Next

The next stage (v4) keeps the same overall direction but fixes the two issues exposed by this run:

stricter judge prompt

better distinguishes:
- genuinely meaningful poems
- awkward, rhyme-driven, semantically thin poems
penalizes lexical games and half-natural nonsense more clearly

small reward-shape fix

Instead of letting count_adherence and repeat_soft help independently, the next stage gates those helper terms by judge quality:

reward_train = hard_gate * (
    0.65 * (meter * judge_quality)
  + judge_quality * (0.20 * count_adherence + 0.15 * repeat_soft)
)

This keeps gradient flow, but makes it much harder for a bad poem to coast on count and repeat once the judge has already called it weak.

Recommended Interpretation

Use this repo as:

the run that validated the meter * judge direction
the run that narrowed the problem from "obvious reward hacking" to "judge permissiveness / signal sharpness"
an important intermediate stage, but not the final paper model

Do not present it as the final best system.

Present it as the stage that clarified the next exact fix.

Useful Local Artifacts

full run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260413_071842
high-score manual review: high_score_poems_step1750.md
prompt-tuning validations:
- reward_validation_judge_v5_current
- reward_validation_judge_v6_current
- reward_validation_judge_v7_current
- reward_validation_final_v4_current
old-hacked comparison validations:
- reward_validation_judge_v7_oldhacked
- reward_validation_final_v4_oldhacked

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3

Base model

humain-ai/ALLaM-7B-Instruct-preview

Finetuned

Navid-AI/Yehia-7B-preview

Adapter

(13)

this model

Shaer-AI
/

Shaer-adapters-grpo-short1k-no-trio-v3