Shaer-adapters-grpo-short1k-no-trio-v2

This repo is the current best completed GRPO stage in the Shaer project.

Current Status As Of 2026-04-13

This is still the best completed GRPO run in the project.

Later runs v3 and v4 were useful diagnostically, but neither surpassed v2 as a finished model. The current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.

Place In The Story

Project sequence:

Shaer-AI/Shaer-adapters trusted clean SFT baseline
Shaer-AI/Shaer-adapters-grpo historically important but reward-hacked GRPO stage
Shaer-AI/Shaer-adapters-grpo-vnext first strict anti-hack rerun
Shaer-AI/Shaer-adapters-grpo-friend-v1 first judge-centered GRPO run
Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst healthier multiplicative judge-centered rerun
Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2 smoother weighted-reward rerun on a short easy-first subset
Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3 meter-times-judge-core rerun on the same short subset

This repo is the first stage where the project adopted both:

the short no-trio train and eval regime
a smoother weighted reward instead of the old multiplicative floor recipe

What Data This Run Used

base starting adapter: Shaer-AI/Shaer-adapters
GRPO dataset artifact: Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
source poetry dataset: Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits
train subset: short only 1-6 bayts
meter coverage: dropped المديد, المنسرح, الهزج
sampling policy: easy-first with medium fill
target cap: 1000 rows per surviving meter
realized train size: 9955
eval bank: short-only no-trio, 80 rows total
local run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260412_221146

Reward Used Here

This run introduced the weighted reward:

reward_train = hard_gate * (
    0.45 * meter
  + 0.15 * count_adherence
  + 0.30 * judge_quality
  + 0.10 * repeat_soft
)

Key ideas:

keep meter as the strongest signal
keep count inside the optimized reward, but lighter
use a focused Arabic judge for meaning, non-garbage, and relevance
keep repeat as a soft guardrail
use only a minimal hard gate for catastrophic junk

Judge setup in this stage:

focused Arabic judge
scores meaning
scores whether the poem is not garbage
scores relevance to the description

Best Checkpoint

best checkpoint: checkpoint-3200
best eval total: 0.7862
best eval meter: 0.9395
best eval judge quality: 0.4445
best eval count adherence: 0.9919
best eval repeat soft: 0.9004
best eval hard gate: 0.9875

Final eval at step 3300:

total: 0.7728
meter: 0.9251
judge quality: 0.4210
count adherence: 0.9975
repeat soft: 0.9031
hard gate: 0.9750

What This Run Proved

This is the healthiest completed GRPO regime so far in the project:

meter became very strong
count adherence became excellent
contamination was much better controlled
repeat-soft visibly exposed template collapse

But this is still not the final flagship paper model:

strong semantic samples are still too sparse
high-meter weak poems still survive
semantic quality remains the main bottleneck

Why We Moved On

Manual inspection of this run made the next issue unusually clear:

the additive reward was healthier than the multiplicative one
but a poem could still get a fairly good total score from:
- excellent meter
- strong count adherence
- strong repeat-soft
- weak semantic judge

That led directly to the next stage:

Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3

whose main change is:

reward_train = hard_gate * (
    0.65 * (meter * judge_quality)
  + 0.20 * count_adherence
  + 0.15 * repeat_soft
)

The whole point of that next run is to make semantic weakness unable to hide behind great meter.

Useful Local Artifacts

full run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260412_221146
manual eval inspection: eval_manual_inspection.md
run summary: run_summary.json

Recommended Use

Use this repo as the current best completed GRPO stage in the Shaer project.

For paper writing, present it as:

the strongest completed GRPO regime so far
clearly better than the older reward-hacked and multiplicative stages
still short of the final desired semantic quality bar

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2

Base model

humain-ai/ALLaM-7B-Instruct-preview

Finetuned

Navid-AI/Yehia-7B-preview

Adapter

(13)

this model

Shaer-AI
/

Shaer-adapters-grpo-short1k-no-trio-v2