Shaer-adapters-grpo-friend-v1-easyfirst

This repo is the easier judge-centered rerun that followed Shaer-AI/Shaer-adapters-grpo-friend-v1.

Current Status As Of 2026-04-13

This repo is still the best result from the older multiplicative friend-style stage, but it has been surpassed as a training regime by later short-subset weighted-reward experiments.

The best completed GRPO stage is now Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2, and the current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.

Place In The Story

Project sequence:

Shaer-AI/Shaer-adapters clean SFT baseline
Shaer-AI/Shaer-adapters-grpo reward-hacked historical GRPO result
Shaer-AI/Shaer-adapters-grpo-vnext stricter anti-template rerun
Shaer-AI/Shaer-adapters-grpo-friend-v1 first judge-centered rerun
Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst easy-first judge-centered rerun
Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2 next weighted-reward rerun on the short 1-6 bayt subset

What Data It Used

base starting adapter: Shaer-AI/Shaer-adapters
GRPO dataset artifact: Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
source poetry dataset: Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits
train subset: easy-first dropped-trio subset, cap 2000 per surviving meter
train size: 19001
eval bank: full 13-meter eval bank, 104 rows total
local run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260412_155553

Reward Used Here

This run kept the same friend-style multiplicative reward:

hard_gate * meter * judge_quality * arabic_floor * count_floor * repeat_floor

The intent was:

meter and judge as the core
count kept inside the reward
repeat and contamination softened into floors
hard zero only for catastrophic junk

This stage kept semantics in the optimized reward through judge_quality, but the overall aggregation was still multiplicative, which later turned out to be too brittle and too permissive in some high-meter weak-meaning cases.

Best Tracked Checkpoint

step: 800
eval total: 0.2375
eval meter: 0.6896
eval judge quality: 0.3920
eval count adherence: 0.6201
eval arabic clean: 0.9423
eval repeat penalty: 0.6619

Trained-meters-only snapshot at the best checkpoint:

total: 0.2682
meter: 0.8028
judge quality: 0.3883
count adherence: 0.6018
joint good count: 8 / 80

What This Run Proved

This run was clearly healthier than friend-v1:

meter recovered strongly on the trained meters
the run found some genuinely promising samples
the easy-first slice was much more learnable

But it still was not the final answer:

judge quality stayed too low overall
top-ranked samples could still look bad to humans
high-meter / low-quality behavior was reduced, not solved

Current Interpretation

For the paper story, this repo is the best result from the multiplicative judge-centered era. It proved that the judge-centered direction could become healthy on a friendlier train slice, but it also made the next problem obvious: semantic weakness could still survive behind strong structure.

Why We Moved On

This repo directly motivated the next redesign:

move from the multiplicative floor recipe to a smoother weighted reward
keep meter strongest
keep count inside the optimized reward but lighter
keep judge focused on meaning, non-garbage, and relevance
keep repeat as a soft guardrail
train on the short easy-first 1k per meter no-trio subset
evaluate on a new short-only no-trio eval bank

That next stage was published as Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2.

Repo Contents

root adapter files: the run-end adapter state
best-checkpoint/: exported best checkpoint from this run

Recommended Use

Use this repo as the best result from the friend-style multiplicative reward stage, but not as the final recommended Shaer GRPO model.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst

Base model

humain-ai/ALLaM-7B-Instruct-preview

Finetuned

Navid-AI/Yehia-7B-preview

Adapter

(13)

this model

Shaer-AI
/

Shaer-adapters-grpo-friend-v1-easyfirst