Shaer-adapters-grpo-friend-v1

This repo is the first judge-centered GRPO rerun in the Shaer project.

Current Status As Of 2026-04-13

This repo remains the first important judge-in-the-loop experiment, but not the current recommended GRPO stage.

The best completed GRPO stage is Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2, and the current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.

Place In The Story

Project sequence:

Shaer-AI/Shaer-adapters clean SFT baseline
Shaer-AI/Shaer-adapters-grpo reward-hacked historical GRPO result
Shaer-AI/Shaer-adapters-grpo-vnext stricter anti-template rerun
Shaer-AI/Shaer-adapters-grpo-friend-v1 first run where semantic judge pressure moved into the optimized reward
Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst easier judge-centered rerun on a friendlier train slice

What Data It Used

base starting adapter: Shaer-AI/Shaer-adapters
GRPO dataset artifact: Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
source poetry dataset: Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits
train subset: dropped-trio curated subset, cap 3000 per surviving meter
eval bank: full 13-meter eval bank, 104 rows total
local run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260412_152315

Reward Used Here

This run used the first friend-style reward:

hard_gate * meter * judge_quality * arabic_floor * count_floor * repeat_floor

Core idea:

keep strong pressure on meter
use a focused Arabic judge for meaning and relevance
soften count, Arabic, and repeat penalties into floors
hard-zero only catastrophic junk

Judge intent in this stage:

score whether the poem has meaning
score whether it is not garbage
score whether it is relevant to the description

Best Tracked Checkpoint

step: 50
eval total: 0.1080
eval meter: 0.3143
eval count adherence: 0.9764
eval judge quality: 0.3617
eval repeat penalty: 0.9708
eval arabic clean: 0.9231

What This Run Proved

This run showed that the semantic-judge direction was conceptually right, but the setup was still too hard for the model on this train slice.

Observed pattern:

count and anti-repeat terms stayed high
meter stayed too weak
the run peaked almost immediately and did not become a healthy training trajectory

Current Interpretation

For the paper story, this repo is the first run where semantic quality was explicitly brought into the optimized GRPO objective. It is important because it marked the right direction, even though the optimization setup was still too brittle to make the direction work well yet.

Why We Moved On

The next fix was not to abandon the judge, but to make the optimization problem easier:

switch to an easy-first subset
reduce the amount of difficult data in the active train mix
keep the same friend-style reward and see whether meter could recover

That next stage was published as Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst.

Recommended Use

Use this repo as the first semantic-judge GRPO experiment, not as the recommended checkpoint for downstream generation.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shaer-AI/Shaer-adapters-grpo-friend-v1

Base model

humain-ai/ALLaM-7B-Instruct-preview

Finetuned

Navid-AI/Yehia-7B-preview

Adapter

(13)

this model

Shaer-AI
/

Shaer-adapters-grpo-friend-v1