Shaer-adapters-grpo-short1k-no-trio-v4

This repo is a stopped post-v3 experiment in the Shaer project.

It exists because v3 made the next problem very clear: the judge still over-credited awkward poems. v4 tried to solve that by making low-judge poems much harder to rescue with count and repeat. That logic looked good offline, but the online run turned out too harsh and did not learn well enough.

Current Status As Of 2026-04-13

This model is not the recommended next direction to continue training from.

The best completed GRPO stage is still Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2. The current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.

Place In The Story

Project sequence:

Shaer-AI/Shaer-adapters clean SFT base
Shaer-AI/Shaer-adapters-grpo historically hacked early GRPO stage
Shaer-AI/Shaer-adapters-grpo-vnext structure-side anti-hack repair stage
Shaer-AI/Shaer-adapters-grpo-friend-v1 first judge-centered run
Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst healthier multiplicative judge-centered rerun
Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2 best completed weighted short-subset run
Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3 stopped meter-times-judge-core stage
Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v4 stopped judge-gated-helper stage

What Data This Run Used

base starting adapter: Shaer-AI/Shaer-adapters
GRPO dataset artifact: Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
source poetry dataset: Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits
train subset: short only 1-6 bayts
meter coverage: dropped المديد, المنسرح, الهزج
sampling policy: easy-first with medium fill
target cap: 1000 rows per surviving meter
realized train size: 9955
eval bank: short-only no-trio, 80 rows total
local run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260413_124347

Reward Used In This Run

reward_train = hard_gate * (
    0.65 * (meter * judge_quality)
  + judge_quality * (0.20 * count_adherence + 0.15 * repeat_soft)
)

Why this was tried:

keep meter * judge as the core from v3
prevent low-judge poems from coasting on count and repeat

Best And Final Tracked Metrics

Best eval:

step: 50
total: 0.2064
meter: 0.5631
judge: 0.2937
count adherence: 0.9606
repeat soft: 0.9909
hard gate: 0.9125

Latest eval before stop:

step: 250
total: 0.1951
meter: 0.4753
judge: 0.3066
count adherence: 0.9763
repeat soft: 0.9909
hard gate: 0.9250

Train stopped around:

train step: 250

Why We Stopped

the run did not look blatantly hacked
but the online learning trend was poor
meter stayed too low
strong-sample rate was weak
offline reward validation was healthier than the actual online training behavior

So this run is best interpreted as:

a useful reward-shape experiment
evidence that the v4 judge-gated helper design was too harsh in practice

What Came Next

The repo was then prepared for a dual-judge v5 candidate:

one judge for meaning/description fit
one judge for naturalness / anti-template quality

with the new proposed reward:

hard_gate * (
    0.45 * (meter * judge_meaning_fit)
  + 0.20 * judge_naturalness
  + 0.20 * (count_adherence * judge_meaning_fit)
  + 0.15 * (repeat_soft * judge_naturalness)
)

Validation artifacts for that next candidate live under:

dual_judge_reward_validation/

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v4

Base model

humain-ai/ALLaM-7B-Instruct-preview

Finetuned

Navid-AI/Yehia-7B-preview

Adapter

(13)

this model

Shaer-AI
/

Shaer-adapters-grpo-short1k-no-trio-v4