Transformers
Safetensors
trl
grpo
arabic-poetry
classical-arabic
lora

Shaer-adapters-grpo-short1k-no-trio-v4

This repo is a stopped post-v3 experiment in the Shaer project.

It exists because v3 made the next problem very clear: the judge still over-credited awkward poems. v4 tried to solve that by making low-judge poems much harder to rescue with count and repeat. That logic looked good offline, but the online run turned out too harsh and did not learn well enough.

Current Status As Of 2026-04-13

This model is not the recommended next direction to continue training from.

The best completed GRPO stage is still Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2. The current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.

Place In The Story

Project sequence:

  1. Shaer-AI/Shaer-adapters clean SFT base
  2. Shaer-AI/Shaer-adapters-grpo historically hacked early GRPO stage
  3. Shaer-AI/Shaer-adapters-grpo-vnext structure-side anti-hack repair stage
  4. Shaer-AI/Shaer-adapters-grpo-friend-v1 first judge-centered run
  5. Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst healthier multiplicative judge-centered rerun
  6. Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2 best completed weighted short-subset run
  7. Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3 stopped meter-times-judge-core stage
  8. Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v4 stopped judge-gated-helper stage

What Data This Run Used

  • base starting adapter: Shaer-AI/Shaer-adapters
  • GRPO dataset artifact: Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
  • source poetry dataset: Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits
  • train subset: short only 1-6 bayts
  • meter coverage: dropped المديد, المنسرح, الهزج
  • sampling policy: easy-first with medium fill
  • target cap: 1000 rows per surviving meter
  • realized train size: 9955
  • eval bank: short-only no-trio, 80 rows total
  • local run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260413_124347

Reward Used In This Run

reward_train = hard_gate * (
    0.65 * (meter * judge_quality)
  + judge_quality * (0.20 * count_adherence + 0.15 * repeat_soft)
)

Why this was tried:

  • keep meter * judge as the core from v3
  • prevent low-judge poems from coasting on count and repeat

Best And Final Tracked Metrics

Best eval:

  • step: 50
  • total: 0.2064
  • meter: 0.5631
  • judge: 0.2937
  • count adherence: 0.9606
  • repeat soft: 0.9909
  • hard gate: 0.9125

Latest eval before stop:

  • step: 250
  • total: 0.1951
  • meter: 0.4753
  • judge: 0.3066
  • count adherence: 0.9763
  • repeat soft: 0.9909
  • hard gate: 0.9250

Train stopped around:

  • train step: 250

Why We Stopped

  • the run did not look blatantly hacked
  • but the online learning trend was poor
  • meter stayed too low
  • strong-sample rate was weak
  • offline reward validation was healthier than the actual online training behavior

So this run is best interpreted as:

  • a useful reward-shape experiment
  • evidence that the v4 judge-gated helper design was too harsh in practice

What Came Next

The repo was then prepared for a dual-judge v5 candidate:

  • one judge for meaning/description fit
  • one judge for naturalness / anti-template quality

with the new proposed reward:

hard_gate * (
    0.45 * (meter * judge_meaning_fit)
  + 0.20 * judge_naturalness
  + 0.20 * (count_adherence * judge_meaning_fit)
  + 0.15 * (repeat_soft * judge_naturalness)
)

Validation artifacts for that next candidate live under:

  • dual_judge_reward_validation/
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v4

Adapter
(13)
this model

Datasets used to train Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v4