Transformers
Safetensors
trl
grpo
arabic-poetry
classical-arabic
lora

Shaer-adapters-grpo-short1k-no-trio-v3

This repo is a stopped intermediate GRPO stage in the Shaer project.

It is important historically because it introduced the meter * judge core reward, which greatly reduced the old "high-meter / low-meaning" reward-hack pattern. We stopped it early because the run made the next bottleneck very clear: the judge was still over-crediting awkward poems, so the project moved to a stricter judge prompt and a small reward-shape fix instead of continuing to train on a blurry signal.

Place In The Story

Project sequence:

  1. Shaer-AI/Shaer-adapters clean SFT base
  2. Shaer-AI/Shaer-adapters-grpo historically important but reward-hacked early GRPO stage
  3. Shaer-AI/Shaer-adapters-grpo-vnext first stricter anti-hack rerun
  4. Shaer-AI/Shaer-adapters-grpo-friend-v1 first judge-centered GRPO run
  5. Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst healthier multiplicative friend-style rerun
  6. Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2 weighted-reward short-subset rerun
  7. Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3 meter-times-judge-core rerun on the same short subset
  8. Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v4 next rerun after tightening the judge and gating auxiliary reward terms by judge quality

This repo is the stage that proved the project should keep the meter-times-judge idea, but sharpen the semantic signal.

Current Status As Of 2026-04-13

This stage remains important because it exposed the next exact bottleneck: the single judge still over-credited awkward poems.

The follow-up v4 tried judge-gated helper terms and turned out too harsh online. The current unlaunched next-step candidate is now the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.

What Data This Run Used

  • base starting adapter: Shaer-AI/Shaer-adapters
  • GRPO dataset artifact: Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
  • source poetry dataset: Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits
  • train subset: short only 1-6 bayts
  • meter coverage: dropped المديد, المنسرح, الهزج
  • sampling policy: easy-first with medium fill
  • target cap: 1000 rows per surviving meter
  • realized train size: 9955
  • eval bank: short-only no-trio, 80 rows total
  • local run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260413_071842

Reward Used In This Run

This run used the following reward:

reward_train = hard_gate * (
    0.65 * (meter * judge_quality)
  + 0.20 * count_adherence
  + 0.15 * repeat_soft
)

Design intent:

  • make semantic weakness unable to hide behind excellent meter
  • keep count adherence inside the reward
  • keep repeat as a small soft guardrail
  • keep only a minimal hard gate for catastrophic junk

Judge setup in this stage:

  • Arabic judge prompt focused on:
    • meaning
    • not being garbage
    • relevance to the description
  • same core judge family as v2, but before the later stricter prompt revision

Best And Latest Eval Before Stopping

Best completed eval:

  • step: 1500
  • total: 0.5827
  • meter: 0.8153
  • judge quality: 0.4936
  • count adherence: 0.9521
  • repeat soft: 0.9100
  • hard gate: 0.9750

Latest completed eval before stopping:

  • step: 2150
  • total: 0.4049
  • meter: 0.7940
  • judge quality: 0.1570
  • count adherence: 0.9881
  • repeat soft: 0.8970
  • hard gate: 0.9750

Stop point:

  • training was stopped manually around train step 2180
  • the last completed eval remained the 2150 checkpoint above

What This Run Proved

This run did accomplish something important:

  • the classic "high meter, obviously bad semantics, still high total" failure mode was much better controlled than before
  • top poems were more often real-looking and less often blatant reward hacks
  • the meter-times-judge core was the right direction

Manual inspection showed that the old exploit was no longer the dominant issue.

Why We Stopped

We did not stop because the run catastrophically collapsed.

We stopped because the failure mode became narrower and easier to diagnose:

  • the judge still over-credit
  • awkward poems with unnatural diction
  • lexical game-playing
  • rhyme-driven phrasing with weak meaning
  • semantically thin poems that were only half convincing

That meant the run was optimizing against a judge signal that was safer than before, but still not sharp enough.

In practice, this produced a pattern like:

  • top scores were less hacked than earlier stages
  • but too many awkward poems were still receiving comfortable judge scores
  • count and repeat could still prop up weak poems once the judge gave them a middling score

So the project moved on instead of continuing to spend compute on a blurry target.

What Came Next

The next stage (v4) keeps the same overall direction but fixes the two issues exposed by this run:

  1. stricter judge prompt
  • better distinguishes:
    • genuinely meaningful poems
    • awkward, rhyme-driven, semantically thin poems
  • penalizes lexical games and half-natural nonsense more clearly
  1. small reward-shape fix

Instead of letting count_adherence and repeat_soft help independently, the next stage gates those helper terms by judge quality:

reward_train = hard_gate * (
    0.65 * (meter * judge_quality)
  + judge_quality * (0.20 * count_adherence + 0.15 * repeat_soft)
)

This keeps gradient flow, but makes it much harder for a bad poem to coast on count and repeat once the judge has already called it weak.

Recommended Interpretation

Use this repo as:

  • the run that validated the meter * judge direction
  • the run that narrowed the problem from "obvious reward hacking" to "judge permissiveness / signal sharpness"
  • an important intermediate stage, but not the final paper model

Do not present it as the final best system.

Present it as the stage that clarified the next exact fix.

Useful Local Artifacts

  • full run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260413_071842
  • high-score manual review: high_score_poems_step1750.md
  • prompt-tuning validations:
    • reward_validation_judge_v5_current
    • reward_validation_judge_v6_current
    • reward_validation_judge_v7_current
    • reward_validation_final_v4_current
  • old-hacked comparison validations:
    • reward_validation_judge_v7_oldhacked
    • reward_validation_final_v4_oldhacked
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3

Adapter
(13)
this model

Datasets used to train Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v3