Transformers
Safetensors
trl
grpo
arabic-poetry
classical-arabic
lora

Shaer-adapters-grpo-friend-v1

This repo is the first judge-centered GRPO rerun in the Shaer project.

Current Status As Of 2026-04-13

This repo remains the first important judge-in-the-loop experiment, but not the current recommended GRPO stage.

The best completed GRPO stage is Shaer-AI/Shaer-adapters-grpo-short1k-no-trio-v2, and the current unlaunched next-step candidate is the dual-judge v5 setup prepared locally in /root/workspace/Shaer/grpo.

Place In The Story

Project sequence:

  1. Shaer-AI/Shaer-adapters clean SFT baseline
  2. Shaer-AI/Shaer-adapters-grpo reward-hacked historical GRPO result
  3. Shaer-AI/Shaer-adapters-grpo-vnext stricter anti-template rerun
  4. Shaer-AI/Shaer-adapters-grpo-friend-v1 first run where semantic judge pressure moved into the optimized reward
  5. Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst easier judge-centered rerun on a friendlier train slice

What Data It Used

  • base starting adapter: Shaer-AI/Shaer-adapters
  • GRPO dataset artifact: Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
  • source poetry dataset: Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits
  • train subset: dropped-trio curated subset, cap 3000 per surviving meter
  • eval bank: full 13-meter eval bank, 104 rows total
  • local run dir: /root/workspace/Shaer/grpo/outputs/train/shaer_grpo_20260412_152315

Reward Used Here

This run used the first friend-style reward:

hard_gate * meter * judge_quality * arabic_floor * count_floor * repeat_floor

Core idea:

  • keep strong pressure on meter
  • use a focused Arabic judge for meaning and relevance
  • soften count, Arabic, and repeat penalties into floors
  • hard-zero only catastrophic junk

Judge intent in this stage:

  • score whether the poem has meaning
  • score whether it is not garbage
  • score whether it is relevant to the description

Best Tracked Checkpoint

  • step: 50
  • eval total: 0.1080
  • eval meter: 0.3143
  • eval count adherence: 0.9764
  • eval judge quality: 0.3617
  • eval repeat penalty: 0.9708
  • eval arabic clean: 0.9231

What This Run Proved

This run showed that the semantic-judge direction was conceptually right, but the setup was still too hard for the model on this train slice.

Observed pattern:

  • count and anti-repeat terms stayed high
  • meter stayed too weak
  • the run peaked almost immediately and did not become a healthy training trajectory

Current Interpretation

For the paper story, this repo is the first run where semantic quality was explicitly brought into the optimized GRPO objective. It is important because it marked the right direction, even though the optimization setup was still too brittle to make the direction work well yet.

Why We Moved On

The next fix was not to abandon the judge, but to make the optimization problem easier:

  • switch to an easy-first subset
  • reduce the amount of difficult data in the active train mix
  • keep the same friend-style reward and see whether meter could recover

That next stage was published as Shaer-AI/Shaer-adapters-grpo-friend-v1-easyfirst.

Recommended Use

Use this repo as the first semantic-judge GRPO experiment, not as the recommended checkpoint for downstream generation.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shaer-AI/Shaer-adapters-grpo-friend-v1

Adapter
(13)
this model

Datasets used to train Shaer-AI/Shaer-adapters-grpo-friend-v1