lapa-instruct-en-uk-grpo

LoRA adapter (r=128, α=256, q/k/v/o projections) for lapa-v0.1.2-instruct (Gemma-3 12B), trained with GRPO for English→Ukrainian translation.

Training

Condition E from our reward ablation study: LLM-as-judge (M-Prometheus-14B) + calibrated guardrails as the reward signal. Trained on WikiMatrix en-uk (132K pairs), 300 steps, LoRA r=128 on attention projections, DeepSpeed ZeRO-2 on 4× RTX 6000 Ada.

This was the only reward configuration (out of 11 tested) that consistently improved over baseline on both BLEU and LLM-judge metrics for en→uk translation.

Results

Benchmark	BLEU	Mistral Judge
FLoRes+ devtest	34.02 (+0.58)	83.94 (+0.12)
WMT24	31.24 (+0.12)	80.44 (+0.22)

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("lapa-llm/lapa-v0.1.2-instruct")
model = PeftModel.from_pretrained(base, "iamthewalrus67/lapa-instruct-en-uk-grpo")
tokenizer = AutoTokenizer.from_pretrained("iamthewalrus67/lapa-instruct-en-uk-grpo")

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for iamthewalrus67/lapa-instruct-en-uk-grpo

Base model

google/gemma-3-12b-pt

Finetuned

lapa-llm/lapa-12b-pt

Finetuned

lapa-llm/lapa-v0.1.2-instruct

Adapter

(4)

this model