lapa-instruct-en-uk-grpo
LoRA adapter (r=128, α=256, q/k/v/o projections) for lapa-v0.1.2-instruct (Gemma-3 12B), trained with GRPO for English→Ukrainian translation.
Training
Condition E from our reward ablation study: LLM-as-judge (M-Prometheus-14B) + calibrated guardrails as the reward signal. Trained on WikiMatrix en-uk (132K pairs), 300 steps, LoRA r=128 on attention projections, DeepSpeed ZeRO-2 on 4× RTX 6000 Ada.
This was the only reward configuration (out of 11 tested) that consistently improved over baseline on both BLEU and LLM-judge metrics for en→uk translation.
Results
| Benchmark | BLEU | Mistral Judge |
|---|---|---|
| FLoRes+ devtest | 34.02 (+0.58) | 83.94 (+0.12) |
| WMT24 | 31.24 (+0.12) | 80.44 (+0.22) |
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("lapa-llm/lapa-v0.1.2-instruct")
model = PeftModel.from_pretrained(base, "iamthewalrus67/lapa-instruct-en-uk-grpo")
tokenizer = AutoTokenizer.from_pretrained("iamthewalrus67/lapa-instruct-en-uk-grpo")
Model tree for iamthewalrus67/lapa-instruct-en-uk-grpo
Base model
google/gemma-3-12b-pt Finetuned
lapa-llm/lapa-12b-pt Finetuned
lapa-llm/lapa-v0.1.2-instruct