lapa-instruct-uk-en-grpo
LoRA adapter (r=128, α=256, q/k/v/o projections) for lapa-v0.1.2-instruct (Gemma-3 12B), trained with GRPO for Ukrainian→English translation.
Training
Condition A from our reward ablation study: chrF (0.55) + BLEU (0.45) as the only reward signals — no neural metrics, no guardrails. Trained on WikiMatrix uk-en (132K pairs), 300 steps, LoRA r=128 on attention projections, DeepSpeed ZeRO-2 on 4× RTX 6000 Ada.
For uk→en, simpler rewards outperformed all neural-metric configurations tested (CometKiwi, xCOMET, MetricX, LLM judge). Wilcoxon signed-rank p=0.002.
Results
| Benchmark | BLEU | Mistral Judge |
|---|---|---|
| FLoRes+ devtest | 43.15 (+1.13) | 88.62 (+0.23) |
| WMT24 | 35.75 (+1.15) | 84.74 (+0.15) |
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("lapa-llm/lapa-v0.1.2-instruct")
model = PeftModel.from_pretrained(base, "iamthewalrus67/lapa-instruct-uk-en-grpo")
tokenizer = AutoTokenizer.from_pretrained("iamthewalrus67/lapa-instruct-uk-en-grpo")
Model tree for iamthewalrus67/lapa-instruct-uk-en-grpo
Base model
google/gemma-3-12b-pt Finetuned
lapa-llm/lapa-12b-pt Finetuned
lapa-llm/lapa-v0.1.2-instruct