lapa-instruct-uk-en-grpo

LoRA adapter (r=128, α=256, q/k/v/o projections) for lapa-v0.1.2-instruct (Gemma-3 12B), trained with GRPO for Ukrainian→English translation.

Training

Condition A from our reward ablation study: chrF (0.55) + BLEU (0.45) as the only reward signals — no neural metrics, no guardrails. Trained on WikiMatrix uk-en (132K pairs), 300 steps, LoRA r=128 on attention projections, DeepSpeed ZeRO-2 on 4× RTX 6000 Ada.

For uk→en, simpler rewards outperformed all neural-metric configurations tested (CometKiwi, xCOMET, MetricX, LLM judge). Wilcoxon signed-rank p=0.002.

Results

Benchmark BLEU Mistral Judge
FLoRes+ devtest 43.15 (+1.13) 88.62 (+0.23)
WMT24 35.75 (+1.15) 84.74 (+0.15)

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("lapa-llm/lapa-v0.1.2-instruct")
model = PeftModel.from_pretrained(base, "iamthewalrus67/lapa-instruct-uk-en-grpo")
tokenizer = AutoTokenizer.from_pretrained("iamthewalrus67/lapa-instruct-uk-en-grpo")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for iamthewalrus67/lapa-instruct-uk-en-grpo

Adapter
(4)
this model