lapa-instruct-uk-en-grpo

LoRA adapter (r=128, α=256, q/k/v/o projections) for lapa-v0.1.2-instruct (Gemma-3 12B), trained with GRPO for Ukrainian→English translation.

Training

Condition A from our reward ablation study: chrF (0.55) + BLEU (0.45) as the only reward signals — no neural metrics, no guardrails. Trained on WikiMatrix uk-en (132K pairs), 300 steps, LoRA r=128 on attention projections, DeepSpeed ZeRO-2 on 4× RTX 6000 Ada.

For uk→en, simpler rewards outperformed all neural-metric configurations tested (CometKiwi, xCOMET, MetricX, LLM judge). Wilcoxon signed-rank p=0.002.

Results

Benchmark	BLEU	Mistral Judge
FLoRes+ devtest	43.15 (+1.13)	88.62 (+0.23)
WMT24	35.75 (+1.15)	84.74 (+0.15)

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("lapa-llm/lapa-v0.1.2-instruct")
model = PeftModel.from_pretrained(base, "iamthewalrus67/lapa-instruct-uk-en-grpo")
tokenizer = AutoTokenizer.from_pretrained("iamthewalrus67/lapa-instruct-uk-en-grpo")

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for iamthewalrus67/lapa-instruct-uk-en-grpo

Base model

google/gemma-3-12b-pt

Finetuned

lapa-llm/lapa-12b-pt

Finetuned

lapa-llm/lapa-v0.1.2-instruct

Adapter

(4)

this model