lapa-instruct-en-uk-grpo

LoRA adapter (r=128, α=256, q/k/v/o projections) for lapa-v0.1.2-instruct (Gemma-3 12B), trained with GRPO for English→Ukrainian translation.

Training

Condition E from our reward ablation study: LLM-as-judge (M-Prometheus-14B) + calibrated guardrails as the reward signal. Trained on WikiMatrix en-uk (132K pairs), 300 steps, LoRA r=128 on attention projections, DeepSpeed ZeRO-2 on 4× RTX 6000 Ada.

This was the only reward configuration (out of 11 tested) that consistently improved over baseline on both BLEU and LLM-judge metrics for en→uk translation.

Results

Benchmark BLEU Mistral Judge
FLoRes+ devtest 34.02 (+0.58) 83.94 (+0.12)
WMT24 31.24 (+0.12) 80.44 (+0.22)

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("lapa-llm/lapa-v0.1.2-instruct")
model = PeftModel.from_pretrained(base, "iamthewalrus67/lapa-instruct-en-uk-grpo")
tokenizer = AutoTokenizer.from_pretrained("iamthewalrus67/lapa-instruct-en-uk-grpo")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for iamthewalrus67/lapa-instruct-en-uk-grpo

Adapter
(4)
this model