Gyeongsang Dialect to Standard Korean LoRA for Gemma 4 E4B

This repository contains a LoRA adapter fine-tuned from google/gemma-4-e4b-it for a narrow text-rewriting task:

input: Gyeongsang dialect Korean sentence
output: standard Korean sentence
goal: preserve meaning and avoid unnecessary additions or omissions

This is an adapter-only release. You need the base model google/gemma-4-e4b-it to use it.

Model Details

Base model: google/gemma-4-e4b-it
Adapter type: PEFT LoRA
LoRA rank: 16
LoRA alpha: 16
LoRA dropout: 0.0
Training stack: Unsloth + PEFT + Transformers
Primary language: Korean
Primary task: dialect normalization / standardization

Intended Use

This adapter is intended for:

converting Gyeongsang dialect utterances into standard Korean
deterministic rewrite workflows
offline batch normalization pipelines
experimentation inside Hugging Face, Transformers, or Unsloth Studio

This adapter is not intended for:

open-ended chat
long-form reasoning
instruction following outside the rewrite task
high-stakes legal, medical, or financial usage without human review

Prompt Format

The best-performing eval path used a plain single-turn rewrite prompt with a strict instruction not to add reasoning.

Minimal usage pattern:

당신은 경상도 방언을 표준 한국어로 바꾸는 전문가입니다.
의미를 보존하고, 불필요한 추가나 생략 없이 표준어 문장만 답하세요.

방언: <input sentence>
표준어:

Greedy decoding worked better than sampling for this task.

Evaluation Summary

Final full-run evaluation was performed on internal dev, test, and hard_test splits.

split	rows	char_similarity	critical_error_rate	number_preservation
dev	4002	0.9436	0.0265	1.0000
test	4001	0.9433	0.0280	0.9615
hard_test	2048	0.9337	0.0381	1.0000

Interpretation:

rewrite quality is strong on the held-out evaluation splits
number preservation remained strong in the final E4B run
prompt-format contamination was reduced to zero in final evaluation

Training Summary

Final training rows: 51719
Eval rows during training run summary: 4002
Training runtime: 3612.06s
Final recorded train loss: 0.2783
Dataset format: prompt_completion_text
Prompt variant: strict
Completion-only loss: True

The adapter was selected from an adaptive loop that tuned prompt format, eval format, LoRA settings, and recovery behavior across multiple iterations before the final full run.

Data

The training task uses Korean dialect-to-standard sentence pairs derived from Gyeongsang dialect data.

High-level properties:

sentence-level rewrite pairs
source side: dialect Korean
target side: standard Korean
task emphasis: minimal-edit normalization with meaning preservation

Limitations

This model is specialized for Gyeongsang dialect normalization and may not transfer well to other dialects.
It is optimized for short-to-medium sentence rewriting, not general conversation.
It may still make lexical or phrasing errors on rare dialect forms.
It should not be treated as a factual QA or reasoning model.

Usage

Example with transformers + peft:

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model_id = "google/gemma-4-e4b-it"
adapter_id = "acidsound/gyeongsang_dialect_gemma-4-e4b-it-LoRA"

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, adapter_id)

prompt = (
    "당신은 경상도 방언을 표준 한국어로 바꾸는 전문가입니다.\\n"
    "의미를 보존하고, 불필요한 추가나 생략 없이 표준어 문장만 답하세요.\\n\\n"
    "방언: 와 이리 춥노 밖에 바람이 억수로 분다\\n"
    "표준어:"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        repetition_penalty=1.05,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Notes

Base model access, usage terms, and license follow the upstream google/gemma-4-e4b-it model.
This repository contains only the LoRA adapter and tokenizer/processor side files needed for the fine-tuned setup.

Downloads last month: 25