Gyeongsang Dialect to Standard Korean LoRA for Gemma 4 E4B

This repository contains a LoRA adapter fine-tuned from google/gemma-4-e4b-it for a narrow text-rewriting task:

  • input: Gyeongsang dialect Korean sentence
  • output: standard Korean sentence
  • goal: preserve meaning and avoid unnecessary additions or omissions

This is an adapter-only release. You need the base model google/gemma-4-e4b-it to use it.

Model Details

  • Base model: google/gemma-4-e4b-it
  • Adapter type: PEFT LoRA
  • LoRA rank: 16
  • LoRA alpha: 16
  • LoRA dropout: 0.0
  • Training stack: Unsloth + PEFT + Transformers
  • Primary language: Korean
  • Primary task: dialect normalization / standardization

Intended Use

This adapter is intended for:

  • converting Gyeongsang dialect utterances into standard Korean
  • deterministic rewrite workflows
  • offline batch normalization pipelines
  • experimentation inside Hugging Face, Transformers, or Unsloth Studio

This adapter is not intended for:

  • open-ended chat
  • long-form reasoning
  • instruction following outside the rewrite task
  • high-stakes legal, medical, or financial usage without human review

Prompt Format

The best-performing eval path used a plain single-turn rewrite prompt with a strict instruction not to add reasoning.

Minimal usage pattern:

๋‹น์‹ ์€ ๊ฒฝ์ƒ๋„ ๋ฐฉ์–ธ์„ ํ‘œ์ค€ ํ•œ๊ตญ์–ด๋กœ ๋ฐ”๊พธ๋Š” ์ „๋ฌธ๊ฐ€์ž…๋‹ˆ๋‹ค.
์˜๋ฏธ๋ฅผ ๋ณด์กดํ•˜๊ณ , ๋ถˆํ•„์š”ํ•œ ์ถ”๊ฐ€๋‚˜ ์ƒ๋žต ์—†์ด ํ‘œ์ค€์–ด ๋ฌธ์žฅ๋งŒ ๋‹ตํ•˜์„ธ์š”.

๋ฐฉ์–ธ: <input sentence>
ํ‘œ์ค€์–ด:

Greedy decoding worked better than sampling for this task.

Evaluation Summary

Final full-run evaluation was performed on internal dev, test, and hard_test splits.

split rows char_similarity critical_error_rate number_preservation format_contamination_rate
dev 4002 0.9436 0.0265 1.0000 0.0000
test 4001 0.9433 0.0280 0.9615 0.0000
hard_test 2048 0.9337 0.0381 1.0000 0.0000

Interpretation:

  • rewrite quality is strong on the held-out evaluation splits
  • number preservation remained strong in the final E4B run
  • prompt-format contamination was reduced to zero in final evaluation

Training Summary

  • Final training rows: 51719
  • Eval rows during training run summary: 4002
  • Training runtime: 3612.06s
  • Final recorded train loss: 0.2783
  • Dataset format: prompt_completion_text
  • Prompt variant: strict
  • Completion-only loss: True

The adapter was selected from an adaptive loop that tuned prompt format, eval format, LoRA settings, and recovery behavior across multiple iterations before the final full run.

Data

The training task uses Korean dialect-to-standard sentence pairs derived from Gyeongsang dialect data.

High-level properties:

  • sentence-level rewrite pairs
  • source side: dialect Korean
  • target side: standard Korean
  • task emphasis: minimal-edit normalization with meaning preservation

Limitations

  • This model is specialized for Gyeongsang dialect normalization and may not transfer well to other dialects.
  • It is optimized for short-to-medium sentence rewriting, not general conversation.
  • It may still make lexical or phrasing errors on rare dialect forms.
  • It should not be treated as a factual QA or reasoning model.

Usage

Example with transformers + peft:

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model_id = "google/gemma-4-e4b-it"
adapter_id = "acidsound/gyeongsang_dialect_gemma-4-e4b-it-LoRA"

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, adapter_id)

prompt = (
    "๋‹น์‹ ์€ ๊ฒฝ์ƒ๋„ ๋ฐฉ์–ธ์„ ํ‘œ์ค€ ํ•œ๊ตญ์–ด๋กœ ๋ฐ”๊พธ๋Š” ์ „๋ฌธ๊ฐ€์ž…๋‹ˆ๋‹ค.\\n"
    "์˜๋ฏธ๋ฅผ ๋ณด์กดํ•˜๊ณ , ๋ถˆํ•„์š”ํ•œ ์ถ”๊ฐ€๋‚˜ ์ƒ๋žต ์—†์ด ํ‘œ์ค€์–ด ๋ฌธ์žฅ๋งŒ ๋‹ตํ•˜์„ธ์š”.\\n\\n"
    "๋ฐฉ์–ธ: ์™€ ์ด๋ฆฌ ์ถฅ๋…ธ ๋ฐ–์— ๋ฐ”๋žŒ์ด ์–ต์ˆ˜๋กœ ๋ถ„๋‹ค\\n"
    "ํ‘œ์ค€์–ด:"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        repetition_penalty=1.05,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Notes

  • Base model access, usage terms, and license follow the upstream google/gemma-4-e4b-it model.
  • This repository contains only the LoRA adapter and tokenizer/processor side files needed for the fine-tuned setup.
Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support