--- language: - ky license: mit library_name: transformers pipeline_tag: text-generation base_model: google/mt5-small tags: - mt5 - text-normalization - kyrgyz - low-resource - turkic - continual-pretraining datasets: - Zarinaaa/kyrgyz-text-normalization metrics: - cer - wer - exact_match --- # mT5-small with continual pre-training + fine-tuning for Kyrgyz text normalization `google/mt5-small` continually pre-trained on a 538 MB Kyrgyz corpus (news portals + books) with T5-style span corruption, then fine-tuned on 1.67M noisy–clean text pairs for Kyrgyz text normalization. This is the **continual pre-training + fine-tuning** (PT+FT) variant from the camera-ready paper *"Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches"* (MeLLM Workshop @ ACL 2026). For the fine-tuning-only variant see [Zarinaaa/mt5-small-kyrgyz-normalization](https://huggingface.co/Zarinaaa/mt5-small-kyrgyz-normalization). **Note on choice between the two variants:** in our experiments the additional continual pre-training step did **not** improve over direct fine-tuning (CER 0.0825 vs. 0.0796, p = 0.06). The main observable difference is a higher rate of hallucination (input repetition) in failure cases. For most users we recommend the fine-tune-only variant unless you specifically want the slightly better Digit–Word category performance (see Evaluation below). ## Usage ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_id = "Zarinaaa/mt5-small-kyrgyz-normalization-ptft" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSeq2SeqLM.from_pretrained(model_id) noisy = "барды жакшы болсун коркунучту жерлерди тазалаш керек" inputs = tokenizer("correct: " + noisy, return_tensors="pt", truncation=True, max_length=256) out = model.generate(**inputs, max_new_tokens=256, num_beams=4) print(tokenizer.decode(out[0], skip_special_tokens=True)) ``` The prefix `"correct: "` is required. ## Training procedure ### Stage 1 — Continual pre-training - **Corpus:** 538 MB clean Kyrgyz text from news portals and books - **Objective:** T5-style span corruption (mask rate 0.15, mean span length 3) - **Epochs:** 3 - **Train/validation split:** 98 / 2, seed 42; best checkpoint by validation loss ### Stage 2 — Fine-tuning Identical to the fine-tune-only variant: - **Effective batch size:** 64 (4 × 16 gradient accumulation) - **Learning rate:** 3e-4, cosine schedule, 500 warmup steps - **Epochs:** 5 - **Max sequence length:** 256 - **Train/validation split:** 95 / 5, seed 42; best checkpoint by validation loss - **Hardware:** 1× NVIDIA RTX 5080 (16 GB VRAM) ## Evaluation Automatic metrics on the held-out 1,000-example test set: | Metric | Value | |---|---| | CER | 0.0825 ± 0.004 | | WER | 0.2017 | | Exact Match | 0.184 | Vs. fine-tune-only (CER 0.0796): paired bootstrap two-sided p = 0.06. We treat this as **insufficient evidence to reject the null** of no difference, **not** as equivalence — n = 1,000 is underpowered for detecting small effects in either direction. Human evaluation (200 examples, 2 native annotators): **99.8%** rated correct (Wilson 95% CI [0.986, 0.9996]); PABAK = 0.990, Gwet's AC1 = 0.995 — identical to the fine-tune-only variant at the ceiling. ### Per-category CER | Category | N | FT-only | **PT+FT** | |---|---|---|---| | Punctuation | 849 | **0.078** | 0.081 | | Capitalization | 62 | **0.084** | 0.085 | | All-caps | 39 | 0.084 | **0.083** | | Digit–Word | 41 | 0.076 | **0.067** | PT+FT is numerically slightly better on Digit–Word compounds; with N = 41 we do not treat this as a robust advantage. ### Failure analysis In 40 examples where FT outperforms PT+FT by more than 0.05 CER, **hallucination (input repetition) is the dominant error mode (35/40 = 87.5%, 95% Wilson CI [74%, 95%])**. Two non-exclusive hypotheses (see paper §6.1): 1. **Copy bias from span corruption** — T5-style span corruption trains the decoder to reconstruct spans of the input verbatim, which may reinforce copying behavior harmful for normalization (where the target is usually not a superset of the input). 2. **Register mismatch** — continual pre-training used clean, formal text (news/books), while fine-tuning targets normalize noisy informal social-media text. The register gap may push the model toward fluent formal continuations that read as hallucinations. ## Limitations Same as the fine-tune-only variant, plus: - **Higher hallucination rate** in failure cases — if you need maximum robustness, use the FT-only variant. - **No measurable benefit from the additional pre-training** at this scale and corpus composition; results suggest a more targeted continual objective (in-domain noisy text, denoising closer to the normalization target) would be needed. ## Citation ```bibtex @inproceedings{uvalieva2026kyrgyz, title={Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches}, author={Uvalieva, Zarina and Kumarbai uulu, Bektemir and Metinov, Adilet and Tashbaltaev, Tynchtykbek and Alibekov, Nurtilek}, booktitle={Proceedings of the MeLLM Workshop at ACL 2026}, year={2026} } ``` ## License MIT. Code: [github.com/Zarina33/Kyrgyz-Text-Normalization-Conference](https://github.com/Zarina33/Kyrgyz-Text-Normalization-Conference).