File size: 4,951 Bytes
cd5f316
 
 
 
 
4afb253
cd5f316
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
language:
- ky
license: mit
library_name: transformers
pipeline_tag: text-generation
base_model: google/mt5-small
tags:
- mt5
- text-normalization
- kyrgyz
- low-resource
- turkic
datasets:
- Zarinaaa/kyrgyz-text-normalization
metrics:
- cer
- wer
- exact_match
---

# mT5-small fine-tuned for Kyrgyz text normalization

Fine-tuned `google/mt5-small` for normalizing noisy Kyrgyz social-media text (YouTube comments, Instagram posts, Telegram messages) into a standardized form — punctuation, capitalization, dialectal spelling, digit–word compounds.

This is the **fine-tuned only** variant from the camera-ready paper *"Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches"* (MeLLM Workshop @ ACL 2026). For the continual pre-training + fine-tuning variant see [Zarinaaa/mt5-small-kyrgyz-normalization-ptft](https://huggingface.co/Zarinaaa/mt5-small-kyrgyz-normalization-ptft).

## Usage

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = "Zarinaaa/mt5-small-kyrgyz-normalization"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

noisy = "барды жакшы болсун коркунучту жерлерди тазалаш керек"
inputs = tokenizer("correct: " + noisy, return_tensors="pt", truncation=True, max_length=256)
out = model.generate(**inputs, max_new_tokens=256, num_beams=4)
print(tokenizer.decode(out[0], skip_special_tokens=True))
# Барды жакшы болсун. Коркунучтуу жерлерди тазалаш керек.
```

The prefix `"correct: "` is required — the model was fine-tuned with this exact prompt.

## Training data

1.67M noisy–clean Kyrgyz text pairs from YouTube (45%), Instagram (25%), and Telegram (30%), automatically annotated with Gemini 3 Pro and spot-checked on a 400-example sample (84% acceptance rate, 95% Wilson CI [80%, 87%]). The 1,000-example test set was fully verified by two native Kyrgyz speakers with adjudication.

A 20,000-pair subset of the training data and the full test set are released at [Zarinaaa/kyrgyz-text-normalization](https://huggingface.co/datasets/Zarinaaa/kyrgyz-text-normalization).

## Training procedure

- **Base model:** `google/mt5-small` (300M parameters)
- **Effective batch size:** 64 (physical batch 4 × gradient accumulation 16)
- **Learning rate:** 3e-4, cosine schedule, 500 warmup steps
- **Epochs:** 5
- **Max sequence length:** 256
- **Train/validation split:** 95 / 5, seed 42; best checkpoint by validation loss
- **Hardware:** 1× NVIDIA RTX 5080 (16 GB VRAM)

The 1,000 test inputs are disjoint from the 1.67M training set (verified 0/1,000 exact-match overlap and 0/1,000 case-insensitive overlap).

## Evaluation

Automatic metrics on the held-out 1,000-example test set:

| Metric | Value |
|---|---|
| **CER** | **0.0796** ± 0.003 |
| WER | 0.1978 |
| Exact Match | 0.186 |

For comparison: rule-based baseline 0.2029 CER, zero-shot Gemma 4 (9.6B, 32× larger) 0.1620 CER.

Human evaluation by two native Kyrgyz speakers on 200 examples: **99.8%** rated correct (Wilson 95% CI [0.986, 0.9996]). Reliability under prevalence skew: PABAK = 0.990, Gwet's AC1 = 0.995. Of the 199 outputs both annotators rated correct, 162 (81.4%) differ from the Gemini reference at the character level — surface-form variability that EM penalizes but native speakers accept.

## Per-category CER

| Category | N | CER |
|---|---|---|
| Punctuation restoration | 849 | 0.078 |
| Capitalization | 62 | 0.084 |
| All-caps segments | 39 | 0.084 |
| Digit–word compounds | 41 | 0.076 |

## Limitations

- **Domain:** trained and evaluated on social-media text. Performance on news, speech transcripts, or formal government text is not guaranteed.
- **Reference bias:** training references were produced by Gemini 3 Pro; a probe with an independent annotator shows the model has learned a general normalization function (CER changes by only 0.012 against an independent reference), but residual stylistic bias is possible.
- **Label noise:** ~16% of training pairs may contain minor issues per the 400-example spot-check.
- **Model size:** larger variants (mT5-base/large, ByT5) and fine-tuned LLMs were not evaluated due to compute constraints.
- **Rule-based comparison:** the baseline in the paper is intentionally minimal; a stronger Kyrgyz FST-based pipeline would likely close part of the gap.

## Citation

```bibtex
@inproceedings{uvalieva2026kyrgyz,
  title={Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches},
  author={Uvalieva, Zarina and Kumarbai uulu, Bektemir and Metinov, Adilet and Tashbaltaev, Tynchtykbek and Alibekov, Nurtilek},
  booktitle={Proceedings of the MeLLM Workshop at ACL 2026},
  year={2026}
}
```

## License

MIT. Code: [github.com/Zarina33/Kyrgyz-Text-Normalization-Conference](https://github.com/Zarina33/Kyrgyz-Text-Normalization-Conference).