Zarinaaa commited on
Commit
cd5f316
·
verified ·
1 Parent(s): 18b6e8d

Add model card

Browse files
Files changed (1) hide show
  1. README.md +108 -0
README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ky
4
+ license: mit
5
+ library_name: transformers
6
+ pipeline_tag: text2text-generation
7
+ base_model: google/mt5-small
8
+ tags:
9
+ - mt5
10
+ - text-normalization
11
+ - kyrgyz
12
+ - low-resource
13
+ - turkic
14
+ datasets:
15
+ - Zarinaaa/kyrgyz-text-normalization
16
+ metrics:
17
+ - cer
18
+ - wer
19
+ - exact_match
20
+ ---
21
+
22
+ # mT5-small fine-tuned for Kyrgyz text normalization
23
+
24
+ Fine-tuned `google/mt5-small` for normalizing noisy Kyrgyz social-media text (YouTube comments, Instagram posts, Telegram messages) into a standardized form — punctuation, capitalization, dialectal spelling, digit–word compounds.
25
+
26
+ This is the **fine-tuned only** variant from the camera-ready paper *"Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches"* (MeLLM Workshop @ ACL 2026). For the continual pre-training + fine-tuning variant see [Zarinaaa/mt5-small-kyrgyz-normalization-ptft](https://huggingface.co/Zarinaaa/mt5-small-kyrgyz-normalization-ptft).
27
+
28
+ ## Usage
29
+
30
+ ```python
31
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
32
+
33
+ model_id = "Zarinaaa/mt5-small-kyrgyz-normalization"
34
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
35
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
36
+
37
+ noisy = "барды жакшы болсун коркунучту жерлерди тазалаш керек"
38
+ inputs = tokenizer("correct: " + noisy, return_tensors="pt", truncation=True, max_length=256)
39
+ out = model.generate(**inputs, max_new_tokens=256, num_beams=4)
40
+ print(tokenizer.decode(out[0], skip_special_tokens=True))
41
+ # Барды жакшы болсун. Коркунучтуу жерлерди тазалаш керек.
42
+ ```
43
+
44
+ The prefix `"correct: "` is required — the model was fine-tuned with this exact prompt.
45
+
46
+ ## Training data
47
+
48
+ 1.67M noisy–clean Kyrgyz text pairs from YouTube (45%), Instagram (25%), and Telegram (30%), automatically annotated with Gemini 3 Pro and spot-checked on a 400-example sample (84% acceptance rate, 95% Wilson CI [80%, 87%]). The 1,000-example test set was fully verified by two native Kyrgyz speakers with adjudication.
49
+
50
+ A 20,000-pair subset of the training data and the full test set are released at [Zarinaaa/kyrgyz-text-normalization](https://huggingface.co/datasets/Zarinaaa/kyrgyz-text-normalization).
51
+
52
+ ## Training procedure
53
+
54
+ - **Base model:** `google/mt5-small` (300M parameters)
55
+ - **Effective batch size:** 64 (physical batch 4 × gradient accumulation 16)
56
+ - **Learning rate:** 3e-4, cosine schedule, 500 warmup steps
57
+ - **Epochs:** 5
58
+ - **Max sequence length:** 256
59
+ - **Train/validation split:** 95 / 5, seed 42; best checkpoint by validation loss
60
+ - **Hardware:** 1× NVIDIA RTX 5080 (16 GB VRAM)
61
+
62
+ The 1,000 test inputs are disjoint from the 1.67M training set (verified 0/1,000 exact-match overlap and 0/1,000 case-insensitive overlap).
63
+
64
+ ## Evaluation
65
+
66
+ Automatic metrics on the held-out 1,000-example test set:
67
+
68
+ | Metric | Value |
69
+ |---|---|
70
+ | **CER** | **0.0796** ± 0.003 |
71
+ | WER | 0.1978 |
72
+ | Exact Match | 0.186 |
73
+
74
+ For comparison: rule-based baseline 0.2029 CER, zero-shot Gemma 4 (9.6B, 32× larger) 0.1620 CER.
75
+
76
+ Human evaluation by two native Kyrgyz speakers on 200 examples: **99.8%** rated correct (Wilson 95% CI [0.986, 0.9996]). Reliability under prevalence skew: PABAK = 0.990, Gwet's AC1 = 0.995. Of the 199 outputs both annotators rated correct, 162 (81.4%) differ from the Gemini reference at the character level — surface-form variability that EM penalizes but native speakers accept.
77
+
78
+ ## Per-category CER
79
+
80
+ | Category | N | CER |
81
+ |---|---|---|
82
+ | Punctuation restoration | 849 | 0.078 |
83
+ | Capitalization | 62 | 0.084 |
84
+ | All-caps segments | 39 | 0.084 |
85
+ | Digit–word compounds | 41 | 0.076 |
86
+
87
+ ## Limitations
88
+
89
+ - **Domain:** trained and evaluated on social-media text. Performance on news, speech transcripts, or formal government text is not guaranteed.
90
+ - **Reference bias:** training references were produced by Gemini 3 Pro; a probe with an independent annotator shows the model has learned a general normalization function (CER changes by only 0.012 against an independent reference), but residual stylistic bias is possible.
91
+ - **Label noise:** ~16% of training pairs may contain minor issues per the 400-example spot-check.
92
+ - **Model size:** larger variants (mT5-base/large, ByT5) and fine-tuned LLMs were not evaluated due to compute constraints.
93
+ - **Rule-based comparison:** the baseline in the paper is intentionally minimal; a stronger Kyrgyz FST-based pipeline would likely close part of the gap.
94
+
95
+ ## Citation
96
+
97
+ ```bibtex
98
+ @inproceedings{uvalieva2026kyrgyz,
99
+ title={Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches},
100
+ author={Uvalieva, Zarina and Kumarbai uulu, Bektemir and Metinov, Adilet and Tashbaltaev, Tynchtykbek and Alibekov, Nurtilek},
101
+ booktitle={Proceedings of the MeLLM Workshop at ACL 2026},
102
+ year={2026}
103
+ }
104
+ ```
105
+
106
+ ## License
107
+
108
+ MIT. Code: [github.com/Zarina33/Kyrgyz-Text-Normalization-Conference](https://github.com/Zarina33/Kyrgyz-Text-Normalization-Conference).