Text Generation
Transformers
Safetensors
Kyrgyz
mt5
text2text-generation
text-normalization
kyrgyz
low-resource
turkic
Instructions to use Zarinaaa/mt5-small-kyrgyz-normalization with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Zarinaaa/mt5-small-kyrgyz-normalization with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Zarinaaa/mt5-small-kyrgyz-normalization")# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("Zarinaaa/mt5-small-kyrgyz-normalization") model = AutoModelForSeq2SeqLM.from_pretrained("Zarinaaa/mt5-small-kyrgyz-normalization") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Zarinaaa/mt5-small-kyrgyz-normalization with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Zarinaaa/mt5-small-kyrgyz-normalization" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Zarinaaa/mt5-small-kyrgyz-normalization", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Zarinaaa/mt5-small-kyrgyz-normalization
- SGLang
How to use Zarinaaa/mt5-small-kyrgyz-normalization with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Zarinaaa/mt5-small-kyrgyz-normalization" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Zarinaaa/mt5-small-kyrgyz-normalization", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Zarinaaa/mt5-small-kyrgyz-normalization" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Zarinaaa/mt5-small-kyrgyz-normalization", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Zarinaaa/mt5-small-kyrgyz-normalization with Docker Model Runner:
docker model run hf.co/Zarinaaa/mt5-small-kyrgyz-normalization
Add model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- ky
|
| 4 |
+
license: mit
|
| 5 |
+
library_name: transformers
|
| 6 |
+
pipeline_tag: text2text-generation
|
| 7 |
+
base_model: google/mt5-small
|
| 8 |
+
tags:
|
| 9 |
+
- mt5
|
| 10 |
+
- text-normalization
|
| 11 |
+
- kyrgyz
|
| 12 |
+
- low-resource
|
| 13 |
+
- turkic
|
| 14 |
+
datasets:
|
| 15 |
+
- Zarinaaa/kyrgyz-text-normalization
|
| 16 |
+
metrics:
|
| 17 |
+
- cer
|
| 18 |
+
- wer
|
| 19 |
+
- exact_match
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
# mT5-small fine-tuned for Kyrgyz text normalization
|
| 23 |
+
|
| 24 |
+
Fine-tuned `google/mt5-small` for normalizing noisy Kyrgyz social-media text (YouTube comments, Instagram posts, Telegram messages) into a standardized form — punctuation, capitalization, dialectal spelling, digit–word compounds.
|
| 25 |
+
|
| 26 |
+
This is the **fine-tuned only** variant from the camera-ready paper *"Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches"* (MeLLM Workshop @ ACL 2026). For the continual pre-training + fine-tuning variant see [Zarinaaa/mt5-small-kyrgyz-normalization-ptft](https://huggingface.co/Zarinaaa/mt5-small-kyrgyz-normalization-ptft).
|
| 27 |
+
|
| 28 |
+
## Usage
|
| 29 |
+
|
| 30 |
+
```python
|
| 31 |
+
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
| 32 |
+
|
| 33 |
+
model_id = "Zarinaaa/mt5-small-kyrgyz-normalization"
|
| 34 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 35 |
+
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
|
| 36 |
+
|
| 37 |
+
noisy = "барды жакшы болсун коркунучту жерлерди тазалаш керек"
|
| 38 |
+
inputs = tokenizer("correct: " + noisy, return_tensors="pt", truncation=True, max_length=256)
|
| 39 |
+
out = model.generate(**inputs, max_new_tokens=256, num_beams=4)
|
| 40 |
+
print(tokenizer.decode(out[0], skip_special_tokens=True))
|
| 41 |
+
# Барды жакшы болсун. Коркунучтуу жерлерди тазалаш керек.
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
The prefix `"correct: "` is required — the model was fine-tuned with this exact prompt.
|
| 45 |
+
|
| 46 |
+
## Training data
|
| 47 |
+
|
| 48 |
+
1.67M noisy–clean Kyrgyz text pairs from YouTube (45%), Instagram (25%), and Telegram (30%), automatically annotated with Gemini 3 Pro and spot-checked on a 400-example sample (84% acceptance rate, 95% Wilson CI [80%, 87%]). The 1,000-example test set was fully verified by two native Kyrgyz speakers with adjudication.
|
| 49 |
+
|
| 50 |
+
A 20,000-pair subset of the training data and the full test set are released at [Zarinaaa/kyrgyz-text-normalization](https://huggingface.co/datasets/Zarinaaa/kyrgyz-text-normalization).
|
| 51 |
+
|
| 52 |
+
## Training procedure
|
| 53 |
+
|
| 54 |
+
- **Base model:** `google/mt5-small` (300M parameters)
|
| 55 |
+
- **Effective batch size:** 64 (physical batch 4 × gradient accumulation 16)
|
| 56 |
+
- **Learning rate:** 3e-4, cosine schedule, 500 warmup steps
|
| 57 |
+
- **Epochs:** 5
|
| 58 |
+
- **Max sequence length:** 256
|
| 59 |
+
- **Train/validation split:** 95 / 5, seed 42; best checkpoint by validation loss
|
| 60 |
+
- **Hardware:** 1× NVIDIA RTX 5080 (16 GB VRAM)
|
| 61 |
+
|
| 62 |
+
The 1,000 test inputs are disjoint from the 1.67M training set (verified 0/1,000 exact-match overlap and 0/1,000 case-insensitive overlap).
|
| 63 |
+
|
| 64 |
+
## Evaluation
|
| 65 |
+
|
| 66 |
+
Automatic metrics on the held-out 1,000-example test set:
|
| 67 |
+
|
| 68 |
+
| Metric | Value |
|
| 69 |
+
|---|---|
|
| 70 |
+
| **CER** | **0.0796** ± 0.003 |
|
| 71 |
+
| WER | 0.1978 |
|
| 72 |
+
| Exact Match | 0.186 |
|
| 73 |
+
|
| 74 |
+
For comparison: rule-based baseline 0.2029 CER, zero-shot Gemma 4 (9.6B, 32× larger) 0.1620 CER.
|
| 75 |
+
|
| 76 |
+
Human evaluation by two native Kyrgyz speakers on 200 examples: **99.8%** rated correct (Wilson 95% CI [0.986, 0.9996]). Reliability under prevalence skew: PABAK = 0.990, Gwet's AC1 = 0.995. Of the 199 outputs both annotators rated correct, 162 (81.4%) differ from the Gemini reference at the character level — surface-form variability that EM penalizes but native speakers accept.
|
| 77 |
+
|
| 78 |
+
## Per-category CER
|
| 79 |
+
|
| 80 |
+
| Category | N | CER |
|
| 81 |
+
|---|---|---|
|
| 82 |
+
| Punctuation restoration | 849 | 0.078 |
|
| 83 |
+
| Capitalization | 62 | 0.084 |
|
| 84 |
+
| All-caps segments | 39 | 0.084 |
|
| 85 |
+
| Digit–word compounds | 41 | 0.076 |
|
| 86 |
+
|
| 87 |
+
## Limitations
|
| 88 |
+
|
| 89 |
+
- **Domain:** trained and evaluated on social-media text. Performance on news, speech transcripts, or formal government text is not guaranteed.
|
| 90 |
+
- **Reference bias:** training references were produced by Gemini 3 Pro; a probe with an independent annotator shows the model has learned a general normalization function (CER changes by only 0.012 against an independent reference), but residual stylistic bias is possible.
|
| 91 |
+
- **Label noise:** ~16% of training pairs may contain minor issues per the 400-example spot-check.
|
| 92 |
+
- **Model size:** larger variants (mT5-base/large, ByT5) and fine-tuned LLMs were not evaluated due to compute constraints.
|
| 93 |
+
- **Rule-based comparison:** the baseline in the paper is intentionally minimal; a stronger Kyrgyz FST-based pipeline would likely close part of the gap.
|
| 94 |
+
|
| 95 |
+
## Citation
|
| 96 |
+
|
| 97 |
+
```bibtex
|
| 98 |
+
@inproceedings{uvalieva2026kyrgyz,
|
| 99 |
+
title={Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches},
|
| 100 |
+
author={Uvalieva, Zarina and Kumarbai uulu, Bektemir and Metinov, Adilet and Tashbaltaev, Tynchtykbek and Alibekov, Nurtilek},
|
| 101 |
+
booktitle={Proceedings of the MeLLM Workshop at ACL 2026},
|
| 102 |
+
year={2026}
|
| 103 |
+
}
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
## License
|
| 107 |
+
|
| 108 |
+
MIT. Code: [github.com/Zarina33/Kyrgyz-Text-Normalization-Conference](https://github.com/Zarina33/Kyrgyz-Text-Normalization-Conference).
|